├── LICENSE ├── README.md ├── accuracy ├── README.md ├── lm_eval │ ├── evaluate_task_result.py │ ├── figure11-config.json │ ├── figure11.py │ ├── figure13.sh │ ├── figure17.sh │ ├── full_cache.sh │ ├── generate_task_data.py │ ├── h2o.sh │ ├── ours.sh │ ├── quant.sh │ ├── run_lm_eval_harness.py │ ├── run_text_generation.py │ ├── tasks │ │ ├── __init__.py │ │ ├── eval_harness.py │ │ └── util.py │ └── utils_lm_eval │ │ ├── modify_gptneox.py │ │ ├── modify_llama.py │ │ └── modify_opt.py ├── perplexity │ ├── datautils.py │ ├── figure12.sh │ ├── llama.py │ ├── opt.py │ └── table2.sh ├── scripts │ ├── figure11 │ │ └── run.sh │ ├── figure12 │ │ └── run.sh │ ├── figure13 │ │ └── run.sh │ ├── figure17 │ │ └── run.sh │ ├── run_all.sh │ └── table2 │ │ └── run.sh ├── setup │ ├── gen_llama_skewing_matrix.py │ ├── gen_opt_model.py │ ├── gen_partial_weight.py │ ├── pg19_firstbook.txt │ ├── setup.sh │ └── utils.py └── src │ ├── modeling_llama_orig.py │ ├── modeling_llama_ours.py │ ├── modeling_llama_ours_prune.py │ ├── modeling_llama_ours_setup.py │ ├── modeling_opt_orig.py │ ├── modeling_opt_ours.py │ ├── modeling_opt_ours_prune.py │ └── modeling_opt_ours_setup.py ├── figures └── infinigen.png ├── requirements.txt └── speedup ├── README.md ├── flexgen ├── LICENSE ├── README.md ├── flexgen │ ├── __init__.py │ ├── compression.py │ ├── flex_opt.py │ ├── opt_config.py │ ├── pytorch_backend.py │ ├── timer.py │ └── utils.py ├── h2o │ ├── flex_opt.py │ └── pytorch_backend.py ├── infinigen │ ├── flex_opt.py │ └── pytorch_backend.py ├── original │ ├── flex_opt.py │ └── pytorch_backend.py ├── pg19_firstbook.txt └── pyproject.toml ├── infinigen ├── infinigen │ ├── kv_selection_controller.py │ ├── partial_weight_generation_controller.py │ └── skewing_controller.py └── setup.py ├── install.sh ├── scripts ├── figure14 │ ├── pg19_firstbook.txt │ └── run.sh ├── figure15 │ ├── pg19_firstbook.txt │ └── run.sh ├── figure16a │ ├── pg19_firstbook.txt │ └── run.sh ├── figure16b │ ├── pg19_firstbook.txt │ └── run.sh ├── figure17a │ ├── pg19_firstbook.txt │ └── run.sh ├── figure17b │ ├── pg19_firstbook.txt │ └── run.sh └── run_all.sh └── uvm ├── allocate.cpp ├── h2o_attention.py ├── selfattention.py ├── transformer.py └── transformerlayer.py /README.md: -------------------------------------------------------------------------------- 1 | # InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management 2 | [[paper](https://arxiv.org/abs/2406.19707)] 3 | 4 | ![infinigen](figures/infinigen.png) 5 | 6 | ## Overview 7 | 8 | This repository contains the code for the artifact evaluation of our paper "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management". 9 | 10 | - Getting Started (10 minutes) 11 | - Run Experiments 12 | 13 | ## Abstract 14 | 15 | Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy. 16 | 17 | ## Getting Started (10 minutes) 18 | 19 | ```sh 20 | git clone https://github.com/snu-comparch/infinigen 21 | conda create -n infinigen python=3.9 22 | conda activate infinigen 23 | pip install -r requirements.txt 24 | ``` 25 | 26 | ## Run Experiments 27 | 28 | We provide the scripts for accuracy and speedup evaluation. 29 | You can find the source code for accuracy evaluation in the `accuracy` directory and speedup evaluation in the `speedup` directory. 30 | It takes about 40 hours and 7 hours for the accuracy and speedup experiments, respectively. 31 | 32 | ## Citation 33 | 34 | ```bibtex 35 | @inproceedings{lee-osdi24, 36 | title = {{I}nfini{G}en: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management}, 37 | author = {Lee, Wonbeom and Lee, Jungi and Seo, Junghwan and Sim, Jaewoong}, 38 | booktitle = {18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)}, 39 | year = {2024} 40 | } 41 | ``` 42 | -------------------------------------------------------------------------------- /accuracy/README.md: -------------------------------------------------------------------------------- 1 | # Language Modeling Evaluation 2 | This directory contains source code for evaluating the language modeling performance. You can reproduce perplexity and accuracy results in the paper. Some of the codes are referenced from the H2O (NeurIPS'23) GitHub repository. 3 | 4 | ## Getting Started (60 minutes) 5 | We evaluate accuracy using the HuggingFace Transformers library. Also, please 6 | fetch llama-2 from [here](https://llama.meta.com/llama-downloads). Follow the 7 | instructions and set up Llama-2. You may also need to convert the model to 8 | huggingface format using the `convert_llama_weight_to_hf.py` in 9 | `transformers/src/transformers/models/llama`. 10 | 11 | NOTE: We recommend using a GPU with a large VRAM size. We evaluate accuracy using `A100-80GB GPU`. 12 | 13 | ```sh 14 | git clone -b v4.35-release https://github.com/huggingface/transformers.git 15 | cd transformers 16 | pip install -e . 17 | ``` 18 | 19 | After setting up the library and llama models, generate the partial weights and skewing matrix. You can safely ignore the uninitialized weight warning. 20 | ```sh 21 | cd setup 22 | export LLAMA_PATH=/path/to/llama-2 23 | bash setup.sh 24 | ``` 25 | 26 | For a "Hello world"-sized example, please run the following command (10 minutes): 27 | ``` 28 | cd lm_eval 29 | mkdir results 30 | python -u generate_task_data.py --output-file results/openbookqa-5.jsonl --task-name openbookqa --num-fewshot 5 31 | bash ours.sh openbookqa ../setup/opt-model/opt-6.7b facebook/opt-6.7b opt 5 0.2 4 1.0 0.2 32 | ``` 33 | 34 | ## Run Experiments (40 hours) 35 | You can reproduce the experimental results from Figure 11-13 and Table 2 by running the following commands: 36 | 37 | ```sh 38 | cd scripts 39 | sh run_all.sh 40 | ``` 41 | 42 | If you want to reproduce the results for a specific figure, please `sh run.sh` in each corresponding directory. For example, 43 | ``` 44 | cd scripts/figure11.sh 45 | sh run.sh 46 | ``` 47 | -------------------------------------------------------------------------------- /accuracy/lm_eval/evaluate_task_result.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | 5 | from lm_eval import evaluator, tasks 6 | from tasks import EvalHarnessAdaptor 7 | 8 | def json_to_key(obj): 9 | return json.dumps(obj) 10 | 11 | 12 | if __name__ == '__main__': 13 | 14 | 15 | parser = argparse.ArgumentParser( 16 | prog = 'ProgramName', 17 | description = 'What the program does', 18 | epilog = 'Text at the bottom of help') 19 | 20 | parser.add_argument('--result-file', type=str, default='result.jsonl') 21 | parser.add_argument('--task-name', type=str, default='hellaswag') 22 | parser.add_argument('--model-type', type=str, default='opt') 23 | parser.add_argument('--debug', action='store_true', default=False) 24 | parser.add_argument('--num-fewshot', type=int, default=0) 25 | args = parser.parse_args() 26 | 27 | if args.model_type == 'opt': 28 | os.environ['MODEL_NAME'] = "facebook/opt-66b" 29 | elif args.model_type == 'bloom': 30 | os.environ['MODEL_NAME'] = "bigscience/bloom" 31 | elif args.model_type == 'gpt_neox': 32 | os.environ['MODEL_NAME'] = "EleutherAI/gpt-neox-20b" 33 | elif args.model_type == 'llama': 34 | os.environ['MODEL_NAME'] = "huggyllama/llama-7b" 35 | else: 36 | assert False 37 | 38 | seq = 1024 39 | total_batch = 1 40 | pe = 'fixed' 41 | 42 | class RealRunner: 43 | 44 | def __init__(self, args): 45 | 46 | self.results = {} 47 | 48 | with open(args.result_file, 'r') as f: 49 | 50 | for line in f: 51 | if line.strip() == '': 52 | continue 53 | 54 | item = json.loads(line) 55 | 56 | request = item['request'] 57 | result = item['result'] 58 | 59 | self.results[json_to_key(request)] = result 60 | 61 | print(f"{len(self.results)} items in the cache") 62 | 63 | def eval(self, batch): 64 | 65 | from tasks.eval_harness import tokenizer 66 | 67 | mask_loss = [] 68 | each_correct = [] 69 | 70 | for i, text in enumerate(batch['text']): 71 | 72 | request = { 73 | "best_of": 1, 74 | "echo": True, 75 | "logprobs": 1, 76 | "max_tokens": 0, 77 | "model": "x", 78 | "n": 1, 79 | "prompt": text, 80 | "request_type": "language-model-inference", 81 | "stop": None, 82 | "temperature": 0, 83 | "top_p": 1 84 | } 85 | 86 | key = json_to_key(request) 87 | 88 | correct = True 89 | 90 | if key in self.results: 91 | result = self.results[key] 92 | 93 | token_logprobs = result['choices'][0]['logprobs']['token_logprobs'] 94 | tokens = result['choices'][0]['logprobs']['tokens'] 95 | top_logprobs = result['choices'][0]['logprobs']['top_logprobs'] 96 | assert token_logprobs[0] is None 97 | 98 | token_ids = tokenizer.convert_tokens_to_ids(tokens) 99 | 100 | obs = batch['obs'][i] 101 | target = batch['target'][i] 102 | eval_mask = batch['eval_mask'][i] 103 | 104 | n_positive = 0 105 | sum_lobprob = 0 106 | if args.debug: 107 | print(target) 108 | for i, mask in enumerate(eval_mask): 109 | try: 110 | 111 | if i+1 >= len(tokens): 112 | break 113 | 114 | if mask == True: 115 | if args.debug: 116 | print(tokens[i+1], next(iter(top_logprobs[i+1].keys()))) 117 | correct = correct and (tokens[i+1] == next(iter(top_logprobs[i+1].keys()))) 118 | sum_lobprob += token_logprobs[i+1] 119 | n_positive += 1 120 | except Exception as e: 121 | raise e 122 | 123 | # avg_logprob = sum(token_logprobs[1:]) / (len(token_logprobs) - 1) 124 | avg_logprob = sum_lobprob / n_positive 125 | 126 | mask_loss.append( - avg_logprob) 127 | 128 | each_correct.append( correct ) 129 | 130 | else: 131 | assert False 132 | 133 | 134 | out = { 135 | 'mask_loss': mask_loss, 136 | 'each_correct': each_correct, 137 | } 138 | 139 | 140 | return out 141 | 142 | t = RealRunner(args) 143 | 144 | adaptor = EvalHarnessAdaptor(t, seq, total_batch, shrink=pe != "fixed") 145 | 146 | results = evaluator.evaluate(adaptor, tasks.get_task_dict([args.task_name 147 | #"lambada_openai", 148 | #"piqa", 149 | #"hellaswag", 150 | #"winogrande", 151 | #"mathqa", 152 | #"pubmedqa", 153 | # "boolq", 154 | # "cb", 155 | # "copa", 156 | # "multirc", 157 | # "record", 158 | # "wic", 159 | # "wsc", 160 | ]), False, args.num_fewshot, None) 161 | 162 | dumped = json.dumps(results, indent=2) 163 | print(dumped) 164 | -------------------------------------------------------------------------------- /accuracy/lm_eval/figure11-config.json: -------------------------------------------------------------------------------- 1 | { 2 | "opt-6.7b":{ 3 | "openbookqa":[ 4 | [2, 0.2], 5 | [4, 0.2], 6 | [5, 0.4], 7 | [7, 0.6] 8 | ], 9 | "piqa":[ 10 | [2, 0.2], 11 | [4, 0.2], 12 | [5, 0.4], 13 | [7, 0.7] 14 | ] 15 | }, 16 | "opt-13b":{ 17 | "winogrande":[ 18 | [2, 0.2], 19 | [4, 0.2], 20 | [4, 0.6], 21 | [7, 0.6] 22 | ], 23 | "openbookqa":[ 24 | [2, 0.2], 25 | [4, 0.2], 26 | [4, 0.6], 27 | [7, 0.6] 28 | ] 29 | }, 30 | "opt-30b":{ 31 | "openbookqa":[ 32 | [2, 0.2], 33 | [4, 0.2], 34 | [5, 0.4], 35 | [7, 0.6] 36 | ], 37 | "copa":[ 38 | [2, 0.2], 39 | [4, 0.2], 40 | [5, 0.4], 41 | [7, 0.6] 42 | ] 43 | }, 44 | "llama-2-7b":{ 45 | "rte":[ 46 | [5, 0.2], 47 | [7, 0.2], 48 | [7, 0.6], 49 | [7, 0.8] 50 | ], 51 | "piqa":[ 52 | [5, 0.2], 53 | [7, 0.2], 54 | [7, 0.6], 55 | [9.5, 0.6] 56 | ] 57 | }, 58 | "llama-2-13b":{ 59 | "copa":[ 60 | [4, 0.2], 61 | [6, 0.2], 62 | [7, 0.4], 63 | [9, 0.6] 64 | ], 65 | "winogrande":[ 66 | [4, 0.2], 67 | [6, 0.2], 68 | [7, 0.4], 69 | [9, 0.6] 70 | ] 71 | } 72 | } 73 | -------------------------------------------------------------------------------- /accuracy/lm_eval/figure11.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | with open("figure11-config.json") as f: 5 | config = json.load(f) 6 | os.system("mkdir -p results") 7 | 8 | shots = 5 9 | partial = 0.2 10 | capacity = 1.0 11 | 12 | # Prepare dataset 13 | for task in ["piqa", "openbookqa", "winogrande", "copa", "rte"]: 14 | cmd = [] 15 | cmd.append("python -u generate_task_data.py") 16 | cmd.append(f"--output-file results/{task}-{shots}.jsonl") 17 | cmd.append(f"--task-name {task}") 18 | cmd.append(f"--num-fewshot {shots}") 19 | cmd = ' '.join(cmd) 20 | os.system(cmd) 21 | 22 | ## Baseline 23 | print("="*10+" Full cache " + "="*10) 24 | # OPT 25 | for size in ["6.7b", "13b", "30b"]: 26 | if size == "6.7b": 27 | tasks = ["piqa", "openbookqa"] 28 | elif size == "13b": 29 | tasks = ["winogrande", "openbookqa"] 30 | elif size == "30b": 31 | tasks = ["copa", "openbookqa"] 32 | for task in tasks: 33 | cmd = [] 34 | cmd.append("bash full_cache.sh") 35 | cmd.append(task) 36 | cmd.append(f"facebook/opt-{size}") 37 | cmd.append("opt") 38 | cmd.append(str(shots)) 39 | cmd = ' '.join(cmd) 40 | print(cmd) 41 | os.system(cmd) 42 | print("-------------------------------------------") 43 | 44 | # Llama-2 45 | llama_2_dir = os.environ["LLAMA_PATH"] 46 | for size in ["7b", "13b"]: 47 | if size == "7b": 48 | tasks = ["rte", "piqa"] 49 | elif size == "13b": 50 | tasks = ["copa", "winogrande"] 51 | for task in tasks: 52 | cmd = [] 53 | cmd.append("bash full_cache.sh") 54 | cmd.append(task) 55 | cmd.append(f"{llama_2_dir}/llama-2-{size}") 56 | cmd.append("llama") 57 | cmd.append(str(shots)) 58 | cmd = ' '.join(cmd) 59 | print(cmd) 60 | os.system(cmd) 61 | print("-------------------------------------------") 62 | 63 | # InfiniGen 64 | print("="*10+" InfiniGen " + "="*10) 65 | # OPT 66 | for size in ["6.7b", "13b", "30b"]: 67 | if size == "6.7b": 68 | tasks = ["piqa", "openbookqa"] 69 | elif size == "13b": 70 | tasks = ["winogrande", "openbookqa"] 71 | elif size == "30b": 72 | tasks = ["copa", "openbookqa"] 73 | for task in tasks: 74 | for retain_ratio in range(4): 75 | alpha, budget = config[f"opt-{size}"][task][retain_ratio] 76 | cmd = [] 77 | cmd.append("bash ours.sh") 78 | cmd.append(task) 79 | cmd.append(f"../setup/opt-model/opt-{size}") 80 | cmd.append(f"facebook/opt-{size}") 81 | cmd.append("opt") 82 | cmd.append(str(shots)) 83 | cmd.append(str(partial)) 84 | cmd.append(str(alpha)) 85 | cmd.append(str(capacity)) 86 | cmd.append(str(budget)) 87 | cmd = ' '.join(cmd) 88 | print(cmd) 89 | os.system(cmd) 90 | print("-------------------------------------------") 91 | 92 | 93 | # Llama-2 94 | llama_2_dir = os.environ["LLAMA_PATH"] 95 | for size in ["7b", "13b"]: 96 | if size == "7b": 97 | tasks = ["rte", "piqa"] 98 | elif size == "13b": 99 | tasks = ["copa", "winogrande"] 100 | for task in tasks: 101 | for retain_ratio in range(4): 102 | alpha, budget = config[f"llama-2-{size}"][task][retain_ratio] 103 | cmd = [] 104 | cmd.append("bash ours.sh") 105 | cmd.append(task) 106 | cmd.append(f"{llama_2_dir}/llama-2-{size}") 107 | cmd.append(f"{llama_2_dir}/llama-2-{size}") 108 | cmd.append("llama") 109 | cmd.append(str(shots)) 110 | cmd.append(str(partial)) 111 | cmd.append(str(alpha)) 112 | cmd.append(str(capacity)) 113 | cmd.append(str(budget)) 114 | cmd = ' '.join(cmd) 115 | print(cmd) 116 | os.system(cmd) 117 | print("-------------------------------------------") 118 | ## H2O 119 | print("="*10+" H2O " + "="*10) 120 | # OPT 121 | for size in ["6.7b", "13b", "30b"]: 122 | if size == "6.7b": 123 | tasks = ["piqa", "openbookqa"] 124 | elif size == "13b": 125 | tasks = ["winogrande", "openbookqa"] 126 | elif size == "30b": 127 | tasks = ["copa", "openbookqa"] 128 | for task in tasks: 129 | for ratio in [0.25, 0.125, 0.0625, 0.03125]: 130 | cmd = [] 131 | cmd.append("bash h2o.sh") 132 | cmd.append(task) 133 | cmd.append(f"facebook/opt-{size}") 134 | cmd.append("opt") 135 | cmd.append(str(shots)) 136 | cmd.append(str(ratio)) # heavy_ratio 137 | cmd.append(str(ratio)) # recent_ratio 138 | cmd = ' '.join(cmd) 139 | print(cmd) 140 | os.system(cmd) 141 | print("-------------------------------------------") 142 | 143 | # Llama-2 144 | llama_2_dir = os.environ["LLAMA_PATH"] 145 | for size in ["7b", "13b"]: 146 | if size == "7b": 147 | tasks = ["rte", "piqa"] 148 | elif size == "13b": 149 | tasks = ["copa", "winogrande"] 150 | for task in tasks: 151 | for ratio in [0.25, 0.125, 0.0625, 0.03125]: 152 | cmd = [] 153 | cmd.append("bash h2o.sh") 154 | cmd.append(task) 155 | cmd.append(f"{llama_2_dir}/llama-2-{size}") 156 | cmd.append("llama") 157 | cmd.append(str(shots)) 158 | cmd.append(str(ratio)) # heavy_ratio 159 | cmd.append(str(ratio)) # recent_ratio 160 | cmd = ' '.join(cmd) 161 | print(cmd) 162 | os.system(cmd) 163 | print("-------------------------------------------") 164 | 165 | ## Quant. 166 | print("="*10+" Quantization " + "="*10) 167 | # OPT 168 | for size in ["6.7b", "13b", "30b"]: 169 | if size == "6.7b": 170 | tasks = ["piqa", "openbookqa"] 171 | elif size == "13b": 172 | tasks = ["winogrande", "openbookqa"] 173 | elif size == "30b": 174 | tasks = ["copa", "openbookqa"] 175 | for task in tasks: 176 | for qbits in [8, 4, 2, 1]: 177 | cmd = [] 178 | cmd.append("bash quant.sh") 179 | cmd.append(task) 180 | cmd.append(f"facebook/opt-{size}") 181 | cmd.append("opt") 182 | cmd.append(str(shots)) 183 | cmd.append(str(qbits)) 184 | cmd = ' '.join(cmd) 185 | print(cmd) 186 | os.system(cmd) 187 | print("-------------------------------------------") 188 | 189 | # Llama-2 190 | llama_2_dir = os.environ["LLAMA_PATH"] 191 | for size in ["7b", "13b"]: 192 | if size == "7b": 193 | tasks = ["rte", "piqa"] 194 | elif size == "13b": 195 | tasks = ["copa", "winogrande"] 196 | for task in tasks: 197 | for qbits in [8, 4, 2, 1]: 198 | cmd = [] 199 | cmd.append("bash quant.sh") 200 | cmd.append(task) 201 | cmd.append(f"{llama_2_dir}/llama-2-{size}") 202 | cmd.append("llama") 203 | cmd.append(str(shots)) 204 | cmd.append(str(qbits)) 205 | cmd = ' '.join(cmd) 206 | print(cmd) 207 | os.system(cmd) 208 | print("-------------------------------------------") 209 | -------------------------------------------------------------------------------- /accuracy/lm_eval/figure13.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | shots=5 4 | # Prepare dataset 5 | echo "prepare dataset" 6 | for task in copa openbookqa winogrande piqa rte; do 7 | python -u generate_task_data.py \ 8 | --output-file "results/${task}-${shots}.jsonl" \ 9 | --task-name ${task} \ 10 | --num-fewshot ${shots} 11 | done 12 | 13 | # Baseline 14 | echo "full cache" 15 | for task in copa openbookqa winogrande piqa rte; do 16 | bash full_cache.sh ${task} "facebook/opt-6.7b" opt ${shots} 17 | done 18 | 19 | # InfiniGen 20 | partial=0.1 21 | capacity=1.0 22 | alpha=99 23 | budget=0.2 24 | 25 | # w/o skewing 26 | echo "InfiniGen w/o skewing" 27 | for task in copa openbookqa winogrande piqa rte; do 28 | bash ours.sh ${task} "../setup/opt-model-no-skew/opt-6.7b" "facebook/opt-6.7b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget} "no-skew" 29 | done 30 | 31 | 32 | # w/ skewing 33 | echo "InfiniGen w/ skewing" 34 | for task in copa openbookqa winogrande piqa rte; do 35 | bash ours.sh ${task} "../setup/opt-model/opt-6.7b" "facebook/opt-6.7b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget} 36 | done 37 | -------------------------------------------------------------------------------- /accuracy/lm_eval/figure17.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | shots=5 4 | # Prepare dataset 5 | echo "prepare dataset" 6 | for task in openbookqa; do 7 | python -u generate_task_data.py \ 8 | --output-file "results/${task}-${shots}.jsonl" \ 9 | --task-name ${task} \ 10 | --num-fewshot ${shots} 11 | done 12 | 13 | 14 | # Threshold (alpha) sweep 15 | partial=0.2 16 | capacity=1.0 17 | budget=0.2 18 | 19 | echo "Threshold (alpha) sweep" 20 | for alpha in 1 2 3 4 5 6 7 8 9; do 21 | echo alpha "${alpha}" 22 | bash ours.sh winogrande "../setup/opt-model/opt-13b" "facebook/opt-13b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget} 23 | done 24 | 25 | # Partial weight sweep 26 | alpha=4 27 | capacity=1.0 28 | budget=0.2 29 | 30 | echo "Partial weight ratio sweep" 31 | for partial in 0.1 0.2 0.4 0.6 0.8 1; do 32 | echo partial ratio "${partial}" 33 | bash ours.sh winogrande "../setup/opt-model/opt-13b" "facebook/opt-13b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget} 34 | done 35 | 36 | -------------------------------------------------------------------------------- /accuracy/lm_eval/full_cache.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Inference, and generate output json file 4 | task=$1 5 | shots=$4 6 | model=$2 7 | model_arch=$3 8 | base_name=$(basename "${model}") 9 | 10 | python -u run_lm_eval_harness.py \ 11 | --input-path results/${task}-${shots}.jsonl \ 12 | --output-path results/${task}-${shots}-${base_name}-full.jsonl \ 13 | --model-name ${model} \ 14 | --model-type ${model_arch} 15 | 16 | # Evaluate results 17 | python -u evaluate_task_result.py \ 18 | --result-file results/${task}-${shots}-${base_name}-full.jsonl \ 19 | --task-name ${task} \ 20 | --num-fewshot ${shots} \ 21 | --model-type ${model_arch} 22 | 23 | rm results/${task}-${shots}-${base_name}-full.jsonl 24 | -------------------------------------------------------------------------------- /accuracy/lm_eval/generate_task_data.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | 4 | from lm_eval import evaluator, tasks 5 | from tasks import EvalHarnessAdaptor 6 | 7 | 8 | if __name__ == '__main__': 9 | 10 | 11 | parser = argparse.ArgumentParser( 12 | prog = 'ProgramName', 13 | description = 'What the program does', 14 | epilog = 'Text at the bottom of help') 15 | 16 | parser.add_argument('--output-file', type=str, default='input.jsonl') 17 | parser.add_argument('--task-name', type=str, default='hellaswag') 18 | parser.add_argument('--num-fewshot', type=int, default=0) 19 | args = parser.parse_args() 20 | 21 | seq = 1024 22 | total_batch = 1 23 | pe = 'fixed' 24 | 25 | with open(args.output_file, 'w') as f: 26 | pass 27 | 28 | class DryRunner: 29 | def eval(self, batch): 30 | 31 | with open(args.output_file, 'a') as f: 32 | 33 | for text in batch['text']: 34 | item = { 35 | "best_of": 1, 36 | "echo": True, 37 | "logprobs": 1, 38 | "max_tokens": 0, 39 | "model": "x", 40 | "n": 1, 41 | "prompt": text, 42 | "request_type": "language-model-inference", 43 | "stop": None, 44 | "temperature": 0, 45 | "top_p": 1 46 | } 47 | 48 | f.write(json.dumps(item) + '\n') 49 | 50 | out = { 51 | 'mask_loss': [1.0] * len(batch), 52 | 'each_correct': [True] * len(batch), 53 | } 54 | return out 55 | 56 | t = DryRunner() 57 | adaptor = EvalHarnessAdaptor(t, seq, total_batch, shrink=pe != "fixed") 58 | results = evaluator.evaluate(adaptor, tasks.get_task_dict([args.task_name 59 | #"lambada_openai", 60 | #"piqa", 61 | #"hellaswag", 62 | #"winogrande", 63 | #"mathqa", 64 | #"pubmedqa", 65 | # "boolq", 66 | # "cb", 67 | # "copa", 68 | # "multirc", 69 | # "record", 70 | # "wic", 71 | # "wsc", 72 | ]), False, args.num_fewshot, None) 73 | print('Finished') 74 | 75 | # dumped = json.dumps(results, indent=2) 76 | # print(dumped) -------------------------------------------------------------------------------- /accuracy/lm_eval/h2o.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Inference, and generate output json file 4 | task=$1 5 | shots=$4 6 | model=$2 7 | model_arch=$3 8 | base_name=$(basename "${model}") 9 | heavy_ratio=$5 10 | recent_ratio=$6 11 | 12 | python -u run_lm_eval_harness.py \ 13 | --input-path results/${task}-${shots}.jsonl \ 14 | --output-path results/${task}-${shots}-${base_name}-h2o.jsonl \ 15 | --model-name ${model} \ 16 | --model-type ${model_arch} \ 17 | --heavy_ratio ${heavy_ratio} \ 18 | --recent_ratio ${recent_ratio} \ 19 | --enable_small_cache 20 | 21 | ## Evaluate results 22 | python -u evaluate_task_result.py \ 23 | --result-file results/${task}-${shots}-${base_name}-h2o.jsonl \ 24 | --task-name ${task} \ 25 | --num-fewshot ${shots} \ 26 | --model-type ${model_arch} 27 | 28 | rm results/${task}-${shots}-${base_name}-h2o.jsonl 29 | -------------------------------------------------------------------------------- /accuracy/lm_eval/ours.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Inference, and generate output json file 4 | task=$1 5 | shots=$5 6 | model_path=$2 7 | model=$3 8 | model_arch=$4 9 | partial_weight=$6 10 | alpha=$7 11 | capacity=$8 12 | budget=$9 13 | no_skewing=${10} 14 | base_name=$(basename "${model}") 15 | if [ -z $no_skewing ]; then 16 | weight_path="../setup/weights/${base_name}_${partial_weight}" 17 | else 18 | weight_path="../setup/weights-no-skew/${base_name}_${partial_weight}" 19 | fi 20 | skewing_path="../setup/skewing_matrix/${base_name}.pt" 21 | 22 | python -u run_lm_eval_harness.py \ 23 | --input-path results/${task}-${shots}.jsonl \ 24 | --output-path results/${task}-${shots}-${base_name}-ours.jsonl \ 25 | --model-name ${model} \ 26 | --model-type ${model_arch} \ 27 | --partial_weight_ratio ${partial_weight} \ 28 | --partial_weight_path ${weight_path} \ 29 | --ours \ 30 | --model-path ${model_path} \ 31 | --skewing_matrix_path ${skewing_path} \ 32 | --alpha ${alpha} \ 33 | --capacity ${capacity} \ 34 | --budget ${budget} 35 | 36 | # Evaluate results 37 | python -u evaluate_task_result.py \ 38 | --result-file results/${task}-${shots}-${base_name}-ours.jsonl \ 39 | --task-name ${task} \ 40 | --num-fewshot ${shots} \ 41 | --model-type ${model_arch} 42 | 43 | rm results/${task}-${shots}-${base_name}-ours.jsonl 44 | -------------------------------------------------------------------------------- /accuracy/lm_eval/quant.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Inference, and generate output json file 4 | task=$1 5 | shots=$4 6 | model=$2 7 | model_arch=$3 8 | base_name=$(basename "${model}") 9 | qbits=$5 10 | 11 | python -u run_lm_eval_harness.py \ 12 | --input-path results/${task}-${shots}.jsonl \ 13 | --output-path results/${task}-${shots}-${base_name}-quant.jsonl \ 14 | --model-name ${model} \ 15 | --model-type ${model_arch} \ 16 | --qbits ${qbits} \ 17 | --enable_quant 18 | 19 | ## Evaluate results 20 | python -u evaluate_task_result.py \ 21 | --result-file results/${task}-${shots}-${base_name}-quant.jsonl \ 22 | --task-name ${task} \ 23 | --num-fewshot ${shots} \ 24 | --model-type ${model_arch} 25 | 26 | rm results/${task}-${shots}-${base_name}-quant.jsonl 27 | -------------------------------------------------------------------------------- /accuracy/lm_eval/run_lm_eval_harness.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json, tqdm 3 | import torch 4 | import copy 5 | import os, sys 6 | import math 7 | 8 | def set_symlink(model_type, fname): 9 | model_path = "../transformers/src/transformers/models/" + model_type 10 | linker_path = os.path.realpath("../src/" + fname) 11 | if not os.path.exists(linker_path): 12 | print(f"No file exists at {linker_path}") 13 | exit(0) 14 | if not os.path.exists(model_path): 15 | print(f"No file exists at {model_path}") 16 | exit(0) 17 | curr_dir = os.getcwd() 18 | os.chdir(model_path) 19 | if os.path.exists(f'modeling_{model_type}.py'): 20 | cmd = f"rm modeling_{model_type}.py" 21 | os.system(cmd) 22 | cmd = f"ln -s {linker_path} modeling_{model_type}.py" 23 | os.system(cmd) 24 | os.chdir(curr_dir) 25 | 26 | if __name__ == '__main__': 27 | 28 | parser = argparse.ArgumentParser( 29 | prog = 'ProgramName', 30 | description = 'What the program does', 31 | epilog = 'Text at the bottom of help') 32 | 33 | parser.add_argument('--input-path', type=str, default=None) 34 | parser.add_argument('--output-path', type=str, default=None) 35 | parser.add_argument('--model-name', type=str, default='facebook/opt-350m') 36 | parser.add_argument('--model-path', type=str, default=None) 37 | parser.add_argument('--model-type', type=str, default='opt') 38 | 39 | # Quant. 40 | parser.add_argument('--enable_quant', action='store_true') 41 | parser.add_argument("--qbits", type=int, default=8) 42 | 43 | # H2O 44 | parser.add_argument('--enable_small_cache', action='store_true') 45 | parser.add_argument("--heavy_ratio", type=float, default=0.1) 46 | parser.add_argument("--recent_ratio", type=float, default=0.1) 47 | 48 | # InfiniGen 49 | parser.add_argument('--ours', action='store_true') 50 | parser.add_argument("--partial_weight_ratio", type=float, default=0.1) 51 | parser.add_argument("--partial_weight_path", type=str) 52 | parser.add_argument("--skewing_matrix_path", type=str) 53 | parser.add_argument("--alpha",type=float, default=5) 54 | parser.add_argument("--capacity",type=float, default=1.0) 55 | parser.add_argument("--budget",type=float, default=0.2) 56 | args = parser.parse_args() 57 | 58 | if args.ours: 59 | set_symlink(args.model_type, f"modeling_{args.model_type}_ours.py") 60 | else: 61 | set_symlink(args.model_type, f"modeling_{args.model_type}_orig.py") 62 | 63 | 64 | input_path = args.input_path 65 | output_path = args.output_path 66 | model_name = args.model_name 67 | 68 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig 69 | config = AutoConfig.from_pretrained(model_name) 70 | tokenizer = AutoTokenizer.from_pretrained(model_name, device_map = 'auto', torch_dtype=torch.float16) 71 | if args.model_path is None: 72 | model = AutoModelForCausalLM.from_pretrained(model_name, device_map = 'auto', torch_dtype=torch.float16) 73 | else: 74 | model = AutoModelForCausalLM.from_pretrained(args.model_path) 75 | 76 | if args.enable_quant: 77 | if args.model_type == "opt": 78 | for i, layer in enumerate(model.model.decoder.layers): 79 | if i>=2: 80 | layer.self_attn.enable_quant = True 81 | layer.self_attn.qbits = args.qbits 82 | if args.model_type == "llama": 83 | for i, layer in enumerate(model.model.layers): 84 | if i>=2: 85 | layer.self_attn.enable_quant = True 86 | layer.self_attn.qbits = args.qbits 87 | 88 | elif args.enable_small_cache: 89 | from utils_lm_eval.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter 90 | from utils_lm_eval.modify_gptneox import convert_kvcache_gpt_neox_heavy_recent, GPTNeoXAttention_Mask 91 | from utils_lm_eval.modify_opt import convert_kvcache_opt_heavy_recent, OPTAttention_Mask 92 | ENABLE_Heavy_Hitter_FUNCTIONS = { 93 | "llama": convert_kvcache_llama_heavy_recent, 94 | "opt": convert_kvcache_opt_heavy_recent, 95 | "gpt_neox": convert_kvcache_gpt_neox_heavy_recent, 96 | } 97 | print('Enable Small Cache Size') 98 | config.heavy_ratio = args.heavy_ratio 99 | config.recent_ratio = args.recent_ratio 100 | base_path = os.path.basename(args.model_name) 101 | if not os.path.exists(f"../h2o_model/{base_path}.pt"): 102 | os.system("mkdir ../h2o_model") 103 | checkpoint = copy.deepcopy(model.state_dict()) 104 | torch.save(checkpoint, f"../h2o_model/{base_path}.pt") 105 | model = ENABLE_Heavy_Hitter_FUNCTIONS[args.model_type](model, config) 106 | model.load_state_dict(torch.load(f"../h2o_model/{base_path}.pt")) 107 | model = model.to(torch.float16) 108 | 109 | elif args.ours: 110 | if args.model_type == "opt": 111 | for layer in range(len(model.model.decoder.layers)): 112 | model.model.decoder.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio 113 | model.model.decoder.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt") 114 | model.model.decoder.layers[layer].self_attn.alpha = args.alpha 115 | model.model.decoder.layers[layer].self_attn.capacity = args.capacity 116 | model.model.decoder.layers[layer].self_attn.budget = args.budget 117 | if args.model_type == "llama": 118 | if args.skewing_matrix_path is not None: 119 | A = torch.load(args.skewing_matrix_path) 120 | for layer in range(len(model.model.layers)): 121 | model.model.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio 122 | model.model.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt") 123 | model.model.layers[layer].self_attn.alpha = args.alpha 124 | model.model.layers[layer].self_attn.capacity = args.capacity 125 | model.model.layers[layer].self_attn.budget = args.budget 126 | if args.skewing_matrix_path is not None: 127 | model.model.layers[layer].self_attn.skewing_matrix = A[layer] 128 | 129 | model.half().eval().cuda() 130 | 131 | requests = [] 132 | with open(input_path, 'r') as f: 133 | for line in f: 134 | if line.strip() != '': 135 | requests.append(json.loads(line)) 136 | 137 | results = [] 138 | density=[] 139 | with torch.no_grad(): 140 | for request in tqdm.tqdm(requests): 141 | result = {'request': request, 'result': {}} 142 | prompt = request['prompt'] 143 | input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids.to(model.device) 144 | 145 | logits = model(input_ids).logits.log_softmax(dim=-1) 146 | if args.ours: 147 | density.append(model.get_density()) 148 | 149 | values, indices = logits.squeeze(0).topk(dim=-1, k=1) 150 | tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze(0)) 151 | 152 | gold_indices = input_ids[:, 1:] # skip first 153 | logprobs = [None] + torch.gather(logits, -1, gold_indices.unsqueeze(-1)).squeeze(-1).squeeze(0).detach().cpu().tolist() 154 | top_logprobs = [None] + [{tokenizer.convert_ids_to_tokens(i.item()): v.item()} for v, i in zip(values.squeeze(-1), indices.squeeze(-1))] 155 | 156 | result['result'] = { 157 | "choices": [ 158 | { 159 | "text": prompt, 160 | "logprobs": { 161 | "tokens": tokens, 162 | "token_logprobs": logprobs, 163 | "top_logprobs": top_logprobs, 164 | "text_offset": [] 165 | }, 166 | "finish_reason": "length" 167 | } 168 | ], 169 | "request_time": { 170 | "batch_time": 0, 171 | "batch_size": 1} 172 | } 173 | 174 | results.append(result) 175 | 176 | if args.ours: 177 | if args.model_type == "opt": 178 | for layer in model.model.decoder.layers: 179 | layer.self_attn.previous_hidden_states = None 180 | if args.model_type == "llama": 181 | for layer in model.model.layers: 182 | layer.self_attn.previous_hidden_states = None 183 | 184 | if args.ours: 185 | density = sum(density) / len(density) * 100 186 | retain_ratio = (1 - math.sqrt(1 - density / 100)) * 100 187 | #print("\ndensity: %.2f"%(density)) 188 | print("retain ratio: %.2f\n"%(retain_ratio)) 189 | 190 | with open(output_path, 'w') as f: 191 | for result in results: 192 | f.write(json.dumps(result) + '\n') 193 | -------------------------------------------------------------------------------- /accuracy/lm_eval/run_text_generation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. 4 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 5 | # 6 | # Licensed under the Apache License, Version 2.0 (the "License"); 7 | # you may not use this file except in compliance with the License. 8 | # You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | """ Conditional text generation with the auto-regressive models 18 | """ 19 | 20 | 21 | import argparse 22 | import logging 23 | 24 | import numpy as np 25 | import torch 26 | import json 27 | import tqdm 28 | import copy 29 | 30 | from transformers import ( 31 | CTRLLMHeadModel, 32 | CTRLTokenizer, 33 | GPT2LMHeadModel, 34 | GPT2Tokenizer, 35 | OpenAIGPTLMHeadModel, 36 | OpenAIGPTTokenizer, 37 | TransfoXLLMHeadModel, 38 | TransfoXLTokenizer, 39 | XLMTokenizer, 40 | XLMWithLMHeadModel, 41 | XLNetLMHeadModel, 42 | XLNetTokenizer, 43 | ) 44 | 45 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig 46 | 47 | from utils_hh.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter 48 | from utils_hh.modify_gptneox import convert_kvcache_gpt_neox_heavy_recent, GPTNeoXAttention_Mask 49 | from utils_hh.modify_opt import convert_kvcache_opt_heavy_recent, OPTAttention_Mask 50 | 51 | 52 | logging.basicConfig( 53 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 54 | datefmt="%m/%d/%Y %H:%M:%S", 55 | level=logging.INFO, 56 | ) 57 | logger = logging.getLogger(__name__) 58 | 59 | MAX_LENGTH = int(10000) # Hardcoded max length to avoid infinite loop 60 | 61 | MODEL_CLASSES = { 62 | "gpt2": (GPT2LMHeadModel, GPT2Tokenizer), 63 | "ctrl": (CTRLLMHeadModel, CTRLTokenizer), 64 | "openai-gpt": (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer), 65 | "xlnet": (XLNetLMHeadModel, XLNetTokenizer), 66 | "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer), 67 | "xlm": (XLMWithLMHeadModel, XLMTokenizer), 68 | } 69 | 70 | # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia 71 | # in https://github.com/rusiaaman/XLNet-gen#methodology 72 | # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e 73 | PREFIX = """In 1991, the remains of Russian Tsar Nicholas II and his family 74 | (except for Alexei and Maria) are discovered. 75 | The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the 76 | remainder of the story. 1883 Western Siberia, 77 | a young Grigori Rasputin is asked by his father and a group of men to perform magic. 78 | Rasputin has a vision and denounces one of the men as a horse thief. Although his 79 | father initially slaps him for making such an accusation, Rasputin watches as the 80 | man is chased outside and beaten. Twenty years later, Rasputin sees a vision of 81 | the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous, 82 | with people, even a bishop, begging for his blessing. """ 83 | 84 | 85 | def set_seed(args): 86 | np.random.seed(args.seed) 87 | torch.manual_seed(args.seed) 88 | if args.n_gpu > 0: 89 | torch.cuda.manual_seed_all(args.seed) 90 | 91 | 92 | ENABLE_Heavy_Hitter_FUNCTIONS = { 93 | "llama": convert_kvcache_llama_heavy_recent, 94 | "opt": convert_kvcache_opt_heavy_recent, 95 | "gpt_neox": convert_kvcache_gpt_neox_heavy_recent, 96 | } 97 | 98 | 99 | def main(): 100 | parser = argparse.ArgumentParser() 101 | 102 | parser.add_argument("--model_arch", type=str, default='llama') 103 | parser.add_argument("--model_name", type=str, default='huggyllama/llama-13b') 104 | parser.add_argument("--cache_dir", type=str, default='../../checkpoint/') 105 | 106 | parser.add_argument("--heavy_ratio", type=float, default=0.1) 107 | parser.add_argument("--recent_ratio", type=float, default=0.1) 108 | 109 | parser.add_argument("--length", type=int, default=64) 110 | 111 | parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") 112 | parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") 113 | parser.add_argument( 114 | "--fp16", 115 | action="store_true", 116 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit", 117 | ) 118 | args = parser.parse_args() 119 | 120 | args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") 121 | args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() 122 | 123 | logger.warning(f"device: {args.device}, n_gpu: {args.n_gpu}, 16-bits training: {args.fp16}") 124 | set_seed(args) 125 | 126 | # Change to your custom prompt text 127 | # prompt_text = 'In the year 2087, humanity has achieved remarkable technological advancements and established colonies on multiple planets within the Milky Way galaxy. Interstellar travel has become commonplace, with faster-than-light spacecraft enabling people to explore distant star systems. Earth has undergone significant changes due to sustainable development efforts, such as harnessing renewable energy sources and implementing widespread ecological restoration projects. However, alongside these triumphs, new challenges have emerged, including the rise of artificial intelligence, ethical dilemmas surrounding genetic engineering, and interplanetary political tensions. Against this backdrop, a team of intrepid scientists embarks on a mission to uncover the secrets of an ancient alien civilization, hidden deep within an uncharted exoplanet. As they navigate treacherous terrains and encounter otherworldly phenomena, they must confront their own fears and reconcile humanity\'s thirst for knowledge with the potential consequences of uncovering secrets that were better left buried. The fate of both their mission and the future of humanity hang in the balance.' 128 | prompt_text = 'In a small, bustling cafe nestled in the heart of a vibrant city, a serendipitous event unfolded, leaving a lasting impression on all who witnessed it. As the patrons sat sipping their coffees and engaging in animated conversations, a talented street musician entered the cafe, carrying a weathered guitar and radiating an aura of creativity.' 129 | 130 | model_name = args.model_name 131 | config = AutoConfig.from_pretrained(model_name, cache_dir=args.cache_dir) 132 | config.heavy_ratio = args.heavy_ratio 133 | config.recent_ratio = args.recent_ratio 134 | 135 | tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, cache_dir=args.cache_dir) 136 | 137 | ######## Generate with Full Cache 138 | model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=args.cache_dir) 139 | model.half().eval().cuda() 140 | 141 | # input_ids = tokenizer(prompt_text, return_tensors='pt').input_ids.to(model.device) 142 | input_ids = tokenizer(prompt_text, add_special_tokens=False, return_tensors='pt').input_ids.to(model.device) 143 | 144 | generate_ids = model.generate(input_ids, max_new_tokens=args.length) 145 | result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] 146 | print("################## Generated Context with Full Cache ###################") 147 | print(result) 148 | 149 | 150 | ######### Enable HH 151 | checkpoint = copy.deepcopy(model.state_dict()) 152 | model = ENABLE_Heavy_Hitter_FUNCTIONS[args.model_arch](model, config) 153 | model.load_state_dict(checkpoint) 154 | model.half().eval().cuda() 155 | 156 | generate_ids_hh = model.generate(input_ids, max_new_tokens=args.length) 157 | result_hh = tokenizer.batch_decode(generate_ids_hh, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] 158 | print("################## Generated Context with Heavy Hitter Oracle ###################") 159 | print(result_hh) 160 | 161 | 162 | if __name__ == "__main__": 163 | main() -------------------------------------------------------------------------------- /accuracy/lm_eval/tasks/__init__.py: -------------------------------------------------------------------------------- 1 | from tasks.eval_harness import EvalHarnessAdaptor -------------------------------------------------------------------------------- /accuracy/lm_eval/tasks/eval_harness.py: -------------------------------------------------------------------------------- 1 | from functools import partial 2 | 3 | import os 4 | import transformers 5 | from lm_eval.base import LM 6 | from tqdm import tqdm 7 | import numpy as np 8 | 9 | from tasks.util import sample_batch, shrink_seq 10 | import multiprocessing 11 | import ftfy 12 | 13 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig 14 | 15 | tokenizer = None 16 | 17 | def process_init(): 18 | global tokenizer 19 | model_name = os.environ.get('MODEL_NAME', 'facebook/opt-1.3b') 20 | 21 | if model_name == "EleutherAI/gpt-neox-20b": 22 | tokenizer = AutoTokenizer.from_pretrained(model_name) 23 | tokenizer.model_max_length = int(1e30) 24 | tokenizer.pad_token = "<|endoftext|>" 25 | elif model_name == 'huggyllama/llama-7b': 26 | tokenizer = AutoTokenizer.from_pretrained(model_name) 27 | tokenizer.model_max_length = int(1e30) 28 | tokenizer.pad_token = "<|endoftext|>" 29 | else: 30 | tokenizer = AutoTokenizer.from_pretrained(model_name) 31 | tokenizer.add_bos_token = False 32 | 33 | def process_request(x, seq): 34 | global tokenizer 35 | 36 | ctx, cont = x 37 | # ctx_tokens = tokenizer.encode("<|endoftext|>" + ftfy.fix_text(ctx, normalization="NFKC")) 38 | ctx_text = ftfy.fix_text(ctx, normalization="NFKC") 39 | cont_text = ftfy.fix_text(cont, normalization="NFKC") 40 | all_text = ctx_text + cont_text 41 | 42 | ctx_tokens = tokenizer(ctx_text, add_special_tokens=False)['input_ids'] 43 | cont_tokens = tokenizer(cont_text, add_special_tokens=False)['input_ids'] 44 | 45 | all_tokens = ctx_tokens + cont_tokens 46 | all_tokens = np.array(all_tokens)[-seq:] # truncate sequence at seq length 47 | 48 | provided_ctx = len(all_tokens) - 1 49 | pad_amount = seq - provided_ctx 50 | 51 | return { 52 | "obs": np.pad(all_tokens[:-1], ((0, pad_amount),), constant_values=tokenizer.pad_token_id), 53 | "target": np.pad(all_tokens[1:], ((0, pad_amount),), constant_values=tokenizer.pad_token_id), 54 | "ctx_length": seq, 55 | "eval_mask": np.logical_and( 56 | np.arange(0, seq) > len(all_tokens) - len(cont_tokens) - 2, 57 | np.arange(0, seq) < len(all_tokens) - 1 58 | ), 59 | "prompt": ctx_text, 60 | "target": cont_text, 61 | "text": all_text, 62 | } 63 | 64 | 65 | class EvalHarnessAdaptor(LM): 66 | def greedy_until(self, requests): 67 | raise Exception("unimplemented") 68 | 69 | def loglikelihood_rolling(self, requests): 70 | raise Exception("unimplemented") 71 | 72 | def __init__(self, tpu_cluster, seq, batch, shrink, min_seq=None): 73 | super().__init__() 74 | self.tpu = tpu_cluster 75 | self.seq = seq 76 | self.batch = batch 77 | self.shrink = shrink 78 | self.min_seq = min_seq 79 | 80 | self.pool = multiprocessing.Pool(processes=1, initializer=process_init) 81 | # self.pool = multiprocessing.Pool(initializer=process_init) 82 | process_init() 83 | 84 | def convert_requests(self, requests): 85 | return self.pool.imap(partial(process_request, seq=self.seq), requests) 86 | 87 | def loglikelihood(self, requests): 88 | output = [] 89 | 90 | r = self.convert_requests(requests) 91 | zero_example = process_request(requests[0], self.seq) 92 | 93 | for b in tqdm(sample_batch(r, self.batch, zero_example), 94 | desc="LM eval harness", 95 | total=len(requests) // self.batch): 96 | 97 | if self.shrink: 98 | b = shrink_seq(b, min_seq=self.min_seq) 99 | 100 | out = self.tpu.eval(b) 101 | 102 | for loss, correct in zip(out["mask_loss"], out["each_correct"]): 103 | output.append((float(-loss), bool(correct))) 104 | 105 | return output 106 | 107 | 108 | -------------------------------------------------------------------------------- /accuracy/lm_eval/tasks/util.py: -------------------------------------------------------------------------------- 1 | from itertools import zip_longest 2 | 3 | import numpy as np 4 | 5 | 6 | def grouper(n, iterable, fillvalue): 7 | "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" 8 | args = [iter(iterable)] * n 9 | return zip_longest(fillvalue=fillvalue, *args) 10 | 11 | 12 | # divide the seq length by 2 until it would truncate actual context 13 | def shrink_seq(examples, min_seq=None): 14 | length = examples["obs"].shape[-1] 15 | 16 | new_length = length // 2 17 | 18 | if min_seq is not None: 19 | if new_length < min_seq: 20 | return examples 21 | 22 | max_length = np.max(examples["eval_mask"] * np.arange(0, length)) + 1 23 | 24 | if max_length < new_length: 25 | examples["obs"] = examples["obs"][:, :new_length] 26 | examples["target"] = examples["target"][:, :new_length] 27 | examples["eval_mask"] = examples["eval_mask"][:, :new_length] 28 | 29 | return shrink_seq(examples, min_seq=min_seq) 30 | else: 31 | return examples 32 | 33 | 34 | def sample_batch(examples, bs, zero_example_shape): 35 | zero_example = { 36 | "obs": np.zeros_like(zero_example_shape["obs"]), 37 | "target": np.zeros_like(zero_example_shape["target"]), 38 | "eval_mask": np.zeros_like(zero_example_shape["eval_mask"]), 39 | "ctx_length": 0, 40 | } 41 | 42 | for batch in grouper(bs, examples, zero_example): 43 | batch_flattened = { 44 | "obs": [], 45 | "target": [], 46 | "eval_mask": [], 47 | "ctx_length": [], 48 | "text": [], 49 | } 50 | 51 | for sample in batch: 52 | batch_flattened["obs"].append(sample["obs"]) 53 | batch_flattened["target"].append(sample["target"]) 54 | batch_flattened["eval_mask"].append(sample["eval_mask"]) 55 | batch_flattened["ctx_length"].append(sample["ctx_length"]) 56 | batch_flattened["text"].append(sample["text"]) 57 | 58 | batch_flattened["obs"] = np.array(batch_flattened["obs"]) 59 | batch_flattened["target"] = np.array(batch_flattened["target"]) 60 | batch_flattened["eval_mask"] = np.array(batch_flattened["eval_mask"]) 61 | batch_flattened["ctx_length"] = np.array(batch_flattened["ctx_length"]) 62 | 63 | yield batch_flattened -------------------------------------------------------------------------------- /accuracy/lm_eval/utils_lm_eval/modify_llama.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pdb 3 | import copy 4 | import math 5 | import numpy as np 6 | from dataclasses import dataclass 7 | from typing import Optional, Tuple, Union 8 | 9 | import torch 10 | from torch import nn 11 | import torch.utils.checkpoint 12 | import torch.nn.functional as F 13 | from torch.cuda.amp import autocast 14 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss 15 | 16 | 17 | from transformers.models.llama.configuration_llama import LlamaConfig 18 | from transformers.models.llama.modeling_llama import LlamaRotaryEmbedding, LlamaAttention, apply_rotary_pos_emb 19 | 20 | 21 | __all__ = ['convert_kvcache_llama_heavy_recent', 'LlamaAttention_heavy_hitter'] 22 | 23 | 24 | def local_heavy_hitter_mask(attn_weights, heavy_budget): 25 | 26 | # attn_weights (BS, head, query, keys) 27 | dtype_attn_weights = attn_weights.dtype 28 | seq_length = attn_weights.shape[-1] 29 | padding_length = 0 30 | 31 | offset = torch.finfo(attn_weights.dtype).min 32 | tmp_attn = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(dtype_attn_weights) 33 | 34 | accumulated_attention_score = torch.sum(tmp_attn[:,:,padding_length:heavy_budget+padding_length,:], dim=-2) #(head, keys) 35 | accumulated_attention_score[:,:,heavy_budget+padding_length:] = 0 36 | accumulated_attention_score[:,:,:padding_length] = 0 37 | 38 | mask_bottom = torch.zeros_like(attn_weights, dtype=torch.bool) 39 | mask_bottom[:,:, padding_length:heavy_budget+padding_length, padding_length:heavy_budget+padding_length] = True 40 | 41 | for token_index in range(heavy_budget+padding_length, seq_length): 42 | 43 | tmp_attn_index = nn.functional.softmax(attn_weights[:,:,token_index,:], dim=-1, dtype=torch.float32).to(dtype_attn_weights) 44 | _, tmp_topk_index = accumulated_attention_score.topk(k=heavy_budget-1, dim=-1) 45 | zeros_index = torch.zeros_like(tmp_attn_index, dtype=torch.bool) 46 | mask_bottom_index = zeros_index.scatter(-1, tmp_topk_index, True) #(head, keys) 47 | mask_bottom_index[:,:, token_index] = True 48 | 49 | mask_bottom[:,:,token_index,:] = mask_bottom_index 50 | accumulated_attention_score += tmp_attn_index 51 | accumulated_attention_score = accumulated_attention_score * mask_bottom_index 52 | 53 | return mask_bottom 54 | 55 | 56 | class LlamaAttention_heavy_hitter(nn.Module): 57 | """Multi-headed attention from 'Attention Is All You Need' paper""" 58 | 59 | def __init__(self, config: LlamaConfig): 60 | super().__init__() 61 | self.config = config 62 | self.hidden_size = config.hidden_size 63 | self.num_heads = config.num_attention_heads 64 | self.head_dim = self.hidden_size // self.num_heads 65 | self.max_position_embeddings = config.max_position_embeddings 66 | 67 | if (self.head_dim * self.num_heads) != self.hidden_size: 68 | raise ValueError( 69 | f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" 70 | f" and `num_heads`: {self.num_heads})." 71 | ) 72 | self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) 73 | self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) 74 | self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) 75 | self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) 76 | self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings) 77 | 78 | self.heavy_budget_ratio = config.heavy_ratio 79 | self.recent_budget_ratio = config.recent_ratio 80 | 81 | 82 | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): 83 | return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() 84 | 85 | def forward( 86 | self, 87 | hidden_states: torch.Tensor, 88 | attention_mask: Optional[torch.Tensor] = None, 89 | position_ids: Optional[torch.LongTensor] = None, 90 | past_key_value: Optional[Tuple[torch.Tensor]] = None, 91 | output_attentions: bool = False, 92 | use_cache: bool = False, 93 | ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: 94 | bsz, q_len, _ = hidden_states.size() 95 | 96 | query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 97 | key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 98 | value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) 99 | 100 | kv_seq_len = key_states.shape[-2] 101 | if past_key_value is not None: 102 | kv_seq_len += past_key_value[0].shape[-2] 103 | cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) 104 | query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) 105 | # [bsz, nh, t, hd] 106 | 107 | if past_key_value is not None: 108 | # reuse k, v, self_attention 109 | key_states = torch.cat([past_key_value[0], key_states], dim=2) 110 | value_states = torch.cat([past_key_value[1], value_states], dim=2) 111 | 112 | past_key_value = (key_states, value_states) if use_cache else None 113 | 114 | attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) 115 | 116 | if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len): 117 | raise ValueError( 118 | f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" 119 | f" {attn_weights.size()}" 120 | ) 121 | 122 | if attention_mask is not None: 123 | if attention_mask.size() != (bsz, 1, q_len, kv_seq_len): 124 | raise ValueError( 125 | f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" 126 | ) 127 | attn_weights = attn_weights + attention_mask 128 | attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min)) 129 | 130 | ### Heavy + Recent 131 | heavy_budget = int(self.heavy_budget_ratio * attn_weights.shape[-1]) 132 | recent_budget = int(self.recent_budget_ratio * attn_weights.shape[-1]) 133 | 134 | # Heavy Hitter Mask 135 | if heavy_budget > 0: 136 | mask_bottom = local_heavy_hitter_mask(attn_weights, heavy_budget) # Default: No padding applied to input 137 | else: 138 | mask_bottom = torch.zeros_like(attn_weights, dtype=torch.bool) 139 | 140 | ones = torch.ones_like(attn_weights, dtype=torch.bool) 141 | ones = torch.triu(ones, diagonal=-recent_budget) 142 | mask_bottom = torch.logical_or(mask_bottom, ones) 143 | 144 | mask_bottom = torch.tril(mask_bottom, diagonal=0) 145 | 146 | # mask_bottom = ones 147 | attn_weights[~mask_bottom] = torch.min(attention_mask) 148 | 149 | # upcast attention to fp32 150 | attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) 151 | attn_output = torch.matmul(attn_weights, value_states) 152 | 153 | if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim): 154 | raise ValueError( 155 | f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" 156 | f" {attn_output.size()}" 157 | ) 158 | 159 | attn_output = attn_output.transpose(1, 2) 160 | attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) 161 | 162 | attn_output = self.o_proj(attn_output) 163 | 164 | if not output_attentions: 165 | attn_weights = None 166 | 167 | return attn_output, attn_weights, past_key_value 168 | 169 | 170 | def convert_kvcache_llama_heavy_recent(model, config): 171 | 172 | #for name, module in reversed(model._modules.items()): 173 | # if len(list(module.children())) > 0: 174 | # model._modules[name] = convert_kvcache_llama_heavy_recent(module, config) 175 | # if isinstance(module, LlamaAttention): 176 | # model._modules[name] = LlamaAttention_heavy_hitter(config) 177 | for idx, module in enumerate(model.model.layers): 178 | if (idx != 0) and (idx!=1): 179 | #if (idx != 32) and (idx!=33): 180 | model.model.layers[idx].self_attn = LlamaAttention_heavy_hitter(config) 181 | else: 182 | print("skip layer: ", idx) 183 | 184 | return model 185 | 186 | -------------------------------------------------------------------------------- /accuracy/perplexity/datautils.py: -------------------------------------------------------------------------------- 1 | from transformers import LlamaTokenizer, AutoTokenizer 2 | from datasets import load_dataset 3 | import numpy as np 4 | import torch 5 | import os 6 | 7 | def set_symlink(model_type, fname): 8 | model_path = "../transformers/src/transformers/models/" + model_type 9 | linker_path = os.path.realpath("../src/" + fname) 10 | if not os.path.exists(linker_path): 11 | print(f"No file exists at {linker_path}") 12 | exit(0) 13 | if not os.path.exists(model_path): 14 | print(f"No file exists at {model_path}") 15 | exit(0) 16 | curr_dir = os.getcwd() 17 | os.chdir(model_path) 18 | if os.path.exists(f'modeling_{model_type}.py'): 19 | cmd = f"rm modeling_{model_type}.py" 20 | os.system(cmd) 21 | cmd = f"ln -s {linker_path} modeling_{model_type}.py" 22 | os.system(cmd) 23 | os.chdir(curr_dir) 24 | 25 | def set_seed(seed): 26 | np.random.seed(seed) 27 | torch.random.manual_seed(seed) 28 | 29 | def get_wikitext2(nsamples, seed, seqlen, model): 30 | testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test') 31 | try: 32 | tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False) 33 | except: 34 | tokenizer = LlamaTokenizer.from_pretrained(model, use_fast=False) 35 | testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt') 36 | return testenc 37 | 38 | def get_ptb(nsamples, seed, seqlen, model): 39 | valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation') 40 | try: 41 | tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False) 42 | except: 43 | tokenizer = LlamaTokenizer.from_pretrained(model, use_fast=False) 44 | testenc = tokenizer("\n\n".join(valdata['sentence']), return_tensors='pt') 45 | return testenc 46 | 47 | def get_loaders( 48 | name, nsamples=128, seed=0, seqlen=2048, model='' 49 | ): 50 | if 'wikitext2' in name: 51 | return get_wikitext2(nsamples, seed, seqlen, model) 52 | if 'ptb' in name: 53 | return get_ptb(nsamples, seed, seqlen, model) 54 | -------------------------------------------------------------------------------- /accuracy/perplexity/figure12.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # InfiniGen 4 | echo "== InfiniGen ==" 5 | partial=0.2 6 | seqlen=2048 7 | alpha=4.0 8 | budget=0.2 9 | echo opt-13b alpha $alpha budget $budget 10 | python opt.py --model "../setup/opt-model/opt-13b" \ 11 | --eval_dataset "wikitext2" \ 12 | --seq_len ${seqlen} \ 13 | --eval_samples 0 \ 14 | --model_name "opt-13b" \ 15 | --infinigen \ 16 | --print_blk_ppl \ 17 | --partial_weight_ratio ${partial} \ 18 | --partial_weight_path "../setup/weights/opt-13b_${partial}" \ 19 | --alpha ${alpha} \ 20 | --budget ${budget} \ 21 | --capacity 1.0 22 | 23 | partial=0.2 24 | seqlen=4096 25 | alpha=5.2 26 | budget=0.2 27 | echo llama-2 alpha $alpha budget $budget 28 | python llama.py --model "${LLAMA_PATH}/llama-2-13b" \ 29 | --eval_dataset "wikitext2" \ 30 | --seq_len ${seqlen} \ 31 | --eval_samples 0 \ 32 | --model_name "llama-2-13b" \ 33 | --infinigen \ 34 | --print_blk_ppl \ 35 | --partial_weight_ratio ${partial} \ 36 | --partial_weight_path "../setup/weights/llama-2-13b_${partial}" \ 37 | --skewing_matrix_path "../setup/skewing_matrix/llama-2-13b.pt" \ 38 | --alpha ${alpha} \ 39 | --budget ${budget} \ 40 | --capacity 1.0 41 | 42 | echo "===============" 43 | 44 | # H2O 45 | echo "== H2O ==" 46 | partial=0.2 47 | seqlen=2048 48 | heavy=0.01875 49 | recent=0.01875 50 | echo opt-13b heavy $heavy recent $recent 51 | python opt.py --model "facebook/opt-13b" \ 52 | --eval_dataset "wikitext2" \ 53 | --seq_len ${seqlen} \ 54 | --eval_samples 0 \ 55 | --model_name "opt-13b" \ 56 | --print_blk_ppl \ 57 | --heavy_ratio ${heavy} \ 58 | --recent_ratio ${recent} 59 | 60 | partial=0.2 61 | seqlen=4096 62 | heavy=0.01875 63 | recent=0.01875 64 | echo llama-2-13b heavy $heavy recent $recent 65 | python llama.py --model "${LLAMA_PATH}/llama-2-13b" \ 66 | --eval_dataset "wikitext2" \ 67 | --seq_len ${seqlen} \ 68 | --eval_samples 0 \ 69 | --model_name "llama-2-13b" \ 70 | --print_blk_ppl \ 71 | --heavy_ratio ${heavy} \ 72 | --recent_ratio ${recent} 73 | 74 | echo "===============" 75 | -------------------------------------------------------------------------------- /accuracy/perplexity/llama.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import copy 4 | import argparse 5 | import math 6 | from datautils import * 7 | 8 | 9 | def get_llama(model, seqlen): 10 | def skip(*args, **kwargs): 11 | pass 12 | torch.nn.init.kaiming_uniform_ = skip 13 | torch.nn.init.uniform_ = skip 14 | torch.nn.init.normal_ = skip 15 | 16 | from transformers import LlamaForCausalLM 17 | model = LlamaForCausalLM.from_pretrained(model, device_map='cpu', torch_dtype=torch.float16) 18 | 19 | model.seqlen = seqlen 20 | return model 21 | 22 | 23 | @torch.no_grad() 24 | def llama_eval(model, testenc, dev, eval_sample, ours, print_chunk = False): 25 | print('Evaluating ...') 26 | 27 | testenc = testenc.input_ids 28 | if eval_sample: 29 | nsamples = eval_sample 30 | else: 31 | nsamples = min(1000, testenc.numel() // model.seqlen) 32 | print("nsamples: ", nsamples) 33 | 34 | use_cache = model.config.use_cache 35 | model.config.use_cache = False 36 | layers = model.model.layers 37 | 38 | model.model.embed_tokens = model.model.embed_tokens.to(dev) 39 | layers[0] = layers[0].to(dev) 40 | 41 | dtype = next(iter(model.parameters())).dtype 42 | inps = torch.zeros((nsamples, model.seqlen, model.config.hidden_size), dtype=dtype, device=dev) 43 | cache = {'i': 0, 'attention_mask': None} 44 | 45 | class Catcher(nn.Module): 46 | 47 | def __init__(self, module): 48 | super().__init__() 49 | self.module = module 50 | 51 | def forward(self, inp, **kwargs): 52 | inps[cache['i']] = inp 53 | cache['i'] += 1 54 | cache['attention_mask'] = kwargs['attention_mask'] 55 | cache['position_ids'] = kwargs['position_ids'] 56 | raise ValueError 57 | 58 | layers[0] = Catcher(layers[0]) 59 | for i in range(nsamples): 60 | batch = testenc[:, (i * model.seqlen):((i + 1) * model.seqlen)].to(dev) 61 | try: 62 | model(batch) 63 | except ValueError: 64 | pass 65 | layers[0] = layers[0].module 66 | 67 | layers[0] = layers[0].cpu() 68 | model.model.embed_tokens = model.model.embed_tokens.cpu() 69 | torch.cuda.empty_cache() 70 | 71 | outs = torch.zeros_like(inps) 72 | attention_mask = cache['attention_mask'] 73 | position_ids = cache['position_ids'] 74 | prev_hidden = [] 75 | 76 | density = [] 77 | for i in range(len(layers)): 78 | layer = layers[i].to(dev) 79 | 80 | for j in range(nsamples): 81 | if ours: 82 | if i >= 2: 83 | layer.self_attn.previous_hidden_states = prev_hidden[j] 84 | outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0] 85 | if ours: 86 | if i >= 1: 87 | cur_bsz = layer.self_attn.current_hidden_states.shape[0] 88 | cur_tgt_len = layer.self_attn.current_hidden_states.shape[1] 89 | cur_device = layer.self_attn.current_hidden_states.device 90 | cur_dtype = layer.self_attn.current_hidden_states.dtype 91 | if i == 1: 92 | prev_hidden.append(layer.self_attn.current_hidden_states) 93 | else: 94 | prev_hidden[j] = layer.self_attn.current_hidden_states 95 | if ours and layer.self_attn.density is not None: 96 | density.append(layer.self_attn.density) 97 | 98 | layers[i] = layer.cpu() 99 | del layer 100 | torch.cuda.empty_cache() 101 | inps, outs = outs, inps 102 | print(i, end=' ',flush=True) 103 | print() 104 | 105 | if ours: 106 | density = sum(density) / len(density) * 100 107 | retain_ratio = (1 - math.sqrt(1 - (density/100))) * 100 108 | #print("density %f"%(density)) 109 | print("retain ratio %f"%((retain_ratio))) 110 | 111 | if model.model.norm is not None: 112 | model.model.norm = model.model.norm.to(dev) 113 | model.lm_head = model.lm_head.to(dev) 114 | 115 | testenc = testenc.to(dev) 116 | nlls = [] 117 | for i in range(nsamples): 118 | hidden_states = inps[i].unsqueeze(0) 119 | if model.model.norm is not None: 120 | hidden_states = model.model.norm(hidden_states) 121 | lm_logits = model.lm_head(hidden_states) 122 | shift_logits = lm_logits[:, :-1, :].contiguous() 123 | shift_labels = testenc[:, (i * model.seqlen):((i + 1) * model.seqlen)][:, 1:] 124 | loss_fct = nn.CrossEntropyLoss(reduction='none') 125 | # loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) 126 | # neg_log_likelihood = loss.float() * model.seqlen 127 | neg_log_likelihood = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).to(torch.float) 128 | nlls.append(neg_log_likelihood) 129 | nlls = torch.stack(nlls) 130 | for seqlen in range(int(model.seqlen/256)): 131 | start = seqlen * 256 132 | end = (seqlen+1)*256 133 | per_token_ppl = torch.exp(nlls[:, start:end].sum(dim=0) / nsamples) 134 | if seqlen == 0: 135 | var = torch.var(per_token_ppl[25:], correction=0) 136 | else: 137 | var = torch.var(per_token_ppl, correction=0) 138 | q1 = torch.quantile(per_token_ppl, 0.25, interpolation='nearest') 139 | q2 = torch.quantile(per_token_ppl, 0.5, interpolation='nearest') 140 | q3 = torch.quantile(per_token_ppl, 0.75, interpolation='nearest') 141 | ppl = torch.exp(nlls[:, start:end].sum() / (nsamples*256)) 142 | if print_chunk: 143 | print("seqlen: ", end) 144 | print("perplexity, variance, q1, q2, q3: ", ppl.item(), var.item(), q1.item(), q2.item(), q3.item()) 145 | 146 | print("Total") 147 | ppl = torch.exp(nlls.sum() / (nsamples * model.seqlen)) 148 | print("Perplexity: ", ppl.item()) 149 | 150 | 151 | model.config.use_cache = use_cache 152 | 153 | if __name__ == '__main__': 154 | 155 | parser = argparse.ArgumentParser() 156 | 157 | parser.add_argument('--model', type=str, help='llama model to load; pass `/path/to/llama') 158 | parser.add_argument('--eval_dataset', type=str, help='evaluation dataset') 159 | parser.add_argument('--seq_len', type=int, help='model sequence length') 160 | parser.add_argument('--eval_samples', type=int, default=0, help='number of sample evaluation dataset') 161 | parser.add_argument('--model_name', type=str, help='name of the model') 162 | parser.add_argument('--print_blk_ppl', action='store_true', help='') 163 | 164 | ## H2O 165 | parser.add_argument("--heavy_ratio", type=float, default=None) 166 | parser.add_argument("--recent_ratio", type=float, default=None) 167 | 168 | ## InfiniGen 169 | parser.add_argument('--infinigen', action='store_true', help='') 170 | parser.add_argument("--partial_weight_ratio", type=float, default=None) 171 | parser.add_argument("--partial_weight_path", type=str) 172 | parser.add_argument("--skewing_matrix_path", type=str) 173 | parser.add_argument("--alpha", type=float, default=0.0) 174 | parser.add_argument("--budget", type=float, default=0.0) 175 | parser.add_argument("--capacity", type=float, default=0.0) 176 | parser.add_argument("--eviction_policy", type=str, default="lru") 177 | 178 | args = parser.parse_args() 179 | 180 | if args.infinigen: 181 | set_symlink("llama", "modeling_llama_ours.py") 182 | else: 183 | set_symlink("llama", "modeling_llama_orig.py") 184 | 185 | model = get_llama(args.model, args.seq_len) 186 | 187 | ## H2O 188 | if args.heavy_ratio is not None: 189 | import sys 190 | sys.path.append("../lm_eval") 191 | from utils_lm_eval.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter 192 | 193 | model.config.heavy_ratio = args.heavy_ratio 194 | model.config.recent_ratio = args.recent_ratio 195 | if not os.path.exists(f"../h2o_model/{args.model_name}.pt"): 196 | os.system("mkdir ../h2o_model") 197 | checkpoint = copy.deepcopy(model.state_dict()) 198 | torch.save(checkpoint, f"../h2o_model/{args.model_name}.pt") 199 | model = convert_kvcache_llama_heavy_recent(model, model.config) 200 | model.load_state_dict(torch.load(f"../h2o_model/{args.model_name}.pt")) 201 | model = model.to(torch.float16) 202 | 203 | ## InfiniGen 204 | if args.infinigen: 205 | A = torch.load(args.skewing_matrix_path) 206 | for layer in range(len(model.model.layers)): 207 | model.model.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio 208 | model.model.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt") 209 | model.model.layers[layer].self_attn.skewing_matrix = A[layer] 210 | 211 | for layer in range(len(model.model.layers)): 212 | model.model.layers[layer].self_attn.alpha = args.alpha 213 | model.model.layers[layer].self_attn.budget = args.budget 214 | model.model.layers[layer].self_attn.capacity = args.capacity 215 | model.model.layers[layer].self_attn.eviction_policy = args.eviction_policy 216 | 217 | model.eval() 218 | 219 | dataset = args.eval_dataset 220 | testloader = get_loaders(dataset, model=args.model, seqlen=model.seqlen) 221 | 222 | llama_eval(model, testloader, 'cuda', args.eval_samples, args.infinigen, args.print_blk_ppl) 223 | -------------------------------------------------------------------------------- /accuracy/perplexity/table2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | partial=0.2 4 | seqlen=2048 5 | 6 | ## OPT 7 | alpha=4 8 | budget=0.2 9 | 10 | for size in 6.7b 13b 30b;do 11 | for dataset in "wikitext2" "ptb";do 12 | echo opt-$size ${dataset} 100% cache 13 | python opt.py --model "../setup/opt-model/opt-${size}" \ 14 | --eval_dataset ${dataset} \ 15 | --seq_len ${seqlen} \ 16 | --eval_samples 0 \ 17 | --model_name "opt-${size}" \ 18 | --infinigen \ 19 | --partial_weight_ratio ${partial} \ 20 | --partial_weight_path "../setup/weights/opt-${size}_${partial}" \ 21 | --alpha ${alpha} \ 22 | --budget ${budget} \ 23 | --capacity 1.0 24 | done 25 | done 26 | 27 | for size in 6.7b 13b 30b;do 28 | for dataset in "wikitext2" "ptb";do 29 | for evict in fifo lru counter;do 30 | echo opt-$size ${dataset} 80% cache evict ${evict} 31 | python opt.py --model "../setup/opt-model/opt-${size}" \ 32 | --eval_dataset ${dataset} \ 33 | --seq_len ${seqlen} \ 34 | --eval_samples 0 \ 35 | --model_name "opt-${size}" \ 36 | --infinigen \ 37 | --partial_weight_ratio ${partial} \ 38 | --partial_weight_path "../setup/weights/opt-${size}_${partial}" \ 39 | --alpha ${alpha} \ 40 | --budget ${budget} \ 41 | --capacity 0.8 \ 42 | --eviction_policy ${evict} 43 | done 44 | done 45 | done 46 | 47 | ## Llama-2 48 | alpha=5 49 | budget=0.2 50 | 51 | for size in 7b 13b;do 52 | for dataset in "wikitext2" "ptb";do 53 | echo llama-2-${size} ${dataset} 100% cache 54 | python llama.py --model "${LLAMA_PATH}/llama-2-${size}" \ 55 | --eval_dataset ${dataset} \ 56 | --seq_len ${seqlen} \ 57 | --eval_samples 0 \ 58 | --model_name "llama-${size}" \ 59 | --infinigen \ 60 | --partial_weight_ratio ${partial} \ 61 | --partial_weight_path "../setup/weights/llama-2-${size}_${partial}" \ 62 | --skewing_matrix_path "../setup/skewing_matrix/llama-2-${size}.pt" \ 63 | --alpha ${alpha} \ 64 | --budget ${budget} \ 65 | --capacity 1.0 66 | done 67 | done 68 | 69 | for size in 7b 13b;do 70 | for dataset in "wikitext2" "ptb";do 71 | for evict in fifo lru counter;do 72 | echo llama-2-${size} ${dataset} 80% cache evict ${evict} 73 | python llama.py --model "${LLAMA_PATH}/llama-2-${size}" \ 74 | --eval_dataset ${dataset} \ 75 | --seq_len ${seqlen} \ 76 | --eval_samples 0 \ 77 | --model_name "llama-${size}" \ 78 | --infinigen \ 79 | --partial_weight_ratio ${partial} \ 80 | --partial_weight_path "../setup/weights/llama-2-${size}_${partial}" \ 81 | --skewing_matrix_path "../setup/skewing_matrix/llama-2-${size}.pt" \ 82 | --alpha ${alpha} \ 83 | --budget ${budget} \ 84 | --capacity 0.8 \ 85 | --eviction_policy ${evict} 86 | done 87 | done 88 | done 89 | -------------------------------------------------------------------------------- /accuracy/scripts/figure11/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cwd=$PWD 4 | 5 | cd ../../lm_eval 6 | python figure11.py 7 | cd $cwd 8 | -------------------------------------------------------------------------------- /accuracy/scripts/figure12/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cwd=$PWD 4 | 5 | cd ../../perplexity 6 | bash figure12.sh 7 | cd $cwd 8 | -------------------------------------------------------------------------------- /accuracy/scripts/figure13/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cwd=$PWD 4 | 5 | cd ../../lm_eval 6 | bash figure13.sh 7 | cd $cwd 8 | -------------------------------------------------------------------------------- /accuracy/scripts/figure17/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cwd=$PWD 4 | 5 | cd ../../lm_eval 6 | bash figure17.sh 7 | cd $cwd 8 | -------------------------------------------------------------------------------- /accuracy/scripts/run_all.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | for FIG in "figure11" "figure12" "figure13" "table2" "figure17"; do 4 | cd $FIG 5 | sh run.sh 6 | cd .. 7 | done 8 | -------------------------------------------------------------------------------- /accuracy/scripts/table2/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cwd=$PWD 4 | 5 | cd ../../perplexity 6 | bash table2.sh 7 | cd $cwd 8 | -------------------------------------------------------------------------------- /accuracy/setup/gen_llama_skewing_matrix.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig 2 | import argparse 3 | import torch 4 | import os 5 | from utils import * 6 | 7 | ### Parameters 8 | 9 | def process_options(): 10 | parser = argparse.ArgumentParser(description="Llama-2 Model") 11 | parser.add_argument("--model", required=True, 12 | help='Llama-2 model to load') 13 | parser.add_argument("--output", required=True, 14 | help='output directory to store result') 15 | return parser 16 | 17 | def main(): 18 | parser = process_options() 19 | args = parser.parse_args() 20 | 21 | ### Model load 22 | set_symlink("llama", "modeling_llama_orig.py") 23 | 24 | model_name = os.path.basename(args.model) 25 | config = AutoConfig.from_pretrained(args.model) 26 | tokenizer = AutoTokenizer.from_pretrained(args.model) 27 | model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda() 28 | head_dim = model.model.layers[0].self_attn.head_dim 29 | n_head = model.model.layers[0].self_attn.num_heads 30 | n_layer = config.num_hidden_layers 31 | 32 | ### Generation 33 | file_path = "./pg19_firstbook.txt" 34 | 35 | with open(file_path, 'r') as file: 36 | prompt = file.read() 37 | 38 | input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()[:, :2048] 39 | 40 | print("Start Generation") 41 | 42 | generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1) 43 | 44 | print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)) 45 | 46 | query_v = {} 47 | key_v = {} 48 | 49 | for i, layer in enumerate(model.model.layers): 50 | query_v[str(i)] = layer.self_attn.rope_query 51 | key_v[str(i)] = layer.self_attn.rope_key 52 | 53 | ### Gen Skewing Matrix A 54 | A = torch.zeros(n_layer, n_head, head_dim, head_dim).to('cuda').to(torch.float16) 55 | for name in query_v: 56 | layer = int(name) 57 | query = query_v[name] 58 | key = key_v[name] 59 | 60 | for head in range(n_head): 61 | in_q = query[0, head] 62 | in_k = key[0, head] 63 | uq, sq, vq = torch.svd(in_q.to(torch.float)) 64 | uk, sk, vk = torch.svd(in_k.to(torch.float)) 65 | s = sq * sk 66 | a = torch.zeros(head_dim, head_dim).to('cuda') 67 | _, ind = s.sort() 68 | r,c = a.shape 69 | A[layer, head] = a.scatter(-1, ind.unsqueeze(0).repeat(r,1), vq).to(torch.float16) 70 | 71 | save_dir = args.output 72 | if not os.path.exists(save_dir): 73 | os.system(f"mkdir -p {save_dir}") 74 | torch.save(A, save_dir + "/" + model_name + ".pt") 75 | 76 | if __name__ == "__main__": 77 | main() 78 | -------------------------------------------------------------------------------- /accuracy/setup/gen_opt_model.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig 2 | import argparse 3 | import torch 4 | import os 5 | from utils import * 6 | 7 | def process_options(): 8 | parser = argparse.ArgumentParser(description="OPT Model") 9 | parser.add_argument("--model", default="facebook/opt-6.7b", 10 | help='OPT model to load') 11 | parser.add_argument("--output", required=True, 12 | help='output directory to store result') 13 | parser.add_argument("--no_skewing", action='store_true', 14 | help='whether to skew weight') 15 | return parser 16 | 17 | def main(): 18 | parser = process_options() 19 | args = parser.parse_args() 20 | 21 | ### Model load 22 | set_symlink("opt", "modeling_opt_orig.py") 23 | 24 | model_name = os.path.basename(args.model) 25 | config = AutoConfig.from_pretrained(args.model) 26 | tokenizer = AutoTokenizer.from_pretrained(args.model) 27 | model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda() 28 | head_dim = model.model.decoder.layers[0].self_attn.head_dim 29 | n_head = model.model.decoder.layers[0].self_attn.num_heads 30 | 31 | ### Add hook 32 | query_v = {} 33 | key_v = {} 34 | 35 | def get_query(name): 36 | def hook(model, input, output): 37 | query_v[name] = output 38 | return hook 39 | def get_key(name): 40 | def hook(model, input, output): 41 | key_v[name] = output 42 | return hook 43 | 44 | for i, layer in enumerate(model.model.decoder.layers): 45 | query = layer.self_attn.q_proj.register_forward_hook(get_query("%d"%(i))) 46 | key = layer.self_attn.k_proj.register_forward_hook(get_key("%d"%(i))) 47 | 48 | ### Generation 49 | file_path = "./pg19_firstbook.txt" 50 | 51 | with open(file_path, 'r') as file: 52 | prompt = file.read() 53 | 54 | input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()[:, :2048] 55 | 56 | print("Start Generation") 57 | 58 | generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1) 59 | 60 | print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)) 61 | 62 | ### New weight generation 63 | for name in query_v: 64 | layer = int(name) 65 | query = query_v[name][0] 66 | query = query * (head_dim ** -0.5) 67 | key = key_v[name][0] 68 | 69 | wq = model.model.decoder.layers[layer].self_attn.q_proj.weight.data 70 | bq = model.model.decoder.layers[layer].self_attn.q_proj.bias.data 71 | wk = model.model.decoder.layers[layer].self_attn.k_proj.weight.data 72 | bk = model.model.decoder.layers[layer].self_attn.k_proj.bias.data 73 | 74 | new_wq = torch.cat((wq.transpose(-1,-2), bq.unsqueeze(0)), dim = 0) * (head_dim**-0.5) 75 | new_wk = torch.cat((wk.transpose(-1,-2), bk.unsqueeze(0)), dim = 0) 76 | 77 | if not args.no_skewing: 78 | for h in range(n_head): 79 | start = h * head_dim 80 | end = (h+1) * head_dim 81 | uq, sq, vq = torch.svd(query[:, start:end].to(torch.float)) 82 | uk, sk, vk = torch.svd(key[:, start:end].to(torch.float)) 83 | uq = uq.to(torch.float16) 84 | sq = sq.to(torch.float16) 85 | vq = vq.to(torch.float16) 86 | uk = uk.to(torch.float16) 87 | sk = sk.to(torch.float16) 88 | vk = vk.to(torch.float16) 89 | s = sq * sk 90 | 91 | A = torch.zeros(head_dim, head_dim).to('cuda').to(torch.float16) 92 | _, ind = s.sort() 93 | r,c = A.shape 94 | A = A.scatter(-1, ind.unsqueeze(0).repeat(r,1), vq) 95 | new_wq[:, start:end] = new_wq[:, start:end] @ A 96 | new_wk[:, start:end] = new_wk[:, start:end] @ A 97 | 98 | model.model.decoder.layers[layer].self_attn.q_proj.weight.data = new_wq 99 | model.model.decoder.layers[layer].self_attn.k_proj.weight.data = new_wk 100 | 101 | save_dir = args.output + "/" + model_name 102 | if not os.path.exists(save_dir): 103 | os.system(f"mkdir -p {save_dir}") 104 | 105 | model.save_pretrained(save_dir) 106 | 107 | if __name__ == "__main__": 108 | main() 109 | -------------------------------------------------------------------------------- /accuracy/setup/gen_partial_weight.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoModelForCausalLM, AutoTokenizer 2 | import argparse 3 | import torch 4 | import os 5 | from utils import * 6 | 7 | def process_options(): 8 | parser = argparse.ArgumentParser(description="Generate partial weight") 9 | parser.add_argument("--our_model_path", default=None, 10 | help='our OPT model') 11 | parser.add_argument("--skewing_matrix_path", default=None, 12 | help='path to skewing matrix') 13 | parser.add_argument("--model", default="facebook/opt-6.7b", 14 | help='model') 15 | parser.add_argument("--model_type", default = "opt", 16 | help='model arch (opt, llama)') 17 | parser.add_argument("--partial_weight_ratio", required=False, default=0.1, 18 | help='Ours: partial weight ratio') 19 | parser.add_argument("--output", required=True, 20 | help='output directory to store result') 21 | return parser 22 | 23 | def main(): 24 | ## get arguments 25 | parser = process_options() 26 | args = parser.parse_args() 27 | file_path = "./pg19_firstbook.txt" 28 | 29 | fname = f"modeling_{args.model_type}_ours_setup.py" 30 | set_symlink(args.model_type, fname) 31 | 32 | if args.our_model_path is not None: 33 | model = AutoModelForCausalLM.from_pretrained(args.our_model_path, torch_dtype=torch.float16).cuda() 34 | else: 35 | model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda() 36 | 37 | if args.skewing_matrix_path is not None: 38 | A = torch.load(args.skewing_matrix_path).to('cuda').to(torch.float16) 39 | if args.model_type == 'llama': 40 | for layer_num, layer in enumerate(model.model.layers): 41 | layer.self_attn.skewing_matrix = A[layer_num] 42 | 43 | tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False) 44 | 45 | prompt = ["The bartender refused to serve the patron because the patron was drunk.\n\nThe girl politely declined the hamburger because she was a vegetarian.\n\nThe spy discovered the enemy's location because the spy bugged the enemy's phone.\n\nI tossed the ball upwards therefore the ball hit the ceiling.\n\nThe rider fell to the ground because the bull bucked the rider.\n\nThe pair of students came under scrutiny by the teacher because the students both received excellent grades."] 46 | 47 | input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() 48 | 49 | if args.model_type == "opt": 50 | for layer in model.model.decoder.layers: 51 | layer.self_attn.partial_weight_ratio = float(args.partial_weight_ratio) 52 | elif args.model_type == "llama": 53 | for layer in model.model.layers: 54 | layer.self_attn.partial_weight_ratio = float(args.partial_weight_ratio) 55 | 56 | print("Start Generation") 57 | 58 | generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1) 59 | 60 | print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)) 61 | 62 | basepath = args.output + "/" + os.path.basename(os.path.normpath(args.model)) + "_%s"%(args.partial_weight_ratio) 63 | if not os.path.exists(basepath): 64 | os.system("mkdir -p %s"%(basepath)) 65 | 66 | if args.model_type == "opt": 67 | for layer in range(len(model.model.decoder.layers)): 68 | partial_weight = model.model.decoder.layers[layer].self_attn.partial_weight_q 69 | torch.save(partial_weight, "%s/partial_weight_q_"%(basepath) + str(layer) + ".pt") 70 | elif args.model_type == "llama": 71 | for layer in range(len(model.model.layers)): 72 | partial_weight = model.model.layers[layer].self_attn.partial_weight_q 73 | torch.save(partial_weight, "%s/partial_weight_q_"%(basepath) + str(layer) + ".pt") 74 | 75 | if __name__ == "__main__": 76 | main() 77 | -------------------------------------------------------------------------------- /accuracy/setup/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /accuracy/setup/setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | CWD=${PWD} 4 | cd ../transformers/src/transformers/models 5 | 6 | for model in llama opt;do 7 | mv ${model}/modeling_${model}.py ${model}/modeling_${model}_orig.py 8 | done 9 | 10 | cd ${CWD} 11 | 12 | # ========= InfiniGen ============ 13 | # generate opt models w/skewing 14 | for size in 6.7b 13b 30b;do 15 | python gen_opt_model.py \ 16 | --model "facebook/opt-${size}" \ 17 | --output "./opt-model" 18 | done 19 | 20 | # generate skewing matrices for llama 21 | for size in 7b 13b;do 22 | python gen_llama_skewing_matrix.py \ 23 | --model "${LLAMA_PATH}/llama-2-${size}" \ 24 | --output "./skewing_matrix" 25 | done 26 | 27 | 28 | # generate partial weight matrices for prediction 29 | PARTIAL_RATIO=0.2 30 | # opt 31 | for size in 6.7b 13b 30b;do 32 | python gen_partial_weight.py \ 33 | --our_model_path "./opt-model/opt-${size}" \ 34 | --model "facebook/opt-${size}" \ 35 | --model_type "opt" \ 36 | --partial_weight_ratio $PARTIAL_RATIO \ 37 | --output "./weights" 38 | done 39 | 40 | # llama 41 | for size in 7b 13b;do 42 | python gen_partial_weight.py \ 43 | --skewing_matrix_path "./skewing_matrix/llama-2-${size}.pt" \ 44 | --model "${LLAMA_PATH}/llama-2-${size}" \ 45 | --model_type "llama" \ 46 | --partial_weight_ratio $PARTIAL_RATIO \ 47 | --output "./weights" 48 | done 49 | 50 | 51 | # ========= w/o skewing (figure 13) 52 | PARTIAL_RATIO=0.1 53 | python gen_partial_weight.py \ 54 | --our_model_path "./opt-model/opt-6.7b" \ 55 | --model "facebook/opt-6.7b" \ 56 | --model_type "opt" \ 57 | --partial_weight_ratio $PARTIAL_RATIO \ 58 | --output "./weights" 59 | 60 | python gen_opt_model.py \ 61 | --model "facebook/opt-6.7b" \ 62 | --output "./opt-model-no-skew" \ 63 | --no_skewing 64 | 65 | python gen_partial_weight.py \ 66 | --our_model_path "./opt-model-no-skew/opt-6.7b" \ 67 | --model "facebook/opt-6.7b" \ 68 | --model_type "opt" \ 69 | --partial_weight_ratio $PARTIAL_RATIO \ 70 | --output "./weights-no-skew" 71 | 72 | # ========= partial weight sweep (figure 17) 73 | for PARTIAL_RATIO in 0.1 0.4 0.6 0.8 1.0;do 74 | python gen_partial_weight.py \ 75 | --our_model_path "./opt-model/opt-13b" \ 76 | --model "facebook/opt-13b" \ 77 | --model_type "opt" \ 78 | --partial_weight_ratio $PARTIAL_RATIO \ 79 | --output "./weights" 80 | done 81 | -------------------------------------------------------------------------------- /accuracy/setup/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | def set_symlink(model_type, fname): 3 | model_path = "../transformers/src/transformers/models/" + model_type 4 | linker_path = os.path.realpath("../src/" + fname) 5 | if not os.path.exists(linker_path): 6 | print(f"No file exists at {linker_path}") 7 | exit(0) 8 | if not os.path.exists(model_path): 9 | print(f"No file exists at {model_path}") 10 | exit(0) 11 | curr_dir = os.getcwd() 12 | os.chdir(model_path) 13 | if os.path.exists(f'modeling_{model_type}.py'): 14 | cmd = f"rm modeling_{model_type}.py" 15 | os.system(cmd) 16 | cmd = f"ln -s {linker_path} modeling_{model_type}.py" 17 | os.system(cmd) 18 | os.chdir(curr_dir) 19 | -------------------------------------------------------------------------------- /figures/infinigen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/snu-comparch/InfiniGen/3754f52d131c30ab7a310e3c4cf356e8e3326ac3/figures/infinigen.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==2.0.1 2 | torchvision==0.15.2 3 | torchaudio==2.0.2 4 | sentencepiece 5 | lm-eval==0.3.0 6 | ftfy 7 | accelerate 8 | -------------------------------------------------------------------------------- /speedup/README.md: -------------------------------------------------------------------------------- 1 | # Speedup Evaluation 2 | This directory contains source codes for evaluating the speedup. You can reproduce inference latency results in the paper. Some of the codes are referenced from FlexGen (ICML'23) GitHub repository. 3 | - Getting Started (10 minutes) 4 | - Run Experiments (7 hours) 5 | 6 | ## Getting Started (10 minutes) 7 | ```sh 8 | sh install.sh 9 | export CUDA_HOME=/path/to/cuda 10 | ``` 11 | For a "Hello world"-sized example, please run the following command (10 minutes): 12 | ``` 13 | python -m flexgen.flex_opt --model huggingface/opt-6.7b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 2 --num-gpu-batches 1 --prompt-len 384 --gen-len 128 --warmup-input-path flexgen/pg19_firstbook.txt --test-input-path flexgen/pg19_firstbook.txt --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 102 14 | ``` 15 | ## Run Experiments (7 hours) 16 | We provide scripts to reproduce the experiment results from Figure 14 to Figure 17. To reproduce all the results at once, please run the following commands (7 hours). 17 | ``` 18 | cd scripts 19 | sh run_all.sh 20 | ``` 21 | If you want to reproduce the results for a specific figure, please `sh run.sh` in each corresponding directory. For example, 22 | ``` 23 | cd scripts/figure14 24 | sh run.sh 25 | ``` 26 | Following is the amount of time to run each experiments on our system (NVIDIA RTX A6000 GPU with 48GB of memory, Intel Xeon Gold 6136 processor with 96GB of DDR4-2666 memory, PCIe 3.0 x16 interconnection). 27 | - Figure 14: 80 minutes 28 | - Figure 15: 220 minutes 29 | - Figure 16a: 30 minutes 30 | - Figure 16b: 60 minutes 31 | - Figure 17a: 20 minutes 32 | - Figure 17b: 10 minutes 33 | 34 | -------------------------------------------------------------------------------- /speedup/flexgen/flexgen/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /speedup/flexgen/flexgen/flex_opt.py: -------------------------------------------------------------------------------- 1 | ../infinigen/flex_opt.py -------------------------------------------------------------------------------- /speedup/flexgen/flexgen/pytorch_backend.py: -------------------------------------------------------------------------------- 1 | ../infinigen/pytorch_backend.py -------------------------------------------------------------------------------- /speedup/flexgen/flexgen/timer.py: -------------------------------------------------------------------------------- 1 | """Global timer for profiling.""" 2 | from collections import namedtuple 3 | import time 4 | from typing import Callable, Any 5 | 6 | 7 | class _Timer: 8 | """An internal timer.""" 9 | 10 | def __init__(self, name: str): 11 | self.name = name 12 | self.started = False 13 | self.start_time = None 14 | 15 | # start-stop timestamp pairs 16 | self.start_times = [] 17 | self.stop_times = [] 18 | self.costs = [] 19 | 20 | def start(self, sync_func: Callable = None): 21 | """Start the timer.""" 22 | assert not self.started, f"timer {self.name} has already been started." 23 | if sync_func: 24 | sync_func() 25 | 26 | self.start_time = time.perf_counter() 27 | self.start_times.append(self.start_time) 28 | self.started = True 29 | 30 | def stop(self, sync_func: Callable = None): 31 | """Stop the timer.""" 32 | assert self.started, f"timer {self.name} is not started." 33 | if sync_func: 34 | sync_func() 35 | 36 | stop_time = time.perf_counter() 37 | self.costs.append(stop_time - self.start_time) 38 | self.stop_times.append(stop_time) 39 | self.started = False 40 | 41 | def reset(self): 42 | """Reset timer.""" 43 | self.started = False 44 | self.start_time = None 45 | self.start_times = [] 46 | self.stop_times = [] 47 | self.costs = [] 48 | 49 | def elapsed(self, mode: str = "average"): 50 | """Calculate the elapsed time.""" 51 | if not self.costs: 52 | return 0.0 53 | if mode == "average": 54 | return sum(self.costs) / len(self.costs) 55 | elif mode == "sum": 56 | return sum(self.costs) 57 | else: 58 | raise RuntimeError("Supported mode is: average | sum") 59 | 60 | 61 | class Timers: 62 | """A group of timers.""" 63 | 64 | def __init__(self): 65 | self.timers = {} 66 | 67 | def __call__(self, name: str): 68 | if name not in self.timers: 69 | self.timers[name] = _Timer(name) 70 | return self.timers[name] 71 | 72 | def __contains__(self, name: str): 73 | return name in self.timers 74 | 75 | 76 | timers = Timers() 77 | 78 | Event = namedtuple("Event", ("tstamp", "name", "info")) 79 | 80 | 81 | class Tracer: 82 | """An activity tracer.""" 83 | 84 | def __init__(self): 85 | self.events = [] 86 | 87 | def log(self, name: str, info: Any, sync_func: Callable = None): 88 | if sync_func: 89 | sync_func() 90 | 91 | self.events.append(Event(time.perf_counter(), name, info)) 92 | 93 | 94 | tracer = Tracer() 95 | -------------------------------------------------------------------------------- /speedup/flexgen/flexgen/utils.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import dataclasses 3 | from attr import define, field 4 | from attr.setters import frozen 5 | import functools 6 | import gc 7 | import math 8 | import os 9 | from typing import Tuple, Union, Optional, Any, Sequence, List 10 | 11 | import numpy as np 12 | import torch 13 | 14 | 15 | KB = 1 << 10 16 | MB = 1 << 20 17 | GB = 1 << 30 18 | T = 1e12 19 | 20 | 21 | @dataclasses.dataclass(frozen=True) 22 | class Task: 23 | """A generation task.""" 24 | inputs: Union[np.array, List[List[int]]] 25 | prompt_len: int 26 | gen_len: int 27 | cut_gen_len: Optional[int] 28 | 29 | do_sample: bool 30 | temperature: float 31 | stop: Optional[int] 32 | 33 | 34 | @dataclasses.dataclass(frozen=True) 35 | class ExecutionEnv: 36 | """Hardware environment.""" 37 | gpu: Any = None 38 | cpu: Any = None 39 | disk: Any = None 40 | mixed: Any = None 41 | 42 | @classmethod 43 | def create(cls, offload_dir): 44 | # fix recursive import 45 | from flexgen.pytorch_backend import TorchDevice, TorchDisk, TorchMixedDevice 46 | gpu = TorchDevice("cuda:0") 47 | cpu = TorchDevice("cpu") 48 | disk = TorchDisk(offload_dir) 49 | return cls(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk])) 50 | 51 | def close_copy_threads(self): 52 | self.disk.close_copy_threads() 53 | 54 | 55 | @dataclasses.dataclass(frozen=True) 56 | class BenchmarkResult: 57 | """Benchmark results.""" 58 | prefill_latency: float 59 | prefill_throughput: float 60 | decode_latency: float 61 | decode_throughput: float 62 | total_latency: float 63 | total_throughput: float 64 | 65 | 66 | np_dtype_to_torch_dtype = { 67 | np.float16: torch.float16, np.float32: torch.float32, np.uint8: torch.uint8, 68 | np.int8: torch.int8, np.int32: torch.int32, np.int64: torch.int64, 69 | bool: torch.bool, 70 | } 71 | 72 | torch_dtype_to_np_dtype = { 73 | torch.float16: np.float16, torch.float32: np.float32, 74 | torch.uint8: np.uint8, torch.int8: np.int8, torch.int32: np.int32, 75 | torch.int64: np.int64, torch.bool: bool, 76 | } 77 | 78 | torch_dtype_to_num_bytes = { 79 | torch.float16: 2, torch.float32: 4, 80 | torch.int8: 1, torch.uint8: 1, torch.int32: 4, torch.int64: 8, 81 | torch.bool: 1, 82 | } 83 | 84 | 85 | def piecewise_linear_func(xs, ys): 86 | """Return a function created by linear inerpolation.""" 87 | indices = np.argsort(xs) 88 | xs = [xs[i] for i in indices] 89 | ys = [ys[i] for i in indices] 90 | 91 | # pad left and right 92 | k = 1e5 93 | delta_x_left = xs[0] - xs[1] 94 | delta_y_left = ys[0] - ys[1] 95 | delta_x_right = xs[-1] - xs[-2] 96 | delta_y_right = ys[-1] - ys[-2] 97 | 98 | xs = [xs[0] + delta_x_left * k] + xs + [xs[-1] + delta_x_right * k] 99 | ys = [ys[0] + delta_y_left * k] + ys + [ys[-1] + delta_y_right * k] 100 | 101 | return functools.partial(piecewise_linear_func_ret_func, xs, ys) 102 | 103 | 104 | def piecewise_linear_func_ret_func(xs, ys, x): 105 | assert x >= xs[0] and x <= xs[-1] 106 | return np.interp(x, xs, ys) 107 | 108 | 109 | def sample_from_range(n, k): 110 | assert n >= 1 111 | 112 | if k == -1: 113 | ret = [1] 114 | while ret[-1] * 2 < n: 115 | ret.append(ret[-1] * 2) 116 | return ret 117 | else: 118 | if k == 1: return [1] 119 | step = (n - 1) // (k - 1) 120 | return list(range(1, n + 1, step)) 121 | 122 | 123 | def cpu_mem_stats(): 124 | objects = gc.get_objects() 125 | tensors = [obj for obj in objects if torch.is_tensor(obj) and not obj.is_cuda] 126 | 127 | total_numel = 0 128 | total_mem = 0 129 | visited_data = set() 130 | for tensor in tensors: 131 | # a data_ptr indicates a memory block allocated 132 | data_ptr = tensor.storage().data_ptr() 133 | if data_ptr in visited_data: 134 | continue 135 | visited_data.add(data_ptr) 136 | 137 | numel = tensor.numel() 138 | total_numel += numel 139 | element_size = tensor.storage().element_size() 140 | mem = numel * element_size 141 | total_mem += mem 142 | 143 | return total_mem 144 | 145 | 146 | def torch_mem_stats(): 147 | objects = gc.get_objects() 148 | tensors = [obj for obj in objects if torch.is_tensor(obj) and obj.is_cuda] 149 | 150 | total_numel = 0 151 | total_mem = 0 152 | visited_data = set() 153 | for tensor in tensors: 154 | # a data_ptr indicates a memory block allocated 155 | data_ptr = tensor.storage().data_ptr() 156 | if data_ptr in visited_data: 157 | continue 158 | visited_data.add(data_ptr) 159 | 160 | print(tensor.shape, tensor.data_ptr()) 161 | 162 | numel = tensor.numel() 163 | total_numel += numel 164 | element_size = tensor.storage().element_size() 165 | mem = numel * element_size 166 | total_mem += mem 167 | 168 | return total_mem 169 | 170 | 171 | class ValueHolder: 172 | def __init__(self): 173 | self.val = None 174 | 175 | def store(self, val): 176 | assert self.val is None 177 | self.val = val 178 | 179 | def pop(self): 180 | ret = self.val 181 | self.val = None 182 | return ret 183 | 184 | def clear(self): 185 | self.val = None 186 | 187 | 188 | def array_1d(a, cls): 189 | return [cls() for _ in range(a)] 190 | 191 | 192 | def array_2d(a, b, cls): 193 | return [[cls() for _ in range(b)] for _ in range(a)] 194 | 195 | 196 | def array_3d(a, b, c, cls): 197 | return [[[cls() for _ in range(c)] for _ in range(b)] for _ in range(a)] 198 | 199 | 200 | def array_4d(a, b, c, d, cls): 201 | return [[[[cls() for _ in range(d)] for _ in range(c)] for _ in range(b)] for _ in range(a)] 202 | 203 | 204 | def vector_gather(vectors, indices): 205 | """ 206 | Gathers (batched) vectors according to indices. 207 | Arguments: 208 | vectors: Tensor[S, B, H] 209 | indices: Tensor[K, B] 210 | Returns: 211 | Tensor[K, B, H] 212 | """ 213 | S, B, H = vectors.shape 214 | K, B2 = indices.shape 215 | assert B == B2 216 | indices = indices.reshape(K, B, 1).expand(K, B, H) 217 | out = vectors.gather(dim=0, index=indices) 218 | return out 219 | 220 | 221 | def run_cmd(cmd): 222 | print(cmd) 223 | os.system(cmd) 224 | 225 | 226 | def str2bool(v): 227 | if isinstance(v, bool): 228 | return v 229 | if v.lower() in ('yes', 'true', 't', 'y', '1'): 230 | return True 231 | elif v.lower() in ('no', 'false', 'f', 'n', '0'): 232 | return False 233 | else: 234 | raise argparse.ArgumentTypeError('Boolean value expected.') 235 | 236 | 237 | def project_decode_latency(costs, prompt_len, gen_len): 238 | decode_costs = costs[1:] 239 | 240 | if gen_len / prompt_len < 0.1: 241 | warmup = 2 242 | decode_latency = (sum(decode_costs[:warmup]) + 243 | np.mean(decode_costs[warmup:]) * (gen_len - 1 - warmup)) 244 | else: 245 | warmup = 2 246 | decode_latency = (sum(decode_costs[:warmup]) + 247 | np.mean(decode_costs[warmup:]) * (gen_len - 1 - warmup)) 248 | 249 | #assert len(decode_costs) >= 4 250 | #warmup = 2 251 | #xs = np.arange(warmup, len(decode_costs)) 252 | #ys = np.asarray(decode_costs[warmup:]) 253 | #curve = np.poly1d(np.polyfit(xs, ys, deg=1)) 254 | #ys_pred = [curve(x) for x in range(gen_len-1)] 255 | #decode_latency = sum(ys_pred) 256 | 257 | #print([round(x, 4) for x in decode_costs]) 258 | #print([round(x, 4) for x in ys_pred]) 259 | 260 | return decode_latency 261 | 262 | 263 | def write_benchmark_log(filename, model_size, cache_size, hidden_size, 264 | gpu_peak_mem, projected, prefill_latency, prefill_throughput, 265 | decode_latency, decode_throughput, total_latency, total_throughput): 266 | 267 | log_str = (f"model size: {model_size/GB:.3f} GB\t" 268 | f"cache size: {cache_size/GB:.3f} GB\t" 269 | f"hidden size (p): {hidden_size/GB:.3f} GB\n" 270 | f"peak gpu mem: {gpu_peak_mem / GB:.3f} GB\t" 271 | f"projected: {projected}\n" 272 | f"prefill latency: {prefill_latency:.3f} s\t" 273 | f"prefill throughput: {prefill_throughput:.3f} token/s\n" 274 | f"decode latency: {decode_latency:.3f} s\t" 275 | f"decode throughput: {decode_throughput:.3f} token/s\n" 276 | f"total latency: {total_latency:.3f} s\t" 277 | f"total throughput: {total_throughput:.3f} token/s") 278 | with open(filename, "a") as fout: 279 | fout.write(log_str + "\n") 280 | 281 | return log_str 282 | 283 | 284 | def read_benchmark_log(filename): 285 | with open(filename) as fin: 286 | lines = fin.readlines() 287 | 288 | def extract(line): 289 | a, b = line.split("\t") 290 | latency = a[a.index(":") + 1:a.index(" s")] 291 | throughput = b[b.index(":") + 1:b.index(" to")] 292 | return float(latency), float(throughput) 293 | 294 | prefill_latency, prefill_throughput = extract(lines[2]) 295 | decode_latency, decode_throughput = extract(lines[3]) 296 | total_latency, total_throughput = extract(lines[4]) 297 | 298 | return BenchmarkResult( 299 | prefill_latency, prefill_throughput, 300 | decode_latency, decode_throughput, 301 | total_latency, total_throughput, 302 | ) 303 | -------------------------------------------------------------------------------- /speedup/flexgen/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/flexgen/pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=61.0"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "flexgen" 7 | version = "0.1.7" 8 | description = "Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput large-batch generation." 9 | readme = "README.md" 10 | requires-python = ">=3.7" 11 | classifiers = [ 12 | "Programming Language :: Python :: 3", 13 | "License :: OSI Approved :: Apache Software License", 14 | ] 15 | dependencies = [ 16 | "torch>=1.12", "transformers>=4.24", 17 | "numpy", "tqdm", "pulp", "attrs", 18 | ] 19 | 20 | [project.urls] 21 | "Homepage" = "https://github.com/FMInference/FlexGen" 22 | "Bug Tracker" = "https://github.com/FMInference/FlexGen/issues" 23 | 24 | [tool.setuptools.packages.find] 25 | exclude = ["benchmark*", "dist*", "playground*", "scripts*"] 26 | 27 | [tool.wheel] 28 | exclude = ["benchmark*", "dist*", "playground*", "scripts*"] 29 | 30 | 31 | -------------------------------------------------------------------------------- /speedup/infinigen/infinigen/kv_selection_controller.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | def select_kv(prefetch_idx, k_cache, v_cache): 6 | """Selects and aggregates critical KV caches using speculated indices 7 | 8 | On the decoding stage, aggregates the critical KV caches corresponding to 9 | the speculated prefetch index using embedding function. 10 | 11 | Args: 12 | prefetch_idx: Indices of critical KV cache tokens for each head and batch (n', 1, bh) 13 | k_cache: Key cache (n, bh, d) 14 | v_cache: Value cache (n, bh, d) 15 | 16 | Returns: 17 | selected_k: selected key cache (n', bh, d) 18 | selected_v: selected value cache (n', bh, d) 19 | """ 20 | 21 | prefetch_idx = prefetch_idx.squeeze().to(k_cache.device) 22 | ind = prefetch_idx * k_cache.shape[1] + torch.arange(k_cache.shape[1])[None, :] 23 | selected_k = F.embedding(ind, k_cache.reshape(-1, k_cache.shape[2])) 24 | selected_v = F.embedding(ind, v_cache.reshape(-1, v_cache.shape[2])) 25 | return selected_k, selected_v 26 | 27 | 28 | def speculate_attention(hidden, p_w_q, p_k_c, n_head, alpha, max_num_kv): 29 | """Speculates the indices of the critical KV caches of next attention layer. 30 | 31 | On the decoding stage, by using the hidden states (layer i), partial query 32 | weight (layer i+1), and partial key cache (layer i+1), speculates the 33 | attention score of the next layer. After that, counts the number of 34 | critical tokens and gets the indcies of the top-k KV cache tokens with high 35 | attention scores. 36 | 37 | Args: 38 | hidden: Hidden states of layer i (b, 1, D) 39 | p_w_q: Partial query weight (D', D) 40 | p_k_c: Partial key cache (n, bh, d') 41 | 42 | Note that bh * d' == D' 43 | 44 | Returns: 45 | prefetch_idx: Indices of critical KV cache tokens for each head and batch (n', 1, bh) 46 | """ 47 | b = hidden.shape[0] 48 | p_q = F.linear(hidden, p_w_q, bias=None) 49 | p_q = p_q.view(b, 1, n_head, -1) 50 | p_q = p_q.permute(0, 2, 1, 3).reshape(b * n_head, 1, -1) 51 | 52 | p_attn = torch.bmm(p_q, p_k_c.permute(1, 2, 0)) 53 | max_ = torch.max(p_attn, dim=-1)[0] 54 | thr_ = (max_ - alpha).unsqueeze(-1).repeat(1, 1, p_attn.shape[-1]) 55 | count = torch.where( 56 | p_attn > thr_, torch.ones_like(p_attn), torch.zeros_like(p_attn) 57 | ) 58 | mean = torch.mean(torch.sum(count, dim=-1)).item() 59 | prefetch_idx = torch.topk( 60 | p_attn.permute(2, 1, 0), min(int(mean), max_num_kv), dim=0 61 | )[1] 62 | 63 | return prefetch_idx 64 | -------------------------------------------------------------------------------- /speedup/infinigen/infinigen/partial_weight_generation_controller.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | def partial_weight_index_generation(query, n_head, head_dim, partial_weight_ratio): 6 | """Generates the indices of partial weight query and partial key cache. 7 | 8 | On the prefill stage, generates the indices of partial weight query and 9 | partial key cache using the query matrix. By comparing the absolute sum of 10 | each column of the query matrix, gets the indices of top-k columns. These 11 | columns correspond to the columns that strongly affect the attention score. 12 | Thus, we use only those partial columns of query and key for speculation. 13 | 14 | Args: 15 | query: Query matrix (b, n, D) 16 | n_head: Number of heads which we refer to as h 17 | head_dim: Hidden dimension of each head which we refer to as d 18 | partial_weight_ratio: Ratio of the top-k columns 19 | 20 | Returns: 21 | partial_weight_index: Indices of top-k columns (b, h, d') 22 | where d' is d * (partial_weight_ratio). 23 | """ 24 | 25 | partial_weight_index = torch.zeros(n_head, int(head_dim * partial_weight_ratio)).to( 26 | query.device 27 | ) 28 | b = query.shape[0] 29 | 30 | for h_idx in range(n_head): 31 | start = h_idx * head_dim 32 | end = (h_idx + 1) * head_dim 33 | _, ind = torch.topk( 34 | torch.sum(torch.abs(query[0, :, start:end]), dim=-2), 35 | int(head_dim * partial_weight_ratio), 36 | ) 37 | partial_weight_index[h_idx] = ind 38 | 39 | return partial_weight_index.unsqueeze(0).repeat(b, 1, 1).to(torch.int64) 40 | 41 | 42 | def set_partial_cache(k_cache, partial_index, n_head, head_dim): 43 | """Sets the partial key cache. 44 | 45 | On the prefill and decoding stages, generates the partial key cache 46 | following the partial_index which indicates the indices of the important 47 | columns. 48 | 49 | Args: 50 | k_cahce: Key cache (n, bh, d) 51 | partial_weight_index: Indices of top-k columns (b, h, d') 52 | n_head: Number of heads which we refer to as h 53 | head_dim: Hidden dimension of each head which we refer to as d 54 | 55 | Returns: 56 | partial_cache: Partial key cache (n, bh, d') 57 | """ 58 | 59 | n, bh, _ = k_cache.shape 60 | partial_cache = torch.gather( 61 | k_cache.view(n, -1, n_head, head_dim), 62 | 3, 63 | partial_index.unsqueeze(0).repeat(n, 1, 1, 1), 64 | ) 65 | return partial_cache.view(n, bh, -1) 66 | 67 | 68 | def set_partial_weight(w_q, partial_index, n_head, head_dim): 69 | """Sets the partial query weight. 70 | 71 | On the prefill stage, generates the partial query weight following the 72 | partial_index which indicates the indices of the important columns. 73 | 74 | Args: 75 | w_q: Query weight (D, D) 76 | partial_weight_index: Indices of top-k columns (b, h, d') 77 | n_head: Number of heads which we refer to as h 78 | head_dim: Hidden dimension of each head which we refer to as d 79 | 80 | Returns: 81 | partial_weight: Partial query weight (D', D) 82 | """ 83 | 84 | partial_weight = F.embedding( 85 | partial_index[0] 86 | + torch.arange(n_head)[:, None].to(partial_index.device) * head_dim, 87 | w_q.view(-1, w_q.shape[-1]), 88 | ) 89 | return partial_weight.view(-1, w_q.shape[-1]) 90 | -------------------------------------------------------------------------------- /speedup/infinigen/infinigen/skewing_controller.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | 4 | def weight_bias_concat(weight, bias, scaling=False, head_dim=1.0): 5 | """Concatenates the weight matrix and bias. 6 | 7 | On the warmup phase, concatenates the weight matrix and bias for skewing. 8 | This manipulation does not hurt the correctness. 9 | 10 | Args: 11 | weight: Weight matrix (D, D) 12 | bias: Bias vector (D) 13 | scaling: If ture, scales the concatenated weight and bias to skip 14 | the scaling after projection. 15 | head_dim: Hidden dimension of each head which we refer to as d 16 | 17 | Returns: 18 | concatenated weight and bias (D, D+1) 19 | """ 20 | 21 | if scaling: 22 | return torch.cat((weight, bias.unsqueeze(1).to(weight.device)), dim=1) * ( 23 | head_dim**-0.5 24 | ) 25 | else: 26 | return torch.cat((weight, bias.unsqueeze(1).to(weight.device)), dim=1) 27 | 28 | 29 | def reform_hidden_states(hidden_states): 30 | """Concatenates the weight matrix and bias. 31 | 32 | Concatenates the hidden states with a column of 1. 33 | This reformation with the concatenated weight and bias makes the linear 34 | projection into a one matrix multiplication without bias addition. 35 | 36 | Args: 37 | hidden: Hidden states (b, n, D) 38 | 39 | Returns: 40 | reformed hidden states (b, n, D+1) 41 | """ 42 | 43 | return torch.cat( 44 | (hidden_states, torch.ones_like(hidden_states)[:, :, 1].unsqueeze(2)), dim=-1 45 | ) 46 | 47 | 48 | def skew(query, key, wq, wk, n_head, head_dim): 49 | """Manipulates the query/key weight matrix for skewing the qeury and key matrix. 50 | 51 | On the warmup phase, manipulates the query/key weight matrix for 52 | skewing the query and key matrix. By doing so, a few columns of 53 | the query and key matrix have become much more important. We use 54 | the columns for attention speculation. 55 | 56 | Args: 57 | query: Query matrix (b, n, h, d) 58 | key: Key matrix (b, n, h, d) 59 | w_q: Concatenated query weight and bias (D, D+1) 60 | w_k: Concatenated key weight and bias (D, D+1) 61 | n_head: Number of heads which we refer to as h 62 | head_dim: Hidden dimension of each head which we refer to as d 63 | 64 | Returns: 65 | w_q: Manipulated w_q (D, D+1) 66 | w_k: Manipulated w_k (D, D+1) 67 | 68 | """ 69 | 70 | for h_idx in range(n_head): 71 | start = h_idx * head_dim 72 | end = (h_idx + 1) * head_dim 73 | _, sq, vq = torch.svd(query[0, :, h_idx].to(torch.float)) 74 | _, sk, _ = torch.svd(key[0, :, h_idx].to(torch.float)) 75 | sq = sq.to(torch.float16) 76 | vq = vq.to(torch.float16) 77 | sk = sk.to(torch.float16) 78 | sq = sq * sk 79 | A = torch.zeros(head_dim, head_dim).to(query.device).to(torch.float16) 80 | _, ind = sq.sort() 81 | A = A.scatter(-1, ind.unsqueeze(0).repeat(head_dim, 1), vq) 82 | wq[start:end, :] = A.t() @ wq[start:end] 83 | wk[start:end, :] = A.t() @ wk[start:end] 84 | return wq, wk 85 | -------------------------------------------------------------------------------- /speedup/infinigen/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup(name="infinigen", version="0.1.0", packages=find_packages(include=["infinigen"])) 4 | -------------------------------------------------------------------------------- /speedup/install.sh: -------------------------------------------------------------------------------- 1 | pip install -e infinigen 2 | pip install -e flexgen 3 | -------------------------------------------------------------------------------- /speedup/scripts/figure14/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure14/run.sh: -------------------------------------------------------------------------------- 1 | UVM_PATH=$PWD/../../uvm 2 | export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH 3 | for SCHEME in "uvm" "uvm_h2o" 4 | do 5 | g++ $UVM_PATH/allocate.cpp -o allocate.so --shared -fPIC -I$CUDA_HOME/include 6 | CMD="--embed_dim 5120 --ffn_dim 20480 --enable_bias --n_head 40 --do_layer_norm_before --n_layer 40 --bsz 20 --prompt_len 1920 --gen_len 128 --runs 1" 7 | 8 | if [ "$SCHEME" = "uvm_h2o" ] 9 | then 10 | CMD=$CMD" --is_h2o --h2o_ratio 0.2" 11 | fi 12 | python $UVM_PATH/transformer.py $CMD 13 | rm allocate.so 14 | done 15 | 16 | FLEXGEN_PATH=$PWD/../../flexgen 17 | for SCHEME in "original" "int4" "h2o" "infinigen" 18 | do 19 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 20 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 21 | if [ "$SCHEME" = "int4" ] 22 | then 23 | ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 24 | ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 25 | else 26 | ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 27 | ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 28 | fi 29 | 30 | CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 20 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 31 | if [ "$SCHEME" = "int4" ] 32 | then 33 | CMD=$CMD" --compress-cache" 34 | elif [ "$SCHEME" = "h2o" ] 35 | then 36 | CMD=$CMD" --max-num-kv 415 --hh-ratio 0.1 --hh-all" 37 | elif [ "$SCHEME" = "infinigen" ] 38 | then 39 | CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 400" 40 | fi 41 | python -m flexgen.flex_opt $CMD 42 | done 43 | -------------------------------------------------------------------------------- /speedup/scripts/figure15/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure15/run.sh: -------------------------------------------------------------------------------- 1 | UVM_PATH=$PWD/../../uvm 2 | export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH 3 | for SCHEME in "uvm" "uvm_h2o" 4 | do 5 | g++ $UVM_PATH/allocate.cpp -o allocate.so --shared -fPIC -I$CUDA_HOME/include 6 | for BSZ in 4 8 12 16 20 7 | do 8 | CMD="--embed_dim 5120 --ffn_dim 20480 --enable_bias --n_head 40 --do_layer_norm_before --n_layer 40 --bsz $BSZ --prompt_len 1920 --gen_len 128 --runs 1" 9 | 10 | if [ "$SCHEME" = "uvm_h2o" ] 11 | then 12 | CMD=$CMD" --is_h2o --h2o_ratio 0.2" 13 | fi 14 | python $UVM_PATH/transformer.py $CMD 15 | done 16 | rm allocate.so 17 | done 18 | 19 | FLEXGEN_PATH=$PWD/../../flexgen 20 | for SCHEME in "original" "int4" "h2o" "infinigen" 21 | do 22 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 23 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 24 | if [ "$SCHEME" = "int4" ] 25 | then 26 | ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 27 | ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 28 | else 29 | ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 30 | ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 31 | fi 32 | 33 | for BSZ in 4 8 12 16 20 34 | do 35 | CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size $BSZ --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 36 | if [ "$SCHEME" = "int4" ] 37 | then 38 | CMD=$CMD" --compress-cache" 39 | elif [ "$SCHEME" = "h2o" ] 40 | then 41 | CMD=$CMD" --max-num-kv 409 --hh-ratio 0.1 --hh-all" 42 | elif [ "$SCHEME" = "infinigen" ] 43 | then 44 | CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 409" 45 | fi 46 | python -m flexgen.flex_opt $CMD 47 | done 48 | done 49 | -------------------------------------------------------------------------------- /speedup/scripts/figure16a/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure16a/run.sh: -------------------------------------------------------------------------------- 1 | FLEXGEN_PATH=$PWD/../../flexgen 2 | for SCHEME in "original" "int4" "h2o" "infinigen" 3 | do 4 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 5 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 6 | if [ "$SCHEME" = "int4" ] 7 | then 8 | ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 9 | ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 10 | else 11 | ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 12 | ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 13 | fi 14 | 15 | for PROMPT_LEN in 384 896 1408 1920 16 | do 17 | CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len $PROMPT_LEN --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 18 | if [ "$SCHEME" = "int4" ] 19 | then 20 | CMD=$CMD" --compress-cache" 21 | elif [ "$SCHEME" = "h2o" ] 22 | then 23 | CMD=$CMD" --max-num-kv `expr \( $PROMPT_LEN + 128 \) / 5` --hh-ratio 0.1 --hh-all" 24 | elif [ "$SCHEME" = "infinigen" ] 25 | then 26 | CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv `expr \( $PROMPT_LEN + 128 \) / 5`" 27 | fi 28 | python -m flexgen.flex_opt $CMD 29 | done 30 | done 31 | -------------------------------------------------------------------------------- /speedup/scripts/figure16b/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure16b/run.sh: -------------------------------------------------------------------------------- 1 | FLEXGEN_PATH=$PWD/../../flexgen 2 | for SCHEME in "original" "int4" "h2o" "infinigen" 3 | do 4 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 5 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 6 | if [ "$SCHEME" = "int4" ] 7 | then 8 | ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 9 | ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 10 | else 11 | ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 12 | ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 13 | fi 14 | 15 | for MODEL in "opt-6.7b" "opt-13b" "opt-30b" 16 | do 17 | CMD="--model huggingface/$MODEL" 18 | if [ "$MODEL" = "opt-30b" ] 19 | then 20 | CMD=$CMD" --percent 70 30 0 100 100 0" 21 | else 22 | CMD=$CMD" --percent 100 0 0 100 100 0" 23 | fi 24 | CMD=$CMD" --overlap false --gpu-batch-size 4 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 25 | if [ "$SCHEME" = "int4" ] 26 | then 27 | CMD=$CMD" --compress-cache" 28 | elif [ "$SCHEME" = "h2o" ] 29 | then 30 | CMD=$CMD" --max-num-kv 409 --hh-ratio 0.1 --hh-all" 31 | elif [ "$SCHEME" = "infinigen" ] 32 | then 33 | CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 409" 34 | fi 35 | python -m flexgen.flex_opt $CMD 36 | done 37 | done 38 | -------------------------------------------------------------------------------- /speedup/scripts/figure17a/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure17a/run.sh: -------------------------------------------------------------------------------- 1 | FLEXGEN_PATH=$PWD/../../flexgen 2 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 3 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 4 | ln -s ../infinigen/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 5 | ln -s ../infinigen/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 6 | 7 | for ALPHA in 1 2 3 4 5 6 7 8 9 8 | do 9 | CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 10 | CMD=$CMD" --alpha $ALPHA --partial-weight-ratio 0.2 --max-num-kv 409" 11 | python -m flexgen.flex_opt $CMD 12 | done 13 | -------------------------------------------------------------------------------- /speedup/scripts/figure17b/pg19_firstbook.txt: -------------------------------------------------------------------------------- 1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house. 2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details. 3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography. 4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis. 5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century. 6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed. 7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame. 8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud. 9 | -------------------------------------------------------------------------------- /speedup/scripts/figure17b/run.sh: -------------------------------------------------------------------------------- 1 | FLEXGEN_PATH=$PWD/../../flexgen 2 | rm $FLEXGEN_PATH/flexgen/flex_opt.py 3 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py 4 | ln -s ../infinigen/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py 5 | ln -s ../infinigen/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py 6 | 7 | for PARTIAL_WEIGHT_RATIO in 0.1 0.2 0.4 0.6 0.8 1.0 8 | do 9 | CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt" 10 | CMD=$CMD" --alpha 4 --partial-weight-ratio $PARTIAL_WEIGHT_RATIO --max-num-kv 409" 11 | python -m flexgen.flex_opt $CMD 12 | done 13 | -------------------------------------------------------------------------------- /speedup/scripts/run_all.sh: -------------------------------------------------------------------------------- 1 | for FIG in "14" "15" "16a" "16b" "17a" "17b" 2 | do 3 | cd figure$FIG 4 | sh run.sh 5 | cd .. 6 | done 7 | -------------------------------------------------------------------------------- /speedup/uvm/allocate.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | extern "C" { 5 | void* uvm_malloc(ssize_t size, int device, cudaStream_t stream) { 6 | void *ptr; 7 | //cudaMalloc(&ptr, size); 8 | cudaMallocManaged(&ptr, size); 9 | return ptr; 10 | } 11 | 12 | void uvm_free(void* ptr, ssize_t size, int device, cudaStream_t stream) { 13 | cudaFree(ptr); 14 | } 15 | } 16 | -------------------------------------------------------------------------------- /speedup/uvm/h2o_attention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from typing import Tuple 4 | 5 | class SelfAttention(nn.Module): 6 | """Multi-headed attention from 'Attention Is All You Need' paper""" 7 | 8 | def __init__( 9 | self, 10 | embed_dim: int, 11 | num_heads: int, 12 | h2o_ratio: float, 13 | bias: bool = True 14 | ): 15 | super().__init__() 16 | self.embed_dim = embed_dim 17 | self.num_heads = num_heads 18 | self.head_dim = embed_dim // num_heads 19 | 20 | if (self.head_dim * num_heads) != self.embed_dim: 21 | raise ValueError( 22 | f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}" 23 | f" and `num_heads`: {num_heads})." 24 | ) 25 | self.scaling = self.head_dim**-0.5 26 | 27 | self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 28 | self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 29 | self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 30 | self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 31 | 32 | self.acc = None 33 | self.ratio = h2o_ratio 34 | self.i = 0 35 | self.past_key_value = None 36 | 37 | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): 38 | return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() 39 | 40 | def _heavy_hitter_pruning(self, k, v, attn_weights, hh_k): 41 | # k, v: (s, b * n_head, head_dim) 42 | # attn_weights: (b * n_head, s, s) 43 | aggr_attn = torch.sum(attn_weights, 1) 44 | # (b * n_head, hh_k) 45 | _, topk_indices = aggr_attn[:, :].topk( 46 | min(hh_k, aggr_attn.shape[1]), dim=1) 47 | 48 | # select heavy-hitters 49 | # k, v: (b * n_head, s, head_dim) 50 | k_t = k.transpose(1, 0) 51 | v_t = v.transpose(1, 0) 52 | dim0_indices = torch.arange(k_t.size(0))[:, None] 53 | dim0_indices = dim0_indices.expand_as(topk_indices) 54 | # (b * n_head, hh_k, head_dim) 55 | k_hh_t = k_t[dim0_indices, topk_indices] 56 | v_hh_t = v_t[dim0_indices, topk_indices] 57 | # (hh_k, b * n_head, head_dim) 58 | k = k_hh_t.transpose(1, 0) 59 | v = v_hh_t.transpose(1, 0) 60 | # new shape (hh_k, b * n_head) 61 | aggr_attn = aggr_attn.transpose(0, 1) 62 | dim1_indices = torch.arange(aggr_attn.size(1)).unsqueeze(0) 63 | # (hh_k * 2, b * n_head) 64 | acc = aggr_attn[topk_indices.transpose(0, 1), dim1_indices] 65 | return k, v, acc 66 | 67 | def forward( 68 | self, 69 | hidden_states: torch.Tensor 70 | ) -> Tuple[torch.Tensor, Tuple[torch.Tensor]]: 71 | """Input shape: Batch x Time x Channel""" 72 | 73 | bsz, tgt_len, _ = hidden_states.size() 74 | 75 | # get query proj 76 | query_states = self.q_proj(hidden_states) * self.scaling 77 | 78 | # get key/value proj 79 | if self.past_key_value is not None: 80 | # reuse k, v, self_attention 81 | k = self._shape(self.k_proj(hidden_states), -1, bsz).squeeze() 82 | v = self._shape(self.v_proj(hidden_states), -1, bsz).squeeze() 83 | key_states = self.past_key_value[0] 84 | key_states[:, :, -1] = k 85 | value_states = self.past_key_value[1] 86 | value_states[:, :, -1] = v 87 | else: 88 | # self_attention 89 | key_states = self._shape(self.k_proj(hidden_states), -1, bsz) 90 | value_states = self._shape(self.v_proj(hidden_states), -1, bsz) 91 | 92 | # update kv cache 93 | #past_key_value = (key_states, value_states) 94 | 95 | # reshape 96 | proj_shape = (bsz * self.num_heads, -1, self.head_dim) 97 | query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) 98 | key_states = key_states.view(*proj_shape) 99 | value_states = value_states.view(*proj_shape) 100 | 101 | # qkt 102 | src_len = key_states.size(1) 103 | attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) 104 | 105 | # masking 106 | if self.i == 0: # prefill 107 | mask = torch.triu(torch.ones(attn_weights.shape).to('cuda'), diagonal=1) * -10000 108 | attn_weights = attn_weights + mask 109 | 110 | # softmax 111 | attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16) 112 | 113 | # sv 114 | attn_output = torch.bmm(attn_weights, value_states) 115 | 116 | # reshape 117 | attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) 118 | attn_output = attn_output.transpose(1, 2) 119 | attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) 120 | 121 | attn_output = self.out_proj(attn_output) 122 | 123 | 124 | ##### h2o #### 125 | if self.acc is None: 126 | self.hh = int(attn_weights.shape[-1] * self.ratio) 127 | key_states, value_states, self.acc = self._heavy_hitter_pruning(key_states.permute(1,0,2), value_states.permute(1,0,2), attn_weights, self.hh) 128 | key_states = key_states.permute(1, 0, 2) 129 | value_states = value_states.permute(1, 0, 2) 130 | self.past_key_value = (torch.cat((key_states.reshape(bsz, self.num_heads, key_states.shape[-2], key_states.shape[-1]), torch.zeros(bsz, self.num_heads, 1, key_states.shape[-1]).to('cuda').to(torch.float16)), dim = -2), 131 | torch.cat((value_states.reshape(bsz, self.num_heads, value_states.shape[-2], value_states.shape[-1]), torch.zeros(bsz, self.num_heads, 1, key_states.shape[-1]).to('cuda').to(torch.float16)), dim = -2)) 132 | 133 | else: 134 | temp_attn = attn_weights.squeeze(1).transpose(0, 1) 135 | self.acc = torch.cat((self.acc, torch.zeros(1, bsz * self.num_heads).to('cuda')), dim=0) 136 | self.acc = self.acc + temp_attn 137 | kick_ind = self.acc.argmin(dim=0).squeeze() 138 | 139 | # reduce accumulated result 140 | indices = kick_ind.unsqueeze(0) 141 | self.acc.scatter_(0, indices, self.acc[-1].unsqueeze(0).clone()) 142 | self.acc = self.acc[:-1] 143 | 144 | # modify kv cache 145 | indices = kick_ind.view(-1, 1).expand(-1, self.head_dim).unsqueeze(1) 146 | key_states.scatter_(1, indices, key_states[:, -1].unsqueeze(1)) 147 | value_states.scatter_(1, indices, value_states[:, -1].unsqueeze(1)) 148 | #key_states = key_states[:, :-1] 149 | #value_states = value_states[:, :-1] 150 | self.past_key_value = (key_states.reshape(bsz, self.num_heads, key_states.shape[-2], key_states.shape[-1]), 151 | value_states.reshape(bsz, self.num_heads, value_states.shape[-2], value_states.shape[-1])) 152 | 153 | self.i += 1 154 | return attn_output 155 | -------------------------------------------------------------------------------- /speedup/uvm/selfattention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from typing import Tuple 4 | 5 | class SelfAttention(nn.Module): 6 | """Multi-headed attention from 'Attention Is All You Need' paper""" 7 | 8 | def __init__( 9 | self, 10 | embed_dim: int, 11 | num_heads: int, 12 | bias: bool = True 13 | ): 14 | super().__init__() 15 | self.embed_dim = embed_dim 16 | self.num_heads = num_heads 17 | self.head_dim = embed_dim // num_heads 18 | 19 | if (self.head_dim * num_heads) != self.embed_dim: 20 | raise ValueError( 21 | f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}" 22 | f" and `num_heads`: {num_heads})." 23 | ) 24 | self.scaling = self.head_dim**-0.5 25 | 26 | self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 27 | self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 28 | self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 29 | self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 30 | 31 | self.past_key_value = None 32 | self.src_s = 0 33 | 34 | 35 | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): 36 | return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() 37 | 38 | def forward( 39 | self, 40 | hidden_states: torch.Tensor, 41 | ) -> Tuple[torch.Tensor, Tuple[torch.Tensor]]: 42 | """Input shape: Batch x Time x Channel""" 43 | 44 | bsz, tgt_len, _ = hidden_states.size() 45 | if tgt_len > 1: 46 | self.src_s = tgt_len 47 | else: 48 | self.src_s += 1 49 | 50 | # get query proj 51 | query_states = self.q_proj(hidden_states) * self.scaling 52 | 53 | # get key/value proj 54 | if self.past_key_value is not None: 55 | # reuse k, v, self_attention 56 | k = self._shape(self.k_proj(hidden_states), -1, bsz).squeeze() 57 | v = self._shape(self.v_proj(hidden_states), -1, bsz).squeeze() 58 | key_states = self.past_key_value[0, :, :, :self.src_s] 59 | key_states[:, :, -1] = k 60 | self.past_key_value[0, :, :, self.src_s] = k 61 | value_states = self.past_key_value[1, :, :, :self.src_s] 62 | value_states[:, :, -1] = v 63 | self.past_key_value[1, :, :, self.src_s] = k 64 | else: 65 | # self_attention 66 | self.past_key_value = torch.zeros((2, bsz, self.num_heads, 2048, self.head_dim), dtype=torch.float16, device=torch.device('cuda')) 67 | key_states = self._shape(self.k_proj(hidden_states), -1, bsz) 68 | value_states = self._shape(self.v_proj(hidden_states), -1, bsz) 69 | self.past_key_value[0, :, :, :tgt_len] = key_states 70 | self.past_key_value[1, :, :, :tgt_len] = value_states 71 | 72 | # update kv cache 73 | #self.past_key_value = (key_states, value_states) 74 | 75 | # reshape 76 | proj_shape = (bsz * self.num_heads, -1, self.head_dim) 77 | query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) 78 | key_states = key_states.view(*proj_shape) 79 | value_states = value_states.view(*proj_shape) 80 | 81 | # qkt 82 | src_len = key_states.size(1) 83 | attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) 84 | 85 | # masking 86 | if attn_weights.shape[1] > 1: # prefill 87 | mask = torch.triu(torch.ones(attn_weights.shape).to('cuda'), diagonal=1) * -10000 88 | attn_weights = attn_weights + mask 89 | 90 | # softmax 91 | attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16) 92 | 93 | # sv 94 | attn_output = torch.bmm(attn_weights, value_states) 95 | 96 | # reshape 97 | attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) 98 | attn_output = attn_output.transpose(1, 2) 99 | attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) 100 | 101 | attn_output = self.out_proj(attn_output) 102 | 103 | return attn_output 104 | -------------------------------------------------------------------------------- /speedup/uvm/transformer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import transformerlayer 3 | import time 4 | 5 | import argparse 6 | parser = argparse.ArgumentParser() 7 | 8 | parser.add_argument( 9 | '--embed_dim', type=int, default=4096, 10 | help='model dim.' 11 | ) 12 | parser.add_argument( 13 | '--ffn_dim', type=int, default=4096*4, 14 | help='ffn dim' 15 | ) 16 | parser.add_argument( 17 | '--enable_bias', action='store_true', 18 | help='enable bias for projections' 19 | ) 20 | parser.add_argument( 21 | '--n_head', type=int, default=32, 22 | help='num heads' 23 | ) 24 | parser.add_argument( 25 | '--do_layer_norm_before', action='store_true', 26 | help='do layernorm before attention/ffn' 27 | ) 28 | parser.add_argument( 29 | '--n_layer', type=int, default=32, 30 | help='num layers' 31 | ) 32 | parser.add_argument( 33 | '--bsz', type=int, default=4, 34 | help='batch size' 35 | ) 36 | parser.add_argument( 37 | '--prompt_len', type=int, default=2048, 38 | help='length of input prompt' 39 | ) 40 | parser.add_argument( 41 | '--gen_len', type=int, default=1024, 42 | help='lenght of output' 43 | ) 44 | parser.add_argument( 45 | '--is_h2o', action='store_true', 46 | help='enable h2o' 47 | ) 48 | parser.add_argument( 49 | '--h2o_ratio', type=float, default=0.2, 50 | help='ratio of heavy hitter' 51 | ) 52 | parser.add_argument( 53 | '--runs', type=int, default=1, 54 | help='number of runs' 55 | ) 56 | args = parser.parse_args() 57 | 58 | ### Parameters ### 59 | embed_dim = args.embed_dim 60 | ffn_dim = args.ffn_dim 61 | bias = args.enable_bias 62 | n_head = args.n_head 63 | do_layer_norm_before = args.do_layer_norm_before 64 | n_layer = args.n_layer 65 | 66 | bsz = args.bsz 67 | prompt_len = args.prompt_len 68 | gen_len = args.gen_len 69 | ################## 70 | 71 | new_alloc = torch.cuda.memory.CUDAPluggableAllocator( 72 | 'allocate.so', 'uvm_malloc', 'uvm_free') 73 | torch.cuda.memory.change_current_allocator(new_alloc) 74 | 75 | ################## 76 | """ 77 | def prefetch(name): 78 | def hook(attn, input, output): 79 | ind = output[2] 80 | if int(name) < (n_layer - 1) and ind is not None: 81 | next_layer = transformer[int(name) + 1] 82 | torch.gather(next_layer.past_key_value[0], 2, ind) 83 | torch.gather(next_layer.past_key_value[1], 2, ind) 84 | 85 | return hook 86 | 87 | for layer_num in range(n_layer): 88 | attn_in = transformer[layer_num].self_attn.register_forward_hook(prefetch("%d"%(layer_num))) 89 | """ 90 | ################# 91 | 92 | total_prefill = 0.0 93 | total_decode = 0.0 94 | for run in range(args.runs): 95 | transformer = [transformerlayer.TransformerLayer(embed_dim, ffn_dim, bias, n_head, do_layer_norm_before, args.is_h2o, args.h2o_ratio) for _ in range(n_layer)] 96 | 97 | prompt = torch.rand(bsz, prompt_len, embed_dim).to(torch.float16).to('cuda') 98 | new_input = torch.rand(bsz, 1, embed_dim).to(torch.float16).to('cuda') 99 | 100 | # Warmup 101 | for i in range(n_layer): 102 | prompt = transformer[i].forward(prompt) 103 | 104 | for i in range(n_layer): 105 | transformer[i].self_attn.past_key_value = None 106 | if args.is_h2o: 107 | transformer[i].self_attn.i = 0 108 | transformer[i].self_attn.acc = None 109 | 110 | start = time.time() 111 | for i in range(n_layer): 112 | prompt = transformer[i].forward(prompt) 113 | 114 | prefill_time = time.time() - start 115 | start = time.time() 116 | 117 | for k in range(gen_len - 1): 118 | for i in range(n_layer): 119 | new_input = transformer[i].forward(new_input) 120 | 121 | decode_time = time.time() - start 122 | 123 | total_prefill += prefill_time 124 | total_decode += decode_time 125 | 126 | del transformer 127 | 128 | prefill_time = total_prefill / float(args.runs) 129 | decode_time = total_decode / float(args.runs) 130 | 131 | print("+++++++++++++++++++++++++++++++++++++++++++++++++") 132 | if args.is_h2o: 133 | print("UVM + H2O") 134 | else: 135 | print("UVM") 136 | 137 | print("input: " + str(prompt_len) + " output: " + str(gen_len) + " bsz: " + str(bsz)) 138 | print("+++++++++++++++++++++++++++++++++++++++++++++++++") 139 | print("Total: " + str(prefill_time + decode_time) + " Prefill: " + str(prefill_time) + " Decode: " + str(decode_time)) 140 | print("=================================================") 141 | -------------------------------------------------------------------------------- /speedup/uvm/transformerlayer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from typing import Tuple 4 | 5 | class TransformerLayer(nn.Module): 6 | def __init__( 7 | self, 8 | embed_dim, 9 | ffn_dim, 10 | bias, 11 | n_head, 12 | do_layer_norm_before, 13 | is_h2o, 14 | h2o_ratio 15 | ): 16 | super().__init__() 17 | self.embed_dim = embed_dim 18 | if is_h2o: 19 | import h2o_attention 20 | self.self_attn = h2o_attention.SelfAttention( 21 | embed_dim=self.embed_dim, 22 | num_heads=n_head, 23 | bias=bias, 24 | h2o_ratio=h2o_ratio 25 | ) 26 | else: 27 | import selfattention 28 | self.self_attn = selfattention.SelfAttention( 29 | embed_dim=self.embed_dim, 30 | num_heads=n_head, 31 | bias=bias, 32 | ) 33 | 34 | self.do_layer_norm_before = do_layer_norm_before 35 | self.activation_fn = nn.ReLU() 36 | 37 | self.self_attn_layer_norm = nn.LayerNorm( 38 | self.embed_dim, elementwise_affine=True, dtype=torch.float16, device=torch.device('cuda') 39 | ) 40 | self.fc1 = nn.Linear(self.embed_dim, ffn_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 41 | self.fc2 = nn.Linear(ffn_dim, self.embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda')) 42 | self.final_layer_norm = nn.LayerNorm( 43 | self.embed_dim, elementwise_affine=True, dtype=torch.float16, device=torch.device('cuda') 44 | ) 45 | 46 | def forward( 47 | self, 48 | hidden_states: torch.Tensor, 49 | ) -> torch.Tensor: 50 | 51 | with torch.no_grad(): 52 | residual = hidden_states 53 | 54 | # OPT: 125m, 1.7B, ..., 175B applies layer norm BEFORE attention 55 | if self.do_layer_norm_before: 56 | hidden_states = self.self_attn_layer_norm(hidden_states) 57 | 58 | # Self Attention 59 | hidden_states = self.self_attn( 60 | hidden_states=hidden_states 61 | ) 62 | hidden_states = residual + hidden_states 63 | 64 | # 350m applies layer norm AFTER attention 65 | if not self.do_layer_norm_before: 66 | hidden_states = self.self_attn_layer_norm(hidden_states) 67 | 68 | # Fully Connected 69 | hidden_states_shape = hidden_states.shape 70 | hidden_states = hidden_states.reshape(-1, hidden_states.size(-1)) 71 | residual = hidden_states 72 | 73 | # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention 74 | if self.do_layer_norm_before: 75 | hidden_states = self.final_layer_norm(hidden_states) 76 | 77 | hidden_states = self.fc1(hidden_states) 78 | hidden_states = self.activation_fn(hidden_states) 79 | 80 | hidden_states = self.fc2(hidden_states) 81 | 82 | hidden_states = (residual + hidden_states).view(hidden_states_shape) 83 | 84 | # 350m applies layer norm AFTER attention 85 | if not self.do_layer_norm_before: 86 | hidden_states = self.final_layer_norm(hidden_states) 87 | 88 | return hidden_states 89 | --------------------------------------------------------------------------------