├── LICENSE
├── README.md
├── accuracy
    ├── README.md
    ├── lm_eval
    │   ├── evaluate_task_result.py
    │   ├── figure11-config.json
    │   ├── figure11.py
    │   ├── figure13.sh
    │   ├── figure17.sh
    │   ├── full_cache.sh
    │   ├── generate_task_data.py
    │   ├── h2o.sh
    │   ├── ours.sh
    │   ├── quant.sh
    │   ├── run_lm_eval_harness.py
    │   ├── run_text_generation.py
    │   ├── tasks
    │   │   ├── __init__.py
    │   │   ├── eval_harness.py
    │   │   └── util.py
    │   └── utils_lm_eval
    │   │   ├── modify_gptneox.py
    │   │   ├── modify_llama.py
    │   │   └── modify_opt.py
    ├── perplexity
    │   ├── datautils.py
    │   ├── figure12.sh
    │   ├── llama.py
    │   ├── opt.py
    │   └── table2.sh
    ├── scripts
    │   ├── figure11
    │   │   └── run.sh
    │   ├── figure12
    │   │   └── run.sh
    │   ├── figure13
    │   │   └── run.sh
    │   ├── figure17
    │   │   └── run.sh
    │   ├── run_all.sh
    │   └── table2
    │   │   └── run.sh
    ├── setup
    │   ├── gen_llama_skewing_matrix.py
    │   ├── gen_opt_model.py
    │   ├── gen_partial_weight.py
    │   ├── pg19_firstbook.txt
    │   ├── setup.sh
    │   └── utils.py
    └── src
    │   ├── modeling_llama_orig.py
    │   ├── modeling_llama_ours.py
    │   ├── modeling_llama_ours_prune.py
    │   ├── modeling_llama_ours_setup.py
    │   ├── modeling_opt_orig.py
    │   ├── modeling_opt_ours.py
    │   ├── modeling_opt_ours_prune.py
    │   └── modeling_opt_ours_setup.py
├── figures
    └── infinigen.png
├── requirements.txt
└── speedup
    ├── README.md
    ├── flexgen
        ├── LICENSE
        ├── README.md
        ├── flexgen
        │   ├── __init__.py
        │   ├── compression.py
        │   ├── flex_opt.py
        │   ├── opt_config.py
        │   ├── pytorch_backend.py
        │   ├── timer.py
        │   └── utils.py
        ├── h2o
        │   ├── flex_opt.py
        │   └── pytorch_backend.py
        ├── infinigen
        │   ├── flex_opt.py
        │   └── pytorch_backend.py
        ├── original
        │   ├── flex_opt.py
        │   └── pytorch_backend.py
        ├── pg19_firstbook.txt
        └── pyproject.toml
    ├── infinigen
        ├── infinigen
        │   ├── kv_selection_controller.py
        │   ├── partial_weight_generation_controller.py
        │   └── skewing_controller.py
        └── setup.py
    ├── install.sh
    ├── scripts
        ├── figure14
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        ├── figure15
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        ├── figure16a
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        ├── figure16b
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        ├── figure17a
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        ├── figure17b
        │   ├── pg19_firstbook.txt
        │   └── run.sh
        └── run_all.sh
    └── uvm
        ├── allocate.cpp
        ├── h2o_attention.py
        ├── selfattention.py
        ├── transformer.py
        └── transformerlayer.py


/README.md:
--------------------------------------------------------------------------------
 1 | # InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
 2 | [[paper](https://arxiv.org/abs/2406.19707)]
 3 | 
 4 | ![infinigen](figures/infinigen.png)
 5 | 
 6 | ## Overview
 7 | 
 8 | This repository contains the code for the artifact evaluation of our paper "InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management".
 9 | 
10 | - Getting Started (10 minutes)
11 | - Run Experiments
12 | 
13 | ## Abstract
14 | 
15 | Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.
16 | 
17 | ## Getting Started (10 minutes)
18 | 
19 | ```sh
20 | git clone https://github.com/snu-comparch/infinigen
21 | conda create -n infinigen python=3.9
22 | conda activate infinigen
23 | pip install -r requirements.txt
24 | ```
25 | 
26 | ## Run Experiments
27 | 
28 | We provide the scripts for accuracy and speedup evaluation.
29 | You can find the source code for accuracy evaluation in the `accuracy` directory and speedup evaluation in the `speedup` directory.
30 | It takes about 40 hours and 7 hours for the accuracy and speedup experiments, respectively.
31 | 
32 | ## Citation
33 | 
34 | ```bibtex
35 | @inproceedings{lee-osdi24,
36 |   title = {{I}nfini{G}en: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management},
37 |   author = {Lee, Wonbeom and Lee, Jungi and Seo, Junghwan and Sim, Jaewoong},
38 |   booktitle = {18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
39 |   year = {2024}
40 | }
41 | ```
42 | 


--------------------------------------------------------------------------------
/accuracy/README.md:
--------------------------------------------------------------------------------
 1 | # Language Modeling Evaluation
 2 | This directory contains source code for evaluating the language modeling performance. You can reproduce perplexity and accuracy results in the paper. Some of the codes are referenced from the H2O (NeurIPS'23) GitHub repository.
 3 | 
 4 | ## Getting Started (60 minutes)
 5 | We evaluate accuracy using the HuggingFace Transformers library. Also, please
 6 | fetch llama-2 from [here](https://llama.meta.com/llama-downloads). Follow the
 7 | instructions and set up Llama-2. You may also need to convert the model to
 8 | huggingface format using the `convert_llama_weight_to_hf.py` in
 9 | `transformers/src/transformers/models/llama`.
10 | 
11 | NOTE: We recommend using a GPU with a large VRAM size. We evaluate accuracy using `A100-80GB GPU`.
12 | 
13 | ```sh
14 | git clone -b v4.35-release https://github.com/huggingface/transformers.git
15 | cd transformers
16 | pip install -e .
17 | ```
18 | 
19 | After setting up the library and llama models, generate the partial weights and skewing matrix. You can safely ignore the uninitialized weight warning.
20 | ```sh
21 | cd setup
22 | export LLAMA_PATH=/path/to/llama-2
23 | bash setup.sh
24 | ```
25 | 
26 | For a "Hello world"-sized example, please run the following command (10 minutes):
27 | ```
28 | cd lm_eval
29 | mkdir results
30 | python -u generate_task_data.py --output-file results/openbookqa-5.jsonl --task-name openbookqa --num-fewshot 5
31 | bash ours.sh openbookqa ../setup/opt-model/opt-6.7b facebook/opt-6.7b opt 5 0.2 4 1.0 0.2 
32 | ```
33 | 
34 | ## Run Experiments (40 hours)
35 | You can reproduce the experimental results from Figure 11-13 and Table 2 by running the following commands:
36 | 
37 | ```sh
38 | cd scripts
39 | sh run_all.sh
40 | ```
41 | 
42 | If you want to reproduce the results for a specific figure, please `sh run.sh` in each corresponding directory. For example,
43 | ```
44 | cd scripts/figure11.sh
45 | sh run.sh
46 | ```
47 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/evaluate_task_result.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import json
  3 | import os
  4 | 
  5 | from lm_eval import evaluator, tasks
  6 | from tasks import EvalHarnessAdaptor
  7 | 
  8 | def json_to_key(obj):
  9 |     return json.dumps(obj)
 10 | 
 11 | 
 12 | if __name__ == '__main__':
 13 |     
 14 | 
 15 |     parser = argparse.ArgumentParser(
 16 |                         prog = 'ProgramName',
 17 |                         description = 'What the program does',
 18 |                         epilog = 'Text at the bottom of help')
 19 | 
 20 |     parser.add_argument('--result-file', type=str, default='result.jsonl')
 21 |     parser.add_argument('--task-name', type=str, default='hellaswag')
 22 |     parser.add_argument('--model-type', type=str, default='opt')
 23 |     parser.add_argument('--debug', action='store_true', default=False)
 24 |     parser.add_argument('--num-fewshot', type=int, default=0)
 25 |     args = parser.parse_args()
 26 |     
 27 |     if args.model_type == 'opt':
 28 |         os.environ['MODEL_NAME'] = "facebook/opt-66b"
 29 |     elif args.model_type == 'bloom':
 30 |         os.environ['MODEL_NAME'] = "bigscience/bloom"
 31 |     elif args.model_type == 'gpt_neox':
 32 |         os.environ['MODEL_NAME'] = "EleutherAI/gpt-neox-20b"
 33 |     elif args.model_type == 'llama':
 34 |         os.environ['MODEL_NAME'] = "huggyllama/llama-7b"
 35 |     else:
 36 |         assert False
 37 | 
 38 |     seq = 1024
 39 |     total_batch = 1
 40 |     pe = 'fixed'
 41 | 
 42 |     class RealRunner:
 43 |         
 44 |         def __init__(self, args):
 45 |             
 46 |             self.results = {}
 47 |             
 48 |             with open(args.result_file, 'r') as f:
 49 |                 
 50 |                 for line in f:
 51 |                     if line.strip() == '':
 52 |                         continue
 53 |                     
 54 |                     item = json.loads(line)
 55 |                     
 56 |                     request = item['request']
 57 |                     result = item['result']
 58 |                     
 59 |                     self.results[json_to_key(request)] = result
 60 |             
 61 |             print(f"{len(self.results)} items in the cache")
 62 |         
 63 |         def eval(self, batch):
 64 |             
 65 |             from tasks.eval_harness import tokenizer
 66 |             
 67 |             mask_loss = []
 68 |             each_correct = []
 69 | 
 70 |             for i, text in enumerate(batch['text']):
 71 |                 
 72 |                 request = {
 73 |                         "best_of": 1, 
 74 |                         "echo": True, 
 75 |                         "logprobs": 1, 
 76 |                         "max_tokens": 0, 
 77 |                         "model": "x", 
 78 |                         "n": 1, 
 79 |                         "prompt": text, 
 80 |                         "request_type": "language-model-inference", 
 81 |                         "stop": None, 
 82 |                         "temperature": 0, 
 83 |                         "top_p": 1
 84 |                     }
 85 |                 
 86 |                 key = json_to_key(request)
 87 |                 
 88 |                 correct = True
 89 |                 
 90 |                 if key in self.results:
 91 |                     result = self.results[key]
 92 |                     
 93 |                     token_logprobs = result['choices'][0]['logprobs']['token_logprobs']
 94 |                     tokens = result['choices'][0]['logprobs']['tokens']
 95 |                     top_logprobs = result['choices'][0]['logprobs']['top_logprobs']
 96 |                     assert token_logprobs[0] is None
 97 |                     
 98 |                     token_ids = tokenizer.convert_tokens_to_ids(tokens)
 99 |                     
100 |                     obs = batch['obs'][i]
101 |                     target = batch['target'][i]
102 |                     eval_mask = batch['eval_mask'][i]
103 |                     
104 |                     n_positive = 0
105 |                     sum_lobprob = 0
106 |                     if args.debug:
107 |                         print(target)
108 |                     for i, mask in enumerate(eval_mask):
109 |                         try:
110 |                             
111 |                             if i+1 >= len(tokens):
112 |                                 break
113 |                             
114 |                             if mask == True:
115 |                                 if args.debug:
116 |                                     print(tokens[i+1], next(iter(top_logprobs[i+1].keys())))
117 |                                 correct = correct and (tokens[i+1] == next(iter(top_logprobs[i+1].keys())))
118 |                                 sum_lobprob += token_logprobs[i+1]
119 |                                 n_positive += 1
120 |                         except Exception as e:
121 |                             raise e
122 |                     
123 |                     # avg_logprob = sum(token_logprobs[1:]) / (len(token_logprobs) - 1)
124 |                     avg_logprob = sum_lobprob / n_positive
125 |                     
126 |                     mask_loss.append( - avg_logprob)
127 |             
128 |                     each_correct.append( correct )
129 |                     
130 |                 else:
131 |                     assert False
132 |                 
133 | 
134 |             out = {
135 |                 'mask_loss': mask_loss,
136 |                 'each_correct': each_correct,
137 |             }
138 |             
139 |             
140 |             return out
141 | 
142 |     t = RealRunner(args)
143 | 
144 |     adaptor = EvalHarnessAdaptor(t, seq, total_batch, shrink=pe != "fixed")
145 | 
146 |     results = evaluator.evaluate(adaptor, tasks.get_task_dict([args.task_name
147 |                                                                #"lambada_openai",
148 |                                                                #"piqa",
149 |                                                                #"hellaswag",
150 |                                                                #"winogrande",
151 |                                                                #"mathqa",
152 |                                                                #"pubmedqa",
153 |                                                                # "boolq",
154 |                                                                # "cb",
155 |                                                                # "copa",
156 |                                                                # "multirc",
157 |                                                                # "record",
158 |                                                                # "wic",
159 |                                                                # "wsc",
160 |                                                                ]), False, args.num_fewshot, None)
161 |     
162 |     dumped = json.dumps(results, indent=2)
163 |     print(dumped)
164 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/figure11-config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "opt-6.7b":{
 3 |     "openbookqa":[
 4 |       [2, 0.2],
 5 |       [4, 0.2],
 6 |       [5, 0.4],
 7 |       [7, 0.6]
 8 |     ],
 9 |     "piqa":[
10 |       [2, 0.2],
11 |       [4, 0.2],
12 |       [5, 0.4],
13 |       [7, 0.7]
14 |     ]
15 |   },
16 |   "opt-13b":{
17 |     "winogrande":[
18 |       [2, 0.2],
19 |       [4, 0.2],
20 |       [4, 0.6],
21 |       [7, 0.6]
22 |     ],
23 |     "openbookqa":[
24 |       [2, 0.2],
25 |       [4, 0.2],
26 |       [4, 0.6],
27 |       [7, 0.6]
28 |     ]
29 |   },
30 |   "opt-30b":{
31 |     "openbookqa":[
32 |       [2, 0.2],
33 |       [4, 0.2],
34 |       [5, 0.4],
35 |       [7, 0.6]
36 |     ],
37 |     "copa":[
38 |       [2, 0.2],
39 |       [4, 0.2],
40 |       [5, 0.4],
41 |       [7, 0.6]
42 |     ]
43 |   },
44 |   "llama-2-7b":{
45 |     "rte":[
46 |       [5, 0.2],
47 |       [7, 0.2],
48 |       [7, 0.6],
49 |       [7, 0.8]
50 |     ],
51 |     "piqa":[
52 |       [5, 0.2],
53 |       [7, 0.2],
54 |       [7, 0.6],
55 |       [9.5, 0.6]
56 |     ]
57 |   },
58 |   "llama-2-13b":{
59 |     "copa":[
60 |       [4, 0.2],
61 |       [6, 0.2],
62 |       [7, 0.4],
63 |       [9, 0.6]
64 |     ],
65 |     "winogrande":[
66 |       [4, 0.2],
67 |       [6, 0.2],
68 |       [7, 0.4],
69 |       [9, 0.6]
70 |     ]
71 |   }
72 | }
73 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/figure11.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | 
  4 | with open("figure11-config.json") as f:
  5 |     config = json.load(f)
  6 | os.system("mkdir -p results")
  7 | 
  8 | shots = 5
  9 | partial = 0.2
 10 | capacity = 1.0
 11 | 
 12 | # Prepare dataset
 13 | for task in ["piqa", "openbookqa", "winogrande", "copa", "rte"]:
 14 |     cmd = []
 15 |     cmd.append("python -u generate_task_data.py")
 16 |     cmd.append(f"--output-file results/{task}-{shots}.jsonl")
 17 |     cmd.append(f"--task-name {task}")
 18 |     cmd.append(f"--num-fewshot {shots}")
 19 |     cmd = ' '.join(cmd)
 20 |     os.system(cmd)
 21 | 
 22 | ## Baseline
 23 | print("="*10+" Full cache " + "="*10)
 24 | # OPT
 25 | for size in ["6.7b", "13b", "30b"]:
 26 |     if size == "6.7b":
 27 |         tasks = ["piqa", "openbookqa"]
 28 |     elif size == "13b":
 29 |         tasks = ["winogrande", "openbookqa"]
 30 |     elif size == "30b":
 31 |         tasks = ["copa", "openbookqa"]
 32 |     for task in tasks:
 33 |         cmd = []
 34 |         cmd.append("bash full_cache.sh")
 35 |         cmd.append(task)
 36 |         cmd.append(f"facebook/opt-{size}")
 37 |         cmd.append("opt")
 38 |         cmd.append(str(shots))
 39 |         cmd = ' '.join(cmd)
 40 |         print(cmd)
 41 |         os.system(cmd)
 42 |         print("-------------------------------------------")
 43 | 
 44 | # Llama-2
 45 | llama_2_dir = os.environ["LLAMA_PATH"]
 46 | for size in ["7b", "13b"]:
 47 |     if size == "7b":
 48 |         tasks = ["rte", "piqa"]
 49 |     elif size == "13b":
 50 |         tasks = ["copa", "winogrande"]
 51 |     for task in tasks:
 52 |         cmd = []
 53 |         cmd.append("bash full_cache.sh")
 54 |         cmd.append(task)
 55 |         cmd.append(f"{llama_2_dir}/llama-2-{size}")
 56 |         cmd.append("llama")
 57 |         cmd.append(str(shots))
 58 |         cmd = ' '.join(cmd)
 59 |         print(cmd)
 60 |         os.system(cmd)
 61 |         print("-------------------------------------------")
 62 | 
 63 | # InfiniGen
 64 | print("="*10+" InfiniGen " + "="*10)
 65 | # OPT
 66 | for size in ["6.7b", "13b", "30b"]:
 67 |     if size == "6.7b":
 68 |         tasks = ["piqa", "openbookqa"]
 69 |     elif size == "13b":
 70 |         tasks = ["winogrande", "openbookqa"]
 71 |     elif size == "30b":
 72 |         tasks = ["copa", "openbookqa"]
 73 |     for task in tasks:
 74 |         for retain_ratio in range(4):
 75 |             alpha, budget = config[f"opt-{size}"][task][retain_ratio]
 76 |             cmd = []
 77 |             cmd.append("bash ours.sh")
 78 |             cmd.append(task)
 79 |             cmd.append(f"../setup/opt-model/opt-{size}")
 80 |             cmd.append(f"facebook/opt-{size}")
 81 |             cmd.append("opt")
 82 |             cmd.append(str(shots))
 83 |             cmd.append(str(partial))
 84 |             cmd.append(str(alpha))
 85 |             cmd.append(str(capacity))
 86 |             cmd.append(str(budget))
 87 |             cmd = ' '.join(cmd)
 88 |             print(cmd)
 89 |             os.system(cmd)
 90 |             print("-------------------------------------------")
 91 | 
 92 | 
 93 | # Llama-2
 94 | llama_2_dir = os.environ["LLAMA_PATH"]
 95 | for size in ["7b", "13b"]:
 96 |     if size == "7b":
 97 |         tasks = ["rte", "piqa"]
 98 |     elif size == "13b":
 99 |         tasks = ["copa", "winogrande"]
100 |     for task in tasks:
101 |         for retain_ratio in range(4):
102 |             alpha, budget = config[f"llama-2-{size}"][task][retain_ratio]
103 |             cmd = []
104 |             cmd.append("bash ours.sh")
105 |             cmd.append(task)
106 |             cmd.append(f"{llama_2_dir}/llama-2-{size}")
107 |             cmd.append(f"{llama_2_dir}/llama-2-{size}")
108 |             cmd.append("llama")
109 |             cmd.append(str(shots))
110 |             cmd.append(str(partial))
111 |             cmd.append(str(alpha))
112 |             cmd.append(str(capacity))
113 |             cmd.append(str(budget))
114 |             cmd = ' '.join(cmd)
115 |             print(cmd)
116 |             os.system(cmd)
117 |             print("-------------------------------------------")
118 | ## H2O
119 | print("="*10+" H2O " + "="*10)
120 | # OPT
121 | for size in ["6.7b", "13b", "30b"]:
122 |     if size == "6.7b":
123 |         tasks = ["piqa", "openbookqa"]
124 |     elif size == "13b":
125 |         tasks = ["winogrande", "openbookqa"]
126 |     elif size == "30b":
127 |         tasks = ["copa", "openbookqa"]
128 |     for task in tasks:
129 |         for ratio in [0.25, 0.125, 0.0625, 0.03125]:
130 |             cmd = []
131 |             cmd.append("bash h2o.sh")
132 |             cmd.append(task)
133 |             cmd.append(f"facebook/opt-{size}")
134 |             cmd.append("opt")
135 |             cmd.append(str(shots))
136 |             cmd.append(str(ratio)) # heavy_ratio
137 |             cmd.append(str(ratio)) # recent_ratio
138 |             cmd = ' '.join(cmd)
139 |             print(cmd)
140 |             os.system(cmd)
141 |             print("-------------------------------------------")
142 | 
143 | # Llama-2
144 | llama_2_dir = os.environ["LLAMA_PATH"]
145 | for size in ["7b", "13b"]:
146 |     if size == "7b":
147 |         tasks = ["rte", "piqa"]
148 |     elif size == "13b":
149 |         tasks = ["copa", "winogrande"]
150 |     for task in tasks:
151 |         for ratio in [0.25, 0.125, 0.0625, 0.03125]:
152 |             cmd = []
153 |             cmd.append("bash h2o.sh")
154 |             cmd.append(task)
155 |             cmd.append(f"{llama_2_dir}/llama-2-{size}")
156 |             cmd.append("llama")
157 |             cmd.append(str(shots))
158 |             cmd.append(str(ratio)) # heavy_ratio
159 |             cmd.append(str(ratio)) # recent_ratio
160 |             cmd = ' '.join(cmd)
161 |             print(cmd)
162 |             os.system(cmd)
163 |             print("-------------------------------------------")
164 | 
165 | ## Quant.
166 | print("="*10+" Quantization " + "="*10)
167 | # OPT
168 | for size in ["6.7b", "13b", "30b"]:
169 |     if size == "6.7b":
170 |         tasks = ["piqa", "openbookqa"]
171 |     elif size == "13b":
172 |         tasks = ["winogrande", "openbookqa"]
173 |     elif size == "30b":
174 |         tasks = ["copa", "openbookqa"]
175 |     for task in tasks:
176 |         for qbits in [8, 4, 2, 1]:
177 |             cmd = []
178 |             cmd.append("bash quant.sh")
179 |             cmd.append(task)
180 |             cmd.append(f"facebook/opt-{size}")
181 |             cmd.append("opt")
182 |             cmd.append(str(shots))
183 |             cmd.append(str(qbits))
184 |             cmd = ' '.join(cmd)
185 |             print(cmd)
186 |             os.system(cmd)
187 |             print("-------------------------------------------")
188 | 
189 | # Llama-2
190 | llama_2_dir = os.environ["LLAMA_PATH"]
191 | for size in ["7b", "13b"]:
192 |     if size == "7b":
193 |         tasks = ["rte", "piqa"]
194 |     elif size == "13b":
195 |         tasks = ["copa", "winogrande"]
196 |     for task in tasks:
197 |         for qbits in [8, 4, 2, 1]:
198 |             cmd = []
199 |             cmd.append("bash quant.sh")
200 |             cmd.append(task)
201 |             cmd.append(f"{llama_2_dir}/llama-2-{size}")
202 |             cmd.append("llama")
203 |             cmd.append(str(shots))
204 |             cmd.append(str(qbits))
205 |             cmd = ' '.join(cmd)
206 |             print(cmd)
207 |             os.system(cmd)
208 |             print("-------------------------------------------")
209 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/figure13.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | shots=5
 4 | # Prepare dataset
 5 | echo "prepare dataset"
 6 | for task in copa openbookqa winogrande piqa rte; do
 7 |   python -u generate_task_data.py \
 8 |   --output-file "results/${task}-${shots}.jsonl" \
 9 |   --task-name ${task} \
10 |   --num-fewshot ${shots} 
11 | done
12 | 
13 | # Baseline
14 | echo "full cache"
15 | for task in copa openbookqa winogrande piqa rte; do
16 |   bash full_cache.sh ${task} "facebook/opt-6.7b" opt ${shots}
17 | done
18 | 
19 | # InfiniGen
20 | partial=0.1
21 | capacity=1.0
22 | alpha=99 
23 | budget=0.2
24 | 
25 | # w/o skewing
26 | echo "InfiniGen w/o skewing"
27 | for task in copa openbookqa winogrande piqa rte; do
28 |   bash ours.sh ${task} "../setup/opt-model-no-skew/opt-6.7b" "facebook/opt-6.7b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget} "no-skew"
29 | done
30 | 
31 | 
32 | # w/ skewing
33 | echo "InfiniGen w/ skewing"
34 | for task in copa openbookqa winogrande piqa rte; do
35 |   bash ours.sh ${task} "../setup/opt-model/opt-6.7b" "facebook/opt-6.7b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget}
36 | done
37 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/figure17.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | shots=5
 4 | # Prepare dataset
 5 | echo "prepare dataset"
 6 | for task in openbookqa; do
 7 |   python -u generate_task_data.py \
 8 |   --output-file "results/${task}-${shots}.jsonl" \
 9 |   --task-name ${task} \
10 |   --num-fewshot ${shots} 
11 | done
12 | 
13 | 
14 | # Threshold (alpha) sweep
15 | partial=0.2
16 | capacity=1.0
17 | budget=0.2
18 | 
19 | echo "Threshold (alpha) sweep"
20 | for alpha in 1 2 3 4 5 6 7 8 9; do
21 |   echo alpha "${alpha}"
22 |   bash ours.sh winogrande "../setup/opt-model/opt-13b" "facebook/opt-13b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget}
23 | done
24 | 
25 | # Partial weight sweep
26 | alpha=4
27 | capacity=1.0
28 | budget=0.2
29 | 
30 | echo "Partial weight ratio sweep"
31 | for partial in 0.1 0.2 0.4 0.6 0.8 1; do
32 |   echo partial ratio "${partial}"
33 |   bash ours.sh winogrande "../setup/opt-model/opt-13b" "facebook/opt-13b" opt ${shots} ${partial} ${alpha} ${capacity} ${budget}
34 | done
35 | 
36 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/full_cache.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Inference, and generate output json file
 4 | task=$1
 5 | shots=$4
 6 | model=$2
 7 | model_arch=$3
 8 | base_name=$(basename "${model}")
 9 | 
10 | python -u run_lm_eval_harness.py \
11 |   --input-path results/${task}-${shots}.jsonl \
12 |   --output-path results/${task}-${shots}-${base_name}-full.jsonl \
13 |   --model-name ${model} \
14 |   --model-type ${model_arch}
15 | 
16 | # Evaluate results
17 | python -u evaluate_task_result.py \
18 |   --result-file results/${task}-${shots}-${base_name}-full.jsonl \
19 |   --task-name ${task} \
20 |   --num-fewshot ${shots} \
21 |   --model-type ${model_arch}
22 | 
23 | rm results/${task}-${shots}-${base_name}-full.jsonl
24 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/generate_task_data.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import json
 3 | 
 4 | from lm_eval import evaluator, tasks
 5 | from tasks import EvalHarnessAdaptor
 6 | 
 7 | 
 8 | if __name__ == '__main__':
 9 |     
10 | 
11 |     parser = argparse.ArgumentParser(
12 |                         prog = 'ProgramName',
13 |                         description = 'What the program does',
14 |                         epilog = 'Text at the bottom of help')
15 | 
16 |     parser.add_argument('--output-file', type=str, default='input.jsonl')
17 |     parser.add_argument('--task-name', type=str, default='hellaswag')
18 |     parser.add_argument('--num-fewshot', type=int, default=0)
19 |     args = parser.parse_args()
20 | 
21 |     seq = 1024
22 |     total_batch = 1
23 |     pe = 'fixed'
24 | 
25 |     with open(args.output_file, 'w') as f:
26 |         pass
27 | 
28 |     class DryRunner:
29 |         def eval(self, batch):
30 | 
31 |             with open(args.output_file, 'a') as f:
32 | 
33 |                 for text in batch['text']:
34 |                     item = {
35 |                         "best_of": 1, 
36 |                         "echo": True, 
37 |                         "logprobs": 1, 
38 |                         "max_tokens": 0, 
39 |                         "model": "x", 
40 |                         "n": 1, 
41 |                         "prompt": text, 
42 |                         "request_type": "language-model-inference", 
43 |                         "stop": None, 
44 |                         "temperature": 0, 
45 |                         "top_p": 1
46 |                     }
47 | 
48 |                     f.write(json.dumps(item) + '\n')
49 | 
50 |             out = {
51 |                 'mask_loss': [1.0] * len(batch),
52 |                 'each_correct': [True] * len(batch),
53 |             }
54 |             return out
55 | 
56 |     t = DryRunner()
57 |     adaptor = EvalHarnessAdaptor(t, seq, total_batch, shrink=pe != "fixed")
58 |     results = evaluator.evaluate(adaptor, tasks.get_task_dict([args.task_name
59 |                                                             #"lambada_openai",
60 |                                                             #"piqa",
61 |                                                             #"hellaswag",
62 |                                                             #"winogrande",
63 |                                                             #"mathqa",
64 |                                                             #"pubmedqa",
65 |                                                             # "boolq",
66 |                                                             # "cb",
67 |                                                             # "copa",
68 |                                                             # "multirc",
69 |                                                             # "record",
70 |                                                             # "wic",
71 |                                                             # "wsc",
72 |                                                             ]), False, args.num_fewshot, None)
73 |     print('Finished')
74 | 
75 |     # dumped = json.dumps(results, indent=2)
76 |     # print(dumped)


--------------------------------------------------------------------------------
/accuracy/lm_eval/h2o.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Inference, and generate output json file
 4 | task=$1
 5 | shots=$4
 6 | model=$2
 7 | model_arch=$3
 8 | base_name=$(basename "${model}")
 9 | heavy_ratio=$5
10 | recent_ratio=$6
11 | 
12 | python -u run_lm_eval_harness.py \
13 |   --input-path results/${task}-${shots}.jsonl \
14 |   --output-path results/${task}-${shots}-${base_name}-h2o.jsonl \
15 |   --model-name ${model} \
16 |   --model-type ${model_arch} \
17 |   --heavy_ratio ${heavy_ratio} \
18 |   --recent_ratio ${recent_ratio} \
19 |   --enable_small_cache
20 | 
21 | ## Evaluate results
22 | python -u evaluate_task_result.py \
23 |   --result-file results/${task}-${shots}-${base_name}-h2o.jsonl \
24 |   --task-name ${task} \
25 |   --num-fewshot ${shots} \
26 |   --model-type ${model_arch}
27 | 
28 | rm results/${task}-${shots}-${base_name}-h2o.jsonl
29 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/ours.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Inference, and generate output json file
 4 | task=$1
 5 | shots=$5
 6 | model_path=$2
 7 | model=$3
 8 | model_arch=$4
 9 | partial_weight=$6
10 | alpha=$7
11 | capacity=$8
12 | budget=$9
13 | no_skewing=${10}
14 | base_name=$(basename "${model}")
15 | if [ -z $no_skewing ]; then
16 |   weight_path="../setup/weights/${base_name}_${partial_weight}"
17 | else 
18 |   weight_path="../setup/weights-no-skew/${base_name}_${partial_weight}"
19 | fi
20 | skewing_path="../setup/skewing_matrix/${base_name}.pt"
21 | 
22 | python -u run_lm_eval_harness.py \
23 |   --input-path results/${task}-${shots}.jsonl \
24 |   --output-path results/${task}-${shots}-${base_name}-ours.jsonl \
25 |   --model-name ${model} \
26 |   --model-type ${model_arch} \
27 |   --partial_weight_ratio ${partial_weight} \
28 |   --partial_weight_path ${weight_path} \
29 |   --ours \
30 |   --model-path ${model_path} \
31 |   --skewing_matrix_path ${skewing_path} \
32 |   --alpha ${alpha} \
33 |   --capacity ${capacity} \
34 |   --budget ${budget}
35 | 
36 | # Evaluate results
37 | python -u evaluate_task_result.py \
38 |   --result-file results/${task}-${shots}-${base_name}-ours.jsonl \
39 |   --task-name ${task} \
40 |   --num-fewshot ${shots} \
41 |   --model-type ${model_arch}
42 | 
43 | rm results/${task}-${shots}-${base_name}-ours.jsonl
44 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/quant.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Inference, and generate output json file
 4 | task=$1
 5 | shots=$4
 6 | model=$2
 7 | model_arch=$3
 8 | base_name=$(basename "${model}")
 9 | qbits=$5
10 | 
11 | python -u run_lm_eval_harness.py \
12 |   --input-path results/${task}-${shots}.jsonl \
13 |   --output-path results/${task}-${shots}-${base_name}-quant.jsonl \
14 |   --model-name ${model} \
15 |   --model-type ${model_arch} \
16 |   --qbits ${qbits} \
17 |   --enable_quant
18 | 
19 | ## Evaluate results
20 | python -u evaluate_task_result.py \
21 |   --result-file results/${task}-${shots}-${base_name}-quant.jsonl \
22 |   --task-name ${task} \
23 |   --num-fewshot ${shots} \
24 |   --model-type ${model_arch}
25 | 
26 | rm results/${task}-${shots}-${base_name}-quant.jsonl
27 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/run_lm_eval_harness.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import json, tqdm
  3 | import torch
  4 | import copy
  5 | import os, sys
  6 | import math
  7 | 
  8 | def set_symlink(model_type, fname):
  9 |     model_path = "../transformers/src/transformers/models/" + model_type
 10 |     linker_path = os.path.realpath("../src/" + fname)
 11 |     if not os.path.exists(linker_path):
 12 |         print(f"No file exists at {linker_path}")
 13 |         exit(0)
 14 |     if not os.path.exists(model_path):
 15 |         print(f"No file exists at {model_path}")
 16 |         exit(0)
 17 |     curr_dir = os.getcwd()
 18 |     os.chdir(model_path)
 19 |     if os.path.exists(f'modeling_{model_type}.py'):
 20 |         cmd = f"rm modeling_{model_type}.py"
 21 |         os.system(cmd)
 22 |     cmd = f"ln -s {linker_path} modeling_{model_type}.py"
 23 |     os.system(cmd)
 24 |     os.chdir(curr_dir)
 25 | 
 26 | if __name__ == '__main__':
 27 | 
 28 |     parser = argparse.ArgumentParser(
 29 |                         prog = 'ProgramName',
 30 |                         description = 'What the program does',
 31 |                         epilog = 'Text at the bottom of help')
 32 | 
 33 |     parser.add_argument('--input-path', type=str, default=None)
 34 |     parser.add_argument('--output-path', type=str, default=None)
 35 |     parser.add_argument('--model-name', type=str, default='facebook/opt-350m')
 36 |     parser.add_argument('--model-path', type=str, default=None)
 37 |     parser.add_argument('--model-type', type=str, default='opt')
 38 | 
 39 |     # Quant.
 40 |     parser.add_argument('--enable_quant', action='store_true')
 41 |     parser.add_argument("--qbits", type=int, default=8)
 42 | 
 43 |     # H2O
 44 |     parser.add_argument('--enable_small_cache', action='store_true')
 45 |     parser.add_argument("--heavy_ratio", type=float, default=0.1)
 46 |     parser.add_argument("--recent_ratio", type=float, default=0.1)
 47 | 
 48 |     # InfiniGen
 49 |     parser.add_argument('--ours', action='store_true')
 50 |     parser.add_argument("--partial_weight_ratio", type=float, default=0.1)
 51 |     parser.add_argument("--partial_weight_path", type=str)
 52 |     parser.add_argument("--skewing_matrix_path", type=str)
 53 |     parser.add_argument("--alpha",type=float, default=5)
 54 |     parser.add_argument("--capacity",type=float, default=1.0)
 55 |     parser.add_argument("--budget",type=float, default=0.2)
 56 |     args = parser.parse_args()
 57 |     
 58 |     if args.ours:
 59 |         set_symlink(args.model_type, f"modeling_{args.model_type}_ours.py")
 60 |     else:
 61 |         set_symlink(args.model_type, f"modeling_{args.model_type}_orig.py")
 62 | 
 63 | 
 64 |     input_path = args.input_path
 65 |     output_path = args.output_path
 66 |     model_name = args.model_name
 67 | 
 68 |     from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
 69 |     config = AutoConfig.from_pretrained(model_name)
 70 |     tokenizer = AutoTokenizer.from_pretrained(model_name, device_map = 'auto', torch_dtype=torch.float16)
 71 |     if args.model_path is None:
 72 |         model = AutoModelForCausalLM.from_pretrained(model_name, device_map = 'auto', torch_dtype=torch.float16)
 73 |     else:
 74 |         model = AutoModelForCausalLM.from_pretrained(args.model_path)
 75 | 
 76 |     if args.enable_quant:
 77 |         if args.model_type == "opt":
 78 |             for i, layer in enumerate(model.model.decoder.layers):
 79 |                 if i>=2:
 80 |                     layer.self_attn.enable_quant = True
 81 |                     layer.self_attn.qbits = args.qbits
 82 |         if args.model_type == "llama":
 83 |             for i, layer in enumerate(model.model.layers):
 84 |                 if i>=2:
 85 |                     layer.self_attn.enable_quant = True
 86 |                     layer.self_attn.qbits = args.qbits
 87 | 
 88 |     elif args.enable_small_cache:
 89 |         from utils_lm_eval.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter
 90 |         from utils_lm_eval.modify_gptneox import convert_kvcache_gpt_neox_heavy_recent, GPTNeoXAttention_Mask
 91 |         from utils_lm_eval.modify_opt import convert_kvcache_opt_heavy_recent, OPTAttention_Mask
 92 |         ENABLE_Heavy_Hitter_FUNCTIONS = {
 93 |             "llama": convert_kvcache_llama_heavy_recent,
 94 |             "opt": convert_kvcache_opt_heavy_recent,
 95 |             "gpt_neox": convert_kvcache_gpt_neox_heavy_recent,
 96 |         }
 97 |         print('Enable Small Cache Size')
 98 |         config.heavy_ratio = args.heavy_ratio
 99 |         config.recent_ratio = args.recent_ratio
100 |         base_path = os.path.basename(args.model_name)
101 |         if not os.path.exists(f"../h2o_model/{base_path}.pt"):
102 |             os.system("mkdir ../h2o_model")
103 |             checkpoint = copy.deepcopy(model.state_dict())
104 |             torch.save(checkpoint, f"../h2o_model/{base_path}.pt")
105 |         model = ENABLE_Heavy_Hitter_FUNCTIONS[args.model_type](model, config)
106 |         model.load_state_dict(torch.load(f"../h2o_model/{base_path}.pt"))
107 |         model = model.to(torch.float16)
108 |     
109 |     elif args.ours:
110 |         if args.model_type == "opt":
111 |             for layer in range(len(model.model.decoder.layers)):
112 |                 model.model.decoder.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio
113 |                 model.model.decoder.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt")
114 |                 model.model.decoder.layers[layer].self_attn.alpha = args.alpha
115 |                 model.model.decoder.layers[layer].self_attn.capacity = args.capacity
116 |                 model.model.decoder.layers[layer].self_attn.budget = args.budget
117 |         if args.model_type == "llama":
118 |             if args.skewing_matrix_path is not None:
119 |                 A = torch.load(args.skewing_matrix_path)
120 |             for layer in range(len(model.model.layers)):
121 |                 model.model.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio
122 |                 model.model.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt")
123 |                 model.model.layers[layer].self_attn.alpha = args.alpha
124 |                 model.model.layers[layer].self_attn.capacity = args.capacity
125 |                 model.model.layers[layer].self_attn.budget = args.budget
126 |                 if args.skewing_matrix_path is not None:
127 |                     model.model.layers[layer].self_attn.skewing_matrix = A[layer]
128 | 
129 |     model.half().eval().cuda()
130 | 
131 |     requests = []
132 |     with open(input_path, 'r') as f:
133 |         for line in f:
134 |             if line.strip() != '':
135 |                 requests.append(json.loads(line))
136 | 
137 |     results = []
138 |     density=[]
139 |     with torch.no_grad():
140 |         for request in tqdm.tqdm(requests):
141 |             result = {'request': request, 'result': {}}
142 |             prompt = request['prompt']
143 |             input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids.to(model.device)
144 | 
145 |             logits = model(input_ids).logits.log_softmax(dim=-1)
146 |             if args.ours:
147 |                 density.append(model.get_density())
148 | 
149 |             values, indices = logits.squeeze(0).topk(dim=-1, k=1)
150 |             tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze(0))
151 |             
152 |             gold_indices = input_ids[:, 1:] # skip first
153 |             logprobs = [None] + torch.gather(logits, -1, gold_indices.unsqueeze(-1)).squeeze(-1).squeeze(0).detach().cpu().tolist()
154 |             top_logprobs = [None] + [{tokenizer.convert_ids_to_tokens(i.item()): v.item()} for v, i in zip(values.squeeze(-1), indices.squeeze(-1))]
155 |             
156 |             result['result'] = {
157 |                 "choices": [
158 |                     {
159 |                         "text": prompt, 
160 |                         "logprobs": {
161 |                             "tokens": tokens, 
162 |                             "token_logprobs": logprobs, 
163 |                             "top_logprobs": top_logprobs, 
164 |                             "text_offset": []
165 |                         }, 
166 |                         "finish_reason": "length"
167 |                     }
168 |                 ], 
169 |                 "request_time": {
170 |                     "batch_time": 0, 
171 |                     "batch_size": 1}
172 |             }
173 |             
174 |             results.append(result)
175 |             
176 |             if args.ours:
177 |                 if args.model_type == "opt":
178 |                     for layer in model.model.decoder.layers:
179 |                         layer.self_attn.previous_hidden_states = None
180 |                 if args.model_type == "llama":
181 |                     for layer in model.model.layers:
182 |                         layer.self_attn.previous_hidden_states = None
183 | 
184 |     if args.ours:
185 |         density = sum(density) / len(density) * 100
186 |         retain_ratio = (1 - math.sqrt(1 - density / 100)) * 100
187 |         #print("\ndensity: %.2f"%(density))
188 |         print("retain ratio: %.2f\n"%(retain_ratio))
189 | 
190 |     with open(output_path, 'w') as f:
191 |         for result in results:
192 |             f.write(json.dumps(result) + '\n')
193 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/run_text_generation.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding=utf-8
  3 | # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
  4 | # Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
  5 | #
  6 | # Licensed under the Apache License, Version 2.0 (the "License");
  7 | # you may not use this file except in compliance with the License.
  8 | # You may obtain a copy of the License at
  9 | #
 10 | #     http://www.apache.org/licenses/LICENSE-2.0
 11 | #
 12 | # Unless required by applicable law or agreed to in writing, software
 13 | # distributed under the License is distributed on an "AS IS" BASIS,
 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 15 | # See the License for the specific language governing permissions and
 16 | # limitations under the License.
 17 | """ Conditional text generation with the auto-regressive models
 18 | """
 19 | 
 20 | 
 21 | import argparse
 22 | import logging
 23 | 
 24 | import numpy as np
 25 | import torch
 26 | import json
 27 | import tqdm 
 28 | import copy 
 29 | 
 30 | from transformers import (
 31 |     CTRLLMHeadModel,
 32 |     CTRLTokenizer,
 33 |     GPT2LMHeadModel,
 34 |     GPT2Tokenizer,
 35 |     OpenAIGPTLMHeadModel,
 36 |     OpenAIGPTTokenizer,
 37 |     TransfoXLLMHeadModel,
 38 |     TransfoXLTokenizer,
 39 |     XLMTokenizer,
 40 |     XLMWithLMHeadModel,
 41 |     XLNetLMHeadModel,
 42 |     XLNetTokenizer,
 43 | )
 44 | 
 45 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
 46 | 
 47 | from utils_hh.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter
 48 | from utils_hh.modify_gptneox import convert_kvcache_gpt_neox_heavy_recent, GPTNeoXAttention_Mask
 49 | from utils_hh.modify_opt import convert_kvcache_opt_heavy_recent, OPTAttention_Mask
 50 | 
 51 | 
 52 | logging.basicConfig(
 53 |     format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
 54 |     datefmt="%m/%d/%Y %H:%M:%S",
 55 |     level=logging.INFO,
 56 | )
 57 | logger = logging.getLogger(__name__)
 58 | 
 59 | MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop
 60 | 
 61 | MODEL_CLASSES = {
 62 |     "gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
 63 |     "ctrl": (CTRLLMHeadModel, CTRLTokenizer),
 64 |     "openai-gpt": (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
 65 |     "xlnet": (XLNetLMHeadModel, XLNetTokenizer),
 66 |     "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer),
 67 |     "xlm": (XLMWithLMHeadModel, XLMTokenizer),
 68 | }
 69 | 
 70 | # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
 71 | # in https://github.com/rusiaaman/XLNet-gen#methodology
 72 | # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
 73 | PREFIX = """In 1991, the remains of Russian Tsar Nicholas II and his family
 74 | (except for Alexei and Maria) are discovered.
 75 | The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
 76 | remainder of the story. 1883 Western Siberia,
 77 | a young Grigori Rasputin is asked by his father and a group of men to perform magic.
 78 | Rasputin has a vision and denounces one of the men as a horse thief. Although his
 79 | father initially slaps him for making such an accusation, Rasputin watches as the
 80 | man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
 81 | the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
 82 | with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
 83 | 
 84 | 
 85 | def set_seed(args):
 86 |     np.random.seed(args.seed)
 87 |     torch.manual_seed(args.seed)
 88 |     if args.n_gpu > 0:
 89 |         torch.cuda.manual_seed_all(args.seed)
 90 | 
 91 | 
 92 | ENABLE_Heavy_Hitter_FUNCTIONS = {
 93 |     "llama": convert_kvcache_llama_heavy_recent,
 94 |     "opt": convert_kvcache_opt_heavy_recent,
 95 |     "gpt_neox": convert_kvcache_gpt_neox_heavy_recent,
 96 | }
 97 | 
 98 | 
 99 | def main():
100 |     parser = argparse.ArgumentParser()
101 | 
102 |     parser.add_argument("--model_arch", type=str, default='llama')
103 |     parser.add_argument("--model_name", type=str, default='huggyllama/llama-13b')
104 |     parser.add_argument("--cache_dir", type=str, default='../../checkpoint/')
105 | 
106 |     parser.add_argument("--heavy_ratio", type=float, default=0.1)
107 |     parser.add_argument("--recent_ratio", type=float, default=0.1)
108 | 
109 |     parser.add_argument("--length", type=int, default=64)
110 | 
111 |     parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
112 |     parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
113 |     parser.add_argument(
114 |         "--fp16",
115 |         action="store_true",
116 |         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
117 |     )
118 |     args = parser.parse_args()
119 | 
120 |     args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
121 |     args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
122 | 
123 |     logger.warning(f"device: {args.device}, n_gpu: {args.n_gpu}, 16-bits training: {args.fp16}")
124 |     set_seed(args)
125 | 
126 |     # Change to your custom prompt text
127 |     # prompt_text = 'In the year 2087, humanity has achieved remarkable technological advancements and established colonies on multiple planets within the Milky Way galaxy. Interstellar travel has become commonplace, with faster-than-light spacecraft enabling people to explore distant star systems. Earth has undergone significant changes due to sustainable development efforts, such as harnessing renewable energy sources and implementing widespread ecological restoration projects. However, alongside these triumphs, new challenges have emerged, including the rise of artificial intelligence, ethical dilemmas surrounding genetic engineering, and interplanetary political tensions. Against this backdrop, a team of intrepid scientists embarks on a mission to uncover the secrets of an ancient alien civilization, hidden deep within an uncharted exoplanet. As they navigate treacherous terrains and encounter otherworldly phenomena, they must confront their own fears and reconcile humanity\'s thirst for knowledge with the potential consequences of uncovering secrets that were better left buried. The fate of both their mission and the future of humanity hang in the balance.'
128 |     prompt_text = 'In a small, bustling cafe nestled in the heart of a vibrant city, a serendipitous event unfolded, leaving a lasting impression on all who witnessed it. As the patrons sat sipping their coffees and engaging in animated conversations, a talented street musician entered the cafe, carrying a weathered guitar and radiating an aura of creativity.'
129 | 
130 |     model_name = args.model_name
131 |     config = AutoConfig.from_pretrained(model_name, cache_dir=args.cache_dir)
132 |     config.heavy_ratio = args.heavy_ratio
133 |     config.recent_ratio = args.recent_ratio
134 | 
135 |     tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, cache_dir=args.cache_dir)
136 | 
137 |     ######## Generate with Full Cache
138 |     model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=args.cache_dir)
139 |     model.half().eval().cuda()
140 | 
141 |     # input_ids = tokenizer(prompt_text, return_tensors='pt').input_ids.to(model.device)
142 |     input_ids = tokenizer(prompt_text, add_special_tokens=False, return_tensors='pt').input_ids.to(model.device)
143 | 
144 |     generate_ids = model.generate(input_ids, max_new_tokens=args.length)
145 |     result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
146 |     print("################## Generated Context with Full Cache ###################")
147 |     print(result)
148 | 
149 | 
150 |     ######### Enable HH
151 |     checkpoint = copy.deepcopy(model.state_dict())
152 |     model = ENABLE_Heavy_Hitter_FUNCTIONS[args.model_arch](model, config)
153 |     model.load_state_dict(checkpoint)
154 |     model.half().eval().cuda()
155 | 
156 |     generate_ids_hh = model.generate(input_ids, max_new_tokens=args.length)
157 |     result_hh = tokenizer.batch_decode(generate_ids_hh, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
158 |     print("################## Generated Context with Heavy Hitter Oracle ###################")
159 |     print(result_hh)
160 | 
161 | 
162 | if __name__ == "__main__":
163 |     main()


--------------------------------------------------------------------------------
/accuracy/lm_eval/tasks/__init__.py:
--------------------------------------------------------------------------------
1 | from tasks.eval_harness import EvalHarnessAdaptor


--------------------------------------------------------------------------------
/accuracy/lm_eval/tasks/eval_harness.py:
--------------------------------------------------------------------------------
  1 | from functools import partial
  2 | 
  3 | import os
  4 | import transformers
  5 | from lm_eval.base import LM
  6 | from tqdm import tqdm
  7 | import numpy as np
  8 | 
  9 | from tasks.util import sample_batch, shrink_seq
 10 | import multiprocessing
 11 | import ftfy
 12 | 
 13 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
 14 | 
 15 | tokenizer = None
 16 | 
 17 | def process_init():
 18 |     global tokenizer
 19 |     model_name = os.environ.get('MODEL_NAME', 'facebook/opt-1.3b')
 20 | 
 21 |     if model_name == "EleutherAI/gpt-neox-20b":
 22 |         tokenizer = AutoTokenizer.from_pretrained(model_name)
 23 |         tokenizer.model_max_length = int(1e30)
 24 |         tokenizer.pad_token = "<|endoftext|>"
 25 |     elif model_name == 'huggyllama/llama-7b':
 26 |         tokenizer = AutoTokenizer.from_pretrained(model_name)
 27 |         tokenizer.model_max_length = int(1e30)
 28 |         tokenizer.pad_token = "<|endoftext|>"
 29 |     else:
 30 |         tokenizer = AutoTokenizer.from_pretrained(model_name)
 31 |         tokenizer.add_bos_token = False
 32 | 
 33 | def process_request(x, seq):
 34 |     global tokenizer
 35 | 
 36 |     ctx, cont = x
 37 | #     ctx_tokens = tokenizer.encode("<|endoftext|>" + ftfy.fix_text(ctx, normalization="NFKC"))
 38 |     ctx_text = ftfy.fix_text(ctx, normalization="NFKC")
 39 |     cont_text = ftfy.fix_text(cont, normalization="NFKC")
 40 |     all_text = ctx_text + cont_text
 41 | 
 42 |     ctx_tokens = tokenizer(ctx_text, add_special_tokens=False)['input_ids']
 43 |     cont_tokens = tokenizer(cont_text, add_special_tokens=False)['input_ids']
 44 | 
 45 |     all_tokens = ctx_tokens + cont_tokens
 46 |     all_tokens = np.array(all_tokens)[-seq:]  # truncate sequence at seq length
 47 | 
 48 |     provided_ctx = len(all_tokens) - 1
 49 |     pad_amount = seq - provided_ctx
 50 | 
 51 |     return {
 52 |         "obs": np.pad(all_tokens[:-1], ((0, pad_amount),), constant_values=tokenizer.pad_token_id),
 53 |         "target": np.pad(all_tokens[1:], ((0, pad_amount),), constant_values=tokenizer.pad_token_id),
 54 |         "ctx_length": seq,
 55 |         "eval_mask": np.logical_and(
 56 |             np.arange(0, seq) > len(all_tokens) - len(cont_tokens) - 2,
 57 |             np.arange(0, seq) < len(all_tokens) - 1
 58 |         ),
 59 |         "prompt": ctx_text,
 60 |         "target": cont_text,
 61 |         "text": all_text,
 62 |     }
 63 | 
 64 | 
 65 | class EvalHarnessAdaptor(LM):
 66 |     def greedy_until(self, requests):
 67 |         raise Exception("unimplemented")
 68 | 
 69 |     def loglikelihood_rolling(self, requests):
 70 |         raise Exception("unimplemented")
 71 | 
 72 |     def __init__(self, tpu_cluster, seq, batch, shrink, min_seq=None):
 73 |         super().__init__()
 74 |         self.tpu = tpu_cluster
 75 |         self.seq = seq
 76 |         self.batch = batch
 77 |         self.shrink = shrink
 78 |         self.min_seq = min_seq
 79 | 
 80 |         self.pool = multiprocessing.Pool(processes=1, initializer=process_init)
 81 |         # self.pool = multiprocessing.Pool(initializer=process_init)
 82 |         process_init()
 83 | 
 84 |     def convert_requests(self, requests):
 85 |         return self.pool.imap(partial(process_request, seq=self.seq), requests)
 86 | 
 87 |     def loglikelihood(self, requests):
 88 |         output = []
 89 | 
 90 |         r = self.convert_requests(requests)
 91 |         zero_example = process_request(requests[0], self.seq)
 92 | 
 93 |         for b in tqdm(sample_batch(r, self.batch, zero_example),
 94 |                       desc="LM eval harness",
 95 |                       total=len(requests) // self.batch):
 96 | 
 97 |             if self.shrink:
 98 |                 b = shrink_seq(b, min_seq=self.min_seq)
 99 | 
100 |             out = self.tpu.eval(b)
101 | 
102 |             for loss, correct in zip(out["mask_loss"], out["each_correct"]):
103 |                 output.append((float(-loss), bool(correct)))
104 | 
105 |         return output
106 | 
107 | 
108 | 


--------------------------------------------------------------------------------
/accuracy/lm_eval/tasks/util.py:
--------------------------------------------------------------------------------
 1 | from itertools import zip_longest
 2 | 
 3 | import numpy as np
 4 | 
 5 | 
 6 | def grouper(n, iterable, fillvalue):
 7 |     "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
 8 |     args = [iter(iterable)] * n
 9 |     return zip_longest(fillvalue=fillvalue, *args)
10 | 
11 | 
12 | # divide the seq length by 2 until it would truncate actual context
13 | def shrink_seq(examples, min_seq=None):
14 |     length = examples["obs"].shape[-1]
15 | 
16 |     new_length = length // 2
17 | 
18 |     if min_seq is not None:
19 |         if new_length < min_seq:
20 |             return examples
21 | 
22 |     max_length = np.max(examples["eval_mask"] * np.arange(0, length)) + 1
23 | 
24 |     if max_length < new_length:
25 |         examples["obs"] = examples["obs"][:, :new_length]
26 |         examples["target"] = examples["target"][:, :new_length]
27 |         examples["eval_mask"] = examples["eval_mask"][:, :new_length]
28 | 
29 |         return shrink_seq(examples, min_seq=min_seq)
30 |     else:
31 |         return examples
32 | 
33 | 
34 | def sample_batch(examples, bs, zero_example_shape):
35 |     zero_example = {
36 |         "obs": np.zeros_like(zero_example_shape["obs"]),
37 |         "target": np.zeros_like(zero_example_shape["target"]),
38 |         "eval_mask": np.zeros_like(zero_example_shape["eval_mask"]),
39 |         "ctx_length": 0,
40 |     }
41 | 
42 |     for batch in grouper(bs, examples, zero_example):
43 |         batch_flattened = {
44 |             "obs": [],
45 |             "target": [],
46 |             "eval_mask": [],
47 |             "ctx_length": [],
48 |             "text": [],
49 |         }
50 | 
51 |         for sample in batch:
52 |             batch_flattened["obs"].append(sample["obs"])
53 |             batch_flattened["target"].append(sample["target"])
54 |             batch_flattened["eval_mask"].append(sample["eval_mask"])
55 |             batch_flattened["ctx_length"].append(sample["ctx_length"])
56 |             batch_flattened["text"].append(sample["text"])
57 | 
58 |         batch_flattened["obs"] = np.array(batch_flattened["obs"])
59 |         batch_flattened["target"] = np.array(batch_flattened["target"])
60 |         batch_flattened["eval_mask"] = np.array(batch_flattened["eval_mask"])
61 |         batch_flattened["ctx_length"] = np.array(batch_flattened["ctx_length"])
62 | 
63 |         yield batch_flattened


--------------------------------------------------------------------------------
/accuracy/lm_eval/utils_lm_eval/modify_llama.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pdb
  3 | import copy
  4 | import math
  5 | import numpy as np 
  6 | from dataclasses import dataclass
  7 | from typing import Optional, Tuple, Union
  8 | 
  9 | import torch
 10 | from torch import nn
 11 | import torch.utils.checkpoint
 12 | import torch.nn.functional as F
 13 | from torch.cuda.amp import autocast
 14 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 15 | 
 16 | 
 17 | from transformers.models.llama.configuration_llama import LlamaConfig
 18 | from transformers.models.llama.modeling_llama import LlamaRotaryEmbedding, LlamaAttention, apply_rotary_pos_emb
 19 | 
 20 | 
 21 | __all__ = ['convert_kvcache_llama_heavy_recent', 'LlamaAttention_heavy_hitter']
 22 | 
 23 | 
 24 | def local_heavy_hitter_mask(attn_weights, heavy_budget):
 25 | 
 26 |     # attn_weights (BS, head, query, keys)
 27 |     dtype_attn_weights = attn_weights.dtype
 28 |     seq_length = attn_weights.shape[-1]
 29 |     padding_length = 0
 30 | 
 31 |     offset = torch.finfo(attn_weights.dtype).min
 32 |     tmp_attn = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(dtype_attn_weights)
 33 | 
 34 |     accumulated_attention_score = torch.sum(tmp_attn[:,:,padding_length:heavy_budget+padding_length,:], dim=-2) #(head, keys)
 35 |     accumulated_attention_score[:,:,heavy_budget+padding_length:] = 0
 36 |     accumulated_attention_score[:,:,:padding_length] = 0
 37 | 
 38 |     mask_bottom = torch.zeros_like(attn_weights, dtype=torch.bool)
 39 |     mask_bottom[:,:, padding_length:heavy_budget+padding_length, padding_length:heavy_budget+padding_length] = True
 40 | 
 41 |     for token_index in range(heavy_budget+padding_length, seq_length):
 42 | 
 43 |         tmp_attn_index = nn.functional.softmax(attn_weights[:,:,token_index,:], dim=-1, dtype=torch.float32).to(dtype_attn_weights)
 44 |         _, tmp_topk_index = accumulated_attention_score.topk(k=heavy_budget-1, dim=-1)
 45 |         zeros_index = torch.zeros_like(tmp_attn_index, dtype=torch.bool)
 46 |         mask_bottom_index = zeros_index.scatter(-1, tmp_topk_index, True) #(head, keys)
 47 |         mask_bottom_index[:,:, token_index] = True
 48 | 
 49 |         mask_bottom[:,:,token_index,:] = mask_bottom_index
 50 |         accumulated_attention_score += tmp_attn_index
 51 |         accumulated_attention_score = accumulated_attention_score * mask_bottom_index
 52 | 
 53 |     return mask_bottom
 54 | 
 55 | 
 56 | class LlamaAttention_heavy_hitter(nn.Module):
 57 |     """Multi-headed attention from 'Attention Is All You Need' paper"""
 58 | 
 59 |     def __init__(self, config: LlamaConfig):
 60 |         super().__init__()
 61 |         self.config = config
 62 |         self.hidden_size = config.hidden_size
 63 |         self.num_heads = config.num_attention_heads
 64 |         self.head_dim = self.hidden_size // self.num_heads
 65 |         self.max_position_embeddings = config.max_position_embeddings
 66 | 
 67 |         if (self.head_dim * self.num_heads) != self.hidden_size:
 68 |             raise ValueError(
 69 |                 f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
 70 |                 f" and `num_heads`: {self.num_heads})."
 71 |             )
 72 |         self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
 73 |         self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
 74 |         self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
 75 |         self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
 76 |         self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
 77 | 
 78 |         self.heavy_budget_ratio = config.heavy_ratio
 79 |         self.recent_budget_ratio = config.recent_ratio
 80 | 
 81 | 
 82 |     def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
 83 |         return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
 84 | 
 85 |     def forward(
 86 |         self,
 87 |         hidden_states: torch.Tensor,
 88 |         attention_mask: Optional[torch.Tensor] = None,
 89 |         position_ids: Optional[torch.LongTensor] = None,
 90 |         past_key_value: Optional[Tuple[torch.Tensor]] = None,
 91 |         output_attentions: bool = False,
 92 |         use_cache: bool = False,
 93 |     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
 94 |         bsz, q_len, _ = hidden_states.size()
 95 | 
 96 |         query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
 97 |         key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
 98 |         value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
 99 | 
100 |         kv_seq_len = key_states.shape[-2]
101 |         if past_key_value is not None:
102 |             kv_seq_len += past_key_value[0].shape[-2]
103 |         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
104 |         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
105 |         # [bsz, nh, t, hd]
106 | 
107 |         if past_key_value is not None:
108 |             # reuse k, v, self_attention
109 |             key_states = torch.cat([past_key_value[0], key_states], dim=2)
110 |             value_states = torch.cat([past_key_value[1], value_states], dim=2)
111 | 
112 |         past_key_value = (key_states, value_states) if use_cache else None
113 | 
114 |         attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
115 | 
116 |         if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
117 |             raise ValueError(
118 |                 f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
119 |                 f" {attn_weights.size()}"
120 |             )
121 | 
122 |         if attention_mask is not None:
123 |             if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
124 |                 raise ValueError(
125 |                     f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
126 |                 )
127 |             attn_weights = attn_weights + attention_mask
128 |             attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min))
129 | 
130 |         ### Heavy + Recent
131 |         heavy_budget = int(self.heavy_budget_ratio * attn_weights.shape[-1])
132 |         recent_budget = int(self.recent_budget_ratio * attn_weights.shape[-1])
133 | 
134 |         # Heavy Hitter Mask
135 |         if heavy_budget > 0:
136 |             mask_bottom = local_heavy_hitter_mask(attn_weights, heavy_budget) # Default: No padding applied to input
137 |         else:
138 |             mask_bottom = torch.zeros_like(attn_weights, dtype=torch.bool)
139 | 
140 |         ones = torch.ones_like(attn_weights, dtype=torch.bool)
141 |         ones = torch.triu(ones, diagonal=-recent_budget)
142 |         mask_bottom = torch.logical_or(mask_bottom, ones)
143 | 
144 |         mask_bottom = torch.tril(mask_bottom, diagonal=0)
145 | 
146 |         # mask_bottom = ones
147 |         attn_weights[~mask_bottom] = torch.min(attention_mask)
148 | 
149 |         # upcast attention to fp32
150 |         attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
151 |         attn_output = torch.matmul(attn_weights, value_states)
152 | 
153 |         if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
154 |             raise ValueError(
155 |                 f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
156 |                 f" {attn_output.size()}"
157 |             )
158 | 
159 |         attn_output = attn_output.transpose(1, 2)
160 |         attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
161 | 
162 |         attn_output = self.o_proj(attn_output)
163 | 
164 |         if not output_attentions:
165 |             attn_weights = None
166 | 
167 |         return attn_output, attn_weights, past_key_value
168 | 
169 | 
170 | def convert_kvcache_llama_heavy_recent(model, config):
171 | 
172 |     #for name, module in reversed(model._modules.items()):
173 |     #    if len(list(module.children())) > 0:
174 |     #        model._modules[name] = convert_kvcache_llama_heavy_recent(module, config)
175 |     #    if isinstance(module, LlamaAttention):
176 |     #        model._modules[name] = LlamaAttention_heavy_hitter(config)
177 |     for idx, module in enumerate(model.model.layers):
178 |         if (idx != 0) and (idx!=1):
179 |         #if (idx != 32) and (idx!=33):
180 |             model.model.layers[idx].self_attn = LlamaAttention_heavy_hitter(config)
181 |         else:
182 |             print("skip layer: ", idx)
183 | 
184 |     return model
185 | 
186 | 


--------------------------------------------------------------------------------
/accuracy/perplexity/datautils.py:
--------------------------------------------------------------------------------
 1 | from transformers import LlamaTokenizer, AutoTokenizer
 2 | from datasets import load_dataset
 3 | import numpy as np
 4 | import torch
 5 | import os
 6 | 
 7 | def set_symlink(model_type, fname):
 8 |     model_path = "../transformers/src/transformers/models/" + model_type
 9 |     linker_path = os.path.realpath("../src/" + fname)
10 |     if not os.path.exists(linker_path):
11 |         print(f"No file exists at {linker_path}")
12 |         exit(0)
13 |     if not os.path.exists(model_path):
14 |         print(f"No file exists at {model_path}")
15 |         exit(0)
16 |     curr_dir = os.getcwd()
17 |     os.chdir(model_path)
18 |     if os.path.exists(f'modeling_{model_type}.py'):
19 |         cmd = f"rm modeling_{model_type}.py"
20 |         os.system(cmd)
21 |     cmd = f"ln -s {linker_path} modeling_{model_type}.py"
22 |     os.system(cmd)
23 |     os.chdir(curr_dir)
24 | 
25 | def set_seed(seed):
26 |     np.random.seed(seed)
27 |     torch.random.manual_seed(seed)
28 | 
29 | def get_wikitext2(nsamples, seed, seqlen, model):
30 |     testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
31 |     try: 
32 |         tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
33 |     except:
34 |         tokenizer = LlamaTokenizer.from_pretrained(model, use_fast=False)
35 |     testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')
36 |     return testenc
37 | 
38 | def get_ptb(nsamples, seed, seqlen, model):
39 |     valdata = load_dataset('ptb_text_only', 'penn_treebank', split='validation')
40 |     try: 
41 |         tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
42 |     except:
43 |         tokenizer = LlamaTokenizer.from_pretrained(model, use_fast=False)
44 |     testenc = tokenizer("\n\n".join(valdata['sentence']), return_tensors='pt')
45 |     return testenc
46 | 
47 | def get_loaders(
48 |     name, nsamples=128, seed=0, seqlen=2048, model=''
49 | ):
50 |     if 'wikitext2' in name:
51 |         return get_wikitext2(nsamples, seed, seqlen, model)
52 |     if 'ptb' in name:
53 |         return get_ptb(nsamples, seed, seqlen, model)
54 | 


--------------------------------------------------------------------------------
/accuracy/perplexity/figure12.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # InfiniGen
 4 | echo "== InfiniGen =="
 5 | partial=0.2
 6 | seqlen=2048
 7 | alpha=4.0
 8 | budget=0.2
 9 | echo opt-13b alpha $alpha budget $budget
10 | python opt.py --model "../setup/opt-model/opt-13b" \
11 |   --eval_dataset "wikitext2" \
12 |   --seq_len ${seqlen} \
13 |   --eval_samples 0 \
14 |   --model_name "opt-13b" \
15 |   --infinigen \
16 |   --print_blk_ppl \
17 |   --partial_weight_ratio ${partial} \
18 |   --partial_weight_path "../setup/weights/opt-13b_${partial}" \
19 |   --alpha ${alpha} \
20 |   --budget ${budget} \
21 |   --capacity 1.0 
22 | 
23 | partial=0.2
24 | seqlen=4096
25 | alpha=5.2
26 | budget=0.2
27 | echo llama-2 alpha $alpha budget $budget
28 | python llama.py --model "${LLAMA_PATH}/llama-2-13b" \
29 |   --eval_dataset "wikitext2" \
30 |   --seq_len ${seqlen} \
31 |   --eval_samples 0 \
32 |   --model_name "llama-2-13b" \
33 |   --infinigen \
34 |   --print_blk_ppl \
35 |   --partial_weight_ratio ${partial} \
36 |   --partial_weight_path "../setup/weights/llama-2-13b_${partial}" \
37 |   --skewing_matrix_path "../setup/skewing_matrix/llama-2-13b.pt" \
38 |   --alpha ${alpha} \
39 |   --budget ${budget} \
40 |   --capacity 1.0 
41 | 
42 | echo "==============="
43 | 
44 | # H2O 
45 | echo "==    H2O   =="
46 | partial=0.2
47 | seqlen=2048
48 | heavy=0.01875
49 | recent=0.01875
50 | echo opt-13b heavy $heavy recent $recent
51 | python opt.py --model "facebook/opt-13b" \
52 |   --eval_dataset "wikitext2" \
53 |   --seq_len ${seqlen} \
54 |   --eval_samples 0 \
55 |   --model_name "opt-13b" \
56 |   --print_blk_ppl \
57 |   --heavy_ratio ${heavy} \
58 |   --recent_ratio ${recent}
59 | 
60 | partial=0.2
61 | seqlen=4096
62 | heavy=0.01875
63 | recent=0.01875
64 | echo llama-2-13b heavy $heavy recent $recent
65 | python llama.py --model "${LLAMA_PATH}/llama-2-13b" \
66 |   --eval_dataset "wikitext2" \
67 |   --seq_len ${seqlen} \
68 |   --eval_samples 0 \
69 |   --model_name "llama-2-13b" \
70 |   --print_blk_ppl \
71 |   --heavy_ratio ${heavy} \
72 |   --recent_ratio ${recent}
73 | 
74 | echo "==============="
75 | 


--------------------------------------------------------------------------------
/accuracy/perplexity/llama.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | import copy
  4 | import argparse
  5 | import math
  6 | from datautils import *
  7 | 
  8 | 
  9 | def get_llama(model, seqlen):
 10 |     def skip(*args, **kwargs):
 11 |         pass
 12 |     torch.nn.init.kaiming_uniform_ = skip
 13 |     torch.nn.init.uniform_ = skip
 14 |     torch.nn.init.normal_ = skip
 15 | 
 16 |     from transformers import LlamaForCausalLM
 17 |     model = LlamaForCausalLM.from_pretrained(model, device_map='cpu', torch_dtype=torch.float16)
 18 | 
 19 |     model.seqlen = seqlen
 20 |     return model
 21 | 
 22 | 
 23 | @torch.no_grad()
 24 | def llama_eval(model, testenc, dev, eval_sample, ours, print_chunk = False):
 25 |     print('Evaluating ...')
 26 | 
 27 |     testenc = testenc.input_ids
 28 |     if eval_sample:
 29 |         nsamples = eval_sample
 30 |     else:
 31 |         nsamples = min(1000, testenc.numel() // model.seqlen)
 32 |     print("nsamples: ", nsamples)
 33 | 
 34 |     use_cache = model.config.use_cache
 35 |     model.config.use_cache = False
 36 |     layers = model.model.layers
 37 | 
 38 |     model.model.embed_tokens = model.model.embed_tokens.to(dev)
 39 |     layers[0] = layers[0].to(dev)
 40 | 
 41 |     dtype = next(iter(model.parameters())).dtype
 42 |     inps = torch.zeros((nsamples, model.seqlen, model.config.hidden_size), dtype=dtype, device=dev)
 43 |     cache = {'i': 0, 'attention_mask': None}
 44 | 
 45 |     class Catcher(nn.Module):
 46 | 
 47 |         def __init__(self, module):
 48 |             super().__init__()
 49 |             self.module = module
 50 | 
 51 |         def forward(self, inp, **kwargs):
 52 |             inps[cache['i']] = inp
 53 |             cache['i'] += 1
 54 |             cache['attention_mask'] = kwargs['attention_mask']
 55 |             cache['position_ids'] = kwargs['position_ids']
 56 |             raise ValueError
 57 | 
 58 |     layers[0] = Catcher(layers[0])
 59 |     for i in range(nsamples):
 60 |         batch = testenc[:, (i * model.seqlen):((i + 1) * model.seqlen)].to(dev)
 61 |         try:
 62 |             model(batch)
 63 |         except ValueError:
 64 |             pass
 65 |     layers[0] = layers[0].module
 66 | 
 67 |     layers[0] = layers[0].cpu()
 68 |     model.model.embed_tokens = model.model.embed_tokens.cpu()
 69 |     torch.cuda.empty_cache()
 70 | 
 71 |     outs = torch.zeros_like(inps)
 72 |     attention_mask = cache['attention_mask']
 73 |     position_ids = cache['position_ids']
 74 |     prev_hidden = []
 75 | 
 76 |     density = []
 77 |     for i in range(len(layers)):
 78 |         layer = layers[i].to(dev)
 79 | 
 80 |         for j in range(nsamples):
 81 |             if ours:
 82 |                 if i >= 2:
 83 |                     layer.self_attn.previous_hidden_states = prev_hidden[j]
 84 |             outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
 85 |             if ours:
 86 |                 if i >= 1:
 87 |                     cur_bsz = layer.self_attn.current_hidden_states.shape[0]
 88 |                     cur_tgt_len = layer.self_attn.current_hidden_states.shape[1]
 89 |                     cur_device = layer.self_attn.current_hidden_states.device
 90 |                     cur_dtype = layer.self_attn.current_hidden_states.dtype
 91 |                     if i == 1:
 92 |                         prev_hidden.append(layer.self_attn.current_hidden_states)
 93 |                     else:
 94 |                         prev_hidden[j] = layer.self_attn.current_hidden_states
 95 |             if ours and layer.self_attn.density is not None:
 96 |                 density.append(layer.self_attn.density)
 97 |         
 98 |         layers[i] = layer.cpu()
 99 |         del layer
100 |         torch.cuda.empty_cache()
101 |         inps, outs = outs, inps
102 |         print(i, end=' ',flush=True)
103 |     print()
104 | 
105 |     if ours:
106 |         density = sum(density) / len(density) * 100
107 |         retain_ratio = (1 - math.sqrt(1 - (density/100))) * 100
108 |         #print("density %f"%(density))
109 |         print("retain ratio %f"%((retain_ratio)))
110 | 
111 |     if model.model.norm is not None:
112 |         model.model.norm = model.model.norm.to(dev)
113 |     model.lm_head = model.lm_head.to(dev)
114 | 
115 |     testenc = testenc.to(dev)
116 |     nlls = []
117 |     for i in range(nsamples):
118 |         hidden_states = inps[i].unsqueeze(0)
119 |         if model.model.norm is not None:
120 |             hidden_states = model.model.norm(hidden_states)
121 |         lm_logits = model.lm_head(hidden_states)
122 |         shift_logits = lm_logits[:, :-1, :].contiguous()
123 |         shift_labels = testenc[:, (i * model.seqlen):((i + 1) * model.seqlen)][:, 1:]
124 |         loss_fct = nn.CrossEntropyLoss(reduction='none')
125 | #        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
126 | #        neg_log_likelihood = loss.float() * model.seqlen
127 |         neg_log_likelihood = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).to(torch.float)
128 |         nlls.append(neg_log_likelihood)
129 |     nlls = torch.stack(nlls)
130 |     for seqlen in range(int(model.seqlen/256)):
131 |         start = seqlen * 256
132 |         end = (seqlen+1)*256
133 |         per_token_ppl = torch.exp(nlls[:, start:end].sum(dim=0) / nsamples)
134 |         if seqlen == 0:
135 |             var = torch.var(per_token_ppl[25:], correction=0)
136 |         else:
137 |             var = torch.var(per_token_ppl, correction=0)
138 |         q1 = torch.quantile(per_token_ppl, 0.25, interpolation='nearest')
139 |         q2 = torch.quantile(per_token_ppl, 0.5, interpolation='nearest')
140 |         q3 = torch.quantile(per_token_ppl, 0.75, interpolation='nearest')
141 |         ppl = torch.exp(nlls[:, start:end].sum() / (nsamples*256))
142 |         if print_chunk:
143 |             print("seqlen: ", end)
144 |             print("perplexity, variance, q1, q2, q3: ", ppl.item(), var.item(), q1.item(), q2.item(), q3.item())
145 | 
146 |     print("Total")
147 |     ppl = torch.exp(nlls.sum() / (nsamples * model.seqlen))
148 |     print("Perplexity: ", ppl.item())
149 | 
150 | 
151 |     model.config.use_cache = use_cache
152 | 
153 | if __name__ == '__main__':
154 | 
155 |     parser = argparse.ArgumentParser()
156 | 
157 |     parser.add_argument('--model', type=str, help='llama model to load; pass `/path/to/llama')
158 |     parser.add_argument('--eval_dataset', type=str, help='evaluation dataset')
159 |     parser.add_argument('--seq_len', type=int, help='model sequence length')
160 |     parser.add_argument('--eval_samples', type=int, default=0, help='number of sample evaluation dataset')
161 |     parser.add_argument('--model_name', type=str, help='name of the model')
162 |     parser.add_argument('--print_blk_ppl', action='store_true', help='')
163 |     
164 |     ## H2O
165 |     parser.add_argument("--heavy_ratio", type=float, default=None)
166 |     parser.add_argument("--recent_ratio", type=float, default=None)
167 | 
168 |     ## InfiniGen 
169 |     parser.add_argument('--infinigen', action='store_true', help='')
170 |     parser.add_argument("--partial_weight_ratio", type=float, default=None)
171 |     parser.add_argument("--partial_weight_path", type=str)
172 |     parser.add_argument("--skewing_matrix_path", type=str)
173 |     parser.add_argument("--alpha", type=float, default=0.0)
174 |     parser.add_argument("--budget", type=float, default=0.0)
175 |     parser.add_argument("--capacity", type=float, default=0.0)
176 |     parser.add_argument("--eviction_policy", type=str, default="lru")
177 |     
178 |     args = parser.parse_args()
179 | 
180 |     if args.infinigen:
181 |         set_symlink("llama", "modeling_llama_ours.py")
182 |     else:
183 |         set_symlink("llama", "modeling_llama_orig.py")
184 |     
185 |     model = get_llama(args.model, args.seq_len)
186 |     
187 |     ## H2O
188 |     if args.heavy_ratio is not None:
189 |         import sys
190 |         sys.path.append("../lm_eval")
191 |         from utils_lm_eval.modify_llama import convert_kvcache_llama_heavy_recent, LlamaAttention_heavy_hitter
192 |         
193 |         model.config.heavy_ratio = args.heavy_ratio
194 |         model.config.recent_ratio = args.recent_ratio
195 |         if not os.path.exists(f"../h2o_model/{args.model_name}.pt"):
196 |             os.system("mkdir ../h2o_model")
197 |             checkpoint = copy.deepcopy(model.state_dict())
198 |             torch.save(checkpoint, f"../h2o_model/{args.model_name}.pt")
199 |         model = convert_kvcache_llama_heavy_recent(model, model.config)
200 |         model.load_state_dict(torch.load(f"../h2o_model/{args.model_name}.pt"))
201 |         model = model.to(torch.float16)
202 |     
203 |     ## InfiniGen
204 |     if args.infinigen:
205 |         A = torch.load(args.skewing_matrix_path)
206 |         for layer in range(len(model.model.layers)):
207 |             model.model.layers[layer].self_attn.partial_weight_ratio = args.partial_weight_ratio
208 |             model.model.layers[layer].self_attn.partial_weight_q = torch.load(args.partial_weight_path + "/partial_weight_q_" + str(layer) + ".pt")
209 |             model.model.layers[layer].self_attn.skewing_matrix = A[layer]
210 | 
211 |         for layer in range(len(model.model.layers)):
212 |             model.model.layers[layer].self_attn.alpha = args.alpha
213 |             model.model.layers[layer].self_attn.budget = args.budget
214 |             model.model.layers[layer].self_attn.capacity = args.capacity
215 |             model.model.layers[layer].self_attn.eviction_policy = args.eviction_policy
216 | 
217 |     model.eval()
218 |     
219 |     dataset = args.eval_dataset
220 |     testloader = get_loaders(dataset, model=args.model, seqlen=model.seqlen)
221 |     
222 |     llama_eval(model, testloader, 'cuda', args.eval_samples, args.infinigen, args.print_blk_ppl)
223 | 


--------------------------------------------------------------------------------
/accuracy/perplexity/table2.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | partial=0.2
 4 | seqlen=2048
 5 | 
 6 | ## OPT
 7 | alpha=4
 8 | budget=0.2
 9 | 
10 | for size in 6.7b 13b 30b;do
11 |   for dataset in "wikitext2" "ptb";do
12 |     echo opt-$size ${dataset} 100% cache
13 |     python opt.py --model "../setup/opt-model/opt-${size}" \
14 |       --eval_dataset ${dataset} \
15 |       --seq_len ${seqlen} \
16 |       --eval_samples 0 \
17 |       --model_name "opt-${size}" \
18 |       --infinigen \
19 |       --partial_weight_ratio ${partial} \
20 |       --partial_weight_path "../setup/weights/opt-${size}_${partial}" \
21 |       --alpha ${alpha} \
22 |       --budget ${budget} \
23 |       --capacity 1.0
24 |   done
25 | done
26 | 
27 | for size in 6.7b 13b 30b;do
28 |   for dataset in "wikitext2" "ptb";do
29 |     for evict in fifo lru counter;do
30 |       echo opt-$size ${dataset} 80% cache evict ${evict}
31 |       python opt.py --model "../setup/opt-model/opt-${size}" \
32 |       --eval_dataset ${dataset} \
33 |         --seq_len ${seqlen} \
34 |         --eval_samples 0 \
35 |         --model_name "opt-${size}" \
36 |         --infinigen \
37 |         --partial_weight_ratio ${partial} \
38 |         --partial_weight_path "../setup/weights/opt-${size}_${partial}" \
39 |         --alpha ${alpha} \
40 |         --budget ${budget} \
41 |         --capacity 0.8 \
42 |         --eviction_policy ${evict}
43 |     done
44 |   done
45 | done
46 | 
47 | ## Llama-2
48 | alpha=5
49 | budget=0.2
50 | 
51 | for size in 7b 13b;do
52 |   for dataset in "wikitext2" "ptb";do
53 |     echo llama-2-${size} ${dataset} 100% cache
54 |     python llama.py --model "${LLAMA_PATH}/llama-2-${size}" \
55 |       --eval_dataset ${dataset} \
56 |       --seq_len ${seqlen} \
57 |       --eval_samples 0 \
58 |       --model_name "llama-${size}" \
59 |       --infinigen \
60 |       --partial_weight_ratio ${partial} \
61 |       --partial_weight_path "../setup/weights/llama-2-${size}_${partial}" \
62 |       --skewing_matrix_path "../setup/skewing_matrix/llama-2-${size}.pt" \
63 |       --alpha ${alpha} \
64 |       --budget ${budget} \
65 |       --capacity 1.0 
66 |   done
67 | done
68 | 
69 | for size in 7b 13b;do
70 |   for dataset in "wikitext2" "ptb";do
71 |     for evict in fifo lru counter;do
72 |       echo llama-2-${size} ${dataset} 80% cache evict ${evict}
73 |       python llama.py --model "${LLAMA_PATH}/llama-2-${size}" \
74 |         --eval_dataset ${dataset} \
75 |         --seq_len ${seqlen} \
76 |         --eval_samples 0 \
77 |         --model_name "llama-${size}" \
78 |         --infinigen \
79 |         --partial_weight_ratio ${partial} \
80 |         --partial_weight_path "../setup/weights/llama-2-${size}_${partial}" \
81 |         --skewing_matrix_path "../setup/skewing_matrix/llama-2-${size}.pt" \
82 |         --alpha ${alpha} \
83 |         --budget ${budget} \
84 |         --capacity 0.8 \
85 |         --eviction_policy ${evict}
86 |     done
87 |   done
88 | done
89 | 


--------------------------------------------------------------------------------
/accuracy/scripts/figure11/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cwd=$PWD
4 | 
5 | cd ../../lm_eval
6 | python figure11.py
7 | cd $cwd
8 | 


--------------------------------------------------------------------------------
/accuracy/scripts/figure12/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cwd=$PWD
4 | 
5 | cd ../../perplexity
6 | bash figure12.sh
7 | cd $cwd
8 | 


--------------------------------------------------------------------------------
/accuracy/scripts/figure13/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cwd=$PWD
4 | 
5 | cd ../../lm_eval
6 | bash figure13.sh
7 | cd $cwd
8 | 


--------------------------------------------------------------------------------
/accuracy/scripts/figure17/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cwd=$PWD
4 | 
5 | cd ../../lm_eval
6 | bash figure17.sh
7 | cd $cwd
8 | 


--------------------------------------------------------------------------------
/accuracy/scripts/run_all.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | for FIG in "figure11" "figure12" "figure13" "table2" "figure17"; do
4 | cd $FIG
5 |   sh run.sh
6 |   cd ..
7 | done
8 | 


--------------------------------------------------------------------------------
/accuracy/scripts/table2/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cwd=$PWD
4 | 
5 | cd ../../perplexity
6 | bash table2.sh
7 | cd $cwd
8 | 


--------------------------------------------------------------------------------
/accuracy/setup/gen_llama_skewing_matrix.py:
--------------------------------------------------------------------------------
 1 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
 2 | import argparse
 3 | import torch
 4 | import os
 5 | from utils import *
 6 | 
 7 | ### Parameters
 8 | 
 9 | def process_options():
10 |   parser = argparse.ArgumentParser(description="Llama-2 Model")
11 |   parser.add_argument("--model", required=True, 
12 |                       help='Llama-2 model to load')
13 |   parser.add_argument("--output", required=True, 
14 |                       help='output directory to store result')
15 |   return parser
16 | 
17 | def main():
18 |     parser = process_options()
19 |     args = parser.parse_args()
20 | 
21 |     ### Model load
22 |     set_symlink("llama", "modeling_llama_orig.py")
23 | 
24 |     model_name = os.path.basename(args.model)
25 |     config = AutoConfig.from_pretrained(args.model)
26 |     tokenizer = AutoTokenizer.from_pretrained(args.model)
27 |     model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda()
28 |     head_dim = model.model.layers[0].self_attn.head_dim
29 |     n_head = model.model.layers[0].self_attn.num_heads
30 |     n_layer = config.num_hidden_layers
31 | 
32 |     ### Generation
33 |     file_path = "./pg19_firstbook.txt"
34 | 
35 |     with open(file_path, 'r') as file:
36 |         prompt = file.read()
37 | 
38 |     input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()[:, :2048]
39 | 
40 |     print("Start Generation")
41 | 
42 |     generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1)
43 | 
44 |     print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
45 | 
46 |     query_v = {}
47 |     key_v = {}
48 | 
49 |     for i, layer in enumerate(model.model.layers):
50 |         query_v[str(i)] = layer.self_attn.rope_query
51 |         key_v[str(i)] = layer.self_attn.rope_key
52 | 
53 |     ### Gen Skewing Matrix A
54 |     A = torch.zeros(n_layer, n_head, head_dim, head_dim).to('cuda').to(torch.float16)
55 |     for name in query_v:
56 |         layer = int(name)
57 |         query = query_v[name]
58 |         key = key_v[name]
59 | 
60 |         for head in range(n_head):
61 |             in_q = query[0, head]
62 |             in_k = key[0, head]
63 |             uq, sq, vq = torch.svd(in_q.to(torch.float))
64 |             uk, sk, vk = torch.svd(in_k.to(torch.float))
65 |             s = sq * sk
66 |             a = torch.zeros(head_dim, head_dim).to('cuda')
67 |             _, ind = s.sort()
68 |             r,c = a.shape
69 |             A[layer, head] = a.scatter(-1, ind.unsqueeze(0).repeat(r,1), vq).to(torch.float16)
70 | 
71 |     save_dir = args.output
72 |     if not os.path.exists(save_dir):
73 |         os.system(f"mkdir -p {save_dir}")
74 |     torch.save(A, save_dir + "/" + model_name + ".pt")
75 | 
76 | if __name__ == "__main__":
77 |     main()
78 | 


--------------------------------------------------------------------------------
/accuracy/setup/gen_opt_model.py:
--------------------------------------------------------------------------------
  1 | from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
  2 | import argparse
  3 | import torch
  4 | import os
  5 | from utils import *
  6 | 
  7 | def process_options():
  8 |   parser = argparse.ArgumentParser(description="OPT Model")
  9 |   parser.add_argument("--model", default="facebook/opt-6.7b", 
 10 |                       help='OPT model to load')
 11 |   parser.add_argument("--output", required=True, 
 12 |                       help='output directory to store result')
 13 |   parser.add_argument("--no_skewing", action='store_true', 
 14 |                       help='whether to skew weight')
 15 |   return parser
 16 | 
 17 | def main():
 18 |     parser = process_options()
 19 |     args = parser.parse_args()
 20 | 
 21 |     ### Model load
 22 |     set_symlink("opt", "modeling_opt_orig.py")
 23 | 
 24 |     model_name = os.path.basename(args.model)
 25 |     config = AutoConfig.from_pretrained(args.model)
 26 |     tokenizer = AutoTokenizer.from_pretrained(args.model)
 27 |     model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda()
 28 |     head_dim = model.model.decoder.layers[0].self_attn.head_dim
 29 |     n_head = model.model.decoder.layers[0].self_attn.num_heads
 30 | 
 31 |     ### Add hook
 32 |     query_v = {}
 33 |     key_v = {}
 34 | 
 35 |     def get_query(name):
 36 |       def hook(model, input, output):
 37 |         query_v[name] = output
 38 |       return hook
 39 |     def get_key(name):
 40 |       def hook(model, input, output):
 41 |         key_v[name] = output
 42 |       return hook
 43 | 
 44 |     for i, layer in enumerate(model.model.decoder.layers):
 45 |         query = layer.self_attn.q_proj.register_forward_hook(get_query("%d"%(i)))
 46 |         key = layer.self_attn.k_proj.register_forward_hook(get_key("%d"%(i)))
 47 | 
 48 |     ### Generation
 49 |     file_path = "./pg19_firstbook.txt"
 50 | 
 51 |     with open(file_path, 'r') as file:
 52 |         prompt = file.read()
 53 | 
 54 |     input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()[:, :2048]
 55 | 
 56 |     print("Start Generation")
 57 | 
 58 |     generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1)
 59 | 
 60 |     print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
 61 | 
 62 |     ### New weight generation
 63 |     for name in query_v:
 64 |         layer = int(name)
 65 |         query = query_v[name][0]
 66 |         query = query * (head_dim ** -0.5)
 67 |         key = key_v[name][0]
 68 | 
 69 |         wq = model.model.decoder.layers[layer].self_attn.q_proj.weight.data
 70 |         bq = model.model.decoder.layers[layer].self_attn.q_proj.bias.data
 71 |         wk = model.model.decoder.layers[layer].self_attn.k_proj.weight.data
 72 |         bk = model.model.decoder.layers[layer].self_attn.k_proj.bias.data
 73 |         
 74 |         new_wq = torch.cat((wq.transpose(-1,-2), bq.unsqueeze(0)), dim = 0) * (head_dim**-0.5)
 75 |         new_wk = torch.cat((wk.transpose(-1,-2), bk.unsqueeze(0)), dim = 0)
 76 |         
 77 |         if not args.no_skewing:
 78 |             for h in range(n_head):
 79 |                 start = h * head_dim
 80 |                 end = (h+1) * head_dim
 81 |                 uq, sq, vq = torch.svd(query[:, start:end].to(torch.float))
 82 |                 uk, sk, vk = torch.svd(key[:, start:end].to(torch.float))
 83 |                 uq = uq.to(torch.float16)
 84 |                 sq = sq.to(torch.float16)
 85 |                 vq = vq.to(torch.float16)
 86 |                 uk = uk.to(torch.float16)
 87 |                 sk = sk.to(torch.float16)
 88 |                 vk = vk.to(torch.float16)
 89 |                 s = sq * sk
 90 | 
 91 |                 A = torch.zeros(head_dim, head_dim).to('cuda').to(torch.float16)
 92 |                 _, ind = s.sort()
 93 |                 r,c = A.shape
 94 |                 A = A.scatter(-1, ind.unsqueeze(0).repeat(r,1), vq) 
 95 |                 new_wq[:, start:end] = new_wq[:, start:end] @ A
 96 |                 new_wk[:, start:end] = new_wk[:, start:end] @ A
 97 | 
 98 |         model.model.decoder.layers[layer].self_attn.q_proj.weight.data = new_wq
 99 |         model.model.decoder.layers[layer].self_attn.k_proj.weight.data = new_wk
100 | 
101 |     save_dir = args.output + "/" + model_name
102 |     if not os.path.exists(save_dir):
103 |         os.system(f"mkdir -p {save_dir}")
104 | 
105 |     model.save_pretrained(save_dir)
106 | 
107 | if __name__ == "__main__":
108 |     main()
109 | 


--------------------------------------------------------------------------------
/accuracy/setup/gen_partial_weight.py:
--------------------------------------------------------------------------------
 1 | from transformers import AutoModelForCausalLM, AutoTokenizer
 2 | import argparse
 3 | import torch
 4 | import os
 5 | from utils import *
 6 | 
 7 | def process_options():
 8 |   parser = argparse.ArgumentParser(description="Generate partial weight")
 9 |   parser.add_argument("--our_model_path", default=None, 
10 |                       help='our OPT model')
11 |   parser.add_argument("--skewing_matrix_path", default=None, 
12 |                       help='path to skewing matrix')
13 |   parser.add_argument("--model", default="facebook/opt-6.7b", 
14 |                       help='model')
15 |   parser.add_argument("--model_type", default = "opt", 
16 |                       help='model arch (opt, llama)')
17 |   parser.add_argument("--partial_weight_ratio", required=False, default=0.1, 
18 |                       help='Ours: partial weight ratio')
19 |   parser.add_argument("--output", required=True, 
20 |                       help='output directory to store result')
21 |   return parser
22 |     
23 | def main():
24 |     ## get arguments
25 |     parser = process_options()
26 |     args = parser.parse_args()
27 |     file_path = "./pg19_firstbook.txt"
28 | 
29 |     fname = f"modeling_{args.model_type}_ours_setup.py"
30 |     set_symlink(args.model_type, fname)
31 | 
32 |     if args.our_model_path is not None:
33 |         model = AutoModelForCausalLM.from_pretrained(args.our_model_path, torch_dtype=torch.float16).cuda()
34 |     else:
35 |         model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype=torch.float16).cuda()
36 | 
37 |     if args.skewing_matrix_path is not None:
38 |         A = torch.load(args.skewing_matrix_path).to('cuda').to(torch.float16)
39 |         if args.model_type == 'llama':
40 |             for layer_num, layer in enumerate(model.model.layers):
41 |                 layer.self_attn.skewing_matrix = A[layer_num]
42 | 
43 |     tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False)
44 | 
45 |     prompt = ["The bartender refused to serve the patron because the patron was drunk.\n\nThe girl politely declined the hamburger because she was a vegetarian.\n\nThe spy discovered the enemy's location because the spy bugged the enemy's phone.\n\nI tossed the ball upwards therefore the ball hit the ceiling.\n\nThe rider fell to the ground because the bull bucked the rider.\n\nThe pair of students came under scrutiny by the teacher because the students both received excellent grades."]
46 |     
47 |     input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
48 | 
49 |     if args.model_type == "opt":
50 |         for layer in model.model.decoder.layers:
51 |             layer.self_attn.partial_weight_ratio = float(args.partial_weight_ratio)
52 |     elif args.model_type == "llama":
53 |         for layer in model.model.layers:
54 |             layer.self_attn.partial_weight_ratio = float(args.partial_weight_ratio)
55 | 
56 |     print("Start Generation")
57 |     
58 |     generated_ids = model.generate(input_ids, max_new_tokens = 1, min_new_tokens = 1)
59 | 
60 |     print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
61 | 
62 |     basepath = args.output + "/" + os.path.basename(os.path.normpath(args.model)) + "_%s"%(args.partial_weight_ratio)
63 |     if not os.path.exists(basepath):
64 |         os.system("mkdir -p %s"%(basepath))
65 | 
66 |     if args.model_type == "opt":
67 |         for layer in range(len(model.model.decoder.layers)):
68 |             partial_weight = model.model.decoder.layers[layer].self_attn.partial_weight_q
69 |             torch.save(partial_weight, "%s/partial_weight_q_"%(basepath) + str(layer) + ".pt")
70 |     elif args.model_type == "llama":
71 |         for layer in range(len(model.model.layers)):
72 |             partial_weight = model.model.layers[layer].self_attn.partial_weight_q
73 |             torch.save(partial_weight, "%s/partial_weight_q_"%(basepath) + str(layer) + ".pt")
74 | 
75 | if __name__ == "__main__":
76 |     main()
77 | 


--------------------------------------------------------------------------------
/accuracy/setup/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/accuracy/setup/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | CWD=${PWD}
 4 | cd ../transformers/src/transformers/models
 5 | 
 6 | for model in llama opt;do
 7 |   mv ${model}/modeling_${model}.py ${model}/modeling_${model}_orig.py
 8 | done
 9 | 
10 | cd ${CWD}
11 | 
12 | # ========= InfiniGen ============
13 | # generate opt models w/skewing
14 | for size in 6.7b 13b 30b;do
15 |   python gen_opt_model.py \
16 |     --model "facebook/opt-${size}" \
17 |     --output "./opt-model"
18 | done
19 | 
20 | # generate skewing matrices for llama
21 | for size in 7b 13b;do
22 |   python gen_llama_skewing_matrix.py \
23 |     --model "${LLAMA_PATH}/llama-2-${size}" \
24 |     --output "./skewing_matrix" 
25 | done
26 | 
27 | 
28 | # generate partial weight matrices for prediction
29 | PARTIAL_RATIO=0.2
30 | # opt
31 | for size in 6.7b 13b 30b;do
32 |   python gen_partial_weight.py \
33 |     --our_model_path "./opt-model/opt-${size}" \
34 |     --model "facebook/opt-${size}" \
35 |     --model_type "opt" \
36 |     --partial_weight_ratio $PARTIAL_RATIO \
37 |     --output "./weights"
38 | done
39 | 
40 | # llama
41 | for size in 7b 13b;do
42 |   python gen_partial_weight.py \
43 |     --skewing_matrix_path "./skewing_matrix/llama-2-${size}.pt" \
44 |     --model "${LLAMA_PATH}/llama-2-${size}" \
45 |     --model_type "llama" \
46 |     --partial_weight_ratio $PARTIAL_RATIO \
47 |     --output "./weights"
48 | done
49 | 
50 | 
51 | # ========= w/o skewing (figure 13)
52 | PARTIAL_RATIO=0.1
53 | python gen_partial_weight.py \
54 |   --our_model_path "./opt-model/opt-6.7b" \
55 |   --model "facebook/opt-6.7b" \
56 |   --model_type "opt" \
57 |   --partial_weight_ratio $PARTIAL_RATIO \
58 |   --output "./weights"
59 | 
60 | python gen_opt_model.py \
61 |   --model "facebook/opt-6.7b" \
62 |   --output "./opt-model-no-skew" \
63 |   --no_skewing
64 | 
65 | python gen_partial_weight.py \
66 |   --our_model_path "./opt-model-no-skew/opt-6.7b" \
67 |   --model "facebook/opt-6.7b" \
68 |   --model_type "opt" \
69 |   --partial_weight_ratio $PARTIAL_RATIO \
70 |   --output "./weights-no-skew"
71 | 
72 | # ========= partial weight sweep (figure 17)
73 | for PARTIAL_RATIO in 0.1 0.4 0.6 0.8 1.0;do
74 |   python gen_partial_weight.py \
75 |     --our_model_path "./opt-model/opt-13b" \
76 |     --model "facebook/opt-13b" \
77 |     --model_type "opt" \
78 |     --partial_weight_ratio $PARTIAL_RATIO \
79 |     --output "./weights"
80 | done
81 | 


--------------------------------------------------------------------------------
/accuracy/setup/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | def set_symlink(model_type, fname):
 3 |     model_path = "../transformers/src/transformers/models/" + model_type
 4 |     linker_path = os.path.realpath("../src/" + fname)
 5 |     if not os.path.exists(linker_path):
 6 |         print(f"No file exists at {linker_path}")
 7 |         exit(0)
 8 |     if not os.path.exists(model_path):
 9 |         print(f"No file exists at {model_path}")
10 |         exit(0)
11 |     curr_dir = os.getcwd()
12 |     os.chdir(model_path)
13 |     if os.path.exists(f'modeling_{model_type}.py'):
14 |         cmd = f"rm modeling_{model_type}.py"
15 |         os.system(cmd)
16 |     cmd = f"ln -s {linker_path} modeling_{model_type}.py"
17 |     os.system(cmd)
18 |     os.chdir(curr_dir)
19 | 


--------------------------------------------------------------------------------
/figures/infinigen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/snu-comparch/InfiniGen/3754f52d131c30ab7a310e3c4cf356e8e3326ac3/figures/infinigen.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==2.0.1
2 | torchvision==0.15.2
3 | torchaudio==2.0.2
4 | sentencepiece
5 | lm-eval==0.3.0
6 | ftfy
7 | accelerate
8 | 


--------------------------------------------------------------------------------
/speedup/README.md:
--------------------------------------------------------------------------------
 1 | # Speedup Evaluation
 2 | This directory contains source codes for evaluating the speedup. You can reproduce inference latency results in the paper. Some of the codes are referenced from FlexGen (ICML'23) GitHub repository.
 3 | - Getting Started (10 minutes)
 4 | - Run Experiments (7 hours)
 5 | 
 6 | ## Getting Started (10 minutes)
 7 | ```sh
 8 | sh install.sh
 9 | export CUDA_HOME=/path/to/cuda
10 | ```
11 | For a "Hello world"-sized example, please run the following command (10 minutes):
12 | ```
13 | python -m flexgen.flex_opt --model huggingface/opt-6.7b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 2 --num-gpu-batches 1 --prompt-len 384 --gen-len 128 --warmup-input-path flexgen/pg19_firstbook.txt --test-input-path flexgen/pg19_firstbook.txt --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 102
14 | ```
15 | ## Run Experiments (7 hours)
16 | We provide scripts to reproduce the experiment results from Figure 14 to Figure 17. To reproduce all the results at once, please run the following commands (7 hours).
17 | ```
18 | cd scripts
19 | sh run_all.sh
20 | ```
21 | If you want to reproduce the results for a specific figure, please `sh run.sh` in each corresponding directory. For example,
22 | ```
23 | cd scripts/figure14
24 | sh run.sh
25 | ```
26 | Following is the amount of time to run each experiments on our system (NVIDIA RTX A6000 GPU with 48GB of memory, Intel Xeon Gold 6136 processor with 96GB of DDR4-2666 memory, PCIe 3.0 x16 interconnection).
27 | - Figure 14: 80 minutes
28 | - Figure 15: 220 minutes
29 | - Figure 16a: 30 minutes
30 | - Figure 16b: 60 minutes
31 | - Figure 17a: 20 minutes
32 | - Figure 17b: 10 minutes
33 | 
34 | 


--------------------------------------------------------------------------------
/speedup/flexgen/flexgen/__init__.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/speedup/flexgen/flexgen/flex_opt.py:
--------------------------------------------------------------------------------
1 | ../infinigen/flex_opt.py


--------------------------------------------------------------------------------
/speedup/flexgen/flexgen/pytorch_backend.py:
--------------------------------------------------------------------------------
1 | ../infinigen/pytorch_backend.py


--------------------------------------------------------------------------------
/speedup/flexgen/flexgen/timer.py:
--------------------------------------------------------------------------------
 1 | """Global timer for profiling."""
 2 | from collections import namedtuple
 3 | import time
 4 | from typing import Callable, Any
 5 | 
 6 | 
 7 | class _Timer:
 8 |     """An internal timer."""
 9 | 
10 |     def __init__(self, name: str):
11 |         self.name = name
12 |         self.started = False
13 |         self.start_time = None
14 | 
15 |         # start-stop timestamp pairs
16 |         self.start_times = []
17 |         self.stop_times = []
18 |         self.costs = []
19 | 
20 |     def start(self, sync_func: Callable = None):
21 |         """Start the timer."""
22 |         assert not self.started, f"timer {self.name} has already been started."
23 |         if sync_func:
24 |             sync_func()
25 | 
26 |         self.start_time = time.perf_counter()
27 |         self.start_times.append(self.start_time)
28 |         self.started = True
29 | 
30 |     def stop(self, sync_func: Callable = None):
31 |         """Stop the timer."""
32 |         assert self.started, f"timer {self.name} is not started."
33 |         if sync_func:
34 |             sync_func()
35 | 
36 |         stop_time = time.perf_counter()
37 |         self.costs.append(stop_time - self.start_time)
38 |         self.stop_times.append(stop_time)
39 |         self.started = False
40 | 
41 |     def reset(self):
42 |         """Reset timer."""
43 |         self.started = False
44 |         self.start_time = None
45 |         self.start_times = []
46 |         self.stop_times = []
47 |         self.costs = []
48 | 
49 |     def elapsed(self, mode: str = "average"):
50 |         """Calculate the elapsed time."""
51 |         if not self.costs:
52 |             return 0.0
53 |         if mode == "average":
54 |             return sum(self.costs) / len(self.costs)
55 |         elif mode == "sum":
56 |             return sum(self.costs)
57 |         else:
58 |             raise RuntimeError("Supported mode is: average | sum")
59 | 
60 | 
61 | class Timers:
62 |     """A group of timers."""
63 | 
64 |     def __init__(self):
65 |         self.timers = {}
66 | 
67 |     def __call__(self, name: str):
68 |         if name not in self.timers:
69 |             self.timers[name] = _Timer(name)
70 |         return self.timers[name]
71 | 
72 |     def __contains__(self, name: str):
73 |         return name in self.timers
74 | 
75 | 
76 | timers = Timers()
77 | 
78 | Event = namedtuple("Event", ("tstamp", "name", "info"))
79 | 
80 | 
81 | class Tracer:
82 |     """An activity tracer."""
83 | 
84 |     def __init__(self):
85 |         self.events = []
86 | 
87 |     def log(self, name: str, info: Any, sync_func: Callable = None):
88 |         if sync_func:
89 |             sync_func()
90 | 
91 |         self.events.append(Event(time.perf_counter(), name, info))
92 | 
93 | 
94 | tracer = Tracer()
95 | 


--------------------------------------------------------------------------------
/speedup/flexgen/flexgen/utils.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import dataclasses
  3 | from attr import define, field
  4 | from attr.setters import frozen
  5 | import functools
  6 | import gc
  7 | import math
  8 | import os
  9 | from typing import Tuple, Union, Optional, Any, Sequence, List
 10 | 
 11 | import numpy as np
 12 | import torch
 13 | 
 14 | 
 15 | KB = 1 << 10
 16 | MB = 1 << 20
 17 | GB = 1 << 30
 18 | T = 1e12
 19 | 
 20 | 
 21 | @dataclasses.dataclass(frozen=True)
 22 | class Task:
 23 |     """A generation task."""
 24 |     inputs: Union[np.array, List[List[int]]]
 25 |     prompt_len: int
 26 |     gen_len: int
 27 |     cut_gen_len: Optional[int]
 28 | 
 29 |     do_sample: bool
 30 |     temperature: float
 31 |     stop: Optional[int]
 32 | 
 33 | 
 34 | @dataclasses.dataclass(frozen=True)
 35 | class ExecutionEnv:
 36 |     """Hardware environment."""
 37 |     gpu: Any = None
 38 |     cpu: Any = None
 39 |     disk: Any = None
 40 |     mixed: Any = None
 41 | 
 42 |     @classmethod
 43 |     def create(cls, offload_dir):
 44 |         # fix recursive import
 45 |         from flexgen.pytorch_backend import TorchDevice, TorchDisk, TorchMixedDevice
 46 |         gpu = TorchDevice("cuda:0")
 47 |         cpu = TorchDevice("cpu")
 48 |         disk = TorchDisk(offload_dir)
 49 |         return cls(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk]))
 50 | 
 51 |     def close_copy_threads(self):
 52 |         self.disk.close_copy_threads()
 53 | 
 54 | 
 55 | @dataclasses.dataclass(frozen=True)
 56 | class BenchmarkResult:
 57 |     """Benchmark results."""
 58 |     prefill_latency: float
 59 |     prefill_throughput: float
 60 |     decode_latency: float
 61 |     decode_throughput: float
 62 |     total_latency: float
 63 |     total_throughput: float
 64 | 
 65 | 
 66 | np_dtype_to_torch_dtype = {
 67 |     np.float16: torch.float16, np.float32: torch.float32, np.uint8: torch.uint8,
 68 |     np.int8: torch.int8, np.int32: torch.int32, np.int64: torch.int64,
 69 |     bool: torch.bool,
 70 | }
 71 | 
 72 | torch_dtype_to_np_dtype = {
 73 |     torch.float16: np.float16, torch.float32: np.float32,
 74 |     torch.uint8: np.uint8, torch.int8: np.int8, torch.int32: np.int32,
 75 |     torch.int64: np.int64, torch.bool: bool,
 76 | }
 77 | 
 78 | torch_dtype_to_num_bytes = {
 79 |     torch.float16: 2, torch.float32: 4,
 80 |     torch.int8: 1, torch.uint8: 1, torch.int32: 4, torch.int64: 8,
 81 |     torch.bool: 1,
 82 | }
 83 | 
 84 | 
 85 | def piecewise_linear_func(xs, ys):
 86 |     """Return a function created by linear inerpolation."""
 87 |     indices = np.argsort(xs)
 88 |     xs = [xs[i] for i in indices]
 89 |     ys = [ys[i] for i in indices]
 90 | 
 91 |     # pad left and right
 92 |     k = 1e5
 93 |     delta_x_left = xs[0] - xs[1]
 94 |     delta_y_left = ys[0] - ys[1]
 95 |     delta_x_right = xs[-1] - xs[-2]
 96 |     delta_y_right = ys[-1] - ys[-2]
 97 | 
 98 |     xs = [xs[0] + delta_x_left * k] + xs + [xs[-1] + delta_x_right * k]
 99 |     ys = [ys[0] + delta_y_left * k] + ys + [ys[-1] + delta_y_right * k]
100 | 
101 |     return functools.partial(piecewise_linear_func_ret_func, xs, ys)
102 | 
103 | 
104 | def piecewise_linear_func_ret_func(xs, ys, x):
105 |     assert x >= xs[0] and x <= xs[-1]
106 |     return np.interp(x, xs, ys)
107 | 
108 | 
109 | def sample_from_range(n, k):
110 |     assert n >= 1
111 | 
112 |     if k == -1:
113 |         ret = [1]
114 |         while ret[-1] * 2 < n:
115 |             ret.append(ret[-1] * 2)
116 |         return ret
117 |     else:
118 |         if k == 1: return [1]
119 |         step = (n - 1) // (k - 1)
120 |         return list(range(1, n + 1, step))
121 | 
122 | 
123 | def cpu_mem_stats():
124 |     objects = gc.get_objects()
125 |     tensors = [obj for obj in objects if torch.is_tensor(obj) and not obj.is_cuda]
126 | 
127 |     total_numel = 0
128 |     total_mem = 0
129 |     visited_data = set()
130 |     for tensor in tensors:
131 |         # a data_ptr indicates a memory block allocated
132 |         data_ptr = tensor.storage().data_ptr()
133 |         if data_ptr in visited_data:
134 |             continue
135 |         visited_data.add(data_ptr)
136 | 
137 |         numel = tensor.numel()
138 |         total_numel += numel
139 |         element_size = tensor.storage().element_size()
140 |         mem = numel * element_size
141 |         total_mem += mem
142 | 
143 |     return total_mem
144 | 
145 | 
146 | def torch_mem_stats():
147 |     objects = gc.get_objects()
148 |     tensors = [obj for obj in objects if torch.is_tensor(obj) and obj.is_cuda]
149 | 
150 |     total_numel = 0
151 |     total_mem = 0
152 |     visited_data = set()
153 |     for tensor in tensors:
154 |         # a data_ptr indicates a memory block allocated
155 |         data_ptr = tensor.storage().data_ptr()
156 |         if data_ptr in visited_data:
157 |             continue
158 |         visited_data.add(data_ptr)
159 | 
160 |         print(tensor.shape, tensor.data_ptr())
161 | 
162 |         numel = tensor.numel()
163 |         total_numel += numel
164 |         element_size = tensor.storage().element_size()
165 |         mem = numel * element_size
166 |         total_mem += mem
167 | 
168 |     return total_mem
169 | 
170 | 
171 | class ValueHolder:
172 |     def __init__(self):
173 |         self.val = None
174 | 
175 |     def store(self, val):
176 |         assert self.val is None
177 |         self.val = val
178 | 
179 |     def pop(self):
180 |         ret = self.val
181 |         self.val = None
182 |         return ret
183 | 
184 |     def clear(self):
185 |         self.val = None
186 | 
187 | 
188 | def array_1d(a, cls):
189 |     return [cls() for _ in range(a)]
190 | 
191 | 
192 | def array_2d(a, b, cls):
193 |     return [[cls() for _ in range(b)] for _ in range(a)]
194 | 
195 | 
196 | def array_3d(a, b, c, cls):
197 |     return [[[cls() for _ in range(c)] for _ in range(b)] for _ in range(a)]
198 | 
199 | 
200 | def array_4d(a, b, c, d, cls):
201 |     return [[[[cls() for _ in range(d)] for _ in range(c)] for _ in range(b)] for _ in range(a)]
202 | 
203 | 
204 | def vector_gather(vectors, indices):
205 |     """
206 |     Gathers (batched) vectors according to indices.
207 |     Arguments:
208 |         vectors: Tensor[S, B, H]
209 |         indices: Tensor[K, B]
210 |     Returns:
211 |         Tensor[K, B, H]
212 |     """
213 |     S, B, H = vectors.shape
214 |     K, B2 = indices.shape
215 |     assert B == B2
216 |     indices = indices.reshape(K, B, 1).expand(K, B, H)
217 |     out = vectors.gather(dim=0, index=indices)
218 |     return out
219 | 
220 | 
221 | def run_cmd(cmd):
222 |     print(cmd)
223 |     os.system(cmd)
224 | 
225 | 
226 | def str2bool(v):
227 |     if isinstance(v, bool):
228 |         return v
229 |     if v.lower() in ('yes', 'true', 't', 'y', '1'):
230 |         return True
231 |     elif v.lower() in ('no', 'false', 'f', 'n', '0'):
232 |         return False
233 |     else:
234 |         raise argparse.ArgumentTypeError('Boolean value expected.')
235 | 
236 | 
237 | def project_decode_latency(costs, prompt_len, gen_len):
238 |     decode_costs = costs[1:]
239 | 
240 |     if gen_len / prompt_len < 0.1:
241 |         warmup = 2
242 |         decode_latency = (sum(decode_costs[:warmup]) +
243 |             np.mean(decode_costs[warmup:]) * (gen_len - 1 - warmup))
244 |     else:
245 |         warmup = 2
246 |         decode_latency = (sum(decode_costs[:warmup]) +
247 |             np.mean(decode_costs[warmup:]) * (gen_len - 1 - warmup))
248 | 
249 |         #assert len(decode_costs) >= 4
250 |         #warmup = 2
251 |         #xs = np.arange(warmup, len(decode_costs))
252 |         #ys = np.asarray(decode_costs[warmup:])
253 |         #curve = np.poly1d(np.polyfit(xs, ys, deg=1))
254 |         #ys_pred = [curve(x) for x in range(gen_len-1)]
255 |         #decode_latency = sum(ys_pred)
256 | 
257 |         #print([round(x, 4) for x in decode_costs])
258 |         #print([round(x, 4) for x in ys_pred])
259 | 
260 |     return decode_latency
261 | 
262 | 
263 | def write_benchmark_log(filename, model_size, cache_size, hidden_size,
264 |         gpu_peak_mem, projected, prefill_latency, prefill_throughput,
265 |         decode_latency, decode_throughput, total_latency, total_throughput):
266 | 
267 |     log_str = (f"model size: {model_size/GB:.3f} GB\t"
268 |                f"cache size: {cache_size/GB:.3f} GB\t"
269 |                f"hidden size (p): {hidden_size/GB:.3f} GB\n"
270 |                f"peak gpu mem: {gpu_peak_mem / GB:.3f} GB\t"
271 |                f"projected: {projected}\n"
272 |                f"prefill latency: {prefill_latency:.3f} s\t"
273 |                f"prefill throughput: {prefill_throughput:.3f} token/s\n"
274 |                f"decode latency: {decode_latency:.3f} s\t"
275 |                f"decode throughput: {decode_throughput:.3f} token/s\n"
276 |                f"total latency: {total_latency:.3f} s\t"
277 |                f"total throughput: {total_throughput:.3f} token/s")
278 |     with open(filename, "a") as fout:
279 |         fout.write(log_str + "\n")
280 | 
281 |     return log_str
282 | 
283 | 
284 | def read_benchmark_log(filename):
285 |     with open(filename) as fin:
286 |         lines = fin.readlines()
287 | 
288 |     def extract(line):
289 |         a, b = line.split("\t")
290 |         latency = a[a.index(":") + 1:a.index(" s")]
291 |         throughput = b[b.index(":") + 1:b.index(" to")]
292 |         return float(latency), float(throughput)
293 | 
294 |     prefill_latency, prefill_throughput = extract(lines[2])
295 |     decode_latency, decode_throughput = extract(lines[3])
296 |     total_latency, total_throughput = extract(lines[4])
297 | 
298 |     return BenchmarkResult(
299 |         prefill_latency, prefill_throughput,
300 |         decode_latency, decode_throughput,
301 |         total_latency, total_throughput,
302 |     )
303 | 


--------------------------------------------------------------------------------
/speedup/flexgen/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/flexgen/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [build-system]
 2 | requires = ["setuptools>=61.0"]
 3 | build-backend = "setuptools.build_meta"
 4 | 
 5 | [project]
 6 | name = "flexgen"
 7 | version = "0.1.7"
 8 | description = "Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput large-batch generation."
 9 | readme = "README.md"
10 | requires-python = ">=3.7"
11 | classifiers = [
12 |     "Programming Language :: Python :: 3",
13 |     "License :: OSI Approved :: Apache Software License",
14 | ]
15 | dependencies = [
16 |     "torch>=1.12", "transformers>=4.24",
17 |     "numpy", "tqdm", "pulp", "attrs",
18 | ]
19 | 
20 | [project.urls]
21 | "Homepage" = "https://github.com/FMInference/FlexGen"
22 | "Bug Tracker" = "https://github.com/FMInference/FlexGen/issues"
23 | 
24 | [tool.setuptools.packages.find]
25 | exclude = ["benchmark*", "dist*", "playground*", "scripts*"]
26 | 
27 | [tool.wheel]
28 | exclude = ["benchmark*", "dist*", "playground*", "scripts*"]
29 | 
30 | 
31 | 


--------------------------------------------------------------------------------
/speedup/infinigen/infinigen/kv_selection_controller.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | 
 4 | 
 5 | def select_kv(prefetch_idx, k_cache, v_cache):
 6 |     """Selects and aggregates critical KV caches using speculated indices
 7 | 
 8 |     On the decoding stage, aggregates the critical KV caches corresponding to
 9 |     the speculated prefetch index using embedding function.
10 | 
11 |     Args:
12 |         prefetch_idx: Indices of critical KV cache tokens for each head and batch (n', 1, bh)
13 |         k_cache: Key cache (n, bh, d)
14 |         v_cache: Value cache (n, bh, d)
15 | 
16 |     Returns:
17 |         selected_k: selected key cache (n', bh, d)
18 |         selected_v: selected value cache (n', bh, d)
19 |     """
20 | 
21 |     prefetch_idx = prefetch_idx.squeeze().to(k_cache.device)
22 |     ind = prefetch_idx * k_cache.shape[1] + torch.arange(k_cache.shape[1])[None, :]
23 |     selected_k = F.embedding(ind, k_cache.reshape(-1, k_cache.shape[2]))
24 |     selected_v = F.embedding(ind, v_cache.reshape(-1, v_cache.shape[2]))
25 |     return selected_k, selected_v
26 | 
27 | 
28 | def speculate_attention(hidden, p_w_q, p_k_c, n_head, alpha, max_num_kv):
29 |     """Speculates the indices of the critical KV caches of next attention layer.
30 | 
31 |     On the decoding stage, by using the hidden states (layer i), partial query
32 |     weight (layer i+1), and partial key cache (layer i+1), speculates the
33 |     attention score of the next layer. After that, counts the number of
34 |     critical tokens and gets the indcies of the top-k KV cache tokens with high
35 |     attention scores.
36 | 
37 |     Args:
38 |         hidden: Hidden states of layer i (b, 1, D)
39 |         p_w_q: Partial query weight (D', D)
40 |         p_k_c: Partial key cache (n, bh, d')
41 | 
42 |         Note that bh * d' == D'
43 | 
44 |     Returns:
45 |         prefetch_idx: Indices of critical KV cache tokens for each head and batch (n', 1, bh)
46 |     """
47 |     b = hidden.shape[0]
48 |     p_q = F.linear(hidden, p_w_q, bias=None)
49 |     p_q = p_q.view(b, 1, n_head, -1)
50 |     p_q = p_q.permute(0, 2, 1, 3).reshape(b * n_head, 1, -1)
51 | 
52 |     p_attn = torch.bmm(p_q, p_k_c.permute(1, 2, 0))
53 |     max_ = torch.max(p_attn, dim=-1)[0]
54 |     thr_ = (max_ - alpha).unsqueeze(-1).repeat(1, 1, p_attn.shape[-1])
55 |     count = torch.where(
56 |         p_attn > thr_, torch.ones_like(p_attn), torch.zeros_like(p_attn)
57 |     )
58 |     mean = torch.mean(torch.sum(count, dim=-1)).item()
59 |     prefetch_idx = torch.topk(
60 |         p_attn.permute(2, 1, 0), min(int(mean), max_num_kv), dim=0
61 |     )[1]
62 | 
63 |     return prefetch_idx
64 | 


--------------------------------------------------------------------------------
/speedup/infinigen/infinigen/partial_weight_generation_controller.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.nn.functional as F
 3 | 
 4 | 
 5 | def partial_weight_index_generation(query, n_head, head_dim, partial_weight_ratio):
 6 |     """Generates the indices of partial weight query and partial key cache.
 7 | 
 8 |     On the prefill stage, generates the indices of partial weight query and
 9 |     partial key cache using the query matrix. By comparing the absolute sum of
10 |     each column of the query matrix, gets the indices of top-k columns. These
11 |     columns correspond to the columns that strongly affect the attention score.
12 |     Thus, we use only those partial columns of query and key for speculation.
13 | 
14 |     Args:
15 |         query: Query matrix (b, n, D)
16 |         n_head: Number of heads which we refer to as h
17 |         head_dim: Hidden dimension of each head which we refer to as d
18 |         partial_weight_ratio: Ratio of the top-k columns
19 | 
20 |     Returns:
21 |         partial_weight_index: Indices of top-k columns (b, h, d')
22 |             where d' is d * (partial_weight_ratio).
23 |     """
24 | 
25 |     partial_weight_index = torch.zeros(n_head, int(head_dim * partial_weight_ratio)).to(
26 |         query.device
27 |     )
28 |     b = query.shape[0]
29 | 
30 |     for h_idx in range(n_head):
31 |         start = h_idx * head_dim
32 |         end = (h_idx + 1) * head_dim
33 |         _, ind = torch.topk(
34 |             torch.sum(torch.abs(query[0, :, start:end]), dim=-2),
35 |             int(head_dim * partial_weight_ratio),
36 |         )
37 |         partial_weight_index[h_idx] = ind
38 | 
39 |     return partial_weight_index.unsqueeze(0).repeat(b, 1, 1).to(torch.int64)
40 | 
41 | 
42 | def set_partial_cache(k_cache, partial_index, n_head, head_dim):
43 |     """Sets the partial key cache.
44 | 
45 |     On the prefill and decoding stages, generates the partial key cache
46 |     following the partial_index which indicates the indices of the important
47 |     columns.
48 | 
49 |     Args:
50 |         k_cahce: Key cache (n, bh, d)
51 |         partial_weight_index: Indices of top-k columns (b, h, d')
52 |         n_head: Number of heads which we refer to as h
53 |         head_dim: Hidden dimension of each head which we refer to as d
54 | 
55 |     Returns:
56 |         partial_cache: Partial key cache (n, bh, d')
57 |     """
58 | 
59 |     n, bh, _ = k_cache.shape
60 |     partial_cache = torch.gather(
61 |         k_cache.view(n, -1, n_head, head_dim),
62 |         3,
63 |         partial_index.unsqueeze(0).repeat(n, 1, 1, 1),
64 |     )
65 |     return partial_cache.view(n, bh, -1)
66 | 
67 | 
68 | def set_partial_weight(w_q, partial_index, n_head, head_dim):
69 |     """Sets the partial query weight.
70 | 
71 |     On the prefill stage, generates the partial query weight following the
72 |     partial_index which indicates the indices of the important columns.
73 | 
74 |     Args:
75 |         w_q: Query weight (D, D)
76 |         partial_weight_index: Indices of top-k columns (b, h, d')
77 |         n_head: Number of heads which we refer to as h
78 |         head_dim: Hidden dimension of each head which we refer to as d
79 | 
80 |     Returns:
81 |         partial_weight: Partial query weight (D', D)
82 |     """
83 | 
84 |     partial_weight = F.embedding(
85 |         partial_index[0]
86 |         + torch.arange(n_head)[:, None].to(partial_index.device) * head_dim,
87 |         w_q.view(-1, w_q.shape[-1]),
88 |     )
89 |     return partial_weight.view(-1, w_q.shape[-1])
90 | 


--------------------------------------------------------------------------------
/speedup/infinigen/infinigen/skewing_controller.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | 
 4 | def weight_bias_concat(weight, bias, scaling=False, head_dim=1.0):
 5 |     """Concatenates the weight matrix and bias.
 6 | 
 7 |     On the warmup phase, concatenates the weight matrix and bias for skewing.
 8 |     This manipulation does not hurt the correctness.
 9 | 
10 |     Args:
11 |         weight: Weight matrix (D, D)
12 |         bias: Bias vector (D)
13 |         scaling: If ture, scales the concatenated weight and bias to skip
14 |             the scaling after projection.
15 |         head_dim: Hidden dimension of each head which we refer to as d
16 | 
17 |     Returns:
18 |         concatenated weight and bias (D, D+1)
19 |     """
20 | 
21 |     if scaling:
22 |         return torch.cat((weight, bias.unsqueeze(1).to(weight.device)), dim=1) * (
23 |             head_dim**-0.5
24 |         )
25 |     else:
26 |         return torch.cat((weight, bias.unsqueeze(1).to(weight.device)), dim=1)
27 | 
28 | 
29 | def reform_hidden_states(hidden_states):
30 |     """Concatenates the weight matrix and bias.
31 | 
32 |     Concatenates the hidden states with a column of 1.
33 |     This reformation with the concatenated weight and bias  makes the linear
34 |     projection into a one matrix multiplication without bias addition.
35 | 
36 |     Args:
37 |         hidden: Hidden states (b, n, D)
38 | 
39 |     Returns:
40 |         reformed hidden states (b, n, D+1)
41 |     """
42 | 
43 |     return torch.cat(
44 |         (hidden_states, torch.ones_like(hidden_states)[:, :, 1].unsqueeze(2)), dim=-1
45 |     )
46 | 
47 | 
48 | def skew(query, key, wq, wk, n_head, head_dim):
49 |     """Manipulates the query/key weight matrix for skewing the qeury and key matrix.
50 | 
51 |     On the warmup phase, manipulates the query/key weight matrix for
52 |     skewing the query and key matrix. By doing so, a few columns of
53 |     the query and key matrix have become much more important. We use
54 |     the columns for attention speculation.
55 | 
56 |     Args:
57 |         query: Query matrix (b, n, h, d)
58 |         key: Key matrix (b, n, h, d)
59 |         w_q: Concatenated query weight and bias (D, D+1)
60 |         w_k: Concatenated key weight and bias (D, D+1)
61 |         n_head: Number of heads which we refer to as h
62 |         head_dim: Hidden dimension of each head which we refer to as d
63 | 
64 |     Returns:
65 |         w_q: Manipulated w_q (D, D+1)
66 |         w_k: Manipulated w_k (D, D+1)
67 | 
68 |     """
69 | 
70 |     for h_idx in range(n_head):
71 |         start = h_idx * head_dim
72 |         end = (h_idx + 1) * head_dim
73 |         _, sq, vq = torch.svd(query[0, :, h_idx].to(torch.float))
74 |         _, sk, _ = torch.svd(key[0, :, h_idx].to(torch.float))
75 |         sq = sq.to(torch.float16)
76 |         vq = vq.to(torch.float16)
77 |         sk = sk.to(torch.float16)
78 |         sq = sq * sk
79 |         A = torch.zeros(head_dim, head_dim).to(query.device).to(torch.float16)
80 |         _, ind = sq.sort()
81 |         A = A.scatter(-1, ind.unsqueeze(0).repeat(head_dim, 1), vq)
82 |         wq[start:end, :] = A.t() @ wq[start:end]
83 |         wk[start:end, :] = A.t() @ wk[start:end]
84 |     return wq, wk
85 | 


--------------------------------------------------------------------------------
/speedup/infinigen/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 | 
3 | setup(name="infinigen", version="0.1.0", packages=find_packages(include=["infinigen"]))
4 | 


--------------------------------------------------------------------------------
/speedup/install.sh:
--------------------------------------------------------------------------------
1 | pip install -e infinigen
2 | pip install -e flexgen
3 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure14/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure14/run.sh:
--------------------------------------------------------------------------------
 1 | UVM_PATH=$PWD/../../uvm
 2 | export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
 3 | for SCHEME in "uvm" "uvm_h2o"
 4 | do
 5 |   g++ $UVM_PATH/allocate.cpp -o allocate.so --shared -fPIC -I$CUDA_HOME/include
 6 |   CMD="--embed_dim 5120 --ffn_dim 20480 --enable_bias --n_head 40 --do_layer_norm_before --n_layer 40 --bsz 20 --prompt_len 1920 --gen_len 128 --runs 1"
 7 |   
 8 |   if [ "$SCHEME" = "uvm_h2o" ]
 9 |   then 
10 |     CMD=$CMD" --is_h2o --h2o_ratio 0.2"
11 |   fi
12 |   python $UVM_PATH/transformer.py $CMD
13 |   rm allocate.so
14 | done
15 | 
16 | FLEXGEN_PATH=$PWD/../../flexgen
17 | for SCHEME in "original" "int4" "h2o" "infinigen"
18 | do
19 |   rm $FLEXGEN_PATH/flexgen/flex_opt.py
20 |   rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
21 |   if [ "$SCHEME" = "int4" ]
22 |   then
23 |     ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
24 |     ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
25 |   else
26 |     ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
27 |     ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
28 |   fi
29 | 
30 |   CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 20 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
31 |   if [ "$SCHEME" = "int4" ]
32 |   then
33 |     CMD=$CMD" --compress-cache"
34 |   elif [ "$SCHEME" = "h2o" ]
35 |   then
36 |     CMD=$CMD" --max-num-kv 415 --hh-ratio 0.1 --hh-all"
37 |   elif [ "$SCHEME" = "infinigen" ]
38 |   then
39 |     CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 400"
40 |   fi
41 |   python -m flexgen.flex_opt $CMD
42 | done
43 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure15/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure15/run.sh:
--------------------------------------------------------------------------------
 1 | UVM_PATH=$PWD/../../uvm
 2 | export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
 3 | for SCHEME in "uvm" "uvm_h2o"
 4 | do
 5 |   g++ $UVM_PATH/allocate.cpp -o allocate.so --shared -fPIC -I$CUDA_HOME/include
 6 |   for BSZ in 4 8 12 16 20
 7 |   do
 8 |     CMD="--embed_dim 5120 --ffn_dim 20480 --enable_bias --n_head 40 --do_layer_norm_before --n_layer 40 --bsz $BSZ --prompt_len 1920 --gen_len 128 --runs 1"
 9 |     
10 |     if [ "$SCHEME" = "uvm_h2o" ]
11 |     then 
12 |       CMD=$CMD" --is_h2o --h2o_ratio 0.2"
13 |     fi
14 |     python $UVM_PATH/transformer.py $CMD
15 |   done
16 |   rm allocate.so
17 | done
18 | 
19 | FLEXGEN_PATH=$PWD/../../flexgen
20 | for SCHEME in "original" "int4" "h2o" "infinigen"
21 | do
22 |   rm $FLEXGEN_PATH/flexgen/flex_opt.py
23 |   rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
24 |   if [ "$SCHEME" = "int4" ]
25 |   then
26 |     ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
27 |     ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
28 |   else
29 |     ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
30 |     ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
31 |   fi
32 | 
33 |   for BSZ in 4 8 12 16 20
34 |   do
35 |     CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size $BSZ --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
36 |     if [ "$SCHEME" = "int4" ]
37 |     then
38 |       CMD=$CMD" --compress-cache"
39 |     elif [ "$SCHEME" = "h2o" ]
40 |     then
41 |       CMD=$CMD" --max-num-kv 409 --hh-ratio 0.1 --hh-all"
42 |     elif [ "$SCHEME" = "infinigen" ]
43 |     then
44 |       CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 409"
45 |     fi
46 |     python -m flexgen.flex_opt $CMD
47 |   done
48 | done
49 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure16a/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure16a/run.sh:
--------------------------------------------------------------------------------
 1 | FLEXGEN_PATH=$PWD/../../flexgen
 2 | for SCHEME in "original" "int4" "h2o" "infinigen"
 3 | do
 4 |   rm $FLEXGEN_PATH/flexgen/flex_opt.py
 5 |   rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
 6 |   if [ "$SCHEME" = "int4" ]
 7 |   then
 8 |     ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
 9 |     ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
10 |   else
11 |     ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
12 |     ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
13 |   fi
14 | 
15 |   for PROMPT_LEN in 384 896 1408 1920
16 |   do
17 |     CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len $PROMPT_LEN --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
18 |     if [ "$SCHEME" = "int4" ]
19 |     then
20 |       CMD=$CMD" --compress-cache"
21 |     elif [ "$SCHEME" = "h2o" ]
22 |     then
23 |       CMD=$CMD" --max-num-kv `expr \( $PROMPT_LEN + 128 \) / 5` --hh-ratio 0.1 --hh-all"
24 |     elif [ "$SCHEME" = "infinigen" ]
25 |     then
26 |       CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv `expr \( $PROMPT_LEN + 128 \) / 5`"
27 |     fi
28 |     python -m flexgen.flex_opt $CMD
29 |   done
30 | done
31 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure16b/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure16b/run.sh:
--------------------------------------------------------------------------------
 1 | FLEXGEN_PATH=$PWD/../../flexgen
 2 | for SCHEME in "original" "int4" "h2o" "infinigen"
 3 | do
 4 |   rm $FLEXGEN_PATH/flexgen/flex_opt.py
 5 |   rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
 6 |   if [ "$SCHEME" = "int4" ]
 7 |   then
 8 |     ln -s ../original/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
 9 |     ln -s ../original/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
10 |   else
11 |     ln -s ../$SCHEME/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
12 |     ln -s ../$SCHEME/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
13 |   fi
14 | 
15 |   for MODEL in "opt-6.7b" "opt-13b" "opt-30b"
16 |   do
17 |     CMD="--model huggingface/$MODEL"
18 |     if [ "$MODEL" = "opt-30b" ]
19 |     then
20 |       CMD=$CMD" --percent 70 30 0 100 100 0"
21 |     else
22 |       CMD=$CMD" --percent 100 0 0 100 100 0"
23 |     fi
24 |     CMD=$CMD" --overlap false --gpu-batch-size 4 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
25 |     if [ "$SCHEME" = "int4" ]
26 |     then
27 |       CMD=$CMD" --compress-cache"
28 |     elif [ "$SCHEME" = "h2o" ]
29 |     then
30 |       CMD=$CMD" --max-num-kv 409 --hh-ratio 0.1 --hh-all"
31 |     elif [ "$SCHEME" = "infinigen" ]
32 |     then
33 |       CMD=$CMD" --alpha 4 --partial-weight-ratio 0.2 --max-num-kv 409"
34 |     fi
35 |     python -m flexgen.flex_opt $CMD
36 |   done
37 | done
38 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure17a/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure17a/run.sh:
--------------------------------------------------------------------------------
 1 | FLEXGEN_PATH=$PWD/../../flexgen
 2 | rm $FLEXGEN_PATH/flexgen/flex_opt.py
 3 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
 4 | ln -s ../infinigen/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
 5 | ln -s ../infinigen/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
 6 | 
 7 | for ALPHA in 1 2 3 4 5 6 7 8 9
 8 | do
 9 |   CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
10 |   CMD=$CMD" --alpha $ALPHA --partial-weight-ratio 0.2 --max-num-kv 409"
11 |   python -m flexgen.flex_opt $CMD
12 | done
13 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure17b/pg19_firstbook.txt:
--------------------------------------------------------------------------------
1 | Half-way down the Rue Saint-Denis, almost at the corner of the Rue du Petit-Lion, there stood formerly one of those delightful houses which enable historians to reconstruct old Paris by analogy. The threatening walls of this tumbledown abode seemed to have been decorated with hieroglyphics. For what other name could the passer-by give to the Xs and Vs which the horizontal or diagonal timbers traced on the front, outlined by little parallel cracks in the plaster? It was evident that every beam quivered in its mortices at the passing of the lightest vehicle. This venerable structure was crowned by a triangular roof of which no example will, ere long, be seen in Paris. This covering, warped by the extremes of the Paris climate, projected three feet over the roadway, as much to protect the threshold from the rainfall as to shelter the wall of a loft and its sill-less dormer-window. This upper story was built of planks, overlapping each other like slates, in order, no doubt, not to overweight the frail house.
2 | One rainy morning in the month of March, a young man, carefully wrapped in his cloak, stood under the awning of a shop opposite this old house, which he was studying with the enthusiasm of an antiquary. In point of fact, this relic of the civic life of the sixteenth century offered more than one problem to the consideration of an observer. Each story presented some singularity; on the first floor four tall, narrow windows, close together, were filled as to the lower panes with boards, so as to produce the doubtful light by which a clever salesman can ascribe to his goods the color his customers inquire for. The young man seemed very scornful of this part of the house; his eyes had not yet rested on it. The windows of the second floor, where the Venetian blinds were drawn up, revealing little dingy muslin curtains behind the large Bohemian glass panes, did not interest him either. His attention was attracted to the third floor, to the modest sash-frames of wood, so clumsily wrought that they might have found a place in the Museum of Arts and Crafts to illustrate the early efforts of French carpentry. These windows were glazed with small squares of glass so green that, but for his good eyes, the young man could not have seen the blue-checked cotton curtains which screened the mysteries of the room from profane eyes. Now and then the watcher, weary of his fruitless contemplation, or of the silence in which the house was buried, like the whole neighborhood, dropped his eyes towards the lower regions. An involuntary smile parted his lips each time he looked at the shop, where, in fact, there were some laughable details.
3 | A formidable wooden beam, resting on four pillars, which appeared to have bent under the weight of the decrepit house, had been encrusted with as many coats of different paint as there are of rouge on an old duchess’ cheek. In the middle of this broad and fantastically carved joist there was an old painting representing a cat playing rackets. This picture was what moved the young man to mirth. But it must be said that the wittiest of modern painters could not invent so comical a caricature. The animal held in one of its forepaws a racket as big as itself, and stood on its hind legs to aim at hitting an enormous ball, returned by a man in a fine embroidered coat. Drawing, color, and accessories, all were treated in such a way as to suggest that the artist had meant to make game of the shop-owner and of the passing observer. Time, while impairing this artless painting, had made it yet more grotesque by introducing some uncertain features which must have puzzled the conscientious idler. For instance, the cat’s tail had been eaten into in such a way that it might now have been taken for the figure of a spectator—so long, and thick, and furry were the tails of our forefathers’ cats. To the right of the picture, on an azure field which ill-disguised the decay of the wood, might be read the name “Guillaume,” and to the left, “Successor to Master Chevrel.” Sun and rain had worn away most of the gilding parsimoniously applied to the letters of this superscription, in which the Us and Vs had changed places in obedience to the laws of old-world orthography.
4 | To quench the pride of those who believe that the world is growing cleverer day by day, and that modern humbug surpasses everything, it may be observed that these signs, of which the origin seems so whimsical to many Paris merchants, are the dead pictures of once living pictures by which our roguish ancestors contrived to tempt customers into their houses. Thus the Spinning Sow, the Green Monkey, and others, were animals in cages whose skills astonished the passer-by, and whose accomplishments prove the patience of the fifteenth-century artisan. Such curiosities did more to enrich their fortunate owners than the signs of “Providence,” “Good-faith,” “Grace of God,” and “Decapitation of John the Baptist,” which may still be seen in the Rue Saint-Denis.
5 | However, our stranger was certainly not standing there to admire the cat, which a minute’s attention sufficed to stamp on his memory. The young man himself had his peculiarities. His cloak, folded after the manner of an antique drapery, showed a smart pair of shoes, all the more remarkable in the midst of the Paris mud, because he wore white silk stockings, on which the splashes betrayed his impatience. He had just come, no doubt, from a wedding or a ball; for at this early hour he had in his hand a pair of white gloves, and his black hair, now out of curl, and flowing over his shoulders, showed that it had been dressed a la Caracalla, a fashion introduced as much by David’s school of painting as by the mania for Greek and Roman styles which characterized the early years of this century.
6 | In spite of the noise made by a few market gardeners, who, being late, rattled past towards the great market-place at a gallop, the busy street lay in a stillness of which the magic charm is known only to those who have wandered through deserted Paris at the hours when its roar, hushed for a moment, rises and spreads in the distance like the great voice of the sea. This strange young man must have seemed as curious to the shopkeeping folk of the “Cat and Racket” as the “Cat and Racket” was to him. A dazzlingly white cravat made his anxious face look even paler than it really was. The fire that flashed in his black eyes, gloomy and sparkling by turns, was in harmony with the singular outline of his features, with his wide, flexible mouth, hardened into a smile. His forehead, knit with violent annoyance, had a stamp of doom. Is not the forehead the most prophetic feature of a man? When the stranger’s brow expressed passion the furrows formed in it were terrible in their strength and energy; but when he recovered his calmness, so easily upset, it beamed with a luminous grace which gave great attractiveness to a countenance in which joy, grief, love, anger, or scorn blazed out so contagiously that the coldest man could not fail to be impressed.
7 | He was so thoroughly vexed by the time when the dormer-window of the loft was suddenly flung open, that he did not observe the apparition of three laughing faces, pink and white and chubby, but as vulgar as the face of Commerce as it is seen in sculpture on certain monuments. These three faces, framed by the window, recalled the puffy cherubs floating among the clouds that surround God the Father. The apprentices snuffed up the exhalations of the street with an eagerness that showed how hot and poisonous the atmosphere of their garret must be. After pointing to the singular sentinel, the most jovial, as he seemed, of the apprentices retired and came back holding an instrument whose hard metal pipe is now superseded by a leather tube; and they all grinned with mischief as they looked down on the loiterer, and sprinkled him with a fine white shower of which the scent proved that three chins had just been shaved. Standing on tiptoe, in the farthest corner of their loft, to enjoy their victim’s rage, the lads ceased laughing on seeing the haughty indifference with which the young man shook his cloak, and the intense contempt expressed by his face as he glanced up at the empty window-frame.
8 | At this moment a slender white hand threw up the lower half of one of the clumsy windows on the third floor by the aid of the sash runners, of which the pulley so often suddenly gives way and releases the heavy panes it ought to hold up. The watcher was then rewarded for his long waiting. The face of a young girl appeared, as fresh as one of the white cups that bloom on the bosom of the waters, crowned by a frill of tumbled muslin, which gave her head a look of exquisite innocence. Though wrapped in brown stuff, her neck and shoulders gleamed here and there through little openings left by her movements in sleep. No expression of embarrassment detracted from the candor of her face, or the calm look of eyes immortalized long since in the sublime works of Raphael;here were the same grace, the same repose as in those Virgins, and now proverbial. There was a delightful contrast between the cheeks of that face on which sleep had, as it were, given high relief to a superabundance of life, and the antiquity of the heavy window with its clumsy shape and black sill. Like those day-blowing flowers, which in the early morning have not yet unfurled their cups, twisted by the chills of night, the girl, as yet hardly awake, let her blue eyes wander beyond the neighboring roofs to look at the sky; then, from habit, she cast them down on the gloomy depths of the street, where they immediately met those of her adorer. Vanity, no doubt, distressed her at being seen in undress; she started back, the worn pulley gave way, and the sash fell with the rapid run, which in our day has earned for this artless invention of our forefathers an odious name, Fenetre a la Guillotine. The vision had disappeared. To the young man the most radiant star of morning seemed to be hidden by a cloud.
9 | 


--------------------------------------------------------------------------------
/speedup/scripts/figure17b/run.sh:
--------------------------------------------------------------------------------
 1 | FLEXGEN_PATH=$PWD/../../flexgen
 2 | rm $FLEXGEN_PATH/flexgen/flex_opt.py
 3 | rm $FLEXGEN_PATH/flexgen/pytorch_backend.py
 4 | ln -s ../infinigen/flex_opt.py $FLEXGEN_PATH/flexgen/flex_opt.py
 5 | ln -s ../infinigen/pytorch_backend.py $FLEXGEN_PATH/flexgen/pytorch_backend.py
 6 | 
 7 | for PARTIAL_WEIGHT_RATIO in 0.1 0.2 0.4 0.6 0.8 1.0
 8 | do
 9 |   CMD="--model huggingface/opt-13b --percent 100 0 0 100 100 0 --overlap false --gpu-batch-size 8 --num-gpu-batches 1 --prompt-len 1920 --gen-len 128 --warmup-input-path pg19_firstbook.txt --test-input-path pg19_firstbook.txt"
10 |   CMD=$CMD" --alpha 4 --partial-weight-ratio $PARTIAL_WEIGHT_RATIO --max-num-kv 409"
11 |   python -m flexgen.flex_opt $CMD
12 | done
13 | 


--------------------------------------------------------------------------------
/speedup/scripts/run_all.sh:
--------------------------------------------------------------------------------
1 | for FIG in "14" "15" "16a" "16b" "17a" "17b"
2 | do
3 |   cd figure$FIG
4 |   sh run.sh
5 |   cd ..
6 | done
7 | 


--------------------------------------------------------------------------------
/speedup/uvm/allocate.cpp:
--------------------------------------------------------------------------------
 1 | #include <sys/types.h>
 2 | #include <cuda_runtime_api.h>
 3 | 
 4 | extern "C" {
 5 |   void* uvm_malloc(ssize_t size, int device, cudaStream_t stream) {
 6 |     void *ptr;
 7 |     //cudaMalloc(&ptr, size);
 8 |     cudaMallocManaged(&ptr, size);
 9 |     return ptr;
10 |   }
11 | 
12 |   void uvm_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
13 |     cudaFree(ptr);
14 |   }
15 | }
16 | 


--------------------------------------------------------------------------------
/speedup/uvm/h2o_attention.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from typing import Tuple
  4 | 
  5 | class SelfAttention(nn.Module):
  6 |     """Multi-headed attention from 'Attention Is All You Need' paper"""
  7 | 
  8 |     def __init__(
  9 |         self,
 10 |         embed_dim: int,
 11 |         num_heads: int,
 12 |         h2o_ratio: float,
 13 |         bias: bool = True
 14 |     ):
 15 |         super().__init__()
 16 |         self.embed_dim = embed_dim
 17 |         self.num_heads = num_heads
 18 |         self.head_dim = embed_dim // num_heads
 19 | 
 20 |         if (self.head_dim * num_heads) != self.embed_dim:
 21 |             raise ValueError(
 22 |                 f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
 23 |                 f" and `num_heads`: {num_heads})."
 24 |             )
 25 |         self.scaling = self.head_dim**-0.5
 26 | 
 27 |         self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 28 |         self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 29 |         self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 30 |         self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 31 | 
 32 |         self.acc = None
 33 |         self.ratio = h2o_ratio
 34 |         self.i = 0
 35 |         self.past_key_value = None
 36 | 
 37 |     def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
 38 |         return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
 39 |     
 40 |     def _heavy_hitter_pruning(self, k, v, attn_weights, hh_k):
 41 |         # k, v: (s, b * n_head, head_dim)
 42 |         # attn_weights: (b * n_head, s, s)
 43 |         aggr_attn = torch.sum(attn_weights, 1)
 44 |         # (b * n_head, hh_k)
 45 |         _, topk_indices = aggr_attn[:, :].topk(
 46 |             min(hh_k, aggr_attn.shape[1]), dim=1)
 47 | 
 48 |         # select heavy-hitters
 49 |         # k, v: (b * n_head, s, head_dim)
 50 |         k_t = k.transpose(1, 0)
 51 |         v_t = v.transpose(1, 0)
 52 |         dim0_indices = torch.arange(k_t.size(0))[:, None]
 53 |         dim0_indices = dim0_indices.expand_as(topk_indices)
 54 |         # (b * n_head, hh_k, head_dim)
 55 |         k_hh_t = k_t[dim0_indices, topk_indices]
 56 |         v_hh_t = v_t[dim0_indices, topk_indices]
 57 |         # (hh_k, b * n_head, head_dim)
 58 |         k = k_hh_t.transpose(1, 0)
 59 |         v = v_hh_t.transpose(1, 0)
 60 |         # new shape (hh_k, b * n_head)
 61 |         aggr_attn = aggr_attn.transpose(0, 1)
 62 |         dim1_indices = torch.arange(aggr_attn.size(1)).unsqueeze(0)
 63 |         # (hh_k * 2, b * n_head)
 64 |         acc = aggr_attn[topk_indices.transpose(0, 1), dim1_indices]
 65 |         return k, v, acc
 66 | 
 67 |     def forward(
 68 |         self,
 69 |         hidden_states: torch.Tensor
 70 |     ) -> Tuple[torch.Tensor, Tuple[torch.Tensor]]:
 71 |         """Input shape: Batch x Time x Channel"""
 72 | 
 73 |         bsz, tgt_len, _ = hidden_states.size()
 74 | 
 75 |         # get query proj
 76 |         query_states = self.q_proj(hidden_states) * self.scaling
 77 |         
 78 |         # get key/value proj
 79 |         if self.past_key_value is not None:
 80 |             # reuse k, v, self_attention
 81 |             k = self._shape(self.k_proj(hidden_states), -1, bsz).squeeze()
 82 |             v = self._shape(self.v_proj(hidden_states), -1, bsz).squeeze()
 83 |             key_states = self.past_key_value[0]
 84 |             key_states[:, :, -1] = k
 85 |             value_states = self.past_key_value[1]
 86 |             value_states[:, :, -1] = v
 87 |         else:
 88 |             # self_attention
 89 |             key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
 90 |             value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
 91 | 
 92 |         # update kv cache
 93 |         #past_key_value = (key_states, value_states)
 94 | 
 95 |         # reshape
 96 |         proj_shape = (bsz * self.num_heads, -1, self.head_dim)
 97 |         query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
 98 |         key_states = key_states.view(*proj_shape)
 99 |         value_states = value_states.view(*proj_shape)
100 | 
101 |         # qkt
102 |         src_len = key_states.size(1)
103 |         attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
104 | 
105 |         # masking
106 |         if self.i == 0: # prefill
107 |             mask = torch.triu(torch.ones(attn_weights.shape).to('cuda'), diagonal=1) * -10000
108 |             attn_weights = attn_weights + mask
109 | 
110 |         # softmax
111 |         attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
112 | 
113 |         # sv
114 |         attn_output = torch.bmm(attn_weights, value_states)
115 | 
116 |         # reshape
117 |         attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
118 |         attn_output = attn_output.transpose(1, 2)
119 |         attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
120 | 
121 |         attn_output = self.out_proj(attn_output)
122 | 
123 | 
124 |         ##### h2o ####
125 |         if self.acc is None:
126 |             self.hh = int(attn_weights.shape[-1] * self.ratio)
127 |             key_states, value_states, self.acc = self._heavy_hitter_pruning(key_states.permute(1,0,2), value_states.permute(1,0,2), attn_weights, self.hh)
128 |             key_states = key_states.permute(1, 0, 2)
129 |             value_states = value_states.permute(1, 0, 2)
130 |             self.past_key_value = (torch.cat((key_states.reshape(bsz, self.num_heads, key_states.shape[-2], key_states.shape[-1]), torch.zeros(bsz, self.num_heads, 1, key_states.shape[-1]).to('cuda').to(torch.float16)), dim = -2),
131 |                               torch.cat((value_states.reshape(bsz, self.num_heads, value_states.shape[-2], value_states.shape[-1]), torch.zeros(bsz, self.num_heads, 1, key_states.shape[-1]).to('cuda').to(torch.float16)), dim = -2))
132 | 
133 |         else:
134 |             temp_attn = attn_weights.squeeze(1).transpose(0, 1)
135 |             self.acc = torch.cat((self.acc, torch.zeros(1, bsz * self.num_heads).to('cuda')), dim=0)
136 |             self.acc = self.acc + temp_attn
137 |             kick_ind = self.acc.argmin(dim=0).squeeze()
138 | 
139 |             # reduce accumulated result
140 |             indices = kick_ind.unsqueeze(0)
141 |             self.acc.scatter_(0, indices, self.acc[-1].unsqueeze(0).clone())
142 |             self.acc = self.acc[:-1]
143 | 
144 |             # modify kv cache
145 |             indices = kick_ind.view(-1, 1).expand(-1, self.head_dim).unsqueeze(1)
146 |             key_states.scatter_(1, indices, key_states[:, -1].unsqueeze(1))
147 |             value_states.scatter_(1, indices, value_states[:, -1].unsqueeze(1))
148 |             #key_states = key_states[:, :-1]
149 |             #value_states = value_states[:, :-1]
150 |             self.past_key_value = (key_states.reshape(bsz, self.num_heads, key_states.shape[-2], key_states.shape[-1]),
151 |                               value_states.reshape(bsz, self.num_heads, value_states.shape[-2], value_states.shape[-1]))
152 |             
153 |         self.i += 1
154 |         return attn_output
155 | 


--------------------------------------------------------------------------------
/speedup/uvm/selfattention.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from typing import Tuple
  4 | 
  5 | class SelfAttention(nn.Module):
  6 |     """Multi-headed attention from 'Attention Is All You Need' paper"""
  7 | 
  8 |     def __init__(
  9 |         self,
 10 |         embed_dim: int,
 11 |         num_heads: int,
 12 |         bias: bool = True
 13 |     ):
 14 |         super().__init__()
 15 |         self.embed_dim = embed_dim
 16 |         self.num_heads = num_heads
 17 |         self.head_dim = embed_dim // num_heads
 18 | 
 19 |         if (self.head_dim * num_heads) != self.embed_dim:
 20 |             raise ValueError(
 21 |                 f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
 22 |                 f" and `num_heads`: {num_heads})."
 23 |             )
 24 |         self.scaling = self.head_dim**-0.5
 25 | 
 26 |         self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 27 |         self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 28 |         self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 29 |         self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
 30 |         
 31 |         self.past_key_value = None
 32 |         self.src_s = 0
 33 | 
 34 | 
 35 |     def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
 36 |         return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
 37 | 
 38 |     def forward(
 39 |         self,
 40 |         hidden_states: torch.Tensor,
 41 |     ) -> Tuple[torch.Tensor, Tuple[torch.Tensor]]:
 42 |         """Input shape: Batch x Time x Channel"""
 43 | 
 44 |         bsz, tgt_len, _ = hidden_states.size()
 45 |         if tgt_len > 1:
 46 |             self.src_s = tgt_len
 47 |         else:
 48 |             self.src_s += 1
 49 | 
 50 |         # get query proj
 51 |         query_states = self.q_proj(hidden_states) * self.scaling
 52 |         
 53 |         # get key/value proj
 54 |         if self.past_key_value is not None:
 55 |             # reuse k, v, self_attention
 56 |             k = self._shape(self.k_proj(hidden_states), -1, bsz).squeeze()
 57 |             v = self._shape(self.v_proj(hidden_states), -1, bsz).squeeze()
 58 |             key_states = self.past_key_value[0, :, :, :self.src_s]
 59 |             key_states[:, :, -1] = k
 60 |             self.past_key_value[0, :, :, self.src_s] = k
 61 |             value_states = self.past_key_value[1, :, :, :self.src_s]
 62 |             value_states[:, :, -1] = v
 63 |             self.past_key_value[1, :, :, self.src_s] = k
 64 |         else:
 65 |             # self_attention
 66 |             self.past_key_value = torch.zeros((2, bsz, self.num_heads, 2048, self.head_dim), dtype=torch.float16, device=torch.device('cuda')) 
 67 |             key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
 68 |             value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
 69 |             self.past_key_value[0, :, :, :tgt_len] = key_states
 70 |             self.past_key_value[1, :, :, :tgt_len] = value_states
 71 | 
 72 |         # update kv cache
 73 |         #self.past_key_value = (key_states, value_states)
 74 | 
 75 |         # reshape
 76 |         proj_shape = (bsz * self.num_heads, -1, self.head_dim)
 77 |         query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
 78 |         key_states = key_states.view(*proj_shape)
 79 |         value_states = value_states.view(*proj_shape)
 80 | 
 81 |         # qkt
 82 |         src_len = key_states.size(1)
 83 |         attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
 84 |         
 85 |         # masking
 86 |         if attn_weights.shape[1] > 1: # prefill
 87 |             mask = torch.triu(torch.ones(attn_weights.shape).to('cuda'), diagonal=1) * -10000
 88 |             attn_weights = attn_weights + mask
 89 | 
 90 |         # softmax
 91 |         attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
 92 | 
 93 |         # sv
 94 |         attn_output = torch.bmm(attn_weights, value_states)
 95 | 
 96 |         # reshape
 97 |         attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
 98 |         attn_output = attn_output.transpose(1, 2)
 99 |         attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
100 | 
101 |         attn_output = self.out_proj(attn_output)
102 | 
103 |         return attn_output
104 | 


--------------------------------------------------------------------------------
/speedup/uvm/transformer.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import transformerlayer
  3 | import time
  4 | 
  5 | import argparse
  6 | parser = argparse.ArgumentParser()
  7 | 
  8 | parser.add_argument(
  9 |     '--embed_dim', type=int, default=4096,
 10 |     help='model dim.'
 11 | )
 12 | parser.add_argument(
 13 |     '--ffn_dim', type=int, default=4096*4,
 14 |     help='ffn dim'
 15 | )
 16 | parser.add_argument(
 17 |     '--enable_bias', action='store_true',
 18 |     help='enable bias for projections'
 19 | )
 20 | parser.add_argument(
 21 |     '--n_head', type=int, default=32,
 22 |     help='num heads'
 23 | )
 24 | parser.add_argument(
 25 |     '--do_layer_norm_before', action='store_true',
 26 |     help='do layernorm before attention/ffn'
 27 | )
 28 | parser.add_argument(
 29 |     '--n_layer', type=int, default=32,
 30 |     help='num layers'
 31 | )
 32 | parser.add_argument(
 33 |     '--bsz', type=int, default=4,
 34 |     help='batch size'
 35 | )
 36 | parser.add_argument(
 37 |     '--prompt_len', type=int, default=2048,
 38 |     help='length of input prompt'
 39 | )
 40 | parser.add_argument(
 41 |     '--gen_len', type=int, default=1024,
 42 |     help='lenght of output'
 43 | )
 44 | parser.add_argument(
 45 |     '--is_h2o', action='store_true',
 46 |     help='enable h2o'
 47 | )
 48 | parser.add_argument(
 49 |     '--h2o_ratio', type=float, default=0.2,
 50 |     help='ratio of heavy hitter'
 51 | )
 52 | parser.add_argument(
 53 |     '--runs', type=int, default=1,
 54 |     help='number of runs'
 55 | )
 56 | args = parser.parse_args()
 57 | 
 58 | ### Parameters ###
 59 | embed_dim = args.embed_dim
 60 | ffn_dim = args.ffn_dim
 61 | bias = args.enable_bias
 62 | n_head = args.n_head
 63 | do_layer_norm_before = args.do_layer_norm_before
 64 | n_layer = args.n_layer
 65 | 
 66 | bsz = args.bsz
 67 | prompt_len = args.prompt_len
 68 | gen_len = args.gen_len
 69 | ##################
 70 | 
 71 | new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
 72 |         'allocate.so', 'uvm_malloc', 'uvm_free')
 73 | torch.cuda.memory.change_current_allocator(new_alloc)
 74 | 
 75 | ##################
 76 | """
 77 | def prefetch(name):
 78 |   def hook(attn, input, output):
 79 |     ind = output[2]
 80 |     if int(name) < (n_layer - 1) and ind is not None:
 81 |         next_layer = transformer[int(name) + 1]
 82 |         torch.gather(next_layer.past_key_value[0], 2, ind)
 83 |         torch.gather(next_layer.past_key_value[1], 2, ind)
 84 | 
 85 |   return hook
 86 | 
 87 | for layer_num in range(n_layer):
 88 |     attn_in = transformer[layer_num].self_attn.register_forward_hook(prefetch("%d"%(layer_num)))
 89 | """
 90 | #################
 91 | 
 92 | total_prefill = 0.0
 93 | total_decode = 0.0
 94 | for run in range(args.runs):
 95 |     transformer = [transformerlayer.TransformerLayer(embed_dim, ffn_dim, bias, n_head, do_layer_norm_before, args.is_h2o, args.h2o_ratio) for _ in range(n_layer)]
 96 | 
 97 |     prompt = torch.rand(bsz, prompt_len, embed_dim).to(torch.float16).to('cuda')
 98 |     new_input = torch.rand(bsz, 1, embed_dim).to(torch.float16).to('cuda')
 99 | 
100 |     # Warmup
101 |     for i in range(n_layer):
102 |         prompt = transformer[i].forward(prompt)
103 | 
104 |     for i in range(n_layer):
105 |         transformer[i].self_attn.past_key_value = None
106 |         if args.is_h2o:
107 |             transformer[i].self_attn.i = 0
108 |             transformer[i].self_attn.acc = None
109 | 
110 |     start = time.time()
111 |     for i in range(n_layer):
112 |         prompt = transformer[i].forward(prompt)
113 | 
114 |     prefill_time = time.time() - start
115 |     start = time.time()
116 | 
117 |     for k in range(gen_len - 1):
118 |         for i in range(n_layer):
119 |             new_input = transformer[i].forward(new_input)
120 | 
121 |     decode_time = time.time() - start
122 | 
123 |     total_prefill += prefill_time
124 |     total_decode += decode_time
125 | 
126 |     del transformer
127 | 
128 | prefill_time = total_prefill / float(args.runs)
129 | decode_time = total_decode / float(args.runs)
130 | 
131 | print("+++++++++++++++++++++++++++++++++++++++++++++++++")
132 | if args.is_h2o:
133 |     print("UVM + H2O")
134 | else:
135 |     print("UVM")
136 | 
137 | print("input: " + str(prompt_len) + " output: " + str(gen_len) + " bsz: " + str(bsz))
138 | print("+++++++++++++++++++++++++++++++++++++++++++++++++")
139 | print("Total: " + str(prefill_time + decode_time) + " Prefill: " + str(prefill_time) + " Decode: " + str(decode_time))
140 | print("=================================================")
141 | 


--------------------------------------------------------------------------------
/speedup/uvm/transformerlayer.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch import nn
 3 | from typing import Tuple
 4 | 
 5 | class TransformerLayer(nn.Module):
 6 |     def __init__(
 7 |             self,
 8 |             embed_dim,
 9 |             ffn_dim,
10 |             bias,
11 |             n_head, 
12 |             do_layer_norm_before,
13 |             is_h2o,
14 |             h2o_ratio
15 |             ):
16 |         super().__init__()
17 |         self.embed_dim = embed_dim
18 |         if is_h2o:
19 |             import h2o_attention
20 |             self.self_attn = h2o_attention.SelfAttention(
21 |                 embed_dim=self.embed_dim,
22 |                 num_heads=n_head,
23 |                 bias=bias,
24 |                 h2o_ratio=h2o_ratio
25 |             )
26 |         else:
27 |             import selfattention
28 |             self.self_attn = selfattention.SelfAttention(
29 |                 embed_dim=self.embed_dim,
30 |                 num_heads=n_head,
31 |                 bias=bias,
32 |             )
33 | 
34 |         self.do_layer_norm_before = do_layer_norm_before
35 |         self.activation_fn = nn.ReLU()
36 | 
37 |         self.self_attn_layer_norm = nn.LayerNorm(
38 |             self.embed_dim, elementwise_affine=True, dtype=torch.float16, device=torch.device('cuda')
39 |         )
40 |         self.fc1 = nn.Linear(self.embed_dim, ffn_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
41 |         self.fc2 = nn.Linear(ffn_dim, self.embed_dim, bias=bias, dtype=torch.float16, device=torch.device('cuda'))
42 |         self.final_layer_norm = nn.LayerNorm(
43 |                 self.embed_dim, elementwise_affine=True, dtype=torch.float16, device=torch.device('cuda')
44 |         )
45 |         
46 |     def forward(
47 |         self,
48 |         hidden_states: torch.Tensor,
49 |     ) -> torch.Tensor:
50 |         
51 |         with torch.no_grad():
52 |             residual = hidden_states
53 | 
54 |             # OPT: 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
55 |             if self.do_layer_norm_before:
56 |                 hidden_states = self.self_attn_layer_norm(hidden_states)
57 | 
58 |             # Self Attention
59 |             hidden_states = self.self_attn(
60 |                 hidden_states=hidden_states
61 |             )
62 |             hidden_states = residual + hidden_states
63 | 
64 |             # 350m applies layer norm AFTER attention
65 |             if not self.do_layer_norm_before:
66 |                 hidden_states = self.self_attn_layer_norm(hidden_states)
67 | 
68 |             # Fully Connected
69 |             hidden_states_shape = hidden_states.shape
70 |             hidden_states = hidden_states.reshape(-1, hidden_states.size(-1))
71 |             residual = hidden_states
72 | 
73 |             # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
74 |             if self.do_layer_norm_before:
75 |                 hidden_states = self.final_layer_norm(hidden_states)
76 | 
77 |             hidden_states = self.fc1(hidden_states)
78 |             hidden_states = self.activation_fn(hidden_states)
79 | 
80 |             hidden_states = self.fc2(hidden_states)
81 | 
82 |             hidden_states = (residual + hidden_states).view(hidden_states_shape)
83 | 
84 |             # 350m applies layer norm AFTER attention
85 |             if not self.do_layer_norm_before:
86 |                 hidden_states = self.final_layer_norm(hidden_states)
87 | 
88 |         return hidden_states
89 | 


--------------------------------------------------------------------------------