├── README.md ├── __init__.py ├── dataflow_search.py ├── deconv_exhaustive_searcher.py ├── deconv_optimizer.py ├── dnn_optimizer.py ├── dnns ├── GC_net.txt ├── PSMNet.txt ├── flowNetC.txt ├── flowNetS.txt ├── test.txt └── test3d.txt ├── fc_layer.py ├── layer3d_base_method.py ├── layer3d_exhaustive_searcher.py ├── layer3d_optimizer.py ├── layer3d_static_method.py ├── layer_base_method.py ├── layer_exhaustive_searcher.py ├── layer_optimizer.py ├── layer_static_method.py ├── multi_layer_perceptron.py ├── runner.sh └── runner3d.sh /README.md: -------------------------------------------------------------------------------- 1 | # Systolic array dataflow optimizer 2 | 3 | This is a DNN dataflow optimizer for a particular hardware accelerator, i.e., systolic 4 | array. It is able to find an optimal or an approximately optimal dataflow for a 5 | particular DNN given hardware constraints, such as bandwidth and SRAM, etc. 6 | This repository is the artifact of our paper [ASV: Accelerated Stereo Vision System](https://www.cs.rochester.edu/horizon/pubs/micro19-tigris.pdf) published at MICRO 2019. 7 | 8 | ## What's new 9 | 10 | The goals of this optimizer are: 11 | 12 | * First, this optimizer is to aim to find a close-to optimal configuration in 13 | order to minimize the latency and reduce the data traffic at the same time. 14 | * Second, this optimizer explores different searching or optimization schemes, 15 | so that, we can show the trade-off between different optimization schemes. 16 | * Third, this optimizer can automatically apply a special optimization for 17 | deconvolutions in the DNN pipeline, and we have different levels of optimizations 18 | to explore. 19 | 20 | ## What's inside 21 | 22 | There are two main parts in this framework: 23 | 1. The overall framework to drive different optimizers. 24 | 25 | * `dataflow_search.py` 26 | * `dnn_optimizer.py` 27 | 28 | 2. The layer-level optimizers to optimize different 29 | 30 | * `layer(3d)_base_method.py` 31 | * `layer(3d)_static_method.py` 32 | * `layer(3d)_optimizer.py` 33 | * `layer(3d)_exhaustive_searcher.py` 34 | * `deconv_exhaustive_searcher.py` 35 | 36 | ## How to use 37 | 38 | Pre-requisite packages: scipy, numpy, sys, math, pprint 39 | 40 | To use the dataflow optimizer, you can run for helping info: 41 | 42 | ``` 43 | $ python dataflow_search.py -h 44 | ``` 45 | 46 | By specify the configuration in the input option and a particular DNN network 47 | you want to optimize, the optimizer will return a dataflow scheme for you. The 48 | sample DNN networks are in the `/dnns` directory. 49 | 50 | A simple example of using this tool on a dnn. 51 | ``` 52 | $ python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 53 | --model_type 2D \ 54 | --search_method Constrained \ 55 | --split True\ 56 | --bufsize 1572864 \ 57 | --bit_width 16 \ 58 | --memory_bandwidth 25.6 \ 59 | --sa_size 16 \ 60 | --model_type 2D \ 61 | --ifmap 960 576 6 62 | ``` 63 | This will load the DNN network from `flowNetC.txt` and search the DNN dataflow 64 | using *Constrained optimization*. You can use `--search_method` option to specify 65 | what kind of search method to use. We provide two different options, one is 66 | `Constrained`, which uses constrained optimization, the other one is `Exhaustive`, 67 | which uses exhaustive search. This command also specifies the ifmap size to be 68 | 960-576-6 (width-height-channel). 69 | 70 | In this command, we also need to provide the hardware configuration. `bufsize` specifies that 71 | the on-chip buffer size is *1572864* bytes. `memory_bandwidth` specifies the memory 72 | bandwidth is *25.6* GB/s. `sa_size` specifies that the systolic array size is *16-by-16*. 73 | The `bitwidth` specifies the number of bits used to represent the numerical precision 74 | for a single number. 75 | 76 | For details of other flags, please see the explanation below. 77 | 78 | The dataflow optimization will print the result as a JSON-like format. The example result 79 | is shown below: 80 | 81 | ``` 82 | {'dnn': [{'Deconv?': False, <--------- DNN architecture 83 | 'ifmap': [960, 576, 6], 84 | 'kernel': [7, 7], 85 | 'out_channel': 64, 86 | 'stride': 2, 87 | 'type': '2D'}, 88 | ...... 89 | 90 | {'Deconv?': True, 91 | 'ifmap': [120.0, 72.0, 128], 92 | 'kernel': [5, 5], 93 | 'out_channel': 64, 94 | 'stride': 1, 95 | 'type': '2D'}], 96 | 'dnn_result': [{'data': {'Deconv?': False, <-------- optimization result 97 | 'ifmap': [960, 576, 6], 98 | 'kernel': [7, 7], 99 | 'ofmap': [480.0, 288.0, 64], 100 | 'out_channel': 64, 101 | 'stride': 2, 102 | 'type': '2D'}, 103 | 'result': {'Bound': 'C', 104 | 'Tile size': [1.0, 8.0, 5.0], # tiling schedule 105 | 'buffer_utilization': 0.7890045166015626, 106 | 'c_0, w_0, h_0': [64, 120, 115], # tile size 107 | 'systolic_array_utilization': 1.0, 108 | 'total_cycle': 40642560, # execution cycles 109 | 'total_transfer': 85029312}}, # DRAM access (in Bytes) 110 | ...... 111 | 112 | {'data': {'Deconv?': True, 113 | 'ifmap': [120.0, 72.0, 128], 114 | 'kernel': [5, 5], 115 | 'out_channel': 64, 116 | 'stride': 1, 117 | 'type': '2D'}, 118 | 'result': [{'Bound': 'C', 119 | 'Tile size': [1.0, 2.0, 1.0], # tiling schedule 120 | 'buffer_utilization': 0.6793619791666666, 121 | 'c_0, w_0, h_0': [64, 60, 72], # tile size 122 | 'systolic_array_utilization': 1.0, 123 | 'total_cycle': 6912000, # execution cycles 124 | 'total_transfer': 2690048}]}], # DRAM access (in Bytes) 125 | 'method': 'Constrained', 126 | 'schedule': {'combine': True, 'split': True, 'static': False}, # optimization flags 127 | 'system_info': {'bit_width': 16.0, 128 | 'bufsize': 1572864.0, 129 | 'memory_bandwidth': 25.6, 130 | 'sa_size': 16.0}} 131 | ``` 132 | 133 | The result shows a couple of fields: 134 | * `method` : the method you specified for dataflow search. 135 | * `schedule` : the optimization options you specified for deconvolution. 136 | Please check out our paper for more details one this. 137 | * `system_info` : this specifies the hardware configurations. 138 | * `dnn` : the architecture of you DNN network. 139 | * `dnn_result` : the optimization result for your dnn. 140 | 141 | You can run more examples of dataflow optimization by `runner.sh`. 142 | 143 | ### How to specify a DNN configuration 144 | 145 | We provide a simple way to specify the configuration (or architecture) of 146 | each DNN layer, the example is shown in `/dnns`. 147 | 148 | The layer parameters are separated by `,`, the order of the specification is: 149 | ofmap channels, kernel height, kernel width, stride, flag to indicate whether 150 | it is a deconvolution layer. 151 | 152 | A simple example for 2D DNN is shown below: 153 | ``` 154 | # ofmap channels, kernel height, kernel width, stride,deconv? 155 | 64,7,7,2,False 156 | 128,5,5,2,False 157 | ... 158 | 159 | 128,5,5,1,True 160 | 64,5,5,1,True 161 | ``` 162 | 163 | Asimple example for 3D DNN is shown here: 164 | ``` 165 | # ofmap channels, kernel height, kernel width, kernel depth, stride, deconv? 166 | 32,3,3,3,1,False 167 | 32,3,3,3,1,False 168 | ... 169 | 170 | 64,3,3,3,1,True 171 | 32,3,3,3,1,True 172 | ``` 173 | 174 | ## Explanation of each option flags 175 | 176 | There are three parts consisted all the flags 177 | 178 | 1. input and output files 179 | * `--dnnfile` : the actual dnn dataflow file to optimize. 180 | * `--outfile` : the file to dump all the result. 181 | 182 | 2. options that are related to search options 183 | * `--static` : to set the flag if static partition the buffer enable, this 184 | flag will statically set the SRAM partition and optimize the entire dnn 185 | dataflow for that particular dnn dataflow, you also need to specify flag 186 | `buffer_partition` too. 187 | * `--split` : enable to apply our special optimization to split a regular 188 | deconvolution kernel into small sub-kernels and effectively avoid reduntant 189 | computation. 190 | * `--combine` : enable to effectively interleave the computation of the split 191 | sub-kernels during convolution. 192 | * `--model_type` : DNN model convolution type: 2D or 3D. 193 | * `--ifmap` : the initial ifmap dimemsion, a.k.a, input image, order: [W H C] 194 | * `--ifmap3d` : the initial ifmap dimemsion in 3D DNN, order: [W H D C] 195 | * `--buffer_partition` : the ifmap dimemsion, order: [I O W] 196 | * `--search_method` : there are three search options: "Constrained", 197 | "Exhaustive", "Combined", the first one is to use constrained optimization, 198 | "Exhaustive" is a exhaustive search with the combination of DP. 199 | "Combine" is to use static partition to set initial guess values for 200 | constrained optimization and then use constrained optimization. 201 | 202 | 3. other hardware configurations 203 | * `--bufsize`, the buffer size, or SRAM size in Bytes, e.g. 1048576.0. 204 | * `--memory_bandwidth`, the DRAM bandwidth in GB/s, e.g. 25.6. 205 | * `--sa_size`, the systolic array dimension, e.g. 16 stands for the systolic 206 | array dimension is 16 by 16. 207 | * `--bit_width`, Bit Width of each value (typically, 8-bit, 16-bit, 32-bit) 208 | 209 | ## Citing 210 | 211 | This project is a artifact of our 2019 MICRO paper: 212 | 213 | Y. Feng, P. Whatmough, and Y. Zhu, "ASV: Accelerated Stereo Vision System", In Proc. of MICRO, 2019. 214 | 215 | Please kindly consider citing this paper in your publications if it helps your research. 216 | ``` 217 | @inproceedings{yu2019asv, 218 | title={ASV: Accelerated Stereo Vision System}, 219 | author={Feng, Yu and Whatmough, Paul and Zhu, Yuhao}, 220 | booktitle={Proceedings of the 52th International Symposium on Microarchitecture}, 221 | year={2019}, 222 | organization={ACM} 223 | } 224 | ``` 225 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | import layer_base_method -------------------------------------------------------------------------------- /dataflow_search.py: -------------------------------------------------------------------------------- 1 | 2 | import argparse 3 | import numpy as np 4 | import scipy 5 | import sys 6 | import pprint 7 | import json 8 | 9 | # import my own modules 10 | import dnn_optimizer 11 | 12 | # setup the argument parser 13 | argparser = argparse.ArgumentParser("dataflow_search.py") 14 | # input dnn file and output result 15 | argparser.add_argument("--dnnfile", required=True) 16 | argparser.add_argument("--outfile", help="output file to dump all the results") 17 | 18 | # other search options 19 | argparser.add_argument("--static", type=bool, default=False, 20 | help="static partition the buffer without dynamically changing") 21 | argparser.add_argument("--split", type=bool, default=False, 22 | help="enable to split the convolution kernel into small sub-kernel") 23 | argparser.add_argument("--combine", type=bool, default=False, 24 | help="enable to combine the sub-kernels durting compute") 25 | argparser.add_argument("--model_type", default="2D", choices=["2D", "3D"], 26 | help="DNN model convolution type: 2D or 3D.") 27 | argparser.add_argument("--ifmap", nargs="+", type=int, required=False, 28 | help="the ifmap dimemsion, order: [W H C]") 29 | argparser.add_argument("--ifmap3d", nargs="+", type=int, required=False, 30 | help="the ifmap dimemsion, order: [W H D C]") 31 | argparser.add_argument("--buffer_partition", nargs="+", type=float, 32 | help="the ifmap dimemsion, order: [I O W]") 33 | argparser.add_argument("--search_method", default="Constrained", 34 | choices=["Constrained", "Exhaustive", "Combined"], 35 | help="Dataflow search methods: constraint optoimization" 36 | ", exhaustive search or combining both.") 37 | 38 | # other hardware configurations 39 | argparser.add_argument("--bufsize", type=float, default=1048576.0*1.5, 40 | help="in Btyes") 41 | argparser.add_argument("--memory_bandwidth", type=float, default=6.4*4, 42 | help="in GB/s") 43 | argparser.add_argument("--sa_size", type=float, default=16, 44 | help="Systolic array size") 45 | argparser.add_argument("--bit_width", type=float, default=16, 46 | help="Bit Width of each value (typically, 8-bit, 16-bit, 32-bit)") 47 | 48 | 49 | args = argparser.parse_args() 50 | 51 | # import dnn network descrtiption into the system; 52 | # the format for one 2D DNN layer is: 53 | # (width, height, in_channel, out_channel, 54 | # kenrel_width, kernel_height, stride, Deconv?) 55 | # 56 | # for 3D DNN layer is 57 | # (width, height, disparity, in_channel, out_channel, 58 | # kenrel_width, kernel_height, kernel_disp, stride, Deconv?) 59 | # 60 | # for MLP is 61 | # [in_channel, out_channel] 62 | def import_dnn(filename, ifmap_dim, ifmap3d_dim): 63 | # a list to store the dnn configuration 64 | dnn = [] 65 | weight_dim = [] 66 | 67 | # The weight input format as follows: 68 | # [out_channel,kenrel_width,kernel_height,stride,Deconv?] 69 | for line in open(filename): 70 | if len(line) <= 1: 71 | continue 72 | ls = line.strip().split(",") 73 | 74 | if len(ls) == 5: 75 | dnn.append({"ifmap" : ifmap_dim, 76 | "out_channel" : int(ls[0]), 77 | "kernel" : [int(ls[1]), int(ls[2])], 78 | "stride" : int(ls[3]), 79 | "Deconv?" : ls[4] == "True", 80 | "type" : "2D"}) 81 | 82 | prev_layer = dnn[-1] 83 | 84 | if prev_layer["Deconv?"]: 85 | # increase the deconv ofmap by two, as default, 86 | # we only consider stride of 1 87 | ifmap_dim = [ifmap_dim[0]*2/prev_layer["stride"], \ 88 | ifmap_dim[1]*2/prev_layer["stride"], \ 89 | prev_layer["out_channel"]] 90 | else: 91 | # if it is Conv, scale down the ifmap dimemsion by stride; 92 | ifmap_dim = [ifmap_dim[0]/prev_layer["stride"], \ 93 | ifmap_dim[1]/prev_layer["stride"], \ 94 | prev_layer["out_channel"]] 95 | 96 | elif len(ls) == 4: 97 | dnn.append({ 98 | "ifmap" : [int(ls[0])*int(ls[1]), int(ls[2])], 99 | "out_channel" : int(ls[3]), 100 | "type" : "MLP", 101 | "Deconv?" : False, 102 | "stride" : 1 103 | }) 104 | elif len(ls) == 3: 105 | dnn.append({ 106 | "ifmap" : [int(ls[1])], 107 | "in_channel" : int(ls[1]), 108 | "out_channel" : int(ls[2]), 109 | "num_of_layer" : int(ls[0]), 110 | "type" : "FC", 111 | "Deconv?" : False, 112 | "stride" : 1 113 | }) 114 | else: 115 | dnn.append({"ifmap" : ifmap3d_dim, 116 | "out_channel" : int(ls[0]), 117 | "kernel" : [int(ls[1]), int(ls[2]), int(ls[3])], 118 | "stride" : int(ls[4]), 119 | "Deconv?" : ls[5] == "True", 120 | "type" : "3D"}) 121 | 122 | prev_layer = dnn[-1] 123 | 124 | if prev_layer["Deconv?"]: 125 | # increase the deconv ofmap by two, as default, 126 | # we only consider stride of 1 127 | ifmap3d_dim = [ifmap3d_dim[0]*2/prev_layer["stride"], \ 128 | ifmap3d_dim[1]*2/prev_layer["stride"], \ 129 | ifmap3d_dim[2]*2/prev_layer["stride"], \ 130 | prev_layer["out_channel"]] 131 | else: 132 | # if it is Conv, scale down the ifmap dimemsion by stride; 133 | ifmap3d_dim = [ifmap3d_dim[0]/prev_layer["stride"], \ 134 | ifmap3d_dim[1]/prev_layer["stride"], \ 135 | ifmap3d_dim[2]/prev_layer["stride"], \ 136 | prev_layer["out_channel"]] 137 | 138 | return dnn 139 | 140 | # The hardware constraints are: 141 | # 1. the on-chip buffer size; 142 | # 2. the memory bandwidth; (Unit in bytes/cycle) 143 | # 3. the systolic array size; 144 | def hardware_constraints(sa_size=16.0, mem_bw=6.4*4, buf=1048576.0*1.5, bit_width=16.0): 145 | systolic_arr_size = sa_size; 146 | memory_bandwidth = mem_bw; 147 | buffer_size = buf; 148 | return [systolic_arr_size, memory_bandwidth, buffer_size, bit_width] 149 | 150 | def system_config(args, meta_data): 151 | # set up the search methods 152 | meta_data["method"] = args.search_method 153 | # setup the system configuration; 154 | meta_data["schedule"] = {} 155 | meta_data["schedule"]["static"] = args.static 156 | meta_data["schedule"]["split"] = args.split 157 | meta_data["schedule"]["combine"] = args.combine 158 | if args.buffer_partition: 159 | meta_data["buffer_partition"] = args.buffer_partition 160 | 161 | # setup the system; 162 | meta_data["system_info"] = {} 163 | meta_data["system_info"]["bufsize"] = args.bufsize 164 | meta_data["system_info"]["memory_bandwidth"] = args.memory_bandwidth 165 | meta_data["system_info"]["sa_size"] = args.sa_size 166 | meta_data["system_info"]["bit_width"] = args.bit_width 167 | 168 | return meta_data 169 | 170 | def calculate_overall_performance(meta_data): 171 | res = { 172 | "total_cycle" : 0.0, 173 | "each_layer_cycle" : [], 174 | "total_transfer" : 0.0, 175 | "each_layer_traffic" : [], 176 | } 177 | for data in meta_data["dnn_result"]: 178 | if isinstance(data['result'], list): 179 | for item in data['result']: 180 | res["total_cycle"] += item["total_cycle"] 181 | res["each_layer_cycle"].append(item["total_cycle"]) 182 | res["total_transfer"] += item["total_transfer"] 183 | res["each_layer_traffic"].append(item["total_transfer"]) 184 | else: 185 | res["total_cycle"] += data['result']["total_cycle"] 186 | res["each_layer_cycle"].append(data['result']["total_cycle"]) 187 | res["total_transfer"] += data['result']["total_transfer"] 188 | res["each_layer_traffic"].append(data['result']["total_transfer"]) 189 | 190 | return res 191 | 192 | 193 | if __name__== "__main__": 194 | # initialize the result data; 195 | meta_data = {} 196 | 197 | # setup system configuration; 198 | meta_data = system_config(args, meta_data) 199 | 200 | # import the dnn 201 | dnn = import_dnn(args.dnnfile, args.ifmap, args.ifmap3d) 202 | meta_data["dnn"] = dnn 203 | hw_constraints = hardware_constraints(sa_size=args.sa_size, 204 | mem_bw=args.memory_bandwidth, 205 | buf=args.bufsize, 206 | bit_width=args.bit_width) 207 | 208 | # start the optimization main routine 209 | meta_data["dnn_result"] = dnn_optimizer.opti_dnn(meta_data, hw_constraints) 210 | 211 | meta_data["overall_result"] = calculate_overall_performance(meta_data) 212 | 213 | pprint.pprint(meta_data) 214 | -------------------------------------------------------------------------------- /deconv_exhaustive_searcher.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer_base_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class DeconvExhaustiveSearcher(LayerBaseMethod): 14 | 15 | # array to store the result from the four different results 16 | rets = [] 17 | 18 | """docstring for LayerExhaustiveSearcher""" 19 | def __init__(self, data, sys_info): 20 | super(DeconvExhaustiveSearcher, self).__init__(data, sys_info) 21 | self.rets = [] 22 | 23 | 24 | # compute buffer utilization 25 | def buffer_utilization(self, x, area): 26 | # buffer = ofmap + weights + ifmap 27 | total_buffer = self.Ci*(self.S*area[0]+2)*(self.S*area[1]+2) 28 | for i in range(len(x)): 29 | total_buffer += x[i]*area[0]*area[1]+ \ 30 | self.Ci*self.Subs[i][0]*self.Subs[i][1]*x[i] 31 | 32 | return total_buffer 33 | 34 | def compute_bound_cycle(self, i, util_rate, c_0): 35 | # total number of ops 36 | total_computation = (self.H*self.W*c_0)*\ 37 | (self.Ci*self.Subs[i][0]*self.Subs[i][1]) 38 | 39 | # systolic array calculation capacity 40 | comp_cap = (self.A*self.A) * util_rate 41 | 42 | return total_computation / comp_cap 43 | 44 | 45 | def process_parameter(self, x, area): 46 | area = list(map(lambda i: math.floor(i), area)) 47 | w_0 = min(self.W/math.ceil(self.W/round(area[0])), self.W) 48 | h_0 = min(self.H/math.ceil(self.H/round(area[1])), self.H) 49 | 50 | total_cycle = 0 51 | 52 | # calculate the total data transfer 53 | ifmap_tile_size = (self.S*h_0+2)*(self.S*w_0+2)*self.Ci 54 | 55 | # calculate the total batch 56 | total_batch = self.H*self.W/(h_0*w_0) 57 | 58 | # ifmap transfer 59 | total_transfer = ifmap_tile_size * total_batch 60 | 61 | util_sys_arr = 0 62 | util_cnt = 0 63 | 64 | for i in range(len(x)): 65 | if (round(x[i]) == 0): 66 | continue 67 | 68 | # compute the total number of elements needed to be updated in row-major. 69 | # ofmap and ifmap tile size 70 | ofmap_tile_size = h_0*w_0*x[i] 71 | 72 | # weight tile size 73 | kernel_tile_size = self.Subs[i][0]*self.Subs[i][0]*self.Ci*x[i] 74 | total_transfer += kernel_tile_size + ofmap_tile_size 75 | 76 | # compute the utilization of systolic array 77 | util_sys_arr += self.systolic_array_utilization(x[i], area) 78 | util_cnt += 1 79 | 80 | # compute the cycle for compute-/memory-bound 81 | comp_bound_cycle = self.compute_bound_cycle(i, util_sys_arr, x[i]) 82 | mem_bound_cycle = total_transfer/self.B 83 | 84 | # pick up the greater value as the actual cycle 85 | total_cycle += max(comp_bound_cycle, mem_bound_cycle) 86 | 87 | if (util_cnt > 0): 88 | util_sys_arr = util_sys_arr/util_cnt 89 | 90 | return (total_cycle, total_transfer, util_sys_arr) 91 | 92 | def fill_bufw(self, remain_subkernels): 93 | x0 = [0]*len(self.data["sub-kernels"]) 94 | sum_subs = 0 95 | for i in range(len(self.data["sub-kernels"])): 96 | sub_size = self.Subs[i][0]*self.Subs[i][1] 97 | # first, let's find the number of kernel we can put into buffer. 98 | while sum_subs < self.bufw_size \ 99 | and x0[i] < remain_subkernels[i]: 100 | x0[i] = x0[i]+self.A 101 | sum_subs += self.A*sub_size*self.Ci 102 | 103 | if x0[i] > remain_subkernels[i]: 104 | x0[i] = remain_subkernels[i] 105 | 106 | return x0 107 | 108 | # heuristically decide the area dimenion. [W, H] 109 | def area_dimension(self, area): 110 | if area >= self.W * self.H: 111 | return [self.W, self.H] 112 | 113 | if math.sqrt(area) > self.H: 114 | tile_w = math.ceil(self.W/math.sqrt(area)) 115 | return [self.W/tile_w, self.H] 116 | 117 | tile_w = math.ceil(self.W/math.sqrt(area)) 118 | tile_h = math.ceil(self.H/math.sqrt(area)) 119 | return [self.W/tile_w, self.H/tile_h] 120 | 121 | # the main optimization routine; 122 | def opti_buffer(self): 123 | # check if the initial configuration can hold the minimum requirements 124 | if ((self.A*self.K_h*self.K_w*self.Ci > self.bufw_size) or 125 | (self.S*self.S*self.A*self.Ci > self.bufi_size)): 126 | return 127 | 128 | total_cycle = 0 129 | total_transfer = 0 130 | remain_subkernels = [self.data["out_channel"]]*len(self.data["sub-kernels"]) 131 | 132 | # set tile area; 133 | area = 0 134 | # next let's see how much ifmap can we fit into the buffer. 135 | while self.S*self.S*(area+self.A)*self.Ci < self.bufi_size: 136 | area = area+self.A 137 | 138 | round_result = [] 139 | result_cache = {} 140 | 141 | # no need to optimize the buffer for ofmap, because it is 142 | # bounded ifmap. 143 | x1 = self.area_dimension(area) 144 | 145 | while not all([sub <= 0.0 for sub in remain_subkernels]): 146 | 147 | # set the initial guess; 148 | x0 = self.fill_bufw(remain_subkernels) 149 | 150 | util_buf = self.buffer_utilization(x0, x1)/self.buf_size 151 | 152 | # print(util_buf, x1, x0) 153 | if util_buf > 1.01: 154 | return 155 | 156 | (cycle, transfer, util_rate) = self.process_parameter(x0, x1) \ 157 | if str(x0 + x1) not in result_cache else result_cache[str(x0 + x1)] 158 | 159 | result_cache[str(x0 + x1)] = (cycle, transfer, util_rate) 160 | 161 | if cycle == -1 or transfer == -1: 162 | return 163 | 164 | total_transfer += transfer 165 | total_cycle += cycle 166 | 167 | remain_subkernels = np.subtract(remain_subkernels, x0) 168 | 169 | round_result.append({"kernels" :x0, 170 | "tiles" : x1, 171 | "systolic array utilization" : util_rate}) 172 | 173 | ret = { 174 | "total_transfer": round(total_transfer), 175 | "total_cycle": round(total_cycle), 176 | "partition" : { 177 | "bufi_size" : round(self.bufi_size), 178 | "bufw_size" : round(self.bufw_size), 179 | "bufo_size" : round(self.bufo_size), 180 | }, 181 | "round_result" : round_result, 182 | } 183 | self.rets.append(ret) 184 | 185 | # optimize one layer 186 | def optimize(self): 187 | self.init_setup() 188 | 189 | layer_info = self.data 190 | add_one = [(i+1)/2 for i in layer_info["kernel"]] 191 | sub_one = [i/2 for i in layer_info["kernel"]] 192 | self.data["sub-kernels"] = [ 193 | [add_one[0], add_one[1]], 194 | [add_one[0], sub_one[1]], 195 | [sub_one[0], add_one[1]], 196 | [sub_one[0], sub_one[1]]] 197 | 198 | self.Subs = self.data["sub-kernels"] 199 | 200 | # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h) 201 | 202 | for i in range(1, 20): 203 | self.bufi_size = self.buf_size*i/20.0 204 | for j in range(1, 20): 205 | self.bufw_size = self.buf_size*j/20.0 206 | 207 | self.res = [] 208 | # if sum of bufi and bufw is over the self.buf_size 209 | # we should skip it. 210 | if (self.bufi_size + self.bufw_size) > self.buf_size: 211 | continue 212 | 213 | # set ofmap size 214 | self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size 215 | # both cases are possible; 216 | self.opti_buffer() 217 | 218 | 219 | ret = dict(self.rets[0]) 220 | 221 | for item in self.rets: 222 | if ret["total_cycle"] > item["total_cycle"]: 223 | ret = dict(item) 224 | if ret["total_cycle"] == item["total_cycle"] and \ 225 | ret["total_transfer"] > item["total_transfer"]: 226 | ret = dict(item) 227 | 228 | return ret 229 | -------------------------------------------------------------------------------- /deconv_optimizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer_base_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class DeconvOptimizer(DeconvOptimizer): 14 | 15 | # array to store the result from the four different results 16 | rets = [] 17 | 18 | """docstring for LayerExhaustiveSearcher""" 19 | def __init__(self, data): 20 | super(DeconvOptimizer, self).__init__(data) 21 | self.rets = [] 22 | 23 | # compute buffer utilization 24 | def buffer_utilization(self, x, area): 25 | # buffer = ofmap + weights + ifmap 26 | total_buffer = self.Ci*(self.S*area[0]+2)*(self.S*area[1]+2) 27 | for i in range(len(x)): 28 | total_buffer += x[i]*area[0]*area[1]+ \ 29 | self.Ci*self.Subs[i][0]*self.Subs[i][1]*x[i] 30 | 31 | return total_buffer 32 | 33 | def process_parameter(self, x, area): 34 | area = list(map(lambda i: math.floor(i), area)) 35 | w_0 = min(self.W/math.ceil(self.W/round(area[0])), self.W) 36 | h_0 =min(self.H/math.ceil(self.H/round(area[1])), self.H) 37 | 38 | total_cycle = 0 39 | 40 | for i in range(len(x)): 41 | if (round(x[i]) == 0): 42 | continue 43 | # make the tile size even for every batch 44 | c_0 = min(self.Co/math.ceil(self.Co/round(x[i])), self.Co) 45 | 46 | # check the result 47 | # print(c_0, w_0, h_0, self.Co/c_0, self.W/w_0, self.H/h_0) 48 | # compute the total number of elements needed to be updated 49 | # if it is row-major. 50 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 51 | total_transfer = (h_0*w_0*c_0+(self.S*h_0+2)*(self.S*w_0+2)*self.Ci) \ 52 | *self.Subs[i][0]*self.Subs[i][0]/(h_0*w_0) \ 53 | +(h_0*w_0*c_0+self.Subs[i][0]*self.Subs[i][0]*self.Ci*c_0) 54 | 55 | # compute the utilization of systolic array 56 | util_sys_arr = x[i]/(math.ceil(x[i]/self.A)*self.A) \ 57 | *area[0]*area[1]/(math.ceil(area[0]*area[1]/self.A)*self.A) 58 | 59 | # # compute the utilization of systolic array 60 | # util_buf += self.buffer_utilization([c_0, w_0, h_0])/self.buffer_size 61 | 62 | # if util_buf > 1.01: 63 | # return (-1, -1) 64 | 65 | comp_bound_cycle = (self.H*self.W*c_0)*(self.Ci*self.Subs[i][0]*self.Subs[i][0])\ 66 | /(self.A*self.A)/util_sys_arr 67 | 68 | mem_bound_cycle = total_transfer/self.B 69 | 70 | total_cycle += max(comp_bound_cycle, mem_bound_cycle) 71 | 72 | # print comp_bound_cycle, mem_bound_cycle 73 | 74 | return (total_cycle, total_transfer) 75 | 76 | # # print(x[0],(math.ceil(x[0]/A)*A), x[1]*x[2], (math.ceil(x[1]*x[2]/A)*A)) 77 | # ret = { 78 | # "total_transfer": round(total_transfer), 79 | # "total_cycle": round(total_cycle), 80 | # "systolic_array_utilization": util_sys_arr, 81 | # "buffer_utilization": util_buf, 82 | # "c_0, w_0, h_0": [c_0, w_0, h_0], 83 | # "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0], 84 | # "Bound" : bound 85 | # } 86 | # self.res.append(ret) 87 | # return total_cycle 88 | 89 | def fill_bufw(self, remain_subkernels): 90 | x0 = [0]*len(self.data["sub-kernels"]) 91 | sum_subs = 0 92 | for i in range(len(self.data["sub-kernels"])): 93 | sub_size = self.Subs[i][0]*self.Subs[i][1] 94 | # first, let's find the number of kernel we can put into buffer. 95 | while sum_subs < self.bufw_size \ 96 | and x0[i] < remain_subkernels[i]: 97 | x0[i] = x0[i]+self.A 98 | sum_subs += self.A*sub_size*self.Ci 99 | 100 | return x0 101 | 102 | # the main optimization routine; 103 | def opti_buffer(self): 104 | # check if the initial configuration can hold the minimum requirements 105 | if ((self.A*self.K_h*self.K_w*self.Ci > self.bufw_size) or 106 | (self.S*self.S*self.A*self.Ci > self.bufi_size)): 107 | return 108 | 109 | total_cycle = 0 110 | total_transfer = 0 111 | remain_subkernels = [self.data["out_channel"]]*len(self.data["sub-kernels"]) 112 | 113 | # set tile area; 114 | area = 0 115 | # next let's see how much ifmap can we fit into the buffer. 116 | while self.S*self.S*(area+self.A)*self.Ci < self.bufi_size: 117 | area = area+self.A 118 | 119 | round_result = [] 120 | result_cache = {} 121 | while not all([sub == 0 for sub in remain_subkernels]): 122 | # set the initial guess; 123 | x0 = self.fill_bufw(remain_subkernels) 124 | 125 | # no need to optimize the buffer for ofmap, because it is 126 | # bounded ifmap. 127 | x1 = [math.sqrt(area), math.sqrt(area)] 128 | 129 | util_buf = self.buffer_utilization(x0, x1)/self.buffer_size 130 | 131 | # print(util_buf, x1, x0) 132 | if util_buf > 1.01: 133 | return 134 | 135 | (cycle, transfer) = self.process_parameter(x0, x1) \ 136 | if str(x0 + x1) not in result_cache else result_cache[str(x0 + x1)] 137 | 138 | result_cache[str(x0 + x1)] = (cycle, transfer) 139 | 140 | if cycle == -1 or transfer == -1: 141 | return 142 | 143 | total_transfer += transfer 144 | total_cycle += cycle 145 | 146 | remain_subkernels = np.subtract(remain_subkernels, x0) 147 | 148 | round_result.append(x0) 149 | 150 | ret = { 151 | "total_transfer": round(total_transfer), 152 | "total_cycle": round(total_cycle), 153 | "partition" : { 154 | "bufi_size" : round(self.bufi_size), 155 | "bufw_size" : round(self.bufw_size), 156 | "bufo_size" : round(self.bufo_size), 157 | }, 158 | "round_result" : round_result, 159 | } 160 | self.rets.append(ret) 161 | 162 | # optimize one layer 163 | def optimize(self): 164 | global SysArr, Bandwith, BufferSize 165 | 166 | self.res = [] 167 | layer_info = self.data 168 | # set up the new layer information 169 | [self.W, self.H, self.Ci] = layer_info["ifmap"] 170 | self.Co = layer_info["out_channel"] 171 | [self.K_w, self.K_h] = layer_info["kernel"] 172 | self.S = layer_info["stride"] 173 | 174 | add_one = [(i+1)/2 for i in layer_info["kernel"]] 175 | sub_one = [i/2 for i in layer_info["kernel"]] 176 | self.data["sub-kernels"] = [ 177 | [add_one[0], add_one[1]], 178 | [add_one[0], sub_one[1]], 179 | [sub_one[0], add_one[1]], 180 | [sub_one[0], sub_one[1]]] 181 | 182 | self.Subs = self.data["sub-kernels"] 183 | 184 | # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h) 185 | 186 | for i in range(1, 20): 187 | self.bufi_size = BufferSize*i/20.0 188 | for j in range(1, 20): 189 | self.bufw_size = BufferSize*j/20.0 190 | 191 | self.res = [] 192 | # if sum of bufi and bufw is over the BufferSize 193 | # we should skip it. 194 | if (self.bufi_size + self.bufw_size) > BufferSize: 195 | continue 196 | 197 | # set ofmap size 198 | self.bufo_size = BufferSize - self.bufi_size - self.bufw_size 199 | # both cases are possible; 200 | self.opti_buffer() 201 | 202 | 203 | ret = dict(self.rets[0]) 204 | 205 | for item in self.rets: 206 | if ret["total_cycle"] > item["total_cycle"]: 207 | ret = dict(item) 208 | if ret["total_cycle"] == item["total_cycle"] and \ 209 | ret["total_transfer"] > item["total_transfer"]: 210 | ret = dict(item) 211 | 212 | return ret 213 | -------------------------------------------------------------------------------- /dnn_optimizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | import matplotlib 3 | import matplotlib.pyplot as plt 4 | import matplotlib.patches as patches 5 | from matplotlib.ticker import FuncFormatter 6 | import numpy as np 7 | import scipy 8 | import sys 9 | 10 | 11 | # import my own modules 12 | import layer_optimizer 13 | import layer_static_method 14 | import layer_exhaustive_searcher 15 | import deconv_exhaustive_searcher 16 | from multi_layer_perceptron import MultiLayerPerceptron 17 | from fc_layer import FullyConnectedLayer 18 | 19 | import layer3d_optimizer 20 | import layer3d_exhaustive_searcher 21 | 22 | method = None 23 | buffer_partition = None 24 | enable = { 25 | "static" : False, 26 | "combine" : False, 27 | "split" : False, 28 | } 29 | 30 | def setup(meta_data, hardware_constraints): 31 | global enable, method, buffer_partition 32 | # define the search method 33 | method = meta_data["method"] 34 | 35 | if meta_data["schedule"]["static"] and \ 36 | "buffer_partition" not in meta_data: 37 | raise Exception("The static scheduling is not supported" 38 | " without specifying the buffer partition.") 39 | 40 | if "buffer_partition" in meta_data: 41 | buffer_partition = meta_data["buffer_partition"] 42 | 43 | # set the schedule policy 44 | enable["static"] = meta_data["schedule"]["static"] 45 | enable["combine"] = meta_data["schedule"]["combine"] 46 | enable["split"] = meta_data["schedule"]["split"] 47 | 48 | def single_layer_optimization(data, sys_info): 49 | global method, enable, buffer_partition 50 | if data["type"] == "MLP": 51 | return MultiLayerPerceptron(data, sys_info).optimize() 52 | 53 | if data["type"] == "FC": 54 | return FullyConnectedLayer(data, sys_info).optimize() 55 | 56 | # if "static" option is enabled, it will be prioritized 57 | if enable["static"]: 58 | return layer_static_method.\ 59 | LayerStaticMethod(data, sys_info, buffer_partition).optimize() 60 | 61 | # check the potential method we use here. 62 | if method == "Constrained": 63 | if data["type"] == "2D": 64 | return layer_optimizer.\ 65 | LayerOptimizer(data, sys_info).optimize() 66 | else: 67 | return layer3d_optimizer.\ 68 | Layer3dOptimizer(data, sys_info).optimize() 69 | elif method == "Exhaustive": 70 | if data["type"] == "2D": 71 | return layer_exhaustive_searcher.\ 72 | LayerExhaustiveSearcher(data, sys_info).optimize() 73 | else: 74 | return layer3d_exhaustive_searcher.\ 75 | Layer3dExhaustiveSearcher(data, sys_info).optimize() 76 | elif method == "Combined": 77 | return layer_optimizer.\ 78 | LayerOptimizer(data, sys_info, True).optimize() 79 | else: 80 | raise Exception("Unknown search method: {}".format(method)) 81 | 82 | def single_combine_optimization(data, sys_info): 83 | global method 84 | if method == "Constrained": 85 | return layer_optimizer.\ 86 | LayerOptimizer(data, sys_info).optimize() 87 | elif method == "Exhaustive": 88 | return deconv_exhaustive_searcher.\ 89 | DeconvExhaustiveSearcher(data, sys_info).optimize() 90 | elif method == "Combined": 91 | return layer_optimizer.\ 92 | LayerOptimizer(data, sys_info, True).optimize() 93 | else: 94 | raise Exception("Unknown search method: {}".format(method)) 95 | 96 | def sub_kernel_sizes(layer): 97 | add_one = [(i+1)/2 for i in layer["kernel"]] 98 | sub_one = [i/2 for i in layer["kernel"]] 99 | 100 | sizes = [[]] 101 | for i in range(len(layer["kernel"])): 102 | tmp = [] 103 | for j in sizes: 104 | e1 = list(j) + [add_one[i]] 105 | e2 = list(j) + [sub_one[i]] 106 | tmp += [e1, e2] 107 | 108 | sizes = tmp 109 | 110 | return sizes 111 | 112 | def single_split_optimization(layer, sys_info): 113 | subs = [] 114 | 115 | # iterate all possible sub_kernels 116 | for sub_size in sub_kernel_sizes(layer): 117 | sub = dict(layer) 118 | sub["kernel"] = sub_size 119 | subs.append(single_layer_optimization(sub, sys_info)) 120 | 121 | return subs 122 | 123 | def opti_deconv(layer, sys_info): 124 | global method, enable 125 | # collect individual result from sub_kernels 126 | subs = [] 127 | 128 | # if the convolution size is odd; 129 | if layer["kernel"][0]%2 == 1: 130 | if enable["combine"]: 131 | subs.append(single_combine_optimization(layer, sys_info)) 132 | else: 133 | subs = single_split_optimization(layer, sys_info) 134 | # if the convolution size is even; 135 | else: 136 | sub = dict(layer) 137 | sub["kernel"][0] = sub["kernel"][0]/2 138 | sub["kernel"][1] = sub["kernel"][1]/2 139 | if enable_combine: 140 | # this will consider four same-size sub-kernels 141 | # as one sub-kernel with more channels 142 | sub["out_channel"] = sub["out_channel"]*4 143 | subs.append(single_layer_optimization(sub4, sys_info)) 144 | else: 145 | # without combining sub-kernels 146 | res = single_layer_optimization(sub, sys_info) 147 | # times 4 of each individual sub-kernel"s 148 | # memory traffic and cycles. 149 | res["total_traffic"] = res["total_traffic"]*4 150 | res["total_cycle"] = res["total_cycle"]*4 151 | subs.append(res) 152 | 153 | return subs 154 | 155 | # the main routine of optimizing the dnn. 156 | def opti_dnn(meta_data, hardware_constraints): 157 | # set up the configurations; 158 | setup(meta_data, hardware_constraints) 159 | dnn = meta_data["dnn"] 160 | sys_info = meta_data["system_info"] 161 | 162 | results = [] 163 | 164 | # optimize for each layer 165 | for i in range(len(dnn)): 166 | layer = dnn[i] 167 | # start to optimize ordinary Conv layer. 168 | data = dict(layer) 169 | 170 | # check if this layer is Deconv, True == YES 171 | if layer["Deconv?"] == True: 172 | if enable["split"]: 173 | # if split the deconv into smaller ones 174 | results.append({ 175 | "data" : data, 176 | "result" : opti_deconv(layer, sys_info) 177 | }) 178 | else: 179 | data["ofmap"] = [0] * len(data["ifmap"]) 180 | # scale up the ifmap to the ifmap based on the stride size. 181 | for j in range(len(data["ifmap"])-1): 182 | data["ifmap"][j] = layer["ifmap"][j]*2/layer["stride"] 183 | data["ofmap"][j] = layer["ifmap"][j]/layer["stride"] 184 | 185 | # the last element is ofmap channel, so treat it separately 186 | data["ofmap"][-1] = data["out_channel"] 187 | 188 | # add the result 189 | results.append({ 190 | "data" : data, 191 | "result" : single_layer_optimization(data, sys_info) 192 | }) 193 | else: 194 | data["ofmap"] = [0] * len(data["ifmap"]) 195 | # scale down the ifmap to the ifmap based on the stride size. 196 | for j in range(len(data["ifmap"])-1): 197 | data["ofmap"][j] = layer["ifmap"][j]/layer["stride"] 198 | 199 | # the last element is ofmap channel, so treat it separately 200 | data["ofmap"][-1] = data["out_channel"] 201 | 202 | results.append({ 203 | "data" : data, 204 | "result" : single_layer_optimization(data, sys_info) 205 | }) 206 | 207 | return results 208 | -------------------------------------------------------------------------------- /dnns/GC_net.txt: -------------------------------------------------------------------------------- 1 | 32,5,5,2,False 2 | 32,3,3,1,False 3 | 32,3,3,1,False 4 | 32,3,3,1,False 5 | 32,3,3,1,False 6 | 32,3,3,1,False 7 | 32,3,3,1,False 8 | 9 | 32,3,3,3,1,False 10 | 32,3,3,3,1,False 11 | 64,3,3,3,2,False 12 | 64,3,3,3,1,False 13 | 64,3,3,3,1,False 14 | 64,3,3,3,2,False 15 | 64,3,3,3,1,False 16 | 64,3,3,3,1,False 17 | 64,3,3,3,2,False 18 | 64,3,3,3,1,False 19 | 64,3,3,3,1,False 20 | 128,3,3,3,2,False 21 | 128,3,3,3,1,False 22 | 128,3,3,3,1,False 23 | 64,3,3,3,1,True 24 | 64,3,3,3,1,True 25 | 64,3,3,3,1,True 26 | 32,3,3,3,1,True -------------------------------------------------------------------------------- /dnns/PSMNet.txt: -------------------------------------------------------------------------------- 1 | 32,3,3,2,False 2 | 32,3,3,1,False 3 | 32,3,3,1,False 4 | 64,3,3,2,False 5 | 64,3,3,1,False 6 | 128,3,3,2,False 7 | 128,3,3,1,False 8 | 9 | 32,3,3,3,1,False 10 | 32,3,3,3,1,False 11 | 32,3,3,3,1,False 12 | 64,3,3,3,2,False 13 | 64,3,3,3,1,False 14 | 64,3,3,3,2,False 15 | 64,3,3,3,1,False 16 | 64,3,3,3,1,True 17 | 32,3,3,3,1,True 18 | 64,3,3,3,2,False 19 | 64,3,3,3,1,False 20 | 64,3,3,3,2,False 21 | 64,3,3,3,1,False 22 | 64,3,3,3,1,True 23 | 32,3,3,3,1,True 24 | 64,3,3,3,2,False 25 | 64,3,3,3,1,False 26 | 64,3,3,3,2,False 27 | 64,3,3,3,1,False 28 | 64,3,3,3,1,True 29 | 32,3,3,3,1,True 30 | -------------------------------------------------------------------------------- /dnns/flowNetC.txt: -------------------------------------------------------------------------------- 1 | 64,7,7,2,False 2 | 128,5,5,2,False 3 | 256,5,5,2,False 4 | 441,1,1,1,False 5 | 256,3,3,1,False 6 | 512,3,3,2,False 7 | 512,3,3,1,False 8 | 512,3,3,2,False 9 | 512,3,3,1,False 10 | 1024,3,3,2,False 11 | 512,5,5,1,True 12 | 256,5,5,1,True 13 | 128,5,5,1,True 14 | 64,5,5,1,True -------------------------------------------------------------------------------- /dnns/flowNetS.txt: -------------------------------------------------------------------------------- 1 | 64,7,7,2,False 2 | 128,5,5,2,False 3 | 256,5,5,2,False 4 | 256,3,3,1,False 5 | 512,3,3,2,False 6 | 512,3,3,1,False 7 | 512,3,3,2,False 8 | 512,3,3,1,False 9 | 1024,3,3,2,False 10 | 512,5,5,1,True 11 | 256,5,5,1,True 12 | 128,5,5,1,True 13 | 64,5,5,1,True -------------------------------------------------------------------------------- /dnns/test.txt: -------------------------------------------------------------------------------- 1 | 64,7,7,2,False 2 | 128,5,5,1,True -------------------------------------------------------------------------------- /dnns/test3d.txt: -------------------------------------------------------------------------------- 1 | 32,3,3,3,1,False 2 | 32,3,3,3,1,False 3 | 64,3,3,3,2,False 4 | 64,3,3,3,1,False 5 | 64,3,3,3,1,False 6 | 64,3,3,3,2,False 7 | 64,3,3,3,1,False 8 | 64,3,3,3,1,False 9 | 64,3,3,3,2,False 10 | 64,3,3,3,1,False 11 | 64,3,3,3,1,False 12 | 128,3,3,3,2,False 13 | 128,3,3,3,1,False 14 | 128,3,3,3,1,False 15 | 64,3,3,3,1,True 16 | 64,3,3,3,1,True 17 | 64,3,3,3,1,True 18 | 32,3,3,3,1,True 19 | -------------------------------------------------------------------------------- /fc_layer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | class FullyConnectedLayer(object): 8 | # info for systolic array 9 | A = None # systolic array dimension 10 | 11 | # memory bandwith number of bytes can be transferred. 12 | B = None 13 | 14 | # on-chip buffer size 15 | buf_size = None 16 | 17 | # input layer dimension 18 | Ci = None # channels for ifmap 19 | Co = None # channels for ofmap 20 | Num = None # number of same FC layer 21 | 22 | # on-chip buffer size 23 | bufi_size = None 24 | bufo_size = None 25 | bufw_size = None 26 | 27 | """docstring for MultiLayerPerceptron""" 28 | def __init__(self, data, sys_info): 29 | self.data = data 30 | self.sys_info = sys_info 31 | self.A = sys_info["sa_size"] 32 | self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8) 33 | self.buf_size = sys_info["bufsize"] 34 | 35 | def init_setup(self): 36 | layer_info = self.data 37 | 38 | # set up the new layer information 39 | self.Ci = layer_info["in_channel"] 40 | self.Co = layer_info["out_channel"] 41 | self.Num = layer_info["num_of_layer"] 42 | 43 | self.bufi_size = self.Ci 44 | 45 | ############################################################### 46 | # general process # 47 | ############################################################### 48 | 49 | # compute buffer utilization 50 | def buffer_utilization(self, x): 51 | # buffer = ofmap + weights + ifmap 52 | return (x + self.Ci*x + self.Ci) 53 | 54 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 55 | def data_transfer(self, x): 56 | # calculate the total batch 57 | total_batch = math.ceil(float(self.Co)/x) 58 | 59 | # ofmap, ifmap and kernel tile size 60 | ofmap_tile_size = x 61 | kernel_tile_size = x*self.Ci 62 | 63 | # ofmap + kernels transfer 64 | total_transfer = (ofmap_tile_size + kernel_tile_size) * total_batch 65 | 66 | # add additional ifmap data transfer 67 | total_transfer += self.Ci 68 | 69 | return total_transfer 70 | 71 | def systolic_array_utilization(self, x): 72 | A = self.A 73 | A_w_uiti = math.ceil(self.Co/math.ceil(float(self.Co)/A)) 74 | 75 | total_usage = x * self.Ci 76 | round_up_val = math.ceil(float(x/A)) * A \ 77 | * math.ceil(float(self.Ci)/A)*A 78 | 79 | # the pct of extra delay due to output-stationary 80 | delay_pct = float(self.Ci)/(self.Ci+A_w_uiti) 81 | 82 | return delay_pct * total_usage / round_up_val 83 | 84 | def compute_bound_cycle(self, util_rate): 85 | # total number of ops 86 | total_computation = (self.Ci*self.Co) 87 | 88 | # systolic array calculation capacity 89 | comp_cap = (self.A*self.A) * util_rate 90 | 91 | return total_computation / comp_cap 92 | 93 | def process_parameter(self, x): 94 | 95 | x = math.floor(x) 96 | bound = "C" 97 | # make the tile size even for every batch 98 | x_0 = min(self.Co/math.ceil(self.Co/round(x)), self.Co) 99 | 100 | # (ofmap + ifmap)*total_batch + weights 101 | total_transfer = self.data_transfer(x_0) 102 | 103 | # compute the utilization of systolic array 104 | util_sys_arr = self.systolic_array_utilization(x_0) 105 | 106 | # compute the utilization of buffer 107 | util_buf = float(self.buffer_utilization(x_0))/self.buf_size 108 | 109 | if util_buf > 1.01: 110 | print("ERROR: the utilization of buffer is over 100%") 111 | exit() 112 | 113 | # calculate the amount of cycles of computing all elements. 114 | if self.compute_bound_cycle(util_sys_arr) > total_transfer/self.B: 115 | bound = "C" 116 | total_cycle = self.compute_bound_cycle(util_sys_arr) 117 | else: 118 | bound = "M" 119 | total_cycle = total_transfer/self.B 120 | 121 | ret = { 122 | "total_transfer": round(total_transfer)*self.Num, 123 | "total_cycle": round(total_cycle)*self.Num, 124 | "systolic_array_utilization": util_sys_arr, 125 | "buffer_utilization": util_buf, 126 | "buffer-partition [I,W,O]": [int(self.bufi_size), 127 | int(self.bufw_size), 128 | int(self.bufo_size)], 129 | "x_0": math.floor(x_0), 130 | "Bound" : bound 131 | } 132 | 133 | return ret 134 | 135 | # optimize one layer 136 | def optimize(self): 137 | self.init_setup() 138 | 139 | # if sum of bufi and bufw is over the self.buf_size 140 | # we should skip it. 141 | if self.bufi_size > self.buf_size: 142 | print("FAIL: the entire weight cannot be stored in buffer") 143 | exit() 144 | 145 | self.bufw_size = (self.buf_size - self.bufi_size)*self.Ci/(self.Ci+1) 146 | self.bufo_size = (self.buf_size - self.bufi_size)/(self.Ci+1) 147 | 148 | # set the initial guess; 149 | x0 = self.A 150 | 151 | # let's see what percentage of ifmap can we fit into the buffer. 152 | while x0 < self.Co and (x0+self.A)*self.Ci < self.bufw_size: 153 | x0 = x0 + self.A 154 | 155 | return self.process_parameter(x0) 156 | -------------------------------------------------------------------------------- /layer3d_base_method.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # own library 8 | from layer_base_method import * 9 | 10 | class Layer3dBaseMethod(LayerBaseMethod): 11 | """docstring for Layer3dBaseMethod""" 12 | # info for systolic array 13 | A = None # systolic array dimension 14 | 15 | # memory bandwith number of bytes can be transferred. 16 | B = None 17 | 18 | # on-chip buffer size 19 | buffer_size = None 20 | # info for weights 21 | K_w = None # kernel width 22 | K_h = None # kernel height 23 | k_d = None # kernel dispairty 24 | S = None # stride size 25 | 26 | # input layer dimension 27 | H = None # height of ofmap 28 | W = None # width of ofmap 29 | D = None # disparity of ofmap 30 | Ci = None # channels for weights 31 | Co = None # channels for ofmap 32 | 33 | # on-chip buffer size 34 | bufi_size = None 35 | bufo_size = None 36 | bufw_size = None 37 | 38 | # array to store the result from the four different results 39 | res = [] 40 | 41 | def __init__(self, data, sys_info): 42 | super(Layer3dBaseMethod, self).__init__(data, sys_info) 43 | 44 | def init_setup(self): 45 | self.res = [] 46 | layer_info = self.data 47 | # set up the new layer information 48 | [self.W, self.H, self.D, self.Ci] = layer_info["ifmap"] 49 | self.Co = layer_info["out_channel"] 50 | [self.K_w, self.K_h, self.K_d] = layer_info["kernel"] 51 | self.S = layer_info["stride"] 52 | 53 | ############################################################### 54 | # general computations # 55 | ############################################################### 56 | def ofmap_tile(self, x): 57 | return x[0]*x[1]*x[2]*x[3] 58 | 59 | def weight_tile(self, num): 60 | return self.Ci*self.K_h*self.K_w*self.K_d*num 61 | 62 | def ifmap_tile(self, x): 63 | S_2 = (self.K_h+1) / 2 64 | return self.Ci*(self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2) 65 | 66 | def total_ofmap_size(self): 67 | return self.H*self.W*self.D*self.Co 68 | 69 | def total_weight_size(self): 70 | return self.weight_tile(self.Co) 71 | 72 | # variables for optimization 73 | # this two has been encodes as x[4] = {c_0, h_0, w_0, d_0}; 74 | # c_0 # number of channels per batch; 75 | # h_0, w_0, d_0 # the dimensions of tile per batch; 76 | ############################################################### 77 | # general process # 78 | ############################################################### 79 | 80 | def buffer_utilization(self, x): 81 | # buffer = ofmap + weights + ifmap 82 | return (self.ofmap_tile(x) + 83 | self.weight_tile(x[0]) + 84 | self.ifmap_tile(x)) 85 | 86 | def row_major_data_transfer(self, h_0, w_0, d_0, c_0): 87 | # ofmap, ifmap and kernel tile size 88 | S_2 = (self.K_h+1) / 2 89 | ofmap_tile_size = h_0*w_0*d_0*c_0 90 | ifmap_tile_size = ((self.S*h_0+S_2) * 91 | (self.S*w_0+S_2) * 92 | (self.S*d_0+S_2) * self.Ci) 93 | kernel_tile_size = self.K_h*self.K_w*self.K_d*self.Ci*c_0 94 | 95 | # calculate the total batch 96 | total_batch = math.ceil((self.H*self.W*self.D*self.Co) / ofmap_tile_size) 97 | 98 | # ofmap + ifmap transfer 99 | total_transfer = ((ofmap_tile_size + ifmap_tile_size) * 100 | (total_batch - self.Co/c_0)) 101 | 102 | # add additional data transfer 103 | total_transfer += (ofmap_tile_size + kernel_tile_size) * (self.Co/c_0) 104 | 105 | return total_transfer 106 | 107 | def channel_major_data_transfer(self, h_0, w_0, d_0, c_0): 108 | S_2 = (self.K_h+1) / 2 109 | 110 | # ofmap and ifmap tile size 111 | ofmap_tile_size = h_0*w_0*d_0*c_0 112 | ifmap_tile_size = ((self.S*h_0+S_2) * 113 | (self.S*w_0+S_2) * 114 | (self.S*d_0+S_2) * self.Ci) 115 | kernel_tile_size = self.K_h*self.K_w*self.K_d*self.Ci*c_0 116 | 117 | # calculate the total batch 118 | total_batch = math.ceil((self.H*self.W*self.D*self.Co) / ofmap_tile_size) 119 | 120 | # ofmap + weight transfer 121 | total_transfer = (ofmap_tile_size + kernel_tile_size) * \ 122 | (total_batch - (self.H*self.W*self.D)/(h_0*w_0*d_0)) 123 | 124 | # add additional data transfer 125 | total_transfer += (ofmap_tile_size + ifmap_tile_size) \ 126 | * (self.H*self.W*self.D)/(h_0*w_0*d_0) 127 | 128 | return total_transfer 129 | 130 | def systolic_array_utilization(self, xi, area): 131 | area_size = area[0] * area[1] *area[2] 132 | A = self.A 133 | total_usage = xi * area_size 134 | round_up_val = (math.ceil(float(xi/A))*A) \ 135 | * (math.ceil(float(area_size)/A)*A) 136 | 137 | return total_usage / round_up_val 138 | 139 | def compute_bound_cycle(self, util_rate): 140 | # total number of ops 141 | total_computation = ((self.H*self.W*self.D*self.Co) 142 | * (self.Ci*self.K_h*self.K_w*self.K_d) 143 | / (self.S*self.S*self.S)) 144 | 145 | # systolic array calculation capacity 146 | comp_cap = (self.A*self.A) * util_rate 147 | 148 | return total_computation / comp_cap 149 | 150 | def process_parameter(self, x, row_major, comp_bound): 151 | bound = "C" 152 | # make the tile size even for every batch 153 | c_0 = min(self.Co/math.ceil(self.Co/round(x[0])), self.Co) 154 | w_0 = min(self.W/math.ceil(self.W/round(x[1])), self.W) 155 | h_0 = min(self.H/math.ceil(self.H/round(x[2])), self.H) 156 | d_0 = min(self.D/math.ceil(self.D/round(x[3])), self.D) 157 | 158 | #compute the total number of elements needed to be updated 159 | # if it is row-major. 160 | if row_major: 161 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 162 | total_transfer = self.row_major_data_transfer(h_0, w_0, d_0, c_0) 163 | 164 | # compute the total number of elements needed to be updated 165 | # if it is channel-major. 166 | else: 167 | # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0) 168 | total_transfer = self.channel_major_data_transfer(h_0, w_0, d_0, c_0) 169 | 170 | # compute the utilization of systolic array 171 | util_sys_arr = self.systolic_array_utilization(c_0, [w_0, h_0, d_0]) 172 | 173 | # compute the utilization of systolic array 174 | util_buf = self.buffer_utilization([c_0, w_0, h_0, d_0])/self.buf_size 175 | 176 | if util_buf > 1.01: 177 | return 178 | # calculate the amount of cycles of computing all elements. 179 | if comp_bound: 180 | bound = "C" 181 | total_cycle = self.compute_bound_cycle(util_sys_arr) 182 | else: 183 | bound = "M" 184 | total_cycle = total_transfer/self.B 185 | 186 | ret = { 187 | "total_transfer": round(total_transfer), 188 | "total_cycle": round(total_cycle), 189 | "systolic_array_utilization": util_sys_arr, 190 | "buffer_utilization": util_buf, 191 | "c_0, w_0, h_0, d_0": [round(c_0), round(w_0), round(h_0), round(d_0)], 192 | "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0, self.D/d_0], 193 | "Bound" : bound 194 | } 195 | self.res.append(ret) 196 | return 197 | 198 | -------------------------------------------------------------------------------- /layer3d_exhaustive_searcher.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer3d_static_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class Layer3dExhaustiveSearcher(Layer3dStaticMethod): 14 | 15 | # array to store the result from the four different results 16 | res = [] 17 | 18 | """docstring for LayerExhaustiveSearcher""" 19 | def __init__(self, data, sys_info): 20 | super(Layer3dExhaustiveSearcher, self).__init__(data, sys_info, None) 21 | self.rets = [] 22 | 23 | # optimize one layer 24 | def optimize(self): 25 | self.init_setup() 26 | 27 | for i in range(1, 20): 28 | self.bufi_size = self.buf_size*i/20.0 29 | for j in range(1, 20): 30 | self.bufw_size = self.buf_size*j/20.0 31 | # optimize one buffer partition 32 | self.res = [] 33 | self.optimize_one_buffer_partition() 34 | 35 | ret = dict(self.rets[0]) 36 | 37 | for item in self.rets: 38 | if ret["total_cycle"] > item["total_cycle"]: 39 | ret = dict(item) 40 | if ret["total_cycle"] == item["total_cycle"] and \ 41 | ret["total_transfer"] > item["total_transfer"]: 42 | ret = dict(item) 43 | 44 | return ret 45 | 46 | # the main optimization routine; 47 | def opti_buffer(self): 48 | # set the initial guess; 49 | x0 = [self.A, self.A] 50 | 51 | # check if the initial configuration can hold the minimum requirements 52 | if ((x0[0]*self.K_h*self.K_w*self.K_d*self.Ci > self.bufw_size) or 53 | (self.S*self.S*self.S*x0[1]*self.Ci > self.bufi_size)): 54 | return 55 | 56 | # first, let's find the number of kernel we can put into buffer. 57 | while (x0[0]+self.A)*self.K_h*self.K_w*self.K_d*self.Ci < self.bufw_size: 58 | x0[0] = x0[0]+self.A 59 | 60 | # next let's see how much ifmap can we fit into the buffer. 61 | while self.S*self.S*self.S*(x0[1]+self.A)*self.Ci < self.bufi_size: 62 | x0[1] = x0[1]+self.A 63 | 64 | # no need to optimize the buffer for ofmap, because it is 65 | # bounded ifmap. 66 | x = [x0[0], min(round(x0[1]**(1.0/3)), self.W), 67 | min(round(x0[1]**(1.0/3)), self.H), min(round(x0[1]**(1.0/3)), self.D)] 68 | self.process_parameter(x, False, False) 69 | self.process_parameter(x, False, True) 70 | self.process_parameter(x, True, False) 71 | self.process_parameter(x, True, True) 72 | 73 | -------------------------------------------------------------------------------- /layer3d_optimizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | from scipy.optimize import minimize 7 | 8 | # my own library 9 | from layer3d_base_method import * 10 | from layer_optimizer import * 11 | 12 | # threshold for bounds 13 | # if the constraint result is negative but within this threshold, 14 | # it is still consider a valid result. 15 | Threshold = 10.0 16 | 17 | class Layer3dOptimizer(Layer3dBaseMethod, LayerOptimizer): 18 | """docstring for Layer3dOptimizer""" 19 | def __init__(self, data, sys_info, combined=False): 20 | super(Layer3dOptimizer, self).__init__(data, sys_info) 21 | 22 | # variables for optimization 23 | # this two has been encodes as x[3] = {c_0, h_0, w_0, d_0}; 24 | # c_0 # number of channels per batch; 25 | # h_0, w_0, d_0 # the dimensions of tile per batch; 26 | 27 | ############################################################### 28 | # general process # 29 | ############################################################### 30 | 31 | def init_guess(self): 32 | x0 = [min(self.A, self.Co), \ 33 | min(math.floor(math.sqrt(self.A)), self.H), \ 34 | min(math.floor(math.sqrt(self.A)), self.W), 1] 35 | return x0 36 | 37 | def variable_boundary(self): 38 | return ((min(self.A, self.Co), self.Co), 39 | (min(math.floor(math.sqrt(self.A)), self.H), self.H), \ 40 | (min(math.floor(math.sqrt(self.A)), self.W), self.W), \ 41 | (1, self.D)) 42 | 43 | ############################################################### 44 | # general constraints # 45 | ############################################################### 46 | # the low bound of buffer size; 47 | # make sure the buffer utilization is always larger than 0 48 | def buffer_constraint1(self, x): 49 | return self.buffer_utilization(x) 50 | 51 | # the upper bound of the buffer size; 52 | # make sure the buffer utilization is 53 | # always smaller than buffer size; 54 | def buffer_constraint2(self, x): 55 | return self.buf_size - self.buffer_constraint1(x) 56 | 57 | ############################################################### 58 | # row-major constraint solving obj and constraints # 59 | ############################################################### 60 | 61 | # the minimization objective of row-major 62 | # this objective is a simplified expression of 63 | # [h_0*w_0*d_0*c_0+(S*h_0+2)(S*w_0+2)(S*d_0+2)*Ci]*(H*W*D*Co)/(h_0*w_0*d_0*c_0) 64 | # + [K^3*Ci+h_0*w_0*d_0*c_0]*Co/c_0 65 | def row_major_mem_obj(self, x): 66 | return (self.ofmap_tile(x) + self.ifmap_tile(x)) \ 67 | * (self.total_ofmap_size()/self.ofmap_tile(x) - self.Co/x[0]) \ 68 | + self.total_weight_size()/x[0] + x[1]*x[2]*x[3]*self.Co 69 | 70 | def row_major_comp_obj(self, x): 71 | return self.total_ofmap_size() / self.ofmap_tile(x) 72 | 73 | # make sure the load for row-major is always less than 74 | # load for channel-major, range : [0, +inf] 75 | def row_major_constraint(self, x): 76 | S_2 = (self.K_h+1)/2 77 | # simplified from K^3*Ci*c_0 > C*(S^3*h_0*w_0*d_0) 78 | return self.K_h*self.K_w*self.K_d*x[0] - \ 79 | (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2); 80 | 81 | # make sure the process is always memory-bound; 82 | # which is the latency for memory access is always 83 | # greater than lantecy of compute; 84 | # (c_0*(h_0*w_0*d_0)+C*((S*h_0+2)*(S*w_0+2)*(S*d_0+2))/B 85 | # >= (K^3*Ci/A^2)*c_0*w_0*d_0*h_0 86 | # range : [0, +inf] 87 | def row_major_mem_bound_constraint(self, x): 88 | return (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B \ 89 | - self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x) 90 | 91 | # make sure the process is always compute-bound; 92 | # which is the latency for compute is always 93 | # greater than lantecy of memory access; 94 | # (c_0*(h_0*w_0*d_0)+Ci*((S*h_0+2)*(S*w_0+2)*(S*d_0+2))/B 95 | # <= (K^3*Ci/A^2)*c_0*w_0*h_0*d_0 96 | # range : [0, +inf] 97 | def row_major_comp_bound_constraint(self, x): 98 | return self.weight_tile(1) / (self.A*self.A)*self.ofmap_tile(x) \ 99 | - (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B 100 | 101 | ############################################################### 102 | # channel-major constraint solving obj and constraints # 103 | ############################################################### 104 | 105 | # the minimization objective of channel-major 106 | # this is the simplified expression of 107 | # (K^3*Ci*c_0+h_0*w_0*d_0*c_0)*(H*W*D*Co)/(h_0*w_0*d_0*c_0) 108 | # + [(S*h_0+2)(S*w_0+2)(S*d_0+2)*Ci + h_0*w_0*d_0*c_0]*(H*W*D)/(h_0*w_0*d_0) 109 | def channel_major_mem_obj(self, x): 110 | S_2 = (self.K_h+1)/2 111 | return (self.total_weight_size)/(x[1]*x[2]*x[3]) + \ 112 | (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2)/\ 113 | (x[1]*x[2]*x[3]) 114 | 115 | def channel_major_comp_obj(self, x): 116 | return self.total_ofmap_size()/(x[1]*x[2]*x[0]*x[3]) 117 | 118 | # make sure the load for channel-major is always less than 119 | # load for row-major, range : [0, +inf] 120 | def channel_major_constraint(self, x): 121 | S_2 = (self.K_h+1)/2 122 | # simplified from K^3*Ci*c_0 <= Ci*((S*h_0+2)*(S*w_0+2)) 123 | return (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2) \ 124 | - self.K_h*self.K_w*self.K_d*x[0]; 125 | 126 | # make sure the process is always memory-bound; 127 | # which is the latency for memory access is always 128 | # greater than lantecy of compute; 129 | # c_0*(h_0*w_0+K^3*C)/B >= (K^3*C/A^2)*c_0*(h_0*w_0) 130 | # range : [0, +inf] 131 | def channel_major_mem_bound_constraint(self, x): 132 | return (x[1]*x[2]*x[3]+self.weight_tile(1)) / self.B \ 133 | - self.weight_tile(1)/(self.A*self.A)*x[1]*x[2]*x[3] 134 | 135 | 136 | # make sure the process is always memory-bound; 137 | # which is the latency for memory access is always 138 | # greater than lantecy of compute; 139 | # c_0*(h_0*w_0+K^3*C)/B >= (K^3*C/A^2)*c_0*(h_0*w_0*d_0) 140 | # range : [0, +inf] 141 | def channel_major_comp_bound_constraint(self, x): 142 | return (self.K_h*self.K_w*self.K_d*self.Co) \ 143 | / (self.A*self.A)*x[1]*x[2]*x[3] \ 144 | - (x[1]*x[2]+self.K_h*self.K_w*self.K_d*self.Co)/self.B 145 | 146 | 147 | -------------------------------------------------------------------------------- /layer3d_static_method.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer3d_base_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class Layer3dStaticMethod(Layer3dBaseMethod): 14 | 15 | # array to store the result from the four different results 16 | res = [] 17 | 18 | def __init__(self, data, sys_info, buffer_partition=None): 19 | super(Layer3dStaticMethod, self).__init__(data, sys_info) 20 | self.rets = [] 21 | if buffer_partition: 22 | # calculate buffer sizes 23 | self.bufi_size = self.buf_size* \ 24 | buffer_partition[0]/sum(buffer_partition) 25 | self.bufo_size = self.buf_size* \ 26 | buffer_partition[1]/sum(buffer_partition) 27 | self.bufw_size = self.buf_size* \ 28 | buffer_partition[2]/sum(buffer_partition) 29 | 30 | # the main optimization routine; 31 | def opti_buffer(self): 32 | # set the initial guess; 33 | x0 = [self.A, self.A] 34 | 35 | # check if the initial configuration can hold the minimum requirements 36 | if ((x0[0]*self.K_h*self.K_w*self.K_d*self.Ci > self.bufw_size) or 37 | (self.S*self.S*self.S*x0[1]*self.Ci > self.bufi_size)): 38 | return 39 | 40 | # first, let's find the number of kernel we can put into buffer. 41 | while (x0[0]+self.A)*self.K_h*self.K_w*self.K_d*self.Ci < self.bufw_size: 42 | x0[0] = x0[0]+self.A 43 | 44 | # next let's see what percentage of ifmap can we fit into the buffer. 45 | if self.K_h*self.K_w*(2*self.S+self.W) >= self.bufi_size: 46 | while self.K_h*self.K_w*(self.S+x0[1]+self.A)*self.Ci < self.bufi_size: 47 | x0[1] = x0[1]+self.A 48 | # no need to optimize the buffer for ofmap, because it is 49 | # bounded ifmap. 50 | x = [x0[0], x0[1], 1] 51 | else: 52 | d_0 = 1 53 | while (self.K_h*self.K_w)*(self.S+self.W)*(self.S+self.H) \ 54 | * (self.S+d_0+1)*self.Ci < self.bufi_size: 55 | d_0 += 1 56 | # no need to optimize the buffer for ofmap, because it is 57 | # bounded ifmap. 58 | x = [x0[0], self.W, self.H, d_0] 59 | 60 | self.process_parameter(x, False, False) 61 | self.process_parameter(x, False, True) 62 | self.process_parameter(x, True, False) 63 | self.process_parameter(x, True, True) 64 | 65 | def optimize_one_buffer_partition(self): 66 | # if sum of bufi and bufw is over the self.buf_size 67 | # we should skip it. 68 | if (self.bufi_size + self.bufw_size) > self.buf_size: 69 | return 70 | 71 | self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size 72 | # both cases are possible; 73 | self.opti_buffer() 74 | 75 | if len(self.res) == 0: 76 | return 77 | 78 | # choose the larger value as the bottleneck 79 | row_major_res = None 80 | if (self.res[0]["total_cycle"] < self.res[1]["total_cycle"]): 81 | row_major_res = self.res[1] 82 | else: 83 | row_major_res = self.res[0] 84 | 85 | # choose the larger value as the bottleneck 86 | channel_major_res = None 87 | if (self.res[2]["total_cycle"] < self.res[3]["total_cycle"]): 88 | channel_major_res = self.res[3] 89 | else: 90 | channel_major_res = self.res[2] 91 | 92 | # return the shortest value as the perferred compute ordering. 93 | ret = None 94 | if (row_major_res["total_cycle"] < channel_major_res["total_cycle"]): 95 | ret = dict(row_major_res) 96 | else: 97 | ret = dict(channel_major_res) 98 | 99 | self.rets.append(ret) 100 | 101 | # optimize one layer 102 | def optimize(self): 103 | self.init_setup() 104 | 105 | # start the optimization 106 | self.res = [] 107 | self.optimize_one_buffer_partition() 108 | 109 | # find the best result 110 | ret = dict(self.rets[0]) 111 | 112 | for item in self.rets: 113 | if ret["total_cycle"] > item["total_cycle"]: 114 | ret = dict(item) 115 | if ret["total_cycle"] == item["total_cycle"] and \ 116 | ret["total_transfer"] > item["total_transfer"]: 117 | ret = dict(item) 118 | 119 | return ret 120 | -------------------------------------------------------------------------------- /layer_base_method.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | class LayerBaseMethod(object): 8 | """docstring for LayerBaseMethod""" 9 | # info for systolic array 10 | A = None # systolic array dimension 11 | 12 | # memory bandwith number of bytes can be transferred. 13 | B = None 14 | 15 | # on-chip buffer size 16 | buf_size = None 17 | 18 | # info for weights 19 | K_w = None # kernel width 20 | K_h = None # kernel height 21 | S = None # stride size 22 | 23 | # input layer dimension 24 | H = None # height of ofmap 25 | W = None # width of ofmap 26 | Ci = None # channels for weights 27 | Co = None # channels for ofmap 28 | 29 | # on-chip buffer size 30 | bufi_size = None 31 | bufo_size = None 32 | bufw_size = None 33 | 34 | # array to store the result from the four different results 35 | res = [] 36 | 37 | """docstring for LayerBaseMethod""" 38 | def __init__(self, data, sys_info): 39 | self.data = data 40 | self.sys_info = sys_info 41 | self.A = sys_info["sa_size"] 42 | self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8) 43 | self.buf_size = sys_info["bufsize"] 44 | self.res = [] 45 | 46 | def init_setup(self): 47 | self.res = [] 48 | layer_info = self.data 49 | # set up the new layer information 50 | [self.W, self.H, self.Ci] = layer_info["ifmap"] 51 | self.Co = layer_info["out_channel"] 52 | [self.K_w, self.K_h] = layer_info["kernel"] 53 | self.S = layer_info["stride"] 54 | self.W /= self.S 55 | self.H /= self.S 56 | 57 | ############################################################### 58 | # general process # 59 | ############################################################### 60 | 61 | # compute buffer utilization 62 | def buffer_utilization(self, x): 63 | # buffer = ofmap + weights + ifmap 64 | return (x[0]*x[1]*x[2] + self.Ci*self.K_h*self.K_w*x[0] 65 | + self.Ci*(self.S*x[1]+self.K_h/2)*(self.S*x[2]+self.K_h/2)) 66 | 67 | def total_batch_number(self, h_0, w_0, c_0): 68 | return math.ceil(float(self.H*self.W*self.Co) / (h_0*w_0*c_0)) 69 | 70 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 71 | def row_major_data_transfer(self, h_0, w_0, c_0): 72 | # calculate the total batch 73 | total_batch = self.total_batch_number(h_0, w_0, c_0) 74 | 75 | # ofmap, ifmap and kernel tile size 76 | ofmap_tile_size = h_0*w_0*c_0 77 | ifmap_tile_size = (self.S*h_0+self.K_h/2)*(self.S*w_0+self.K_w/2)*self.Ci 78 | kernel_tile_size = self.K_h*self.K_w*self.Ci*c_0 79 | 80 | # ofmap + ifmap transfer 81 | total_transfer = (ofmap_tile_size + ifmap_tile_size) * total_batch 82 | 83 | # add additional data transfer 84 | total_transfer += (ofmap_tile_size + kernel_tile_size) * self.Co/c_0 85 | 86 | return total_transfer 87 | 88 | # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0) 89 | def channel_major_data_transfer(self, h_0, w_0, c_0): 90 | # calculate the total batch 91 | total_batch = self.total_batch_number(h_0, w_0, c_0) 92 | 93 | # ofmap and ifmap tile size 94 | ofmap_tile_size = h_0*w_0*c_0 95 | ifmap_tile_size = (self.S*h_0+self.K_h/2)*(self.S*w_0+self.K_w/2)*self.Ci 96 | kernel_tile_size = self.K_h*self.K_w*self.Ci*c_0 97 | 98 | # ofmap + weight transfer 99 | total_transfer = (ofmap_tile_size + kernel_tile_size) * total_batch 100 | 101 | # add additional data transfer 102 | total_transfer += (ofmap_tile_size + ifmap_tile_size) \ 103 | * self.H*self.W / (h_0*w_0) 104 | 105 | return total_transfer 106 | 107 | def systolic_array_utilization(self, xi, area): 108 | area_size = area[0] * area[1] 109 | A = self.A 110 | total_usage = xi * area_size 111 | round_up_val = math.ceil(float(xi/A))*A \ 112 | * math.ceil(float(area_size)/A)*A 113 | 114 | return total_usage / round_up_val 115 | 116 | def compute_bound_cycle(self, util_rate): 117 | # total number of ops 118 | total_computation = (self.H*self.W*self.Co) * \ 119 | (self.Ci*self.K_h*self.K_w) 120 | 121 | # systolic array calculation capacity 122 | comp_cap = (self.A*self.A) * util_rate 123 | 124 | return total_computation / comp_cap 125 | 126 | def process_parameter(self, x, row_major, comp_bound): 127 | 128 | x = list(map(lambda i: math.floor(i), x)) 129 | bound = "C" 130 | # make the tile size even for every batch 131 | c_0 = min(self.Co/math.ceil(self.Co/round(x[0])), self.Co) 132 | w_0 = min(self.W/math.ceil(self.W/round(x[1])), self.W) 133 | h_0 =min(self.H/math.ceil(self.H/round(x[2])), self.H) 134 | 135 | # compute the total number of elements needed to be updated 136 | # if it is row-major. 137 | if row_major: 138 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 139 | total_transfer = self.row_major_data_transfer(h_0, w_0, c_0) 140 | 141 | # compute the total number of elements needed to be updated 142 | # if it is channel-major. 143 | else: 144 | # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0) 145 | total_transfer = self.channel_major_data_transfer(h_0, w_0, c_0) 146 | 147 | # compute the utilization of systolic array 148 | util_sys_arr = self.systolic_array_utilization(c_0, [w_0, h_0]) 149 | 150 | # compute the utilization of systolic array 151 | util_buf = self.buffer_utilization([c_0, w_0, h_0])/self.buf_size 152 | 153 | if util_buf > 1.01: 154 | return 155 | # calculate the amount of cycles of computing all elements. 156 | if comp_bound: 157 | bound = "C" 158 | total_cycle = self.compute_bound_cycle(util_sys_arr) 159 | else: 160 | bound = "M" 161 | total_cycle = total_transfer/self.B 162 | 163 | ret = { 164 | "total_transfer": round(total_transfer), 165 | "total_cycle": round(total_cycle), 166 | "systolic_array_utilization": util_sys_arr, 167 | "buffer_utilization": util_buf, 168 | "c_0, w_0, h_0": [round(c_0), round(w_0), round(h_0)], 169 | "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0], 170 | "Bound" : bound 171 | } 172 | self.res.append(ret) 173 | return 174 | -------------------------------------------------------------------------------- /layer_exhaustive_searcher.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer_static_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class LayerExhaustiveSearcher(LayerStaticMethod): 14 | 15 | # array to store the result from the four different results 16 | res = [] 17 | 18 | """docstring for LayerExhaustiveSearcher""" 19 | def __init__(self, data, sys_info): 20 | super(LayerExhaustiveSearcher, self).__init__(data, sys_info, None) 21 | self.rets = [] 22 | 23 | # optimize one layer 24 | def optimize(self): 25 | self.init_setup() 26 | 27 | for i in range(1, 20): 28 | self.bufi_size = self.buf_size*i/20.0 29 | for j in range(1, 20): 30 | self.bufw_size = self.buf_size*j/20.0 31 | # optimize one buffer partition 32 | self.res = [] 33 | self.optimize_one_buffer_partition() 34 | 35 | ret = dict(self.rets[0]) 36 | 37 | for item in self.rets: 38 | if ret["total_cycle"] > item["total_cycle"]: 39 | ret = dict(item) 40 | if ret["total_cycle"] == item["total_cycle"] and \ 41 | ret["total_transfer"] > item["total_transfer"]: 42 | ret = dict(item) 43 | 44 | return ret 45 | 46 | # the main optimization routine; 47 | def opti_buffer(self): 48 | # set the initial guess; 49 | x0 = [self.A, self.A] 50 | 51 | # check if the initial configuration can hold the minimum requirements 52 | if ((x0[0]*self.K_h*self.K_w*self.Ci > self.bufw_size) or 53 | (self.S*self.S*x0[1]*self.Ci > self.bufi_size)): 54 | return 55 | 56 | # first, let's find the number of kernel we can put into buffer. 57 | while (x0[0]+self.A)*self.K_h*self.K_w*self.Ci < self.bufw_size: 58 | x0[0] = x0[0]+self.A 59 | 60 | # next let's see how much ifmap can we fit into the buffer. 61 | while self.S*self.S*(x0[1]+self.A)*self.Ci < self.bufi_size: 62 | x0[1] = x0[1]+self.A 63 | 64 | 65 | # no need to optimize the buffer for ofmap, because it is 66 | # bounded ifmap. 67 | x = [x0[0], math.sqrt(x0[1]), math.sqrt(x0[1])] 68 | self.process_parameter(x, False, False) 69 | self.process_parameter(x, False, True) 70 | self.process_parameter(x, True, False) 71 | self.process_parameter(x, True, True) 72 | 73 | -------------------------------------------------------------------------------- /layer_optimizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | from scipy.optimize import minimize 7 | 8 | # my own module 9 | from layer_base_method import * 10 | import layer_exhaustive_searcher 11 | 12 | # threshold for bounds 13 | # if the constraint result is negative but within this threshold, 14 | # it is still consider a valid result. 15 | Threshold = 500.0 16 | 17 | class LayerOptimizer(LayerBaseMethod): 18 | """docstring for LayerOptimizer""" 19 | def __init__(self, data, sys_info, combined=False): 20 | super(LayerOptimizer, self).__init__(data, sys_info) 21 | self.combined = combined 22 | 23 | # variables for optimization 24 | # this two has been encodes as x[3] = {c_0, h_0, w_0}; 25 | # c_0 # number of channels per batch; 26 | # h_0xw_0 # size of tile per batch; 27 | 28 | # calculate the latency for compute and memory; 29 | # l_com = (K_h*K_w*c_0*h_0*w_0)/(R*R) 30 | # # if row-major 31 | # l_mem_r = (c_0*h_0*w_0 + C*(h_0+2)*(w_0+2))/B 32 | # # if channel-major 33 | # l_mem_c = (c_0*h_0*w_0 + C*K_h*K_w*c_0)/B 34 | 35 | ############################################################### 36 | # general process # 37 | ############################################################### 38 | 39 | def optimize(self): 40 | self.init_setup() 41 | 42 | # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h) 43 | # both cases are possible; 44 | # opti_mem() 45 | self.opti_comp() 46 | 47 | if len(self.res) == 0: 48 | self.opti_mem() 49 | 50 | if len(self.res) == 0: 51 | return None 52 | 53 | ret = dict(self.res[0]) 54 | 55 | for item in self.res: 56 | if ret["total_cycle"] > item["total_cycle"]: 57 | ret = dict(item) 58 | if ret["total_cycle"] == item["total_cycle"] and \ 59 | ret["total_transfer"] > item["total_transfer"]: 60 | ret = dict(item) 61 | 62 | return ret 63 | 64 | def opti_mem(self): 65 | # print("========================= Memory Bound ==========================") 66 | # optimization for row-major; 67 | self.opti_mem_row_major(); 68 | # optimization for channel-major; 69 | self.opti_mem_channel_major(); 70 | # print("\n") 71 | 72 | def opti_comp(self): 73 | # print("========================= Compute Bound =========================") 74 | # optimization for row-major; 75 | self.opti_comp_row_major(); 76 | # optimization for channel-major; 77 | self.opti_comp_channel_major(); 78 | # print("\n") 79 | 80 | # the main optimization of memory-bound and row-major case; 81 | def opti_mem_row_major(self): 82 | # set the initial guess; 83 | x0 = self.init_guess() 84 | # for row_major_constraint1 85 | con1 = {'type': 'ineq', 'fun': self.row_major_constraint} 86 | # for mem_bound_constraint 87 | con2 = {'type': 'ineq', 'fun': self.row_major_mem_bound_constraint} 88 | # for the buffer_constraint 89 | con3 = {'type': 'ineq', 'fun': self.buffer_constraint1} 90 | con4 = {'type': 'ineq', 'fun': self.buffer_constraint2} 91 | 92 | # summery all the bounds and constraints 93 | bnds = self.variable_boundary() 94 | cons = ([con1, con2, con3, con4]) 95 | 96 | # call the external solver to solve the solution 97 | solution = minimize(self.row_major_mem_obj, x0, method='SLSQP',\ 98 | bounds=bnds, constraints=cons) 99 | 100 | passed = True 101 | if np.any(np.isnan(solution.x)): 102 | passed = False 103 | # print("Solution with NaN, abort!") 104 | # check the validation 105 | if passed and self.row_major_constraint(solution.x) < -Threshold: 106 | passed = False 107 | # print("row major constraint", self.row_major_constraint(solution.x), "NOT PASSED.") 108 | if passed and self.buffer_constraint2(solution.x) < -Threshold: 109 | passed = False 110 | # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!") 111 | # print("buffer constraint", buffer_constraint2(solution.x)) 112 | if passed and self.row_major_mem_bound_constraint(solution.x) < -Threshold: 113 | passed = False 114 | # print("row-major memory-bound", self.row_major_mem_bound_constraint(solution.x), \ 115 | # " no longer bounded!") 116 | 117 | if passed: 118 | # print("Row-major memory-bound case PASSED!") 119 | self.process_parameter(solution.x, True, False) 120 | else: 121 | return None 122 | 123 | # the main optimization of compute-bound and row-major case; 124 | def opti_comp_row_major(self): 125 | # set the initial guess; 126 | x0 = self.init_guess() 127 | # for row_major_constraint1 128 | con1 = {'type': 'ineq', 'fun': self.row_major_constraint} 129 | # for mem_bound_constraint 130 | con2 = {'type': 'ineq', 'fun': self.row_major_comp_bound_constraint} 131 | # for the buffer_constraint 132 | con3 = {'type': 'ineq', 'fun': self.buffer_constraint1} 133 | con4 = {'type': 'ineq', 'fun': self.buffer_constraint2} 134 | 135 | # summery all the bounds and constraints 136 | bnds = self.variable_boundary() 137 | cons = ([con1, con2, con3, con4]) 138 | 139 | # call the external solver to solve the solution 140 | solution = minimize(self.row_major_comp_obj, x0, method='SLSQP',\ 141 | bounds=bnds, constraints=cons) 142 | 143 | passed = True 144 | if np.any(np.isnan(solution.x)): 145 | passed = False 146 | # print("Solution with NaN, abort!") 147 | # check the validation 148 | if passed and self.row_major_constraint(solution.x) < -Threshold: 149 | passed = False 150 | # print("row major constraint", self.row_major_constraint(solution.x), "NOT PASSED.") 151 | if passed and self.buffer_constraint2(solution.x) < -Threshold: 152 | passed = False 153 | # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!") 154 | if passed and self.row_major_comp_bound_constraint(solution.x) < -Threshold: 155 | passed = False 156 | # print("Row-major compute-bound", self.row_major_comp_bound_constraint(solution.x), \ 157 | # " no longer bounded!") 158 | 159 | if passed: 160 | # print("Row-major compute-bound case PASSED!") 161 | self.process_parameter(solution.x, True, True) 162 | else: 163 | return None 164 | 165 | 166 | 167 | # the main optimization of memory-bound and channel-major case; 168 | def opti_mem_channel_major(self): 169 | # set the initial guess; 170 | x0 = self.init_guess() 171 | # for row_major_constraint1 172 | con1 = {'type': 'ineq', 'fun': self.channel_major_constraint} 173 | # for mem_bound_constraint 174 | con2 = {'type': 'ineq', 'fun': self.channel_major_mem_bound_constraint} 175 | # for the buffer_constraint 176 | con3 = {'type': 'ineq', 'fun': self.buffer_constraint1} 177 | con4 = {'type': 'ineq', 'fun': self.buffer_constraint2} 178 | 179 | # summery all the bounds and constraints 180 | bnds = self.variable_boundary() 181 | cons = ([con1, con2, con3, con4]) 182 | 183 | # call the external solver to solve the solution 184 | solution = minimize(self.channel_major_mem_obj, x0, method='SLSQP',\ 185 | bounds=bnds, constraints=cons) 186 | 187 | passed = True 188 | if np.any(np.isnan(solution.x)): 189 | passed = False 190 | # print("Solution with NaN, abort!") 191 | # check the validation 192 | if passed and self.channel_major_constraint(solution.x) < -Threshold: 193 | passed = False 194 | # print("channel major constraint", self.channel_major_constraint(solution.x), "NOT PASSED.") 195 | if passed and self.buffer_constraint2(solution.x) < -Threshold: 196 | passed = False 197 | # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!") 198 | if passed and self.channel_major_mem_bound_constraint(solution.x) < -Threshold: 199 | passed = False 200 | # print("Channel-major memory-bound", self.channel_major_mem_bound_constraint(solution.x), \ 201 | # " no longer bounded!") 202 | 203 | if passed: 204 | # print("Channel-major memory-bound case PASSED!") 205 | self.process_parameter(solution.x, False, False) 206 | else: 207 | return None 208 | 209 | 210 | # the main optimization of compute-bound and channel-major case; 211 | def opti_comp_channel_major(self): 212 | # set the initial guess; 213 | x0 = self.init_guess() 214 | # for row_major_constraint1 215 | con1 = {'type': 'ineq', 'fun': self.channel_major_constraint} 216 | # for mem_bound_constraint 217 | con2 = {'type': 'ineq', 'fun': self.channel_major_comp_bound_constraint} 218 | # for the buffer_constraint 219 | con3 = {'type': 'ineq', 'fun': self.buffer_constraint1} 220 | con4 = {'type': 'ineq', 'fun': self.buffer_constraint2} 221 | 222 | # summery all the bounds and constraints 223 | bnds = self.variable_boundary() 224 | cons = ([con1, con2, con3, con4]) 225 | 226 | # call the external solver to solve the solution 227 | solution = minimize(self.channel_major_comp_obj, x0, method='SLSQP',\ 228 | bounds=bnds, constraints=cons) 229 | 230 | passed = True 231 | if np.any(np.isnan(solution.x)): 232 | passed = False 233 | # print("Solution with NaN, abort!") 234 | # check the validation 235 | if passed and self.channel_major_constraint(solution.x) < -Threshold: 236 | passed = False 237 | # print("channel major constraint", self.channel_major_constraint(solution.x), "NOT PASSED.") 238 | if passed and self.buffer_constraint2(solution.x) < -Threshold: 239 | passed = False 240 | # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!") 241 | if passed and self.channel_major_comp_bound_constraint(solution.x) < -Threshold: 242 | passed = False 243 | # print("Channel-major compute-bound", self.channel_major_comp_bound_constraint(solution.x), \ 244 | # " no longer bounded!") 245 | 246 | if passed: 247 | # print("Channel-major compute-bound case PASSED!") 248 | self.process_parameter(solution.x, False, True) 249 | else: 250 | return None 251 | 252 | ############################################################### 253 | # general computations # 254 | ############################################################### 255 | 256 | def ofmap_tile(self, x): 257 | return x[0]*x[1]*x[2] 258 | 259 | def weight_tile(self, num): 260 | return self.Ci*self.K_h*self.K_w*num 261 | 262 | def ifmap_tile(self, x): 263 | return self.Ci*(self.S*x[1]+2)*(self.S*x[2]+2) 264 | 265 | def total_ofmap_size(self): 266 | return self.H*self.W*self.Co 267 | 268 | def total_weight_size(self): 269 | return self.weight_tile(self.Co) 270 | 271 | ############################################################### 272 | # general constraints # 273 | ############################################################### 274 | # the low bound of buffer size; 275 | # make sure the buffer utilization is always larger than 0 276 | def buffer_constraint1(self, x): 277 | # buffer = ofmap + weights + ifmap 278 | return (self.ofmap_tile(x) + 279 | self.weight_tile(x[0]) + 280 | self.ifmap_tile(x)) 281 | 282 | # the upper bound of the buffer size; 283 | # make sure the buffer utilization is 284 | # always smaller than buffer size; 285 | def buffer_constraint2(self, x): 286 | return (self.buf_size - self.buffer_constraint1(x)) 287 | 288 | # set initial guess for constrained optimization 289 | def init_guess(self): 290 | # set the initial guess; 291 | x0 = [min(self.A, self.Co), \ 292 | min(math.floor(math.sqrt(self.A)), self.H), \ 293 | min(math.floor(math.sqrt(self.A)), self.W)] 294 | if self.combined: 295 | result = layer_static_method.\ 296 | LayerStaticMethod(data, sys_info, [3.0, 3.0, 4.0]).optimize() 297 | x0 = result["c_0, w_0, h_0"] 298 | 299 | return x0 300 | 301 | # set constraints for the variables in the optimization 302 | def variable_boundary(self): 303 | return ((min(self.A, self.Co), self.Co), 304 | (min(math.floor(math.sqrt(self.A)), self.H), self.H), 305 | (min(math.floor(math.sqrt(self.A)), self.W), self.W)) 306 | 307 | 308 | ############################################################### 309 | # row-major constraint solving obj and constraints # 310 | ############################################################### 311 | 312 | # the minimization objective of row-major 313 | # this objective is a simplified expression of 314 | # [h_0*w_0*c_0+(h_0+2)(w_0+2)*Ci]*(H*W*Co)/(h_0*w_0*c_0) 315 | # + [K^2*Ci+h_0*w_0*c_0]*Co/c_0 316 | # this expression can be finally reduce to: 317 | # (H*W*Co/c_0 + 2(h_0+w_0)Ci*H*W*Co/(h_0*w_0*c_0)+h_0*w_0*Co/c_0 318 | def row_major_mem_obj(self, x): 319 | return (self.ofmap_tile(x) + self.ifmap_tile(x)) \ 320 | * (self.total_ofmap_size()/self.ofmap_tile(x) - self.Co/x[0]) \ 321 | + self.total_weight_size()/x[0] + x[1]*x[2]*self.Co 322 | 323 | def row_major_comp_obj(self, x): 324 | return self.total_ofmap_size() / self.ofmap_tile(x) 325 | 326 | # make sure the load for row-major is always less than 327 | # load for channel-major, range : [0, +inf] 328 | def row_major_constraint(self, x): 329 | # simplified from K^2*C*c_0 > C*(S^2*h_0*w_0) 330 | return self.K_h*self.K_w*x[0] - (self.S*x[1]+2)*(self.S*x[2]+2); 331 | 332 | # make sure the process is always memory-bound; 333 | # which is the latency for memory access is always 334 | # greater than lantecy of compute; 335 | # (c_0*(h_0*w_0)+C*((S*h_0+2)*(S*w_0+2))/B >= (K^2*C/A^2)*c_0*w_0*h_0 336 | # range : [0, +inf] 337 | def row_major_mem_bound_constraint(self, x): 338 | return (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B \ 339 | - self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x) 340 | 341 | # make sure the process is always compute-bound; 342 | # which is the latency for compute is always 343 | # greater than lantecy of memory access; 344 | # (c_0*(h_0*w_0)+C*((S*h_0+2)*(S*w_0+2))/B <= (K^2*C/A^2)*c_0*w_0*h_0 345 | # range : [0, +inf] 346 | def row_major_comp_bound_constraint(self, x): 347 | return self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x) \ 348 | - (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B 349 | 350 | ############################################################### 351 | # channel-major constraint solving obj and constraints # 352 | ############################################################### 353 | 354 | # the minimization objective of channel-major 355 | # this is the simplified expression of 356 | # (K^2*Ci*c_0+h_0*w_0*c_0)*(H*W*Co)/(h_0*w_0*c_0) 357 | # + [(h_0+2)(w_0+2)*Ci + h_0*w_0*c_0]*(H*W)/(h_0*w_0) 358 | def channel_major_mem_obj(self, x): 359 | return (self.total_weight_size)/(x[1]*x[2]) + \ 360 | 2*(self.S*x[1]+self.S*x[2])*self.Co/(x[1]*x[2])+1/x[0] 361 | 362 | # the minimization functions is to moinimize the 363 | # channel major compute-bound objective 364 | def channel_major_comp_obj(self, x): 365 | return self.total_ofmap_size()/(x[1]*x[2]*x[0]) 366 | 367 | # make sure the load for channel-major is always less than 368 | # load for row-major, range : [0, +inf] 369 | def channel_major_constraint(self, x): 370 | # simplified from K^2*C*c_0 <= C*((S*h_0+2)*(S*w_0+2)) 371 | return (self.S*x[1]+2)*(self.S*x[2]+2) - self.K_h*self.K_w*x[0]; 372 | 373 | # make sure the process is always memory-bound; 374 | # which is the latency for memory access is always 375 | # greater than lantecy of compute; 376 | # c_0*(h_0*w_0+K^2*C)/B >= (K^2*C/A^2)*c_0*(h_0*w_0) 377 | # range : [0, +inf] 378 | def channel_major_mem_bound_constraint(self, x): 379 | return (x[1]*x[2] + self.weight_tile(1)) / self.B \ 380 | - self.weight_tile(1)/(self.A*self.A)*x[1]*x[2] 381 | 382 | # make sure the process is always memory-bound; 383 | # which is the latency for memory access is always 384 | # greater than lantecy of compute; 385 | # c_0*(h_0*w_0+K^2*C)/B >= (K^2*C/A^2)*c_0*(h_0*w_0) 386 | # range : [0, +inf] 387 | def channel_major_comp_bound_constraint(self, x): 388 | return self.K_h*self.K_w*self.Co/(self.A*self.A)*x[1]*x[2] \ 389 | - (x[1]*x[2]+self.K_h*self.K_w*self.Co)/self.B 390 | 391 | 392 | -------------------------------------------------------------------------------- /layer_static_method.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | # my own module 8 | from layer_base_method import * 9 | 10 | ############################################################### 11 | # general process # 12 | ############################################################### 13 | class LayerStaticMethod(LayerBaseMethod): 14 | 15 | # array to store the result from the four different results 16 | res = [] 17 | 18 | def __init__(self, data, sys_info, buffer_partition=None): 19 | super(LayerStaticMethod, self).__init__(data, sys_info) 20 | self.rets = [] 21 | if buffer_partition: 22 | # calculate buffer sizes 23 | self.bufi_size = self.buf_size* \ 24 | buffer_partition[0]/sum(buffer_partition) 25 | self.bufo_size = self.buf_size* \ 26 | buffer_partition[1]/sum(buffer_partition) 27 | self.bufw_size = self.buf_size* \ 28 | buffer_partition[2]/sum(buffer_partition) 29 | 30 | # the main optimization routine; 31 | def opti_buffer(self): 32 | # set the initial guess; 33 | x0 = [self.A, self.A] 34 | 35 | # check if the initial configuration can hold the minimum requirements 36 | if ((x0[0]*self.K_h*self.K_w*self.Ci > self.bufw_size) or 37 | (self.S*self.S*x0[1]*self.Ci > self.bufi_size)): 38 | return 39 | 40 | # first, let's find the number of kernel we can put into buffer. 41 | while (x0[0]+self.A)*self.K_h*self.K_w*self.Ci < self.bufw_size: 42 | x0[0] = x0[0]+self.A 43 | 44 | # next let's see what percentage of ifmap can we fit into the buffer. 45 | if self.K_h*(2*self.S+self.W) >= self.bufi_size: 46 | while self.K_h*(self.S+x0[1]+self.A)*self.Ci < self.bufi_size: 47 | x0[1] = x0[1]+self.A 48 | # no need to optimize the buffer for ofmap, because it is 49 | # bounded ifmap. 50 | x = [x0[0], x0[1], 1] 51 | else: 52 | h_0 = 1 53 | while self.K_h*(self.S+self.W)*(self.S+h_0+1)*self.Ci < self.bufi_size: 54 | h_0 += 1 55 | # no need to optimize the buffer for ofmap, because it is 56 | # bounded ifmap. 57 | x = [x0[0], self.W, h_0] 58 | 59 | self.process_parameter(x, False, False) 60 | self.process_parameter(x, False, True) 61 | self.process_parameter(x, True, False) 62 | self.process_parameter(x, True, True) 63 | 64 | 65 | def optimize_one_buffer_partition(self): 66 | # if sum of bufi and bufw is over the self.buf_size 67 | # we should skip it. 68 | if (self.bufi_size + self.bufw_size) > self.buf_size: 69 | return 70 | 71 | self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size 72 | # both cases are possible; 73 | self.opti_buffer() 74 | 75 | if len(self.res) == 0: 76 | return 77 | 78 | # choose the larger value as the bottleneck 79 | row_major_res = None 80 | if (self.res[0]["total_cycle"] < self.res[1]["total_cycle"]): 81 | row_major_res = self.res[1] 82 | else: 83 | row_major_res = self.res[0] 84 | 85 | # choose the larger value as the bottleneck 86 | channel_major_res = None 87 | if (self.res[2]["total_cycle"] < self.res[3]["total_cycle"]): 88 | channel_major_res = self.res[3] 89 | else: 90 | channel_major_res = self.res[2] 91 | 92 | # return the shortest value as the perferred compute ordering. 93 | ret = None 94 | if (row_major_res["total_cycle"] < channel_major_res["total_cycle"]): 95 | ret = dict(row_major_res) 96 | else: 97 | ret = dict(channel_major_res) 98 | 99 | self.rets.append(ret) 100 | 101 | # optimize one layer 102 | def optimize(self): 103 | self.init_setup() 104 | 105 | # start the optimization 106 | self.res = [] 107 | self.optimize_one_buffer_partition() 108 | 109 | # find the best result 110 | ret = dict(self.rets[0]) 111 | 112 | for item in self.rets: 113 | if ret["total_cycle"] > item["total_cycle"]: 114 | ret = dict(item) 115 | if ret["total_cycle"] == item["total_cycle"] and \ 116 | ret["total_transfer"] > item["total_transfer"]: 117 | ret = dict(item) 118 | 119 | return ret 120 | -------------------------------------------------------------------------------- /multi_layer_perceptron.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | 3 | # public library 4 | import math 5 | import numpy as np 6 | 7 | class MultiLayerPerceptron(object): 8 | """docstring for MultiLayerPerceptron""" 9 | # info for systolic array 10 | A = None # systolic array dimension 11 | 12 | # memory bandwith number of bytes can be transferred. 13 | B = None 14 | 15 | # on-chip buffer size 16 | buf_size = None 17 | 18 | # input layer dimension 19 | N = None # numbers of feature (NumberOfPoints x NumberOfFeature) 20 | Ci = None # channels for ifmap 21 | Co = None # channels for ofmap 22 | 23 | # on-chip buffer size 24 | bufi_size = None 25 | bufo_size = None 26 | bufw_size = None 27 | 28 | """docstring for MultiLayerPerceptron""" 29 | def __init__(self, data, sys_info): 30 | self.data = data 31 | self.sys_info = sys_info 32 | self.A = sys_info["sa_size"] 33 | self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8) 34 | self.buf_size = sys_info["bufsize"] 35 | 36 | def init_setup(self): 37 | layer_info = self.data 38 | 39 | # set up the new layer information 40 | [self.N, self.Ci] = layer_info["ifmap"] 41 | self.Co = layer_info["out_channel"] 42 | 43 | self.bufw_size = self.Co * self.Ci 44 | 45 | ############################################################### 46 | # general process # 47 | ############################################################### 48 | 49 | # compute buffer utilization 50 | def buffer_utilization(self, x): 51 | # buffer = ofmap + weights + ifmap 52 | return (x*self.Co + self.Ci*self.Co + x*self.Ci) 53 | 54 | # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0 55 | def data_transfer(self, x): 56 | # calculate the total batch 57 | total_batch = math.ceil(float(self.N) / x) 58 | 59 | # ofmap, ifmap and kernel tile size 60 | ofmap_tile_size = self.Co * x 61 | ifmap_tile_size = self.Ci * x 62 | kernel_tile_size = self.Co*self.Ci 63 | 64 | # ofmap + ifmap transfer 65 | total_transfer = (ofmap_tile_size + ifmap_tile_size) * total_batch 66 | 67 | # add additional data transfer 68 | total_transfer += kernel_tile_size 69 | 70 | return total_transfer 71 | 72 | def systolic_array_utilization(self, x): 73 | A = self.A 74 | A_w_uiti = math.ceil(self.Co/math.ceil(float(self.Co)/A)) 75 | 76 | total_usage = x * self.Co 77 | round_up_val = math.ceil(float(x/A)) * A \ 78 | * math.ceil(float(self.Co)/A)*A 79 | 80 | # the pct of extra delay due to output-stationary 81 | delay_pct = float(self.Ci)/(self.Ci+A_w_uiti) 82 | 83 | return delay_pct * total_usage / round_up_val 84 | 85 | def compute_bound_cycle(self, util_rate): 86 | # total number of ops 87 | total_computation = (self.N*self.Ci*self.Co) 88 | 89 | # systolic array calculation capacity 90 | comp_cap = (self.A*self.A) * util_rate 91 | 92 | return total_computation / comp_cap 93 | 94 | def process_parameter(self, x): 95 | 96 | x = math.floor(x) 97 | bound = "C" 98 | # make the tile size even for every batch 99 | x_0 = min(self.N/math.ceil(self.N/round(x)), self.N) 100 | 101 | # (ofmap + ifmap)*total_batch + weights 102 | total_transfer = self.data_transfer(x_0) 103 | 104 | # compute the utilization of systolic array 105 | util_sys_arr = self.systolic_array_utilization(x_0) 106 | 107 | # compute the utilization of buffer 108 | util_buf = float(self.buffer_utilization(x_0))/self.buf_size 109 | 110 | if util_buf > 1.01: 111 | print("ERROR: the utilization of buffer is over 100%") 112 | exit() 113 | 114 | # calculate the amount of cycles of computing all elements. 115 | if self.compute_bound_cycle(util_sys_arr) > total_transfer/self.B: 116 | bound = "C" 117 | total_cycle = self.compute_bound_cycle(util_sys_arr) 118 | else: 119 | bound = "M" 120 | total_cycle = total_transfer/self.B 121 | 122 | ret = { 123 | "total_transfer": round(total_transfer), 124 | "total_cycle": round(total_cycle), 125 | "systolic_array_utilization": util_sys_arr, 126 | "buffer_utilization": util_buf, 127 | "buffer-partition [I,W,O]": [int(self.bufi_size), 128 | int(self.bufw_size), 129 | int(self.bufo_size)], 130 | "x_0": math.floor(x_0), 131 | "Bound" : bound 132 | } 133 | 134 | return ret 135 | 136 | # optimize one layer 137 | def optimize(self): 138 | self.init_setup() 139 | 140 | # if sum of bufi and bufw is over the self.buf_size 141 | # we should skip it. 142 | if self.bufw_size > self.buf_size: 143 | print("FAIL: the entire weight cannot be stored in buffer") 144 | exit() 145 | 146 | self.bufi_size = (self.buf_size - self.bufw_size)*self.Ci/(self.Ci+self.Co) 147 | self.bufo_size = (self.buf_size - self.bufw_size)*self.Co/(self.Ci+self.Co) 148 | 149 | # set the initial guess; 150 | x0 = self.A 151 | 152 | # let's see what percentage of ifmap can we fit into the buffer. 153 | while x0 < self.N and (x0+self.A)*self.Ci < self.bufi_size: 154 | x0 = x0 + self.A 155 | 156 | return self.process_parameter(x0) 157 | -------------------------------------------------------------------------------- /runner.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 4 | --model_type 2D \ 5 | --bufsize 1572864 \ 6 | --bit_width 16 \ 7 | --memory_bandwidth 25.6 \ 8 | --sa_size 16 \ 9 | --model_type 2D \ 10 | --ifmap 960 576 6 \ 11 | --static True \ 12 | --buffer_partition 3 3 4 13 | 14 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 15 | --model_type 2D \ 16 | --search_method Constrained \ 17 | --bufsize 1572864 \ 18 | --bit_width 16 \ 19 | --memory_bandwidth 25.6 \ 20 | --sa_size 16 \ 21 | --model_type 2D \ 22 | --ifmap 960 576 6 23 | 24 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 25 | --model_type 2D \ 26 | --search_method Exhaustive \ 27 | --bufsize 1572864 \ 28 | --bit_width 16 \ 29 | --memory_bandwidth 25.6 \ 30 | --sa_size 16 \ 31 | --model_type 2D \ 32 | --ifmap 960 576 6 33 | 34 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 35 | --model_type 2D \ 36 | --search_method Constrained \ 37 | --split True\ 38 | --bufsize 1572864 \ 39 | --bit_width 16 \ 40 | --memory_bandwidth 25.6 \ 41 | --sa_size 16 \ 42 | --model_type 2D \ 43 | --ifmap 960 576 6 44 | 45 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 46 | --model_type 2D \ 47 | --search_method Exhaustive \ 48 | --split True\ 49 | --bufsize 1572864 \ 50 | --bit_width 16 \ 51 | --memory_bandwidth 25.6 \ 52 | --sa_size 16 \ 53 | --model_type 2D \ 54 | --ifmap 960 576 6 55 | 56 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 57 | --model_type 2D \ 58 | --search_method Constrained \ 59 | --split True \ 60 | --combine True \ 61 | --bufsize 1572864 \ 62 | --bit_width 16 \ 63 | --memory_bandwidth 25.6 \ 64 | --sa_size 16 \ 65 | --model_type 2D \ 66 | --ifmap 960 576 6 67 | 68 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \ 69 | --model_type 2D \ 70 | --search_method Exhaustive \ 71 | --split True \ 72 | --combine True \ 73 | --bufsize 1572864 \ 74 | --bit_width 16 \ 75 | --memory_bandwidth 25.6 \ 76 | --sa_size 16 \ 77 | --model_type 2D \ 78 | --ifmap 960 576 6 79 | -------------------------------------------------------------------------------- /runner3d.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python dataflow_search.py --dnnfile dnns/test3d.txt \ 4 | --model_type 3D \ 5 | --search_method Constrained \ 6 | --bufsize 1572864 \ 7 | --bit_width 16 \ 8 | --memory_bandwidth 25.6 \ 9 | --sa_size 16 \ 10 | --model_type 3D \ 11 | --ifmap3d 480 288 96 64 12 | 13 | python dataflow_search.py --dnnfile dnns/test3d.txt \ 14 | --model_type 3D \ 15 | --search_method Exhaustive \ 16 | --bufsize 1572864 \ 17 | --bit_width 16 \ 18 | --memory_bandwidth 25.6 \ 19 | --sa_size 16 \ 20 | --model_type 3D \ 21 | --ifmap3d 480 288 96 64 22 | 23 | python dataflow_search.py --dnnfile dnns/test3d.txt \ 24 | --model_type 3D \ 25 | --search_method Constrained \ 26 | --split True\ 27 | --bufsize 1572864 \ 28 | --bit_width 16 \ 29 | --memory_bandwidth 25.6 \ 30 | --sa_size 16 \ 31 | --model_type 3D \ 32 | --ifmap3d 480 288 96 64 33 | 34 | python dataflow_search.py --dnnfile dnns/test3d.txt \ 35 | --model_type 3D \ 36 | --search_method Exhaustive \ 37 | --split True\ 38 | --bufsize 1572864 \ 39 | --bit_width 16 \ 40 | --memory_bandwidth 25.6 \ 41 | --sa_size 16 \ 42 | --model_type 3D \ 43 | --ifmap3d 480 288 96 64 44 | 45 | --------------------------------------------------------------------------------