├── README.md
├── __init__.py
├── dataflow_search.py
├── deconv_exhaustive_searcher.py
├── deconv_optimizer.py
├── dnn_optimizer.py
├── dnns
    ├── GC_net.txt
    ├── PSMNet.txt
    ├── flowNetC.txt
    ├── flowNetS.txt
    ├── test.txt
    └── test3d.txt
├── fc_layer.py
├── layer3d_base_method.py
├── layer3d_exhaustive_searcher.py
├── layer3d_optimizer.py
├── layer3d_static_method.py
├── layer_base_method.py
├── layer_exhaustive_searcher.py
├── layer_optimizer.py
├── layer_static_method.py
├── multi_layer_perceptron.py
├── runner.sh
└── runner3d.sh


/README.md:
--------------------------------------------------------------------------------
  1 | # Systolic array dataflow optimizer
  2 | 
  3 | This is a DNN dataflow optimizer for a particular hardware accelerator, i.e., systolic
  4 | array. It is able to find an optimal or an approximately optimal dataflow for a
  5 | particular DNN given hardware constraints, such as bandwidth and SRAM, etc.
  6 | This repository is the artifact of our paper [ASV: Accelerated Stereo Vision System](https://www.cs.rochester.edu/horizon/pubs/micro19-tigris.pdf) published at MICRO 2019.
  7 | 
  8 | ## What's new
  9 | 
 10 | The goals of this optimizer are:
 11 | 
 12 | * First, this optimizer is to aim to find a close-to optimal configuration in
 13 |   order to minimize the latency and reduce the data traffic at the same time.
 14 | * Second, this optimizer explores different searching or optimization schemes,
 15 |   so that, we can show the trade-off between different optimization schemes.
 16 | * Third, this optimizer can automatically apply a special optimization for
 17 |   deconvolutions in the DNN pipeline, and we have different levels of optimizations
 18 |   to explore.
 19 | 
 20 | ## What's inside
 21 | 
 22 | There are two main parts in this framework:
 23 | 1. The overall framework to drive different optimizers.
 24 | 
 25 | * `dataflow_search.py`
 26 | * `dnn_optimizer.py`
 27 | 
 28 | 2. The layer-level optimizers to optimize different
 29 | 
 30 | * `layer(3d)_base_method.py`
 31 | * `layer(3d)_static_method.py`
 32 | * `layer(3d)_optimizer.py`
 33 | * `layer(3d)_exhaustive_searcher.py`
 34 | * `deconv_exhaustive_searcher.py`
 35 | 
 36 | ## How to use
 37 | 
 38 | Pre-requisite packages: scipy, numpy, sys, math, pprint
 39 | 
 40 | To use the dataflow optimizer, you can run for helping info:
 41 | 
 42 | ```
 43 |   $ python dataflow_search.py -h
 44 | ```
 45 | 
 46 | By specify the configuration in the input option and a particular DNN network
 47 | you want to optimize, the optimizer will return a dataflow scheme for you. The
 48 | sample DNN networks are in the `/dnns` directory.
 49 | 
 50 | A simple example of using this tool on a dnn.
 51 | ```
 52 |   $ python dataflow_search.py --dnnfile dnns/flowNetC.txt \
 53 |         --model_type 2D \
 54 |         --search_method Constrained \
 55 |         --split True\
 56 |         --bufsize 1572864 \
 57 |         --bit_width 16 \
 58 |         --memory_bandwidth 25.6 \
 59 |         --sa_size 16 \
 60 |         --model_type 2D \
 61 |         --ifmap 960 576 6
 62 | ```
 63 | This will load the DNN network from `flowNetC.txt` and search the DNN dataflow
 64 | using *Constrained optimization*. You can use `--search_method` option to specify
 65 | what kind of search method to use. We provide two different options, one is
 66 | `Constrained`, which uses constrained optimization, the other one is `Exhaustive`,
 67 | which uses exhaustive search. This command also specifies the ifmap size to be
 68 | 960-576-6 (width-height-channel). 
 69 | 
 70 | In this command, we also need to provide the hardware configuration. `bufsize` specifies that 
 71 | the on-chip buffer size is *1572864* bytes. `memory_bandwidth` specifies the memory
 72 | bandwidth is *25.6* GB/s. `sa_size` specifies that the systolic array size is *16-by-16*.
 73 | The `bitwidth` specifies the number of bits used to represent the numerical precision
 74 | for a single number. 
 75 | 
 76 | For details of other flags, please see the explanation below.
 77 | 
 78 | The dataflow optimization will print the result as a JSON-like format. The example result
 79 | is shown below:
 80 | 
 81 | ```
 82 | {'dnn': [{'Deconv?': False,                                  <--------- DNN architecture
 83 |           'ifmap': [960, 576, 6],
 84 |           'kernel': [7, 7],
 85 |           'out_channel': 64,
 86 |           'stride': 2,
 87 |           'type': '2D'},
 88 |           ......
 89 |           
 90 |          {'Deconv?': True,
 91 |           'ifmap': [120.0, 72.0, 128],
 92 |           'kernel': [5, 5],
 93 |           'out_channel': 64,
 94 |           'stride': 1,
 95 |           'type': '2D'}],
 96 |  'dnn_result': [{'data': {'Deconv?': False,                 <-------- optimization result
 97 |                           'ifmap': [960, 576, 6],
 98 |                           'kernel': [7, 7],
 99 |                           'ofmap': [480.0, 288.0, 64],
100 |                           'out_channel': 64,
101 |                           'stride': 2,
102 |                           'type': '2D'},
103 |                  'result': {'Bound': 'C',
104 |                             'Tile size': [1.0, 8.0, 5.0],               # tiling schedule
105 |                             'buffer_utilization': 0.7890045166015626,
106 |                             'c_0, w_0, h_0': [64, 120, 115],            # tile size
107 |                             'systolic_array_utilization': 1.0,
108 |                             'total_cycle': 40642560,                    # execution cycles
109 |                             'total_transfer': 85029312}},               # DRAM access (in Bytes)
110 |                 ......
111 |                 
112 |                 {'data': {'Deconv?': True,
113 |                           'ifmap': [120.0, 72.0, 128],
114 |                           'kernel': [5, 5],
115 |                           'out_channel': 64,
116 |                           'stride': 1,
117 |                           'type': '2D'},
118 |                  'result': [{'Bound': 'C',
119 |                              'Tile size': [1.0, 2.0, 1.0],              # tiling schedule
120 |                              'buffer_utilization': 0.6793619791666666,
121 |                              'c_0, w_0, h_0': [64, 60, 72],             # tile size
122 |                              'systolic_array_utilization': 1.0,
123 |                              'total_cycle': 6912000,                    # execution cycles
124 |                              'total_transfer': 2690048}]}],             # DRAM access (in Bytes)
125 |  'method': 'Constrained',
126 |  'schedule': {'combine': True, 'split': True, 'static': False},         # optimization flags
127 |  'system_info': {'bit_width': 16.0,
128 |                  'bufsize': 1572864.0,
129 |                  'memory_bandwidth': 25.6,
130 |                  'sa_size': 16.0}}
131 | ```
132 | 
133 | The result shows a couple of fields:
134 |   * `method` : the method you specified for dataflow search.
135 |   * `schedule` : the optimization options you specified for deconvolution.
136 |     Please check out our paper for more details one this.
137 |   * `system_info` : this specifies the hardware configurations.
138 |   * `dnn` : the architecture of you DNN network.
139 |   * `dnn_result` : the optimization result for your dnn. 
140 | 
141 | You can run more examples of dataflow optimization by `runner.sh`.
142 | 
143 | ### How to specify a DNN configuration
144 | 
145 | We provide a simple way to specify the configuration (or architecture) of
146 | each DNN layer, the example is shown in `/dnns`.
147 | 
148 | The layer parameters are separated by `,`, the order of the specification is:
149 | ofmap channels, kernel height, kernel width, stride, flag to indicate whether
150 | it is a deconvolution layer. 
151 | 
152 | A simple example for 2D DNN is shown below:
153 | ```
154 | # ofmap channels, kernel height, kernel width, stride,deconv?
155 | 64,7,7,2,False
156 | 128,5,5,2,False
157 | ...
158 | 
159 | 128,5,5,1,True
160 | 64,5,5,1,True
161 | ```
162 | 
163 | Asimple example for 3D DNN is shown here:
164 | ```
165 | # ofmap channels, kernel height, kernel width, kernel depth, stride, deconv?
166 | 32,3,3,3,1,False
167 | 32,3,3,3,1,False
168 | ...
169 | 
170 | 64,3,3,3,1,True
171 | 32,3,3,3,1,True
172 | ```
173 | 
174 | ## Explanation of each option flags
175 | 
176 | There are three parts consisted all the flags
177 | 
178 | 1. input and output files
179 |   * `--dnnfile` : the actual dnn dataflow file to optimize.
180 |   * `--outfile` : the file to dump all the result.
181 | 
182 | 2. options that are related to search options
183 |   * `--static` :  to set the flag if static partition the buffer enable, this
184 |     flag will statically set the SRAM partition and optimize the entire dnn
185 |     dataflow for that particular dnn dataflow, you also need to specify flag
186 |     `buffer_partition` too.
187 |   * `--split` : enable to apply our special optimization to split a regular
188 |     deconvolution kernel into small sub-kernels and effectively avoid reduntant
189 |     computation.
190 |   * `--combine` : enable to effectively interleave the computation of the split
191 |     sub-kernels during convolution.
192 |   * `--model_type` : DNN model convolution type: 2D or 3D.
193 |   * `--ifmap` : the initial ifmap dimemsion, a.k.a, input image, order: [W H C]
194 |   * `--ifmap3d` : the initial ifmap dimemsion in 3D DNN, order: [W H D C]
195 |   * `--buffer_partition` : the ifmap dimemsion, order: [I O W]
196 |   * `--search_method` : there are three search options: "Constrained",
197 |      "Exhaustive", "Combined", the first one is to use constrained optimization,
198 |      "Exhaustive" is a exhaustive search with the combination of DP.
199 |      "Combine" is to use static partition to set initial guess values for
200 |      constrained optimization and then use constrained optimization.
201 | 
202 | 3. other hardware configurations
203 |   * `--bufsize`, the buffer size, or SRAM size in Bytes, e.g. 1048576.0.
204 |   * `--memory_bandwidth`, the DRAM bandwidth in GB/s, e.g. 25.6.
205 |   * `--sa_size`, the systolic array dimension, e.g. 16 stands for the systolic
206 |     array dimension is 16 by 16.
207 |   * `--bit_width`, Bit Width of each value (typically, 8-bit, 16-bit, 32-bit)
208 | 
209 | ## Citing
210 | 
211 | This project is a artifact of our 2019 MICRO paper:
212 | 
213 | Y. Feng,  P. Whatmough, and Y. Zhu, "ASV: Accelerated Stereo Vision System", In Proc. of MICRO, 2019.
214 | 
215 | Please kindly consider citing this paper in your publications if it helps your research.
216 | ```
217 | @inproceedings{yu2019asv,
218 |   title={ASV: Accelerated Stereo Vision System},
219 |   author={Feng, Yu and Whatmough, Paul and Zhu, Yuhao},
220 |   booktitle={Proceedings of the 52th International Symposium on Microarchitecture},
221 |   year={2019},
222 |   organization={ACM}
223 | }
224 | ```
225 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | import layer_base_method


--------------------------------------------------------------------------------
/dataflow_search.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import argparse
  3 | import numpy as np
  4 | import scipy
  5 | import sys
  6 | import pprint
  7 | import json
  8 | 
  9 | # import my own modules
 10 | import dnn_optimizer
 11 | 
 12 | # setup the argument parser
 13 | argparser = argparse.ArgumentParser("dataflow_search.py")
 14 | # input dnn file and output result
 15 | argparser.add_argument("--dnnfile", required=True)
 16 | argparser.add_argument("--outfile", help="output file to dump all the results")
 17 | 
 18 | # other search options
 19 | argparser.add_argument("--static", type=bool, default=False,
 20 |                         help="static partition the buffer without dynamically changing")
 21 | argparser.add_argument("--split", type=bool, default=False,
 22 |                         help="enable to split the convolution kernel into small sub-kernel")
 23 | argparser.add_argument("--combine", type=bool, default=False,
 24 |                         help="enable to combine the sub-kernels durting compute")
 25 | argparser.add_argument("--model_type", default="2D", choices=["2D", "3D"],
 26 |                         help="DNN model convolution type: 2D or 3D.")
 27 | argparser.add_argument("--ifmap", nargs="+", type=int, required=False,
 28 |                         help="the ifmap dimemsion, order: [W H C]")
 29 | argparser.add_argument("--ifmap3d", nargs="+", type=int, required=False,
 30 |                         help="the ifmap dimemsion, order: [W H D C]")
 31 | argparser.add_argument("--buffer_partition", nargs="+", type=float,
 32 |                         help="the ifmap dimemsion, order: [I O W]")
 33 | argparser.add_argument("--search_method", default="Constrained",
 34 |                         choices=["Constrained", "Exhaustive", "Combined"],
 35 |                         help="Dataflow search methods: constraint optoimization"
 36 |                         ", exhaustive search or combining both.")
 37 | 
 38 | # other hardware configurations
 39 | argparser.add_argument("--bufsize", type=float, default=1048576.0*1.5,
 40 |                         help="in Btyes")
 41 | argparser.add_argument("--memory_bandwidth", type=float, default=6.4*4,
 42 |                         help="in GB/s")
 43 | argparser.add_argument("--sa_size", type=float, default=16,
 44 |                         help="Systolic array size")
 45 | argparser.add_argument("--bit_width", type=float, default=16,
 46 |                         help="Bit Width of each value (typically, 8-bit, 16-bit, 32-bit)")
 47 | 
 48 | 
 49 | args = argparser.parse_args()
 50 | 
 51 | # import dnn network descrtiption into the system;
 52 | # the format for one 2D DNN layer is:
 53 | # (width, height, in_channel, out_channel,
 54 | #  kenrel_width, kernel_height, stride, Deconv?)
 55 | #
 56 | # for 3D DNN layer is
 57 | # (width, height, disparity, in_channel, out_channel,
 58 | #  kenrel_width, kernel_height, kernel_disp, stride, Deconv?)
 59 | # 
 60 | # for MLP is 
 61 | # [in_channel, out_channel]
 62 | def import_dnn(filename, ifmap_dim, ifmap3d_dim):
 63 |     # a list to store the dnn configuration
 64 |     dnn = []
 65 |     weight_dim = []
 66 | 
 67 |     # The weight input format as follows:
 68 |     # [out_channel,kenrel_width,kernel_height,stride,Deconv?]
 69 |     for line in open(filename):
 70 |         if len(line) <= 1:
 71 |             continue
 72 |         ls = line.strip().split(",")
 73 | 
 74 |         if len(ls) == 5:
 75 |             dnn.append({"ifmap" : ifmap_dim,
 76 |                         "out_channel" : int(ls[0]),
 77 |                         "kernel" : [int(ls[1]), int(ls[2])],
 78 |                         "stride" : int(ls[3]),
 79 |                         "Deconv?" : ls[4] == "True",
 80 |                         "type" : "2D"})
 81 | 
 82 |             prev_layer = dnn[-1]
 83 | 
 84 |             if prev_layer["Deconv?"]:
 85 |                 # increase the deconv ofmap by two, as default,
 86 |                 # we only consider stride of 1
 87 |                 ifmap_dim = [ifmap_dim[0]*2/prev_layer["stride"], \
 88 |                              ifmap_dim[1]*2/prev_layer["stride"], \
 89 |                              prev_layer["out_channel"]]
 90 |             else:
 91 |                 # if it is Conv, scale down the ifmap dimemsion by stride;
 92 |                 ifmap_dim = [ifmap_dim[0]/prev_layer["stride"], \
 93 |                             ifmap_dim[1]/prev_layer["stride"], \
 94 |                             prev_layer["out_channel"]]
 95 | 
 96 |         elif len(ls) == 4:
 97 |             dnn.append({
 98 |                 "ifmap" : [int(ls[0])*int(ls[1]), int(ls[2])],
 99 |                 "out_channel" : int(ls[3]),
100 |                 "type" : "MLP",
101 |                 "Deconv?" : False,
102 |                 "stride" : 1
103 |                 })
104 |         elif len(ls) == 3:
105 |             dnn.append({
106 |                 "ifmap" : [int(ls[1])],
107 |                 "in_channel" : int(ls[1]),
108 |                 "out_channel" : int(ls[2]),
109 |                 "num_of_layer" : int(ls[0]),
110 |                 "type" : "FC",
111 |                 "Deconv?" : False,
112 |                 "stride" : 1
113 |                 })
114 |         else:
115 |             dnn.append({"ifmap" : ifmap3d_dim,
116 |                         "out_channel" : int(ls[0]),
117 |                         "kernel" : [int(ls[1]), int(ls[2]), int(ls[3])],
118 |                         "stride" : int(ls[4]),
119 |                         "Deconv?" : ls[5] == "True",
120 |                         "type" : "3D"})
121 | 
122 |             prev_layer = dnn[-1]
123 | 
124 |             if prev_layer["Deconv?"]:
125 |                 # increase the deconv ofmap by two, as default,
126 |                 # we only consider stride of 1
127 |                 ifmap3d_dim = [ifmap3d_dim[0]*2/prev_layer["stride"], \
128 |                                ifmap3d_dim[1]*2/prev_layer["stride"], \
129 |                                ifmap3d_dim[2]*2/prev_layer["stride"], \
130 |                                prev_layer["out_channel"]]
131 |             else:
132 |                 # if it is Conv, scale down the ifmap dimemsion by stride;
133 |                 ifmap3d_dim = [ifmap3d_dim[0]/prev_layer["stride"], \
134 |                                ifmap3d_dim[1]/prev_layer["stride"], \
135 |                                ifmap3d_dim[2]/prev_layer["stride"],  \
136 |                                prev_layer["out_channel"]]
137 | 
138 |     return dnn
139 | 
140 | # The hardware constraints are:
141 | #   1. the on-chip buffer size;
142 | #   2. the memory bandwidth; (Unit in bytes/cycle)
143 | #   3. the systolic array size;
144 | def hardware_constraints(sa_size=16.0, mem_bw=6.4*4, buf=1048576.0*1.5, bit_width=16.0):
145 |     systolic_arr_size = sa_size;
146 |     memory_bandwidth = mem_bw;
147 |     buffer_size = buf;
148 |     return [systolic_arr_size, memory_bandwidth, buffer_size, bit_width]
149 | 
150 | def system_config(args, meta_data):
151 |     # set up the search methods
152 |     meta_data["method"] = args.search_method
153 |     # setup the system configuration;
154 |     meta_data["schedule"] = {}
155 |     meta_data["schedule"]["static"] = args.static
156 |     meta_data["schedule"]["split"] = args.split
157 |     meta_data["schedule"]["combine"] = args.combine
158 |     if args.buffer_partition:
159 |         meta_data["buffer_partition"] = args.buffer_partition
160 | 
161 |     # setup the system;
162 |     meta_data["system_info"] = {}
163 |     meta_data["system_info"]["bufsize"] = args.bufsize
164 |     meta_data["system_info"]["memory_bandwidth"] = args.memory_bandwidth
165 |     meta_data["system_info"]["sa_size"] = args.sa_size
166 |     meta_data["system_info"]["bit_width"] = args.bit_width
167 | 
168 |     return meta_data
169 | 
170 | def calculate_overall_performance(meta_data):
171 |     res = {
172 |         "total_cycle" : 0.0,
173 |         "each_layer_cycle" : [],
174 |         "total_transfer" : 0.0,
175 |         "each_layer_traffic" : [],
176 |     }
177 |     for data in meta_data["dnn_result"]:
178 |         if isinstance(data['result'], list):
179 |             for item in data['result']:
180 |                 res["total_cycle"] += item["total_cycle"]
181 |                 res["each_layer_cycle"].append(item["total_cycle"])
182 |                 res["total_transfer"] += item["total_transfer"]
183 |                 res["each_layer_traffic"].append(item["total_transfer"])
184 |         else:
185 |             res["total_cycle"] += data['result']["total_cycle"]
186 |             res["each_layer_cycle"].append(data['result']["total_cycle"])
187 |             res["total_transfer"] += data['result']["total_transfer"]
188 |             res["each_layer_traffic"].append(data['result']["total_transfer"])
189 | 
190 |     return res
191 | 
192 | 
193 | if __name__== "__main__":
194 |     # initialize the result data;
195 |     meta_data = {}
196 | 
197 |     # setup system configuration;
198 |     meta_data = system_config(args, meta_data)
199 | 
200 |     # import the dnn
201 |     dnn = import_dnn(args.dnnfile, args.ifmap, args.ifmap3d)
202 |     meta_data["dnn"] = dnn
203 |     hw_constraints = hardware_constraints(sa_size=args.sa_size,
204 |                                           mem_bw=args.memory_bandwidth,
205 |                                           buf=args.bufsize,
206 |                                           bit_width=args.bit_width)
207 | 
208 |     # start the optimization main routine
209 |     meta_data["dnn_result"] = dnn_optimizer.opti_dnn(meta_data, hw_constraints)
210 | 
211 |     meta_data["overall_result"] = calculate_overall_performance(meta_data)
212 | 
213 |     pprint.pprint(meta_data)
214 | 


--------------------------------------------------------------------------------
/deconv_exhaustive_searcher.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | # my own module
  8 | from layer_base_method import *
  9 | 
 10 | ###############################################################
 11 | #                       general process                       #
 12 | ###############################################################
 13 | class DeconvExhaustiveSearcher(LayerBaseMethod):
 14 | 
 15 |     # array to store the result from the four different results
 16 |     rets = []
 17 | 
 18 |     """docstring for LayerExhaustiveSearcher"""
 19 |     def __init__(self, data, sys_info):
 20 |         super(DeconvExhaustiveSearcher, self).__init__(data, sys_info)
 21 |         self.rets = []
 22 | 
 23 | 
 24 |     # compute buffer utilization
 25 |     def buffer_utilization(self, x, area):
 26 |         # buffer = ofmap + weights + ifmap
 27 |         total_buffer = self.Ci*(self.S*area[0]+2)*(self.S*area[1]+2)
 28 |         for i in range(len(x)):
 29 |             total_buffer += x[i]*area[0]*area[1]+ \
 30 |                     self.Ci*self.Subs[i][0]*self.Subs[i][1]*x[i]
 31 | 
 32 |         return total_buffer
 33 | 
 34 |     def compute_bound_cycle(self, i, util_rate, c_0):
 35 |         # total number of ops
 36 |         total_computation = (self.H*self.W*c_0)*\
 37 |             (self.Ci*self.Subs[i][0]*self.Subs[i][1])
 38 | 
 39 |         # systolic array calculation capacity
 40 |         comp_cap = (self.A*self.A) * util_rate
 41 | 
 42 |         return total_computation / comp_cap
 43 | 
 44 | 
 45 |     def process_parameter(self, x, area):
 46 |         area = list(map(lambda i: math.floor(i), area))
 47 |         w_0 = min(self.W/math.ceil(self.W/round(area[0])), self.W)
 48 |         h_0 = min(self.H/math.ceil(self.H/round(area[1])), self.H)
 49 | 
 50 |         total_cycle = 0
 51 | 
 52 |         # calculate the total data transfer
 53 |         ifmap_tile_size = (self.S*h_0+2)*(self.S*w_0+2)*self.Ci
 54 | 
 55 |         # calculate the total batch
 56 |         total_batch = self.H*self.W/(h_0*w_0)
 57 | 
 58 |         # ifmap transfer
 59 |         total_transfer = ifmap_tile_size * total_batch
 60 | 
 61 |         util_sys_arr = 0
 62 |         util_cnt = 0
 63 | 
 64 |         for i in range(len(x)):
 65 |             if (round(x[i]) == 0):
 66 |                 continue
 67 | 
 68 |             # compute the total number of elements needed to be updated in row-major.
 69 |             # ofmap and ifmap tile size
 70 |             ofmap_tile_size = h_0*w_0*x[i]
 71 | 
 72 |             # weight tile size
 73 |             kernel_tile_size = self.Subs[i][0]*self.Subs[i][0]*self.Ci*x[i]
 74 |             total_transfer += kernel_tile_size + ofmap_tile_size
 75 | 
 76 |             # compute the utilization of systolic array
 77 |             util_sys_arr += self.systolic_array_utilization(x[i], area)
 78 |             util_cnt += 1
 79 | 
 80 |             # compute the cycle for compute-/memory-bound
 81 |             comp_bound_cycle = self.compute_bound_cycle(i, util_sys_arr, x[i])
 82 |             mem_bound_cycle = total_transfer/self.B
 83 | 
 84 |             # pick up the greater value as the actual cycle
 85 |             total_cycle += max(comp_bound_cycle, mem_bound_cycle)
 86 | 
 87 |         if (util_cnt > 0):
 88 |             util_sys_arr = util_sys_arr/util_cnt
 89 | 
 90 |         return (total_cycle, total_transfer, util_sys_arr)
 91 | 
 92 |     def fill_bufw(self, remain_subkernels):
 93 |         x0 = [0]*len(self.data["sub-kernels"])
 94 |         sum_subs = 0
 95 |         for i in range(len(self.data["sub-kernels"])):
 96 |             sub_size = self.Subs[i][0]*self.Subs[i][1]
 97 |             # first, let's find the number of kernel we can put into buffer.
 98 |             while sum_subs < self.bufw_size \
 99 |                 and x0[i] < remain_subkernels[i]:
100 |                 x0[i] = x0[i]+self.A
101 |                 sum_subs += self.A*sub_size*self.Ci
102 | 
103 |             if x0[i] > remain_subkernels[i]:
104 |                 x0[i] = remain_subkernels[i]
105 | 
106 |         return x0
107 | 
108 |     # heuristically decide the area dimenion. [W, H]
109 |     def area_dimension(self, area):
110 |         if area >= self.W * self.H:
111 |           return [self.W, self.H]
112 | 
113 |         if math.sqrt(area) > self.H:
114 |           tile_w = math.ceil(self.W/math.sqrt(area))
115 |           return [self.W/tile_w, self.H]
116 | 
117 |         tile_w = math.ceil(self.W/math.sqrt(area))
118 |         tile_h = math.ceil(self.H/math.sqrt(area))
119 |         return [self.W/tile_w, self.H/tile_h]
120 | 
121 |     # the main optimization routine;
122 |     def opti_buffer(self):
123 |         # check if the initial configuration can hold the minimum requirements
124 |         if ((self.A*self.K_h*self.K_w*self.Ci > self.bufw_size) or
125 |             (self.S*self.S*self.A*self.Ci > self.bufi_size)):
126 |             return
127 | 
128 |         total_cycle = 0
129 |         total_transfer = 0
130 |         remain_subkernels = [self.data["out_channel"]]*len(self.data["sub-kernels"])
131 | 
132 |         # set tile area;
133 |         area = 0
134 |         # next let's see how much ifmap can we fit into the buffer.
135 |         while self.S*self.S*(area+self.A)*self.Ci < self.bufi_size:
136 |             area = area+self.A
137 | 
138 |         round_result = []
139 |         result_cache = {}
140 | 
141 |         # no need to optimize the buffer for ofmap, because it is
142 |         # bounded ifmap.
143 |         x1 = self.area_dimension(area)
144 | 
145 |         while not all([sub <= 0.0 for sub in remain_subkernels]):
146 |     
147 |             # set the initial guess;
148 |             x0 = self.fill_bufw(remain_subkernels)
149 | 
150 |             util_buf = self.buffer_utilization(x0, x1)/self.buf_size
151 | 
152 |             # print(util_buf, x1, x0)
153 |             if util_buf > 1.01:
154 |                 return
155 | 
156 |             (cycle, transfer, util_rate) = self.process_parameter(x0, x1) \
157 |                 if str(x0 + x1) not in result_cache else result_cache[str(x0 + x1)]
158 | 
159 |             result_cache[str(x0 + x1)] = (cycle, transfer, util_rate)
160 | 
161 |             if cycle == -1 or transfer == -1:
162 |                 return
163 | 
164 |             total_transfer += transfer
165 |             total_cycle += cycle
166 | 
167 |             remain_subkernels = np.subtract(remain_subkernels, x0)
168 | 
169 |             round_result.append({"kernels" :x0,
170 |                                  "tiles" : x1,
171 |                                  "systolic array utilization" : util_rate})
172 | 
173 |         ret = {
174 |             "total_transfer": round(total_transfer),
175 |             "total_cycle": round(total_cycle),
176 |             "partition" :  {
177 |                 "bufi_size" : round(self.bufi_size),
178 |                 "bufw_size" : round(self.bufw_size),
179 |                 "bufo_size" : round(self.bufo_size),
180 |             },
181 |             "round_result" : round_result,
182 |         }
183 |         self.rets.append(ret)
184 | 
185 |     # optimize one layer
186 |     def optimize(self):
187 |         self.init_setup()
188 |   
189 |         layer_info = self.data
190 |         add_one = [(i+1)/2 for i in layer_info["kernel"]]
191 |         sub_one = [i/2 for i in layer_info["kernel"]]
192 |         self.data["sub-kernels"] = [
193 |             [add_one[0], add_one[1]],
194 |             [add_one[0], sub_one[1]],
195 |             [sub_one[0], add_one[1]],
196 |             [sub_one[0], sub_one[1]]]
197 | 
198 |         self.Subs = self.data["sub-kernels"]
199 | 
200 |         # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h)
201 | 
202 |         for i in range(1, 20):
203 |             self.bufi_size = self.buf_size*i/20.0
204 |             for j in range(1, 20):
205 |                 self.bufw_size = self.buf_size*j/20.0
206 | 
207 |                 self.res = []
208 |                 # if sum of bufi and bufw is over the self.buf_size
209 |                 # we should skip it.
210 |                 if (self.bufi_size + self.bufw_size) > self.buf_size:
211 |                     continue
212 | 
213 |                 # set ofmap size
214 |                 self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size
215 |                 # both cases are possible;
216 |                 self.opti_buffer()
217 | 
218 | 
219 |         ret  = dict(self.rets[0])
220 | 
221 |         for item in self.rets:
222 |             if ret["total_cycle"] > item["total_cycle"]:
223 |                 ret = dict(item)
224 |             if ret["total_cycle"] == item["total_cycle"] and \
225 |                 ret["total_transfer"] > item["total_transfer"]:
226 |                 ret = dict(item)
227 | 
228 |         return ret
229 | 


--------------------------------------------------------------------------------
/deconv_optimizer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | # my own module
  8 | from layer_base_method import *
  9 | 
 10 | ###############################################################
 11 | #                       general process                       #
 12 | ###############################################################
 13 | class DeconvOptimizer(DeconvOptimizer):
 14 | 
 15 |     # array to store the result from the four different results
 16 |     rets = []
 17 | 
 18 |     """docstring for LayerExhaustiveSearcher"""
 19 |     def __init__(self, data):
 20 |         super(DeconvOptimizer, self).__init__(data)
 21 |         self.rets = []
 22 | 
 23 |     # compute buffer utilization
 24 |     def buffer_utilization(self, x, area):
 25 |         # buffer = ofmap + weights + ifmap
 26 |         total_buffer = self.Ci*(self.S*area[0]+2)*(self.S*area[1]+2)
 27 |         for i in range(len(x)):
 28 |             total_buffer += x[i]*area[0]*area[1]+ \
 29 |                     self.Ci*self.Subs[i][0]*self.Subs[i][1]*x[i]
 30 | 
 31 |         return total_buffer
 32 | 
 33 |     def process_parameter(self, x, area):
 34 |         area = list(map(lambda i: math.floor(i), area))
 35 |         w_0 = min(self.W/math.ceil(self.W/round(area[0])), self.W)
 36 |         h_0 =min(self.H/math.ceil(self.H/round(area[1])), self.H)
 37 | 
 38 |         total_cycle = 0
 39 | 
 40 |         for i in range(len(x)):
 41 |             if (round(x[i]) == 0):
 42 |                 continue
 43 |             # make the tile size even for every batch
 44 |             c_0 = min(self.Co/math.ceil(self.Co/round(x[i])), self.Co)
 45 | 
 46 |             # check the result
 47 |             # print(c_0, w_0, h_0, self.Co/c_0, self.W/w_0, self.H/h_0)
 48 |             # compute the total number of elements needed to be updated 
 49 |             # if it is row-major.
 50 |             # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
 51 |             total_transfer = (h_0*w_0*c_0+(self.S*h_0+2)*(self.S*w_0+2)*self.Ci) \
 52 |                                 *self.Subs[i][0]*self.Subs[i][0]/(h_0*w_0) \
 53 |                                 +(h_0*w_0*c_0+self.Subs[i][0]*self.Subs[i][0]*self.Ci*c_0)
 54 | 
 55 |             # compute the utilization of systolic array
 56 |             util_sys_arr = x[i]/(math.ceil(x[i]/self.A)*self.A) \
 57 |                                 *area[0]*area[1]/(math.ceil(area[0]*area[1]/self.A)*self.A)
 58 | 
 59 |             # # compute the utilization of systolic array
 60 |             # util_buf += self.buffer_utilization([c_0, w_0, h_0])/self.buffer_size
 61 | 
 62 |             # if util_buf > 1.01:
 63 |             #     return (-1, -1)
 64 | 
 65 |             comp_bound_cycle = (self.H*self.W*c_0)*(self.Ci*self.Subs[i][0]*self.Subs[i][0])\
 66 |                                 /(self.A*self.A)/util_sys_arr
 67 | 
 68 |             mem_bound_cycle = total_transfer/self.B
 69 | 
 70 |             total_cycle += max(comp_bound_cycle, mem_bound_cycle)
 71 | 
 72 |             # print comp_bound_cycle, mem_bound_cycle
 73 | 
 74 |         return (total_cycle, total_transfer)
 75 | 
 76 |         # # print(x[0],(math.ceil(x[0]/A)*A), x[1]*x[2], (math.ceil(x[1]*x[2]/A)*A))
 77 |         # ret = {
 78 |         #     "total_transfer": round(total_transfer),
 79 |         #     "total_cycle": round(total_cycle), 
 80 |         #     "systolic_array_utilization": util_sys_arr,
 81 |         #     "buffer_utilization": util_buf,
 82 |         #     "c_0, w_0, h_0": [c_0, w_0, h_0],
 83 |         #     "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0],
 84 |         #     "Bound" : bound
 85 |         # }
 86 |         # self.res.append(ret)
 87 |         # return total_cycle
 88 | 
 89 |     def fill_bufw(self, remain_subkernels):
 90 |         x0 = [0]*len(self.data["sub-kernels"])
 91 |         sum_subs = 0
 92 |         for i in range(len(self.data["sub-kernels"])):
 93 |             sub_size = self.Subs[i][0]*self.Subs[i][1]
 94 |             # first, let's find the number of kernel we can put into buffer.
 95 |             while sum_subs < self.bufw_size \
 96 |                 and x0[i] < remain_subkernels[i]:
 97 |                 x0[i] = x0[i]+self.A
 98 |                 sum_subs += self.A*sub_size*self.Ci
 99 | 
100 |         return x0
101 | 
102 |     # the main optimization routine;
103 |     def opti_buffer(self):
104 |         # check if the initial configuration can hold the minimum requirements
105 |         if ((self.A*self.K_h*self.K_w*self.Ci > self.bufw_size) or
106 |             (self.S*self.S*self.A*self.Ci > self.bufi_size)):
107 |             return
108 | 
109 |         total_cycle = 0
110 |         total_transfer = 0
111 |         remain_subkernels = [self.data["out_channel"]]*len(self.data["sub-kernels"])
112 | 
113 |         # set tile area;
114 |         area = 0
115 |         # next let's see how much ifmap can we fit into the buffer.
116 |         while self.S*self.S*(area+self.A)*self.Ci < self.bufi_size:
117 |             area = area+self.A
118 | 
119 |         round_result = []
120 |         result_cache = {}
121 |         while not all([sub == 0 for sub in remain_subkernels]):
122 |             # set the initial guess;
123 |             x0 = self.fill_bufw(remain_subkernels)
124 | 
125 |             # no need to optimize the buffer for ofmap, because it is
126 |             # bounded ifmap.
127 |             x1 = [math.sqrt(area), math.sqrt(area)]
128 | 
129 |             util_buf = self.buffer_utilization(x0, x1)/self.buffer_size
130 | 
131 |             # print(util_buf, x1, x0)
132 |             if util_buf > 1.01:
133 |                 return
134 | 
135 |             (cycle, transfer) = self.process_parameter(x0, x1) \
136 |                 if str(x0 + x1) not in result_cache else result_cache[str(x0 + x1)]
137 | 
138 |             result_cache[str(x0 + x1)] = (cycle, transfer)
139 | 
140 |             if cycle == -1 or transfer == -1:
141 |                 return
142 | 
143 |             total_transfer += transfer
144 |             total_cycle += cycle
145 | 
146 |             remain_subkernels = np.subtract(remain_subkernels, x0)
147 | 
148 |             round_result.append(x0)
149 | 
150 |         ret = {
151 |             "total_transfer": round(total_transfer),
152 |             "total_cycle": round(total_cycle),
153 |             "partition" :  {
154 |                 "bufi_size" : round(self.bufi_size),
155 |                 "bufw_size" : round(self.bufw_size),
156 |                 "bufo_size" : round(self.bufo_size),
157 |             },
158 |             "round_result" : round_result,
159 |         }
160 |         self.rets.append(ret)
161 | 
162 |     # optimize one layer
163 |     def optimize(self):
164 |         global SysArr, Bandwith, BufferSize
165 | 
166 |         self.res = []
167 |         layer_info = self.data
168 |         # set up the new layer information
169 |         [self.W, self.H, self.Ci] = layer_info["ifmap"]
170 |         self.Co = layer_info["out_channel"]
171 |         [self.K_w, self.K_h] = layer_info["kernel"]
172 |         self.S = layer_info["stride"]
173 | 
174 |         add_one = [(i+1)/2 for i in layer_info["kernel"]]
175 |         sub_one = [i/2 for i in layer_info["kernel"]]
176 |         self.data["sub-kernels"] = [
177 |             [add_one[0], add_one[1]],
178 |             [add_one[0], sub_one[1]],
179 |             [sub_one[0], add_one[1]],
180 |             [sub_one[0], sub_one[1]]]
181 | 
182 |         self.Subs = self.data["sub-kernels"]
183 |         
184 |         # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h)
185 | 
186 |         for i in range(1, 20):
187 |             self.bufi_size = BufferSize*i/20.0
188 |             for j in range(1, 20):
189 |                 self.bufw_size = BufferSize*j/20.0
190 | 
191 |                 self.res = []
192 |                 # if sum of bufi and bufw is over the BufferSize
193 |                 # we should skip it.
194 |                 if (self.bufi_size + self.bufw_size) > BufferSize:
195 |                     continue
196 | 
197 |                 # set ofmap size
198 |                 self.bufo_size = BufferSize - self.bufi_size - self.bufw_size
199 |                 # both cases are possible;
200 |                 self.opti_buffer()
201 | 
202 | 
203 |         ret  = dict(self.rets[0])
204 | 
205 |         for item in self.rets:
206 |             if ret["total_cycle"] > item["total_cycle"]:
207 |                 ret = dict(item)
208 |             if ret["total_cycle"] == item["total_cycle"] and \
209 |                 ret["total_transfer"] > item["total_transfer"]:
210 |                 ret = dict(item)
211 | 
212 |         return ret
213 | 


--------------------------------------------------------------------------------
/dnn_optimizer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | import matplotlib
  3 | import matplotlib.pyplot as plt
  4 | import matplotlib.patches as patches
  5 | from matplotlib.ticker import FuncFormatter
  6 | import numpy as np
  7 | import scipy
  8 | import sys
  9 | 
 10 | 
 11 | # import my own modules
 12 | import layer_optimizer
 13 | import layer_static_method
 14 | import layer_exhaustive_searcher
 15 | import deconv_exhaustive_searcher
 16 | from multi_layer_perceptron import MultiLayerPerceptron
 17 | from fc_layer import FullyConnectedLayer
 18 | 
 19 | import layer3d_optimizer
 20 | import layer3d_exhaustive_searcher
 21 | 
 22 | method = None
 23 | buffer_partition = None
 24 | enable = {
 25 |     "static" : False,
 26 |     "combine" : False,
 27 |     "split" : False,
 28 | }
 29 | 
 30 | def setup(meta_data, hardware_constraints):
 31 |     global enable, method, buffer_partition
 32 |     # define the search method
 33 |     method = meta_data["method"]
 34 | 
 35 |     if meta_data["schedule"]["static"] and \
 36 |         "buffer_partition" not in meta_data:
 37 |         raise Exception("The static scheduling is not supported"
 38 |             " without specifying the buffer partition.")
 39 | 
 40 |     if "buffer_partition" in meta_data:
 41 |         buffer_partition = meta_data["buffer_partition"]
 42 | 
 43 |     # set the schedule policy
 44 |     enable["static"] = meta_data["schedule"]["static"]
 45 |     enable["combine"] = meta_data["schedule"]["combine"]
 46 |     enable["split"] = meta_data["schedule"]["split"]
 47 | 
 48 | def single_layer_optimization(data, sys_info):
 49 |     global method, enable, buffer_partition
 50 |     if data["type"] == "MLP":
 51 |         return MultiLayerPerceptron(data, sys_info).optimize()
 52 | 
 53 |     if data["type"] == "FC":
 54 |         return FullyConnectedLayer(data, sys_info).optimize()
 55 | 
 56 |     # if "static" option is enabled, it will be prioritized
 57 |     if enable["static"]:
 58 |       return layer_static_method.\
 59 |           LayerStaticMethod(data, sys_info, buffer_partition).optimize()
 60 | 
 61 |     # check the potential method we use here.
 62 |     if method == "Constrained":
 63 |         if data["type"] == "2D":
 64 |             return layer_optimizer.\
 65 |                 LayerOptimizer(data, sys_info).optimize()
 66 |         else:
 67 |             return layer3d_optimizer.\
 68 |                 Layer3dOptimizer(data, sys_info).optimize()
 69 |     elif method == "Exhaustive":
 70 |         if data["type"] == "2D":
 71 |             return layer_exhaustive_searcher.\
 72 |                 LayerExhaustiveSearcher(data, sys_info).optimize()
 73 |         else:
 74 |             return layer3d_exhaustive_searcher.\
 75 |                 Layer3dExhaustiveSearcher(data, sys_info).optimize()
 76 |     elif method == "Combined":
 77 |         return layer_optimizer.\
 78 |             LayerOptimizer(data, sys_info, True).optimize()
 79 |     else:
 80 |         raise Exception("Unknown search method: {}".format(method))
 81 | 
 82 | def single_combine_optimization(data, sys_info):
 83 |     global method
 84 |     if method == "Constrained":
 85 |       return layer_optimizer.\
 86 |             LayerOptimizer(data, sys_info).optimize()
 87 |     elif method == "Exhaustive":
 88 |         return deconv_exhaustive_searcher.\
 89 |             DeconvExhaustiveSearcher(data, sys_info).optimize()
 90 |     elif method == "Combined":
 91 |         return layer_optimizer.\
 92 |             LayerOptimizer(data, sys_info, True).optimize()
 93 |     else:
 94 |         raise Exception("Unknown search method: {}".format(method))
 95 | 
 96 | def sub_kernel_sizes(layer):
 97 |     add_one = [(i+1)/2 for i in layer["kernel"]]
 98 |     sub_one = [i/2 for i in layer["kernel"]]
 99 | 
100 |     sizes = [[]]
101 |     for i in range(len(layer["kernel"])):
102 |       tmp = []
103 |       for j in sizes:
104 |         e1 = list(j) + [add_one[i]]
105 |         e2 = list(j) + [sub_one[i]]
106 |         tmp += [e1, e2]
107 | 
108 |       sizes = tmp
109 | 
110 |     return sizes
111 | 
112 | def single_split_optimization(layer, sys_info):
113 |     subs = []
114 | 
115 |     # iterate all possible sub_kernels
116 |     for sub_size in sub_kernel_sizes(layer):
117 |         sub = dict(layer)
118 |         sub["kernel"] = sub_size
119 |         subs.append(single_layer_optimization(sub, sys_info))
120 | 
121 |     return subs
122 | 
123 | def opti_deconv(layer, sys_info):
124 |     global method, enable
125 |     # collect individual result from sub_kernels
126 |     subs = []
127 | 
128 |     # if the convolution size is odd;
129 |     if layer["kernel"][0]%2 == 1:
130 |         if enable["combine"]:
131 |             subs.append(single_combine_optimization(layer, sys_info))
132 |         else:
133 |             subs = single_split_optimization(layer, sys_info)
134 |     # if the convolution size is even;
135 |     else:
136 |         sub = dict(layer)
137 |         sub["kernel"][0] = sub["kernel"][0]/2
138 |         sub["kernel"][1] = sub["kernel"][1]/2
139 |         if enable_combine:
140 |             # this will consider four same-size sub-kernels
141 |             # as one sub-kernel with more channels
142 |             sub["out_channel"] = sub["out_channel"]*4
143 |             subs.append(single_layer_optimization(sub4, sys_info))
144 |         else:
145 |             # without combining sub-kernels
146 |             res = single_layer_optimization(sub, sys_info)
147 |             # times 4 of each individual sub-kernel"s
148 |             # memory traffic and cycles.
149 |             res["total_traffic"] = res["total_traffic"]*4
150 |             res["total_cycle"] = res["total_cycle"]*4
151 |             subs.append(res)
152 | 
153 |     return subs
154 | 
155 | # the main routine of optimizing the dnn.
156 | def opti_dnn(meta_data, hardware_constraints):
157 |     # set up the configurations;
158 |     setup(meta_data, hardware_constraints)
159 |     dnn = meta_data["dnn"]
160 |     sys_info = meta_data["system_info"]
161 | 
162 |     results = []
163 | 
164 |     # optimize for each layer
165 |     for i in range(len(dnn)):
166 |         layer = dnn[i]
167 |         # start to optimize ordinary Conv layer.
168 |         data = dict(layer)
169 | 
170 |         # check if this layer is Deconv, True == YES
171 |         if layer["Deconv?"] == True:
172 |             if enable["split"]:
173 |                 # if split the deconv into smaller ones
174 |                 results.append({
175 |                         "data" : data,
176 |                         "result" : opti_deconv(layer, sys_info)
177 |                         })
178 |             else:
179 |                 data["ofmap"] = [0] * len(data["ifmap"])
180 |                 # scale up the ifmap to the ifmap based on the stride size.
181 |                 for j in range(len(data["ifmap"])-1):
182 |                     data["ifmap"][j] = layer["ifmap"][j]*2/layer["stride"]
183 |                     data["ofmap"][j] = layer["ifmap"][j]/layer["stride"]
184 | 
185 |                 # the last element is ofmap channel, so treat it separately
186 |                 data["ofmap"][-1] = data["out_channel"]
187 | 
188 |                 # add the result
189 |                 results.append({
190 |                         "data" : data,
191 |                         "result" : single_layer_optimization(data, sys_info)
192 |                         })
193 |         else:
194 |             data["ofmap"] = [0] * len(data["ifmap"])
195 |             # scale down the ifmap to the ifmap based on the stride size.
196 |             for j in range(len(data["ifmap"])-1):
197 |                 data["ofmap"][j] = layer["ifmap"][j]/layer["stride"]
198 | 
199 |             # the last element is ofmap channel, so treat it separately
200 |             data["ofmap"][-1] = data["out_channel"]
201 | 
202 |             results.append({
203 |                         "data" : data,
204 |                         "result" : single_layer_optimization(data, sys_info)
205 |                         })
206 | 
207 |     return results
208 | 


--------------------------------------------------------------------------------
/dnns/GC_net.txt:
--------------------------------------------------------------------------------
 1 | 32,5,5,2,False
 2 | 32,3,3,1,False
 3 | 32,3,3,1,False
 4 | 32,3,3,1,False
 5 | 32,3,3,1,False
 6 | 32,3,3,1,False
 7 | 32,3,3,1,False
 8 | 
 9 | 32,3,3,3,1,False
10 | 32,3,3,3,1,False
11 | 64,3,3,3,2,False
12 | 64,3,3,3,1,False
13 | 64,3,3,3,1,False
14 | 64,3,3,3,2,False
15 | 64,3,3,3,1,False
16 | 64,3,3,3,1,False
17 | 64,3,3,3,2,False
18 | 64,3,3,3,1,False
19 | 64,3,3,3,1,False
20 | 128,3,3,3,2,False
21 | 128,3,3,3,1,False
22 | 128,3,3,3,1,False
23 | 64,3,3,3,1,True
24 | 64,3,3,3,1,True
25 | 64,3,3,3,1,True
26 | 32,3,3,3,1,True


--------------------------------------------------------------------------------
/dnns/PSMNet.txt:
--------------------------------------------------------------------------------
 1 | 32,3,3,2,False
 2 | 32,3,3,1,False
 3 | 32,3,3,1,False
 4 | 64,3,3,2,False
 5 | 64,3,3,1,False
 6 | 128,3,3,2,False
 7 | 128,3,3,1,False
 8 | 
 9 | 32,3,3,3,1,False
10 | 32,3,3,3,1,False
11 | 32,3,3,3,1,False
12 | 64,3,3,3,2,False
13 | 64,3,3,3,1,False
14 | 64,3,3,3,2,False
15 | 64,3,3,3,1,False
16 | 64,3,3,3,1,True
17 | 32,3,3,3,1,True
18 | 64,3,3,3,2,False
19 | 64,3,3,3,1,False
20 | 64,3,3,3,2,False
21 | 64,3,3,3,1,False
22 | 64,3,3,3,1,True
23 | 32,3,3,3,1,True
24 | 64,3,3,3,2,False
25 | 64,3,3,3,1,False
26 | 64,3,3,3,2,False
27 | 64,3,3,3,1,False
28 | 64,3,3,3,1,True
29 | 32,3,3,3,1,True
30 | 


--------------------------------------------------------------------------------
/dnns/flowNetC.txt:
--------------------------------------------------------------------------------
 1 | 64,7,7,2,False
 2 | 128,5,5,2,False
 3 | 256,5,5,2,False
 4 | 441,1,1,1,False
 5 | 256,3,3,1,False
 6 | 512,3,3,2,False
 7 | 512,3,3,1,False
 8 | 512,3,3,2,False
 9 | 512,3,3,1,False
10 | 1024,3,3,2,False
11 | 512,5,5,1,True
12 | 256,5,5,1,True
13 | 128,5,5,1,True
14 | 64,5,5,1,True


--------------------------------------------------------------------------------
/dnns/flowNetS.txt:
--------------------------------------------------------------------------------
 1 | 64,7,7,2,False
 2 | 128,5,5,2,False
 3 | 256,5,5,2,False
 4 | 256,3,3,1,False
 5 | 512,3,3,2,False
 6 | 512,3,3,1,False
 7 | 512,3,3,2,False
 8 | 512,3,3,1,False
 9 | 1024,3,3,2,False
10 | 512,5,5,1,True
11 | 256,5,5,1,True
12 | 128,5,5,1,True
13 | 64,5,5,1,True


--------------------------------------------------------------------------------
/dnns/test.txt:
--------------------------------------------------------------------------------
1 | 64,7,7,2,False
2 | 128,5,5,1,True


--------------------------------------------------------------------------------
/dnns/test3d.txt:
--------------------------------------------------------------------------------
 1 | 32,3,3,3,1,False
 2 | 32,3,3,3,1,False
 3 | 64,3,3,3,2,False
 4 | 64,3,3,3,1,False
 5 | 64,3,3,3,1,False
 6 | 64,3,3,3,2,False
 7 | 64,3,3,3,1,False
 8 | 64,3,3,3,1,False
 9 | 64,3,3,3,2,False
10 | 64,3,3,3,1,False
11 | 64,3,3,3,1,False
12 | 128,3,3,3,2,False
13 | 128,3,3,3,1,False
14 | 128,3,3,3,1,False
15 | 64,3,3,3,1,True
16 | 64,3,3,3,1,True
17 | 64,3,3,3,1,True
18 | 32,3,3,3,1,True
19 | 


--------------------------------------------------------------------------------
/fc_layer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | class FullyConnectedLayer(object):
  8 |     # info for systolic array
  9 |     A = None       # systolic array dimension
 10 | 
 11 |     # memory bandwith number of bytes can be transferred.
 12 |     B = None
 13 | 
 14 |     # on-chip buffer size
 15 |     buf_size = None
 16 | 
 17 |     # input layer dimension
 18 |     Ci = None       # channels for ifmap
 19 |     Co = None       # channels for ofmap
 20 |     Num = None      # number of same FC layer
 21 | 
 22 |     # on-chip buffer size
 23 |     bufi_size = None
 24 |     bufo_size = None
 25 |     bufw_size = None
 26 | 
 27 |     """docstring for MultiLayerPerceptron"""
 28 |     def __init__(self, data, sys_info):
 29 |         self.data = data
 30 |         self.sys_info = sys_info
 31 |         self.A = sys_info["sa_size"]
 32 |         self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8)
 33 |         self.buf_size = sys_info["bufsize"]
 34 | 
 35 |     def init_setup(self):
 36 |         layer_info = self.data
 37 |        
 38 |         # set up the new layer information
 39 |         self.Ci = layer_info["in_channel"]
 40 |         self.Co = layer_info["out_channel"]
 41 |         self.Num = layer_info["num_of_layer"]
 42 | 
 43 |         self.bufi_size = self.Ci
 44 | 
 45 |     ###############################################################
 46 |     #                       general process                       #
 47 |     ###############################################################
 48 | 
 49 |     # compute buffer utilization
 50 |     def buffer_utilization(self, x):
 51 |         # buffer = ofmap + weights + ifmap
 52 |         return (x + self.Ci*x + self.Ci)
 53 | 
 54 |     # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
 55 |     def data_transfer(self, x):
 56 |         # calculate the total batch
 57 |         total_batch = math.ceil(float(self.Co)/x)
 58 | 
 59 |         # ofmap, ifmap and kernel tile size
 60 |         ofmap_tile_size = x
 61 |         kernel_tile_size = x*self.Ci
 62 | 
 63 |         # ofmap + kernels transfer
 64 |         total_transfer = (ofmap_tile_size + kernel_tile_size) * total_batch
 65 | 
 66 |         # add additional ifmap data transfer
 67 |         total_transfer += self.Ci
 68 | 
 69 |         return total_transfer
 70 | 
 71 |     def systolic_array_utilization(self, x):
 72 |         A = self.A
 73 |         A_w_uiti = math.ceil(self.Co/math.ceil(float(self.Co)/A))
 74 | 
 75 |         total_usage = x * self.Ci
 76 |         round_up_val = math.ceil(float(x/A)) * A \
 77 |                      * math.ceil(float(self.Ci)/A)*A
 78 | 
 79 |         # the pct of extra delay due to output-stationary
 80 |         delay_pct = float(self.Ci)/(self.Ci+A_w_uiti)
 81 | 
 82 |         return delay_pct * total_usage / round_up_val
 83 | 
 84 |     def compute_bound_cycle(self, util_rate):
 85 |         # total number of ops
 86 |         total_computation = (self.Ci*self.Co)
 87 | 
 88 |         # systolic array calculation capacity
 89 |         comp_cap = (self.A*self.A) * util_rate
 90 | 
 91 |         return total_computation / comp_cap
 92 | 
 93 |     def process_parameter(self, x):
 94 | 
 95 |         x = math.floor(x)
 96 |         bound = "C"
 97 |         # make the tile size even for every batch
 98 |         x_0 = min(self.Co/math.ceil(self.Co/round(x)), self.Co)
 99 | 
100 |         # (ofmap + ifmap)*total_batch + weights
101 |         total_transfer = self.data_transfer(x_0)
102 | 
103 |         # compute the utilization of systolic array
104 |         util_sys_arr = self.systolic_array_utilization(x_0)
105 | 
106 |         # compute the utilization of buffer
107 |         util_buf = float(self.buffer_utilization(x_0))/self.buf_size
108 | 
109 |         if util_buf > 1.01:
110 |             print("ERROR: the utilization of buffer is over 100%")
111 |             exit()
112 | 
113 |         # calculate the amount of cycles of computing all elements.
114 |         if self.compute_bound_cycle(util_sys_arr) > total_transfer/self.B:
115 |             bound = "C"
116 |             total_cycle = self.compute_bound_cycle(util_sys_arr)
117 |         else:
118 |             bound = "M"
119 |             total_cycle = total_transfer/self.B
120 | 
121 |         ret = {
122 |             "total_transfer": round(total_transfer)*self.Num,
123 |             "total_cycle": round(total_cycle)*self.Num,
124 |             "systolic_array_utilization": util_sys_arr,
125 |             "buffer_utilization": util_buf,
126 |             "buffer-partition [I,W,O]": [int(self.bufi_size), 
127 |                                          int(self.bufw_size), 
128 |                                          int(self.bufo_size)], 
129 |             "x_0": math.floor(x_0),
130 |             "Bound" : bound
131 |         }
132 | 
133 |         return ret
134 | 
135 |     # optimize one layer
136 |     def optimize(self):
137 |         self.init_setup()
138 | 
139 |         # if sum of bufi and bufw is over the self.buf_size
140 |         # we should skip it.
141 |         if self.bufi_size > self.buf_size:
142 |             print("FAIL: the entire weight cannot be stored in buffer")
143 |             exit()
144 | 
145 |         self.bufw_size = (self.buf_size - self.bufi_size)*self.Ci/(self.Ci+1)
146 |         self.bufo_size = (self.buf_size - self.bufi_size)/(self.Ci+1)
147 | 
148 |         # set the initial guess;
149 |         x0 = self.A
150 | 
151 |         # let's see what percentage of ifmap can we fit into the buffer.
152 |         while x0 < self.Co and (x0+self.A)*self.Ci < self.bufw_size:
153 |             x0 = x0 + self.A
154 | 
155 |         return self.process_parameter(x0)
156 | 


--------------------------------------------------------------------------------
/layer3d_base_method.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | # own library
  8 | from layer_base_method import *
  9 | 
 10 | class Layer3dBaseMethod(LayerBaseMethod):
 11 |     """docstring for Layer3dBaseMethod"""
 12 |     # info for systolic array
 13 |     A = None      # systolic array dimension
 14 | 
 15 |     # memory bandwith number of bytes can be transferred.
 16 |     B = None
 17 | 
 18 |     # on-chip buffer size
 19 |     buffer_size = None
 20 |     # info for weights
 21 |     K_w = None       # kernel width
 22 |     K_h = None       # kernel height
 23 |     k_d = None       # kernel dispairty
 24 |     S = None         # stride size
 25 | 
 26 |     # input layer dimension
 27 |     H = None        # height of ofmap
 28 |     W = None        # width of ofmap
 29 |     D = None        # disparity of ofmap
 30 |     Ci = None       # channels for weights
 31 |     Co = None       # channels for ofmap
 32 | 
 33 |     # on-chip buffer size
 34 |     bufi_size = None
 35 |     bufo_size = None
 36 |     bufw_size = None
 37 | 
 38 |     # array to store the result from the four different results
 39 |     res = []
 40 | 
 41 |     def __init__(self, data, sys_info):
 42 |       super(Layer3dBaseMethod, self).__init__(data, sys_info)
 43 | 
 44 |     def init_setup(self):
 45 |         self.res = []
 46 |         layer_info = self.data
 47 |         # set up the new layer information
 48 |         [self.W, self.H, self.D, self.Ci] = layer_info["ifmap"]
 49 |         self.Co = layer_info["out_channel"]
 50 |         [self.K_w, self.K_h, self.K_d] = layer_info["kernel"]
 51 |         self.S = layer_info["stride"]
 52 | 
 53 |     ###############################################################
 54 |     #                     general computations                    #
 55 |     ###############################################################
 56 |     def ofmap_tile(self, x):
 57 |         return x[0]*x[1]*x[2]*x[3]
 58 | 
 59 |     def weight_tile(self, num):
 60 |         return self.Ci*self.K_h*self.K_w*self.K_d*num
 61 | 
 62 |     def ifmap_tile(self, x):
 63 |         S_2 = (self.K_h+1) / 2
 64 |         return self.Ci*(self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2)
 65 | 
 66 |     def total_ofmap_size(self):
 67 |         return self.H*self.W*self.D*self.Co
 68 | 
 69 |     def total_weight_size(self):
 70 |         return self.weight_tile(self.Co)
 71 | 
 72 |     # variables for optimization
 73 |     # this two has been encodes as x[4] = {c_0, h_0, w_0, d_0};
 74 |     # c_0  # number of channels per batch;
 75 |     # h_0, w_0, d_0 # the dimensions of tile per batch;
 76 |     ###############################################################
 77 |     #                       general process                       #
 78 |     ###############################################################
 79 | 
 80 |     def buffer_utilization(self, x):
 81 |         # buffer = ofmap + weights + ifmap
 82 |         return (self.ofmap_tile(x) +
 83 |                 self.weight_tile(x[0]) +
 84 |                 self.ifmap_tile(x))
 85 | 
 86 |     def row_major_data_transfer(self, h_0, w_0, d_0, c_0):
 87 |         # ofmap, ifmap and kernel tile size
 88 |         S_2 = (self.K_h+1) / 2
 89 |         ofmap_tile_size = h_0*w_0*d_0*c_0
 90 |         ifmap_tile_size = ((self.S*h_0+S_2) *
 91 |                            (self.S*w_0+S_2) *
 92 |                            (self.S*d_0+S_2) * self.Ci)
 93 |         kernel_tile_size = self.K_h*self.K_w*self.K_d*self.Ci*c_0
 94 | 
 95 |         # calculate the total batch
 96 |         total_batch = math.ceil((self.H*self.W*self.D*self.Co) / ofmap_tile_size)
 97 | 
 98 |         # ofmap + ifmap transfer
 99 |         total_transfer = ((ofmap_tile_size + ifmap_tile_size) *
100 |                           (total_batch - self.Co/c_0))
101 | 
102 |         # add additional data transfer
103 |         total_transfer += (ofmap_tile_size + kernel_tile_size) * (self.Co/c_0)
104 | 
105 |         return total_transfer
106 | 
107 |     def channel_major_data_transfer(self, h_0, w_0, d_0, c_0):
108 |         S_2 = (self.K_h+1) / 2
109 | 
110 |         # ofmap and ifmap tile size
111 |         ofmap_tile_size = h_0*w_0*d_0*c_0
112 |         ifmap_tile_size = ((self.S*h_0+S_2) *
113 |                            (self.S*w_0+S_2) *
114 |                            (self.S*d_0+S_2) * self.Ci)
115 |         kernel_tile_size = self.K_h*self.K_w*self.K_d*self.Ci*c_0
116 | 
117 |         # calculate the total batch
118 |         total_batch = math.ceil((self.H*self.W*self.D*self.Co) / ofmap_tile_size)
119 | 
120 |         # ofmap + weight transfer
121 |         total_transfer = (ofmap_tile_size + kernel_tile_size) * \
122 |             (total_batch - (self.H*self.W*self.D)/(h_0*w_0*d_0))
123 | 
124 |         # add additional data transfer
125 |         total_transfer += (ofmap_tile_size + ifmap_tile_size) \
126 |             * (self.H*self.W*self.D)/(h_0*w_0*d_0)
127 | 
128 |         return total_transfer
129 | 
130 |     def systolic_array_utilization(self, xi, area):
131 |         area_size = area[0] * area[1] *area[2]
132 |         A = self.A
133 |         total_usage = xi * area_size
134 |         round_up_val = (math.ceil(float(xi/A))*A) \
135 |                         * (math.ceil(float(area_size)/A)*A)
136 | 
137 |         return total_usage / round_up_val
138 | 
139 |     def compute_bound_cycle(self, util_rate):
140 |         # total number of ops
141 |         total_computation = ((self.H*self.W*self.D*self.Co)
142 |                           * (self.Ci*self.K_h*self.K_w*self.K_d)
143 |                           / (self.S*self.S*self.S))
144 | 
145 |         # systolic array calculation capacity
146 |         comp_cap = (self.A*self.A) * util_rate
147 | 
148 |         return total_computation / comp_cap
149 | 
150 |     def process_parameter(self, x, row_major, comp_bound):
151 |         bound = "C"
152 |         # make the tile size even for every batch
153 |         c_0 = min(self.Co/math.ceil(self.Co/round(x[0])), self.Co)
154 |         w_0 = min(self.W/math.ceil(self.W/round(x[1])), self.W)
155 |         h_0 = min(self.H/math.ceil(self.H/round(x[2])), self.H)
156 |         d_0 = min(self.D/math.ceil(self.D/round(x[3])), self.D)
157 | 
158 |         #compute the total number of elements needed to be updated
159 |         # if it is row-major.
160 |         if row_major:
161 |             # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
162 |             total_transfer = self.row_major_data_transfer(h_0, w_0, d_0, c_0)
163 | 
164 |         # compute the total number of elements needed to be updated
165 |         # if it is channel-major.
166 |         else:
167 |             # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0)
168 |             total_transfer = self.channel_major_data_transfer(h_0, w_0, d_0, c_0)
169 | 
170 |         # compute the utilization of systolic array
171 |         util_sys_arr = self.systolic_array_utilization(c_0, [w_0, h_0, d_0])
172 | 
173 |         # compute the utilization of systolic array
174 |         util_buf = self.buffer_utilization([c_0, w_0, h_0, d_0])/self.buf_size
175 | 
176 |         if util_buf > 1.01:
177 |             return
178 |         # calculate the amount of cycles of computing all elements.
179 |         if comp_bound:
180 |             bound = "C"
181 |             total_cycle = self.compute_bound_cycle(util_sys_arr)
182 |         else:
183 |             bound = "M"
184 |             total_cycle = total_transfer/self.B
185 | 
186 |         ret = {
187 |             "total_transfer": round(total_transfer),
188 |             "total_cycle": round(total_cycle),
189 |             "systolic_array_utilization": util_sys_arr,
190 |             "buffer_utilization": util_buf,
191 |             "c_0, w_0, h_0, d_0": [round(c_0), round(w_0), round(h_0), round(d_0)],
192 |             "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0, self.D/d_0],
193 |             "Bound" : bound
194 |         }
195 |         self.res.append(ret)
196 |         return
197 | 
198 | 


--------------------------------------------------------------------------------
/layer3d_exhaustive_searcher.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python2.7
 2 | 
 3 | # public library
 4 | import math
 5 | import numpy as np
 6 | 
 7 | # my own module
 8 | from layer3d_static_method import *
 9 | 
10 | ###############################################################
11 | #                       general process                       #
12 | ###############################################################
13 | class Layer3dExhaustiveSearcher(Layer3dStaticMethod):
14 | 
15 |     # array to store the result from the four different results
16 |     res = []
17 | 
18 |     """docstring for LayerExhaustiveSearcher"""
19 |     def __init__(self, data, sys_info):
20 |         super(Layer3dExhaustiveSearcher, self).__init__(data, sys_info, None)
21 |         self.rets = []
22 | 
23 |     # optimize one layer
24 |     def optimize(self):
25 |         self.init_setup()
26 | 
27 |         for i in range(1, 20):
28 |             self.bufi_size = self.buf_size*i/20.0
29 |             for j in range(1, 20):
30 |                 self.bufw_size = self.buf_size*j/20.0
31 |                 # optimize one buffer partition
32 |                 self.res = []
33 |                 self.optimize_one_buffer_partition()
34 | 
35 |         ret  = dict(self.rets[0])
36 | 
37 |         for item in self.rets:
38 |             if ret["total_cycle"] > item["total_cycle"]:
39 |                 ret = dict(item)
40 |             if ret["total_cycle"] == item["total_cycle"] and \
41 |                 ret["total_transfer"] > item["total_transfer"]:
42 |                 ret = dict(item)
43 | 
44 |         return ret
45 | 
46 |     # the main optimization routine;
47 |     def opti_buffer(self):
48 |         # set the initial guess;
49 |         x0 = [self.A, self.A]
50 | 
51 |         # check if the initial configuration can hold the minimum requirements
52 |         if ((x0[0]*self.K_h*self.K_w*self.K_d*self.Ci > self.bufw_size) or
53 |             (self.S*self.S*self.S*x0[1]*self.Ci > self.bufi_size)):
54 |             return
55 | 
56 |         # first, let's find the number of kernel we can put into buffer.
57 |         while (x0[0]+self.A)*self.K_h*self.K_w*self.K_d*self.Ci < self.bufw_size:
58 |             x0[0] = x0[0]+self.A
59 | 
60 |         # next let's see how much ifmap can we fit into the buffer.
61 |         while self.S*self.S*self.S*(x0[1]+self.A)*self.Ci < self.bufi_size:
62 |             x0[1] = x0[1]+self.A
63 | 
64 |         # no need to optimize the buffer for ofmap, because it is
65 |         # bounded ifmap.
66 |         x = [x0[0], min(round(x0[1]**(1.0/3)), self.W),
67 |              min(round(x0[1]**(1.0/3)), self.H), min(round(x0[1]**(1.0/3)), self.D)]
68 |         self.process_parameter(x, False, False)
69 |         self.process_parameter(x, False, True)
70 |         self.process_parameter(x, True, False)
71 |         self.process_parameter(x, True, True)
72 | 
73 | 


--------------------------------------------------------------------------------
/layer3d_optimizer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | from scipy.optimize import minimize
  7 | 
  8 | # my own library
  9 | from layer3d_base_method import *
 10 | from layer_optimizer import *
 11 | 
 12 | # threshold for bounds
 13 | # if the constraint result is negative but within this threshold,
 14 | # it is still consider a valid result.
 15 | Threshold = 10.0
 16 | 
 17 | class Layer3dOptimizer(Layer3dBaseMethod, LayerOptimizer):
 18 |     """docstring for Layer3dOptimizer"""
 19 |     def __init__(self, data, sys_info, combined=False):
 20 |         super(Layer3dOptimizer, self).__init__(data, sys_info)
 21 | 
 22 |     # variables for optimization
 23 |     # this two has been encodes as x[3] = {c_0, h_0, w_0, d_0};
 24 |     # c_0  # number of channels per batch;
 25 |     # h_0, w_0, d_0 # the dimensions of tile per batch;
 26 | 
 27 |     ###############################################################
 28 |     #                       general process                       #
 29 |     ###############################################################
 30 | 
 31 |     def init_guess(self):
 32 |         x0 = [min(self.A, self.Co), \
 33 |               min(math.floor(math.sqrt(self.A)), self.H), \
 34 |               min(math.floor(math.sqrt(self.A)), self.W), 1]
 35 |         return x0
 36 | 
 37 |     def variable_boundary(self):
 38 |         return ((min(self.A, self.Co), self.Co),
 39 |                 (min(math.floor(math.sqrt(self.A)), self.H), self.H), \
 40 |                 (min(math.floor(math.sqrt(self.A)), self.W), self.W), \
 41 |                 (1, self.D))
 42 | 
 43 |     ###############################################################
 44 |     #                     general constraints                     #
 45 |     ###############################################################
 46 |     # the low bound of buffer size;
 47 |     # make sure the buffer utilization is always larger than 0
 48 |     def buffer_constraint1(self, x):
 49 |         return self.buffer_utilization(x)
 50 | 
 51 |     # the upper bound of the buffer size;
 52 |     # make sure the buffer utilization is
 53 |     # always smaller than buffer size;
 54 |     def buffer_constraint2(self, x):
 55 |         return self.buf_size - self.buffer_constraint1(x)
 56 | 
 57 |     ###############################################################
 58 |     #       row-major constraint solving obj and constraints      #
 59 |     ###############################################################
 60 | 
 61 |     # the minimization objective of row-major
 62 |     # this objective is a simplified expression of
 63 |     # [h_0*w_0*d_0*c_0+(S*h_0+2)(S*w_0+2)(S*d_0+2)*Ci]*(H*W*D*Co)/(h_0*w_0*d_0*c_0)
 64 |     # + [K^3*Ci+h_0*w_0*d_0*c_0]*Co/c_0
 65 |     def row_major_mem_obj(self, x):
 66 |       return (self.ofmap_tile(x) + self.ifmap_tile(x)) \
 67 |           * (self.total_ofmap_size()/self.ofmap_tile(x) - self.Co/x[0]) \
 68 |           + self.total_weight_size()/x[0] + x[1]*x[2]*x[3]*self.Co
 69 | 
 70 |     def row_major_comp_obj(self, x):
 71 |         return self.total_ofmap_size() / self.ofmap_tile(x)
 72 | 
 73 |     # make sure the load for row-major is always less than
 74 |     # load for channel-major, range : [0, +inf]
 75 |     def row_major_constraint(self, x):
 76 |         S_2 = (self.K_h+1)/2
 77 |         # simplified from K^3*Ci*c_0 > C*(S^3*h_0*w_0*d_0)
 78 |         return self.K_h*self.K_w*self.K_d*x[0] - \
 79 |             (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2);
 80 | 
 81 |     # make sure the process is always memory-bound;
 82 |     # which is the latency for memory access is always
 83 |     # greater than lantecy of compute;
 84 |     # (c_0*(h_0*w_0*d_0)+C*((S*h_0+2)*(S*w_0+2)*(S*d_0+2))/B
 85 |     # >= (K^3*Ci/A^2)*c_0*w_0*d_0*h_0
 86 |     # range : [0, +inf]
 87 |     def row_major_mem_bound_constraint(self, x):
 88 |       return (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B \
 89 |           - self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x)
 90 | 
 91 |     # make sure the process is always compute-bound;
 92 |     # which is the latency for compute is always
 93 |     # greater than lantecy of memory access;
 94 |     # (c_0*(h_0*w_0*d_0)+Ci*((S*h_0+2)*(S*w_0+2)*(S*d_0+2))/B
 95 |     # <= (K^3*Ci/A^2)*c_0*w_0*h_0*d_0
 96 |     # range : [0, +inf]
 97 |     def row_major_comp_bound_constraint(self, x):
 98 |         return self.weight_tile(1) / (self.A*self.A)*self.ofmap_tile(x) \
 99 |             - (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B
100 | 
101 |     ###############################################################
102 |     #     channel-major constraint solving obj and constraints    #
103 |     ###############################################################
104 | 
105 |     # the minimization objective of channel-major
106 |     # this is the simplified expression of
107 |     # (K^3*Ci*c_0+h_0*w_0*d_0*c_0)*(H*W*D*Co)/(h_0*w_0*d_0*c_0)
108 |     # + [(S*h_0+2)(S*w_0+2)(S*d_0+2)*Ci + h_0*w_0*d_0*c_0]*(H*W*D)/(h_0*w_0*d_0)
109 |     def channel_major_mem_obj(self, x):
110 |         S_2 = (self.K_h+1)/2
111 |         return (self.total_weight_size)/(x[1]*x[2]*x[3]) + \
112 |                 (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2)/\
113 |                 (x[1]*x[2]*x[3])
114 | 
115 |     def channel_major_comp_obj(self, x):
116 |         return self.total_ofmap_size()/(x[1]*x[2]*x[0]*x[3])
117 | 
118 |     # make sure the load for channel-major is always less than
119 |     # load for row-major, range : [0, +inf]
120 |     def channel_major_constraint(self, x):
121 |         S_2 = (self.K_h+1)/2
122 |         # simplified from K^3*Ci*c_0 <= Ci*((S*h_0+2)*(S*w_0+2))
123 |         return (self.S*x[1]+S_2)*(self.S*x[2]+S_2)*(self.S*x[3]+S_2) \
124 |             - self.K_h*self.K_w*self.K_d*x[0];
125 | 
126 |     # make sure the process is always memory-bound;
127 |     # which is the latency for memory access is always
128 |     # greater than lantecy of compute;
129 |     # c_0*(h_0*w_0+K^3*C)/B >= (K^3*C/A^2)*c_0*(h_0*w_0)
130 |     # range : [0, +inf]
131 |     def channel_major_mem_bound_constraint(self, x):
132 |         return (x[1]*x[2]*x[3]+self.weight_tile(1)) / self.B \
133 |             - self.weight_tile(1)/(self.A*self.A)*x[1]*x[2]*x[3]
134 | 
135 | 
136 |     # make sure the process is always memory-bound;
137 |     # which is the latency for memory access is always
138 |     # greater than lantecy of compute;
139 |     # c_0*(h_0*w_0+K^3*C)/B >= (K^3*C/A^2)*c_0*(h_0*w_0*d_0)
140 |     # range : [0, +inf]
141 |     def channel_major_comp_bound_constraint(self, x):
142 |         return (self.K_h*self.K_w*self.K_d*self.Co) \
143 |             / (self.A*self.A)*x[1]*x[2]*x[3] \
144 |             - (x[1]*x[2]+self.K_h*self.K_w*self.K_d*self.Co)/self.B
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/layer3d_static_method.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | # my own module
  8 | from layer3d_base_method import *
  9 | 
 10 | ###############################################################
 11 | #                       general process                       #
 12 | ###############################################################
 13 | class Layer3dStaticMethod(Layer3dBaseMethod):
 14 | 
 15 |     # array to store the result from the four different results
 16 |     res = []
 17 | 
 18 |     def __init__(self, data, sys_info, buffer_partition=None):
 19 |         super(Layer3dStaticMethod, self).__init__(data, sys_info)
 20 |         self.rets = []
 21 |         if buffer_partition:
 22 |             # calculate buffer sizes
 23 |             self.bufi_size = self.buf_size* \
 24 |                 buffer_partition[0]/sum(buffer_partition)
 25 |             self.bufo_size = self.buf_size* \
 26 |                 buffer_partition[1]/sum(buffer_partition)
 27 |             self.bufw_size = self.buf_size* \
 28 |                 buffer_partition[2]/sum(buffer_partition)
 29 | 
 30 |     # the main optimization routine;
 31 |     def opti_buffer(self):
 32 |         # set the initial guess;
 33 |         x0 = [self.A, self.A]
 34 | 
 35 |         # check if the initial configuration can hold the minimum requirements
 36 |         if ((x0[0]*self.K_h*self.K_w*self.K_d*self.Ci > self.bufw_size) or
 37 |             (self.S*self.S*self.S*x0[1]*self.Ci > self.bufi_size)):
 38 |             return
 39 | 
 40 |         # first, let's find the number of kernel we can put into buffer.
 41 |         while (x0[0]+self.A)*self.K_h*self.K_w*self.K_d*self.Ci < self.bufw_size:
 42 |             x0[0] = x0[0]+self.A
 43 | 
 44 |         # next let's see what percentage of ifmap can we fit into the buffer.
 45 |         if self.K_h*self.K_w*(2*self.S+self.W) >= self.bufi_size:
 46 |             while self.K_h*self.K_w*(self.S+x0[1]+self.A)*self.Ci < self.bufi_size:
 47 |                 x0[1] = x0[1]+self.A
 48 |             # no need to optimize the buffer for ofmap, because it is
 49 |             # bounded ifmap.
 50 |             x = [x0[0], x0[1], 1]
 51 |         else:
 52 |             d_0 = 1
 53 |             while (self.K_h*self.K_w)*(self.S+self.W)*(self.S+self.H) \
 54 |                 * (self.S+d_0+1)*self.Ci < self.bufi_size:
 55 |                 d_0 += 1
 56 |             # no need to optimize the buffer for ofmap, because it is
 57 |             # bounded ifmap.
 58 |             x = [x0[0], self.W, self.H, d_0]
 59 | 
 60 |         self.process_parameter(x, False, False)
 61 |         self.process_parameter(x, False, True)
 62 |         self.process_parameter(x, True, False)
 63 |         self.process_parameter(x, True, True)
 64 | 
 65 |     def optimize_one_buffer_partition(self):
 66 |         # if sum of bufi and bufw is over the self.buf_size
 67 |         # we should skip it.
 68 |         if (self.bufi_size + self.bufw_size) > self.buf_size:
 69 |             return
 70 | 
 71 |         self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size
 72 |         # both cases are possible;
 73 |         self.opti_buffer()
 74 | 
 75 |         if len(self.res) == 0:
 76 |             return
 77 | 
 78 |         # choose the larger value as the bottleneck
 79 |         row_major_res = None
 80 |         if (self.res[0]["total_cycle"] < self.res[1]["total_cycle"]):
 81 |             row_major_res = self.res[1]
 82 |         else:
 83 |             row_major_res = self.res[0]
 84 | 
 85 |         # choose the larger value as the bottleneck
 86 |         channel_major_res = None
 87 |         if (self.res[2]["total_cycle"] < self.res[3]["total_cycle"]):
 88 |             channel_major_res = self.res[3]
 89 |         else:
 90 |             channel_major_res = self.res[2]
 91 | 
 92 |         # return the shortest value as the perferred compute ordering.
 93 |         ret = None
 94 |         if (row_major_res["total_cycle"] < channel_major_res["total_cycle"]):
 95 |             ret = dict(row_major_res)
 96 |         else:
 97 |             ret = dict(channel_major_res)
 98 | 
 99 |         self.rets.append(ret)
100 | 
101 |     # optimize one layer
102 |     def optimize(self):
103 |         self.init_setup()
104 | 
105 |         # start the optimization
106 |         self.res = []
107 |         self.optimize_one_buffer_partition()
108 | 
109 |         # find the best result
110 |         ret  = dict(self.rets[0])
111 | 
112 |         for item in self.rets:
113 |             if ret["total_cycle"] > item["total_cycle"]:
114 |                 ret = dict(item)
115 |             if ret["total_cycle"] == item["total_cycle"] and \
116 |                 ret["total_transfer"] > item["total_transfer"]:
117 |                 ret = dict(item)
118 | 
119 |         return ret
120 | 


--------------------------------------------------------------------------------
/layer_base_method.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | class LayerBaseMethod(object):
  8 |     """docstring for LayerBaseMethod"""
  9 |     # info for systolic array
 10 |     A = None      # systolic array dimension
 11 | 
 12 |     # memory bandwith number of bytes can be transferred.
 13 |     B = None
 14 | 
 15 |     # on-chip buffer size
 16 |     buf_size = None
 17 | 
 18 |     # info for weights
 19 |     K_w = None       # kernel width
 20 |     K_h = None       # kernel height
 21 |     S = None         # stride size
 22 | 
 23 |     # input layer dimension
 24 |     H = None        # height of ofmap
 25 |     W = None        # width of ofmap
 26 |     Ci = None       # channels for weights
 27 |     Co = None       # channels for ofmap
 28 | 
 29 |     # on-chip buffer size
 30 |     bufi_size = None
 31 |     bufo_size = None
 32 |     bufw_size = None
 33 | 
 34 |     # array to store the result from the four different results
 35 |     res = []
 36 | 
 37 |     """docstring for LayerBaseMethod"""
 38 |     def __init__(self, data, sys_info):
 39 |         self.data = data
 40 |         self.sys_info = sys_info
 41 |         self.A = sys_info["sa_size"]
 42 |         self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8)
 43 |         self.buf_size = sys_info["bufsize"]
 44 |         self.res = []
 45 | 
 46 |     def init_setup(self):
 47 |         self.res = []
 48 |         layer_info = self.data
 49 |         # set up the new layer information
 50 |         [self.W, self.H, self.Ci] = layer_info["ifmap"]
 51 |         self.Co = layer_info["out_channel"]
 52 |         [self.K_w, self.K_h] = layer_info["kernel"]
 53 |         self.S = layer_info["stride"]
 54 |         self.W /= self.S
 55 |         self.H /= self.S
 56 | 
 57 |     ###############################################################
 58 |     #                       general process                       #
 59 |     ###############################################################
 60 | 
 61 |     # compute buffer utilization
 62 |     def buffer_utilization(self, x):
 63 |         # buffer = ofmap + weights + ifmap
 64 |         return (x[0]*x[1]*x[2] + self.Ci*self.K_h*self.K_w*x[0]
 65 |                 + self.Ci*(self.S*x[1]+self.K_h/2)*(self.S*x[2]+self.K_h/2))
 66 | 
 67 |     def total_batch_number(self, h_0, w_0, c_0):
 68 |         return math.ceil(float(self.H*self.W*self.Co) / (h_0*w_0*c_0))
 69 | 
 70 |     # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
 71 |     def row_major_data_transfer(self, h_0, w_0, c_0):
 72 |         # calculate the total batch
 73 |         total_batch = self.total_batch_number(h_0, w_0, c_0)
 74 | 
 75 |         # ofmap, ifmap and kernel tile size
 76 |         ofmap_tile_size = h_0*w_0*c_0
 77 |         ifmap_tile_size = (self.S*h_0+self.K_h/2)*(self.S*w_0+self.K_w/2)*self.Ci
 78 |         kernel_tile_size = self.K_h*self.K_w*self.Ci*c_0
 79 | 
 80 |         # ofmap + ifmap transfer
 81 |         total_transfer = (ofmap_tile_size + ifmap_tile_size) * total_batch
 82 | 
 83 |         # add additional data transfer
 84 |         total_transfer += (ofmap_tile_size + kernel_tile_size) * self.Co/c_0
 85 | 
 86 |         return total_transfer
 87 | 
 88 |     # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0)
 89 |     def channel_major_data_transfer(self, h_0, w_0, c_0):
 90 |         # calculate the total batch
 91 |         total_batch = self.total_batch_number(h_0, w_0, c_0)
 92 | 
 93 |         # ofmap and ifmap tile size
 94 |         ofmap_tile_size = h_0*w_0*c_0
 95 |         ifmap_tile_size = (self.S*h_0+self.K_h/2)*(self.S*w_0+self.K_w/2)*self.Ci
 96 |         kernel_tile_size = self.K_h*self.K_w*self.Ci*c_0
 97 | 
 98 |         # ofmap + weight transfer
 99 |         total_transfer = (ofmap_tile_size + kernel_tile_size) * total_batch
100 | 
101 |         # add additional data transfer
102 |         total_transfer += (ofmap_tile_size + ifmap_tile_size) \
103 |                         * self.H*self.W / (h_0*w_0)
104 | 
105 |         return total_transfer
106 | 
107 |     def systolic_array_utilization(self, xi, area):
108 |         area_size = area[0] * area[1]
109 |         A = self.A
110 |         total_usage = xi * area_size
111 |         round_up_val = math.ceil(float(xi/A))*A \
112 |                      * math.ceil(float(area_size)/A)*A
113 | 
114 |         return total_usage / round_up_val
115 | 
116 |     def compute_bound_cycle(self, util_rate):
117 |         # total number of ops
118 |         total_computation = (self.H*self.W*self.Co) * \
119 |                             (self.Ci*self.K_h*self.K_w) 
120 | 
121 |         # systolic array calculation capacity
122 |         comp_cap = (self.A*self.A) * util_rate
123 | 
124 |         return total_computation / comp_cap
125 | 
126 |     def process_parameter(self, x, row_major, comp_bound):
127 | 
128 |         x = list(map(lambda i: math.floor(i), x))
129 |         bound = "C"
130 |         # make the tile size even for every batch
131 |         c_0 = min(self.Co/math.ceil(self.Co/round(x[0])), self.Co)
132 |         w_0 = min(self.W/math.ceil(self.W/round(x[1])), self.W)
133 |         h_0 =min(self.H/math.ceil(self.H/round(x[2])), self.H)
134 | 
135 |         # compute the total number of elements needed to be updated
136 |         # if it is row-major.
137 |         if row_major:
138 |             # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
139 |             total_transfer = self.row_major_data_transfer(h_0, w_0, c_0)
140 | 
141 |         # compute the total number of elements needed to be updated
142 |         # if it is channel-major.
143 |         else:
144 |             # (ofmap + weights)*total_batch + (ofmap+ifmap)*(H*W)/(h_0*w_0)
145 |             total_transfer = self.channel_major_data_transfer(h_0, w_0, c_0)
146 | 
147 |         # compute the utilization of systolic array
148 |         util_sys_arr = self.systolic_array_utilization(c_0, [w_0, h_0])
149 | 
150 |         # compute the utilization of systolic array
151 |         util_buf = self.buffer_utilization([c_0, w_0, h_0])/self.buf_size
152 | 
153 |         if util_buf > 1.01:
154 |             return
155 |         # calculate the amount of cycles of computing all elements.
156 |         if comp_bound:
157 |             bound = "C"
158 |             total_cycle = self.compute_bound_cycle(util_sys_arr)
159 |         else:
160 |             bound = "M"
161 |             total_cycle = total_transfer/self.B
162 | 
163 |         ret = {
164 |             "total_transfer": round(total_transfer),
165 |             "total_cycle": round(total_cycle),
166 |             "systolic_array_utilization": util_sys_arr,
167 |             "buffer_utilization": util_buf,
168 |             "c_0, w_0, h_0": [round(c_0), round(w_0), round(h_0)],
169 |             "Tile size" : [self.Co/c_0, self.W/w_0, self.H/h_0],
170 |             "Bound" : bound
171 |         }
172 |         self.res.append(ret)
173 |         return
174 | 


--------------------------------------------------------------------------------
/layer_exhaustive_searcher.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python2.7
 2 | 
 3 | # public library
 4 | import math
 5 | import numpy as np
 6 | 
 7 | # my own module
 8 | from layer_static_method import *
 9 | 
10 | ###############################################################
11 | #                       general process                       #
12 | ###############################################################
13 | class LayerExhaustiveSearcher(LayerStaticMethod):
14 | 
15 |     # array to store the result from the four different results
16 |     res = []
17 | 
18 |     """docstring for LayerExhaustiveSearcher"""
19 |     def __init__(self, data, sys_info):
20 |         super(LayerExhaustiveSearcher, self).__init__(data, sys_info, None)
21 |         self.rets = []
22 | 
23 |     # optimize one layer
24 |     def optimize(self):
25 |         self.init_setup()
26 | 
27 |         for i in range(1, 20):
28 |             self.bufi_size = self.buf_size*i/20.0
29 |             for j in range(1, 20):
30 |                 self.bufw_size = self.buf_size*j/20.0
31 |                 # optimize one buffer partition
32 |                 self.res = []
33 |                 self.optimize_one_buffer_partition()
34 | 
35 |         ret  = dict(self.rets[0])
36 | 
37 |         for item in self.rets:
38 |             if ret["total_cycle"] > item["total_cycle"]:
39 |                 ret = dict(item)
40 |             if ret["total_cycle"] == item["total_cycle"] and \
41 |                 ret["total_transfer"] > item["total_transfer"]:
42 |                 ret = dict(item)
43 | 
44 |         return ret
45 | 
46 |     # the main optimization routine;
47 |     def opti_buffer(self):
48 |         # set the initial guess;
49 |         x0 = [self.A, self.A]
50 | 
51 |         # check if the initial configuration can hold the minimum requirements
52 |         if ((x0[0]*self.K_h*self.K_w*self.Ci > self.bufw_size) or
53 |             (self.S*self.S*x0[1]*self.Ci > self.bufi_size)):
54 |             return
55 | 
56 |         # first, let's find the number of kernel we can put into buffer.
57 |         while (x0[0]+self.A)*self.K_h*self.K_w*self.Ci < self.bufw_size:
58 |             x0[0] = x0[0]+self.A
59 | 
60 |         # next let's see how much ifmap can we fit into the buffer.
61 |         while self.S*self.S*(x0[1]+self.A)*self.Ci < self.bufi_size:
62 |             x0[1] = x0[1]+self.A
63 | 
64 | 
65 |         # no need to optimize the buffer for ofmap, because it is
66 |         # bounded ifmap.
67 |         x = [x0[0], math.sqrt(x0[1]), math.sqrt(x0[1])]
68 |         self.process_parameter(x, False, False)
69 |         self.process_parameter(x, False, True)
70 |         self.process_parameter(x, True, False)
71 |         self.process_parameter(x, True, True)
72 | 
73 | 


--------------------------------------------------------------------------------
/layer_optimizer.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | from scipy.optimize import minimize
  7 | 
  8 | # my own module
  9 | from layer_base_method import *
 10 | import layer_exhaustive_searcher
 11 | 
 12 | # threshold for bounds
 13 | # if the constraint result is negative but within this threshold,
 14 | # it is still consider a valid result.
 15 | Threshold = 500.0
 16 | 
 17 | class LayerOptimizer(LayerBaseMethod):
 18 |     """docstring for LayerOptimizer"""
 19 |     def __init__(self, data, sys_info, combined=False):
 20 |         super(LayerOptimizer, self).__init__(data, sys_info)
 21 |         self.combined = combined
 22 | 
 23 |     # variables for optimization
 24 |     # this two has been encodes as x[3] = {c_0, h_0, w_0};
 25 |     # c_0  # number of channels per batch;
 26 |     # h_0xw_0 # size of tile per batch;
 27 | 
 28 |     # calculate the latency for compute and memory;
 29 |     # l_com = (K_h*K_w*c_0*h_0*w_0)/(R*R)
 30 |     # # if row-major
 31 |     # l_mem_r = (c_0*h_0*w_0 + C*(h_0+2)*(w_0+2))/B
 32 |     # # if channel-major
 33 |     # l_mem_c = (c_0*h_0*w_0 + C*K_h*K_w*c_0)/B
 34 | 
 35 |     ###############################################################
 36 |     #                       general process                       #
 37 |     ###############################################################
 38 | 
 39 |     def optimize(self):
 40 |         self.init_setup()
 41 | 
 42 |         # print("##[LAYER]##", self.W, self.H, self.Ci, self.Co, self.K_w, self.K_h)
 43 |         # both cases are possible;
 44 |         # opti_mem()
 45 |         self.opti_comp()
 46 | 
 47 |         if len(self.res) == 0:
 48 |             self.opti_mem()
 49 | 
 50 |         if len(self.res) == 0:
 51 |             return None
 52 | 
 53 |         ret  = dict(self.res[0])
 54 | 
 55 |         for item in self.res:
 56 |             if ret["total_cycle"] > item["total_cycle"]:
 57 |                 ret = dict(item)
 58 |             if ret["total_cycle"] == item["total_cycle"] and \
 59 |                 ret["total_transfer"] > item["total_transfer"]:
 60 |                 ret = dict(item)
 61 | 
 62 |         return ret
 63 | 
 64 |     def opti_mem(self):
 65 |         # print("=========================  Memory Bound  ==========================")
 66 |         # optimization for row-major;
 67 |         self.opti_mem_row_major();
 68 |         # optimization for channel-major;
 69 |         self.opti_mem_channel_major();
 70 |         # print("\n")
 71 | 
 72 |     def opti_comp(self):
 73 |         # print("=========================  Compute Bound  =========================")
 74 |         # optimization for row-major;
 75 |         self.opti_comp_row_major();
 76 |         # optimization for channel-major;
 77 |         self.opti_comp_channel_major();
 78 |         # print("\n")
 79 | 
 80 |     # the main optimization of memory-bound and row-major case;
 81 |     def opti_mem_row_major(self):
 82 |         # set the initial guess;
 83 |         x0 = self.init_guess()
 84 |         # for row_major_constraint1
 85 |         con1 = {'type': 'ineq', 'fun': self.row_major_constraint}
 86 |         # for mem_bound_constraint
 87 |         con2 = {'type': 'ineq', 'fun': self.row_major_mem_bound_constraint}
 88 |         # for the buffer_constraint
 89 |         con3 = {'type': 'ineq', 'fun': self.buffer_constraint1}
 90 |         con4 = {'type': 'ineq', 'fun': self.buffer_constraint2}
 91 | 
 92 |         # summery all the bounds and constraints
 93 |         bnds = self.variable_boundary()
 94 |         cons = ([con1, con2, con3, con4])
 95 | 
 96 |         # call the external solver to solve the solution
 97 |         solution = minimize(self.row_major_mem_obj, x0, method='SLSQP',\
 98 |                         bounds=bnds, constraints=cons)
 99 | 
100 |         passed = True
101 |         if np.any(np.isnan(solution.x)):
102 |             passed = False
103 |             # print("Solution with NaN, abort!")
104 |         # check the validation
105 |         if passed and self.row_major_constraint(solution.x) < -Threshold:
106 |             passed = False
107 |             # print("row major constraint", self.row_major_constraint(solution.x), "NOT PASSED.")
108 |         if passed and self.buffer_constraint2(solution.x) < -Threshold:
109 |             passed = False
110 |             # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!")
111 |             # print("buffer constraint", buffer_constraint2(solution.x))
112 |         if passed and self.row_major_mem_bound_constraint(solution.x) < -Threshold:
113 |             passed = False
114 |             # print("row-major memory-bound", self.row_major_mem_bound_constraint(solution.x), \
115 |             #      " no longer bounded!")
116 | 
117 |         if passed:
118 |             # print("Row-major memory-bound case PASSED!")
119 |             self.process_parameter(solution.x, True, False)
120 |         else:
121 |             return None
122 | 
123 |     # the main optimization of compute-bound and row-major case;
124 |     def opti_comp_row_major(self):
125 |         # set the initial guess;
126 |         x0 = self.init_guess()
127 |         # for row_major_constraint1
128 |         con1 = {'type': 'ineq', 'fun': self.row_major_constraint}
129 |         # for mem_bound_constraint
130 |         con2 = {'type': 'ineq', 'fun': self.row_major_comp_bound_constraint}
131 |         # for the buffer_constraint
132 |         con3 = {'type': 'ineq', 'fun': self.buffer_constraint1}
133 |         con4 = {'type': 'ineq', 'fun': self.buffer_constraint2}
134 | 
135 |         # summery all the bounds and constraints
136 |         bnds = self.variable_boundary()
137 |         cons = ([con1, con2, con3, con4])
138 | 
139 |         # call the external solver to solve the solution
140 |         solution = minimize(self.row_major_comp_obj, x0, method='SLSQP',\
141 |                         bounds=bnds, constraints=cons)
142 | 
143 |         passed = True
144 |         if np.any(np.isnan(solution.x)):
145 |             passed = False
146 |             # print("Solution with NaN, abort!")
147 |         # check the validation
148 |         if passed and self.row_major_constraint(solution.x) < -Threshold:
149 |             passed = False
150 |             # print("row major constraint", self.row_major_constraint(solution.x), "NOT PASSED.")
151 |         if passed and self.buffer_constraint2(solution.x) < -Threshold:
152 |             passed = False
153 |             # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!")
154 |         if passed and self.row_major_comp_bound_constraint(solution.x) < -Threshold:
155 |             passed = False
156 |             # print("Row-major compute-bound", self.row_major_comp_bound_constraint(solution.x), \
157 |             #     " no longer bounded!")
158 | 
159 |         if passed:
160 |             # print("Row-major compute-bound case PASSED!")
161 |             self.process_parameter(solution.x, True, True)
162 |         else:
163 |             return None
164 | 
165 | 
166 | 
167 |     # the main optimization of memory-bound and channel-major case;
168 |     def opti_mem_channel_major(self):
169 |         # set the initial guess;
170 |         x0 = self.init_guess()
171 |         # for row_major_constraint1
172 |         con1 = {'type': 'ineq', 'fun': self.channel_major_constraint}
173 |         # for mem_bound_constraint
174 |         con2 = {'type': 'ineq', 'fun': self.channel_major_mem_bound_constraint}
175 |         # for the buffer_constraint
176 |         con3 = {'type': 'ineq', 'fun': self.buffer_constraint1}
177 |         con4 = {'type': 'ineq', 'fun': self.buffer_constraint2}
178 | 
179 |         # summery all the bounds and constraints
180 |         bnds = self.variable_boundary()
181 |         cons = ([con1, con2, con3, con4])
182 | 
183 |         # call the external solver to solve the solution
184 |         solution = minimize(self.channel_major_mem_obj, x0, method='SLSQP',\
185 |                         bounds=bnds, constraints=cons)
186 | 
187 |         passed = True
188 |         if np.any(np.isnan(solution.x)):
189 |             passed = False
190 |             # print("Solution with NaN, abort!")
191 |         # check the validation
192 |         if passed and self.channel_major_constraint(solution.x) < -Threshold:
193 |             passed = False
194 |             # print("channel major constraint", self.channel_major_constraint(solution.x), "NOT PASSED.")
195 |         if passed and self.buffer_constraint2(solution.x) < -Threshold:
196 |             passed = False
197 |             # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!")
198 |         if passed and self.channel_major_mem_bound_constraint(solution.x) < -Threshold:
199 |             passed = False
200 |             # print("Channel-major memory-bound", self.channel_major_mem_bound_constraint(solution.x), \
201 |             #     " no longer bounded!")
202 | 
203 |         if passed:
204 |             # print("Channel-major memory-bound case PASSED!")
205 |             self.process_parameter(solution.x, False, False)
206 |         else:
207 |             return None
208 | 
209 | 
210 |     # the main optimization of compute-bound and channel-major case;
211 |     def opti_comp_channel_major(self):
212 |         # set the initial guess;
213 |         x0 = self.init_guess()
214 |         # for row_major_constraint1
215 |         con1 = {'type': 'ineq', 'fun': self.channel_major_constraint}
216 |         # for mem_bound_constraint
217 |         con2 = {'type': 'ineq', 'fun': self.channel_major_comp_bound_constraint}
218 |         # for the buffer_constraint
219 |         con3 = {'type': 'ineq', 'fun': self.buffer_constraint1}
220 |         con4 = {'type': 'ineq', 'fun': self.buffer_constraint2}
221 | 
222 |         # summery all the bounds and constraints
223 |         bnds = self.variable_boundary()
224 |         cons = ([con1, con2, con3, con4])
225 | 
226 |         # call the external solver to solve the solution
227 |         solution = minimize(self.channel_major_comp_obj, x0, method='SLSQP',\
228 |                         bounds=bnds, constraints=cons)
229 | 
230 |         passed = True
231 |         if np.any(np.isnan(solution.x)):
232 |             passed = False
233 |             # print("Solution with NaN, abort!")
234 |         # check the validation
235 |         if passed and self.channel_major_constraint(solution.x) < -Threshold:
236 |             passed = False
237 |             # print("channel major constraint", self.channel_major_constraint(solution.x), "NOT PASSED.")
238 |         if passed and self.buffer_constraint2(solution.x) < -Threshold:
239 |             passed = False
240 |             # print("buffer size", self.buffer_constraint1(solution.x), "is OVER limit!")
241 |         if passed and self.channel_major_comp_bound_constraint(solution.x) < -Threshold:
242 |             passed = False
243 |             # print("Channel-major compute-bound", self.channel_major_comp_bound_constraint(solution.x), \
244 |             #     " no longer bounded!")
245 | 
246 |         if passed:
247 |             # print("Channel-major compute-bound case PASSED!")
248 |             self.process_parameter(solution.x, False, True)
249 |         else:
250 |             return None
251 | 
252 |     ###############################################################
253 |     #                     general computations                    #
254 |     ###############################################################
255 | 
256 |     def ofmap_tile(self, x):
257 |         return x[0]*x[1]*x[2]
258 | 
259 |     def weight_tile(self, num):
260 |         return self.Ci*self.K_h*self.K_w*num
261 | 
262 |     def ifmap_tile(self, x):
263 |         return self.Ci*(self.S*x[1]+2)*(self.S*x[2]+2)
264 | 
265 |     def total_ofmap_size(self):
266 |         return self.H*self.W*self.Co
267 | 
268 |     def total_weight_size(self):
269 |         return self.weight_tile(self.Co)
270 | 
271 |     ###############################################################
272 |     #                     general constraints                     #
273 |     ###############################################################
274 |     # the low bound of buffer size;
275 |     # make sure the buffer utilization is always larger than 0
276 |     def buffer_constraint1(self, x):
277 |         # buffer = ofmap + weights + ifmap
278 |         return (self.ofmap_tile(x) +
279 |                 self.weight_tile(x[0]) +
280 |                 self.ifmap_tile(x))
281 | 
282 |     # the upper bound of the buffer size;
283 |     # make sure the buffer utilization is
284 |     # always smaller than buffer size;
285 |     def buffer_constraint2(self, x):
286 |         return (self.buf_size - self.buffer_constraint1(x))
287 | 
288 |     # set initial guess for constrained optimization
289 |     def init_guess(self):
290 |         # set the initial guess;
291 |         x0 = [min(self.A, self.Co), \
292 |               min(math.floor(math.sqrt(self.A)), self.H), \
293 |               min(math.floor(math.sqrt(self.A)), self.W)]
294 |         if self.combined:
295 |           result = layer_static_method.\
296 |               LayerStaticMethod(data, sys_info, [3.0, 3.0, 4.0]).optimize()
297 |           x0 = result["c_0, w_0, h_0"]
298 | 
299 |         return x0
300 | 
301 |     # set constraints for the variables in the optimization
302 |     def variable_boundary(self):
303 |       return ((min(self.A, self.Co), self.Co),
304 |               (min(math.floor(math.sqrt(self.A)), self.H), self.H),
305 |               (min(math.floor(math.sqrt(self.A)), self.W), self.W))
306 | 
307 | 
308 |     ###############################################################
309 |     #       row-major constraint solving obj and constraints      #
310 |     ###############################################################
311 | 
312 |     # the minimization objective of row-major
313 |     # this objective is a simplified expression of
314 |     # [h_0*w_0*c_0+(h_0+2)(w_0+2)*Ci]*(H*W*Co)/(h_0*w_0*c_0)
315 |     # + [K^2*Ci+h_0*w_0*c_0]*Co/c_0
316 |     # this expression can be finally reduce to:
317 |     #   (H*W*Co/c_0 + 2(h_0+w_0)Ci*H*W*Co/(h_0*w_0*c_0)+h_0*w_0*Co/c_0
318 |     def row_major_mem_obj(self, x):
319 |       return (self.ofmap_tile(x) + self.ifmap_tile(x)) \
320 |           * (self.total_ofmap_size()/self.ofmap_tile(x) - self.Co/x[0]) \
321 |           + self.total_weight_size()/x[0] + x[1]*x[2]*self.Co
322 | 
323 |     def row_major_comp_obj(self, x):
324 |       return self.total_ofmap_size() / self.ofmap_tile(x)
325 | 
326 |     # make sure the load for row-major is always less than
327 |     # load for channel-major, range : [0, +inf]
328 |     def row_major_constraint(self, x):
329 |         # simplified from K^2*C*c_0 > C*(S^2*h_0*w_0)
330 |         return self.K_h*self.K_w*x[0] - (self.S*x[1]+2)*(self.S*x[2]+2);
331 | 
332 |     # make sure the process is always memory-bound;
333 |     # which is the latency for memory access is always
334 |     # greater than lantecy of compute;
335 |     # (c_0*(h_0*w_0)+C*((S*h_0+2)*(S*w_0+2))/B >= (K^2*C/A^2)*c_0*w_0*h_0
336 |     # range : [0, +inf]
337 |     def row_major_mem_bound_constraint(self, x):
338 |       return (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B \
339 |           - self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x)
340 | 
341 |     # make sure the process is always compute-bound;
342 |     # which is the latency for compute is always
343 |     # greater than lantecy of memory access;
344 |     # (c_0*(h_0*w_0)+C*((S*h_0+2)*(S*w_0+2))/B <= (K^2*C/A^2)*c_0*w_0*h_0
345 |     # range : [0, +inf]
346 |     def row_major_comp_bound_constraint(self, x):
347 |         return self.weight_tile(1)/(self.A*self.A)*self.ofmap_tile(x) \
348 |             - (self.ofmap_tile(x) + self.ifmap_tile(x)) / self.B
349 | 
350 |     ###############################################################
351 |     #     channel-major constraint solving obj and constraints    #
352 |     ###############################################################
353 | 
354 |     # the minimization objective of channel-major
355 |     # this is the simplified expression of
356 |     # (K^2*Ci*c_0+h_0*w_0*c_0)*(H*W*Co)/(h_0*w_0*c_0)
357 |     # + [(h_0+2)(w_0+2)*Ci + h_0*w_0*c_0]*(H*W)/(h_0*w_0)
358 |     def channel_major_mem_obj(self, x):
359 |         return (self.total_weight_size)/(x[1]*x[2]) + \
360 |             2*(self.S*x[1]+self.S*x[2])*self.Co/(x[1]*x[2])+1/x[0]
361 | 
362 |     # the minimization functions is to moinimize the
363 |     # channel major compute-bound objective
364 |     def channel_major_comp_obj(self, x):
365 |         return self.total_ofmap_size()/(x[1]*x[2]*x[0])
366 | 
367 |     # make sure the load for channel-major is always less than
368 |     # load for row-major, range : [0, +inf]
369 |     def channel_major_constraint(self, x):
370 |         # simplified from K^2*C*c_0 <= C*((S*h_0+2)*(S*w_0+2))
371 |         return (self.S*x[1]+2)*(self.S*x[2]+2) - self.K_h*self.K_w*x[0];
372 | 
373 |     # make sure the process is always memory-bound;
374 |     # which is the latency for memory access is always
375 |     # greater than lantecy of compute;
376 |     # c_0*(h_0*w_0+K^2*C)/B >= (K^2*C/A^2)*c_0*(h_0*w_0)
377 |     # range : [0, +inf]
378 |     def channel_major_mem_bound_constraint(self, x):
379 |         return (x[1]*x[2] + self.weight_tile(1)) / self.B \
380 |             - self.weight_tile(1)/(self.A*self.A)*x[1]*x[2]
381 | 
382 |     # make sure the process is always memory-bound;
383 |     # which is the latency for memory access is always
384 |     # greater than lantecy of compute;
385 |     # c_0*(h_0*w_0+K^2*C)/B >= (K^2*C/A^2)*c_0*(h_0*w_0)
386 |     # range : [0, +inf]
387 |     def channel_major_comp_bound_constraint(self, x):
388 |         return self.K_h*self.K_w*self.Co/(self.A*self.A)*x[1]*x[2] \
389 |             - (x[1]*x[2]+self.K_h*self.K_w*self.Co)/self.B
390 | 
391 | 
392 | 


--------------------------------------------------------------------------------
/layer_static_method.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | # my own module
  8 | from layer_base_method import *
  9 | 
 10 | ###############################################################
 11 | #                       general process                       #
 12 | ###############################################################
 13 | class LayerStaticMethod(LayerBaseMethod):
 14 | 
 15 |     # array to store the result from the four different results
 16 |     res = []
 17 | 
 18 |     def __init__(self, data, sys_info, buffer_partition=None):
 19 |         super(LayerStaticMethod, self).__init__(data, sys_info)
 20 |         self.rets = []
 21 |         if buffer_partition:
 22 |             # calculate buffer sizes
 23 |             self.bufi_size = self.buf_size* \
 24 |                 buffer_partition[0]/sum(buffer_partition)
 25 |             self.bufo_size = self.buf_size* \
 26 |                 buffer_partition[1]/sum(buffer_partition)
 27 |             self.bufw_size = self.buf_size* \
 28 |                 buffer_partition[2]/sum(buffer_partition)
 29 | 
 30 |     # the main optimization routine;
 31 |     def opti_buffer(self):
 32 |         # set the initial guess;
 33 |         x0 = [self.A, self.A]
 34 | 
 35 |         # check if the initial configuration can hold the minimum requirements
 36 |         if ((x0[0]*self.K_h*self.K_w*self.Ci > self.bufw_size) or
 37 |             (self.S*self.S*x0[1]*self.Ci > self.bufi_size)):
 38 |             return
 39 | 
 40 |         # first, let's find the number of kernel we can put into buffer.
 41 |         while (x0[0]+self.A)*self.K_h*self.K_w*self.Ci < self.bufw_size:
 42 |             x0[0] = x0[0]+self.A
 43 | 
 44 |         # next let's see what percentage of ifmap can we fit into the buffer.
 45 |         if self.K_h*(2*self.S+self.W) >= self.bufi_size:
 46 |             while self.K_h*(self.S+x0[1]+self.A)*self.Ci < self.bufi_size:
 47 |                 x0[1] = x0[1]+self.A
 48 |             # no need to optimize the buffer for ofmap, because it is
 49 |             # bounded ifmap.
 50 |             x = [x0[0], x0[1], 1]
 51 |         else:
 52 |             h_0 = 1
 53 |             while self.K_h*(self.S+self.W)*(self.S+h_0+1)*self.Ci < self.bufi_size:
 54 |                 h_0 += 1
 55 |             # no need to optimize the buffer for ofmap, because it is
 56 |             # bounded ifmap.
 57 |             x = [x0[0], self.W, h_0]
 58 | 
 59 |         self.process_parameter(x, False, False)
 60 |         self.process_parameter(x, False, True)
 61 |         self.process_parameter(x, True, False)
 62 |         self.process_parameter(x, True, True)
 63 | 
 64 | 
 65 |     def optimize_one_buffer_partition(self):
 66 |         # if sum of bufi and bufw is over the self.buf_size
 67 |         # we should skip it.
 68 |         if (self.bufi_size + self.bufw_size) > self.buf_size:
 69 |             return
 70 | 
 71 |         self.bufo_size = self.buf_size - self.bufi_size - self.bufw_size
 72 |         # both cases are possible;
 73 |         self.opti_buffer()
 74 | 
 75 |         if len(self.res) == 0:
 76 |             return
 77 | 
 78 |         # choose the larger value as the bottleneck
 79 |         row_major_res = None
 80 |         if (self.res[0]["total_cycle"] < self.res[1]["total_cycle"]):
 81 |             row_major_res = self.res[1]
 82 |         else:
 83 |             row_major_res = self.res[0]
 84 | 
 85 |         # choose the larger value as the bottleneck
 86 |         channel_major_res = None
 87 |         if (self.res[2]["total_cycle"] < self.res[3]["total_cycle"]):
 88 |             channel_major_res = self.res[3]
 89 |         else:
 90 |             channel_major_res = self.res[2]
 91 | 
 92 |         # return the shortest value as the perferred compute ordering.
 93 |         ret = None
 94 |         if (row_major_res["total_cycle"] < channel_major_res["total_cycle"]):
 95 |             ret = dict(row_major_res)
 96 |         else:
 97 |             ret = dict(channel_major_res)
 98 | 
 99 |         self.rets.append(ret)
100 | 
101 |     # optimize one layer
102 |     def optimize(self):
103 |         self.init_setup()
104 | 
105 |         # start the optimization
106 |         self.res = []
107 |         self.optimize_one_buffer_partition()
108 | 
109 |         # find the best result
110 |         ret  = dict(self.rets[0])
111 | 
112 |         for item in self.rets:
113 |             if ret["total_cycle"] > item["total_cycle"]:
114 |                 ret = dict(item)
115 |             if ret["total_cycle"] == item["total_cycle"] and \
116 |                 ret["total_transfer"] > item["total_transfer"]:
117 |                 ret = dict(item)
118 | 
119 |         return ret
120 | 


--------------------------------------------------------------------------------
/multi_layer_perceptron.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python2.7
  2 | 
  3 | # public library
  4 | import math
  5 | import numpy as np
  6 | 
  7 | class MultiLayerPerceptron(object):
  8 |     """docstring for MultiLayerPerceptron"""
  9 |     # info for systolic array
 10 |     A = None       # systolic array dimension
 11 | 
 12 |     # memory bandwith number of bytes can be transferred.
 13 |     B = None
 14 | 
 15 |     # on-chip buffer size
 16 |     buf_size = None
 17 | 
 18 |     # input layer dimension
 19 |     N = None        # numbers of feature (NumberOfPoints x NumberOfFeature)
 20 |     Ci = None       # channels for ifmap
 21 |     Co = None       # channels for ofmap
 22 | 
 23 |     # on-chip buffer size
 24 |     bufi_size = None
 25 |     bufo_size = None
 26 |     bufw_size = None
 27 | 
 28 |     """docstring for MultiLayerPerceptron"""
 29 |     def __init__(self, data, sys_info):
 30 |         self.data = data
 31 |         self.sys_info = sys_info
 32 |         self.A = sys_info["sa_size"]
 33 |         self.B = sys_info["memory_bandwidth"]/(sys_info["bit_width"]/8)
 34 |         self.buf_size = sys_info["bufsize"]
 35 | 
 36 |     def init_setup(self):
 37 |         layer_info = self.data
 38 |        
 39 |         # set up the new layer information
 40 |         [self.N, self.Ci] = layer_info["ifmap"]
 41 |         self.Co = layer_info["out_channel"]
 42 | 
 43 |         self.bufw_size = self.Co * self.Ci
 44 | 
 45 |     ###############################################################
 46 |     #                       general process                       #
 47 |     ###############################################################
 48 | 
 49 |     # compute buffer utilization
 50 |     def buffer_utilization(self, x):
 51 |         # buffer = ofmap + weights + ifmap
 52 |         return (x*self.Co + self.Ci*self.Co + x*self.Ci)
 53 | 
 54 |     # (ofmap + ifmap)*total_batch + (ofmap+weights)*Co/c_0
 55 |     def data_transfer(self, x):
 56 |         # calculate the total batch
 57 |         total_batch = math.ceil(float(self.N) / x)
 58 | 
 59 |         # ofmap, ifmap and kernel tile size
 60 |         ofmap_tile_size = self.Co * x
 61 |         ifmap_tile_size = self.Ci * x
 62 |         kernel_tile_size = self.Co*self.Ci
 63 | 
 64 |         # ofmap + ifmap transfer
 65 |         total_transfer = (ofmap_tile_size + ifmap_tile_size) * total_batch
 66 | 
 67 |         # add additional data transfer
 68 |         total_transfer += kernel_tile_size
 69 | 
 70 |         return total_transfer
 71 | 
 72 |     def systolic_array_utilization(self, x):
 73 |         A = self.A
 74 |         A_w_uiti = math.ceil(self.Co/math.ceil(float(self.Co)/A))
 75 | 
 76 |         total_usage = x * self.Co
 77 |         round_up_val = math.ceil(float(x/A)) * A \
 78 |                      * math.ceil(float(self.Co)/A)*A
 79 | 
 80 |         # the pct of extra delay due to output-stationary
 81 |         delay_pct = float(self.Ci)/(self.Ci+A_w_uiti)
 82 | 
 83 |         return delay_pct * total_usage / round_up_val
 84 | 
 85 |     def compute_bound_cycle(self, util_rate):
 86 |         # total number of ops
 87 |         total_computation = (self.N*self.Ci*self.Co)
 88 | 
 89 |         # systolic array calculation capacity
 90 |         comp_cap = (self.A*self.A) * util_rate
 91 | 
 92 |         return total_computation / comp_cap
 93 | 
 94 |     def process_parameter(self, x):
 95 | 
 96 |         x = math.floor(x)
 97 |         bound = "C"
 98 |         # make the tile size even for every batch
 99 |         x_0 = min(self.N/math.ceil(self.N/round(x)), self.N)
100 | 
101 |         # (ofmap + ifmap)*total_batch + weights
102 |         total_transfer = self.data_transfer(x_0)
103 | 
104 |         # compute the utilization of systolic array
105 |         util_sys_arr = self.systolic_array_utilization(x_0)
106 | 
107 |         # compute the utilization of buffer
108 |         util_buf = float(self.buffer_utilization(x_0))/self.buf_size
109 | 
110 |         if util_buf > 1.01:
111 |             print("ERROR: the utilization of buffer is over 100%")
112 |             exit()
113 | 
114 |         # calculate the amount of cycles of computing all elements.
115 |         if self.compute_bound_cycle(util_sys_arr) > total_transfer/self.B:
116 |             bound = "C"
117 |             total_cycle = self.compute_bound_cycle(util_sys_arr)
118 |         else:
119 |             bound = "M"
120 |             total_cycle = total_transfer/self.B
121 | 
122 |         ret = {
123 |             "total_transfer": round(total_transfer),
124 |             "total_cycle": round(total_cycle),
125 |             "systolic_array_utilization": util_sys_arr,
126 |             "buffer_utilization": util_buf,
127 |             "buffer-partition [I,W,O]": [int(self.bufi_size), 
128 |                                          int(self.bufw_size), 
129 |                                          int(self.bufo_size)], 
130 |             "x_0": math.floor(x_0),
131 |             "Bound" : bound
132 |         }
133 | 
134 |         return ret
135 | 
136 |     # optimize one layer
137 |     def optimize(self):
138 |         self.init_setup()
139 | 
140 |         # if sum of bufi and bufw is over the self.buf_size
141 |         # we should skip it.
142 |         if self.bufw_size > self.buf_size:
143 |             print("FAIL: the entire weight cannot be stored in buffer")
144 |             exit()
145 | 
146 |         self.bufi_size = (self.buf_size - self.bufw_size)*self.Ci/(self.Ci+self.Co)
147 |         self.bufo_size = (self.buf_size - self.bufw_size)*self.Co/(self.Ci+self.Co)
148 | 
149 |         # set the initial guess;
150 |         x0 = self.A
151 | 
152 |         # let's see what percentage of ifmap can we fit into the buffer.
153 |         while x0 < self.N and (x0+self.A)*self.Ci < self.bufi_size:
154 |             x0 = x0 + self.A
155 | 
156 |         return self.process_parameter(x0)
157 | 


--------------------------------------------------------------------------------
/runner.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
 4 |         --model_type 2D \
 5 |         --bufsize 1572864 \
 6 |         --bit_width 16 \
 7 |         --memory_bandwidth 25.6 \
 8 |         --sa_size 16 \
 9 |         --model_type 2D \
10 |         --ifmap 960 576 6 \
11 |         --static True \
12 |         --buffer_partition 3 3 4
13 | 
14 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
15 |         --model_type 2D \
16 |         --search_method Constrained \
17 |         --bufsize 1572864 \
18 |         --bit_width 16 \
19 |         --memory_bandwidth 25.6 \
20 |         --sa_size 16 \
21 |         --model_type 2D \
22 |         --ifmap 960 576 6
23 | 
24 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
25 |         --model_type 2D \
26 |         --search_method Exhaustive \
27 |         --bufsize 1572864 \
28 |         --bit_width 16 \
29 |         --memory_bandwidth 25.6 \
30 |         --sa_size 16 \
31 |         --model_type 2D \
32 |         --ifmap 960 576 6
33 | 
34 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
35 |         --model_type 2D \
36 |         --search_method Constrained \
37 |         --split True\
38 |         --bufsize 1572864 \
39 |         --bit_width 16 \
40 |         --memory_bandwidth 25.6 \
41 |         --sa_size 16 \
42 |         --model_type 2D \
43 |         --ifmap 960 576 6
44 | 
45 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
46 |         --model_type 2D \
47 |         --search_method Exhaustive \
48 |         --split True\
49 |         --bufsize 1572864 \
50 |         --bit_width 16 \
51 |         --memory_bandwidth 25.6 \
52 |         --sa_size 16 \
53 |         --model_type 2D \
54 |         --ifmap 960 576 6
55 | 
56 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
57 |         --model_type 2D \
58 |         --search_method Constrained \
59 |         --split True \
60 |         --combine True \
61 |         --bufsize 1572864 \
62 |         --bit_width 16 \
63 |         --memory_bandwidth 25.6 \
64 |         --sa_size 16 \
65 |         --model_type 2D \
66 |         --ifmap 960 576 6
67 | 
68 | python dataflow_search.py --dnnfile dnns/flowNetC.txt \
69 |         --model_type 2D \
70 |         --search_method Exhaustive \
71 |         --split True \
72 |         --combine True \
73 |         --bufsize 1572864 \
74 |         --bit_width 16 \
75 |         --memory_bandwidth 25.6 \
76 |         --sa_size 16 \
77 |         --model_type 2D \
78 |         --ifmap 960 576 6
79 | 


--------------------------------------------------------------------------------
/runner3d.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python dataflow_search.py --dnnfile dnns/test3d.txt \
 4 |         --model_type 3D \
 5 |         --search_method Constrained \
 6 |         --bufsize 1572864 \
 7 |         --bit_width 16 \
 8 |         --memory_bandwidth 25.6 \
 9 |         --sa_size 16 \
10 |         --model_type 3D \
11 |         --ifmap3d 480 288 96 64
12 | 
13 | python dataflow_search.py --dnnfile dnns/test3d.txt \
14 |         --model_type 3D \
15 |         --search_method Exhaustive \
16 |         --bufsize 1572864 \
17 |         --bit_width 16 \
18 |         --memory_bandwidth 25.6 \
19 |         --sa_size 16 \
20 |         --model_type 3D \
21 |         --ifmap3d 480 288 96 64
22 | 
23 | python dataflow_search.py --dnnfile dnns/test3d.txt \
24 |         --model_type 3D \
25 |         --search_method Constrained \
26 |         --split True\
27 |         --bufsize 1572864 \
28 |         --bit_width 16 \
29 |         --memory_bandwidth 25.6 \
30 |         --sa_size 16 \
31 |         --model_type 3D \
32 |         --ifmap3d 480 288 96 64
33 | 
34 | python dataflow_search.py --dnnfile dnns/test3d.txt \
35 |         --model_type 3D \
36 |         --search_method Exhaustive \
37 |         --split True\
38 |         --bufsize 1572864 \
39 |         --bit_width 16 \
40 |         --memory_bandwidth 25.6 \
41 |         --sa_size 16 \
42 |         --model_type 3D \
43 |         --ifmap3d 480 288 96 64
44 | 
45 | 


--------------------------------------------------------------------------------