├── .gitignore
├── LICENSE
├── README.md
├── acl_test.cc
├── data
└── kernels.cl
├── install.sh
├── layer-test
├── README.md
├── conv2d.cc
├── depth.cc
├── gemm.cc
├── test_conv2d.py
├── test_dense.py
├── test_depth.py
├── util.cc
└── util.h
├── mali_imagenet_bench.py
├── mxnet_test.py
├── results.png
├── run_test.sh
└── spatial.png
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Lianmin Zheng
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Note: The data and scripts here are all stale. Please go to https://github.com/dmlc/tvm/wiki/Benchmark#mobile-gpu For the latest results.
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 | # Benchmarking Deep Neural Networks on ARM CPU/GPU
12 |
13 | This repo is the supporting material for [Optimizing Mobile Deep Learning on ARM GPU with TVM](http://tvmlang.org/2018/01/16/opt-mali-gpu.html)
14 |
15 | ## Inference Speed on ImageNet
16 | Tested on
17 | ```
18 | Firefly-RK3399 4G, CPU: dual-core Cortex-A72 + quad-core Cortex-A53, GPU: Mali-T860MP4
19 | Arm Compute Library: v17.12, MXNet: v1.0.1, Openblas: v0.2.18
20 | ```
21 |
22 | 
23 |
24 |
25 | ## Set Test Environment
26 | ```
27 | sudo /etc/init.d/lightdm stop
28 | sudo -i
29 | echo performance > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor
30 | ```
31 | This can make the environment more stable.
32 |
33 | **Note**: You need more than 2.5GB of memory to run the following test.
34 | Otherwise, you must skip the test of vgg16 by replacing `--model all` with `--model resnet18` or `--model mobilenet`
35 | in the commond.
36 |
37 | ## Run Test for TVM/NNVM
38 | In TVM, we use [RPC](http://nnvm.tvmlang.org/tutorials/deploy_model_on_mali_gpu.html) to do test,
39 | so you should build TVM runtime and start a RPC server on your device.
40 | ```
41 | python -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090
42 | ```
43 |
44 | Then in your host machine, run the test commond
45 | ``` bash
46 | python mali_imagenet_bench.py --target-host TARGET_HOST --host HOST --port PORT --model all
47 | ```
48 | Replace the `TARGET_HOST`, `HOST` and `PORT` with the corresponding values in your environment.
49 |
50 | For example, on my Firefly-RK3399, the commond is
51 | ``` bash
52 | python mali_imagenet_bench.py --target-host 'llvm -target=aarch64-linux-gnu -mattr=+neon' --host 10.42.0.96 --port 9090 --model all
53 | ```
54 |
55 | ## Run Test for MXNet + Openblas
56 | This test is executed locally on your device. So you need install the mxnet with openblas on your device first.
57 |
58 | ``` bash
59 | python mxnet_test.py --model all
60 | ```
61 |
62 | ## Run Test for Arm Compute Library
63 | Build ACL by cross-compile on host system.
64 | ``` bash
65 | scons Werror=1 neon=1 opencl=1 examples=1 benchmark_tests=1 os=linux arch=arm64-v8a embed_kernels=1 -j$(nproc)
66 | ```
67 |
68 | copy acl\_test.cc to the root directoy of ACL and build the acl\_test by
69 | ``` bash
70 | aarch64-linux-gnu-g++ acl_test.cc build/utils/*.o -O2 -std=c++11\
71 | -I. -Iinclude -Lbuild -Lbuild/opencl-1.2-stubs/\
72 | -larm_compute -larm_compute_graph -larm_compute_core -lOpenCL -o acl_test
73 | ```
74 |
75 | copy the binary file acl\_test to your device and run
76 | ```
77 | ./acl_test all
78 | cat result-acl.txt
79 | ```
80 | results are recored in `result-acl.txt`
81 |
82 | **Note** Some testcases (e.g. resnet) are missing because Arm Compute Library currently (v17.12) does not
83 | support skip connection in its graph runtime. Also some testcases are too slow so that be skipped.
84 |
85 | ## Result
86 | Paste the outputs on my board here.
87 |
88 | ### TVM/NNVM
89 | ```
90 | ============================================================
91 | model: vgg16, dtype: float32
92 | warm up..
93 | test..
94 | cost per image: 1.2926s
95 | ============================================================
96 | model: vgg16, dtype: float16
97 | warm up..
98 | test..
99 | cost per image: 0.6896s
100 | ============================================================
101 | model: resnet18, dtype: float32
102 | warm up..
103 | test..
104 | cost per image: 0.2041s
105 | ============================================================
106 | model: resnet18, dtype: float16
107 | warm up..
108 | test..
109 | cost per image: 0.1183s
110 | ============================================================
111 | model: mobilenet, dtype: float32
112 | warm up..
113 | test..
114 | cost per image: 0.0767s
115 | ============================================================
116 | model: mobilenet, dtype: float16
117 | warm up..
118 | test..
119 | cost per image: 0.0479s
120 | ```
121 |
122 | ### MXNet + Openblas
123 | ```
124 | ============================================================
125 | model: vgg16, dtype: float32
126 | warm up...
127 | test..
128 | cost per image: 3.0250s
129 | ============================================================
130 | model: resnet18, dtype: float32
131 | warm up...
132 | test..
133 | cost per image: 0.3977s
134 | ============================================================
135 | model: mobilenet, dtype: float32
136 | warm up...
137 | test..
138 | cost per image: 0.2914s
139 | ```
140 |
141 | ### ACL
142 | ```
143 | backend: cl model: vgg16 conv_method: gemm dtype: float32 cost: 1.64456
144 | backend: cl model: vgg16 conv_method: gemm dtype: float16 cost: 0.969372
145 | backend: cl model: vgg16 conv_method: direct dtype: float32 cost: 3.90031
146 | backend: cl model: vgg16 conv_method: direct dtype: float16 cost: 1.61179
147 | backend: cl model: mobilenet conv_method: gemm dtype: float32 cost: 0.170934
148 | backend: cl model: mobilenet conv_method: direct dtype: float32 cost: 0.173883
149 | backend: neon model: vgg16 conv_method: gemm dtype: float32 cost: 4.10269
150 | ```
151 |
152 |
--------------------------------------------------------------------------------
/acl_test.cc:
--------------------------------------------------------------------------------
1 | #include "arm_compute/graph/Graph.h"
2 | #include "arm_compute/graph/Nodes.h"
3 | #include "arm_compute/runtime/CL/CLScheduler.h"
4 | #include "arm_compute/runtime/CPP/CPPScheduler.h"
5 | #include "arm_compute/runtime/Scheduler.h"
6 | #include "support/ToolchainSupport.h"
7 | #include "utils/GraphUtils.h"
8 | #include "utils/Utils.h"
9 |
10 | #include
11 | #include
12 | #include
13 | #include
14 | #include
15 | #include
16 |
17 | using namespace arm_compute::graph;
18 | using namespace arm_compute::graph_utils;
19 |
20 | std::unique_ptr dummy() {
21 | return arm_compute::support::cpp14::make_unique(1);
22 | }
23 |
24 | void get_vgg16(Graph *graph, arm_compute::DataType type) {
25 | *graph << Tensor(TensorInfo(TensorShape(224U, 224U, 3U, 1U), 1, type))
26 | // Layer 1
27 | << ConvolutionLayer( 3U, 3U, 64U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
28 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
29 | // Layer 2
30 | << ConvolutionLayer( 3U, 3U, 64U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
31 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
32 | << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
33 | // Layer 3
34 | << ConvolutionLayer( 3U, 3U, 128U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
35 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
36 | // Layer 4
37 | << ConvolutionLayer( 3U, 3U, 128U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
38 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
39 | << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
40 | // Layer 5
41 | << ConvolutionLayer( 3U, 3U, 256U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
42 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
43 | // Layer 6
44 | << ConvolutionLayer( 3U, 3U, 256U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
45 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
46 | // Layer 7
47 | << ConvolutionLayer( 3U, 3U, 256U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
48 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
49 | << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
50 | // Layer 8
51 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
52 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
53 | // Layer 9
54 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
55 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
56 | // Layer 10
57 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
58 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
59 | << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
60 | // Layer 11
61 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
62 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
63 | // Layer 12
64 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
65 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
66 | // Layer 13
67 | << ConvolutionLayer( 3U, 3U, 512U, dummy(), dummy(), PadStrideInfo(1, 1, 1, 1))
68 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
69 | << PoolingLayer(PoolingLayerInfo(PoolingType::MAX, 2, PadStrideInfo(2, 2, 0, 0)))
70 | // Layer 14
71 | << FullyConnectedLayer( 4096U, dummy(), dummy())
72 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
73 | // Layer 15
74 | << FullyConnectedLayer( 4096U, dummy(), dummy())
75 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::RELU))
76 | // Layer 16
77 | << FullyConnectedLayer( 1000U, dummy(), dummy())
78 | // Softmax
79 | << SoftmaxLayer()
80 | << Tensor(TensorInfo(TensorShape(1000U), 1, type));
81 | }
82 |
83 | BranchLayer get_dwsc_node(const std::string &data_path, std::string &¶m_path,
84 | unsigned int conv_filt,
85 | PadStrideInfo dwc_pad_stride_info, PadStrideInfo conv_pad_stride_info)
86 | {
87 | std::string total_path = "/cnn_data/mobilenet_v1_model/" + param_path + "_";
88 | SubGraph sg;
89 | sg << DepthwiseConvolutionLayer(
90 | 3U, 3U, dummy(),
91 | std::unique_ptr(nullptr),
92 | dwc_pad_stride_info,
93 | true)
94 | << BatchNormalizationLayer(dummy(), dummy(), dummy(), dummy(), 0.001f)
95 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::BOUNDED_RELU, 6.f))
96 | << ConvolutionLayer( 1U, 1U, conv_filt, dummy(),
97 | std::unique_ptr(nullptr), conv_pad_stride_info)
98 | << BatchNormalizationLayer( dummy(), dummy(), dummy(), dummy(), 0.001f)
99 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::BOUNDED_RELU, 6.f));
100 |
101 | return BranchLayer(std::move(sg));
102 | }
103 |
104 | void get_mobilenet(Graph *graph, arm_compute::DataType type) {
105 | std::string data_path; /* Path to the trainable data */
106 |
107 | *graph << Tensor(TensorInfo(TensorShape(224U, 224U, 3U, 1U), 1, type))
108 | << ConvolutionLayer( 3U, 3U, 32U, dummy(),
109 | std::unique_ptr(nullptr),
110 | PadStrideInfo(2, 2, 0, 1, 0, 1, DimensionRoundingType::FLOOR))
111 | << BatchNormalizationLayer( dummy(), dummy(), dummy(), dummy(), 0.001f)
112 | << ActivationLayer(ActivationLayerInfo(ActivationLayerInfo::ActivationFunction::BOUNDED_RELU, 6.f))
113 | << get_dwsc_node(data_path, "Conv2d_1", 64, PadStrideInfo(1, 1, 1, 1), PadStrideInfo(1, 1, 0, 0))
114 | << get_dwsc_node(data_path, "Conv2d_2", 128, PadStrideInfo(2, 2, 0, 1, 0, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
115 | << get_dwsc_node(data_path, "Conv2d_3", 128, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
116 | << get_dwsc_node(data_path, "Conv2d_4", 256, PadStrideInfo(2, 2, 0, 1, 0, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
117 | << get_dwsc_node(data_path, "Conv2d_5", 256, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
118 | << get_dwsc_node(data_path, "Conv2d_6", 512, PadStrideInfo(2, 2, 0, 1, 0, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
119 | << get_dwsc_node(data_path, "Conv2d_7", 512, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
120 | << get_dwsc_node(data_path, "Conv2d_8", 512, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
121 | << get_dwsc_node(data_path, "Conv2d_9", 512, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
122 | << get_dwsc_node(data_path, "Conv2d_10", 512, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
123 | << get_dwsc_node(data_path, "Conv2d_11", 512, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
124 | << get_dwsc_node(data_path, "Conv2d_12", 1024, PadStrideInfo(2, 2, 0, 1, 0, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
125 | << get_dwsc_node(data_path, "Conv2d_13", 1024, PadStrideInfo(1, 1, 1, 1, 1, 1, DimensionRoundingType::FLOOR), PadStrideInfo(1, 1, 0, 0))
126 | << PoolingLayer(PoolingLayerInfo(PoolingType::AVG))
127 | << ConvolutionLayer( 1U, 1U, 1000U, dummy(), dummy(), PadStrideInfo(1, 1, 0, 0))
128 | << ReshapeLayer(TensorShape(1000U))
129 | << SoftmaxLayer()
130 | << Tensor(TensorInfo(TensorShape(1000U), 1, type));
131 | }
132 |
133 | double measure(Graph *graph, int n_times) {
134 | arm_compute::CLScheduler::get().default_init();
135 | graph->run();
136 | arm_compute::CLScheduler::get().sync();
137 |
138 | auto tbegin = std::chrono::high_resolution_clock::now();
139 | for (int i = 0; i < n_times; i++) {
140 | graph->run();
141 | }
142 | arm_compute::CLScheduler::get().sync();
143 | auto tend = std::chrono::high_resolution_clock::now();
144 |
145 |
146 | double cost = std::chrono::duration_cast>(tend - tbegin).count();
147 | return cost / n_times;
148 | }
149 |
150 | double run_case(std::string backend, std::string model, std::string conv_method, std::string dtype) {
151 | TargetHint target_hint;
152 | ConvolutionMethodHint convolution_hint;
153 | arm_compute::DataType type;
154 |
155 | if (conv_method == "gemm") {
156 | convolution_hint = ConvolutionMethodHint::GEMM;
157 | } else {
158 | convolution_hint = ConvolutionMethodHint::DIRECT;
159 | }
160 |
161 | if (backend == "cl") {
162 | target_hint = TargetHint::OPENCL;
163 | } else {
164 | target_hint = TargetHint::NEON;
165 | }
166 |
167 | if (dtype == "float32") {
168 | type = DataType::F32;
169 | } else {
170 | type = DataType::F16;
171 | }
172 |
173 | Graph graph;
174 | graph << target_hint << convolution_hint;
175 |
176 | if (model == "mobilenet")
177 | get_mobilenet(&graph, type);
178 | else if (model == "vgg16")
179 | get_vgg16(&graph, type);
180 | else
181 | std::cout << "unknown model" << std::endl;
182 |
183 | int num_warmup, num_test;
184 |
185 | num_warmup = 10;
186 | num_test = 60;
187 |
188 | if (model == "mobilenet") { // mobilenet is fast, need more runs for stable measureament
189 | num_warmup *= 5;
190 | num_test *= 5;
191 | }
192 |
193 | // warm up
194 | measure(&graph, num_warmup);
195 |
196 | // test
197 | double cost = measure(&graph, num_test);
198 | return cost;
199 | }
200 |
201 | int main(int argc, const char **argv)
202 | {
203 | // Check if OpenCL is available and initialize the scheduler
204 | // Usage 1 : test all
205 | // Usage 2 : test [cl|neno] [mobilenet|vgg16] [gemm|direct] [float32|float16]
206 |
207 | std::ofstream fout("result-acl.txt", std::ios::app);
208 |
209 | if (strcmp(argv[1], "all") == 0) { // test all
210 | std::string backend[] = {"cl", "neon"};
211 | std::string model[] = {"vgg16", "mobilenet"};
212 | std::string conv_method[] = {"gemm", "direct"};
213 | std::string dtype[] = {"float32", "float16"};
214 |
215 | for (int i = 0; i < sizeof(backend)/sizeof(backend[0]); i++) {
216 | for (int j = 0; j < sizeof(model)/sizeof(model[0]); j++) {
217 | for (int k = 0; k < sizeof(conv_method)/sizeof(conv_method[0]); k++) {
218 | for (int l = 0; l < sizeof(dtype)/sizeof(dtype[0]); l++) {
219 |
220 | // skip some test for neon
221 | if (backend[i] == "neon" ) {
222 | continue;
223 | if (conv_method[k] == "direct") // this config is too slow, skip it
224 | continue;
225 | if (model[j] == "mobilenet") // too slow, skip it
226 | continue;
227 | if (dtype[l] == "float16") // skip the test of fp16 on CPU
228 | continue;
229 | } else {
230 | // ACL does not support FP16 depthwise conv
231 | if (model[j] == "mobilenet" && dtype[l] == "float16")
232 | continue;
233 | }
234 |
235 | double cost = run_case(backend[i], model[j], conv_method[k], dtype[l]);
236 |
237 | std::stringstream ss;
238 |
239 | std::string back_name;
240 | if (backend[i] == "cl")
241 | back_name = "mali";
242 | else
243 | back_name = "neon";
244 |
245 | ss << "backend: ARMComputeLib-" << back_name << "\tmodel: " << model[j]
246 | << "\tconv_method: " << conv_method[k] << "\tdtype: " << dtype[l]
247 | << "\tcost: " << cost;
248 | std::cout << ss.str() << std::endl;
249 | fout << ss.str() << std::endl;
250 | sleep(20);
251 | }
252 | }
253 | }
254 | }
255 | } else { // test single case
256 | std::string backend = argv[1];
257 | std::string model = argv[2];
258 | std::string conv_method = argv[3];
259 | std::string dtype = argv[4];
260 |
261 | double cost = run_case(backend, model, conv_method, dtype);
262 | std::stringstream ss;
263 | ss << "backend: " << backend << "\tmodel: " << model
264 | << "\tconv_method: " << conv_method << "\tdtype: " << dtype
265 | << "\tcost: " << cost;
266 | std::cout << ss.str() << std::endl;
267 | fout << ss.str() << std::endl;
268 | }
269 |
270 | return 0;
271 | }
272 |
273 |
--------------------------------------------------------------------------------
/data/kernels.cl:
--------------------------------------------------------------------------------
1 | // kernel for packing data
2 | __kernel void default_function__kernel0(__global void* restrict data_vec, __global float* restrict data) {
3 | for (int vh = 0; vh < 3; ++vh) {
4 | ((__global float*)data_vec)[(((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (1 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 15)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -57)] : 0.000000e+00f);
5 | ((__global float*)data_vec)[((((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6) + 1)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (0 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 14)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -56)] : 0.000000e+00f);
6 | ((__global float*)data_vec)[((((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6) + 2)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (0 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 14)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -55)] : 0.000000e+00f);
7 | ((__global float*)data_vec)[((((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6) + 3)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (0 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 14)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -54)] : 0.000000e+00f);
8 | ((__global float*)data_vec)[((((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6) + 4)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (0 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 14)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -53)] : 0.000000e+00f);
9 | ((__global float*)data_vec)[((((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) * 256) + ((int)get_group_id(0))) * 3) + vh) * 6) + 5)] = ((((((1 - vh) <= ((int)get_group_id(2))) && (((int)get_group_id(2)) < (57 - vh))) && (-1 <= ((int)get_group_id(1)))) && (((int)get_group_id(1)) < 13)) ? data[((((((((int)get_group_id(2)) * 14) + ((int)get_group_id(1))) + (((int)get_group_id(0)) * 784)) + (vh * 14)) * 4) + -52)] : 0.000000e+00f);
10 | }
11 | }
12 |
13 | // kernel for packing filter
14 | __kernel void default_function__kernel1(__global void* restrict kernel_vec, __global float* restrict weight) {
15 | float4 _1;
16 | int4 _2 = (int4)(((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9))+(2304*0), ((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9))+(2304*1), ((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9))+(2304*2), ((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9))+(2304*3));
17 | _1.s0 = weight[_2.s0];
18 | _1.s1 = weight[_2.s1];
19 | _1.s2 = weight[_2.s2];
20 | _1.s3 = weight[_2.s3];
21 | vstore4(_1, 0, (__global float*)kernel_vec + (((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36));
22 | float4 _3;
23 | int4 _4 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 1))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 1))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 1))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 1))+(2304*3));
24 | _3.s0 = weight[_4.s0];
25 | _3.s1 = weight[_4.s1];
26 | _3.s2 = weight[_4.s2];
27 | _3.s3 = weight[_4.s3];
28 | vstore4(_3, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 4));
29 | float4 _5;
30 | int4 _6 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 2))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 2))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 2))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 2))+(2304*3));
31 | _5.s0 = weight[_6.s0];
32 | _5.s1 = weight[_6.s1];
33 | _5.s2 = weight[_6.s2];
34 | _5.s3 = weight[_6.s3];
35 | vstore4(_5, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 8));
36 | float4 _7;
37 | int4 _8 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 3))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 3))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 3))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 3))+(2304*3));
38 | _7.s0 = weight[_8.s0];
39 | _7.s1 = weight[_8.s1];
40 | _7.s2 = weight[_8.s2];
41 | _7.s3 = weight[_8.s3];
42 | vstore4(_7, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 12));
43 | float4 _9;
44 | int4 _10 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 4))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 4))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 4))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 4))+(2304*3));
45 | _9.s0 = weight[_10.s0];
46 | _9.s1 = weight[_10.s1];
47 | _9.s2 = weight[_10.s2];
48 | _9.s3 = weight[_10.s3];
49 | vstore4(_9, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 16));
50 | float4 _11;
51 | int4 _12 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 5))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 5))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 5))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 5))+(2304*3));
52 | _11.s0 = weight[_12.s0];
53 | _11.s1 = weight[_12.s1];
54 | _11.s2 = weight[_12.s2];
55 | _11.s3 = weight[_12.s3];
56 | vstore4(_11, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 20));
57 | float4 _13;
58 | int4 _14 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 6))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 6))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 6))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 6))+(2304*3));
59 | _13.s0 = weight[_14.s0];
60 | _13.s1 = weight[_14.s1];
61 | _13.s2 = weight[_14.s2];
62 | _13.s3 = weight[_14.s3];
63 | vstore4(_13, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 24));
64 | float4 _15;
65 | int4 _16 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 7))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 7))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 7))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 7))+(2304*3));
66 | _15.s0 = weight[_16.s0];
67 | _15.s1 = weight[_16.s1];
68 | _15.s2 = weight[_16.s2];
69 | _15.s3 = weight[_16.s3];
70 | vstore4(_15, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 28));
71 | float4 _17;
72 | int4 _18 = (int4)((((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 8))+(2304*0), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 8))+(2304*1), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 8))+(2304*2), (((((((int)get_group_id(1)) * 1024) + ((int)get_group_id(0))) * 9) + 8))+(2304*3));
73 | _17.s0 = weight[_18.s0];
74 | _17.s1 = weight[_18.s1];
75 | _17.s2 = weight[_18.s2];
76 | _17.s3 = weight[_18.s3];
77 | vstore4(_17, 0, (__global float*)kernel_vec + ((((((int)get_group_id(1)) * 256) + ((int)get_group_id(0))) * 36) + 32));
78 | }
79 |
80 | // kernel for convolution
81 | __kernel void default_function__kernel2(__global void* restrict conv, __global void* restrict data_vec, __global void* restrict kernel_vec) {
82 | vstore4(((float4)(0.000000e+00f, 0.000000e+00f, 0.000000e+00f, 0.000000e+00f)), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
83 | vstore4(((float4)(0.000000e+00f, 0.000000e+00f, 0.000000e+00f, 0.000000e+00f)), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
84 | vstore4(((float4)(0.000000e+00f, 0.000000e+00f, 0.000000e+00f, 0.000000e+00f)), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
85 | vstore4(((float4)(0.000000e+00f, 0.000000e+00f, 0.000000e+00f, 0.000000e+00f)), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
86 | for (int ci = 0; ci < 256; ++ci) {
87 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[(((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18)], ((__global float*)data_vec)[(((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18)], ((__global float*)data_vec)[(((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18)], ((__global float*)data_vec)[(((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18)])) * vload4(0, (__global float*)kernel_vec + (((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
88 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)])) * vload4(0, (__global float*)kernel_vec + (((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
89 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)])) * vload4(0, (__global float*)kernel_vec + (((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
90 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)])) * vload4(0, (__global float*)kernel_vec + (((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
91 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 1)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 4)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
92 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 4)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
93 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 4)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
94 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 4)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
95 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 2)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 8)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
96 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 3)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 8)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
97 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 4)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 8)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
98 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 5)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 5)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 5)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 5)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 8)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
99 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 6)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 6)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 6)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 6)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 12)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
100 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 12)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
101 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 12)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
102 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 12)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
103 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 7)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 16)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
104 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 16)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
105 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 16)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
106 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 16)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
107 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 8)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 20)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
108 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 9)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 20)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
109 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 10)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 20)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
110 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 11)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 11)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 11)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 11)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 20)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
111 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 12)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 12)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 12)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 12)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 24)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
112 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 24)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
113 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 24)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
114 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 24)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
115 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 13)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 28)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
116 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 28)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
117 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 28)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
118 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 28)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
119 | vstore4((vload4(0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 14)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 32)))), 0, (__global float*)conv + (((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16));
120 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 15)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 32)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 4));
121 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 16)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 32)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 8));
122 | vstore4((vload4(0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12)) + (((float4)(((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 17)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 17)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 17)], ((__global float*)data_vec)[((((((((int)get_group_id(1)) * 14) + ((int)get_group_id(0))) * 256) + ci) * 18) + 17)])) * vload4(0, (__global float*)kernel_vec + ((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 256) + ci) * 36) + 32)))), 0, (__global float*)conv + ((((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 14) + ((int)get_group_id(0))) * 16) + 12));
123 | }
124 | }
125 |
126 | // kernel for unpacking the output
127 | __kernel void default_function__kernel3(__global float* restrict output_unpack, __global void* restrict conv) {
128 | output_unpack[((((((((int)get_group_id(2)) * 8) + ((int)get_local_id(2))) * 56) + ((int)get_group_id(1))) * 56) + ((int)get_group_id(0)))] = ((__global float*)conv)[((((((((int)get_group_id(2)) * 2) + (((int)get_local_id(2)) / 4)) * 12544) + (((int)get_local_id(2)) % 4)) + (((int)get_group_id(1)) * 224)) + (((int)get_group_id(0)) * 4))];
129 | }
130 |
131 |
--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | sudo apt-get update
2 |
3 | sudo apt install llvm-4.0
4 | sudo apt install scons
5 | sudo apt install libopenblas-dev
6 | sudo apt-get -y install git cmake build-essential g++-4.8 c++-4.8 liblapack* libblas* libopencv*
7 |
8 | git clone --recursive https://github.com/dmlc/nnvm.git
9 | git clone https://github.com/ARM-software/ComputeLibrary.git --branch v17.12
10 | git clone --recursive https://github.com/apache/incubator-mxnet.git
11 |
12 | # build nnvm/tvm
13 | cd nnvm/tvm
14 | make USE_OPENCL=1 LLVM_CONFIG=llvm-config-4.0 -j4
15 | cd ..
16 | make
17 | cd ..
18 |
19 | # build arm compute library
20 | cd ComputeLibrary
21 | scons Werror=1 neon=1 opencl=1 examples=1 os=linux arch=arm64-v8a embed_kernels=1 build=native -j4
22 | cp ../acl_test.cc .
23 |
24 | g++ acl_test.cc build/utils/*.o -O2 -std=c++11 -I. -Iinclude -Lbuild -larm_compute -larm_compute_graph -larm_compute_core -lOpenCL -o acl_test
25 | cp acl_test ..
26 | cd ..
27 |
28 | # build mxnet
29 | cd incubator-mxnet
30 | make -j2 USE_OPENCV=0 USE_BLAS=openblas
31 | cd ..
32 |
33 |
--------------------------------------------------------------------------------
/layer-test/README.md:
--------------------------------------------------------------------------------
1 | # Test script for layer-wise benchmark
2 |
3 | to be documented...
4 |
5 |
--------------------------------------------------------------------------------
/layer-test/conv2d.cc:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | #include "arm_compute/core/Types.h"
7 | #include "arm_compute/runtime/CL/CLFunctions.h"
8 | #include "arm_compute/runtime/CL/CLScheduler.h"
9 |
10 | #include "util.h"
11 |
12 | using namespace arm_compute;
13 |
14 | struct Workload {
15 | std::string in_dtype;
16 | std::string out_dtype;
17 | size_t batch;
18 | size_t height;
19 | size_t width;
20 | size_t in_filter;
21 | size_t out_filter;
22 | size_t hkernel;
23 | size_t wkernel;
24 | size_t hpad;
25 | size_t wpad;
26 | size_t hstride;
27 | size_t wstride;
28 | };
29 |
30 | // measure the cost and gflops of 2d convolution
31 | std::pair MeasureConv(const Workload &w, int times=30) {
32 | assert(w.in_dtype == w.out_dtype);
33 | Format format = DtypeToFormat(w.in_dtype);
34 |
35 | CLTensor input, weight, output;
36 | PadStrideInfo conv_info(w.wstride, w.hstride, w.wpad, w.hpad);
37 |
38 | // init OpenCL
39 | CLScheduler::get().default_init();
40 |
41 | // allocate tensors
42 | input.allocator()->init(TensorInfo(TensorShape(w.width, w.height, w.in_filter, w.batch), format));
43 | weight.allocator()->init(TensorInfo(TensorShape(w.wkernel, w.hkernel, w.in_filter, w.out_filter), format));
44 | size_t w_out = (w.width - w.wkernel + w.wpad * 2) / w.wstride + 1;
45 | size_t h_out = (w.height - w.hkernel + w.hpad * 2) / w.hstride + 1;
46 | output.allocator()->init(TensorInfo(TensorShape(w_out, h_out, w.out_filter, w.batch), format));
47 | input.allocator()->allocate();
48 | weight.allocator()->allocate();
49 | output.allocator()->allocate();
50 | CLScheduler::get().sync();
51 |
52 | // configure conv2d function
53 | CLConvolutionLayer conv2d;
54 | conv2d.configure(&input, &weight, nullptr, &output, conv_info);
55 |
56 | // run test
57 | conv2d.run();
58 | std::chrono::high_resolution_clock::time_point begin = std::chrono::high_resolution_clock::now();
59 |
60 | for (int i = 0; i < times; i++) {
61 | conv2d.run();
62 | }
63 | CLScheduler::get().sync();
64 |
65 | std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
66 |
67 | // calcuate gflops
68 | std::chrono::duration fp_ms = end - begin;
69 | double cost = fp_ms.count() / times;
70 | return std::make_pair(cost, 2.0 * w.batch * w_out * h_out * w.out_filter *
71 | w.hkernel * w.wkernel * w.in_filter / 1e9 / cost);
72 | }
73 |
74 |
75 | int main(int argc, const char **argv)
76 | {
77 | Workload to_test[] = {
78 | // vgg16
79 | // Workload{"float32", "float32", 1, 224, 224, 3, 64, 3, 3, 1, 1, 1, 1},
80 | // Workload{"float32", "float32", 1, 224, 224, 64, 64, 3, 3, 1, 1, 1, 1},
81 | // Workload{"float32", "float32", 1, 112, 112, 64, 128,3, 3, 1, 1, 1, 1},
82 | // Workload{"float32", "float32", 1, 112, 112,128, 128,3, 3, 1, 1, 1, 1},
83 | // Workload{"float32", "float32", 1, 56, 56, 128, 256, 3, 3, 1, 1, 1, 1},
84 | // Workload{"float32", "float32", 1, 56, 56, 256, 256, 3, 3, 1, 1, 1, 1},
85 | // Workload{"float32", "float32", 1, 28, 28, 256, 512, 3, 3, 1, 1, 1, 1},
86 | // Workload{"float32", "float32", 1, 28, 28, 512, 512, 3, 3, 1, 1, 1, 1},
87 | // Workload{"float32", "float32", 1, 14, 14, 512, 512, 3, 3, 1, 1, 1, 1},
88 |
89 | // resnet
90 | Workload{"float32", "float32", 1, 224, 224, 3, 64, 7, 7, 3, 3, 2, 2},
91 | Workload{"float32", "float32", 32, 224, 224, 3, 64, 7, 7, 3, 3, 2, 2},
92 | Workload{"float32", "float32", 1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1},
93 | Workload{"float32", "float32", 32, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1},
94 | Workload{"float32", "float32", 1, 56, 56, 64, 64, 1, 1, 0, 0, 1, 1},
95 | Workload{"float32", "float32", 32, 56, 56, 64, 64, 1, 1, 0, 0, 1, 1},
96 | Workload{"float32", "float32", 1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2},
97 | Workload{"float32", "float32", 32, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2},
98 | Workload{"float32", "float32", 1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2},
99 | Workload{"float32", "float32", 32, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2},
100 | Workload{"float32", "float32", 1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1},
101 | Workload{"float32", "float32", 32, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1},
102 | Workload{"float32", "float32", 1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2},
103 | Workload{"float32", "float32", 32, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2},
104 | Workload{"float32", "float32", 1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2},
105 | Workload{"float32", "float32", 32, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2},
106 | Workload{"float32", "float32", 1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1},
107 | Workload{"float32", "float32", 32, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1},
108 | Workload{"float32", "float32", 1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2},
109 | Workload{"float32", "float32", 32, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2},
110 | Workload{"float32", "float32", 1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2},
111 | Workload{"float32", "float32", 32, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2},
112 | Workload{"float32", "float32", 1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1},
113 | Workload{"float32", "float32", 32, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1},
114 |
115 | // // mobilenet
116 | // Workload{"float32", "float32", 1, 224, 224, 3, 32, 3, 3, 1, 1, 2, 2},
117 | // Workload{"float32", "float32", 1, 112, 112, 32, 64, 1, 1, 0, 0, 1, 1},
118 | // Workload{"float32", "float32", 1, 56, 56, 64, 128, 1, 1, 0, 0, 1, 1},
119 | // Workload{"float32", "float32", 1, 56, 56, 128, 128, 1, 1, 0, 0, 1, 1},
120 | // Workload{"float32", "float32", 1, 28, 28, 128, 256, 1, 1, 0, 0, 1, 1},
121 | // Workload{"float32", "float32", 1, 28, 28, 256, 256, 1, 1, 0, 0, 1, 1},
122 | // Workload{"float32", "float32", 1, 14, 14, 256, 512, 1, 1, 0, 0, 1, 1},
123 | // Workload{"float32", "float32", 1, 14, 14, 512, 512, 1, 1, 0, 0, 1, 1},
124 | // Workload{"float32", "float32", 1, 7, 7, 512, 1024, 1, 1, 0, 0, 1, 1},
125 | // Workload{"float32", "float32", 1, 7, 7, 1024,1024, 1, 1, 0, 0, 1, 1},
126 | };
127 |
128 | for (size_t i = 0; i < sizeof(to_test) / sizeof(to_test[0]); i++) {
129 | Workload &w = to_test[i];
130 | double cost, gflops;
131 | std::tie(cost, gflops) = MeasureConv(w);
132 |
133 | std::cout << std::fixed << std::setprecision(4);
134 | std::cout << w.height << "x" << w.width << 'x' << w.in_filter << "x" << w.out_filter
135 | << " " << w.hkernel << "\t";
136 | std::cout << "cost: " << cost << ", "
137 | << "GFLOPS: " << gflops << std::endl;
138 | }
139 |
140 | return 0;
141 | }
142 |
143 |
--------------------------------------------------------------------------------
/layer-test/depth.cc:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | #include "arm_compute/core/Types.h"
7 | #include "arm_compute/runtime/CL/CLFunctions.h"
8 | #include "arm_compute/runtime/CL/CLScheduler.h"
9 |
10 | #include "util.h"
11 |
12 | using namespace arm_compute;
13 |
14 | struct Workload {
15 | std::string in_dtype;
16 | std::string out_dtype;
17 | size_t n;
18 | size_t height;
19 | size_t in_filter;
20 | int channel_m;
21 | size_t hkernel;
22 | size_t hpad;
23 | size_t hstride;
24 | };
25 |
26 | // measure the cost and gflops of 2d convolution
27 | std::pair MeasureConv(const Workload &w, int times=100) {
28 | assert(w.in_dtype == w.out_dtype);
29 | Format format = DtypeToFormat(w.in_dtype);
30 |
31 | CLTensor input, weight, output;
32 | PadStrideInfo conv_info(w.hstride, w.hstride, w.hpad, w.hpad);
33 |
34 | // init OpenCL
35 | CLScheduler::get().default_init();
36 |
37 | // allocate tensors
38 | input.allocator()->init(TensorInfo(TensorShape(w.height, w.height, w.in_filter), format));
39 | weight.allocator()->init(TensorInfo(TensorShape(w.hkernel, w.hkernel, w.in_filter), format));
40 | size_t h_out = (w.height - w.hkernel + w.hpad * 2) / w.hstride + 1;
41 | output.allocator()->init(TensorInfo(TensorShape(h_out, h_out, w.in_filter), format));
42 | input.allocator()->allocate();
43 | weight.allocator()->allocate();
44 | output.allocator()->allocate();
45 | CLScheduler::get().sync();
46 |
47 | // configure gemm function
48 | CLDepthwiseConvolutionLayer conv2d;
49 | conv2d.configure(&input, &weight, nullptr, &output, conv_info);
50 |
51 | // run test
52 | conv2d.run();
53 | CLScheduler::get().sync();
54 | std::chrono::high_resolution_clock::time_point begin = std::chrono::high_resolution_clock::now();
55 |
56 | for (int i = 0; i < times; i++) {
57 | conv2d.run();
58 | }
59 | CLScheduler::get().sync();
60 |
61 | std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
62 |
63 | // calcuate gflops
64 | std::chrono::duration fp_ms = end - begin;
65 | double cost = fp_ms.count() / times;
66 | return std::make_pair(cost, 2.0 * h_out * h_out *
67 | w.hkernel * w.hkernel * w.in_filter / 1e9 / cost);
68 | }
69 |
70 |
71 | int main(int argc, const char **argv)
72 | {
73 | Workload to_test[] = {
74 | // mobilenet
75 | Workload{"float32", "float32", 1, 112, 32, 1, 3, 1, 1},
76 | Workload{"float32", "float32", 1, 112, 64, 1, 3, 1, 2},
77 | Workload{"float32", "float32", 1, 56, 128, 1, 3, 1, 1},
78 | Workload{"float32", "float32", 1, 56, 128, 1, 3, 1, 2},
79 | Workload{"float32", "float32", 1, 28, 256, 1, 3, 1, 1},
80 | Workload{"float32", "float32", 1, 28, 256, 1, 3, 1, 2},
81 | Workload{"float32", "float32", 1, 14, 512, 1, 3, 1, 1},
82 | Workload{"float32", "float32", 1, 14, 512, 1, 3, 1, 2},
83 | Workload{"float32", "float32", 1, 7, 1024, 1, 3, 1, 1},
84 | };
85 |
86 | for (size_t i = 0; i < sizeof(to_test) / sizeof(to_test[0]); i++) {
87 | Workload &w = to_test[i];
88 | double cost, gflops;
89 | std::tie(cost, gflops) = MeasureConv(w);
90 |
91 | std::cout << std::fixed << std::setprecision(6);
92 | std::cout << w.height << "x" << w.height << 'x' << w.in_filter << "x" << w.in_filter
93 | << " " << w.hkernel << "\t";
94 | std::cout << "cost: " << cost << ", "
95 | << "GFLOPS: " << gflops << std::endl;
96 | }
97 |
98 | return 0;
99 | }
100 |
101 |
--------------------------------------------------------------------------------
/layer-test/gemm.cc:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 |
5 | #include "arm_compute/core/Types.h"
6 | #include "arm_compute/runtime/CL/CLFunctions.h"
7 | #include "arm_compute/runtime/CL/CLScheduler.h"
8 |
9 | #include "util.h"
10 |
11 | using namespace arm_compute;
12 |
13 | // measure the cost and gflops of gemm
14 | std::pair MeasureGemm(int n, int l, int m, std::string dtype, int times=30) {
15 | Format format = DtypeToFormat(dtype);
16 |
17 | CLTensor a, b, dst;
18 |
19 | // init OpenCL
20 | CLScheduler::get().default_init();
21 |
22 | // allocate tensors
23 | a.allocator()->init(TensorInfo(l, n, format));
24 | b.allocator()->init(TensorInfo(m, l, format));
25 | dst.allocator()->init(TensorInfo(m, n, format));
26 | a.allocator()->allocate();
27 | b.allocator()->allocate();
28 | dst.allocator()->allocate();
29 | CLScheduler::get().sync();
30 |
31 | // configure gemm function
32 | CLGEMM gemm;
33 | gemm.configure(&a, &b, nullptr, &dst, 1.0, 0.0);
34 |
35 | // run test
36 | gemm.run();
37 | std::chrono::high_resolution_clock::time_point begin = std::chrono::high_resolution_clock::now();
38 |
39 | for (int i = 0; i < times; i++) {
40 | gemm.run();
41 | }
42 | CLScheduler::get().sync();
43 |
44 | std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
45 |
46 | // calcuate gflops
47 | std::chrono::duration fp_ms = end - begin;
48 | double cost = fp_ms.count() / times;
49 | return std::make_pair(cost, 2.0 * n * l * m / (1e9) / cost);
50 | }
51 |
52 | int main(int argc, const char **argv)
53 | {
54 | size_t to_test[][3] = {
55 | {1024, 1024, 1024},
56 | {2048, 2048, 2048},
57 | };
58 |
59 | for (size_t i = 0; i < sizeof(to_test) / sizeof(to_test[0]); i++) {
60 | int n, l, m;
61 | double cost, gflops;
62 | n = to_test[i][0];
63 | l = to_test[i][1];
64 | m = to_test[i][2];
65 |
66 | std::tie(cost, gflops) = MeasureGemm(n, l, m, "float");
67 |
68 | std::cout << std::fixed << std::setprecision(4);
69 | std::cout << "size: " << i << ", " << "cost: " << cost << ", "
70 | << "GFLOPS: " << gflops << std::endl;
71 | }
72 |
73 | return 0;
74 | }
75 |
76 |
--------------------------------------------------------------------------------
/layer-test/test_conv2d.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 |
4 | import numpy as np
5 | import tvm
6 | import topi
7 | from tvm.contrib import rpc, util
8 | from topi.util import get_const_tuple
9 | import topi.testing
10 | from tvm.contrib.pickle_memoize import memoize
11 |
12 | dtype = 'float32'
13 |
14 | def convert_to_remote(func, remote):
15 | temp = util.tempdir()
16 | prefix = str(np.random.randint(1 << 31)) + "_"
17 | path_dso = temp.relpath(prefix + "tmp_func.tar")
18 | func.export_library(path_dso)
19 |
20 | remote.upload(path_dso)
21 | func = remote.load_module(prefix + "tmp_func.tar")
22 | return func
23 |
24 |
25 | def generate_tune_packs(item_list):
26 | ret = []
27 |
28 | now = {}
29 | def dfs(depth):
30 | if depth == len(item_list):
31 | ret.append(now.copy())
32 | return
33 |
34 | name = item_list[depth][0]
35 | for value in item_list[depth][1]:
36 | now[name] = value
37 | dfs(depth + 1)
38 |
39 | dfs(0)
40 |
41 | return ret
42 |
43 | USE_MANUAL_CODE = False
44 | @tvm.register_func
45 | def tvm_callback_opencl_postproc(code):
46 | if not os.path.exists("perf"):
47 | os.mkdir("perf")
48 | with open("generated.cl", 'w') as fout:
49 | fout.write(code)
50 | if USE_MANUAL_CODE:
51 | split = code.split("\n")
52 | code = '\n'.join(split)
53 | return code
54 |
55 |
56 |
57 | def tune_conv2d_nchw(batch, in_size, in_channel, num_filter, kernel, padding, stride, ctx,
58 | n_times=1, target_host=None, remote=None):
59 | in_height = in_width = in_size
60 |
61 | A = tvm.placeholder((batch, in_channel, in_height, in_width), name='data')
62 | W = tvm.placeholder((num_filter, in_channel, kernel, kernel), name='weight')
63 |
64 | # get verify data
65 | a_shape = get_const_tuple(A.shape)
66 | w_shape = get_const_tuple(W.shape)
67 | dtype = A.dtype
68 |
69 | @memoize("topi.tests.test_topi_conv2d.verify_conv2d")
70 | def get_ref_data():
71 | a_np = np.random.uniform(size=a_shape).astype(dtype)
72 | w_np = np.random.uniform(size=w_shape).astype(dtype)
73 | b_np = topi.testing.conv2d_nchw_python(a_np, w_np, stride, padding)
74 | return a_np, w_np, b_np
75 |
76 | a_np, w_np, b_np = get_ref_data()
77 | a = tvm.nd.array(a_np, ctx)
78 | w = tvm.nd.array(w_np, ctx)
79 | b = tvm.nd.array(np.zeros(b_np.shape, dtype=dtype), ctx)
80 |
81 | # generate static config
82 | #tune_pack = generate_tune_packs([
83 | # ["bn", [4]],
84 | # ["num_thread", [1, 2, 4, 8, 16]],
85 | # ["unroll_step", [1, 4, 16]],
86 | # ])
87 |
88 | tune_pack = generate_tune_packs([
89 | ["VH", [1]],
90 | ["VW", [1, 7]],
91 | ["VC", [1, 2, 4, 8, 16]],
92 | ["num_thread", [1, 2, 4, 8, 16, 32]],
93 | ])
94 |
95 | # search
96 | best_cost = 1e9
97 | best_config = None
98 | for config in reversed(tune_pack):
99 | with tvm.target.mali():
100 | tvm.target.current_target().tune_config = config
101 | B = topi.nn.conv2d(A, W, stride, padding)
102 | s = topi.generic.schedule_conv2d_nchw([B])
103 | func = tvm.build(s, [A, W, B], target_host=target_host)
104 |
105 | if remote is not None:
106 | func = convert_to_remote(func, remote)
107 |
108 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
109 | cost = time_f(a, w, b).mean
110 |
111 | try:
112 | np.testing.assert_allclose(b.asnumpy(), b_np, rtol=1e-4)
113 | except Exception as e:
114 | pass
115 |
116 | gflops = 2.0 * np.prod(b.shape) * kernel * kernel * in_channel /(1e9)/ cost
117 | print(config, cost, gflops)
118 | if cost < best_cost:
119 | best_cost = cost
120 | best_config = config
121 |
122 | return best_cost, 2.0 * np.prod(b.shape) * kernel * kernel * in_channel /(1e9)/ best_cost, best_config
123 |
124 |
125 | def verify_conv2d_nchw(batch, in_size, in_channel, num_filter, kernel, padding, stride, ctx,
126 | n_times=1, target=None, target_host=None, remote=None):
127 | in_height = in_width = in_size
128 |
129 | A = tvm.placeholder((batch, in_channel, in_height, in_width), dtype=dtype, name='data')
130 | W = tvm.placeholder((num_filter, in_channel, kernel, kernel), dtype=dtype, name='weight')
131 |
132 | with target:
133 | B = topi.nn.conv2d(A, W, stride, padding)
134 | s = topi.generic.schedule_conv2d_nchw([B])
135 | func = tvm.build(s, [A, W, B], target_host=target_host)
136 | #print(func.imported_modules[0].get_source())
137 |
138 | a_shape = get_const_tuple(A.shape)
139 | w_shape = get_const_tuple(W.shape)
140 |
141 | @memoize("topi.tests.test_topi_conv2d.verify_conv2d")
142 | def get_ref_data():
143 | a_np = np.random.uniform(size=a_shape)
144 | w_np = np.random.uniform(size=w_shape)
145 | b_np = topi.testing.conv2d_nchw_python(a_np, w_np, stride, padding)
146 | return a_np, w_np, b_np
147 |
148 | a_np, w_np, b_np = get_ref_data()
149 | a = tvm.nd.array(a_np.astype(dtype), ctx)
150 | w = tvm.nd.array(w_np.astype(dtype), ctx)
151 | b = tvm.nd.array(np.zeros(get_const_tuple(B.shape)).astype(B.dtype), ctx)
152 |
153 | if remote is not None:
154 | func = convert_to_remote(func, remote)
155 |
156 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
157 | cost = time_f(a, w, b).mean
158 |
159 | try:
160 | if dtype == 'float32':
161 | np.testing.assert_allclose(b.asnumpy(), b_np, rtol=1e-4)
162 | elif dtype == 'float16':
163 | np.testing.assert_allclose(b.asnumpy(), b_np, rtol=0.2)
164 | else:
165 | raise NotImplementedError
166 | except Exception as e:
167 | print(e)
168 |
169 | return cost, 2.0 * np.prod(b.shape) * kernel * kernel * in_channel / (1e9) / cost
170 |
171 | workloads = [
172 | # vgg16
173 | # (1, 224, 3, 64, 3, 1, 1),
174 | # (1, 224, 64, 64, 3, 1, 1),
175 | # (1, 112, 64, 128, 3, 1, 1),
176 | # (1, 112,128, 128, 3, 1, 1),
177 | # (1, 56, 128, 256, 3, 1, 1),
178 | # (1, 56, 256, 256, 3, 1, 1),
179 | # (1, 28, 256, 512, 3, 1, 1),
180 | # (1, 28, 512, 512, 3, 1, 1),
181 | # (1, 14, 512, 512, 3, 1, 1),
182 | # (1, 7, 512, 512, 3, 1, 1),
183 | #
184 | # # resnet-18
185 | # (1, 224, 3, 64, 7, 3, 2),
186 | # (1, 56, 64, 64, 3, 1, 1),
187 | # (1, 56, 64, 64, 1, 0, 1),
188 | # (1, 56, 64, 128, 3, 1, 2),
189 | # (1, 56, 64, 128, 1, 0, 2),
190 | (1, 28, 128, 128, 3, 1, 1),
191 | # (1, 28, 128, 256, 3, 1, 2),
192 | # (1, 28, 128, 256, 1, 0, 2),
193 | # (1, 14, 256, 256, 3, 1, 1),
194 | # (1, 14, 256, 512, 3, 1, 2),
195 | # (1, 14, 256, 512, 1, 0, 2),
196 | # (1, 7, 512, 512, 3, 1, 1),
197 | #
198 | # mobilenet
199 | # (1, 224, 3, 32, 3, 1, 2),
200 | # (1, 112, 32, 64, 1, 0, 1),
201 | # (1, 56, 64, 128, 1, 0, 1),
202 | # (1, 56, 128, 128, 1, 0, 1),
203 | # (1, 28, 128, 256, 1, 0, 1),
204 | # (1, 28, 256, 256, 1, 0, 1),
205 | # (1, 14, 256, 512, 1, 0, 1),
206 | # (1, 14, 512, 512, 1, 0, 1),
207 | # (1, 7, 512, 1024, 1, 0, 1),
208 | # (1, 7, 1024, 1024, 1, 0, 1),
209 | ]
210 |
211 | def verify_workloads(ctx, n_times=1, target=None, target_host=None, remote=None):
212 | for item in workloads:
213 | cost, gflops = verify_conv2d_nchw(*item, ctx=ctx, target=target,
214 | target_host=target_host, remote=remote)
215 | print("%-30s %.6f %.6f" % (item, cost, gflops))
216 |
217 | #def tune_workloads(ctx, n_times=1, target=None, target_host=None, remote=None):
218 | # ret = []
219 | # for item in workloads:
220 | # cost, gflops, config = tune_conv2d_nchw(*item, ctx=ctx, target_host=target_host, remote=remote)
221 | # print(item, cost, gflops, config)
222 | # ret.append([item, config])
223 | # for item in ret:
224 | # print(item, config)
225 |
226 | if __name__ == "__main__":
227 | host = os.environ["TVM_OPENCL_DEVICE_HOST"]
228 | port = 9090
229 | remote = rpc.connect(host, port)
230 | target_host = "llvm -target=aarch64-linux-gnu -mattr=+neon"
231 |
232 | verify_workloads(remote.cl(), 10, tvm.target.mali(), target_host, remote)
233 |
234 |
--------------------------------------------------------------------------------
/layer-test/test_dense.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 |
4 | import numpy as np
5 | import tvm
6 | import topi
7 | from tvm.contrib import rpc, util
8 | from topi.util import get_const_tuple
9 | from tvm.contrib.pickle_memoize import memoize
10 |
11 | dtype = 'float16'
12 |
13 | USE_MANUAL_CODE = False
14 | @tvm.register_func
15 | def tvm_callback_opencl_postproc(code):
16 | if not os.path.exists("perf"):
17 | os.mkdir("perf")
18 | with open("generated.cl", 'w') as fout:
19 | fout.write(code)
20 | if USE_MANUAL_CODE:
21 | with open("manual.cl") as fin:
22 | code = "\n".join(fin.readlines())
23 | print(code)
24 | return code
25 |
26 |
27 | def convert_to_remote(func, remote):
28 | temp = util.tempdir()
29 | prefix = str(np.random.randint(1 << 31)) + "_"
30 | path_dso = temp.relpath(prefix + "tmp_func.tar")
31 | func.export_library(path_dso)
32 |
33 | remote.upload(path_dso)
34 | func = remote.load_module(prefix + "tmp_func.tar")
35 | return func
36 |
37 |
38 | def generate_tune_packs(item_list):
39 | ret = []
40 |
41 | now = {}
42 | def dfs(depth):
43 | if depth == len(item_list):
44 | ret.append(now.copy())
45 | return
46 |
47 | name = item_list[depth][0]
48 | for value in item_list[depth][1]:
49 | now[name] = value
50 | dfs(depth + 1)
51 |
52 | dfs(0)
53 |
54 | return ret
55 |
56 |
57 | def tune_dense(batch, hidden, out, ctx,
58 | n_times=1, target_host=None, remote=None):
59 | A = tvm.placeholder((1, hidden), dtype=dtype, name='A')
60 | B = tvm.placeholder((out, hidden), dtype=dtype, name='B')
61 | BIAS = tvm.placeholder((out,), dtype=dtype, name='bias')
62 |
63 | # generate static config
64 | tune_pack = generate_tune_packs([
65 | # ["bn", [1, 2, 4, 8, 16]],
66 | # ["reuse", [1, 2, 4, 8]],
67 | ["num_thread", [1, 2, 4, 32, 64, 256]],
68 | ["unroll_step", [1, 2, 4, 5, 6, 16, 32]],
69 | ])
70 |
71 | a_shape = get_const_tuple(A.shape)
72 | b_shape = get_const_tuple(B.shape)
73 | bias_shape = get_const_tuple(BIAS.shape)
74 | c_shape = (1, out)
75 |
76 | a_np = np.random.uniform(size=a_shape)
77 | b_np = np.random.uniform(size=b_shape)
78 | bias_np = np.random.uniform(size=bias_shape)
79 | c_np = np.random.uniform(size=c_shape)
80 |
81 | # search
82 | tic = time.time()
83 | best_cost = 1e9
84 | best_config = None
85 | for i, config in enumerate(tune_pack):
86 | with tvm.target.mali():
87 | tvm.target.current_target().tune_config = config
88 | C = topi.nn.dense(A, B, BIAS)
89 | s = topi.generic.schedule_dense([C])
90 | func = tvm.build(s, [A, B, BIAS, C], target_host=target_host)
91 |
92 | a = tvm.nd.array(a_np.astype(dtype), ctx=ctx)
93 | b = tvm.nd.array(b_np.astype(dtype), ctx=ctx)
94 | bias = tvm.nd.array(bias_np.astype(dtype), ctx=ctx)
95 | c = tvm.nd.array(c_np.astype(dtype), ctx=ctx)
96 |
97 | if remote is not None:
98 | func = convert_to_remote(func, remote)
99 |
100 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
101 | cost = time_f(a, b, bias, c).mean
102 |
103 | gflops = 2.0 * np.prod(b.shape) / (1e9) / cost
104 | if cost < best_cost:
105 | print(config, cost, gflops)
106 | best_cost = cost
107 | best_config = config
108 |
109 | if i % 20 == 0:
110 | print(i, len(tune_pack), time.time()- tic, (time.time() - tic) / (i+1))
111 |
112 | try:
113 | np.testing.assert_allclose(np.dot(a_np, b_np.T) + bias_np, c.asnumpy(), rtol=1e-2)
114 | except Exception as e:
115 | pass
116 | print(e)
117 |
118 | return best_cost, 2.0 * np.prod(b.shape) / (1e9) / best_cost, best_config
119 |
120 | def verify_dense(batch, hidden, out, ctx,
121 | n_times=1, target_host=None, remote=None):
122 | A = tvm.placeholder((1, hidden), dtype=dtype, name='A')
123 | B = tvm.placeholder((out, hidden), dtype=dtype, name='B')
124 | bias = tvm.placeholder((out,), dtype=dtype, name='bias')
125 |
126 | with tvm.target.mali():
127 | C = topi.nn.dense(A, B, bias)
128 | s = topi.generic.schedule_dense([C])
129 | func = tvm.build(s, [A, B, bias, C], target_host=target_host)
130 |
131 | a_shape = get_const_tuple(A.shape)
132 | b_shape = get_const_tuple(B.shape)
133 | bias_shape = get_const_tuple(bias.shape)
134 | c_shape = get_const_tuple(C.shape)
135 |
136 | a_np = np.random.uniform(size=a_shape)
137 | b_np = np.random.uniform(size=b_shape)
138 | bias_np = np.random.uniform(size=bias_shape)
139 | c_np = np.random.uniform(size=c_shape)
140 |
141 | a = tvm.nd.array(a_np.astype(dtype), ctx=ctx)
142 | b = tvm.nd.array(b_np.astype(dtype), ctx=ctx)
143 | bias = tvm.nd.array(bias_np.astype(dtype), ctx=ctx)
144 | c = tvm.nd.array(np.zeros_like(c_np).astype(dtype), ctx=ctx)
145 |
146 | if remote is not None:
147 | func = convert_to_remote(func, remote)
148 |
149 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
150 | cost = time_f(a, b, bias, c).mean
151 |
152 | try:
153 | np.testing.assert_allclose(np.dot(a_np, b_np.T) + bias_np, c.asnumpy(), rtol=1e-1)
154 | except Exception as e:
155 | pass
156 | print(e)
157 |
158 | return cost, 2.0 * np.prod(b.shape) / (1e9) / cost
159 |
160 | workloads = [
161 | (1, 25088, 4096),
162 | # (1, 4096, 4096),
163 | # (1, 4096, 1000),
164 | # (1, 1024, 1000),
165 | ]
166 |
167 | def verify_workloads(ctx, n_times=1, target_host=None, remote=None):
168 | for item in workloads:
169 | cost, gflops = verify_dense(*item, ctx=ctx, target_host=target_host, remote=remote)
170 | print("%-30s %.6f %.6f" % (item, cost, gflops))
171 |
172 | def tune_workloads(ctx, n_times=1, target_host=None, remote=None):
173 | for item in workloads:
174 | cost, gflops, config = tune_dense(*item, ctx=ctx, target_host=target_host, remote=remote)
175 | print(item, cost, gflops, config)
176 |
177 | if __name__ == "__main__":
178 | host = os.environ["TVM_OPENCL_DEVICE_HOST"]
179 | port = 9090
180 | remote = rpc.connect(host, port)
181 | target_host = "llvm -target=aarch64-linux-gnu"
182 |
183 | verify_workloads(remote.cl(), 10, target_host, remote)
184 |
185 |
--------------------------------------------------------------------------------
/layer-test/test_depth.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import numpy as np
4 | import tvm
5 | import topi
6 | from tvm.contrib import rpc, util
7 | from topi.util import get_const_tuple
8 | from tvm.contrib.pickle_memoize import memoize
9 |
10 | dtype = 'float32'
11 |
12 | def convert_to_remote(func, remote):
13 | temp = util.tempdir()
14 | prefix = str(np.random.randint(1 << 31)) + "_"
15 | path_dso = temp.relpath(prefix + "tmp_func.tar")
16 | func.export_library(path_dso)
17 |
18 | remote.upload(path_dso)
19 | func = remote.load_module(prefix + "tmp_func.tar")
20 | return func
21 |
22 |
23 | def generate_tune_packs(item_list):
24 | ret = []
25 |
26 | now = {}
27 | def dfs(depth):
28 | if depth == len(item_list):
29 | ret.append(now.copy())
30 | return
31 |
32 | name = item_list[depth][0]
33 | for value in item_list[depth][1]:
34 | now[name] = value
35 | dfs(depth + 1)
36 |
37 | dfs(0)
38 |
39 | return ret
40 |
41 | USE_MANUAL_CODE = False
42 | @tvm.register_func
43 | def tvm_callback_opencl_postproc(code):
44 | if not os.path.exists("perf"):
45 | os.mkdir("perf")
46 | with open("generated.cl", 'w') as fout:
47 | fout.write(code)
48 | if USE_MANUAL_CODE:
49 | split = code.split("\n")
50 | code = '\n'.join(split)
51 | return code
52 |
53 |
54 |
55 | def tune_conv2d_nchw(batch, in_size, in_channel, num_filter, kernel, padding, stride, ctx,
56 | n_times=1, target_host=None, remote=None):
57 | in_height = in_width = in_size
58 |
59 | A = tvm.placeholder((batch, in_channel, in_height, in_width), dtype=dtype, name='data')
60 | W = tvm.placeholder((num_filter, in_channel, kernel, kernel), dtype=dtype, name='weight')
61 |
62 | # get verify data
63 | a_shape = get_const_tuple(A.shape)
64 | w_shape = get_const_tuple(W.shape)
65 |
66 | @memoize("topi.tests.test_topi_conv2d.verify_conv2d")
67 | def get_ref_data():
68 | a_np = np.random.uniform(size=a_shape)
69 | w_np = np.random.uniform(size=w_shape)
70 | b_np = topi.testing.conv2d_nchw_python(a_np, w_np, stride, padding)
71 | return a_np, w_np, b_np
72 |
73 | a_np, w_np, b_np = get_ref_data()
74 | a = tvm.nd.array(a_np.astype(dtype), ctx)
75 | w = tvm.nd.array(w_np.astype(dtype), ctx)
76 | b = tvm.nd.array(np.zeros(b_np.shape).astype(dtype), ctx)
77 |
78 | # generate static config
79 | #tune_pack = generate_tune_packs([
80 | # ["bn", [4]],
81 | # ["num_thread", [1, 2, 4, 8, 16]],
82 | # ["unroll_step", [1, 4, 16]],
83 | # ])
84 |
85 | tune_pack = generate_tune_packs([
86 | ["VH", [1, 2, 4]],
87 | ["VW", [1, 2, 4, 8]],
88 | ["VC", [1, 2, 4, 8]],
89 | ["num_thread", [1, 2, 4, 16, 32, 64]],
90 | ])
91 |
92 | # search
93 | best_cost = 1e9
94 | best_config = None
95 | for config in reversed(tune_pack):
96 | with tvm.target.mali():
97 | tvm.target.current_target().tune_config = config
98 | B = topi.nn.conv2d(A, W, stride, padding)
99 | s = topi.generic.schedule_conv2d_nchw([B])
100 | func = tvm.build(s, [A, W, B], target_host=target_host)
101 |
102 | if remote is not None:
103 | func = convert_to_remote(func, remote)
104 |
105 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
106 | cost = time_f(a, w, b).mean
107 |
108 | try:
109 | np.testing.assert_allclose(b.asnumpy(), b_np, rtol=1e-4)
110 | except Exception as e:
111 | pass
112 |
113 | gflops = 2.0 * np.prod(b.shape) * kernel * kernel * in_channel /(1e9)/ cost
114 | print(config, cost, gflops)
115 | if cost < best_cost:
116 | best_cost = cost
117 | best_config = config
118 |
119 | return best_cost, 2.0 * np.prod(b.shape) * kernel * kernel * in_channel /(1e9)/ best_cost, best_config
120 |
121 |
122 | def verify_conv2d_nchw(batch, in_size, in_channel, channel_multiplier, kernel, padding, stride, ctx,
123 | n_times=1, target_host=None, remote=None):
124 | in_height = in_width = in_size
125 |
126 | A = tvm.placeholder((batch, in_channel, in_height, in_width), dtype=dtype, name='data')
127 | W = tvm.placeholder((in_channel, channel_multiplier, kernel, kernel), dtype=dtype, name='weight')
128 |
129 | with tvm.target.mali():
130 | B = topi.nn.depthwise_conv2d_nchw(A, W, stride, padding)
131 | #B = topi.nn.relu(B)
132 | s = topi.generic.schedule_depthwise_conv2d_nchw([B])
133 | func = tvm.build(s, [A, W, B], target_host=target_host)
134 |
135 | a_shape = get_const_tuple(A.shape)
136 | w_shape = get_const_tuple(W.shape)
137 |
138 | @memoize("topi.tests.test_topi_depthconv.verify_depthconv")
139 | def get_ref_data():
140 | a_np = np.random.uniform(size=a_shape).astype('float32')
141 | w_np = np.random.uniform(size=w_shape).astype('float32')
142 | b_np = topi.testing.depthwise_conv2d_python_nchw(a_np, w_np, stride, padding)
143 | return a_np, w_np, b_np
144 |
145 | a_np, w_np, b_np = get_ref_data()
146 | a = tvm.nd.array(a_np.astype(dtype), ctx)
147 | w = tvm.nd.array(w_np.astype(dtype), ctx)
148 | b = tvm.nd.array(np.zeros(get_const_tuple(B.shape)).astype(dtype), ctx)
149 |
150 | if remote is not None:
151 | func = convert_to_remote(func, remote)
152 |
153 | time_f = func.time_evaluator(func.entry_name, ctx, number=n_times)
154 | cost = time_f(a, w, b).mean
155 |
156 | try:
157 | np.testing.assert_allclose(b.asnumpy(), b_np, rtol=1e-1)
158 | except Exception as e:
159 | print(e)
160 |
161 | return cost, 2.0 * np.prod(b.shape) * kernel * kernel / (1e9) / cost
162 |
163 | workloads = [
164 | # mobilenet
165 | (1, 112, 32, 1, 3, 1, 1),
166 | (1, 112, 64, 1, 3, 1, 2),
167 | (1, 56, 128, 1, 3, 1, 1),
168 | (1, 56, 128, 1, 3, 1, 2),
169 | (1, 28, 256, 1, 3, 1, 1),
170 | (1, 28, 256, 1, 3, 1, 2),
171 | (1, 14, 512, 1, 3, 1, 1),
172 | (1, 14, 512, 1, 3, 1, 2),
173 | (1, 7, 1024, 1, 3, 1, 1),
174 | ]
175 |
176 | def verify_workloads(ctx, n_times=1, target_host=None, remote=None):
177 | for item in workloads:
178 | cost, gflops = verify_conv2d_nchw(*item, ctx=ctx, target_host=target_host, remote=remote)
179 | print("%-30s %.6f %.6f" % (item, cost, gflops))
180 |
181 | def tune_workloads(ctx, n_times=1, target_host=None, remote=None):
182 | for item in workloads:
183 | cost, gflops, config = tune_conv2d_nchw(*item, ctx=ctx, target_host=target_host, remote=remote)
184 | print(item, cost, gflops, config)
185 |
186 | if __name__ == "__main__":
187 | host = os.environ["TVM_OPENCL_DEVICE_HOST"]
188 | port = 9090
189 | remote = rpc.connect(host, port)
190 | target_host = "llvm -target=aarch64-linux-gnu -mattr=+neon"
191 |
192 | verify_workloads(remote.cl(), 10000, target_host, remote)
193 |
194 |
--------------------------------------------------------------------------------
/layer-test/util.cc:
--------------------------------------------------------------------------------
1 | #include "util.h"
2 |
3 | // read data from CLTensor
4 | void ReadTensor(const CLTensor *tensor, void *to, size_t size) {
5 | cl::CommandQueue &queue = CLScheduler::get().queue();
6 | queue.enqueueReadBuffer(tensor->cl_buffer(), true, 0, size, to);
7 | queue.finish();
8 | }
9 |
10 | // write data to CLTensor
11 | void WriteTensor(CLTensor *tensor, const void *from, size_t size) {
12 | cl::CommandQueue &queue = CLScheduler::get().queue();
13 | queue.enqueueWriteBuffer(tensor->cl_buffer(), true, 0, size, from);
14 | queue.finish();
15 | }
16 |
17 | // transform dtype to format in arm compute
18 | Format DtypeToFormat(std::string dtype) {
19 | if (dtype == "float" || dtype == "float32")
20 | return Format::F32;
21 | else if (dtype == "float16")
22 | return Format::F16;
23 | else {
24 | std::cerr << "Unsupported type: " << dtype << std::endl;
25 | exit(-1);
26 | }
27 | }
28 |
--------------------------------------------------------------------------------
/layer-test/util.h:
--------------------------------------------------------------------------------
1 | #ifndef ARM_COMPUTE_UTIL_H_
2 | #define ARM_COMPUTE_UTIL_H_
3 |
4 | #include "arm_compute/runtime/CL/CLFunctions.h"
5 | #include "arm_compute/core/Types.h"
6 | #include
7 |
8 | using namespace arm_compute;
9 |
10 | // read data from CLTensor
11 | void ReadTensor(CLTensor &tensor, void *to, size_t size);
12 |
13 | // write data to CLTensor
14 | void WriteTensor(CLTensor &tensor, void *from, size_t size);
15 |
16 | // transform dtype to format in arm compute
17 | Format DtypeToFormat(std::string);
18 |
19 | #endif // ARM_COMPUTE_UTIL_H_
20 |
21 |
--------------------------------------------------------------------------------
/mali_imagenet_bench.py:
--------------------------------------------------------------------------------
1 | """
2 | Benchmark inference speed on ImageNet
3 | Example (run on Firefly RK3399):
4 | python mali_imagenet_bench.py --target-host 'llvm -target=aarch64-linux-gnu' --host 192.168.0.100 --port 9090 --model mobilenet
5 | """
6 |
7 | import time
8 | import argparse
9 | import numpy as np
10 | import tvm
11 | import nnvm.compiler
12 | import nnvm.testing
13 | from tvm.contrib import util, rpc
14 | from tvm.contrib import graph_runtime as runtime
15 |
16 | def run_case(model, dtype):
17 | # load model
18 | if model == 'vgg16':
19 | net, params = nnvm.testing.vgg.get_workload(num_layers=16,
20 | batch_size=1, image_shape=image_shape, dtype=dtype)
21 | elif model == 'resnet18':
22 | net, params = nnvm.testing.resnet.get_workload(num_layers=18,
23 | batch_size=1, image_shape=image_shape, dtype=dtype)
24 | elif model == 'mobilenet':
25 | net, params = nnvm.testing.mobilenet.get_workload(
26 | batch_size=1, image_shape=image_shape, dtype=dtype)
27 | else:
28 | raise ValueError('no benchmark prepared for {}.'.format(model))
29 |
30 | # compile
31 | opt_level = 2 if dtype == 'float32' else 1
32 | with nnvm.compiler.build_config(opt_level=opt_level):
33 | graph, lib, params = nnvm.compiler.build(
34 | net, tvm.target.mali(), shape={"data": data_shape}, params=params,
35 | dtype=dtype, target_host=args.target_host)
36 |
37 | # upload model to remote device
38 | tmp = util.tempdir()
39 | lib_fname = tmp.relpath('net.tar')
40 | lib.export_library(lib_fname)
41 |
42 | if args.host is not None:
43 | remote = rpc.connect(args.host, args.port)
44 | remote.upload(lib_fname)
45 |
46 | ctx = remote.cl(0)
47 | rlib = remote.load_module('net.tar')
48 | rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
49 | else:
50 | ctx = tvm.cl(0)
51 | rlib = lib
52 | rparams = params
53 |
54 | # create graph runtime
55 | module = runtime.create(graph, rlib, ctx)
56 | module.set_input('data', tvm.nd.array(np.random.uniform(size=(data_shape)).astype(dtype)))
57 | module.set_input(**rparams)
58 |
59 | # benchmark
60 | # print("============================================================")
61 | # print("model: %s, dtype: %s" % (model, dtype))
62 |
63 | # the num of runs for warm up and test
64 | num_warmup = 10
65 | num_test = 60
66 | if model == 'mobilenet': # mobilenet is fast, need more runs for stable measureament
67 | num_warmup *= 5
68 | num_test *= 5
69 |
70 | # perform some warm up runs
71 | # print("warm up..")
72 | warm_up_timer = module.module.time_evaluator("run", ctx, num_warmup)
73 | warm_up_timer()
74 |
75 | # test
76 | # print("test..")
77 | ftimer = module.module.time_evaluator("run", ctx, num_test)
78 | prof_res = ftimer()
79 | # print("cost per image: %.4fs" % prof_res.mean)
80 |
81 | print("backend: TVM-mali\tmodel: %s\tdtype: %s\tcost:%.4f" % (model, dtype, prof_res.mean))
82 |
83 | if __name__ == '__main__':
84 | parser = argparse.ArgumentParser()
85 | parser.add_argument('--model', type=str, required=True, choices=['vgg16', 'resnet18', 'mobilenet', 'all'],
86 | help="The model type.")
87 | parser.add_argument('--dtype', type=str, default='float32', choices=['float16', 'float32'])
88 | parser.add_argument('--host', type=str, help="The host address of your arm device.", default=None)
89 | parser.add_argument('--port', type=int, help="The port number of your arm device", default=None)
90 | parser.add_argument('--target-host', type=str, help="The compilation target of host device.", default=None)
91 | args = parser.parse_args()
92 |
93 | # set parameter
94 | batch_size = 1
95 | num_classes = 1000
96 | image_shape = (3, 224, 224)
97 |
98 | # load model
99 | data_shape = (batch_size,) + image_shape
100 | out_shape = (batch_size, num_classes)
101 |
102 | if args.model == 'all': # test all
103 | for model in ['vgg16', 'resnet18', 'mobilenet']:
104 | for dtype in ['float32', 'float16']:
105 | run_case(model, dtype)
106 | time.sleep(10)
107 |
108 | else: # test single
109 | run_case(args.model, args.dtype)
110 |
111 |
--------------------------------------------------------------------------------
/mxnet_test.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import sys
4 | import argparse
5 |
6 | import numpy as np
7 |
8 | import mxnet as mx
9 | from mxnet import gluon
10 | from mxnet.gluon.model_zoo.vision import get_model
11 | from mxnet.gluon.utils import download
12 |
13 | input_size = 224
14 |
15 | def test_module(model, dtype):
16 | assert dtype == 'float32'
17 |
18 | if model == 'vgg16':
19 | model_block = mx.gluon.model_zoo.vision.get_vgg(16, pretrained=False)
20 | elif model == 'mobilenet':
21 | model_block = mx.gluon.model_zoo.vision.get_mobilenet(1.0, pretrained=False)
22 | elif model == 'resnet18':
23 | model_block = mx.gluon.model_zoo.vision.get_resnet(version=1, num_layers=18, pretrained=False)
24 | else:
25 | raise RuntimeError("invalid model model " + model)
26 | model_block.collect_params().initialize(mx.init.Xavier())
27 |
28 | # define input and test function
29 | x = mx.nd.array(np.zeros((1, 3, input_size, input_size)))
30 | def measure(n_time):
31 | out = model_block(x).asnumpy()
32 | tic = time.time()
33 | for i in range(n_time):
34 | out = model_block(x).asnumpy()
35 | cost = time.time() - tic
36 | return cost / n_time
37 |
38 | # benchmark
39 | # print("============================================================")
40 | # print("model: %s, dtype: %s" % (model, dtype))
41 |
42 | num_warmup = 15
43 | num_test = 80
44 | if model == 'mobilenet': # mobilenet is fast, need more runs for stable measureament
45 | num_warmup *= 4
46 | num_test *= 4
47 |
48 | # warm up
49 | # print("warm up...")
50 | measure(num_warmup)
51 |
52 | # print("test..")
53 | cost = measure(num_test)
54 | # print("cost per image: %.4fs" % cost)
55 |
56 | print("backend: MXNet+OpenBLAS\tmodel: %s\tdtype: %s\tcost:%.4f" % (model, dtype, cost))
57 |
58 | if __name__ == "__main__":
59 | parser = argparse.ArgumentParser()
60 | parser.add_argument('--model', type=str, required=True, choices=['vgg16', 'mobilenet', 'resnet18', 'all'])
61 | args = parser.parse_args()
62 |
63 | if args.model == 'all':
64 | for model in ['resnet18', 'mobilenet', 'vgg16']:
65 | test_module(model, 'float32')
66 | time.sleep(20)
67 | else:
68 | test_module(args.model, 'float32')
69 |
70 |
--------------------------------------------------------------------------------
/results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/merrymercy/tvm-mali/70f57626b45b484edcb9ef03e01e5ff3d63296ed/results.png
--------------------------------------------------------------------------------
/run_test.sh:
--------------------------------------------------------------------------------
1 | sudo /etc/init.d/lightdm stop
2 | sudo echo performance > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor
3 |
4 | export PYTHONPATH=$(pwd)/nnvm/python:$(pwd)/nnvm/tvm/python:$(pwd)/nnvm/tvm/topi/python:$(pwd)/incubator-mxnet/python
5 |
6 | python mxnet_test.py --model all
7 | python mali_imagenet_bench.py --model all
8 | LD_LIBRARY_PATH=ComputeLibrary/build ./acl_test all
9 |
10 |
--------------------------------------------------------------------------------
/spatial.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/merrymercy/tvm-mali/70f57626b45b484edcb9ef03e01e5ff3d63296ed/spatial.png
--------------------------------------------------------------------------------