├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── data ├── data.sh ├── myplot.m ├── res_0.txt ├── res_1.txt ├── res_10.txt ├── res_11.txt ├── res_12.txt ├── res_13.txt ├── res_14.txt ├── res_15.txt ├── res_16.txt ├── res_17.txt ├── res_18.txt ├── res_2.txt ├── res_3.txt ├── res_4.txt ├── res_5.txt ├── res_6.txt ├── res_7.txt ├── res_8.txt └── res_9.txt ├── figures ├── Kernel1.png ├── Kernel10.png ├── Kernel11.png ├── Kernel12.png ├── Kernel13.png ├── Kernel14.png ├── Kernel15.png ├── Kernel16.png ├── Kernel17.png ├── Kernel18.png ├── Kernel2.png ├── Kernel3.png ├── Kernel4.png ├── Kernel5.png ├── Kernel6.png ├── Kernel7.png ├── Kernel8.png └── Kernel9.png ├── include ├── kernel1.h ├── kernel10.h ├── kernel11.h ├── kernel12.h ├── kernel13.h ├── kernel14.h ├── kernel15.h ├── kernel16.h ├── kernel17.h ├── kernel18.h ├── kernel19.h ├── kernel2.h ├── kernel3.h ├── kernel4.h ├── kernel5.h ├── kernel6.h ├── kernel7.h ├── kernel8.h └── kernel9.h ├── kernels.h ├── run.sh ├── test.c ├── utils.c └── utils.h /.gitignore: -------------------------------------------------------------------------------- 1 | # Prerequisites 2 | *.d 3 | 4 | # Object files 5 | *.o 6 | *.ko 7 | *.obj 8 | *.elf 9 | 10 | # Linker output 11 | *.ilk 12 | *.map 13 | *.exp 14 | 15 | # Precompiled Headers 16 | *.gch 17 | *.pch 18 | 19 | # Libraries 20 | *.lib 21 | *.a 22 | *.la 23 | *.lo 24 | back 25 | dgemm_x86 26 | 27 | # Shared objects (inc. Windows DLLs) 28 | *.dll 29 | *.so 30 | *.so.* 31 | *.dylib 32 | 33 | # Executables 34 | *.exe 35 | *.out 36 | *.app 37 | *.i*86 38 | *.x86_64 39 | *.hex 40 | 41 | # Debug files 42 | *.dSYM/ 43 | *.su 44 | *.idb 45 | *.pdb 46 | 47 | # Kernel Module Compile Results 48 | *.mod* 49 | *.cmd 50 | .tmp_versions/ 51 | modules.order 52 | Module.symvers 53 | Mkfile.old 54 | dkms.conf 55 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | BINARY_NAME = dgemm_x86 2 | CC = gcc 3 | CFLAGS = -O0 -march=skylake-avx512 -w -lpthread -fopenmp 4 | MKLPATH = /opt/intel/mkl 5 | LDFLAGS = -L$(MKLPATH)/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl -DMKL_ILP64 -m64 6 | INCFLAGS = -I$(MKLPATH)/include 7 | 8 | 9 | SRC = $(wildcard *.c) 10 | build : $(BINARY_NAME) 11 | 12 | $(BINARY_NAME): $(SRC) 13 | $(CC) $(CFLAGS) $(LDFLAGS) $(INCFLAGS) $(SRC) -o $(BINARY_NAME) 14 | 15 | clean: 16 | rm $(BINARY_NAME) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to optimize DGEMM on x86 CPU platforms 2 | 3 | General matrix/matrix multiplication (GEMM) is a core routine of many popular algorithms. On modern computing platforms with hierarchical memory architecture, it is typically possible that we can reach near-optimal performance for GEMM. For example, on most x86 CPUs, Intel MKL, as well as other well-known BLAS implementations including OpenBLAS and BLIS, can provide >90% of the peak performance for GEMM. On the GPU side, cuBLAS, provided by NVIDIA, can also provide near-optimal performance for GEMM. Though optimizing serial implementation of GEMM on x86 platforms is never a new topic, a tutorial discussing optimizing GEMM on x86 platforms with AVX512 instructions is missing among existing learning resources online. Additionally, with the increasing on data width compared between AVX512 and its predecessors AVX2, AVX, SSE4 and etc, the gap between the peak computational capability and the memory bandwidth continues growing. This simultaneously gives rise of the requirement on programmers to design more delicate prefetching schemes in order to hide the memory latency. Comparing with existed turials, ours is the first one which not only touches the implementation leaveraging AVX512 instructions, and provides step-wise optimization with prefetching strategies as well. The DGEMM implementation eventually reaches comparable performance to Intel MKL. 4 | 5 | # Q & A 6 | I warmly welcome questions and discussions through pull requests or my personal email at yujiazhai94@gmail.com 7 | 8 | # Hardware platforms and software configurations 9 | 10 | * We require a CPU with the CPU flag ```avx512f``` to run all test cases in this tutorial. This can be checked on terminal using the command: ```lscpu | grep "avx512f"```. 11 | * The experimental data shown are collected on an Intel Xeon W-2255 CPU (2 AVX512 units, base frequency 3.7 GHz, turbo boost frequency running AVX512 instructions on a single core: 4.0 GHz). This workstation is equipped with 4X8GB=32GB DRAM at 2933 GHz. The theoretical peak performance on a single core is: 2(FMA)*2(AVX512 Units)*512(data with)/64(bit of a fp64 number)*4 GHz = 128 GFLOPS. 12 | * We compiled the program with ```gcc 7.3.0``` under Ubuntu 18.04.5 LTS. 13 | * Intel MKL version: oneMKL 2021.1 beta. 14 | 15 | # How to run 16 | Just three steps. 17 | * We first modify the path of MKL in ```Makefile```. 18 | * Second, type in ```make``` to compile. A binary executable ```dgemm_x86``` will be generated. 19 | * Third, run the binary using ```./dgemm_x86 [kernel_number]```, where ```kernel_number``` selects the kernel for benchmark. ```0``` represents Intel MKL and ```1-19``` represent 19 kernels demonstrating the optimizing strategies. Here ```kernel18``` is the best serial version while ```kernel19``` is the best parallel version. Both of them reach the performance comparable to / faster than Intel MKL on lastest Intel CPUs. 20 | 21 | # Related good GEMM tutorials/materials on x86-64 CPUs 22 | * https://github.com/flame/how-to-optimize-gemm 23 | * http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html 24 | * https://github.com/wjc404/GEMM_AVX512F 25 | * https://github.com/xianyi/OpenBLAS/tree/develop/kernel/x86_64 26 | 27 | # Step-wise Optimizations 28 | 29 | Here we take the column-major implemetation for DGEMM. 30 | 31 | ## Kernel 1 (naive version) 32 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel1.h) 33 | 34 | Kernel 1 is the most naive implementation of DGEMM. 35 | 36 | ## Kernel 2 (register re-use) 37 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel2.h) 38 | 39 | Observing the innermost loop in [kernel1](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel1.h), ```C(i,j)``` is irrelevant to the innermost loop index ```k```, therefore, one can load it into the register before entering the k-loop to avoid unnecessary memory access. 40 | 41 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel1.png) 42 | 43 | ## Kernel 3 (2x2 register blocking) 44 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel3.h) 45 | 46 | Since matrix multiplication is an algorithm at the complexity order of N^3, one should basically re-use the data at register/cache levels to improve the performance. In [kernel2](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel2.h), we cached a C element in register but failed to re-use the data on matrix A and B at the register level. Now we update the whole 2x2 block of C matrix by loading a 2x1 slice of A and an 1x2 slice of B. Then we conduct an outer-product rank-1 update on the 2x2 C block. This significantly improves the performance compared to the previous step. 47 | 48 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel2.png) 49 | 50 | ## Kernel 4 (4x4 register blocking) 51 | 52 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel4.h) 53 | 54 | Now we more aggresively leverage the register-blocking strategy. We update on an 4x4 C block each time. The performance further improves. 55 | 56 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel3.png) 57 | 58 | ## Kernel 5 (Kernel 4 + AVX2) 59 | 60 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel5.h) 61 | 62 | Instead of processing data with scalar instructions, now we process data using Single Instruction Multiple Data (SIMD) instructions. An AVX2 instruction, whose data width is 256-bit, can process 4 fp64 elements at a lane. 63 | 64 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel4.png) 65 | 66 | ## Kernel 6 (Kernel 5 + loop unrolling x 4) 67 | 68 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel6.h) 69 | 70 | We unroll the loop by 4 folds. This step slightly improves the performance. 71 | 72 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel5.png) 73 | 74 | ## Kernel 7 (8x4 kernel + AVX2 + loop unrolling x 4) 75 | 76 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel7.h) 77 | 78 | We changed the previous [kernel](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel6.h) from 4x4 to the current 8x4 so that we obtain a better utilization on all 256-bit YMM registers (16 YMMs on SNB/HSW/BRW and 32 YMM/ZMM on SKL/CSL). 79 | 80 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel6.png) 81 | 82 | ## Kernel 8 (Kernel 7 + cache blocking) 83 | 84 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel8.h) 85 | 86 | We notice that the performance around 30 GFLOPS cannot be maintained when data cannot be held in cache. Therefore, we introduce cache blocking to maintain the good performance when matrix sizes are small. 87 | 88 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel7.png) 89 | 90 | ## Kernel 9 (Kernel 8 + packing) 91 | 92 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel9.h) 93 | 94 | To avoid the TLB miss when accessing the cache blocks, we pack the data blocks into continous memory before loading them. This strategy boosts the performance. 95 | 96 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel8.png) 97 | 98 | ## Kernel 10 (24x8 kernel + AVX512 + blocking + packing) 99 | 100 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel10.h) 101 | 102 | We update [kernel9](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel9.h) with a larger micro kernel (24x8 instead of 8x4) and wider data width (AVX512 instead of AVX2). All other strategies maintain. 103 | 104 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel9.png) 105 | 106 | ## Kernel 11 (Kernel 10 + discontinous packing on B) 107 | 108 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel11.h) 109 | 110 | Previously in [kernel10](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel10.h) we pack the matrix B into totally continous memory. Recalling that the L2 cache contains multiple cache line buffers so that the L2 hardware prefetcher can prefetch the data from lower memory hierarchy. In this step, breaking the continous memory access in B can benefit the L2 hw prefetcher so that the memory latency can be further hidden. 111 | 112 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel10.png) 113 | 114 | ## Kernel 12 (Kernel 11: from instrinsics to inline ASM ) 115 | 116 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel12.h) 117 | 118 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel11.png) 119 | 120 | ## Kernel 13 (Kernel 12 + changing the whole macro kernel into inline ASM) 121 | 122 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel13.h) 123 | 124 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel12.png) 125 | 126 | ## Kernel 14 (Kernel 13 + software prefetching on A) 127 | 128 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel14.h) 129 | 130 | 131 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel13.png) 132 | 133 | ## Kernel 15 (Kernel 14 + software prefetching on B) 134 | 135 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel15.h) 136 | 137 | 138 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel14.png) 139 | 140 | 141 | ## Kernel 16 (Kernel 15 + software prefetching on C) 142 | 143 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel16.h) 144 | 145 | 146 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel15.png) 147 | 148 | ## Kernel 17 (Kernel 16 + fine-tuned matrix scaling routine on C) 149 | 150 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel17.h) 151 | 152 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel16.png) 153 | 154 | ## Kernel 18 (Kernel 17 fine-grained packing for B to benefit the CPU frequency boosting) 155 | 156 | [source code](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/kernel18.h) 157 | 158 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel17.png) 159 | 160 | ## Kernel 18 comparison against Intel MKL 161 | 162 | ![image](https://www.cs.ucr.edu/~yzhai015/CPU_GEMM/Kernel18.png) 163 | -------------------------------------------------------------------------------- /data/data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | for ((i=0;i<20;i+=1)) 3 | do 4 | input_file_name="res_${i}.txt" 5 | output_file_name="perf_${i}.txt" 6 | grep "elasped" ${input_file_name} | cut -c 52-60 &> ${output_file_name} 7 | done 8 | -------------------------------------------------------------------------------- /data/myplot.m: -------------------------------------------------------------------------------- 1 | %% CLEAN WORKSPACE 2 | 3 | clear; clc; close all; 4 | 5 | markers = ['p', 'o','*','x', '+']; 6 | colors = ['b', 'k','c','m', 'r']; 7 | 8 | 9 | %% LOAD DATA 10 | 11 | 12 | 13 | for kernel1 = 1:18 14 | if kernel1 ~= 18 15 | kernel2=kernel1+1; 16 | kernel2_name = "Kernel"+num2str(kernel2); 17 | else 18 | kernel2=0; 19 | kernel2_name = "Intel MKL"; 20 | end 21 | data_tmp = zeros(30, 2); 22 | array_size = [100:100:3000]; 23 | array_size = repmat(array_size, 2, 1); 24 | data_ref_1_path = "perf_"+num2str(kernel1)+".txt"; 25 | data_ref_2_path = "perf_"+num2str(kernel2)+".txt"; 26 | 27 | kernel1_name = "Kernel"+num2str(kernel1); 28 | 29 | 30 | data_ref_1 = load(data_ref_1_path); 31 | data_ref_2 = load(data_ref_2_path); 32 | 33 | min_len = min(length(data_ref_1),length(data_ref_2)); 34 | figure 35 | plot(array_size(1, 1:length(data_ref_1))', data_ref_1, strcat('-', markers(1), colors(1)), 'LineWidth',2); 36 | hold on 37 | plot(array_size(2, 1:length(data_ref_2))', data_ref_2, strcat('-', markers(2), colors(2)), 'LineWidth',2); 38 | 39 | if kernel1 <= 8 40 | legend(kernel1_name,kernel2_name,'Location','northeast', 'FontSize', 16) 41 | else 42 | legend(kernel1_name,kernel2_name,'Location','southeast', 'FontSize', 16) 43 | end 44 | xlabel('Matrix Sizes (m=n=k)', 'FontSize', 16, 'FontWeight', 'bold') 45 | ylabel('Performance (GFLOPS)', 'FontSize', 16, 'FontWeight', 'bold') 46 | if kernel1 ~= 19 47 | msg = "Comparison bewteen: Kernel " + num2str(kernel1) + " and Kernel " + num2str(kernel2); 48 | else 49 | msg = "Comparison bewteen: Kernel " + num2str(kernel1) + " and Intel MKL"; 50 | end 51 | if min_len==30 52 | xlim([100,3000]) 53 | xticks(100 : 100 : 3000); 54 | else 55 | xlim([100,1000]) 56 | xticks(100 : 100 : 1000); 57 | end 58 | ylim([0, 120]) 59 | a = get(gca,'XTickLabel'); 60 | set(gca,'XTickLabelRotation',45); 61 | set(gca,'XTickLabel',a,'fontsize',12, 'FontWeight', 'bold') 62 | title(sprintf(msg)); 63 | saveas(gca,kernel1_name+".png"); 64 | end 65 | 66 | 67 | 68 | 69 | -------------------------------------------------------------------------------- /data/res_0.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000283 second, performance: 7.067067 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000265 second, performance: 60.295475 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000676 second, performance: 79.919695 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001292 second, performance: 99.041492 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.002640 second, performance: 94.710905 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.004277 second, performance: 101.005593 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.006528 second, performance: 105.091221 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.009826 second, performance: 104.216425 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.013612 second, performance: 107.113623 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.018406 second, performance: 108.662602 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.024448 second, performance: 108.885590 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.032563 second, performance: 106.133599 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.040203 second, performance: 109.295285 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.050243 second, performance: 109.229880 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.061824 second, performance: 109.181455 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.076525 second, performance: 107.049577 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.093582 second, performance: 104.998831 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.110026 second, performance: 106.011343 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.130645 second, performance: 105.001820 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.158769 second, performance: 100.775358 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.178608 second, performance: 103.701812 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.200645 second, performance: 106.137512 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.226083 second, performance: 107.632867 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.261503 second, performance: 105.727413 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.286880 second, performance: 108.930587 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.321133 second, performance: 109.462442 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.371492 second, performance: 105.967179 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.407250 second, performance: 107.806014 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.447557 second, performance: 108.987300 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.488118 second, performance: 110.629056 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_1.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000933 second, performance: 2.143597 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.007938 second, performance: 2.015705 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.026525 second, performance: 2.035789 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.065517 second, performance: 1.953691 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.130875 second, performance: 1.910220 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.225835 second, performance: 1.912901 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.355149 second, performance: 1.931585 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.542870 second, performance: 1.886272 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.766656 second, performance: 1.901766 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 1.274845 second, performance: 1.568818 GFLOPS. 31 | -------------------------------------------------------------------------------- /data/res_10.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000064 second, performance: 31.418007 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000306 second, performance: 52.238348 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000901 second, performance: 59.934484 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.002038 second, performance: 62.806611 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003772 second, performance: 66.277479 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005791 second, performance: 74.603269 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.009074 second, performance: 75.597534 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.013744 second, performance: 74.505042 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.017477 second, performance: 83.424138 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.024110 second, performance: 82.954228 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.032729 second, performance: 81.333831 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.043460 second, performance: 79.521451 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.055181 second, performance: 79.629278 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.067185 second, performance: 81.685281 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.079515 second, performance: 84.890004 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.098594 second, performance: 83.088195 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.119534 second, performance: 82.202542 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.140081 second, performance: 83.266105 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.166351 second, performance: 82.464029 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.194964 second, performance: 82.066428 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.230699 second, performance: 80.286440 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.261526 second, performance: 81.429852 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.295982 second, performance: 82.214469 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.337404 second, performance: 81.943325 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.378400 second, performance: 82.584565 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.421681 second, performance: 83.361529 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.475785 second, performance: 82.738994 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.526697 second, performance: 83.357224 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.579995 second, performance: 84.100679 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.637225 second, performance: 84.742439 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_11.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000065 second, performance: 30.802722 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000329 second, performance: 48.582672 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000951 second, performance: 56.802980 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001999 second, performance: 64.032630 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003754 second, performance: 66.601626 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005858 second, performance: 73.741002 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.009178 second, performance: 74.746520 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.013823 second, performance: 74.077556 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.017610 second, performance: 82.793897 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.023804 second, performance: 84.020793 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.032614 second, performance: 81.622203 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.043649 second, performance: 79.177724 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.054823 second, performance: 80.148841 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.067059 second, performance: 81.838817 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.079649 second, performance: 84.747111 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.099467 second, performance: 82.358678 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.120063 second, performance: 81.840593 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.146997 second, performance: 79.348743 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.177816 second, performance: 77.147166 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.201322 second, performance: 79.474516 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.245129 second, performance: 75.560329 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.266079 second, performance: 80.036467 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.297082 second, performance: 81.910125 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.340270 second, performance: 81.253207 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.380985 second, performance: 82.024154 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.424677 second, performance: 82.773439 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.477802 second, performance: 82.389784 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.529265 second, performance: 82.952711 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.582464 second, performance: 83.744280 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.640414 second, performance: 84.320413 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_12.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000058 second, performance: 34.285864 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000298 second, performance: 53.629886 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000904 second, performance: 59.713265 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001889 second, performance: 67.772469 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003292 second, performance: 75.934246 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005536 second, performance: 78.039164 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.008141 second, performance: 84.261120 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.012825 second, performance: 79.842000 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.016065 second, performance: 90.754274 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.021912 second, performance: 91.275766 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.028523 second, performance: 93.329335 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.040246 second, performance: 85.871867 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.049762 second, performance: 88.300855 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.060889 second, performance: 90.131332 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.071546 second, performance: 94.344480 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.089691 second, performance: 91.335482 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.108364 second, performance: 90.676115 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.127172 second, performance: 91.718594 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.150387 second, performance: 91.217817 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.176900 second, performance: 90.446423 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.209587 second, performance: 88.373660 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.237064 second, performance: 89.832265 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.268135 second, performance: 90.752688 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.306551 second, performance: 90.190545 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.343247 second, performance: 91.042306 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.381527 second, performance: 92.135039 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.430973 second, performance: 91.342138 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.476392 second, performance: 92.159392 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.524832 second, performance: 92.940159 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.575820 second, performance: 93.779246 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_13.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000059 second, performance: 34.053889 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000294 second, performance: 54.412592 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000849 second, performance: 63.603599 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001925 second, performance: 66.482817 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003290 second, performance: 75.996618 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005400 second, performance: 80.005563 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.008134 second, performance: 84.333564 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.012615 second, performance: 81.175474 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.016363 second, performance: 89.105278 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.021639 second, performance: 92.424230 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.028478 second, performance: 93.475709 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.040081 second, performance: 86.224658 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.050043 second, performance: 87.805139 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.060564 second, performance: 90.614825 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.071847 second, performance: 93.949277 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.089860 second, performance: 91.163668 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.108863 second, performance: 90.260535 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.127357 second, performance: 91.585068 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.151008 second, performance: 90.843078 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.177623 second, performance: 90.078246 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.210417 second, performance: 88.025194 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.238027 second, performance: 89.468716 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.268719 second, performance: 90.555683 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.307661 second, performance: 89.865150 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.344584 second, performance: 90.688961 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.382851 second, performance: 91.816312 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.432441 second, performance: 91.032072 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.477571 second, performance: 91.931939 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.526324 second, performance: 92.676692 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.581062 second, performance: 92.933284 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_14.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000060 second, performance: 33.509752 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000326 second, performance: 49.080105 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000958 second, performance: 56.388153 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.002129 second, performance: 60.113191 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003945 second, performance: 63.365724 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.006076 second, performance: 71.102917 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.008297 second, performance: 82.676861 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.012535 second, performance: 81.689090 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.016743 second, performance: 87.079389 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.021617 second, performance: 92.521072 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.029235 second, performance: 91.055101 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.039266 second, performance: 88.015900 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.049351 second, performance: 89.035151 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.060206 second, performance: 91.153802 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.077524 second, performance: 87.069424 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.097098 second, performance: 84.368342 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.110075 second, performance: 89.266412 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.129813 second, performance: 89.852577 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.159820 second, performance: 85.834063 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.187903 second, performance: 85.150314 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.219915 second, performance: 84.223575 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.252710 second, performance: 84.270606 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.269823 second, performance: 90.184916 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.308235 second, performance: 89.697793 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.344587 second, performance: 90.688271 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.383487 second, performance: 91.664052 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.446984 second, performance: 88.070256 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.480523 second, performance: 91.367109 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.527015 second, performance: 92.555245 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.577827 second, performance: 93.453568 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_15.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000057 second, performance: 34.855712 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000291 second, performance: 54.932222 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000883 second, performance: 61.131556 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001874 second, performance: 68.301291 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003439 second, performance: 72.696617 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005417 second, performance: 79.743831 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.008048 second, performance: 85.242168 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.012642 second, performance: 80.997384 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.016038 second, performance: 90.911227 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.021645 second, performance: 92.401494 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.028364 second, performance: 93.850234 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.039691 second, performance: 87.073396 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.049805 second, performance: 88.223360 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.060019 second, performance: 91.437687 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.071785 second, performance: 94.030717 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.089311 second, performance: 91.724379 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.108092 second, performance: 90.904053 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.126743 second, performance: 92.028752 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.150362 second, performance: 91.233148 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.176462 second, performance: 90.671072 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.209468 second, performance: 88.423854 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.236775 second, performance: 89.941807 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.267565 second, performance: 90.946229 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.306270 second, performance: 90.273182 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.342710 second, performance: 91.184877 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.380634 second, performance: 92.351262 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.431191 second, performance: 91.295908 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.475152 second, performance: 92.399841 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.523741 second, performance: 93.133762 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.574601 second, performance: 93.978202 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_16.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000058 second, performance: 34.663669 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000290 second, performance: 55.112672 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000805 second, performance: 67.082362 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001881 second, performance: 68.047350 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003222 second, performance: 77.591831 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.005033 second, performance: 85.827802 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.007901 second, performance: 86.824966 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.011912 second, performance: 85.963532 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.015416 second, performance: 94.575196 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.020434 second, performance: 97.877697 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.026713 second, performance: 99.653138 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.037478 second, performance: 92.214955 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.045707 second, performance: 96.133262 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.055403 second, performance: 99.056587 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.067530 second, performance: 99.955581 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.082408 second, performance: 99.407404 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.099079 second, performance: 99.173016 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.116157 second, performance: 100.415563 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.137511 second, performance: 99.759051 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.160636 second, performance: 99.603860 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.189914 second, performance: 97.528198 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.215767 second, performance: 98.698884 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.245491 second, performance: 99.123657 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.278228 second, performance: 99.371722 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.312101 second, performance: 100.127956 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.346999 second, performance: 101.302790 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.394363 second, performance: 99.821736 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.432181 second, performance: 101.586982 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.479907 second, performance: 101.640468 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.522457 second, performance: 103.357733 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_17.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000054 second, performance: 37.008565 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000271 second, performance: 59.109393 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000791 second, performance: 68.268587 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001618 second, performance: 79.095061 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003049 second, performance: 81.994735 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.004814 second, performance: 89.738468 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.007709 second, performance: 88.990264 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.011172 second, performance: 91.654647 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.014892 second, performance: 97.906862 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.019774 second, performance: 101.143119 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.026115 second, performance: 101.932435 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.036291 second, performance: 95.230318 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.044570 second, performance: 98.587265 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.053816 second, performance: 101.977708 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.064472 second, performance: 104.696548 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.076812 second, performance: 106.649521 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.091824 second, performance: 107.008627 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.106188 second, performance: 109.843270 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.124685 second, performance: 110.021561 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.146396 second, performance: 109.292365 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.168595 second, performance: 109.860912 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.190999 second, performance: 111.497762 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.218565 second, performance: 111.335469 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.247309 second, performance: 111.795235 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.279611 second, performance: 111.762305 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.311009 second, performance: 113.025549 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.356690 second, performance: 110.364838 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.388392 second, performance: 113.040442 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.434629 second, performance: 112.228957 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.471890 second, performance: 114.433513 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_18.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000060 second, performance: 33.509752 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000290 second, performance: 55.173086 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.000816 second, performance: 66.154926 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.001655 second, performance: 77.325495 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.003167 second, performance: 78.939222 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.004770 second, performance: 90.565269 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.007468 second, performance: 91.854869 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.011002 second, performance: 93.074120 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.014726 second, performance: 99.010673 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.019563 second, performance: 102.235265 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.025865 second, performance: 102.920273 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.036036 second, performance: 95.904057 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.043722 second, performance: 100.499350 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.053607 second, performance: 102.374715 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.064125 second, performance: 105.263710 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.076448 second, performance: 107.157746 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.090749 second, performance: 108.276645 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.105297 second, performance: 110.772043 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.123691 second, performance: 110.905399 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.146085 second, performance: 109.525318 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.167116 second, performance: 110.833190 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.189462 second, performance: 112.402519 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.216719 second, performance: 112.283488 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.245005 second, performance: 112.846690 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.277267 second, performance: 112.707255 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.308088 second, performance: 114.097394 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 0.353778 second, performance: 111.273282 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 0.384951 second, performance: 114.050869 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 0.430948 second, performance: 113.187748 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 0.467678 second, performance: 115.464139 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_2.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000800 second, performance: 2.500827 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.007446 second, performance: 2.148698 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.025822 second, performance: 2.091240 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.064178 second, performance: 1.994443 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.128827 second, performance: 1.940582 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.218946 second, performance: 1.973089 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.347952 second, performance: 1.971536 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.522607 second, performance: 1.959409 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.742905 second, performance: 1.962565 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 1.182570 second, performance: 1.691232 GFLOPS. 31 | -------------------------------------------------------------------------------- /data/res_3.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000272 second, performance: 7.351979 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.002189 second, performance: 7.310334 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.007662 second, performance: 7.047423 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.025840 second, performance: 4.953491 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.050516 second, performance: 4.948891 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.090048 second, performance: 4.797459 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.137898 second, performance: 4.974678 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.204789 second, performance: 5.000276 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.298631 second, performance: 4.882285 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.504119 second, performance: 3.967317 GFLOPS. 31 | -------------------------------------------------------------------------------- /data/res_4.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000173 second, performance: 11.538663 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.001497 second, performance: 10.685558 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.005339 second, performance: 10.113677 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.013528 second, performance: 9.461837 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.026549 second, performance: 9.416684 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.045630 second, performance: 9.467394 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.073888 second, performance: 9.284324 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.110822 second, performance: 9.240019 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.161419 second, performance: 9.032376 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.243352 second, performance: 8.218536 GFLOPS. 31 | -------------------------------------------------------------------------------- /data/res_5.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000070 second, performance: 28.565067 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000661 second, performance: 24.218284 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.002594 second, performance: 20.817318 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.006802 second, performance: 18.817108 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.013837 second, performance: 18.067578 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.022642 second, performance: 19.079606 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.035220 second, performance: 19.477577 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.053280 second, performance: 19.219321 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.075098 second, performance: 19.414540 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.124883 second, performance: 16.014988 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.145598 second, performance: 18.283266 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.198915 second, performance: 17.374227 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.281015 second, performance: 15.636198 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.322418 second, performance: 17.021367 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.458154 second, performance: 14.733049 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.909394 second, performance: 9.008193 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.988013 second, performance: 9.945213 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 1.306908 second, performance: 8.924880 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 1.740841 second, performance: 7.880098 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 1.936918 second, performance: 8.260544 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 2.393219 second, performance: 7.739368 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 2.770924 second, performance: 7.685522 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 3.289765 second, performance: 7.396882 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 4.180771 second, performance: 6.613135 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 4.210540 second, performance: 7.421851 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 4.783332 second, performance: 7.348852 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 5.555718 second, performance: 7.085672 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 6.108167 second, performance: 7.187754 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 6.818280 second, performance: 7.154004 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 7.822134 second, performance: 6.903487 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_6.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000079 second, performance: 25.420024 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000705 second, performance: 22.694915 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.002548 second, performance: 21.195909 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.007257 second, performance: 17.638180 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.014409 second, performance: 17.350667 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.024602 second, performance: 17.559764 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.035819 second, performance: 19.151648 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.060217 second, performance: 17.005057 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.080034 second, performance: 18.217326 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.136156 second, performance: 14.688999 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.157793 second, performance: 16.870173 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.217283 second, performance: 15.905500 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.288663 second, performance: 15.221899 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.337148 second, performance: 16.277732 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.474865 second, performance: 14.214567 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.933889 second, performance: 8.771918 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 1.002384 second, performance: 9.802633 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 1.332452 second, performance: 8.753788 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 1.750214 second, performance: 7.837899 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 1.978627 second, performance: 8.086416 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 2.480368 second, performance: 7.467440 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 2.801906 second, performance: 7.600540 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 3.264131 second, performance: 7.454970 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 4.092487 second, performance: 6.755794 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 4.246968 second, performance: 7.358190 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 4.792007 second, performance: 7.335549 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 5.470783 second, performance: 7.195679 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 6.118353 second, performance: 7.175787 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 6.808529 second, performance: 7.164249 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 7.767232 second, performance: 6.952283 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_7.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000077 second, performance: 26.078574 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000470 second, performance: 34.071178 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.002361 second, performance: 22.871861 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.005091 second, performance: 25.142646 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.010646 second, performance: 23.483819 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.016265 second, performance: 26.559458 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.026511 second, performance: 25.876407 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.038946 second, performance: 26.292821 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.059465 second, performance: 24.518489 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.088561 second, performance: 22.583391 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.114015 second, performance: 23.347800 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.147799 second, performance: 23.383169 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.204660 second, performance: 21.469753 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.246410 second, performance: 22.271797 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.339744 second, performance: 19.867881 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.611512 second, performance: 13.396297 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.602048 second, performance: 16.320949 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.730389 second, performance: 15.969578 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 1.032072 second, performance: 13.291713 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 1.174407 second, performance: 13.623901 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 1.396350 second, performance: 13.264580 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 1.553587 second, performance: 13.707635 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 1.807356 second, performance: 13.463864 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 2.383448 second, performance: 11.600003 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 2.408157 second, performance: 12.976730 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 2.593178 second, performance: 13.555570 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 3.088232 second, performance: 12.747099 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 3.327420 second, performance: 13.194609 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 3.920907 second, performance: 12.440488 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 4.325503 second, performance: 12.484098 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_8.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000084 second, performance: 23.718967 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000462 second, performance: 34.604089 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.002385 second, performance: 22.638677 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.005115 second, performance: 25.022725 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.010433 second, performance: 23.961609 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.017752 second, performance: 24.334834 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.028623 second, performance: 23.966786 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.041460 second, performance: 24.698719 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.064851 second, performance: 22.482428 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.087897 second, performance: 22.753910 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.113537 second, performance: 23.446037 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.145456 second, performance: 23.759815 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.186549 second, performance: 23.554130 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.229639 second, performance: 23.898349 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.286561 second, performance: 23.555219 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.352738 second, performance: 23.224016 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.420857 second, performance: 23.347596 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.499982 second, performance: 23.328838 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.582780 second, performance: 23.538898 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.706954 second, performance: 22.632307 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.802024 second, performance: 23.094082 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.918903 second, performance: 23.175459 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 1.025997 second, performance: 23.717426 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 1.208814 second, performance: 22.872011 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 1.376855 second, performance: 22.696648 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 1.528901 second, performance: 22.991679 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 1.676683 second, performance: 23.478499 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 1.899364 second, performance: 23.115102 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 2.065415 second, performance: 23.616558 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 2.326072 second, performance: 23.215106 GFLOPS. 91 | -------------------------------------------------------------------------------- /data/res_9.txt: -------------------------------------------------------------------------------- 1 | 2 | M=N=K=100: 3 | Average elasped time: 0.000083 second, performance: 24.082128 GFLOPS. 4 | 5 | M=N=K=200: 6 | Average elasped time: 0.000505 second, performance: 31.704975 GFLOPS. 7 | 8 | M=N=K=300: 9 | Average elasped time: 0.001601 second, performance: 33.736023 GFLOPS. 10 | 11 | M=N=K=400: 12 | Average elasped time: 0.003977 second, performance: 32.182004 GFLOPS. 13 | 14 | M=N=K=500: 15 | Average elasped time: 0.007186 second, performance: 34.789794 GFLOPS. 16 | 17 | M=N=K=600: 18 | Average elasped time: 0.012181 second, performance: 35.466098 GFLOPS. 19 | 20 | M=N=K=700: 21 | Average elasped time: 0.018289 second, performance: 37.509517 GFLOPS. 22 | 23 | M=N=K=800: 24 | Average elasped time: 0.027927 second, performance: 36.666606 GFLOPS. 25 | 26 | M=N=K=900: 27 | Average elasped time: 0.038607 second, performance: 37.765132 GFLOPS. 28 | 29 | M=N=K=1000: 30 | Average elasped time: 0.057678 second, performance: 34.675276 GFLOPS. 31 | 32 | M=N=K=1100: 33 | Average elasped time: 0.074458 second, performance: 35.751564 GFLOPS. 34 | 35 | M=N=K=1200: 36 | Average elasped time: 0.097549 second, performance: 35.428247 GFLOPS. 37 | 38 | M=N=K=1300: 39 | Average elasped time: 0.120748 second, performance: 36.389848 GFLOPS. 40 | 41 | M=N=K=1400: 42 | Average elasped time: 0.149370 second, performance: 36.740970 GFLOPS. 43 | 44 | M=N=K=1500: 45 | Average elasped time: 0.185390 second, performance: 36.409669 GFLOPS. 46 | 47 | M=N=K=1600: 48 | Average elasped time: 0.234730 second, performance: 34.899624 GFLOPS. 49 | 50 | M=N=K=1700: 51 | Average elasped time: 0.279096 second, performance: 35.206488 GFLOPS. 52 | 53 | M=N=K=1800: 54 | Average elasped time: 0.329090 second, performance: 35.443230 GFLOPS. 55 | 56 | M=N=K=1900: 57 | Average elasped time: 0.386074 second, performance: 35.532020 GFLOPS. 58 | 59 | M=N=K=2000: 60 | Average elasped time: 0.466119 second, performance: 34.325971 GFLOPS. 61 | 62 | M=N=K=2100: 63 | Average elasped time: 0.530173 second, performance: 34.935767 GFLOPS. 64 | 65 | M=N=K=2200: 66 | Average elasped time: 0.597254 second, performance: 35.656524 GFLOPS. 67 | 68 | M=N=K=2300: 69 | Average elasped time: 0.674764 second, performance: 36.062979 GFLOPS. 70 | 71 | M=N=K=2400: 72 | Average elasped time: 0.785389 second, performance: 35.202936 GFLOPS. 73 | 74 | M=N=K=2500: 75 | Average elasped time: 0.877093 second, performance: 35.629061 GFLOPS. 76 | 77 | M=N=K=2600: 78 | Average elasped time: 0.983036 second, performance: 35.758596 GFLOPS. 79 | 80 | M=N=K=2700: 81 | Average elasped time: 1.115037 second, performance: 35.304647 GFLOPS. 82 | 83 | M=N=K=2800: 84 | Average elasped time: 1.236919 second, performance: 35.494656 GFLOPS. 85 | 86 | M=N=K=2900: 87 | Average elasped time: 1.363170 second, performance: 35.782780 GFLOPS. 88 | 89 | M=N=K=3000: 90 | Average elasped time: 1.511161 second, performance: 35.734114 GFLOPS. 91 | -------------------------------------------------------------------------------- /figures/Kernel1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel1.png -------------------------------------------------------------------------------- /figures/Kernel10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel10.png -------------------------------------------------------------------------------- /figures/Kernel11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel11.png -------------------------------------------------------------------------------- /figures/Kernel12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel12.png -------------------------------------------------------------------------------- /figures/Kernel13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel13.png -------------------------------------------------------------------------------- /figures/Kernel14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel14.png -------------------------------------------------------------------------------- /figures/Kernel15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel15.png -------------------------------------------------------------------------------- /figures/Kernel16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel16.png -------------------------------------------------------------------------------- /figures/Kernel17.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel17.png -------------------------------------------------------------------------------- /figures/Kernel18.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel18.png -------------------------------------------------------------------------------- /figures/Kernel2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel2.png -------------------------------------------------------------------------------- /figures/Kernel3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel3.png -------------------------------------------------------------------------------- /figures/Kernel4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel4.png -------------------------------------------------------------------------------- /figures/Kernel5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel5.png -------------------------------------------------------------------------------- /figures/Kernel6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel6.png -------------------------------------------------------------------------------- /figures/Kernel7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel7.png -------------------------------------------------------------------------------- /figures/Kernel8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel8.png -------------------------------------------------------------------------------- /figures/Kernel9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F/fd62398ae788ee3adbf96aa95e119d98c5d71779/figures/Kernel9.png -------------------------------------------------------------------------------- /include/kernel1.h: -------------------------------------------------------------------------------- 1 | #define A(i,j) A[(i)+(j)*LDA] 2 | #define B(i,j) B[(i)+(j)*LDB] 3 | #define C(i,j) C[(i)+(j)*LDC] 4 | 5 | void scale_c_k1(double *C,int M, int N, int LDC, double scalar){ 6 | int i,j; 7 | for (i=0;i 3 | #define A(i,j) A[(i)+(j)*LDA] 4 | #define B(i,j) B[(i)+(j)*LDB] 5 | #define C(i,j) C[(i)+(j)*LDC] 6 | #define M_BLOCKING 192 7 | #define N_BLOCKING 2048 8 | #define K_BLOCKING 384 9 | void scale_c_k12(double *C,int M, int N, int LDC, double scalar){ 10 | int i,j; 11 | for (i=0;i1;count_second+=2,count_sub-=2){ 24 | tosrc1=src+count_second*leading_dim;tosrc2=tosrc1+leading_dim; 25 | for (count_first=0;count_first0;count_second++,count_sub-=1){ 31 | tosrc1=src+count_second*leading_dim; 32 | for (count_first=0;count_first23;count_first+=24,count_sub-=24){ 46 | tosrc=src+count_first; 47 | for(count_second=0;count_second7;count_first+=8,count_sub-=8){ 57 | tosrc=src+count_first; 58 | for(count_second=0;count_second1;count_first+=2,count_sub-=2){ 65 | tosrc=src+count_first; 66 | for(count_second=0;count_second0;count_first+=1,count_sub-=1){ 73 | tosrc=src+count_first; 74 | for(count_second=0;count_second7;n_count_sub-=8,n_count+=8){ 446 | //call the m layer with n=8; 447 | a_ptr=a_buffer;b_ptr=b_buffer+n_count*k;c_ptr=C+n_count*LDC; 448 | for (m_count_sub=m,m_count=0;m_count_sub>23;m_count_sub-=24,m_count+=24){ 449 | //call the micro kernel: m24n8; 450 | a_ptr=a_buffer+m_count*K; 451 | b_ptr=b_buffer+n_count*k; 452 | c_ptr=C+n_count*LDC+m_count; 453 | micro_kernel_24x8_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 454 | } 455 | for (;m_count_sub>7;m_count_sub-=8,m_count+=8){ 456 | //call the micro kernel: m8n8; 457 | a_ptr=a_buffer+m_count*K; 458 | b_ptr=b_buffer+n_count*k; 459 | c_ptr=C+n_count*LDC+m_count; 460 | micro_kernel_8x8_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 461 | } 462 | for (;m_count_sub>1;m_count_sub-=2,m_count+=2){ 463 | //call the micro kernel: m2n8; 464 | a_ptr=a_buffer+m_count*K; 465 | b_ptr=b_buffer+n_count*k; 466 | c_ptr=C+n_count*LDC+m_count; 467 | micro_kernel_2x8_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 468 | } 469 | for (;m_count_sub>0;m_count_sub-=1,m_count+=1){ 470 | //call the micro kernel: m1n8; 471 | } 472 | } 473 | for (;n_count_sub>3;n_count_sub-=4,n_count+=4){ 474 | //call the m layer with n=4 475 | a_ptr=a_buffer;b_ptr=b_buffer+n_count*k;c_ptr=C+n_count*LDC; 476 | for (m_count_sub=m,m_count=0;m_count_sub>23;m_count_sub-=24,m_count+=24){ 477 | //call the micro kernel: m24n4; 478 | a_ptr=a_buffer+m_count*K; 479 | b_ptr=b_buffer+n_count*k; 480 | c_ptr=C+n_count*LDC+m_count; 481 | micro_kernel_24x4_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 482 | } 483 | for (;m_count_sub>7;m_count_sub-=8,m_count+=8){ 484 | //call the micro kernel: m8n4; 485 | a_ptr=a_buffer+m_count*K; 486 | b_ptr=b_buffer+n_count*k; 487 | c_ptr=C+n_count*LDC+m_count; 488 | micro_kernel_8x4_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 489 | } 490 | for (;m_count_sub>1;m_count_sub-=2,m_count+=2){ 491 | //call the micro kernel: m2n4; 492 | a_ptr=a_buffer+m_count*K; 493 | b_ptr=b_buffer+n_count*k; 494 | c_ptr=C+n_count*LDC+m_count; 495 | micro_kernel_2x4_k12(a_ptr,b_ptr,c_ptr,ldc_in_bytes,K); 496 | } 497 | for (;m_count_sub>0;m_count_sub-=1,m_count+=1){ 498 | //call the micro kernel: m1n4; 499 | } 500 | } 501 | for (;n_count_sub>1;n_count_sub-=2,n_count+=2){ 502 | //call the m layer with n=2 503 | } 504 | for (;n_count_sub>0;n_count_sub-=1,n_count+=1){ 505 | //call the m layer with n=1 506 | } 507 | } 508 | 509 | 510 | void mydgemm_cpu_v12(\ 511 | int M, \ 512 | int N, \ 513 | int K, \ 514 | double alpha, \ 515 | double *A, \ 516 | int LDA, \ 517 | double *B, \ 518 | int LDB, \ 519 | double beta, \ 520 | double *C, \ 521 | int LDC)\ 522 | { 523 | if (beta != 1.0) scale_c_k12(C,M,N,LDC,beta); 524 | double *b_buffer = (double *)aligned_alloc(4096,K_BLOCKING*N_BLOCKING*sizeof(double)); 525 | double *a_buffer = (double *)aligned_alloc(4096,K_BLOCKING*M_BLOCKING*sizeof(double)); 526 | int m_count, n_count, k_count; 527 | int m_inc, n_inc, k_inc; 528 | for (n_count=0;n_countN_BLOCKING)?N_BLOCKING:N-n_count; 530 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 532 | packing_b_24x8_edge_version2_k12(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 533 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 535 | packing_a_24x8_edge_k12(alpha, A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 536 | //macro kernel: to compute C += A_tilt * B_tilt 537 | macro_kernel_k12(a_buffer, b_buffer, m_inc, n_inc, k_inc, &C(m_count, n_count), LDC); 538 | } 539 | } 540 | } 541 | free(a_buffer);free(b_buffer); 542 | } -------------------------------------------------------------------------------- /include/kernel13.h: -------------------------------------------------------------------------------- 1 | //#define TIMER 1 2 | #include "immintrin.h" 3 | #include 4 | #ifdef TIMER 5 | #include "utils.h" 6 | #endif 7 | #define A(i,j) A[(i)+(j)*LDA] 8 | #define B(i,j) B[(i)+(j)*LDB] 9 | #define C(i,j) C[(i)+(j)*LDC] 10 | #define M_BLOCKING 192 11 | #define N_BLOCKING 2048 12 | #define K_BLOCKING 384 13 | 14 | 15 | void scale_c_k13(double *C,int M, int N, int LDC, double scalar){ 16 | int i,j; 17 | for (i=0;i1;count_second+=2,count_sub-=2){ 30 | tosrc1=src+count_second*leading_dim;tosrc2=tosrc1+leading_dim; 31 | for (count_first=0;count_first0;count_second++,count_sub-=1){ 37 | tosrc1=src+count_second*leading_dim; 38 | for (count_first=0;count_first23;count_first+=24,count_sub-=24){ 52 | tosrc=src+count_first; 53 | for(count_second=0;count_second7;count_first+=8,count_sub-=8){ 63 | tosrc=src+count_first; 64 | for(count_second=0;count_second1;count_first+=2,count_sub-=2){ 71 | tosrc=src+count_first; 72 | for(count_second=0;count_second0;count_first+=1,count_sub-=1){ 79 | tosrc=src+count_first; 80 | for(count_second=0;count_second7;n_count_sub-=8,n_count+=8){ 470 | //call the m layer with n=8; 471 | macro_n8 472 | } 473 | for (;n_count_sub>3;n_count_sub-=4,n_count+=4){ 474 | //call the m layer with n=4 475 | macro_n4 476 | } 477 | for (;n_count_sub>1;n_count_sub-=2,n_count+=2){ 478 | //call the m layer with n=2 479 | } 480 | for (;n_count_sub>0;n_count_sub-=1,n_count+=1){ 481 | //call the m layer with n=1 482 | } 483 | } 484 | 485 | 486 | void mydgemm_cpu_v13(\ 487 | int M, \ 488 | int N, \ 489 | int K, \ 490 | double alpha, \ 491 | double *A, \ 492 | int LDA, \ 493 | double *B, \ 494 | int LDB, \ 495 | double beta, \ 496 | double *C, \ 497 | int LDC)\ 498 | { 499 | if (beta != 1.0) scale_c_k13(C,M,N,LDC,beta); 500 | double *b_buffer = (double *)aligned_alloc(4096,K_BLOCKING*N_BLOCKING*sizeof(double)); 501 | double *a_buffer = (double *)aligned_alloc(4096,K_BLOCKING*M_BLOCKING*sizeof(double)); 502 | int m_count, n_count, k_count; 503 | int m_inc, n_inc, k_inc; 504 | for (n_count=0;n_countN_BLOCKING)?N_BLOCKING:N-n_count; 506 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 508 | packing_b_24x8_edge_version2_k13(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 509 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 511 | packing_a_24x8_edge_k13(alpha, A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 512 | //macro kernel: to compute C += A_tilt * B_tilt 513 | macro_kernel_k13(a_buffer, b_buffer, m_inc, n_inc, k_inc, &C(m_count, n_count), LDC); 514 | } 515 | } 516 | } 517 | free(a_buffer);free(b_buffer); 518 | } -------------------------------------------------------------------------------- /include/kernel14.h: -------------------------------------------------------------------------------- 1 | //#define TIMER 1 2 | #include "immintrin.h" 3 | #include 4 | #ifdef TIMER 5 | #include "utils.h" 6 | #endif 7 | #define A(i,j) A[(i)+(j)*LDA] 8 | #define B(i,j) B[(i)+(j)*LDB] 9 | #define C(i,j) C[(i)+(j)*LDC] 10 | #define M_BLOCKING 192 11 | #define N_BLOCKING 2048 12 | #define K_BLOCKING 384 13 | 14 | 15 | void scale_c_k14(double *C,int M, int N, int LDC, double scalar){ 16 | int i,j; 17 | for (i=0;i1;count_second+=2,count_sub-=2){ 30 | tosrc1=src+count_second*leading_dim;tosrc2=tosrc1+leading_dim; 31 | for (count_first=0;count_first0;count_second++,count_sub-=1){ 37 | tosrc1=src+count_second*leading_dim; 38 | for (count_first=0;count_first23;count_first+=24,count_sub-=24){ 52 | tosrc=src+count_first; 53 | for(count_second=0;count_second7;count_first+=8,count_sub-=8){ 63 | tosrc=src+count_first; 64 | for(count_second=0;count_second1;count_first+=2,count_sub-=2){ 71 | tosrc=src+count_first; 72 | for(count_second=0;count_second0;count_first+=1,count_sub-=1){ 79 | tosrc=src+count_first; 80 | for(count_second=0;count_second7;n_count_sub-=8,n_count+=8){ 470 | //call the m layer with n=8; 471 | macro_n8 472 | } 473 | for (;n_count_sub>3;n_count_sub-=4,n_count+=4){ 474 | //call the m layer with n=4 475 | macro_n4 476 | } 477 | for (;n_count_sub>1;n_count_sub-=2,n_count+=2){ 478 | //call the m layer with n=2 479 | } 480 | for (;n_count_sub>0;n_count_sub-=1,n_count+=1){ 481 | //call the m layer with n=1 482 | } 483 | } 484 | 485 | 486 | void mydgemm_cpu_v14(\ 487 | int M, \ 488 | int N, \ 489 | int K, \ 490 | double alpha, \ 491 | double *A, \ 492 | int LDA, \ 493 | double *B, \ 494 | int LDB, \ 495 | double beta, \ 496 | double *C, \ 497 | int LDC)\ 498 | { 499 | if (beta != 1.0) scale_c_k14(C,M,N,LDC,beta); 500 | double *b_buffer = (double *)aligned_alloc(4096,K_BLOCKING*N_BLOCKING*sizeof(double)); 501 | double *a_buffer = (double *)aligned_alloc(4096,K_BLOCKING*M_BLOCKING*sizeof(double)); 502 | int m_count, n_count, k_count; 503 | int m_inc, n_inc, k_inc; 504 | for (n_count=0;n_countN_BLOCKING)?N_BLOCKING:N-n_count; 506 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 508 | packing_b_24x8_edge_version2_k14(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 509 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 511 | packing_a_24x8_edge_k14(alpha, A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 512 | //macro kernel: to compute C += A_tilt * B_tilt 513 | macro_kernel_k14(a_buffer, b_buffer, m_inc, n_inc, k_inc, &C(m_count, n_count), LDC); 514 | } 515 | } 516 | } 517 | free(a_buffer);free(b_buffer); 518 | } -------------------------------------------------------------------------------- /include/kernel15.h: -------------------------------------------------------------------------------- 1 | //#define TIMER 1 2 | #include "immintrin.h" 3 | #include 4 | #ifdef TIMER 5 | #include "utils.h" 6 | #endif 7 | #define A(i,j) A[(i)+(j)*LDA] 8 | #define B(i,j) B[(i)+(j)*LDB] 9 | #define C(i,j) C[(i)+(j)*LDC] 10 | #define M_BLOCKING 192 11 | #define N_BLOCKING 2048 12 | #define K_BLOCKING 384 13 | 14 | 15 | void scale_c_k15(double *C,int M, int N, int LDC, double scalar){ 16 | int i,j; 17 | for (i=0;i1;count_second+=2,count_sub-=2){ 30 | tosrc1=src+count_second*leading_dim;tosrc2=tosrc1+leading_dim; 31 | for (count_first=0;count_first0;count_second++,count_sub-=1){ 37 | tosrc1=src+count_second*leading_dim; 38 | for (count_first=0;count_first23;count_first+=24,count_sub-=24){ 52 | tosrc=src+count_first; 53 | for(count_second=0;count_second7;count_first+=8,count_sub-=8){ 63 | tosrc=src+count_first; 64 | for(count_second=0;count_second1;count_first+=2,count_sub-=2){ 71 | tosrc=src+count_first; 72 | for(count_second=0;count_second0;count_first+=1,count_sub-=1){ 79 | tosrc=src+count_first; 80 | for(count_second=0;count_second7;n_count_sub-=8,n_count+=8){ 473 | //call the m layer with n=8; 474 | macro_n8 475 | } 476 | for (;n_count_sub>3;n_count_sub-=4,n_count+=4){ 477 | //call the m layer with n=4 478 | macro_n4 479 | } 480 | for (;n_count_sub>1;n_count_sub-=2,n_count+=2){ 481 | //call the m layer with n=2 482 | } 483 | for (;n_count_sub>0;n_count_sub-=1,n_count+=1){ 484 | //call the m layer with n=1 485 | } 486 | } 487 | 488 | 489 | void mydgemm_cpu_v15(\ 490 | int M, \ 491 | int N, \ 492 | int K, \ 493 | double alpha, \ 494 | double *A, \ 495 | int LDA, \ 496 | double *B, \ 497 | int LDB, \ 498 | double beta, \ 499 | double *C, \ 500 | int LDC)\ 501 | { 502 | if (beta != 1.0) scale_c_k15(C,M,N,LDC,beta); 503 | double *b_buffer = (double *)aligned_alloc(4096,K_BLOCKING*N_BLOCKING*sizeof(double)); 504 | double *a_buffer = (double *)aligned_alloc(4096,K_BLOCKING*M_BLOCKING*sizeof(double)); 505 | int m_count, n_count, k_count; 506 | int m_inc, n_inc, k_inc; 507 | for (n_count=0;n_countN_BLOCKING)?N_BLOCKING:N-n_count; 509 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 511 | packing_b_24x8_edge_version2_k15(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 512 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 514 | packing_a_24x8_edge_k15(alpha, A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 515 | //macro kernel: to compute C += A_tilt * B_tilt 516 | macro_kernel_k15(a_buffer, b_buffer, m_inc, n_inc, k_inc, &C(m_count, n_count), LDC); 517 | } 518 | } 519 | } 520 | free(a_buffer);free(b_buffer); 521 | } -------------------------------------------------------------------------------- /include/kernel16.h: -------------------------------------------------------------------------------- 1 | //#define TIMER 1 2 | #include "immintrin.h" 3 | #include 4 | #ifdef TIMER 5 | #include "utils.h" 6 | #endif 7 | #define A(i,j) A[(i)+(j)*LDA] 8 | #define B(i,j) B[(i)+(j)*LDB] 9 | #define C(i,j) C[(i)+(j)*LDC] 10 | #define M_BLOCKING 192 11 | #define N_BLOCKING 2048 12 | #define K_BLOCKING 384 13 | 14 | 15 | void scale_c_k16(double *C,int M, int N, int LDC, double scalar){ 16 | int i,j; 17 | for (i=0;i1;count_second+=2,count_sub-=2){ 30 | tosrc1=src+count_second*leading_dim;tosrc2=tosrc1+leading_dim; 31 | for (count_first=0;count_first0;count_second++,count_sub-=1){ 37 | tosrc1=src+count_second*leading_dim; 38 | for (count_first=0;count_first23;count_first+=24,count_sub-=24){ 52 | tosrc=src+count_first; 53 | for(count_second=0;count_second7;count_first+=8,count_sub-=8){ 63 | tosrc=src+count_first; 64 | for(count_second=0;count_second1;count_first+=2,count_sub-=2){ 71 | tosrc=src+count_first; 72 | for(count_second=0;count_second0;count_first+=1,count_sub-=1){ 79 | tosrc=src+count_first; 80 | for(count_second=0;count_second7;n_count_sub-=8,n_count+=8){ 484 | //call the m layer with n=8; 485 | macro_n8 486 | } 487 | for (;n_count_sub>3;n_count_sub-=4,n_count+=4){ 488 | //call the m layer with n=4 489 | macro_n4 490 | } 491 | for (;n_count_sub>1;n_count_sub-=2,n_count+=2){ 492 | //call the m layer with n=2 493 | } 494 | for (;n_count_sub>0;n_count_sub-=1,n_count+=1){ 495 | //call the m layer with n=1 496 | } 497 | } 498 | 499 | 500 | void mydgemm_cpu_v16(\ 501 | int M, \ 502 | int N, \ 503 | int K, \ 504 | double alpha, \ 505 | double *A, \ 506 | int LDA, \ 507 | double *B, \ 508 | int LDB, \ 509 | double beta, \ 510 | double *C, \ 511 | int LDC)\ 512 | { 513 | if (beta != 1.0) scale_c_k16(C,M,N,LDC,beta); 514 | double *b_buffer = (double *)aligned_alloc(4096,K_BLOCKING*N_BLOCKING*sizeof(double)); 515 | double *a_buffer = (double *)aligned_alloc(4096,K_BLOCKING*M_BLOCKING*sizeof(double)); 516 | int m_count, n_count, k_count; 517 | int m_inc, n_inc, k_inc; 518 | for (n_count=0;n_countN_BLOCKING)?N_BLOCKING:N-n_count; 520 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 522 | packing_b_24x8_edge_version2_k16(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 523 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 525 | packing_a_24x8_edge_k16(alpha, A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 526 | //macro kernel: to compute C += A_tilt * B_tilt 527 | macro_kernel_k16(a_buffer, b_buffer, m_inc, n_inc, k_inc, &C(m_count, n_count), LDC); 528 | } 529 | } 530 | } 531 | free(a_buffer);free(b_buffer); 532 | } -------------------------------------------------------------------------------- /include/kernel2.h: -------------------------------------------------------------------------------- 1 | #define A(i,j) A[(i)+(j)*LDA] 2 | #define B(i,j) B[(i)+(j)*LDB] 3 | #define C(i,j) C[(i)+(j)*LDC] 4 | 5 | void scale_c_k2(double *C,int M, int N, int LDC, double scalar){ 6 | int i,j; 7 | for (i=0;iN_BLOCKING)?N_BLOCKING:N-n_count; 98 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 100 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 102 | //macro kernel: to compute C += A_tilt * B_tilt 103 | macro_kernel_gemm_k8(m_inc,n_inc,k_inc,alpha,&A(m_count,k_count), LDA, &B(k_count,n_count), LDB, &C(m_count, n_count), LDC); 104 | } 105 | } 106 | } 107 | 108 | } 109 | -------------------------------------------------------------------------------- /include/kernel9.h: -------------------------------------------------------------------------------- 1 | #include "immintrin.h" 2 | #define A(i,j) A[(i)+(j)*LDA] 3 | #define B(i,j) B[(i)+(j)*LDB] 4 | #define C(i,j) C[(i)+(j)*LDC] 5 | #define M_BLOCKING 192 6 | #define N_BLOCKING 2048 7 | #define K_BLOCKING 384 8 | void scale_c_k9(double *C,int M, int N, int LDC, double scalar){ 9 | int i,j; 10 | for (i=0;i7;count_first+=8,count_sub-=8){ 23 | tosrc=src+count_first; 24 | for(count_second=0;count_second3;count_first+=4,count_sub-=4){ 31 | tosrc=src+count_first; 32 | for(count_second=0;count_secondN_BLOCKING)?N_BLOCKING:N-n_count; 174 | for (k_count=0;k_countK_BLOCKING)?K_BLOCKING:K-k_count; 176 | packing_b_k9(B+k_count+n_count*LDB,b_buffer,LDB,k_inc,n_inc); 177 | for (m_count=0;m_countM_BLOCKING)?M_BLOCKING:N-m_count; 179 | packing_a_k9(A+m_count+k_count*LDA,a_buffer,LDA,m_inc,k_inc); 180 | //macro kernel: to compute C += A_tilt * B_tilt 181 | macro_kernel_gemm_k9(m_inc,n_inc,k_inc,alpha,a_buffer, LDA, b_buffer, LDB, &C(m_count, n_count), LDC); 182 | } 183 | } 184 | } 185 | free(a_buffer);free(b_buffer); 186 | } 187 | -------------------------------------------------------------------------------- /kernels.h: -------------------------------------------------------------------------------- 1 | #include "./include/kernel1.h" 2 | #include "./include/kernel2.h" 3 | #include "./include/kernel3.h" 4 | #include "./include/kernel4.h" 5 | #include "./include/kernel5.h" 6 | #include "./include/kernel6.h" 7 | #include "./include/kernel7.h" 8 | #include "./include/kernel8.h" 9 | #include "./include/kernel9.h" 10 | #include "./include/kernel10.h" 11 | #include "./include/kernel11.h" 12 | #include "./include/kernel12.h" 13 | #include "./include/kernel13.h" 14 | #include "./include/kernel14.h" 15 | #include "./include/kernel15.h" 16 | #include "./include/kernel16.h" 17 | #include "./include/kernel17.h" 18 | #include "./include/kernel18.h" 19 | #include "./include/kernel19.h" -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | rm res* 2 | for i in {0..19..1} 3 | do 4 | export MKL_NUM_THREADS=1 5 | export OMP_NUM_THREADS=1 6 | file_name="res_${i}.txt" 7 | ./dgemm_x86 $i >> ${file_name} 8 | done 9 | -------------------------------------------------------------------------------- /test.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include "utils.h" 5 | #include "mkl.h" 6 | 7 | int main(int argc, char *argv[]){ 8 | if (argc != 2) { 9 | printf("Please select a kernel (range 0 - 19, here 0 is for Intel MKL).\n"); 10 | exit(-1); 11 | } 12 | int SIZE[30]={100,200,300,400,500,600,700,800,900,1000,1100,\ 13 | 1200,1300,1400,1500,1600,1700,1800,1900,2000,\ 14 | 2100,2200,2300,2400,2500,2600,2700,2800,2900,3000};//testing 100-3000 square matrices 15 | int kernel_num=atoi(argv[1]); 16 | if (kernel_num<0||kernel_num>19) { 17 | printf("Please enter a valid kernel number (0-19).\n"); 18 | exit(-2); 19 | } 20 | int m, n, k,max_size=3000; 21 | int n_count,N=3,upper_limit; 22 | if (kernel_num<=4&&kernel_num!=0) upper_limit=10; 23 | else upper_limit=30; 24 | double *A=NULL,*B=NULL,*C=NULL,*C_ref=NULL; 25 | double alpha = 2.0, beta = 0.;//two arbitary input parameters 26 | double t0,t1; 27 | A=(double *)malloc(sizeof(double)*max_size*max_size); 28 | B=(double *)malloc(sizeof(double)*max_size*max_size); 29 | C=(double *)malloc(sizeof(double)*max_size*max_size); 30 | C_ref=(double *)malloc(sizeof(double)*max_size*max_size); 31 | 32 | randomize_matrix(A,max_size,max_size);randomize_matrix(B,max_size,max_size);randomize_matrix(C,max_size,max_size);copy_matrix(C,C_ref,max_size*max_size); 33 | for (int i_count=0;i_count 2 | #include 3 | #include "utils.h" 4 | #include 5 | #include 6 | #include 7 | #include "kernels.h" 8 | #include "mkl.h" 9 | void print_vector(double *vec, int n){ 10 | int i; 11 | for (i = 0; i < n; i++){ 12 | printf(" %5.2f, ", vec[i]); 13 | } 14 | printf("\n"); 15 | } 16 | 17 | void print_matrix_lda(const double *A, int lda, int m, int n) 18 | { 19 | int i,j; 20 | double *a_curr_pos=A; 21 | printf("["); 22 | for (i = 0; i < n; i++){ 23 | for (j=0;j 1e-2) { 80 | printf("error. %5.2f,%5.2f,%d\n", mat1[i],mat2[i],i); 81 | return false; 82 | } 83 | } 84 | return true; 85 | } 86 | 87 | void test_mkl(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 88 | cblas_dgemm(CblasColMajor, CblasNoTrans,CblasNoTrans,m,n,k,alpha,A,m,B,k,beta,C,m); 89 | } 90 | 91 | void test_mydgemm_v1(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 92 | mydgemm_cpu_v1(m,n,k,alpha,A,m,B,k,beta,C,m); 93 | } 94 | 95 | void test_mydgemm_v2(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 96 | mydgemm_cpu_v2(m,n,k,alpha,A,m,B,k,beta,C,m); 97 | } 98 | 99 | void test_mydgemm_v3(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 100 | mydgemm_cpu_v3(m,n,k,alpha,A,m,B,k,beta,C,m); 101 | } 102 | 103 | void test_mydgemm_v4(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 104 | mydgemm_cpu_v4(m,n,k,alpha,A,m,B,k,beta,C,m); 105 | } 106 | 107 | void test_mydgemm_v5(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 108 | mydgemm_cpu_v5(m,n,k,alpha,A,m,B,k,beta,C,m); 109 | } 110 | 111 | void test_mydgemm_v6(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 112 | mydgemm_cpu_v6(m,n,k,alpha,A,m,B,k,beta,C,m); 113 | } 114 | 115 | void test_mydgemm_v7(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 116 | mydgemm_cpu_v7(m,n,k,alpha,A,m,B,k,beta,C,m); 117 | } 118 | 119 | void test_mydgemm_v8(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 120 | mydgemm_cpu_v8(m,n,k,alpha,A,m,B,k,beta,C,m); 121 | } 122 | 123 | void test_mydgemm_v9(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 124 | mydgemm_cpu_v9(m,n,k,alpha,A,m,B,k,beta,C,m); 125 | } 126 | 127 | void test_mydgemm_v10(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 128 | mydgemm_cpu_v10(m,n,k,alpha,A,m,B,k,beta,C,m); 129 | } 130 | 131 | void test_mydgemm_v11(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 132 | mydgemm_cpu_v11(m,n,k,alpha,A,m,B,k,beta,C,m); 133 | } 134 | 135 | void test_mydgemm_v12(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 136 | mydgemm_cpu_v12(m,n,k,alpha,A,m,B,k,beta,C,m); 137 | } 138 | 139 | void test_mydgemm_v13(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 140 | mydgemm_cpu_v13(m,n,k,alpha,A,m,B,k,beta,C,m); 141 | } 142 | 143 | void test_mydgemm_v14(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 144 | mydgemm_cpu_v14(m,n,k,alpha,A,m,B,k,beta,C,m); 145 | } 146 | 147 | void test_mydgemm_v15(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 148 | mydgemm_cpu_v15(m,n,k,alpha,A,m,B,k,beta,C,m); 149 | } 150 | 151 | void test_mydgemm_v16(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 152 | mydgemm_cpu_v16(m,n,k,alpha,A,m,B,k,beta,C,m); 153 | } 154 | 155 | void test_mydgemm_v17(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 156 | mydgemm_cpu_v17(m,n,k,alpha,A,m,B,k,beta,C,m); 157 | } 158 | 159 | void test_mydgemm_v18(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 160 | mydgemm_cpu_v18(m,n,k,alpha,A,m,B,k,beta,C,m); 161 | } 162 | 163 | void test_mydgemm_v19(int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 164 | mydgemm_cpu_v19(m,n,k,alpha,A,m,B,k,beta,C,m); 165 | } 166 | 167 | void test_kernel(int kernel_num,int m,int n,int k,double alpha,double *A,double *B,double beta,double *C){ 168 | switch (kernel_num){ 169 | case 0: test_mkl(m,n,k,alpha,A,B,beta,C); break; 170 | case 1: test_mydgemm_v1(m,n,k,alpha,A,B,beta,C); break; 171 | case 2: test_mydgemm_v2(m,n,k,alpha,A,B,beta,C); break; 172 | case 3: test_mydgemm_v3(m,n,k,alpha,A,B,beta,C); break; 173 | case 4: test_mydgemm_v4(m,n,k,alpha,A,B,beta,C); break; 174 | case 5: test_mydgemm_v5(m,n,k,alpha,A,B,beta,C); break; 175 | case 6: test_mydgemm_v6(m,n,k,alpha,A,B,beta,C); break; 176 | case 7: test_mydgemm_v7(m,n,k,alpha,A,B,beta,C); break; 177 | case 8: test_mydgemm_v8(m,n,k,alpha,A,B,beta,C); break; 178 | case 9: test_mydgemm_v9(m,n,k,alpha,A,B,beta,C); break; 179 | case 10: test_mydgemm_v10(m,n,k,alpha,A,B,beta,C); break; 180 | case 11: test_mydgemm_v11(m,n,k,alpha,A,B,beta,C); break; 181 | case 12: test_mydgemm_v12(m,n,k,alpha,A,B,beta,C); break; 182 | case 13: test_mydgemm_v13(m,n,k,alpha,A,B,beta,C); break; 183 | case 14: test_mydgemm_v14(m,n,k,alpha,A,B,beta,C); break; 184 | case 15: test_mydgemm_v15(m,n,k,alpha,A,B,beta,C); break; 185 | case 16: test_mydgemm_v16(m,n,k,alpha,A,B,beta,C); break; 186 | case 17: test_mydgemm_v17(m,n,k,alpha,A,B,beta,C); break; 187 | case 18: test_mydgemm_v18(m,n,k,alpha,A,B,beta,C); break; 188 | case 19: test_mydgemm_v19(m,n,k,alpha,A,B,beta,C); break; 189 | default: break; 190 | } 191 | } -------------------------------------------------------------------------------- /utils.h: -------------------------------------------------------------------------------- 1 | #ifndef _UTIL_H_ 2 | #define _UTIL_H_ 3 | #include "sys/time.h" 4 | #include 5 | 6 | void randomize_matrix(double *A, int m, int n); 7 | double get_sec(); 8 | void print_matrix(const double *A, int m, int n); 9 | void print_vector(double *vec, int n); 10 | void copy_matrix(double *src, double *dest, int n); 11 | bool verify_matrix(double *mat1, double *mat2, int n); 12 | void test_kernel(int kernel_num,int m,int n,int k,double alpha,double *A,double *B,double beta,double *C); 13 | #endif // _UTIL_H_ --------------------------------------------------------------------------------