├── README.md ├── figures ├── compare_MMult-1x4-3_MMult-1x4-4.png ├── compare_MMult-1x4-3_MMult-1x4-5.png ├── compare_MMult-1x4-3_MMult-4x4-3.png ├── compare_MMult-1x4-4_MMult-1x4-5.png ├── compare_MMult-1x4-4_MMult-4x4-4.png ├── compare_MMult-1x4-5_MMult-1x4-6.png ├── compare_MMult-1x4-5_MMult-4x4-5.png ├── compare_MMult-1x4-6_MMult-1x4-7.png ├── compare_MMult-1x4-6_MMult-4x4-6.png ├── compare_MMult-1x4-7_MMult-1x4-8.png ├── compare_MMult-1x4-7_MMult-4x4-7.png ├── compare_MMult-1x4-8_MMult-1x4-9.png ├── compare_MMult-1x4-9_MMult-4x4-10.png ├── compare_MMult-4x4-10_MMult-4x4-11.png ├── compare_MMult-4x4-11_MMult-4x4-12.png ├── compare_MMult-4x4-11_MMult-4x4-13.png ├── compare_MMult-4x4-12_MMult-4x4-13.png ├── compare_MMult-4x4-13_MMult-4x4-14.png ├── compare_MMult-4x4-13_MMult-4x4-15.png ├── compare_MMult-4x4-13_MMult_4x4_15.png ├── compare_MMult-4x4-14_MMult-4x4-15.png ├── compare_MMult-4x4-3_MMult-4x4-4.png ├── compare_MMult-4x4-4_MMult-4x4-5.png ├── compare_MMult-4x4-5_MMult-4x4-6.png ├── compare_MMult-4x4-6_MMult-4x4-7.png ├── compare_MMult-4x4-7_MMult-4x4-8.png ├── compare_MMult-4x4-8_MMult-4x4-9.png ├── compare_MMult-4x4-9_MMult-4x4-10.png ├── compare_MMult0_MMult-1x4-5.png ├── compare_MMult0_MMult-1x4-9.png ├── compare_MMult0_MMult-4x4-10.png ├── compare_MMult0_MMult-4x4-11.png ├── compare_MMult0_MMult-4x4-13.png ├── compare_MMult0_MMult-4x4-15.png ├── compare_MMult0_MMult-4x4-5.png ├── compare_MMult0_MMult0.png ├── compare_MMult0_MMult1.png ├── compare_MMult0_MMult2.png ├── compare_MMult0_MMult_4x4_15.png ├── compare_MMult0_vs_MMult0.png ├── compare_MMult1_MMult2.png ├── compare_MMult2_MMult-1x4-3.png ├── compare_MMult2_MMult-4x4-3.png ├── graph_10_vs_11.png ├── graph_1_vs_2.png ├── graph_2_vs_3.png ├── graph_3_vs_4.png ├── graph_4_vs_5.png ├── graph_5_vs_6.png ├── graph_6_vs_7.png ├── graph_7_vs_8.png ├── graph_7_vs_9.png ├── graph_8_vs_10.png └── graph_8_vs_9.png └── src ├── HowToOptimizeGemm.tar.gz ├── HowToOptimizeGemm ├── MMult0.c ├── MMult1.c ├── PlotAll.m ├── PlotAll.py ├── REF_MMult.c ├── compare_matrices.c ├── copy_matrix.c ├── dclock.c ├── makefile ├── parameters.h ├── print_matrix.c ├── proc_parameters.m ├── random_matrix.c └── test_MMult.c ├── MMult1.c ├── MMult2.c ├── MMult_1x4_3.c ├── MMult_1x4_4.c ├── MMult_1x4_5.c ├── MMult_1x4_6.c ├── MMult_1x4_7.c ├── MMult_1x4_8.c ├── MMult_1x4_9.c ├── MMult_4x4_10.c ├── MMult_4x4_11.c ├── MMult_4x4_12.c ├── MMult_4x4_13.c ├── MMult_4x4_14.c ├── MMult_4x4_15.c ├── MMult_4x4_3.c ├── MMult_4x4_4.c ├── MMult_4x4_5.c ├── MMult_4x4_6.c ├── MMult_4x4_7.c ├── MMult_4x4_8.c └── MMult_4x4_9.c /README.md: -------------------------------------------------------------------------------- 1 | # How To Optimize Gemm wiki pages 2 | https://github.com/flame/how-to-optimize-gemm/wiki 3 | 4 | Copyright by Prof. Robert van de Geijn (rvdg@cs.utexas.edu). 5 | 6 | Adapted to Github Markdown Wiki by Jianyu Huang (jianyu@cs.utexas.edu). 7 | 8 | # Table of contents 9 | 10 | * [The GotoBLAS/BLIS Approach to Optimizing Matrix-Matrix Multiplication - Step-by-Step](../../wiki#the-gotoblasblis-approach-to-optimizing-matrix-matrix-multiplication---step-by-step) 11 | * [NOTICE ON ACADEMIC HONESTY](../../wiki#notice-on-academic-honesty) 12 | * [References](../../wiki#references) 13 | * [Set Up](../../wiki#set-up) 14 | * [Step-by-step optimizations](../../wiki#step-by-step-optimizations) 15 | * [Computing four elements of C at a time](../../wiki#computing-four-elements-of-c-at-a-time) 16 | * [Hiding computation in a subroutine](../../wiki#hiding-computation-in-a-subroutine) 17 | * [Computing four elements at a time](../../wiki#computing-four-elements-at-a-time) 18 | * [Further optimizing](../../wiki#further-optimizing) 19 | * [Computing a 4 x 4 block of C at a time](../../wiki#computing-a-4-x-4-block-of-c-at-a-time) 20 | * [Repeating the same optimizations](../../wiki#repeating-the-same-optimizations) 21 | * [Further optimizing](../../wiki#further-optimizing-1) 22 | * [Blocking to maintain performance](../../wiki#blocking-to-maintain-performance) 23 | * [Packing into contiguous memory](../../wiki#packing-into-contiguous-memory) 24 | * [Acknowledgement](../../wiki#acknowledgement) 25 | 26 | # Related Links 27 | * [BLISlab: A Sandbox for Optimizing GEMM](https://github.com/flame/blislab) 28 | * [GEMM: From Pure C to SSE Optimized Micro Kernels](http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/) 29 | 30 | # Acknowledgement 31 | This material was partially sponsored by grants from the National Science Foundation (Awards ACI-1148125/1340293 and ACI-1550493). 32 | 33 | _Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF)._ 34 | -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-3_MMult-1x4-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-3_MMult-1x4-4.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-3_MMult-1x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-3_MMult-1x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-3_MMult-4x4-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-3_MMult-4x4-3.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-4_MMult-1x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-4_MMult-1x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-4_MMult-4x4-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-4_MMult-4x4-4.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-5_MMult-1x4-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-5_MMult-1x4-6.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-5_MMult-4x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-5_MMult-4x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-6_MMult-1x4-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-6_MMult-1x4-7.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-6_MMult-4x4-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-6_MMult-4x4-6.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-7_MMult-1x4-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-7_MMult-1x4-8.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-7_MMult-4x4-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-7_MMult-4x4-7.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-8_MMult-1x4-9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-8_MMult-1x4-9.png -------------------------------------------------------------------------------- /figures/compare_MMult-1x4-9_MMult-4x4-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-1x4-9_MMult-4x4-10.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-10_MMult-4x4-11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-10_MMult-4x4-11.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-11_MMult-4x4-12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-11_MMult-4x4-12.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-11_MMult-4x4-13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-11_MMult-4x4-13.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-12_MMult-4x4-13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-12_MMult-4x4-13.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-13_MMult-4x4-14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-13_MMult-4x4-14.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-13_MMult-4x4-15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-13_MMult-4x4-15.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-13_MMult_4x4_15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-13_MMult_4x4_15.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-14_MMult-4x4-15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-14_MMult-4x4-15.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-3_MMult-4x4-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-3_MMult-4x4-4.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-4_MMult-4x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-4_MMult-4x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-5_MMult-4x4-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-5_MMult-4x4-6.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-6_MMult-4x4-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-6_MMult-4x4-7.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-7_MMult-4x4-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-7_MMult-4x4-8.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-8_MMult-4x4-9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-8_MMult-4x4-9.png -------------------------------------------------------------------------------- /figures/compare_MMult-4x4-9_MMult-4x4-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult-4x4-9_MMult-4x4-10.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-1x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-1x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-1x4-9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-1x4-9.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-4x4-10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-4x4-10.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-4x4-11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-4x4-11.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-4x4-13.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-4x4-13.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-4x4-15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-4x4-15.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult-4x4-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult-4x4-5.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult0.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult1.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult2.png -------------------------------------------------------------------------------- /figures/compare_MMult0_MMult_4x4_15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_MMult_4x4_15.png -------------------------------------------------------------------------------- /figures/compare_MMult0_vs_MMult0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult0_vs_MMult0.png -------------------------------------------------------------------------------- /figures/compare_MMult1_MMult2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult1_MMult2.png -------------------------------------------------------------------------------- /figures/compare_MMult2_MMult-1x4-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult2_MMult-1x4-3.png -------------------------------------------------------------------------------- /figures/compare_MMult2_MMult-4x4-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/compare_MMult2_MMult-4x4-3.png -------------------------------------------------------------------------------- /figures/graph_10_vs_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_10_vs_11.png -------------------------------------------------------------------------------- /figures/graph_1_vs_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_1_vs_2.png -------------------------------------------------------------------------------- /figures/graph_2_vs_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_2_vs_3.png -------------------------------------------------------------------------------- /figures/graph_3_vs_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_3_vs_4.png -------------------------------------------------------------------------------- /figures/graph_4_vs_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_4_vs_5.png -------------------------------------------------------------------------------- /figures/graph_5_vs_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_5_vs_6.png -------------------------------------------------------------------------------- /figures/graph_6_vs_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_6_vs_7.png -------------------------------------------------------------------------------- /figures/graph_7_vs_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_7_vs_8.png -------------------------------------------------------------------------------- /figures/graph_7_vs_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_7_vs_9.png -------------------------------------------------------------------------------- /figures/graph_8_vs_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_8_vs_10.png -------------------------------------------------------------------------------- /figures/graph_8_vs_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/figures/graph_8_vs_9.png -------------------------------------------------------------------------------- /src/HowToOptimizeGemm.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flame/how-to-optimize-gemm/40b1ec685db23e63ed92f3f60b0fef4cd193455d/src/HowToOptimizeGemm.tar.gz -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/MMult0.c: -------------------------------------------------------------------------------- 1 | /* Create macros so that the matrices are stored in column-major order */ 2 | 3 | #define A(i,j) a[ (j)*lda + (i) ] 4 | #define B(i,j) b[ (j)*ldb + (i) ] 5 | #define C(i,j) c[ (j)*ldc + (i) ] 6 | 7 | /* Routine for computing C = A * B + C */ 8 | 9 | void MY_MMult( int m, int n, int k, double *a, int lda, 10 | double *b, int ldb, 11 | double *c, int ldc ) 12 | { 13 | int i, j, p; 14 | 15 | for ( i=0; i None: 23 | self.attrs = {} 24 | with open(file_name) as file: 25 | self.toks = file.read().split() 26 | self.toksi = 0 27 | file.close() 28 | self.attrs = self.parse() 29 | 30 | def next(self): 31 | tok = self.toks[self.toksi] 32 | self.toksi += 1 33 | return tok 34 | 35 | def get_var_name(self): 36 | return self.next() 37 | 38 | def get_symbol(self, sym): 39 | tok = self.next() 40 | assert(tok == sym) 41 | return tok 42 | 43 | def get_value(self): 44 | value = None 45 | tok = self.next() 46 | if tok == '[': 47 | # list 48 | value = [] 49 | tok = self.next() 50 | while not tok.startswith(']'): 51 | value.append(float(tok)) 52 | tok = self.next() 53 | elif tok.startswith("'"): 54 | value = tok[1:-2] 55 | 56 | assert value != None 57 | return value 58 | 59 | def parse(self): 60 | res = {} 61 | while self.toksi < len(self.toks): 62 | var = self.get_var_name() 63 | self.get_symbol('=') 64 | val = self.get_value() 65 | res[var] = val 66 | return res 67 | 68 | def __getattr__(self, name): 69 | return self.attrs[name] 70 | 71 | old = Parser("output_old.m") 72 | new = Parser("output_new.m") 73 | 74 | #print(old) 75 | #print(new) 76 | 77 | old_data = np.array(old.MY_MMult).reshape(-1, 3) 78 | new_data = np.array(new.MY_MMult).reshape(-1, 3) 79 | 80 | max_gflops = nflops_per_cycle * nprocessors * GHz_of_processor; 81 | 82 | fig, ax = plt.subplots() 83 | ax.plot(old_data[:,0], old_data[:,1], 'bo-.', label='old:' + old.version) 84 | ax.plot(new_data[:,0], new_data[:,1], 'r-*', label='new:' + new.version) 85 | 86 | ax.set(xlabel='m = n = k', ylabel='GFLOPS/sec.', 87 | title="OLD = {}, NEW = {}".format(old.version, new.version)) 88 | ax.grid() 89 | ax.legend() 90 | 91 | ax.set_xlim([old_data[0,0], old_data[-1,0]]) 92 | ax.set_ylim([0, max_gflops]) 93 | 94 | # fig.savefig("test.png") 95 | plt.show() -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/REF_MMult.c: -------------------------------------------------------------------------------- 1 | /* Create macros so that the matrices are stored in column-major order */ 2 | 3 | #define A(i,j) a[ (j)*lda + (i) ] 4 | #define B(i,j) b[ (j)*ldb + (i) ] 5 | #define C(i,j) c[ (j)*ldc + (i) ] 6 | 7 | /* Routine for computing C = A * B + C */ 8 | 9 | void REF_MMult( int m, int n, int k, double *a, int lda, 10 | double *b, int ldb, 11 | double *c, int ldc ) 12 | { 13 | int i, j, p; 14 | 15 | for ( i=0; i max_diff ? diff : max_diff ); 14 | } 15 | 16 | return max_diff; 17 | } 18 | 19 | -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/copy_matrix.c: -------------------------------------------------------------------------------- 1 | #define A( i, j ) a[ (j)*lda + (i) ] 2 | #define B( i, j ) b[ (j)*ldb + (i) ] 3 | 4 | void copy_matrix( int m, int n, double *a, int lda, double *b, int ldb ) 5 | { 6 | int i, j; 7 | 8 | for ( j=0; j 2 | #include 3 | 4 | static double gtod_ref_time_sec = 0.0; 5 | 6 | /* Adapted from the bl2_clock() routine in the BLIS library */ 7 | 8 | double dclock() 9 | { 10 | double the_time, norm_sec; 11 | struct timeval tv; 12 | 13 | gettimeofday( &tv, NULL ); 14 | 15 | if ( gtod_ref_time_sec == 0.0 ) 16 | gtod_ref_time_sec = ( double ) tv.tv_sec; 17 | 18 | norm_sec = ( double ) tv.tv_sec - gtod_ref_time_sec; 19 | 20 | the_time = norm_sec + tv.tv_usec * 1.0e-6; 21 | 22 | return the_time; 23 | } 24 | 25 | -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/makefile: -------------------------------------------------------------------------------- 1 | OLD := MMult0 2 | NEW := MMult0 3 | # 4 | # sample makefile 5 | # 6 | 7 | CC := gcc 8 | LINKER := $(CC) 9 | CFLAGS := -O2 -Wall -msse3 10 | LDFLAGS := -lm 11 | 12 | UTIL := copy_matrix.o \ 13 | compare_matrices.o \ 14 | random_matrix.o \ 15 | dclock.o \ 16 | REF_MMult.o \ 17 | print_matrix.o 18 | 19 | TEST_OBJS := test_MMult.o $(NEW).o 20 | 21 | %.o: %.c 22 | $(CC) $(CFLAGS) -c $< -o $@ 23 | %.o: %.c 24 | $(CC) $(CFLAGS) -c $< -o $@ 25 | 26 | all: 27 | make clean; 28 | make test_MMult.x 29 | 30 | test_MMult.x: $(TEST_OBJS) $(UTIL) parameters.h 31 | $(LINKER) $(TEST_OBJS) $(UTIL) $(LDFLAGS) \ 32 | $(BLAS_LIB) -o $(TEST_BIN) $@ 33 | 34 | run: 35 | make all 36 | export OMP_NUM_THREADS=1 37 | export GOTO_NUM_THREADS=1 38 | echo "version = '$(NEW)';" > output_$(NEW).m 39 | ./test_MMult.x >> output_$(NEW).m 40 | cp output_$(OLD).m output_old.m 41 | cp output_$(NEW).m output_new.m 42 | 43 | clean: 44 | rm -f *.o *~ core *.x 45 | 46 | cleanall: 47 | rm -f *.o *~ core *.x output*.m *.eps *.png 48 | -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/parameters.h: -------------------------------------------------------------------------------- 1 | /* 2 | In the test driver, there is a loop "for ( p=PFIRST; p<= PLAST; p+= PINC )" 3 | The below parameters set this range of values that p takes on 4 | */ 5 | #define PFIRST 40 6 | #define PLAST 800 7 | #define PINC 40 8 | 9 | /* 10 | In the test driver, the m, n, and k dimensions are set to the below 11 | values. If the value equals "-1" then that dimension is bound to the 12 | index p, given above. 13 | */ 14 | 15 | #define M -1 16 | #define N -1 17 | #define K -1 18 | 19 | /* 20 | In the test driver, each experiment is repeated NREPEATS times and 21 | the best time from these repeats is used to compute the performance 22 | */ 23 | 24 | #define NREPEATS 2 25 | 26 | /* 27 | Matrices A, B, and C are stored in two dimensional arrays with 28 | row dimensions that are greater than or equal to the row dimension 29 | of the matrix. This row dimension of the array is known as the 30 | "leading dimension" and determines the stride (the number of 31 | double precision numbers) when one goes from one element in a row 32 | to the next. Having this number larger than the row dimension of 33 | the matrix tends to adversely affect performance. LDX equals the 34 | leading dimension of the array that stores matrix X. If LDX=-1 35 | then the leading dimension is set to the row dimension of matrix X. 36 | */ 37 | 38 | #define LDA 1000 39 | #define LDB 1000 40 | #define LDC 1000 41 | -------------------------------------------------------------------------------- /src/HowToOptimizeGemm/print_matrix.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #define A( i, j ) a[ (j)*lda + (i) ] 4 | 5 | void print_matrix( int m, int n, double *a, int lda ) 6 | { 7 | int i, j; 8 | 9 | for ( j=0; j 2 | 3 | #define A( i,j ) a[ (j)*lda + (i) ] 4 | 5 | void random_matrix( int m, int n, double *a, int lda ) 6 | { 7 | double drand48(); 8 | int i,j; 9 | 10 | for ( j=0; j 2 | // #include 3 | #include 4 | 5 | #include "parameters.h" 6 | 7 | void REF_MMult(int, int, int, double *, int, double *, int, double *, int ); 8 | void MY_MMult(int, int, int, double *, int, double *, int, double *, int ); 9 | void copy_matrix(int, int, double *, int, double *, int ); 10 | void random_matrix(int, int, double *, int); 11 | double compare_matrices( int, int, double *, int, double *, int ); 12 | 13 | double dclock(); 14 | 15 | int main() 16 | { 17 | int 18 | p, 19 | m, n, k, 20 | lda, ldb, ldc, 21 | rep; 22 | 23 | double 24 | dtime, dtime_best, 25 | gflops, 26 | diff; 27 | 28 | double 29 | *a, *b, *c, *cref, *cold; 30 | 31 | printf( "MY_MMult = [\n" ); 32 | 33 | for ( p=PFIRST; p<=PLAST; p+=PINC ){ 34 | m = ( M == -1 ? p : M ); 35 | n = ( N == -1 ? p : N ); 36 | k = ( K == -1 ? p : K ); 37 | 38 | gflops = 2.0 * m * n * k * 1.0e-09; 39 | 40 | lda = ( LDA == -1 ? m : LDA ); 41 | ldb = ( LDB == -1 ? k : LDB ); 42 | ldc = ( LDC == -1 ? m : LDC ); 43 | 44 | /* Allocate space for the matrices */ 45 | /* Note: I create an extra column in A to make sure that 46 | prefetching beyond the matrix does not cause a segfault */ 47 | a = ( double * ) malloc( lda * (k+1) * sizeof( double ) ); 48 | b = ( double * ) malloc( ldb * n * sizeof( double ) ); 49 | c = ( double * ) malloc( ldc * n * sizeof( double ) ); 50 | cold = ( double * ) malloc( ldc * n * sizeof( double ) ); 51 | cref = ( double * ) malloc( ldc * n * sizeof( double ) ); 52 | 53 | /* Generate random matrices A, B, Cold */ 54 | random_matrix( m, k, a, lda ); 55 | random_matrix( k, n, b, ldb ); 56 | random_matrix( m, n, cold, ldc ); 57 | 58 | copy_matrix( m, n, cold, ldc, cref, ldc ); 59 | 60 | /* Run the reference implementation so the answers can be compared */ 61 | 62 | REF_MMult( m, n, k, a, lda, b, ldb, cref, ldc ); 63 | 64 | /* Time the "optimized" implementation */ 65 | for ( rep=0; rep 29 | #include // SSE 30 | #include // SSE2 31 | #include // SSE3 32 | 33 | typedef union 34 | { 35 | __m128d v; 36 | double d[2]; 37 | } v2df_t; 38 | 39 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 40 | { 41 | /* So, this routine computes a 4x4 block of matrix A 42 | 43 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 44 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 45 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 46 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 47 | 48 | Notice that this routine is called with c = C( i, j ) in the 49 | previous routine, so these are actually the elements 50 | 51 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 52 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 53 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 54 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 55 | 56 | in the original matrix C 57 | 58 | And now we use vector registers and instructions */ 59 | 60 | int p; 61 | 62 | v2df_t 63 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 64 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 65 | a_0p_a_1p_vreg, 66 | a_2p_a_3p_vreg, 67 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 68 | 69 | double 70 | /* Point to the current elements in the four columns of B */ 71 | *b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr; 72 | 73 | b_p0_pntr = &B( 0, 0 ); 74 | b_p1_pntr = &B( 0, 1 ); 75 | b_p2_pntr = &B( 0, 2 ); 76 | b_p3_pntr = &B( 0, 3 ); 77 | 78 | c_00_c_10_vreg.v = _mm_setzero_pd(); 79 | c_01_c_11_vreg.v = _mm_setzero_pd(); 80 | c_02_c_12_vreg.v = _mm_setzero_pd(); 81 | c_03_c_13_vreg.v = _mm_setzero_pd(); 82 | c_20_c_30_vreg.v = _mm_setzero_pd(); 83 | c_21_c_31_vreg.v = _mm_setzero_pd(); 84 | c_22_c_32_vreg.v = _mm_setzero_pd(); 85 | c_23_c_33_vreg.v = _mm_setzero_pd(); 86 | 87 | for ( p=0; p 52 | #include // SSE 53 | #include // SSE2 54 | #include // SSE3 55 | 56 | typedef union 57 | { 58 | __m128d v; 59 | double d[2]; 60 | } v2df_t; 61 | 62 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 63 | { 64 | /* So, this routine computes a 4x4 block of matrix A 65 | 66 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 67 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 68 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 69 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 70 | 71 | Notice that this routine is called with c = C( i, j ) in the 72 | previous routine, so these are actually the elements 73 | 74 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 75 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 76 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 77 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 78 | 79 | in the original matrix C 80 | 81 | And now we use vector registers and instructions */ 82 | 83 | int p; 84 | v2df_t 85 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 86 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 87 | a_0p_a_1p_vreg, 88 | a_2p_a_3p_vreg, 89 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 90 | 91 | double 92 | /* Point to the current elements in the four columns of B */ 93 | *b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr; 94 | 95 | b_p0_pntr = &B( 0, 0 ); 96 | b_p1_pntr = &B( 0, 1 ); 97 | b_p2_pntr = &B( 0, 2 ); 98 | b_p3_pntr = &B( 0, 3 ); 99 | 100 | c_00_c_10_vreg.v = _mm_setzero_pd(); 101 | c_01_c_11_vreg.v = _mm_setzero_pd(); 102 | c_02_c_12_vreg.v = _mm_setzero_pd(); 103 | c_03_c_13_vreg.v = _mm_setzero_pd(); 104 | c_20_c_30_vreg.v = _mm_setzero_pd(); 105 | c_21_c_31_vreg.v = _mm_setzero_pd(); 106 | c_22_c_32_vreg.v = _mm_setzero_pd(); 107 | c_23_c_33_vreg.v = _mm_setzero_pd(); 108 | 109 | for ( p=0; p 70 | #include // SSE 71 | #include // SSE2 72 | #include // SSE3 73 | 74 | typedef union 75 | { 76 | __m128d v; 77 | double d[2]; 78 | } v2df_t; 79 | 80 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 81 | { 82 | /* So, this routine computes a 4x4 block of matrix A 83 | 84 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 85 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 86 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 87 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 88 | 89 | Notice that this routine is called with c = C( i, j ) in the 90 | previous routine, so these are actually the elements 91 | 92 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 93 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 94 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 95 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 96 | 97 | in the original matrix C 98 | 99 | And now we use vector registers and instructions */ 100 | 101 | int p; 102 | v2df_t 103 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 104 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 105 | a_0p_a_1p_vreg, 106 | a_2p_a_3p_vreg, 107 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 108 | 109 | double 110 | /* Point to the current elements in the four columns of B */ 111 | *b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr; 112 | 113 | b_p0_pntr = &B( 0, 0 ); 114 | b_p1_pntr = &B( 0, 1 ); 115 | b_p2_pntr = &B( 0, 2 ); 116 | b_p3_pntr = &B( 0, 3 ); 117 | 118 | c_00_c_10_vreg.v = _mm_setzero_pd(); 119 | c_01_c_11_vreg.v = _mm_setzero_pd(); 120 | c_02_c_12_vreg.v = _mm_setzero_pd(); 121 | c_03_c_13_vreg.v = _mm_setzero_pd(); 122 | c_20_c_30_vreg.v = _mm_setzero_pd(); 123 | c_21_c_31_vreg.v = _mm_setzero_pd(); 124 | c_22_c_32_vreg.v = _mm_setzero_pd(); 125 | c_23_c_33_vreg.v = _mm_setzero_pd(); 126 | 127 | for ( p=0; p 70 | #include // SSE 71 | #include // SSE2 72 | #include // SSE3 73 | 74 | typedef union 75 | { 76 | __m128d v; 77 | double d[2]; 78 | } v2df_t; 79 | 80 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 81 | { 82 | /* So, this routine computes a 4x4 block of matrix A 83 | 84 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 85 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 86 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 87 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 88 | 89 | Notice that this routine is called with c = C( i, j ) in the 90 | previous routine, so these are actually the elements 91 | 92 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 93 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 94 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 95 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 96 | 97 | in the original matrix C 98 | 99 | And now we use vector registers and instructions */ 100 | 101 | int p; 102 | v2df_t 103 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 104 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 105 | a_0p_a_1p_vreg, 106 | a_2p_a_3p_vreg, 107 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 108 | 109 | double 110 | /* Point to the current elements in the four columns of B */ 111 | *b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr; 112 | 113 | b_p0_pntr = &B( 0, 0 ); 114 | b_p1_pntr = &B( 0, 1 ); 115 | b_p2_pntr = &B( 0, 2 ); 116 | b_p3_pntr = &B( 0, 3 ); 117 | 118 | c_00_c_10_vreg.v = _mm_setzero_pd(); 119 | c_01_c_11_vreg.v = _mm_setzero_pd(); 120 | c_02_c_12_vreg.v = _mm_setzero_pd(); 121 | c_03_c_13_vreg.v = _mm_setzero_pd(); 122 | c_20_c_30_vreg.v = _mm_setzero_pd(); 123 | c_21_c_31_vreg.v = _mm_setzero_pd(); 124 | c_22_c_32_vreg.v = _mm_setzero_pd(); 125 | c_23_c_33_vreg.v = _mm_setzero_pd(); 126 | 127 | for ( p=0; p 90 | #include // SSE 91 | #include // SSE2 92 | #include // SSE3 93 | 94 | typedef union 95 | { 96 | __m128d v; 97 | double d[2]; 98 | } v2df_t; 99 | 100 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 101 | { 102 | /* So, this routine computes a 4x4 block of matrix A 103 | 104 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 105 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 106 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 107 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 108 | 109 | Notice that this routine is called with c = C( i, j ) in the 110 | previous routine, so these are actually the elements 111 | 112 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 113 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 114 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 115 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 116 | 117 | in the original matrix C 118 | 119 | And now we use vector registers and instructions */ 120 | 121 | int p; 122 | v2df_t 123 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 124 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 125 | a_0p_a_1p_vreg, 126 | a_2p_a_3p_vreg, 127 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 128 | 129 | c_00_c_10_vreg.v = _mm_setzero_pd(); 130 | c_01_c_11_vreg.v = _mm_setzero_pd(); 131 | c_02_c_12_vreg.v = _mm_setzero_pd(); 132 | c_03_c_13_vreg.v = _mm_setzero_pd(); 133 | c_20_c_30_vreg.v = _mm_setzero_pd(); 134 | c_21_c_31_vreg.v = _mm_setzero_pd(); 135 | c_22_c_32_vreg.v = _mm_setzero_pd(); 136 | c_23_c_33_vreg.v = _mm_setzero_pd(); 137 | 138 | for ( p=0; p 94 | #include // SSE 95 | #include // SSE2 96 | #include // SSE3 97 | 98 | typedef union 99 | { 100 | __m128d v; 101 | double d[2]; 102 | } v2df_t; 103 | 104 | void AddDot4x4( int k, double *a, int lda, double *b, int ldb, double *c, int ldc ) 105 | { 106 | /* So, this routine computes a 4x4 block of matrix A 107 | 108 | C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ). 109 | C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ). 110 | C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ). 111 | C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ). 112 | 113 | Notice that this routine is called with c = C( i, j ) in the 114 | previous routine, so these are actually the elements 115 | 116 | C( i , j ), C( i , j+1 ), C( i , j+2 ), C( i , j+3 ) 117 | C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 118 | C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 119 | C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 120 | 121 | in the original matrix C 122 | 123 | And now we use vector registers and instructions */ 124 | 125 | int p; 126 | v2df_t 127 | c_00_c_10_vreg, c_01_c_11_vreg, c_02_c_12_vreg, c_03_c_13_vreg, 128 | c_20_c_30_vreg, c_21_c_31_vreg, c_22_c_32_vreg, c_23_c_33_vreg, 129 | a_0p_a_1p_vreg, 130 | a_2p_a_3p_vreg, 131 | b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 132 | 133 | c_00_c_10_vreg.v = _mm_setzero_pd(); 134 | c_01_c_11_vreg.v = _mm_setzero_pd(); 135 | c_02_c_12_vreg.v = _mm_setzero_pd(); 136 | c_03_c_13_vreg.v = _mm_setzero_pd(); 137 | c_20_c_30_vreg.v = _mm_setzero_pd(); 138 | c_21_c_31_vreg.v = _mm_setzero_pd(); 139 | c_22_c_32_vreg.v = _mm_setzero_pd(); 140 | c_23_c_33_vreg.v = _mm_setzero_pd(); 141 | 142 | for ( p=0; p