├── AUTHORS.md ├── LICENSE.md ├── README.md ├── interface.md ├── operations.md ├── proposal1.md ├── proposal2a.md ├── proposal2b.md ├── src ├── tapp.h └── tapp │ ├── attributes.h │ ├── datatype.h │ ├── error.h │ ├── executor.h │ ├── handle.h │ ├── product.h │ ├── status.h │ └── tensor.h └── terminology.md /AUTHORS.md: -------------------------------------------------------------------------------- 1 | Copyright holders: 2 | Devin Matthews, Southern Methodist University, TX, USA - damatthews at mail.smu.edu 3 | Paolo Bientinesi, Department of Computing Science, Umeå university, Sweden - pauldj at cs.umu.se 4 | 5 | Other contributors: 6 | Niklas Hörnblad, Department of Computing Science, Umeå university, Sweden 7 | Jan Brandejs, Toulouse III - Paul Sabatier University, France - jbrandejs at irsamc.ups-tlse.fr 8 | Edward Valeev, Virginia Tech, USA 9 | Jeff Hammond, NVIDIA, Finland 10 | Alexander Heinecke, Intel Corporation, USA 11 | Christopher Millette, Advanced Micro Devices, Inc., USA 12 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2025, the respective contributors, as shown by the AUTHORS.md file. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | 3. Neither the name of the copyright holder nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | TAPP: Tensor Algebra Processing Primitives 2 | === 3 | TAPP is a draft for a low-level standard tensor interface, similar to BLAS interface for matrices. 4 | 5 | TAPP defines the interface itself and clarifies the related technicalities. A reference implementation is available [here.](https://github.com/TAPPorg/reference-implementation) 6 | 7 | TAPP formulates the basic tensor contraction operation as 8 | ``` 9 | D(idx_D) = alpha A(idx_A) * B(idx_B) + beta C(idx_C), 10 | ``` 11 | where A,B,C and D are tensors and alpha, beta scalars. idx_* is an array of indices of the corresponding tensor using the Einstein summation convention, e.g. 12 | ``` 13 | int64_t idx_D[3] = {'a', 'd', 'e'}; 14 | int64_t idx_A[3] = {'a', 'b', 'c'}; 15 | int64_t idx_B[4] = {'c', 'd', 'e', 'b'}; 16 | int64_t idx_C[3] = {'a', 'd', 'e'}; 17 | ``` 18 | The function TAPP_create_tensor_product in src/tapp/product.h uses this information to create a contraction plan, which is then executed by TAPP_execute_product. 19 | 20 | 21 | TAPP was devised following CECAM Workshop on Tensor Contraction Library Standardization which took place in Toulouse May 22-24, 2024. Since then, a working group is meeting regularly to maintain TAPP. 22 | 23 | [tensor.sciencesconf.org](https://tensor.sciencesconf.org/) 24 | 25 | Please submit all technical questions, suggestions and queries to Issues. 26 | For other inquiries, you can contact: 27 | Devin Matthews damatthews at mail.smu.edu 28 | Paolo Bientinesi pauldj at cs.umu.se 29 | Jan Brandejs jbrandejs at irsamc.ups-tlse.fr 30 | -------------------------------------------------------------------------------- /interface.md: -------------------------------------------------------------------------------- 1 | # Interface Issues 2 | 3 | - Integer size: 4 | - Fixed at 32-bit, 64-bit or configurable? 5 | - Same for lengths and strides? 6 | - Signed or unsigned? 7 | - Type for index string: 8 | - char or integral (char allows string literals in C/C++). 9 | - Pointers or pass by values for scalars? 10 | - Return values (norm, dot, etc.) or parameters? 11 | - Does reduce (idamax, etc.) return the index, the value, or both? 12 | - Dedicated reductions (max, min, norm2, etc.) or a single function with operation tag? 13 | - For a mixed-precision interface: 14 | - Separate functions? 15 | - Type tags? 16 | - What is the type of alpha and beta? 17 | - Internal computation type? 18 | - Accumulation type? 19 | - Storage format for complex: 20 | - Separate real/complex storage nice for some applications. 21 | - High-level vs. low-level interfaces: 22 | - Low-level interface *must* be in ANSI C. 23 | - Structs are probably a no-no 24 | - Language compatibility (esp with Fortran?) 25 | - Doesn't necessarily need to be concise and intuitive 26 | - 1st concern 27 | - High-level interfaces 28 | - Can be tuned to language and/or usage 29 | - Use existing interfaces as well (Eigen, MATLAB, Numpy, etc.) 30 | - 2nd concern 31 | - Error checking: 32 | - When does it happen? 33 | - How much? 34 | - Can it be turned off? 35 | - What happens on error? 36 | - Hardware: 37 | - CPU only? 38 | - Separate interface for GPU or combined? Heterogeneous? 39 | - Out-of-core 40 | - Threading 41 | - Distributed parallelism (maybe)? 42 | 43 | ## "Plans" 44 | 45 | Paul Springer incorporates the notion of a "plan" in [HPTT](https://github.com/springer13/hptt), similar to how it is used in e.g. FFTW. The basic idea is that instead of just calling an interface function, one creates an object for the operation. This object can then be executed (possibly more than once) and maniuplated to change the operation parameters. 46 | 47 | Examples of why this is helpful: 48 | - Reducing overhead by calling the object multiple times with different pointers 49 | - This also can increase the amount of allowable overhead, opening up new analysis/optimization opportunities 50 | - Passing around operations for task scheduling, etc. (Lambdas work for this too). 51 | - Creating an isolated execution environment 52 | - E.g. specifying threading, precision, algorithm, etc. without being affected by global changes 53 | -------------------------------------------------------------------------------- /operations.md: -------------------------------------------------------------------------------- 1 | # Classifying and Specifying Tensor Operations 2 | 3 | - *n*-ary operations 4 | - Do we count all tensors (operands, maybe including lone scalars?) on the RHS? Just tensors being "multiplied"? All but the LHS? 5 | - How are `norm`, `reduce`, and `dot` classified? 6 | - Is it enough to consider just "binary" operations? 7 | - Existing methods for specifying operations (contraction at least): 8 | 1. Reduction to canonical form. Example: specify input permutation of A and B and always contract last *k* indices of A and first *k* indices of B, then permute output into C. 9 | 10 | ```C[abcd] = A[afce]*B[edbf] -> perm_A = [0,2,3,1], perm_B = [0,3,2,1], perm_C = [0,2,1,3]``` 11 | 12 | (Note that the permutations are not unique) 13 | 14 | 2. Pick dimensions by position. Example: specify which dimensions are contracted listing their position in A and B. Dimensions of C are all those left over (some interfaces have no way to choose the ordering in C!). E.g. Tensor Toolbox and Eigen. 15 | 16 | ```C[abcd] = A[afce]*B[edbf] -> ctr_A = [3,1], ctr_B = [0,3], + explicit permutation of C``` 17 | 18 | 3. Pick dimensions using Einstein summation. Example: specify index strings for each operand. Repeated indices in A and B are contracted. Relative positions determine necessary permutations. I think this is everybody's current favorite. 19 | 20 | ```C[abcd] = A[afce]*B[edbf] -> idx_A = "afce", idx_B = "edbf", idx_C = "abcd"``` 21 | 22 | 4. Specify the shape of the operation not the shape of the operands. Example: generalize GEMM interface by specifying multiple *m*, *n*, and *k* dimensions and shapes, with two stride vectors for each operand. This is actually highly useful for the implementation! 23 | 24 | ```C++ 25 | contract(ndim_M, shape_M, ndim_N, shape_N, ndim_K, shape_K, 26 | alpha, A, stride_A_M, stride_A_K, 27 | B, stride_B_K, stride_B_N, 28 | beta, C, stride_C_M, stride_C_N); 29 | ``` 30 | - Permutations: "comes-from" or "goes-to" notation? 31 | - Beyond contraction: 32 | - Generalized Einstein summation: sum over all indices that don't appear on the LHS. Allows: batched operations (weighting), single-index trace, extended contraction (CP decomposition etc.). 33 | - What are these operations called? Possibility: 34 | - Appears in only A or B but not both: trace. 35 | - Appears in A and B but not C: contract. 36 | - Appears in A or B (not both) and in C: free. 37 | - Appears in only C: replicate (or broadcast). 38 | - Appears in all 3: weight. 39 | - More than 3 operands: who knows? 40 | - "Batched" operations: 41 | - When batch operands are regularly strided in memory, it can be encoded as a higher-dimensional non-batched operation. True for batched matrix operations as well but it doesn't fit in the matrix-based interface. 42 | - Otherwise (batch operands are randomly located), is it better to have a batched interface or more complicated tensor layouts? E.g. [TBLIS](https://github.com/devinamatthews/tblis) can do operations over tensors and matrices without regular strides (I call it a scatter layout). 43 | - Why do we want batched operations? 44 | - Block sparsity! 45 | - Others? 46 | - Threading: should this be explicit in the interface 47 | - Mixed/extended/reduced precision and mixed-domain. Integers too. 48 | - Internal computation in single precision may significant speed many things up without changing the calling program. 49 | - Good complex/mixed domain support very important in many calculations. 50 | - Notation: 51 | - free vs. uncontracted vs. external and bound vs. contracted vs. internal 52 | - permutation vs. transpose vs. shuffle vs. etc. 53 | - contraction vs. multiplication (should multiplication be something more general?) 54 | -------------------------------------------------------------------------------- /proposal1.md: -------------------------------------------------------------------------------- 1 | This is an initial proposal for a mixed-precision tensor contraction interface: 2 | 3 | ```C++ 4 | typedef char mode_type; 5 | typedef int64_t stride_type; 6 | typedef int64_t extent_type; 7 | ``` 8 | 9 | ```C++ 10 | enum error_t 11 | { 12 | SUCCESS, 13 | INVALID_ARGUMENTS, 14 | INTERNAL_ERROR, 15 | NOT_SUPPORTED 16 | }; 17 | ``` 18 | 19 | ```C++ 20 | enum data_type_t 21 | { 22 | TYPE_FP16, 23 | TYPE_FP32, 24 | TYPE_FP64, 25 | TYPE_INT16, 26 | TYPE_INT32, 27 | TYPE_INT64, 28 | TYPE_FCOMPLEX, 29 | TYPE_DCOMPLEX 30 | }; 31 | ``` 32 | 33 | ```C++ 34 | /** 35 | * \brief This routine computes the tensor contraction C = alpha * op(A) * op(B) + beta * op(C) 36 | * 37 | * \f[ \mathcal{C}_{\text{modes}_\mathcal{C}} \gets \alpha * op(\mathcal{A}_{\text{modes}_\mathcal{A}}) op(B_{\text{modes}_\mathcal{B}}) + \beta op(\mathcal{C}_{\text{modes}_\mathcal{C}}), \f] 38 | * where op(X) = X or op(X) = complex conjugate(X). 39 | * 40 | * 41 | * \param[in] alpha Scaling for A*B (data_type_t is determined by 'typeCompute') 42 | * \param[in] A Pointer to the data corresponding to A (data type is determined by 'typeA') 43 | * \param[in] typeA Datatype of A. This values could be TYPE_SINGLE, TYPE_DOUBLE, TYPE_COMPLEX, or TYPE_DOUBLE_COMPLEX 44 | * \param[in] conjA Indicates if the entries of A should be conjucated (only applies to complex types) 45 | * \param[in] nmodeA Number of modes of A 46 | * \param[in] extentA Array with 'nmodeA' values that represents the extent of A (e.g., extentA[] = {4,8,12} represents an order-3 tensor of size 4x8x12). 47 | * \param[in] strideA Array with 'nmodeA' values that represents the strides of 48 | * A with respect to each mode. While the following inequality must be obeyed: 49 | * (strideA[i] == 0) or (strideA[i] >= s * extentA[i-1], if i > 0, where s 50 | * represents the last strideA[j] that is larger than 0, with j < i) . 51 | * strideA[i] == 0 indicates that this dimension will be broadcasted. 52 | * 53 | * This argument is optional and may be NULL; in this case a compact 54 | * tensor is assumed. 55 | * \param[in] modeA Array with 'nmodeA' values that represent the modes of A. 56 | * \param[in] B Pointer to the data corresponding to B (data type is determined by 'typeB') 57 | * \param[in] typeB Datatype of B (see typeA) 58 | * \param[in] conjB Indicates if the entries of B should be conjugated (only applies to complex types) 59 | * \param[in] nmodeB Number of modes of B 60 | * \param[in] extentB Array with 'nmodeB' values that represents the extent of B. 61 | * \param[in] strideB Array with 'nmodeB' values that represents the strides of B with respect to each mode (see strideA). 62 | * \param[in] beta Scaling for C (data_type_t is determined by 'typeCompute') 63 | * \param[in,out] C Pointer to the data corresponding to C (data type is determined by 'typeC') 64 | * \param[in] typeC Datatype of C (see typeA) 65 | * \param[in] conjC Indicates if the initial entries of C should be conjucated (only applies to complex types) 66 | * \param[in] nmodeC Number of modes of C 67 | * \param[in] extentC Array with 'nmodeC' values that represents the extent of C. 68 | * \param[in] strideC Array with 'nmodeC' values that represents the strides of C with respect to each mode (see strideA). 69 | * \param[in] typeCompute Datatype of for the intermediate computation of typeCompute T = A * B 70 | * 71 | * 72 | * Example: 73 | * 74 | * The tensor contraction C[a,b,c,d] = 1.3 * A[b,e,d,f] * B[f,e,a,c], 75 | * where C, A, and B respectively are double-precision tensors of size E_a x E_b x E_c x E_d, 76 | * E_b x E_e x E_d x E_f, and E_f x E_e x E_a x E_c can be computed as follows: 77 | * 78 | * double alpha = 1.3; 79 | * double beta = 0.0; 80 | * extent_type extentC[] = {E_a, E_b, E_c, E_d}; 81 | * extent_type extentA[] = {E_b, E_e, E_d, E_f}; 82 | * extent_type extentB[] = {E_f, E_e, E_a, E_c}; 83 | * stride_type strideC[] = {1, E_a, E_a*E_b, E_a*E_b*E_c}; //optional 84 | * stride_type strideA[] = {1, E_b, E_b*E_e, E_b*E_e*E_d}; //optional 85 | * stride_type strideB[] = {1, E_f, E_f*E_e, E_f*E_e*E_a}; //optional 86 | * mode_type modeC[] = {'a','b','c','d'}; 87 | * mode_type modeA[] = {'b','e','d','f'}; 88 | * mode_type modeB[] = {'f','e','a','c'}; 89 | * int nmodeA = 4; 90 | * int nmodeB = 4; 91 | * int nmodeC = 4; 92 | * data_type_t typeA = TYPE_FP64; 93 | * data_type_t typeB = TYPE_FP64; 94 | * data_type_t typeC = TYPE_FP64; 95 | * data_type_t typeCompute = TYPE_FP64; 96 | * 97 | * error_t error = tensorMult(&alpha, A, typeA, false, nmodeA, extentA, NULL, modeA, 98 | * B, typeB, false, nmodeB, extentB, NULL, modeB, 99 | * &beta, C, typeC, false, nmodeC, extentC, NULL, modeC, typeCompute); 100 | * 101 | */ 102 | error_t tensorMult(const void* alpha, const void *A, data_type_t typeA, bool conjA, int nmodeA, const extent_type *extentA, const stride_type *strideA, const mode_type* modeA, 103 | const void *B, data_type_t typeB, bool conjB, int nmodeB, const extent_type *extentB, const stride_type *strideB, const mode_type* modeB, 104 | const void* beta, void *C, data_type_t typeC, bool conjC, int nmodeC, const extent_type *extentC, const stride_type *strideC, const mode_type* modeC, data_type_t typeCompute); 105 | ``` 106 | -------------------------------------------------------------------------------- /proposal2a.md: -------------------------------------------------------------------------------- 1 | `XXX` is an appropiate namespace TBD. 2 | 3 | ```C 4 | typedef int64_t XXX_extent; 5 | typedef int64_t XXX_stride; 6 | ``` 7 | These types should almost certainly be signed. 64-bit seems like a fair assumption these days. 8 | 9 | ```C 10 | typedef int32_t XXX_index; 11 | ``` 12 | This can probably be just about any integral type. 13 | 14 | ```C 15 | typedef enum 16 | { 17 | XXX_TYPE_F32, 18 | XXX_TYPE_F64, 19 | XXX_TYPE_C32, 20 | XXX_TYPE_C64, 21 | ... 22 | } XXX_datatype; 23 | 24 | typedef enum 25 | { 26 | XXX_TYPE_F32_F32_ACCUM_F32 = XXX_TYPE_F32, 27 | ... 28 | } XXX_comp_datatype; 29 | ``` 30 | Enumerations for the supported storage and computational datatypes. Not all combinations are required to be supported. 31 | 32 | ```C 33 | typedef /* unspecified */ XXX_error; // Should be a trivial type, e.g. "int" 34 | 35 | int XXX_error_check(XXX_error err); // return non-zero on error 36 | 37 | const char* XXX_error_explain(XXX_error err); 38 | 39 | void XXX_error_clear(XXX_error err); 40 | ``` 41 | Error handling --- implementation defined. 42 | 43 | ```C 44 | typedef /* unspecified */ XXX_attr; // Requires initialization. E.g. "struct XXX_attr_internal*" 45 | typedef int32_t XXX_key; // Some values should be reserved for standardization 46 | 47 | XXX_error XXX_attr_init(XXX_attr* attr); 48 | 49 | XXX_error XXX_attr_destroy(XXX_attr* attr); 50 | 51 | XXX_error XXX_attr_set(XXX_attr* attr, XXX_key, void* value); 52 | 53 | XXX_error XXX_attr_get(XXX_attr* attr, XXX_key, void** value); 54 | 55 | XXX_error XXX_attr_clear(XXX_attr* attr, XXX_key); 56 | ``` 57 | Implementation defined (and maybe some standard) attributes, loosely based on MPI. 58 | 59 | ```C 60 | // Unary and binary element-wise operations (transpose, scale, norm, reduction, etc.) should also be defined! 61 | 62 | // Compute D_{idx_D} = alpha * A_{idx_A} * B_{idx_B} + beta * C_{idx_C} 63 | 64 | XXX_error 65 | XXX_contract(const void* alpha, 66 | XXX_datatype type_alpha, 67 | const void* A, 68 | XXX_datatype type_A, 69 | int nmode_A, 70 | const XXX_extent* shape_A, 71 | const XXX_stride* stride_A, 72 | const XXX_index* idx_A, 73 | const void* B, 74 | XXX_datatype type_B, 75 | int nmode_B, 76 | const XXX_extent* shape_B, 77 | const XXX_stride* stride_B, 78 | const XXX_index* idx_B, 79 | const void* beta, 80 | XXX_datatype type_beta, 81 | const void* C, 82 | XXX_datatype type_C, 83 | int nmode_C, 84 | const XXX_extent* shape_C, 85 | const XXX_stride* stride_C, 86 | const XXX_index* idx_C, 87 | void* D, 88 | XXX_datatype type_D, 89 | int nmode_D, 90 | const XXX_extent* shape_D, 91 | const XXX_stride* stride_D, 92 | const XXX_index* idx_D, 93 | XXX_comp_datatype comp_type, 94 | XXX_attr attr); 95 | ``` 96 | 97 | -------------------------------------------------------------------------------- /proposal2b.md: -------------------------------------------------------------------------------- 1 | See [Proposal 1a](proposal1a.md) for definitions of basic types. This "very-low-level" interface and Proposal 1a could coexist. 2 | 3 | ```C 4 | // Compute D_{MNL} = alpha * \sum_K A_{MKL} B_{KNL} + beta * C_{MNL} 5 | 6 | XXX_error 7 | XXX_contract( int nmode_M, 8 | const XXX_extent* shape_M, 9 | int nmode_N, 10 | const XXX_extent* shape_N, 11 | int nmode_K, 12 | const XXX_extent* shape_K, 13 | int nmode_L, 14 | const XXX_extent* shape_L, 15 | const void* alpha, 16 | XXX_datatype type_alpha, 17 | const void* A, 18 | XXX_datatype type_A, 19 | const XXX_stride* stride_A_M, 20 | const XXX_stride* stride_A_K, 21 | const XXX_stride* stride_A_L, 22 | const void* B, 23 | XXX_datatype type_B, 24 | const XXX_stride* stride_B_K, 25 | const XXX_stride* stride_B_N, 26 | const XXX_stride* stride_B_L, 27 | const void* beta, 28 | XXX_datatype type_beta, 29 | const void* C, 30 | XXX_datatype type_C, 31 | const XXX_stride* stride_C_M, 32 | const XXX_stride* stride_C_N, 33 | const XXX_stride* stride_C_L, 34 | void* D, 35 | XXX_datatype type_D, 36 | const XXX_stride* stride_D_M, 37 | const XXX_stride* stride_D_N, 38 | const XXX_stride* stride_D_L, 39 | XXX_comp_datatype type_comp, 40 | XXX_attr attr); 41 | 42 | // Batched tensor contraction (TBD) 43 | 44 | XXX_error 45 | XXX_contract_batch( ???? ); 46 | ``` 47 | -------------------------------------------------------------------------------- /src/tapp.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_TAPP_H_ 2 | #define TAPP_TAPP_H_ 3 | 4 | #include "tapp/error.h" 5 | #include "tapp/attributes.h" 6 | #include "tapp/datatype.h" 7 | #include "tapp/handle.h" 8 | #include "tapp/executor.h" 9 | #include "tapp/status.h" 10 | #include "tapp/tensor.h" 11 | #include "tapp/product.h" 12 | 13 | #endif /* TAPP_TAPP_H_ */ 14 | -------------------------------------------------------------------------------- /src/tapp/attributes.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_ATTRIBUTES_H_ 2 | #define TAPP_ATTRIBUTES_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | 8 | typedef intptr_t TAPP_attr; 9 | typedef int TAPP_key; 10 | 11 | //TODO: predefined attributes? error conditions? 12 | 13 | TAPP_error TAPP_attr_set(TAPP_attr attr, TAPP_key key, void* value); 14 | 15 | TAPP_error TAPP_attr_get(TAPP_attr attr, TAPP_key key, void** value); 16 | 17 | TAPP_error TAPP_attr_clear(TAPP_attr attr, TAPP_key key); 18 | 19 | #endif /* TAPP_ATTRIBUTES_H_ */ 20 | -------------------------------------------------------------------------------- /src/tapp/datatype.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_DATATYPE_H_ 2 | #define TAPP_DATATYPE_H_ 3 | 4 | /* 5 | * Storage data types: 6 | * 7 | * The storage data type is an integer enumeration which indicates the numerical format of tensor data as passed 8 | * into or returned from the library. Each tensor has a single storage datatype which describes all elements of 9 | * the tensor. Negative values and values less than 0x1000 are reserved by the standard, but values greater than 10 | * or equal to 0x1000 may be used by implementations for additional data types. 11 | */ 12 | 13 | typedef int TAPP_datatype; 14 | 15 | enum 16 | { 17 | /* IEEE754 float32: 1 sign bit, 8 exponent bits, 23 explicit significand bits */ 18 | TAPP_F32 = 0, 19 | 20 | /* IEEE754 float64: 1 sign bit, 11 exponent bits, 52 explicit significand bits */ 21 | TAPP_F64 = 1, 22 | 23 | /* Complex IEEE754 float32, stored with consecutive real and imaginary parts packed into 8 bytes */ 24 | TAPP_C32 = 2, 25 | 26 | /* Complex IEEE754 float64, stored with consecutive real and imaginary parts packed into 16 bytes */ 27 | TAPP_C64 = 3, 28 | 29 | /* IEEE754 float16: 1 sign bit, 5 exponent bits, 10 explicit significand bits */ 30 | TAPP_F16 = 4, 31 | 32 | /* bfloat16: 1 sign bit, 8 exponent bits, 7 explicit significand bits */ 33 | TAPP_BF16 = 5, 34 | 35 | /* Aliases */ 36 | TAPP_FLOAT = TAPP_F32, 37 | TAPP_DOUBLE = TAPP_F64, 38 | TAPP_SCOMPLEX = TAPP_C32, 39 | TAPP_DCOMPLEX = TAPP_C64, 40 | }; 41 | 42 | /* 43 | * Computational precision types: 44 | * 45 | * The computational precision determines the number of correct significant digits in the multiplication and 46 | * accumulation of scalar floating-point types. The computational precision also typically determines the conversion 47 | * of data to/from storage data types into an internal representation which may or may not be another storage data 48 | * type. The names of the precision type values are of the form TAPP_XXXYYY_ACCUM_ZZZ, where XXX and YYY indicate 49 | * the precision of the input scalars before multiplication, and ZZZ indicates the precision of the product after 50 | * accumulation. Note that when fused-multiply-add (FMA) instructions are available, the precision of the intermediate 51 | * product is "infinite". Low-precision computations, e.g. float16, may be performed in a higher precision when 52 | * hardware support is not available. The default precision TAPP_DEFAULT_PREC indicates that a computational precision 53 | * should be used which is not less than that of any of the input or output operands, but not typically greater than 54 | * that of any input or output operand (except for low-precision types as noted). For example, the multiplication 55 | * of float32 and float64 scalars accumulated into a float64 scalar could be performed in either f32f32_accum_f32 56 | * or f64f64_accum_f64 precision. The first option would require conversion of float64->float32 for one input operand 57 | * and float32->float64 during accumulation, while the second option would require float32->float64 conversion of the 58 | * other input operand. 59 | * 60 | * The computational precision indicates the precision of accumulation for a single scalar product, which may or may 61 | * not be maintained accross a large number of scalar operations. Any additional accumulation, e.g. from registers 62 | * to memory, or additional scalar multiplications such as multiplication by beta in a tensor contraction will be 63 | * performed in a precision greater than or equal to the precision of the storage data type of the output operand, 64 | * even if this precision is less than the indicated computational precision. For example, register-to-memory 65 | * accumulation for an output of type float16 may use float16 arithmetic even if the computational precision is 66 | * f16f16_accum_f32. 67 | */ 68 | 69 | typedef int TAPP_prectype; 70 | 71 | enum 72 | { 73 | /* The computational precision is determined as *at least* the lowest precision of the storage data types */ 74 | /* of all input and output operands. */ 75 | TAPP_DEFAULT_PREC = -1, 76 | 77 | /* IEEE754 float32 with or without singly-rounded FMA */ 78 | TAPP_F32F32_ACCUM_F32 = TAPP_F32, /* = 0 */ 79 | 80 | /* IEEE754 float64 with or without singly-rounded FMA */ 81 | TAPP_F64F64_ACCUM_F64 = TAPP_F64, /* = 1 */ 82 | 83 | /* IEEE754 float16 with or without singly-rounded FMA */ 84 | /* Implementations may compute final or intermediate results in a higher precision */ 85 | TAPP_F16F16_ACCUM_F16 = TAPP_F16, /* = 3 */ 86 | 87 | /* float16 with wide accumulation */ 88 | /* Implementations may compute intermediate results in a higher precision */ 89 | TAPP_F16F16_ACCUM_F32 = 5, 90 | 91 | /* bfloat16 with wide accumulation */ 92 | TAPP_BF16BF16_ACCUM_F32 = 6, 93 | }; 94 | 95 | #endif /* TAPP_DATATYPE_H_ */ 96 | -------------------------------------------------------------------------------- /src/tapp/error.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_ERROR_H_ 2 | #define TAPP_ERROR_H_ 3 | 4 | #include 5 | #include 6 | 7 | typedef int TAPP_error; 8 | 9 | /* Return true if the error code indicates success and false otherwise. */ 10 | bool TAPP_check_success(TAPP_error error); 11 | 12 | /* 13 | * Fill a user-provided buffer with an implementation-defined string explaining the error code. No more than maxlen-1 14 | * characters will be written. If maxlen is greater than zero, then a terminating null character is also 15 | * written. The actual number of characters written is returned, not including the terminating null character. 16 | * If maxlen is zero, then no characters are written and instead the length of the full string which would have been 17 | * written is returned, not including the terminating null character. This means that the message written will always 18 | * be null-terminated. 19 | * 20 | * TODO: should the null character be included in the return value? 21 | */ 22 | size_t TAPP_explain_error(TAPP_error error, 23 | size_t maxlen, 24 | char* message); 25 | 26 | #endif /* TAPP_ERROR_H_ */ 27 | -------------------------------------------------------------------------------- /src/tapp/executor.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_EXECUTOR_H_ 2 | #define TAPP_EXECUTOR_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | 8 | typedef intptr_t TAPP_executor; 9 | 10 | /* 11 | * TODO: implementation-defined creation of executors or "wrapper" to get all implementations and select one? 12 | * devices probably can't be enumerated until you have a handle.... 13 | */ 14 | 15 | TAPP_error TAPP_destroy_executor(TAPP_executor exec); 16 | 17 | #endif /* TAPP_HANDLE_H_ */ 18 | -------------------------------------------------------------------------------- /src/tapp/handle.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_HANDLE_H_ 2 | #define TAPP_HANDLE_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | 8 | typedef intptr_t TAPP_handle; 9 | 10 | /* 11 | * TODO: implementation-defined creation of handles or "wrapper" to get all implementations and select one? 12 | * devices probably can't be enumerated until you have a handle.... 13 | */ 14 | 15 | //TODO: get string describing implementation? 16 | 17 | //TODO: API versioning and query 18 | 19 | //TODO: optional APIs with feature test macros 20 | 21 | TAPP_error TAPP_destroy_handle(TAPP_handle handle); 22 | 23 | #endif /* TAPP_HANDLE_H_ */ 24 | -------------------------------------------------------------------------------- /src/tapp/product.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_PRODUCT_H_ 2 | #define TAPP_PRODUCT_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | #include "handle.h" 8 | #include "executor.h" 9 | #include "datatype.h" 10 | #include "status.h" 11 | #include "tensor.h" 12 | 13 | //TODO: where should this go? 14 | typedef int TAPP_element_op; 15 | 16 | enum 17 | { 18 | TAPP_IDENTITY = 0, 19 | TAPP_CONJUGATE = 1, 20 | }; 21 | 22 | //TODO: where should this go? 23 | #define TAPP_IN_PLACE NULL 24 | 25 | /* 26 | * TODO: what are the required error conditions? 27 | * 28 | * TODO: must C and D info be the same? (should they just be the same variable?) 29 | * JB: Can this be implemented efficiently with different data types of C and D? 30 | * Let’s say D is complex and C real. Then it should be possible with a different "stride". 31 | * In such cases we might want to support different C and D info. If D info is null, they 32 | * are assumed identical. 33 | */ 34 | 35 | typedef intptr_t TAPP_tensor_product; 36 | 37 | // D_{idx_D} = alpha * A_{idx_A} * B_{idx_B} + beta * C_{idx_C} 38 | // each idx is an array of einsum style index "characters", e.g. {'a', 'b', 'i', 'j'} 39 | TAPP_error TAPP_create_tensor_product(TAPP_tensor_product* plan, 40 | TAPP_handle handle, 41 | TAPP_element_op op_A, 42 | TAPP_tensor_info A, 43 | const int64_t* idx_A, 44 | TAPP_element_op op_B, 45 | TAPP_tensor_info B, 46 | const int64_t* idx_B, 47 | TAPP_element_op op_C, 48 | TAPP_tensor_info C, 49 | const int64_t* idx_C, 50 | TAPP_element_op op_D, 51 | TAPP_tensor_info D, 52 | const int64_t* idx_D, 53 | TAPP_prectype prec); 54 | 55 | TAPP_error TAPP_destroy_tensor_product(TAPP_tensor_product plan); 56 | 57 | //TODO: in-place operation: set C = NULL or TAPP_IN_PLACE? 58 | 59 | TAPP_error TAPP_execute_product(TAPP_tensor_product plan, 60 | TAPP_executor exec, 61 | TAPP_status* status, 62 | const void* alpha, 63 | const void* A, 64 | const void* B, 65 | const void* beta, 66 | const void* C, 67 | void* D); 68 | 69 | //TODO: is it always OK to pass NULL for exec? 70 | //TODO: can C be NULL/TAPP_IN_PLACE (in addition to array entries being NULL)? 71 | 72 | TAPP_error TAPP_execute_batched_product(TAPP_tensor_product plan, 73 | TAPP_executor exec, 74 | TAPP_status* status, 75 | int num_batches, 76 | const void* alpha, 77 | const void** A, 78 | const void** B, 79 | const void* beta, 80 | const void** C, 81 | void** D); 82 | 83 | //JB: TODO: variable sized batching (DM suggests plan of plans) 84 | 85 | #endif /* TAPP_PRODUCT_H_ */ 86 | -------------------------------------------------------------------------------- /src/tapp/status.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_STATUS_H_ 2 | #define TAPP_STATUS_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | 8 | typedef intptr_t TAPP_status; 9 | 10 | /* 11 | * Status objects are created by execution functions (e.g. TAPP_execute_product). 12 | * 13 | * TODO: how to get data out? using attributes or separate standardized interface? implementation-defined? 14 | */ 15 | 16 | TAPP_error TAPP_destroy_status(TAPP_status status); 17 | 18 | #endif /* TAPP_STATUS_H_ */ 19 | -------------------------------------------------------------------------------- /src/tapp/tensor.h: -------------------------------------------------------------------------------- 1 | #ifndef TAPP_TENSOR_H_ 2 | #define TAPP_TENSOR_H_ 3 | 4 | #include 5 | 6 | #include "error.h" 7 | #include "datatype.h" 8 | 9 | typedef intptr_t TAPP_tensor_info; 10 | 11 | /* 12 | * TODO: what are the required error conditions? What is explicitly allowed (esp. regarding strides)? 13 | */ 14 | 15 | TAPP_error TAPP_create_tensor_info(TAPP_tensor_info* info, 16 | TAPP_datatype type, 17 | int nmode, 18 | const int64_t* extents, 19 | const int64_t* strides); 20 | 21 | TAPP_error TAPP_destroy_tensor_info(TAPP_tensor_info info); 22 | 23 | int TAPP_get_nmodes(TAPP_tensor_info info); 24 | 25 | TAPP_error TAPP_set_nmodes(TAPP_tensor_info info, 26 | int nmodes); 27 | 28 | void TAPP_get_extents(TAPP_tensor_info info, 29 | int64_t* extents); 30 | 31 | TAPP_error TAPP_set_extents(TAPP_tensor_info info, 32 | const int64_t* extents); 33 | 34 | void TAPP_get_strides(TAPP_tensor_info info, 35 | int64_t* strides); 36 | 37 | TAPP_error TAPP_set_strides(TAPP_tensor_info info, 38 | const int64_t* strides); 39 | 40 | #endif /* TAPP_TENSOR_H_ */ 41 | -------------------------------------------------------------------------------- /terminology.md: -------------------------------------------------------------------------------- 1 | # Terminology 2 | 3 | ## Tensor 4 | 5 | A **dense tensor** is a multi-dimensional array of arithmetic values. A tensor has a positive number, *n*, of **modes** (i.e. it is an *n*-mode tensor).[1](#foot1) Each **mode** has a non-negative **extent**[2](#foot2), which is the number of distinct values that the mode can have. As an example, consider a 4-mode tensor *A* with real elements and extents of 4, 5, 9, and 3: 6 | 7 | ![](https://latex.codecogs.com/gif.latex?\mathcal{A}\in{}\mathbb{R}^{4\otimes{}5\otimes9{}\otimes{}3}) 8 | 9 | In a programming language such as C, this would correspond to: 10 | 11 | ```C 12 | double A[4][5][9][3]; 13 | ``` 14 | 15 | (or `float`, etc.). 16 | 17 | 1) Also: *n*-dimensional, *n*-way, order-*n*, *n*-ary, rank-*n*, *n*-adic, *n*-fold, *n*-index, *n* subspaces. 18 | 19 | 2) Also: length, dimension, size. 20 | 21 | ## Indexing 22 | 23 | Individual tensor elements are referred to by **indexing**. A *d*-mode tensor is indexed by *d* **indices**.[3](#foot3) Each index may take on a definite integral value in the range `[0,n)`,[4](#foot4) where *n* is the extent of the mode being indexed. If an index appears multiple times then it takes on the same value in each case. For example, we may refer to elements of the tensor *A* above using indices *i*, *j*, *k*, and *l*: 24 | 25 | ![](https://latex.codecogs.com/gif.latex?\mathcal{A}_{ijkl}\in{}\mathbb{R}) 26 | 27 | In this case it must be that `0 <= i < 4`, `0 <= j < 5`, `0 <= k < 9`, and `0 <= l < 3`. 28 | 29 | 3) Also: labels, symbols. 30 | 31 | 4) 0-based (C-style) indexing is used here, but 1-based index (Fortran- or Matlab-style) is also possible. The distinction between the two is only relevant when referencing a single element of sub-range of elements. Operations such as tensor contraction do not generally need to explicitly depend on any indexing method. 32 | 33 | ## Shape 34 | 35 | A tensor **shape**[5](#foot5) is an ordered set of non-negative integers. The *i*th entry, `shape[i]`, gives the extent of the *i*th mode of the tensor. The shape of a particular tensor *A* is denoted `shape_A`. Mode *i* in tensor *A* is **compatible** with mode *j* in tensor *B* if `shape_A[i] == shape_B[j]`. Only compatible dimensions may share the same index. 36 | 37 | 5) Also: size, structure. 38 | 39 | ## Layout 40 | 41 | When the values of a tensor are placed in a linear storage medium (e.g. main memory), additional information is necessary to map values referred to by indices to linear locations. A tensor **layout** for a *d*-mode tensor is such a map from *d* indices to a linear location: 42 | 43 | ![](https://latex.codecogs.com/gif.latex?\mathrm{layout}\\,\colon\mathbb{N}^d\to\mathbb{N}) 44 | 45 | The most useful layout for dense tensors is the **general stride** layout: 46 | 47 | ![](https://latex.codecogs.com/gif.latex?\mathrm{layout}\\,\colon(i_0,\ldots,i_{d-1})\mapsto\sum_{k=0}^{d-1}i_k\cdot{s_k}) 48 | 49 | The ordered set of *s* values is the **stride**[6](#foot6) of the tensor, denoted for some tensor *A* by `stride_A`. In general a stride value may be negative, but the strides must obey the condition that no two elements with distinct indices share the same linear location. There are two special cases of the general stride layout of importance: the **column-major** layout and the **row-major** layout. These are defined by: 50 | 51 | ![](https://latex.codecogs.com/gif.latex?\begin{align*}\mathrm{layout_{col}}\\,\colon(i_0,\ldots,i_{d-1})\mapsto{}&\sum_{k=0}^{d-1}i_k\prod_{l=0}^{k-1}n_l\\\\{}\mathrm{layout_{row}}\\,\colon(i_0,\ldots,i_{d-1})\mapsto{}&\sum_{k=0}^{d-1}i_k\prod_{l=k+1}^{d-1}n_l\end{align*}) 52 | 53 | where *n* are the extents of the tensor. Since these layouts depend only on the tensor shape, an additional stride vector is not required. 54 | 55 | 6) Also: shape, leading dimension (similar but not identical idea). 56 | 57 | ## Tensor Specification 58 | 59 | A tensor *A* is fully specified by: 60 | 61 | 1. Its number of modes `nmode_A`. 62 | 2. Its shape `shape_A`. 63 | 3. Its layout (restricting ourselves to general stride layouts) `stride_A`, or an assumption of column- or row-major storage. 64 | 4. A pointer to the origin element `A`. This is the location of ![](https://latex.codecogs.com/gif.latex?\mathcal{A}_{0\ldots{}0}). 65 | 5. Its data type (real, complex, precision, etc.), unless assumed from the context. 66 | 67 | ## Other Terms 68 | 69 | The **size**[7](#foot7) of a tensor is the number of elements, given by the product of the extents. The **span**[8](#foot8) of a tensor is the difference between the lowest and highest linear location over all elements, plus one. In a general stride layout this is given by: 70 | 71 | ![](https://latex.codecogs.com/gif.latex?\mathrm{span}=1+\sum_{i=0}^{d-1}(n_i-1)\cdot|s_i|) 72 | 73 | A tensor is **contiguous** if its span is equal to its size. The *i*th and *j*th modes are **sequentially contiguous** if `stride[j] == stride[i]*shape[i]` (note this requires `stride[j] >= stride[i]` and means that sequential contiguity is not commutative). An ordered set of modes is sequentially contiguous if each consecutive pair of modes is sequentially contiguous. 74 | 75 | A sequentially contiguous set of modes may be replaced (**folded**[9](#foot9)) by a single mode whose extent is equal to the product of the extents of the original modes, and whose stride is equal to the smallest stride of the original modes. The tensor formed from folding contains the same elements at the same linear locations as the original. A contiguous tensor may always be folded to a vector (or in general to a tensor with any fewer number of modes). 76 | 77 | 7) Also: length. 78 | 79 | 8) Also: extent. 80 | 81 | 9) Also: linearized, unfolded (confusingly). 82 | --------------------------------------------------------------------------------