├── float_params.m ├── license.txt └── readme.md /float_params.m: -------------------------------------------------------------------------------- 1 | function [u,xmins,xmin,xmax,p,emins,emin,emax] = float_params(prec) 2 | %FLOAT_PARAMS Parameters for floating-point arithmetic. 3 | % [u,xmins,xmin,xmax,p,emin,emax] = FLOAT_PARAMS(prec) returns 4 | % u: the unit roundoff, 5 | % xmins: the smallest positive (subnormal) floating-point number, 6 | % xmin: the smallest positive normalized floating-point number, 7 | % xmax: the largest floating-point number, 8 | % p: the number of binary digits in the significand (including the 9 | % implicit leading bit), 10 | % emins exponent of xmins, 11 | % emin: exponent of xmin, 12 | % emax: exponent of xmax. 13 | % where prec is one of 14 | % 'q43', 'fp8-e4m3' - NVIDIA quarter precision (4 exponent bits, 15 | % 3 significand bits) 16 | % 'q52', 'fp8-e5m2' - NVIDIA quarter precision (5 exponent bits, 17 | % 2 significand bits) 18 | % 'b', 'bfloat16' - bfloat16, 19 | % 'h', 'half', 'fp16' - IEEE half precision, 20 | % 't', 'tf32' - NVIDIA tf32, 21 | % 's', 'single', 'fp32' - IEEE single precision, 22 | % 'd', 'double', 'fp64' - IEEE double precision (the default), 23 | % 'q', 'quadruple', 'fp128' - IEEE quadruple precision. 24 | % For all these arithmetics the floating-point numbers have the form 25 | % s * 2^e * d_0.d_1d_2...d_{t-1} where s = 1 or -1, e is the exponent 26 | % and each d_i is 0 or 1, with d_0 = 1 for normalized numbers. 27 | % With no input and output arguments, FLOAT_PARAMS prints a table showing 28 | % all the parameters for all the precisions. 29 | % Note: xmax and xmin are not representable in double precision for 30 | % 'quadruple'. 31 | 32 | % Author: Nicholas J. Higham. 33 | 34 | % References: 35 | % [1] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 36 | % (revision of IEEE Std 754-2008), The Institute of Electrical and 37 | % Electronics Engineers 2019. 38 | % [2] Intel Corporation, BFLOAT16---hardware numerics definition, Nov. 2018, 39 | % White paper. Document number 338302-001US. 40 | % [3] Shibo Wang and Pankaj Kanwar, BFloat16: The Secret To High Performance 41 | % on Cloud TPUs, 2019. 42 | % [4] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format. 43 | % [5] NVIDIA Corporation, NVIDIA A100 Tensor Core GPU Architecture, 2020. 44 | % [6] NVIDIA Corporation, NVIDIA Hopper Architecture In-Depth, 2022. 45 | % https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 46 | 47 | if nargin < 1 && nargout < 1 48 | precs = 'bhtsdq'; 49 | fprintf([' u xmins xmin xmax p ' ... 50 | ' emins emin emax\n']) 51 | fprintf(' ---------------------------------------------------------') 52 | fprintf('---------------\n') 53 | for j = -1:length(precs) 54 | switch j 55 | case -1, prec = 'q43'; 56 | case 0, prec = 'q52'; 57 | otherwise, prec = precs(j); 58 | end 59 | [u,xmins,xmin,xmax,p,emins,emin,emax] = float_params(prec); 60 | fprintf('%s: %9.2e %9.2e %9.2e %9.2e %3.0f %7.0f %7.0f %6.0f\n',... 61 | pad(prec,3,'left'),u,xmins,xmin,xmax,p,emins,emin,emax) 62 | end 63 | clear u, return 64 | end 65 | 66 | if nargin < 1, prec = 'd'; end 67 | 68 | if ismember(prec, {'q43','fp8-e4m3'}) 69 | % Significand: 3 bits plus 1 hidden. Exponent: 4 bits. 70 | p = 4; emax = 7; 71 | elseif ismember(prec, {'q52','fp8-e5m2'}) 72 | % Significand: 2 bits plus 1 hidden. Exponent: 5 bits. 73 | p = 3; emax = 15; 74 | elseif ismember(prec, {'b','bfloat16'}) 75 | % Significand: 7 bits plus 1 hidden. Exponent: 8 bits. 76 | p = 8; emax = 127; 77 | elseif ismember(prec, {'h','half','fp16'}) 78 | % Significand: 10 bits plus 1 hidden. Exponent: 5 bits. 79 | p = 11; emax = 15; 80 | elseif ismember(prec, {'t','tf32'}) 81 | % Significand: 10 bits plus 1 hidden. Exponent: 8 bits. 82 | p = 11; emax = 127; 83 | elseif ismember(prec, {'s','single','fp32'}) 84 | % Significand: 23 bits plus 1 hidden. Exponent: 8 bits. 85 | p = 24; emax = 127; 86 | elseif ismember(prec, {'d','double','fp64'}) 87 | % Significand: 52 bits plus 1 hidden. Exponent: 11 bits. 88 | p = 53; emax = 1023; 89 | elseif ismember(prec, {'q','quadruple','fp128'}) 90 | % Significand: 112 bits plus 1 hidden. Exponent: 15 bits. 91 | p = 113; emax = 16383; 92 | else 93 | error('Unrecognized argument') 94 | end 95 | 96 | emin = 1-emax; % Exponent of smallest normal number. 97 | emins = emin + 1 - p; % Exponent of smallest subnormal number. 98 | xmins = 2^emins; 99 | xmin = 2^emin; 100 | xmax = 2^emax * (2-2^(1-p)); 101 | u = 2^(-p); 102 | -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018, Nicholas J. Higham 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 24 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | `Floating-Point Parameters` - MATLAB Code for Parameters of Floating-Point Arithmetics 2 | ========== 3 | 4 | About 5 | ----- 6 | 7 | `float_params` is a MATLAB function for obtaining the parameters of several 8 | floating-point arithmetics. The parameters are built into the code and are 9 | not computed at run time. 10 | 11 | The parameters are 12 | 13 | - the unit roundoff, 14 | - the smallest positive (subnormal) floating-point number, xmins, 15 | - the smallest positive normalized floating-point number, xmin, 16 | - the largest floating-point number, xmax, 17 | - the number of binary digits in the significand (including the 18 | implicit leading bit), 19 | - the exponent of xmins, 20 | - the exponent of xmin, 21 | - the exponent of xmax 22 | 23 | and the arithmetics supported are 24 | 25 | - NVIDIA quarter precision (fp8-e4m3, fp8-e5m2), 26 | - bfloat16, 27 | - IEEE half precision (fp16), 28 | - NVIDIA tf32, 29 | - IEEE single precision (fp32), 30 | - IEEE double precision (fp64), 31 | - IEEE quadruple precision (fp128). 32 | 33 | The code was developed in MATLAB R2020a and works with versions at least 34 | back to R2016b. 35 | 36 | License 37 | ------- 38 | 39 | See `license.txt` for licensing information. 40 | 41 | --------------------------------------------------------------------------------