├── LICENCE ├── README.md ├── SConscript ├── qfplib-m0-full.h └── qfplib-m0-full_gcc.S /LICENCE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | 294 | Copyright (C) 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | , 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Qfplib-M0-full for RT-Thread 2 | 3 | ## A free, fast and compact ARM Cortex-M0 floating-point library 4 | 5 | ## Introduction 6 | 7 | Qfplib-M0-full is a library of IEEE 754 single- and double-precision floating-point arithmetic routines for microcontrollers based on the ARM Cortex-M0 core (ARMv6-M architecture). It should also run on Cortex-M3 and Cortex-M4 microcontrollers and will give reasonable performance, but it is not optimised for these devices. 8 | 9 | It provides correctly rounded (to nearest, even-on-tie) addition, subtraction, multiplication, division and square root operations, and sine, cosine, tangent, arctangent, logarithm and exponential functions that give a high degree of accuracy. There are also conversion functions between floating-point values and signed or unsigned integer or fixed-point values. The library occupies less than 6 kbyte of program memory. 10 | 11 | Qfplib-M0-full does not use any static storage. Stack use is parsimonious and statically analysable; recursion is not used. 12 | 13 | ## Licence 14 | 15 | Qfplib-M0-full is open source, licensed under version 2 of the [GNU GPL](http://www.gnu.org/licenses/). Use at your own risk. If you wish to enquire about alternative licensing please use the e-mail address on the [home page](https://www.quinapalus.com/index.html). 16 | 17 | ## How to Obtain 18 | 19 | ``` 20 | RT-Thread online packages ---> 21 | system packages ---> 22 | acceleration: Assembly language or algorithmic acceleration packages ---> 23 | [*] Qfplib-M0-full: a free, fast and compact ARM Cortex-M0 floating-point library 24 | ``` 25 | 26 | ## Speed 27 | 28 | The following table compares cycle counts for Qfplib-M0-full against other libraries. Qfplib-M0-full and GCC library results are average values for non-exceptional arguments to the functions, include calling overhead, and are approximate. They were measured using an LPC11U68 microcontroller with single-cycle flash memory. Results for the Micro Digital ‘GoFast’ library—presumably optimised for speed rather than size, judging by its name—are inferred from the timings given on [this page](http://www.smxrtos.com/ussw/gofast/gofast_arm_gnu.htm) for an ARM7TDMI-based processor. The comparison here may not be not strictly fair to Qfplib-M0-full as it is not clear from their description whether Micro Digital’s library exploits features available on that processor but not on the Cortex-M0: for example, ARM mode is considerably faster and more flexible than Thumb mode, and the long multiply instructions can be used to advantage in several of the routines, especially in double precision. Micro Digital do not appear to provide public information on the code size of their library. The implementation of the basic functions does not appear to be IEEE 754 compliant with regard to rounding. 29 | 30 | | Function | **Qfplib-M0-full cycles** | GCC library cycles | ‘GoFast’ library cycles | 31 | | ------------ | ------------------------- | ------------------ | ----------------------- | 32 | | `qfp_fadd` | **76** | 102 | 182 | 33 | | `qfp_fsub` | **78** | 108 | 181 | 34 | | `qfp_fmul` | **62** | 166 | 144 | 35 | | `qfp_fdiv` | **83** | 475 | 799 | 36 | | `qfp_fcos` | **595** | 3350 | 393 | 37 | | `qfp_fsin` | **584** | 3300 | 394 | 38 | | `qfp_ftan` | **671** | 6140 | 1090 | 39 | | `qfp_fatan2` | **673** | 4930 | 2041 | 40 | | `qfp_fexp` | **261** | 1930 | 372 | 41 | | `qfp_fln` | **277** | 3960 | 1321 | 42 | | `qfp_fsqrt` | **67** | 460 | 1590 | 43 | | `qfp_dadd` | **94** | 168 | 231 | 44 | | `qfp_dsub` | **100** | 167 | 243 | 45 | | `qfp_dmul` | **163** | 377 | 224 | 46 | | `qfp_ddiv` | **200** | 1190 | 1557 | 47 | | `qfp_dcos` | **1623** | 6162 | 951 | 48 | | `qfp_dsin` | **1624** | 5854 | 966 | 49 | | `qfp_dtan` | **1906** | 11371 | 2541 | 50 | | `qfp_datan2` | **2187** | 9973 | 4487 | 51 | | `qfp_dexp` | **811** | 6655 | 1178 | 52 | | `qfp_dln` | **464** | 4375 | 2798 | 53 | | `qfp_dsqrt` | **174** | 1305 | 3042 | 54 | 55 | Note that in every case the Qfplib-M0-full **double**-precision implementation is faster than the corresponding GCC **single**-precision implementation, sometimes by a very large factor. 56 | 57 | The ARM CMSIS implementations of the scientific functions, despite their name ‘FastMath’, appear to be many times slower than Qfplib-M0-full. For example, the average execution time for ARM's single-precision cosine function (compiled using GCC) is about 3880 cycles, virtually independent of the optimisation flags used. 58 | 59 | ## Limitations and deviations from the IEEE 754 standard 60 | 61 | On input and output NaNs are converted to infinities and denormals are flushed to zero. 62 | 63 | ## Function ranges and accuracy 64 | 65 | Subject to the limitations and deviations mentioned above, the addition, subtraction, multiplication, division and square root functions all produce correctly rounded (to nearest, even-on-tie) results. This has been verified using many billions of test cases, both random and contrived. 66 | 67 | Other functions generally give results accurate to approximately 1 ulp (‘unit in last place’). Accuracy is poorer where a tiny change in an argument results in a change in the result of a large number of ulps, such as when taking the logarithm of a value near 1 or the sine of a value near a multiple of π. Accurate handling of such cases consumes a large amount of code space and is seldom if ever needed. 68 | 69 | The single-precision trigonometric functions require an argument between –128 and +128; the double-precision trigonometric functions require an argument between –1024 and +1024. 70 | 71 | The comparison functions return zero if its arguments are equal (negative zero is equal to positive zero) or plus or minus one if its first argument is respectively greater than or less than its second. 72 | 73 | ## Conversion functions 74 | 75 | A comprehensive range of functions is provided to convert between floating-point data and signed and unsigned fixed-point and integer data. They are as follows. 76 | 77 | - `qfp_float2int` 78 | - `qfp_float2fix` 79 | - `qfp_float2uint` 80 | - `qfp_float2ufix` 81 | - `qfp_int2float` 82 | - `qfp_fix2float` 83 | - `qfp_uint2float` 84 | - `qfp_ufix2float` 85 | - `qfp_int642float` 86 | - `qfp_fix642float` 87 | - `qfp_uint642float` 88 | - `qfp_ufix642float` 89 | - `qfp_float2int64` 90 | - `qfp_float2fix64` 91 | - `qfp_float2uint64` 92 | - `qfp_float2ufix64` 93 | - `qfp_double2int` 94 | - `qfp_double2fix` 95 | - `qfp_double2uint` 96 | - `qfp_double2ufix` 97 | - `qfp_double2int64` 98 | - `qfp_double2fix64` 99 | - `qfp_double2uint64` 100 | - `qfp_double2ufix64` 101 | - `qfp_int2double` 102 | - `qfp_fix2double` 103 | - `qfp_uint2double` 104 | - `qfp_ufix2double` 105 | - `qfp_int642double` 106 | - `qfp_fix642double` 107 | - `qfp_uint642double` 108 | - `qfp_ufix642double` 109 | - `qfp_double2float` 110 | - `qfp_float2double` 111 | 112 | ## Other functions 113 | 114 | You may also be interested in the `qfp_float2str` and `qfp_str2float` functions provided as part of [Qfplib-M0-tiny](https://github.com/mysterywolf/Qfplib-M0-tiny) library. 115 | 116 | Visit http://www.quinapalus.com/qfplib.html for more information. 117 | -------------------------------------------------------------------------------- /SConscript: -------------------------------------------------------------------------------- 1 | Import('rtconfig') 2 | from building import * 3 | 4 | cwd = GetCurrentDir() 5 | src = Glob('*.c') 6 | 7 | if rtconfig.PLATFORM == 'armcc': 8 | src += Glob('*_rvds.S') 9 | 10 | if rtconfig.PLATFORM == 'gcc': 11 | src += Glob('*_gcc.S') 12 | 13 | if rtconfig.PLATFORM == 'iar': 14 | src += Glob('*_iar.S') 15 | 16 | CPPPATH = [cwd] 17 | 18 | group = DefineGroup('Qfplib-M0-full', src, depend = ['PKG_USING_QFPLIB_M0_FULL'], CPPPATH = CPPPATH) 19 | 20 | Return('group') 21 | -------------------------------------------------------------------------------- /qfplib-m0-full.h: -------------------------------------------------------------------------------- 1 | #ifndef __QFPLIB_M0_FULL_H__ 2 | #define __QFPLIB_M0_FULL_H__ 3 | 4 | /* 5 | Copyright 2019-2020 Mark Owen 6 | http://www.quinapalus.com 7 | E-mail: qfp@quinapalus.com 8 | 9 | This file is free software: you can redistribute it and/or modify 10 | it under the terms of version 2 of the GNU General Public License 11 | as published by the Free Software Foundation. 12 | 13 | This file is distributed in the hope that it will be useful, 14 | but WITHOUT ANY WARRANTY; without even the implied warranty of 15 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 16 | GNU General Public License for more details. 17 | 18 | You should have received a copy of the GNU General Public License 19 | along with this file. If not, see or 20 | write to the Free Software Foundation, Inc., 51 Franklin Street, 21 | Fifth Floor, Boston, MA 02110-1301, USA. 22 | */ 23 | 24 | typedef unsigned int ui32; 25 | typedef int i32; 26 | typedef unsigned long long int ui64; 27 | typedef long long int i64; 28 | 29 | extern float qfp_fadd (float x,float y); 30 | extern float qfp_fsub (float x,float y); 31 | extern float qfp_fmul (float x,float y); 32 | extern float qfp_fdiv (float x,float y); 33 | extern int qfp_fcmp (float x,float y); 34 | extern float qfp_fsqrt (float x); 35 | extern i32 qfp_float2int (float x); 36 | extern i32 qfp_float2fix (float x,int f); 37 | extern ui32 qfp_float2uint (float x); 38 | extern ui32 qfp_float2ufix (float x,int f); 39 | extern float qfp_int2float (i32 x); 40 | extern float qfp_fix2float (i32 x,int f); 41 | extern float qfp_uint2float (ui32 x); 42 | extern float qfp_ufix2float (ui32 x,int f); 43 | extern float qfp_int642float (i64 x); 44 | extern float qfp_fix642float (i64 x,int f); 45 | extern float qfp_uint642float (ui64 x); 46 | extern float qfp_ufix642float (ui64 x,int f); 47 | extern float qfp_fcos (float x); 48 | extern float qfp_fsin (float x); 49 | extern float qfp_ftan (float x); 50 | extern float qfp_fatan2 (float y,float x); 51 | extern float qfp_fexp (float x); 52 | extern float qfp_fln (float x); 53 | 54 | extern double qfp_dadd (double x,double y); 55 | extern double qfp_dsub (double x,double y); 56 | extern double qfp_dmul (double x,double y); 57 | extern double qfp_ddiv (double x,double y); 58 | extern double qfp_dsqrt (double x); 59 | extern double qfp_dcos (double x); 60 | extern double qfp_dsin (double x); 61 | extern double qfp_dtan (double x); 62 | extern double qfp_datan2 (double y,double x); 63 | extern double qfp_dexp (double x); 64 | extern double qfp_dln (double x); 65 | extern int qfp_dcmp (double x,double y); 66 | 67 | extern i64 qfp_float2int64 (float x); 68 | extern i64 qfp_float2fix64 (float x,int f); 69 | extern ui64 qfp_float2uint64 (float x); 70 | extern ui64 qfp_float2ufix64 (float x,int f); 71 | 72 | extern i32 qfp_double2int (double x); 73 | extern i32 qfp_double2fix (double x,int f); 74 | extern ui32 qfp_double2uint (double x); 75 | extern ui32 qfp_double2ufix (double x,int f); 76 | extern i64 qfp_double2int64 (double x); 77 | extern i64 qfp_double2fix64 (double x,int f); 78 | extern ui64 qfp_double2uint64(double x); 79 | extern ui64 qfp_double2ufix64(double x,int f); 80 | 81 | extern double qfp_int2double (i32 x); 82 | extern double qfp_fix2double (i32 x,int f); 83 | extern double qfp_uint2double (ui32 x); 84 | extern double qfp_ufix2double (ui32 x,int f); 85 | extern double qfp_int642double (i64 x); 86 | extern double qfp_fix642double (i64 x,int f); 87 | extern double qfp_uint642double(ui64 x); 88 | extern double qfp_ufix642double(ui64 x,int f); 89 | 90 | extern float qfp_double2float (double x); 91 | extern double qfp_float2double (float x); 92 | 93 | #endif 94 | -------------------------------------------------------------------------------- /qfplib-m0-full_gcc.S: -------------------------------------------------------------------------------- 1 | @ Copyright 2019-2020 Mark Owen 2 | @ http://www.quinapalus.com 3 | @ E-mail: qfp@quinapalus.com 4 | @ 5 | @ This file is free software: you can redistribute it and/or modify 6 | @ it under the terms of version 2 of the GNU General Public License 7 | @ as published by the Free Software Foundation. 8 | @ 9 | @ This file is distributed in the hope that it will be useful, 10 | @ but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | @ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | @ GNU General Public License for more details. 13 | @ 14 | @ You should have received a copy of the GNU General Public License 15 | @ along with this file. If not, see or 16 | @ write to the Free Software Foundation, Inc., 51 Franklin Street, 17 | @ Fifth Floor, Boston, MA 02110-1301, USA. 18 | 19 | .syntax unified 20 | .cpu cortex-m0plus 21 | .thumb 22 | 23 | @ exported symbols 24 | 25 | .global qfp_fadd 26 | .global qfp_fsub 27 | .global qfp_fmul 28 | .global qfp_fdiv 29 | .global qfp_fcmp 30 | .global qfp_fsqrt 31 | .global qfp_float2int 32 | .global qfp_float2fix 33 | .global qfp_float2uint 34 | .global qfp_float2ufix 35 | .global qfp_int2float 36 | .global qfp_fix2float 37 | .global qfp_uint2float 38 | .global qfp_ufix2float 39 | .global qfp_int642float 40 | .global qfp_fix642float 41 | .global qfp_uint642float 42 | .global qfp_ufix642float 43 | .global qfp_fcos 44 | .global qfp_fsin 45 | .global qfp_ftan 46 | .global qfp_fatan2 47 | .global qfp_fexp 48 | .global qfp_fln 49 | 50 | .global qfp_dadd 51 | .global qfp_dsub 52 | .global qfp_dmul 53 | .global qfp_ddiv 54 | .global qfp_dsqrt 55 | .global qfp_dcos 56 | .global qfp_dsin 57 | .global qfp_dtan 58 | .global qfp_datan2 59 | .global qfp_dexp 60 | .global qfp_dln 61 | .global qfp_dcmp 62 | 63 | .global qfp_float2int64 64 | .global qfp_float2fix64 65 | .global qfp_float2uint64 66 | .global qfp_float2ufix64 67 | 68 | .global qfp_double2int 69 | .global qfp_double2fix 70 | .global qfp_double2uint 71 | .global qfp_double2ufix 72 | .global qfp_double2int64 73 | .global qfp_double2fix64 74 | .global qfp_double2uint64 75 | .global qfp_double2ufix64 76 | 77 | .global qfp_int2double 78 | .global qfp_fix2double 79 | .global qfp_uint2double 80 | .global qfp_ufix2double 81 | .global qfp_int642double 82 | .global qfp_fix642double 83 | .global qfp_uint642double 84 | .global qfp_ufix642double 85 | 86 | .global qfp_double2float 87 | .global qfp_float2double 88 | 89 | qfp_lib_start: 90 | 91 | @ exchange r0<->r1, r2<->r3 92 | xchxy: 93 | push {r0,r2,r14} 94 | mov r0,r1 95 | mov r2,r3 96 | pop {r1,r3,r15} 97 | 98 | @ IEEE single in r0-> signed (two's complemennt) mantissa in r0 9Q23 (24 significant bits), signed exponent (bias removed) in r2 99 | @ trashes r4; zero, denormal -> mantissa=+/-1, exponent=-380; Inf, NaN -> mantissa=+/-1, exponent=+640 100 | unpackx: 101 | lsrs r2,r0,#23 @ save exponent and sign 102 | lsls r0,#9 @ extract mantissa 103 | lsrs r0,#9 104 | movs r4,#1 105 | lsls r4,#23 106 | orrs r0,r4 @ reinstate implied leading 1 107 | cmp r2,#255 @ test sign bit 108 | uxtb r2,r2 @ clear it 109 | bls 1f @ branch on positive 110 | rsbs r0,#0 @ negate mantissa 111 | 1: 112 | subs r2,#1 113 | cmp r2,#254 @ zero/denormal/Inf/NaN? 114 | bhs 2f 115 | subs r2,#126 @ remove exponent bias: can now be -126..+127 116 | bx r14 117 | 118 | 2: @ here with special-case values 119 | cmp r0,#0 120 | mov r0,r4 @ set mantissa to +1 121 | bpl 3f 122 | rsbs r0,#0 @ zero/denormal/Inf/NaN: mantissa=+/-1 123 | 3: 124 | subs r2,#126 @ zero/denormal: exponent -> -127; Inf, NaN: exponent -> 128 125 | lsls r2,#2 @ zero/denormal: exponent -> -508; Inf, NaN: exponent -> 512 126 | adds r2,#128 @ zero/denormal: exponent -> -380; Inf, NaN: exponent -> 640 127 | bx r14 128 | 129 | @ normalise and pack signed mantissa in r0 nominally 3Q29, signed exponent in r2-> IEEE single in r0 130 | @ trashes r4, preserves r1,r3 131 | @ r5: "sticky bits", must be zero iff all result bits below r0 are zero for correct rounding 132 | packx: 133 | lsrs r4,r0,#31 @ save sign bit 134 | lsls r4,r4,#31 @ sign now in b31 135 | bpl 2f @ skip if positive 136 | cmp r5,#0 137 | beq 11f 138 | adds r0,#1 @ fiddle carry in to following rsb if sticky bits are non-zero 139 | 11: 140 | rsbs r0,#0 @ can now treat r0 as unsigned 141 | packx0: 142 | bmi 3f @ catch r0=0x80000000 case 143 | 2: 144 | subs r2,#1 @ normalisation loop 145 | adds r0,r0 146 | beq 1f @ zero? special case 147 | bpl 2b @ normalise so leading "1" in bit 31 148 | 3: 149 | adds r2,#129 @ (mis-)offset exponent 150 | bne 12f @ special case: highest denormal can round to lowest normal 151 | adds r0,#0x80 @ in special case, need to add 256 to r0 for rounding 152 | bcs 4f @ tripped carry? then have leading 1 in C as required 153 | 12: 154 | adds r0,#0x80 @ rounding 155 | bcs 4f @ tripped carry? then have leading 1 in C as required (and result is even so can ignore sticky bits) 156 | cmp r5,#0 157 | beq 7f @ sticky bits zero? 158 | 8: 159 | lsls r0,#1 @ remove leading 1 160 | 9: 161 | subs r2,#1 @ compensate exponent on this path 162 | 4: 163 | cmp r2,#254 164 | bge 5f @ overflow? 165 | adds r2,#1 @ correct exponent offset 166 | ble 10f @ denormal/underflow? 167 | lsrs r0,#9 @ align mantissa 168 | lsls r2,#23 @ align exponent 169 | orrs r0,r2 @ assemble exponent and mantissa 170 | 6: 171 | orrs r0,r4 @ apply sign 172 | 1: 173 | bx r14 174 | 175 | 5: 176 | movs r0,#0xff @ create infinity 177 | lsls r0,#23 178 | b 6b 179 | 180 | 10: 181 | movs r0,#0 @ create zero 182 | bx r14 183 | 184 | 7: @ sticky bit rounding case 185 | lsls r5,r0,#24 @ check bottom 8 bits of r0 186 | bne 8b @ in rounding-tie case? 187 | lsrs r0,#9 @ ensure even result 188 | lsls r0,#10 189 | b 9b 190 | 191 | .align 2 192 | .ltorg 193 | 194 | @ signed multiply r0 1Q23 by r1 4Q23, result in r0 7Q25, sticky bits in r5 195 | @ trashes r3,r4 196 | mul0: 197 | uxth r3,r0 @ Q23 198 | asrs r4,r1,#16 @ Q7 199 | muls r3,r4 @ L*H, Q30 signed 200 | asrs r4,r0,#16 @ Q7 201 | uxth r5,r1 @ Q23 202 | muls r4,r5 @ H*L, Q30 signed 203 | adds r3,r4 @ sum of middle partial products 204 | uxth r4,r0 205 | muls r4,r5 @ L*L, Q46 unsigned 206 | lsls r5,r4,#16 @ initialise sticky bits from low half of low partial product 207 | lsrs r4,#16 @ Q25 208 | adds r3,r4 @ add high half of low partial product to sum of middle partial products 209 | @ (cannot generate carry by limits on input arguments) 210 | asrs r0,#16 @ Q7 211 | asrs r1,#16 @ Q7 212 | muls r0,r1 @ H*H, Q14 signed 213 | lsls r0,#11 @ high partial product Q25 214 | lsls r1,r3,#27 @ sticky 215 | orrs r5,r1 @ collect further sticky bits 216 | asrs r1,r3,#5 @ middle partial products Q25 217 | adds r0,r1 @ final result 218 | bx r14 219 | 220 | .thumb_func 221 | qfp_fcmp: 222 | lsls r2,r0,#1 223 | lsrs r2,#24 224 | beq 1f 225 | cmp r2,#0xff 226 | bne 2f 227 | 1: 228 | lsrs r0,#23 @ clear mantissa if NaN or denormal 229 | lsls r0,#23 230 | 2: 231 | lsls r2,r1,#1 232 | lsrs r2,#24 233 | beq 1f 234 | cmp r2,#0xff 235 | bne 2f 236 | 1: 237 | lsrs r1,#23 @ clear mantissa if NaN or denormal 238 | lsls r1,#23 239 | 2: 240 | movs r2,#1 @ initialise result 241 | eors r1,r0 242 | bmi 4f @ opposite signs? then can proceed on basis of sign of x 243 | eors r1,r0 @ restore y 244 | bpl 1f 245 | rsbs r2,#0 @ both negative? flip comparison 246 | 1: 247 | cmp r0,r1 248 | bgt 2f 249 | blt 3f 250 | 5: 251 | movs r2,#0 252 | 3: 253 | rsbs r2,#0 254 | 2: 255 | subs r0,r2,#0 256 | bx r14 257 | 4: 258 | orrs r1,r0 259 | adds r1,r1 260 | beq 5b 261 | cmp r0,#0 262 | bge 2b 263 | b 3b 264 | 265 | @ convert float to signed int, rounding towards -Inf, clamping 266 | .thumb_func 267 | qfp_float2int: 268 | movs r1,#0 @ fall through 269 | 270 | @ convert float in r0 to signed fixed point in r0, clamping 271 | .thumb_func 272 | qfp_float2fix: 273 | push {r4,r14} 274 | bl unpackx 275 | movs r3,r2 276 | adds r3,#130 277 | bmi 6f @ -0? 278 | add r2,r1 @ incorporate binary point position into exponent 279 | subs r2,#23 @ r2 is now amount of left shift required 280 | blt 1f @ requires right shift? 281 | cmp r2,#7 @ overflow? 282 | ble 4f 283 | 3: @ overflow 284 | asrs r1,r0,#31 @ +ve:0 -ve:0xffffffff 285 | mvns r1,r1 @ +ve:0xffffffff -ve:0 286 | movs r0,#1 287 | lsls r0,#31 288 | 5: 289 | eors r0,r1 @ +ve:0x7fffffff -ve:0x80000000 (unsigned path: 0xffffffff) 290 | pop {r4,r15} 291 | 1: 292 | rsbs r2,#0 @ right shift for r0, >0 293 | cmp r2,#32 294 | blt 2f @ more than 32 bits of right shift? 295 | movs r2,#32 296 | 2: 297 | asrs r0,r0,r2 298 | pop {r4,r15} 299 | 6: 300 | movs r0,#0 301 | pop {r4,r15} 302 | 303 | @ unsigned version 304 | .thumb_func 305 | qfp_float2uint: 306 | movs r1,#0 @ fall through 307 | .thumb_func 308 | qfp_float2ufix: 309 | push {r4,r14} 310 | bl unpackx 311 | add r2,r1 @ incorporate binary point position into exponent 312 | movs r1,r0 313 | bmi 5b @ negative? return zero 314 | subs r2,#23 @ r2 is now amount of left shift required 315 | blt 1b @ requires right shift? 316 | mvns r1,r0 @ ready to return 0xffffffff 317 | cmp r2,#8 @ overflow? 318 | bgt 5b 319 | 4: 320 | lsls r0,r0,r2 @ result fits, left shifted 321 | pop {r4,r15} 322 | 323 | 324 | @ convert uint64 to float, rounding 325 | .thumb_func 326 | qfp_uint642float: 327 | movs r2,#0 @ fall through 328 | 329 | @ convert unsigned 64-bit fix to float, rounding; number of r0:r1 bits after point in r2 330 | .thumb_func 331 | qfp_ufix642float: 332 | push {r4,r5,r14} 333 | cmp r1,#0 334 | bpl 3f @ positive? we can use signed code 335 | lsls r5,r1,#31 @ contribution to sticky bits 336 | orrs r5,r0 337 | lsrs r0,r1,#1 338 | subs r2,#1 339 | b 4f 340 | 341 | @ convert int64 to float, rounding 342 | .thumb_func 343 | qfp_int642float: 344 | movs r2,#0 @ fall through 345 | 346 | @ convert signed 64-bit fix to float, rounding; number of r0:r1 bits after point in r2 347 | .thumb_func 348 | qfp_fix642float: 349 | push {r4,r5,r14} 350 | 3: 351 | movs r5,r0 352 | orrs r5,r1 353 | beq ret_pop45 @ zero? return +0 354 | asrs r5,r1,#31 @ sign bits 355 | 2: 356 | asrs r4,r1,#24 @ try shifting 7 bits at a time 357 | cmp r4,r5 358 | bne 1f @ next shift will overflow? 359 | lsls r1,#7 360 | lsrs r4,r0,#25 361 | orrs r1,r4 362 | lsls r0,#7 363 | adds r2,#7 364 | b 2b 365 | 1: 366 | movs r5,r0 367 | movs r0,r1 368 | 4: 369 | rsbs r2,#0 370 | adds r2,#32+29 371 | b packret 372 | 373 | @ convert signed int to float, rounding 374 | .thumb_func 375 | qfp_int2float: 376 | movs r1,#0 @ fall through 377 | 378 | @ convert signed fix to float, rounding; number of r0 bits after point in r1 379 | .thumb_func 380 | qfp_fix2float: 381 | push {r4,r5,r14} 382 | 1: 383 | movs r2,#29 384 | subs r2,r1 @ fix exponent 385 | packretns: @ pack and return, sticky bits=0 386 | movs r5,#0 387 | packret: @ common return point: "pack and return" 388 | bl packx 389 | ret_pop45: 390 | pop {r4,r5,r15} 391 | 392 | 393 | @ unsigned version 394 | .thumb_func 395 | qfp_uint2float: 396 | movs r1,#0 @ fall through 397 | .thumb_func 398 | qfp_ufix2float: 399 | push {r4,r5,r14} 400 | cmp r0,#0 401 | bge 1b @ treat <2^31 as signed 402 | movs r2,#30 403 | subs r2,r1 @ fix exponent 404 | lsls r5,r0,#31 @ one sticky bit 405 | lsrs r0,#1 406 | b packret 407 | 408 | @ All the scientific functions are implemented using the CORDIC algorithm. For notation, 409 | @ details not explained in the comments below, and a good overall survey see 410 | @ "50 Years of CORDIC: Algorithms, Architectures, and Applications" by Meher et al., 411 | @ IEEE Transactions on Circuits and Systems Part I, Volume 56 Issue 9. 412 | 413 | @ Register use: 414 | @ r0: x 415 | @ r1: y 416 | @ r2: z/omega 417 | @ r3: coefficient pointer 418 | @ r4,r12: m 419 | @ r5: i (shift) 420 | 421 | cordic_start: @ initialisation 422 | movs r5,#0 @ initial shift=0 423 | mov r12,r4 424 | b 5f 425 | 426 | cordic_vstep: @ one step of algorithm in vector mode 427 | cmp r1,#0 @ check sign of y 428 | bgt 4f 429 | b 1f 430 | cordic_rstep: @ one step of algorithm in rotation mode 431 | cmp r2,#0 @ check sign of angle 432 | bge 1f 433 | 4: 434 | subs r1,r6 @ negative rotation: y=y-(x>>i) 435 | rsbs r7,#0 436 | adds r2,r4 @ accumulate angle 437 | b 2f 438 | 1: 439 | adds r1,r6 @ positive rotation: y=y+(x>>i) 440 | subs r2,r4 @ accumulate angle 441 | 2: 442 | mov r4,r12 443 | muls r7,r4 @ apply sign from m 444 | subs r0,r7 @ finish rotation: x=x{+/-}(y>>i) 445 | 5: 446 | ldmia r3!,{r4} @ fetch next angle from table and bump pointer 447 | lsrs r4,#1 @ repeated angle? 448 | bcs 3f 449 | adds r5,#1 @ adjust shift if not 450 | 3: 451 | mov r6,r0 452 | asrs r6,r5 @ x>>i 453 | mov r7,r1 454 | asrs r7,r5 @ y>>i 455 | lsrs r4,#1 @ shift end flag into carry 456 | bx r14 457 | 458 | @ CORDIC rotation mode 459 | cordic_rot: 460 | push {r6,r7,r14} 461 | bl cordic_start @ initialise 462 | 1: 463 | bl cordic_rstep 464 | bcc 1b @ step until table finished 465 | asrs r6,r0,#14 @ remaining small rotations can be linearised: see IV.B of paper referenced above 466 | asrs r7,r1,#14 467 | asrs r2,#3 468 | muls r6,r2 @ all remaining CORDIC steps in a multiplication 469 | muls r7,r2 470 | mov r4,r12 471 | muls r7,r4 472 | asrs r6,#12 473 | asrs r7,#12 474 | subs r0,r7 @ x=x{+/-}(yz>>k) 475 | adds r1,r6 @ y=y+(xz>>k) 476 | cordic_exit: 477 | pop {r6,r7,r15} 478 | 479 | @ CORDIC vector mode 480 | cordic_vec: 481 | push {r6,r7,r14} 482 | bl cordic_start @ initialise 483 | 1: 484 | bl cordic_vstep 485 | bcc 1b @ step until table finished 486 | 4: 487 | cmp r1,#0 @ continue as in cordic_vstep but without using table; x is not affected as y is small 488 | bgt 2f @ check sign of y 489 | adds r1,r6 @ positive rotation: y=y+(x>>i) 490 | subs r2,r4 @ accumulate angle 491 | b 3f 492 | 2: 493 | subs r1,r6 @ negative rotation: y=y-(x>>i) 494 | adds r2,r4 @ accumulate angle 495 | 3: 496 | asrs r6,#1 497 | asrs r4,#1 @ next "table entry" 498 | bne 4b 499 | b cordic_exit 500 | 501 | .thumb_func 502 | qfp_fsin: @ calculate sin and cos using CORDIC rotation method 503 | push {r4,r5,r14} 504 | movs r1,#24 505 | bl qfp_float2fix @ range reduction by repeated subtraction/addition in fixed point 506 | ldr r4,pi_q29 507 | lsrs r4,#4 @ 2pi Q24 508 | 1: 509 | subs r0,r4 510 | bge 1b 511 | 1: 512 | adds r0,r4 513 | bmi 1b @ now in range 0..2pi 514 | lsls r2,r0,#2 @ z Q26 515 | lsls r5,r4,#1 @ pi Q26 (r4=pi/2 Q26) 516 | ldr r0,=#0x136e9db4 @ initialise CORDIC x,y with scaling 517 | movs r1,#0 518 | 1: 519 | cmp r2,r4 @ >pi/2? 520 | blt 2f 521 | subs r2,r5 @ reduce range to -pi/2..pi/2 522 | rsbs r0,#0 @ rotate vector by pi 523 | b 1b 524 | 2: 525 | lsls r2,#3 @ Q29 526 | adr r3,tab_cc @ circular coefficients 527 | movs r4,#1 @ m=1 528 | bl cordic_rot 529 | adds r1,#9 @ fiddle factor to make sin(0)==0 530 | movs r2,#0 @ exponents to zero 531 | movs r3,#0 532 | movs r5,#0 @ no sticky bits 533 | bl clampx 534 | bl packx @ pack cosine 535 | bl xchxy 536 | bl clampx 537 | b packretns @ pack sine 538 | 539 | .thumb_func 540 | qfp_fcos: 541 | push {r14} 542 | bl qfp_fsin 543 | mov r0,r1 @ extract cosine result 544 | pop {r15} 545 | 546 | @ force r0 to lie in range [-1,1] Q29 547 | clampx: 548 | movs r4,#1 549 | lsls r4,#29 550 | cmp r0,r4 551 | bgt 1f 552 | rsbs r4,#0 553 | cmp r0,r4 554 | ble 1f 555 | bx r14 556 | 1: 557 | movs r0,r4 558 | bx r14 559 | 560 | .thumb_func 561 | qfp_ftan: 562 | push {r4,r5,r6,r14} 563 | bl qfp_fsin @ sine in r0/r2, cosine in r1/r3 564 | b fdiv_n @ sin/cos 565 | 566 | .thumb_func 567 | qfp_fexp: 568 | push {r4,r5,r14} 569 | movs r1,#24 570 | bl qfp_float2fix @ Q24: covers entire valid input range 571 | asrs r1,r0,#16 @ Q8 572 | ldr r2,=#5909 @ log_2(e) Q12 573 | muls r2,r1 @ estimate exponent of result Q20 (always an underestimate) 574 | asrs r2,#20 @ Q0 575 | lsls r1,r0,#6 @ Q30 576 | ldr r0,=#0x2c5c85fe @ ln(2) Q30 577 | muls r0,r2 @ accurate contribution of estimated exponent 578 | subs r1,r0 @ residual to be exponentiated, guaranteed ≥0, < about 0.75 Q30 579 | 580 | @ here 581 | @ r1: mantissa to exponentiate, 0...~0.75 Q30 582 | @ r2: first exponent estimate 583 | 584 | movs r5,#1 @ shift 585 | adr r3,ftab_exp @ could use alternate words from dtab_exp to save space if required 586 | movs r0,#1 587 | lsls r0,#29 @ x=1 Q29 588 | 589 | 3: 590 | ldmia r3!,{r4} 591 | subs r4,r1,r4 592 | bmi 1f 593 | movs r1,r4 @ keep result of subtraction 594 | movs r4,r0 595 | lsrs r4,r5 596 | adcs r0,r4 @ x+=x>>i with rounding 597 | 598 | 1: 599 | adds r5,#1 600 | cmp r5,#15 601 | bne 3b 602 | 603 | @ here 604 | @ r0: exp a Q29 1..2+ 605 | @ r1: ε (residual x where x=a+ε), < 2^-14 Q30 606 | @ r2: first exponent estimate 607 | @ and we wish to calculate exp x=exp a exp ε~(exp a)(1+ε) 608 | 609 | lsrs r3,r0,#15 @ exp a Q14 610 | muls r3,r1 @ ε exp a Q44 611 | lsrs r3,#15 @ ε exp a Q29 612 | adcs r0,r3 @ (1+ε) exp a Q29 with rounding 613 | 614 | b packretns @ pack result 615 | 616 | .thumb_func 617 | qfp_fln: 618 | push {r4,r5,r14} 619 | asrs r1,r0,#23 620 | bmi 3f @ -ve argument? 621 | beq 3f @ 0 argument? 622 | cmp r1,#0xff 623 | beq 4f @ +Inf/NaN 624 | bl unpackx 625 | adds r2,#1 626 | ldr r3,=#0x2c5c85fe @ ln(2) Q30 627 | lsrs r1,r3,#14 @ ln(2) Q16 628 | muls r1,r2 @ result estimate Q16 629 | asrs r1,#16 @ integer contribution to result 630 | muls r3,r2 631 | lsls r4,r1,#30 632 | subs r3,r4 @ fractional contribution to result Q30, signed 633 | lsls r0,#8 @ Q31 634 | 635 | @ here 636 | @ r0: mantissa Q31 637 | @ r1: integer contribution to result 638 | @ r3: fractional contribution to result Q30, signed 639 | 640 | movs r5,#1 @ shift 641 | adr r4,ftab_exp @ could use alternate words from dtab_exp to save space if required 642 | 643 | 2: 644 | movs r2,r0 645 | lsrs r2,r5 646 | adcs r2,r0 @ x+(x>>i) with rounding 647 | bcs 1f @ >=2? 648 | movs r0,r2 @ keep result 649 | ldr r2,[r4] 650 | subs r3,r2 651 | 1: 652 | adds r4,#4 653 | adds r5,#1 654 | cmp r5,#15 655 | bne 2b 656 | 657 | @ here 658 | @ r0: residual x, nearly 2 Q31 659 | @ r1: integer contribution to result 660 | @ r3: fractional part of result Q30 661 | 662 | asrs r0,#2 663 | adds r0,r3,r0 664 | 665 | cmp r1,#0 666 | bne 2f 667 | 668 | asrs r0,#1 669 | lsls r1,#29 670 | adds r0,r1 671 | movs r2,#0 672 | b packretns 673 | 674 | 2: 675 | lsls r1,#24 676 | asrs r0,#6 @ Q24 677 | adcs r0,r1 @ with rounding 678 | movs r2,#5 679 | b packretns 680 | 681 | 3: 682 | ldr r0,=#0xff800000 @ -Inf 683 | pop {r4,r5,r15} 684 | 4: 685 | ldr r0,=#0x7f800000 @ +Inf 686 | pop {r4,r5,r15} 687 | 688 | .align 2 689 | ftab_exp: 690 | .word 0x19f323ed @ log 1+2^-1 Q30 691 | .word 0x0e47fbe4 @ log 1+2^-2 Q30 692 | .word 0x0789c1dc @ log 1+2^-3 Q30 693 | .word 0x03e14618 @ log 1+2^-4 Q30 694 | .word 0x01f829b1 @ log 1+2^-5 Q30 695 | .word 0x00fe0546 @ log 1+2^-6 Q30 696 | .word 0x007f80aa @ log 1+2^-7 Q30 697 | .word 0x003fe015 @ log 1+2^-8 Q30 698 | .word 0x001ff803 @ log 1+2^-9 Q30 699 | .word 0x000ffe00 @ log 1+2^-10 Q30 700 | .word 0x0007ff80 @ log 1+2^-11 Q30 701 | .word 0x0003ffe0 @ log 1+2^-12 Q30 702 | .word 0x0001fff8 @ log 1+2^-13 Q30 703 | .word 0x0000fffe @ log 1+2^-14 Q30 704 | 705 | .thumb_func 706 | qfp_fatan2: 707 | push {r4,r5,r14} 708 | 709 | @ unpack arguments and shift one down to have common exponent 710 | bl unpackx 711 | bl xchxy 712 | bl unpackx 713 | lsls r0,r0,#5 @ Q28 714 | lsls r1,r1,#5 @ Q28 715 | adds r4,r2,r3 @ this is -760 if both arguments are 0 and at least -380-126=-506 otherwise 716 | asrs r4,#9 717 | adds r4,#1 718 | bmi 2f @ force y to 0 proper, so result will be zero 719 | subs r4,r2,r3 @ calculate shift 720 | bge 1f @ ex>=ey? 721 | rsbs r4,#0 @ make shift positive 722 | asrs r0,r4 723 | cmp r4,#28 724 | blo 3f 725 | asrs r0,#31 726 | b 3f 727 | 1: 728 | asrs r1,r4 729 | cmp r4,#28 730 | blo 3f 731 | 2: 732 | @ here |x|>>|y| or both x and y are ±0 733 | cmp r0,#0 734 | bge 4f @ x positive, return signed 0 735 | ldr r0,pi_q29 @ x negative, return +/- pi 736 | asrs r1,#31 737 | eors r0,r1 738 | b 7f 739 | 4: 740 | asrs r0,r1,#31 741 | b 7f 742 | 3: 743 | movs r2,#0 @ initial angle 744 | cmp r0,#0 @ x negative 745 | bge 5f 746 | rsbs r0,#0 @ rotate to 1st/4th quadrants 747 | rsbs r1,#0 748 | ldr r2,pi_q29 @ pi Q29 749 | 5: 750 | adr r3,tab_cc @ circular coefficients 751 | movs r4,#1 @ m=1 752 | bl cordic_vec @ also produces magnitude (with scaling factor 1.646760119), which is discarded 753 | mov r0,r2 @ result here is -pi/2..3pi/2 Q29 754 | @ asrs r2,#29 755 | @ subs r0,r2 756 | ldr r2,pi_q29 @ pi Q29 757 | adds r4,r0,r2 @ attempt to fix -3pi/2..-pi case 758 | bcs 6f @ -pi/2..0? leave result as is 759 | subs r4,r0,r2 @ pi: take off 2pi 762 | 6: 763 | subs r0,#1 @ fiddle factor so atan2(0,1)==0 764 | 7: 765 | movs r2,#0 @ exponent for pack 766 | b packretns 767 | 768 | .align 2 769 | .ltorg 770 | 771 | @ first entry in following table is pi Q29 772 | pi_q29: 773 | @ circular CORDIC coefficients: atan(2^-i), b0=flag for preventing shift, b1=flag for end of table 774 | tab_cc: 775 | .word 0x1921fb54*4+1 @ no shift before first iteration 776 | .word 0x0ed63383*4+0 777 | .word 0x07d6dd7e*4+0 778 | .word 0x03fab753*4+0 779 | .word 0x01ff55bb*4+0 780 | .word 0x00ffeaae*4+0 781 | .word 0x007ffd55*4+0 782 | .word 0x003fffab*4+0 783 | .word 0x001ffff5*4+0 784 | .word 0x000fffff*4+0 785 | .word 0x0007ffff*4+0 786 | .word 0x00040000*4+0 787 | .word 0x00020000*4+0+2 @ +2 marks end 788 | 789 | .align 2 790 | .thumb_func 791 | qfp_fsub: 792 | ldr r2,=#0x80000000 793 | eors r1,r2 @ flip sign on second argument 794 | @ drop into fadd, on .align2:ed boundary 795 | 796 | .thumb_func 797 | qfp_fadd: 798 | push {r4,r5,r6,r14} 799 | asrs r4,r0,#31 800 | lsls r2,r0,#1 801 | lsrs r2,#24 @ x exponent 802 | beq fa_xe0 803 | cmp r2,#255 804 | beq fa_xe255 805 | fa_xe: 806 | asrs r5,r1,#31 807 | lsls r3,r1,#1 808 | lsrs r3,#24 @ y exponent 809 | beq fa_ye0 810 | cmp r3,#255 811 | beq fa_ye255 812 | fa_ye: 813 | ldr r6,=#0x007fffff 814 | ands r0,r0,r6 @ extract mantissa bits 815 | ands r1,r1,r6 816 | adds r6,#1 @ r6=0x00800000 817 | orrs r0,r0,r6 @ set implied 1 818 | orrs r1,r1,r6 819 | eors r0,r0,r4 @ complement... 820 | eors r1,r1,r5 821 | subs r0,r0,r4 @ ... and add 1 if sign bit is set: 2's complement 822 | subs r1,r1,r5 823 | subs r5,r3,r2 @ ye-xe 824 | subs r4,r2,r3 @ xe-ye 825 | bmi fa_ygtx 826 | @ here xe>=ye 827 | cmp r4,#30 828 | bge fa_xmgty @ xe much greater than ye? 829 | adds r5,#32 830 | movs r3,r2 @ save exponent 831 | @ here y in r1 must be shifted down r4 places to align with x in r0 832 | movs r2,r1 833 | lsls r2,r2,r5 @ keep the bits we will shift off the bottom of r1 834 | asrs r1,r1,r4 835 | b fa_0 836 | 837 | .ltorg 838 | 839 | fa_ymgtx: 840 | movs r2,#0 @ result is just y 841 | movs r0,r1 842 | b fa_1 843 | fa_xmgty: 844 | movs r3,r2 @ result is just x 845 | movs r2,#0 846 | b fa_1 847 | 848 | fa_ygtx: 849 | @ here ye>xe 850 | cmp r5,#30 851 | bge fa_ymgtx @ ye much greater than xe? 852 | adds r4,#32 853 | @ here x in r0 must be shifted down r5 places to align with y in r1 854 | movs r2,r0 855 | lsls r2,r2,r4 @ keep the bits we will shift off the bottom of r0 856 | asrs r0,r0,r5 857 | fa_0: 858 | adds r0,r1 @ result is now in r0:r2, possibly highly denormalised or zero; exponent in r3 859 | beq fa_9 @ if zero, inputs must have been of identical magnitude and opposite sign, so return +0 860 | fa_1: 861 | lsrs r1,r0,#31 @ sign bit 862 | beq fa_8 863 | mvns r0,r0 864 | rsbs r2,r2,#0 865 | bne fa_8 866 | adds r0,#1 867 | fa_8: 868 | adds r6,r6 869 | @ r6=0x01000000 870 | cmp r0,r6 871 | bhs fa_2 872 | fa_3: 873 | adds r2,r2 @ normalisation loop 874 | adcs r0,r0 875 | subs r3,#1 @ adjust exponent 876 | cmp r0,r6 877 | blo fa_3 878 | fa_2: 879 | @ here r0:r2 is the result mantissa 0x01000000<=r0<0x02000000, r3 the exponent, and r1 the sign bit 880 | lsrs r0,#1 881 | bcc fa_4 882 | @ rounding bits here are 1:r2 883 | adds r0,#1 @ round up 884 | cmp r2,#0 885 | beq fa_5 @ sticky bits all zero? 886 | fa_4: 887 | cmp r3,#254 888 | bhs fa_6 @ exponent too large or negative? 889 | lsls r1,#31 @ pack everything 890 | add r0,r1 891 | lsls r3,#23 892 | add r0,r3 893 | fa_end: 894 | pop {r4,r5,r6,r15} 895 | 896 | fa_9: 897 | cmp r2,#0 @ result zero? 898 | beq fa_end @ return +0 899 | b fa_1 900 | 901 | fa_5: 902 | lsrs r0,#1 903 | lsls r0,#1 @ round to even 904 | b fa_4 905 | 906 | fa_6: 907 | bge fa_7 908 | @ underflow 909 | @ can handle denormals here 910 | lsls r0,r1,#31 @ result is signed zero 911 | pop {r4,r5,r6,r15} 912 | fa_7: 913 | @ overflow 914 | lsls r0,r1,#8 915 | adds r0,#255 916 | lsls r0,#23 @ result is signed infinity 917 | pop {r4,r5,r6,r15} 918 | 919 | 920 | fa_xe0: 921 | @ can handle denormals here 922 | subs r2,#32 923 | adds r2,r4 @ exponent -32 for +Inf, -33 for -Inf 924 | b fa_xe 925 | 926 | fa_xe255: 927 | @ can handle NaNs here 928 | lsls r2,#8 929 | add r2,r2,r4 @ exponent ~64k for +Inf, ~64k-1 for -Inf 930 | b fa_xe 931 | 932 | fa_ye0: 933 | @ can handle denormals here 934 | subs r3,#32 935 | adds r3,r5 @ exponent -32 for +Inf, -33 for -Inf 936 | b fa_ye 937 | 938 | fa_ye255: 939 | @ can handle NaNs here 940 | lsls r3,#8 941 | add r3,r3,r5 @ exponent ~64k for +Inf, ~64k-1 for -Inf 942 | b fa_ye 943 | 944 | 945 | .align 2 946 | .thumb_func 947 | qfp_fmul: 948 | push {r7,r14} 949 | mov r2,r0 950 | eors r2,r1 @ sign of result 951 | lsrs r2,#31 952 | lsls r2,#31 953 | mov r14,r2 954 | lsls r0,#1 955 | lsls r1,#1 956 | lsrs r2,r0,#24 @ xe 957 | beq fm_xe0 958 | cmp r2,#255 959 | beq fm_xe255 960 | fm_xe: 961 | lsrs r3,r1,#24 @ ye 962 | beq fm_ye0 963 | cmp r3,#255 964 | beq fm_ye255 965 | fm_ye: 966 | adds r7,r2,r3 @ exponent of result (will possibly be incremented) 967 | subs r7,#128 @ adjust bias for packing 968 | lsls r0,#8 @ x mantissa 969 | lsls r1,#8 @ y mantissa 970 | lsrs r0,#9 971 | lsrs r1,#9 972 | 973 | adds r2,r0,r1 @ for later 974 | mov r12,r2 975 | lsrs r2,r0,#7 @ x[22..7] Q16 976 | lsrs r3,r1,#7 @ y[22..7] Q16 977 | muls r2,r2,r3 @ result [45..14] Q32: never an overestimate and worst case error is 2*(2^7-1)*(2^23-2^7)+(2^7-1)^2 = 2130690049 < 2^31 978 | muls r0,r0,r1 @ result [31..0] Q46 979 | lsrs r2,#18 @ result [45..32] Q14 980 | bcc 1f 981 | cmp r0,#0 982 | bmi 1f 983 | adds r2,#1 @ fix error in r2 984 | 1: 985 | lsls r3,r0,#9 @ bits off bottom of result 986 | lsrs r0,#23 @ Q23 987 | lsls r2,#9 988 | adds r0,r2 @ cut'n'shut 989 | add r0,r12 @ implied 1*(x+y) to compensate for no insertion of implied 1s 990 | @ result-1 in r3:r0 Q23+32, i.e., in range [0,3) 991 | 992 | lsrs r1,r0,#23 993 | bne fm_0 @ branch if we need to shift down one place 994 | @ here 1<=result<2 995 | cmp r7,#254 996 | bhs fm_3a @ catches both underflow and overflow 997 | lsls r3,#1 @ sticky bits at top of R3, rounding bit in carry 998 | bcc fm_1 @ no rounding 999 | beq fm_2 @ rounding tie? 1000 | adds r0,#1 @ round up 1001 | fm_1: 1002 | adds r7,#1 @ for implied 1 1003 | lsls r7,#23 @ pack result 1004 | add r0,r7 1005 | add r0,r14 1006 | pop {r7,r15} 1007 | fm_2: @ rounding tie 1008 | adds r0,#1 1009 | fm_3: 1010 | lsrs r0,#1 1011 | lsls r0,#1 @ clear bottom bit 1012 | b fm_1 1013 | 1014 | @ here 1<=result-1<3 1015 | fm_0: 1016 | adds r7,#1 @ increment exponent 1017 | cmp r7,#254 1018 | bhs fm_3b @ catches both underflow and overflow 1019 | lsrs r0,#1 @ shift mantissa down 1020 | bcc fm_1a @ no rounding 1021 | adds r0,#1 @ assume we will round up 1022 | cmp r3,#0 @ sticky bits 1023 | beq fm_3c @ rounding tie? 1024 | fm_1a: 1025 | adds r7,r7 1026 | adds r7,#1 @ for implied 1 1027 | lsls r7,#22 @ pack result 1028 | add r0,r7 1029 | add r0,r14 1030 | pop {r7,r15} 1031 | 1032 | fm_3c: 1033 | lsrs r0,#1 1034 | lsls r0,#1 @ clear bottom bit 1035 | b fm_1a 1036 | 1037 | fm_xe0: 1038 | subs r2,#16 1039 | fm_xe255: 1040 | lsls r2,#8 1041 | b fm_xe 1042 | fm_ye0: 1043 | subs r3,#16 1044 | fm_ye255: 1045 | lsls r3,#8 1046 | b fm_ye 1047 | 1048 | @ here the result is under- or overflowing 1049 | fm_3b: 1050 | bge fm_4 @ branch on overflow 1051 | @ trap case where result is denormal 0x007fffff + 0.5ulp or more 1052 | adds r7,#1 @ exponent=-1? 1053 | bne fm_5 1054 | @ corrected mantissa will be >= 3.FFFFFC (0x1fffffe Q23) 1055 | @ so r0 >= 2.FFFFFC (0x17ffffe Q23) 1056 | adds r0,#2 1057 | lsrs r0,#23 1058 | cmp r0,#3 1059 | bne fm_5 1060 | b fm_6 1061 | 1062 | fm_3a: 1063 | bge fm_4 @ branch on overflow 1064 | @ trap case where result is denormal 0x007fffff + 0.5ulp or more 1065 | adds r7,#1 @ exponent=-1? 1066 | bne fm_5 1067 | adds r0,#1 @ mantissa=0xffffff (i.e., r0=0x7fffff)? 1068 | lsrs r0,#23 1069 | beq fm_5 1070 | fm_6: 1071 | movs r0,#1 @ return smallest normal 1072 | lsls r0,#23 1073 | add r0,r14 1074 | pop {r7,r15} 1075 | 1076 | fm_5: 1077 | mov r0,r14 1078 | pop {r7,r15} 1079 | fm_4: 1080 | movs r0,#0xff 1081 | lsls r0,#23 1082 | add r0,r14 1083 | pop {r7,r15} 1084 | 1085 | @ This version of the division algorithm uses external divider hardware to estimate the 1086 | @ reciprocal of the divisor to about 14 bits; then a multiplication step to get a first 1087 | @ quotient estimate; then the remainder based on this estimate is used to calculate a 1088 | @ correction to the quotient. The result is good to about 27 bits and so we only need 1089 | @ to calculate the exact remainder when close to a rounding boundary. 1090 | .align 2 1091 | .thumb_func 1092 | qfp_fdiv: 1093 | push {r4,r5,r6,r14} 1094 | fdiv_n: 1095 | 1096 | movs r4,#1 1097 | lsls r4,#23 @ implied 1 position 1098 | lsls r2,r1,#9 @ clear out sign and exponent 1099 | lsrs r2,r2,#9 1100 | orrs r2,r2,r4 @ divisor mantissa Q23 with implied 1 1101 | 1102 | @ here 1103 | @ r0=packed dividend 1104 | @ r1=packed divisor 1105 | @ r2=divisor mantissa Q23 1106 | @ r4=1<<23 1107 | 1108 | // see divtest.c 1109 | lsrs r3,r2,#18 @ x2=x>>18; // Q5 32..63 1110 | adr r5,rcpapp-32 1111 | ldrb r3,[r5,r3] @ u=lut5[x2-32]; // Q8 1112 | lsls r5,r2,#5 1113 | muls r5,r5,r3 1114 | asrs r5,#14 @ e=(i32)(u*(x<<5))>>14; // Q22 1115 | asrs r6,r5,#11 1116 | muls r6,r6,r6 @ e2=(e>>11)*(e>>11); // Q22 1117 | subs r5,r6 1118 | muls r5,r5,r3 @ c=(e-e2)*u; // Q30 1119 | lsls r6,r3,#8 1120 | asrs r5,#13 1121 | adds r5,#1 1122 | asrs r5,#1 1123 | subs r5,r6,r5 @ u0=(u<<8)-((c+0x2000)>>14); // Q16 1124 | 1125 | @ here 1126 | @ r0=packed dividend 1127 | @ r1=packed divisor 1128 | @ r2=divisor mantissa Q23 1129 | @ r4=1<<23 1130 | @ r5=reciprocal estimate Q16 1131 | 1132 | lsrs r6,r0,#23 1133 | uxtb r3,r6 @ dividend exponent 1134 | lsls r0,#9 1135 | lsrs r0,#9 1136 | orrs r0,r0,r4 @ dividend mantissa Q23 1137 | 1138 | lsrs r1,#23 1139 | eors r6,r1 @ sign of result in bit 8 1140 | lsrs r6,#8 1141 | lsls r6,#31 @ sign of result in bit 31, other bits clear 1142 | 1143 | @ here 1144 | @ r0=dividend mantissa Q23 1145 | @ r1=divisor sign+exponent 1146 | @ r2=divisor mantissa Q23 1147 | @ r3=dividend exponent 1148 | @ r5=reciprocal estimate Q16 1149 | @ r6b31=sign of result 1150 | 1151 | uxtb r1,r1 @ divisor exponent 1152 | cmp r1,#0 1153 | beq retinf 1154 | cmp r1,#255 1155 | beq 20f @ divisor is infinite 1156 | cmp r3,#0 1157 | beq retzero 1158 | cmp r3,#255 1159 | beq retinf 1160 | subs r3,r1 @ initial result exponent (no bias) 1161 | adds r3,#125 @ add bias 1162 | 1163 | lsrs r1,r0,#8 @ dividend mantissa Q15 1164 | 1165 | @ here 1166 | @ r0=dividend mantissa Q23 1167 | @ r1=dividend mantissa Q15 1168 | @ r2=divisor mantissa Q23 1169 | @ r3=initial result exponent 1170 | @ r5=reciprocal estimate Q16 1171 | @ r6b31=sign of result 1172 | 1173 | muls r1,r5 1174 | 1175 | lsrs r1,#16 @ Q15 qu0=(q15)(u*y0); 1176 | lsls r0,r0,#15 @ dividend Q38 1177 | movs r4,r2 1178 | muls r4,r1 @ Q38 qu0*x 1179 | subs r4,r0,r4 @ Q38 re0=(y<<15)-qu0*x; note this remainder is signed 1180 | asrs r4,#10 1181 | muls r4,r5 @ Q44 qu1=(re0>>10)*u; this quotient correction is also signed 1182 | asrs r4,#16 @ Q28 1183 | lsls r1,#13 1184 | adds r1,r1,r4 @ Q28 qu=(qu0<<13)+(qu1>>16); 1185 | 1186 | @ here 1187 | @ r0=dividend mantissa Q38 1188 | @ r1=quotient Q28 1189 | @ r2=divisor mantissa Q23 1190 | @ r3=initial result exponent 1191 | @ r6b31=sign of result 1192 | 1193 | lsrs r4,r1,#28 1194 | bne 1f 1195 | @ here the quotient is less than 1<<28 (i.e., result mantissa <1.0) 1196 | 1197 | adds r1,#5 1198 | lsrs r4,r1,#4 @ rounding + small reduction in systematic bias 1199 | bcc 2f @ skip if we are not near a rounding boundary 1200 | lsrs r1,#3 @ quotient Q25 1201 | lsls r0,#10 @ dividend mantissa Q48 1202 | muls r1,r1,r2 @ quotient*divisor Q48 1203 | subs r0,r0,r1 @ remainder Q48 1204 | bmi 2f 1205 | b 3f 1206 | 1207 | 1: 1208 | @ here the quotient is at least 1<<28 (i.e., result mantissa >=1.0) 1209 | 1210 | adds r3,#1 @ bump exponent (and shift mantissa down one more place) 1211 | adds r1,#9 1212 | lsrs r4,r1,#5 @ rounding + small reduction in systematic bias 1213 | bcc 2f @ skip if we are not near a rounding boundary 1214 | 1215 | lsrs r1,#4 @ quotient Q24 1216 | lsls r0,#9 @ dividend mantissa Q47 1217 | muls r1,r1,r2 @ quotient*divisor Q47 1218 | subs r0,r0,r1 @ remainder Q47 1219 | bmi 2f 1220 | 3: 1221 | adds r4,#1 @ increment quotient as we are above the rounding boundary 1222 | 1223 | @ here 1224 | @ r3=result exponent 1225 | @ r4=correctly rounded quotient Q23 in range [1,2] *note closed interval* 1226 | @ r6b31=sign of result 1227 | 1228 | 2: 1229 | cmp r3,#254 1230 | bhs 10f @ this catches both underflow and overflow 1231 | lsls r1,r3,#23 1232 | adds r0,r4,r1 1233 | adds r0,r6 1234 | pop {r4,r5,r6,r15} 1235 | 1236 | @ here divisor is infinite; dividend exponent in r3 1237 | 20: 1238 | cmp r3,#255 1239 | bne retzero 1240 | 1241 | retinf: 1242 | movs r0,#255 1243 | 21: 1244 | lsls r0,#23 1245 | orrs r0,r6 1246 | pop {r4,r5,r6,r15} 1247 | 1248 | 10: 1249 | bge retinf @ overflow? 1250 | adds r1,r3,#1 1251 | bne retzero @ exponent <-1? return 0 1252 | @ here exponent is exactly -1 1253 | lsrs r1,r4,#25 1254 | bcc retzero @ mantissa is not 01000000? 1255 | @ return minimum normal 1256 | movs r0,#1 1257 | lsls r0,#23 1258 | orrs r0,r6 1259 | pop {r4,r5,r6,r15} 1260 | 1261 | retzero: 1262 | movs r0,r6 1263 | pop {r4,r5,r6,r15} 1264 | 1265 | @ x2=[32:1:63]/32; 1266 | @ round(256 ./(x2+1/64)) 1267 | .align 2 1268 | rcpapp: 1269 | .byte 252,245,237,231,224,218,213,207,202,197,193,188,184,180,176,172 1270 | .byte 169,165,162,159,156,153,150,148,145,142,140,138,135,133,131,129 1271 | 1272 | @ The square root routine uses an initial approximation to the reciprocal of the square root of the argument based 1273 | @ on the top four bits of the mantissa (possibly shifted one place to make the exponent even). It then performs two 1274 | @ Newton-Raphson iterations, resulting in about 14 bits of accuracy. This reciprocal is then multiplied by 1275 | @ the original argument to produce an approximation to the result, again with about 14 bits of accuracy. 1276 | @ Then a remainder is calculated, and multiplied by the reciprocal estiamte to generate a correction term 1277 | @ giving a final answer to about 28 bits of accuracy. A final remainder calculation rounds to the correct 1278 | @ result if necessary. 1279 | @ Again, the fixed-point calculation is carefully implemented to preserve accuracy, and similar comments to those 1280 | @ made above on the fast division routine apply. 1281 | @ The reciprocal square root calculation has been tested for all possible (possibly shifted) input mantissa values. 1282 | .align 2 1283 | .thumb_func 1284 | qfp_fsqrt: 1285 | push {r4} 1286 | lsls r1,r0,#1 1287 | bcs sq_0 @ negative? 1288 | lsls r1,#8 1289 | lsrs r1,#9 @ mantissa 1290 | movs r2,#1 1291 | lsls r2,#23 1292 | adds r1,r2 @ insert implied 1 1293 | lsrs r2,r0,#23 @ extract exponent 1294 | beq sq_2 @ zero? 1295 | cmp r2,#255 @ infinite? 1296 | beq sq_1 1297 | adds r2,#125 @ correction for packing 1298 | asrs r2,#1 @ exponent/2, LSB into carry 1299 | bcc 1f 1300 | lsls r1,#1 @ was even: double mantissa; mantissa y now 1..4 Q23 1301 | 1: 1302 | adr r4,rsqrtapp-4@ first four table entries are never accessed because of the mantissa's leading 1 1303 | lsrs r3,r1,#21 @ y Q2 1304 | ldrb r4,[r4,r3] @ initial approximation to reciprocal square root a0 Q8 1305 | 1306 | lsrs r0,r1,#7 @ y Q16: first Newton-Raphson iteration 1307 | muls r0,r4 @ a0*y Q24 1308 | muls r0,r4 @ r0=p0=a0*y*y Q32 1309 | asrs r0,#12 @ r0 Q20 1310 | muls r0,r4 @ dy0=a0*r0 Q28 1311 | asrs r0,#13 @ dy0 Q15 1312 | lsls r4,#8 @ a0 Q16 1313 | subs r4,r0 @ a1=a0-dy0/2 Q16-Q15/2 -> Q16 1314 | adds r4,#170 @ mostly remove systematic error in this approximation: gains approximately 1 bit 1315 | 1316 | movs r0,r4 @ second Newton-Raphson iteration 1317 | muls r0,r0 @ a1*a1 Q32 1318 | lsrs r0,#15 @ a1*a1 Q17 1319 | lsrs r3,r1,#8 @ y Q15 1320 | muls r0,r3 @ r1=p1=a1*a1*y Q32 1321 | asrs r0,#12 @ r1 Q20 1322 | muls r0,r4 @ dy1=a1*r1 Q36 1323 | asrs r0,#21 @ dy1 Q15 1324 | subs r4,r0 @ a2=a1-dy1/2 Q16-Q15/2 -> Q16 1325 | 1326 | muls r3,r4 @ a3=y*a2 Q31 1327 | lsrs r3,#15 @ a3 Q16 1328 | @ here a2 is an approximation to the reciprocal square root 1329 | @ and a3 is an approximation to the square root 1330 | movs r0,r3 1331 | muls r0,r0 @ a3*a3 Q32 1332 | lsls r1,#9 @ y Q32 1333 | subs r0,r1,r0 @ r2=y-a3*a3 Q32 remainder 1334 | asrs r0,#5 @ r2 Q27 1335 | muls r4,r0 @ r2*a2 Q43 1336 | lsls r3,#7 @ a3 Q23 1337 | asrs r0,r4,#15 @ r2*a2 Q28 1338 | adds r0,#16 @ rounding to Q24 1339 | asrs r0,r0,#6 @ r2*a2 Q22 1340 | add r3,r0 @ a4 Q23: candidate final result 1341 | bcc sq_3 @ near rounding boundary? skip if no rounding needed 1342 | mov r4,r3 1343 | adcs r4,r4 @ a4+0.5ulp Q24 1344 | muls r4,r4 @ Q48 1345 | lsls r1,#16 @ y Q48 1346 | subs r1,r4 @ remainder Q48 1347 | bmi sq_3 1348 | adds r3,#1 @ round up 1349 | sq_3: 1350 | lsls r2,#23 @ pack exponent 1351 | adds r0,r2,r3 1352 | sq_6: 1353 | pop {r4} 1354 | bx r14 1355 | 1356 | sq_0: 1357 | lsrs r1,#24 1358 | beq sq_2 @ -0: return it 1359 | @ here negative and not -0: return -Inf 1360 | asrs r0,#31 1361 | sq_5: 1362 | lsls r0,#23 1363 | b sq_6 1364 | sq_1: @ +Inf 1365 | lsrs r0,#23 1366 | b sq_5 1367 | sq_2: 1368 | lsrs r0,#31 1369 | lsls r0,#31 1370 | b sq_6 1371 | 1372 | @ round(sqrt(2^22./[72:16:248])) 1373 | rsqrtapp: 1374 | .byte 0xf1,0xda,0xc9,0xbb, 0xb0,0xa6,0x9e,0x97, 0x91,0x8b,0x86,0x82 1375 | 1376 | 1377 | 1378 | @ Notation: 1379 | @ rx:ry means the concatenation of rx and ry with rx having the less significant bits 1380 | 1381 | @ IEEE double in ra:rb -> 1382 | @ mantissa in ra:rb 12Q52 (53 significant bits) with implied 1 set 1383 | @ exponent in re 1384 | @ sign in rs 1385 | @ trashes rt 1386 | .macro mdunpack ra,rb,re,rs,rt 1387 | lsrs \re,\rb,#20 @ extract sign and exponent 1388 | subs \rs,\re,#1 1389 | lsls \rs,#20 1390 | subs \rb,\rs @ clear sign and exponent in mantissa; insert implied 1 1391 | lsrs \rs,\re,#11 @ sign 1392 | lsls \re,#21 1393 | lsrs \re,#21 @ exponent 1394 | beq l\@_1 @ zero exponent? 1395 | adds \rt,\re,#1 1396 | lsrs \rt,#11 1397 | beq l\@_2 @ exponent != 0x7ff? then done 1398 | l\@_1: 1399 | movs \ra,#0 1400 | movs \rb,#1 1401 | lsls \rb,#20 1402 | subs \re,#128 1403 | lsls \re,#12 1404 | l\@_2: 1405 | .endm 1406 | 1407 | @ IEEE double in ra:rb -> 1408 | @ signed mantissa in ra:rb 12Q52 (53 significant bits) with implied 1 1409 | @ exponent in re 1410 | @ trashes rt0 and rt1 1411 | @ +zero, +denormal -> exponent=-0x80000 1412 | @ -zero, -denormal -> exponent=-0x80000 1413 | @ +Inf, +NaN -> exponent=+0x77f000 1414 | @ -Inf, -NaN -> exponent=+0x77e000 1415 | .macro mdunpacks ra,rb,re,rt0,rt1 1416 | lsrs \re,\rb,#20 @ extract sign and exponent 1417 | lsrs \rt1,\rb,#31 @ sign only 1418 | subs \rt0,\re,#1 1419 | lsls \rt0,#20 1420 | subs \rb,\rt0 @ clear sign and exponent in mantissa; insert implied 1 1421 | lsls \re,#21 1422 | bcc l\@_1 @ skip on positive 1423 | mvns \rb,\rb @ negate mantissa 1424 | rsbs \ra,#0 1425 | bcc l\@_1 1426 | adds \rb,#1 1427 | l\@_1: 1428 | lsrs \re,#21 1429 | beq l\@_2 @ zero exponent? 1430 | adds \rt0,\re,#1 1431 | lsrs \rt0,#11 1432 | beq l\@_3 @ exponent != 0x7ff? then done 1433 | subs \re,\rt1 1434 | l\@_2: 1435 | movs \ra,#0 1436 | lsls \rt1,#1 @ +ve: 0 -ve: 2 1437 | adds \rb,\rt1,#1 @ +ve: 1 -ve: 3 1438 | lsls \rb,#30 @ create +/-1 mantissa 1439 | asrs \rb,#10 1440 | subs \re,#128 1441 | lsls \re,#12 1442 | l\@_3: 1443 | .endm 1444 | 1445 | .align 2 1446 | .thumb_func 1447 | qfp_dsub: 1448 | push {r4-r7,r14} 1449 | movs r4,#1 1450 | lsls r4,#31 1451 | eors r3,r4 @ flip sign on second argument 1452 | b da_entry @ continue in dadd 1453 | 1454 | .align 2 1455 | .thumb_func 1456 | qfp_dadd: 1457 | push {r4-r7,r14} 1458 | da_entry: 1459 | mdunpacks r0,r1,r4,r6,r7 1460 | mdunpacks r2,r3,r5,r6,r7 1461 | subs r7,r5,r4 @ ye-xe 1462 | subs r6,r4,r5 @ xe-ye 1463 | bmi da_ygtx 1464 | @ here xe>=ye: need to shift y down r6 places 1465 | mov r12,r4 @ save exponent 1466 | cmp r6,#32 1467 | bge da_xrgty @ xe rather greater than ye? 1468 | adds r7,#32 1469 | movs r4,r2 1470 | lsls r4,r4,r7 @ rounding bit + sticky bits 1471 | da_xgty0: 1472 | movs r5,r3 1473 | lsls r5,r5,r7 1474 | lsrs r2,r6 1475 | asrs r3,r6 1476 | orrs r2,r5 1477 | da_add: 1478 | adds r0,r2 1479 | adcs r1,r3 1480 | da_pack: 1481 | @ here unnormalised signed result (possibly 0) is in r0:r1 with exponent r12, rounding + sticky bits in r4 1482 | @ Note that if a large normalisation shift is required then the arguments were close in magnitude and so we 1483 | @ cannot have not gone via the xrgty/yrgtx paths. There will therefore always be enough high bits in r4 1484 | @ to provide a correct continuation of the exact result. 1485 | @ now pack result back up 1486 | lsrs r3,r1,#31 @ get sign bit 1487 | beq 1f @ skip on positive 1488 | mvns r1,r1 @ negate mantissa 1489 | mvns r0,r0 1490 | movs r2,#0 1491 | rsbs r4,#0 1492 | adcs r0,r2 1493 | adcs r1,r2 1494 | 1: 1495 | mov r2,r12 @ get exponent 1496 | lsrs r5,r1,#21 1497 | bne da_0 @ shift down required? 1498 | lsrs r5,r1,#20 1499 | bne da_1 @ normalised? 1500 | cmp r0,#0 1501 | beq da_5 @ could mantissa be zero? 1502 | da_2: 1503 | adds r4,r4 1504 | adcs r0,r0 1505 | adcs r1,r1 1506 | subs r2,#1 @ adjust exponent 1507 | lsrs r5,r1,#20 1508 | beq da_2 1509 | da_1: 1510 | lsls r4,#1 @ check rounding bit 1511 | bcc da_3 1512 | da_4: 1513 | adds r0,#1 @ round up 1514 | bcc 2f 1515 | adds r1,#1 1516 | 2: 1517 | cmp r4,#0 @ sticky bits zero? 1518 | bne da_3 1519 | lsrs r0,#1 @ round to even 1520 | lsls r0,#1 1521 | da_3: 1522 | subs r2,#1 1523 | bmi da_6 1524 | adds r4,r2,#2 @ check if exponent is overflowing 1525 | lsrs r4,#11 1526 | bne da_7 1527 | lsls r2,#20 @ pack exponent and sign 1528 | add r1,r2 1529 | lsls r3,#31 1530 | add r1,r3 1531 | pop {r4-r7,r15} 1532 | 1533 | da_7: 1534 | @ here exponent overflow: return signed infinity 1535 | lsls r1,r3,#31 1536 | ldr r3,=#0x7ff00000 1537 | orrs r1,r3 1538 | b 1f 1539 | da_6: 1540 | @ here exponent underflow: return signed zero 1541 | lsls r1,r3,#31 1542 | 1: 1543 | movs r0,#0 1544 | pop {r4-r7,r15} 1545 | 1546 | da_5: 1547 | @ here mantissa could be zero 1548 | cmp r1,#0 1549 | bne da_2 1550 | cmp r4,#0 1551 | bne da_2 1552 | @ inputs must have been of identical magnitude and opposite sign, so return +0 1553 | pop {r4-r7,r15} 1554 | 1555 | da_0: 1556 | @ here a shift down by one place is required for normalisation 1557 | adds r2,#1 @ adjust exponent 1558 | lsls r6,r0,#31 @ save rounding bit 1559 | lsrs r0,#1 1560 | lsls r5,r1,#31 1561 | orrs r0,r5 1562 | lsrs r1,#1 1563 | cmp r6,#0 1564 | beq da_3 1565 | b da_4 1566 | 1567 | da_xrgty: @ xe>ye and shift>=32 places 1568 | cmp r6,#60 1569 | bge da_xmgty @ xe much greater than ye? 1570 | subs r6,#32 1571 | adds r7,#64 1572 | 1573 | movs r4,r2 1574 | lsls r4,r4,r7 @ these would be shifted off the bottom of the sticky bits 1575 | beq 1f 1576 | movs r4,#1 1577 | 1: 1578 | lsrs r2,r2,r6 1579 | orrs r4,r2 1580 | movs r2,r3 1581 | lsls r3,r3,r7 1582 | orrs r4,r3 1583 | asrs r3,r2,#31 @ propagate sign bit 1584 | b da_xgty0 1585 | 1586 | da_ygtx: 1587 | @ here ye>xe: need to shift x down r7 places 1588 | mov r12,r5 @ save exponent 1589 | cmp r7,#32 1590 | bge da_yrgtx @ ye rather greater than xe? 1591 | adds r6,#32 1592 | movs r4,r0 1593 | lsls r4,r4,r6 @ rounding bit + sticky bits 1594 | da_ygtx0: 1595 | movs r5,r1 1596 | lsls r5,r5,r6 1597 | lsrs r0,r7 1598 | asrs r1,r7 1599 | orrs r0,r5 1600 | b da_add 1601 | 1602 | da_yrgtx: 1603 | cmp r7,#60 1604 | bge da_ymgtx @ ye much greater than xe? 1605 | subs r7,#32 1606 | adds r6,#64 1607 | 1608 | movs r4,r0 1609 | lsls r4,r4,r6 @ these would be shifted off the bottom of the sticky bits 1610 | beq 1f 1611 | movs r4,#1 1612 | 1: 1613 | lsrs r0,r0,r7 1614 | orrs r4,r0 1615 | movs r0,r1 1616 | lsls r1,r1,r6 1617 | orrs r4,r1 1618 | asrs r1,r0,#31 @ propagate sign bit 1619 | b da_ygtx0 1620 | 1621 | da_ymgtx: @ result is just y 1622 | movs r0,r2 1623 | movs r1,r3 1624 | da_xmgty: @ result is just x 1625 | movs r4,#0 @ clear sticky bits 1626 | b da_pack 1627 | 1628 | .ltorg 1629 | 1630 | @ equivalent of UMULL 1631 | @ needs five temporary registers 1632 | @ can have rt3==rx, in which case rx trashed 1633 | @ can have rt4==ry, in which case ry trashed 1634 | @ can have rzl==rx 1635 | @ can have rzh==ry 1636 | @ can have rzl,rzh==rt3,rt4 1637 | .macro mul32_32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4 1638 | @ t0 t1 t2 t3 t4 1639 | @ (x) (y) 1640 | uxth \rt0,\rx @ xl 1641 | uxth \rt1,\ry @ yl 1642 | muls \rt0,\rt1 @ xlyl=L 1643 | lsrs \rt2,\rx,#16 @ xh 1644 | muls \rt1,\rt2 @ xhyl=M0 1645 | lsrs \rt4,\ry,#16 @ yh 1646 | muls \rt2,\rt4 @ xhyh=H 1647 | uxth \rt3,\rx @ xl 1648 | muls \rt3,\rt4 @ xlyh=M1 1649 | adds \rt1,\rt3 @ M0+M1=M 1650 | bcc l\@_1 @ addition of the two cross terms can overflow, so add carry into H 1651 | movs \rt3,#1 @ 1 1652 | lsls \rt3,#16 @ 0x10000 1653 | adds \rt2,\rt3 @ H' 1654 | l\@_1: 1655 | @ t0 t1 t2 t3 t4 1656 | @ (zl) (zh) 1657 | lsls \rzl,\rt1,#16 @ ML 1658 | lsrs \rzh,\rt1,#16 @ MH 1659 | adds \rzl,\rt0 @ ZL 1660 | adcs \rzh,\rt2 @ ZH 1661 | .endm 1662 | 1663 | @ SUMULL: x signed, y unsigned 1664 | @ in table below ¯ means signed variable 1665 | @ needs five temporary registers 1666 | @ can have rt3==rx, in which case rx trashed 1667 | @ can have rt4==ry, in which case ry trashed 1668 | @ can have rzl==rx 1669 | @ can have rzh==ry 1670 | @ can have rzl,rzh==rt3,rt4 1671 | .macro muls32_32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4 1672 | @ t0 t1 t2 t3 t4 1673 | @ ¯(x) (y) 1674 | uxth \rt0,\rx @ xl 1675 | uxth \rt1,\ry @ yl 1676 | muls \rt0,\rt1 @ xlyl=L 1677 | asrs \rt2,\rx,#16 @ ¯xh 1678 | muls \rt1,\rt2 @ ¯xhyl=M0 1679 | lsrs \rt4,\ry,#16 @ yh 1680 | muls \rt2,\rt4 @ ¯xhyh=H 1681 | uxth \rt3,\rx @ xl 1682 | muls \rt3,\rt4 @ xlyh=M1 1683 | asrs \rt4,\rt1,#31 @ M0sx (M1 sign extension is zero) 1684 | adds \rt1,\rt3 @ M0+M1=M 1685 | movs \rt3,#0 @ 0 1686 | adcs \rt4,\rt3 @ ¯Msx 1687 | lsls \rt4,#16 @ ¯Msx<<16 1688 | adds \rt2,\rt4 @ H' 1689 | 1690 | @ t0 t1 t2 t3 t4 1691 | @ (zl) (zh) 1692 | lsls \rzl,\rt1,#16 @ M~ 1693 | lsrs \rzh,\rt1,#16 @ M~ 1694 | adds \rzl,\rt0 @ ZL 1695 | adcs \rzh,\rt2 @ ¯ZH 1696 | .endm 1697 | 1698 | @ SSMULL: x signed, y signed 1699 | @ in table below ¯ means signed variable 1700 | @ needs five temporary registers 1701 | @ can have rt3==rx, in which case rx trashed 1702 | @ can have rt4==ry, in which case ry trashed 1703 | @ can have rzl==rx 1704 | @ can have rzh==ry 1705 | @ can have rzl,rzh==rt3,rt4 1706 | .macro muls32_s32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4 1707 | @ t0 t1 t2 t3 t4 1708 | @ ¯(x) (y) 1709 | uxth \rt0,\rx @ xl 1710 | uxth \rt1,\ry @ yl 1711 | muls \rt0,\rt1 @ xlyl=L 1712 | asrs \rt2,\rx,#16 @ ¯xh 1713 | muls \rt1,\rt2 @ ¯xhyl=M0 1714 | asrs \rt4,\ry,#16 @ ¯yh 1715 | muls \rt2,\rt4 @ ¯xhyh=H 1716 | uxth \rt3,\rx @ xl 1717 | muls \rt3,\rt4 @ ¯xlyh=M1 1718 | adds \rt1,\rt3 @ ¯M0+M1=M 1719 | asrs \rt3,\rt1,#31 @ Msx 1720 | bvc l\@_1 @ 1721 | mvns \rt3,\rt3 @ ¯Msx flip sign extension bits if overflow 1722 | l\@_1: 1723 | lsls \rt3,#16 @ ¯Msx<<16 1724 | adds \rt2,\rt3 @ H' 1725 | 1726 | @ t0 t1 t2 t3 t4 1727 | @ (zl) (zh) 1728 | lsls \rzl,\rt1,#16 @ M~ 1729 | lsrs \rzh,\rt1,#16 @ M~ 1730 | adds \rzl,\rt0 @ ZL 1731 | adcs \rzh,\rt2 @ ¯ZH 1732 | .endm 1733 | 1734 | @ can have rt2==rx, in which case rx trashed 1735 | @ can have rzl==rx 1736 | @ can have rzh==rt1 1737 | .macro square32_64 rx,rzl,rzh,rt0,rt1,rt2 1738 | @ t0 t1 t2 zl zh 1739 | uxth \rt0,\rx @ xl 1740 | muls \rt0,\rt0 @ xlxl=L 1741 | uxth \rt1,\rx @ xl 1742 | lsrs \rt2,\rx,#16 @ xh 1743 | muls \rt1,\rt2 @ xlxh=M 1744 | muls \rt2,\rt2 @ xhxh=H 1745 | lsls \rzl,\rt1,#17 @ ML 1746 | lsrs \rzh,\rt1,#15 @ MH 1747 | adds \rzl,\rt0 @ ZL 1748 | adcs \rzh,\rt2 @ ZH 1749 | .endm 1750 | 1751 | .align 2 1752 | .thumb_func 1753 | qfp_dmul: 1754 | push {r4-r7,r14} 1755 | mdunpack r0,r1,r4,r6,r5 1756 | mov r12,r4 1757 | mdunpack r2,r3,r4,r7,r5 1758 | eors r7,r6 @ sign of result 1759 | add r4,r12 @ exponent of result 1760 | push {r0-r2,r4,r7} 1761 | 1762 | @ accumulate full product in r12:r5:r6:r7 1763 | mul32_32_64 r0,r2, r0,r5, r4,r6,r7,r0,r5 @ XL*YL 1764 | mov r12,r0 @ save LL bits 1765 | 1766 | mul32_32_64 r1,r3, r6,r7, r0,r2,r4,r6,r7 @ XH*YH 1767 | 1768 | pop {r0} @ XL 1769 | mul32_32_64 r0,r3, r0,r3, r1,r2,r4,r0,r3 @ XL*YH 1770 | adds r5,r0 1771 | adcs r6,r3 1772 | movs r0,#0 1773 | adcs r7,r0 1774 | 1775 | pop {r1,r2} @ XH,YL 1776 | mul32_32_64 r1,r2, r1,r2, r0,r3,r4, r1,r2 @ XH*YL 1777 | adds r5,r1 1778 | adcs r6,r2 1779 | movs r0,#0 1780 | adcs r7,r0 1781 | 1782 | @ here r5:r6:r7 holds the product [1..4) in Q(104-32)=Q72, with extra LSBs in r12 1783 | pop {r3,r4} @ exponent in r3, sign in r4 1784 | lsls r1,r7,#11 1785 | lsrs r2,r6,#21 1786 | orrs r1,r2 1787 | lsls r0,r6,#11 1788 | lsrs r2,r5,#21 1789 | orrs r0,r2 1790 | lsls r5,#11 @ now r5:r0:r1 Q83=Q(51+32), extra LSBs in r12 1791 | lsrs r2,r1,#20 1792 | bne 1f @ skip if in range [2..4) 1793 | adds r5,r5 @ shift up so always [2..4) Q83, i.e. [1..2) Q84=Q(52+32) 1794 | adcs r0,r0 1795 | adcs r1,r1 1796 | subs r3,#1 @ correct exponent 1797 | 1: 1798 | ldr r6,=#0x3ff 1799 | subs r3,r6 @ correct for exponent bias 1800 | lsls r6,#1 @ 0x7fe 1801 | cmp r3,r6 1802 | bhs dm_0 @ exponent over- or underflow 1803 | lsls r5,#1 @ rounding bit to carry 1804 | bcc 1f @ result is correctly rounded 1805 | adds r0,#1 1806 | movs r6,#0 1807 | adcs r1,r6 @ round up 1808 | mov r6,r12 @ remaining sticky bits 1809 | orrs r5,r6 1810 | bne 1f @ some sticky bits set? 1811 | lsrs r0,#1 1812 | lsls r0,#1 @ round to even 1813 | 1: 1814 | lsls r3,#20 1815 | adds r1,r3 1816 | dm_2: 1817 | lsls r4,#31 1818 | add r1,r4 1819 | pop {r4-r7,r15} 1820 | 1821 | @ here for exponent over- or underflow 1822 | dm_0: 1823 | bge dm_1 @ overflow? 1824 | adds r3,#1 @ would-be zero exponent? 1825 | bne 1f 1826 | adds r0,#1 1827 | bne 1f @ all-ones mantissa? 1828 | adds r1,#1 1829 | lsrs r7,r1,#21 1830 | beq 1f 1831 | lsrs r1,#1 1832 | b dm_2 1833 | 1: 1834 | lsls r1,r4,#31 1835 | movs r0,#0 1836 | pop {r4-r7,r15} 1837 | 1838 | @ here for exponent overflow 1839 | dm_1: 1840 | adds r6,#1 @ 0x7ff 1841 | lsls r1,r6,#20 1842 | movs r0,#0 1843 | b dm_2 1844 | 1845 | .ltorg 1846 | 1847 | @ Approach to division y/x is as follows. 1848 | @ 1849 | @ First generate u1, an approximation to 1/x to about 29 bits. Multiply this by the top 1850 | @ 32 bits of y to generate a0, a first approximation to the result (good to 28 bits or so). 1851 | @ Calculate the exact remainder r0=y-a0*x, which will be about 0. Calculate a correction 1852 | @ d0=r0*u1, and then write a1=a0+d0. If near a rounding boundary, compute the exact 1853 | @ remainder r1=y-a1*x (which can be done using r0 as a basis) to determine whether to 1854 | @ round up or down. 1855 | @ 1856 | @ The calculation of 1/x is as given in dreciptest.c. That code verifies exhaustively 1857 | @ that | u1*x-1 | < 10*2^-32. 1858 | @ 1859 | @ More precisely: 1860 | @ 1861 | @ x0=(q16)x; 1862 | @ x1=(q30)x; 1863 | @ y0=(q31)y; 1864 | @ u0=(q15~)"(0xffffffffU/(unsigned int)roundq(x/x_ulp))/powq(2,16)"(x0); // q15 approximation to 1/x; "~" denotes rounding rather than truncation 1865 | @ v=(q30)(u0*x1-1); 1866 | @ u1=(q30)u0-(q30~)(u0*v); 1867 | @ 1868 | @ a0=(q30)(u1*y0); 1869 | @ r0=(q82)y-a0*x; 1870 | @ r0x=(q57)r0; 1871 | @ d0=r0x*u1; 1872 | @ a1=d0+a0; 1873 | @ 1874 | @ Error analysis 1875 | @ 1876 | @ Use Greek letters to represent the errors introduced by rounding and truncation. 1877 | @ 1878 | @ r₀ = y - a₀x 1879 | @ = y - [ u₁ ( y - α ) - β ] x where 0 ≤ α < 2^-31, 0 ≤ β < 2^-30 1880 | @ = y ( 1 - u₁x ) + ( u₁α + β ) x 1881 | @ 1882 | @ Hence 1883 | @ 1884 | @ | r₀ / x | < 2 * 10*2^-32 + 2^-31 + 2^-30 1885 | @ = 26*2^-32 1886 | @ 1887 | @ r₁ = y - a₁x 1888 | @ = y - a₀x - d₀x 1889 | @ = r₀ - d₀x 1890 | @ = r₀ - u₁ ( r₀ - γ ) x where 0 ≤ γ < 2^-57 1891 | @ = r₀ ( 1 - u₁x ) + u₁γx 1892 | @ 1893 | @ Hence 1894 | @ 1895 | @ | r₁ / x | < 26*2^-32 * 10*2^-32 + 2^-57 1896 | @ = (260+128)*2^-64 1897 | @ < 2^-55 1898 | @ 1899 | @ Empirically it seems to be nearly twice as good as this. 1900 | @ 1901 | @ To determine correctly whether the exact remainder calculation can be skipped we need a result 1902 | @ accurate to < 0.25ulp. In the case where x>y the quotient will be shifted up one place for normalisation 1903 | @ and so 1ulp is 2^-53 and so the calculation above suffices. 1904 | 1905 | .align 2 1906 | .thumb_func 1907 | qfp_ddiv: 1908 | push {r4-r7,r14} 1909 | ddiv0: @ entry point from dtan 1910 | mdunpack r2,r3,r4,r7,r6 @ unpack divisor 1911 | 1912 | @ unpack dividend by hand to save on register use 1913 | lsrs r6,r1,#31 1914 | adds r6,r7 1915 | mov r12,r6 @ result sign in r12b0; r12b1 trashed 1916 | lsls r1,#1 1917 | lsrs r7,r1,#21 @ exponent 1918 | beq 1f @ zero exponent? 1919 | adds r6,r7,#1 1920 | lsrs r6,#11 1921 | beq 2f @ exponent != 0x7ff? then done 1922 | 1: 1923 | movs r0,#0 1924 | movs r1,#0 1925 | subs r7,#64 @ less drastic fiddling of exponents to get 0/0, Inf/Inf correct 1926 | lsls r7,#12 1927 | 2: 1928 | subs r6,r7,r4 1929 | lsls r6,#2 1930 | add r12,r12,r6 @ (signed) exponent in r12[31..8] 1931 | subs r7,#1 @ implied 1 1932 | lsls r7,#21 1933 | subs r1,r7 1934 | lsrs r1,#1 1935 | 1936 | // see dreciptest-boxc.c 1937 | lsrs r4,r3,#15 @ x2=x>>15; // Q5 32..63 1938 | ldr r5,=#(rcpapp-32) 1939 | ldrb r4,[r5,r4] @ u=lut5[x2-32]; // Q8 1940 | lsls r5,r3,#8 1941 | muls r5,r5,r4 1942 | asrs r5,#14 @ e=(i32)(u*(x<<8))>>14; // Q22 1943 | asrs r6,r5,#11 1944 | muls r6,r6,r6 @ e2=(e>>11)*(e>>11); // Q22 1945 | subs r5,r6 1946 | muls r5,r5,r4 @ c=(e-e2)*u; // Q30 1947 | lsls r6,r4,#7 1948 | asrs r5,#14 1949 | adds r5,#1 1950 | asrs r5,#1 1951 | subs r6,r5 @ u0=(u<<7)-((c+0x4000)>>15); // Q15 1952 | 1953 | @ here 1954 | @ r0:r1 y mantissa 1955 | @ r2:r3 x mantissa 1956 | @ r6 u0, first approximation to 1/x Q15 1957 | @ r12: result sign, exponent 1958 | 1959 | lsls r4,r3,#10 1960 | lsrs r5,r2,#22 1961 | orrs r5,r4 @ x1=(q30)x 1962 | muls r5,r6 @ u0*x1 Q45 1963 | asrs r5,#15 @ v=u0*x1-1 Q30 1964 | muls r5,r6 @ u0*v Q45 1965 | asrs r5,#14 1966 | adds r5,#1 1967 | asrs r5,#1 @ round u0*v to Q30 1968 | lsls r6,#15 1969 | subs r6,r5 @ u1 Q30 1970 | 1971 | @ here 1972 | @ r0:r1 y mantissa 1973 | @ r2:r3 x mantissa 1974 | @ r6 u1, second approximation to 1/x Q30 1975 | @ r12: result sign, exponent 1976 | 1977 | push {r2,r3} 1978 | lsls r4,r1,#11 1979 | lsrs r5,r0,#21 1980 | orrs r4,r5 @ y0=(q31)y 1981 | mul32_32_64 r4,r6, r4,r5, r2,r3,r7,r4,r5 @ y0*u1 Q61 1982 | adds r4,r4 1983 | adcs r5,r5 @ a0=(q30)(y0*u1) 1984 | 1985 | @ here 1986 | @ r0:r1 y mantissa 1987 | @ r5 a0, first approximation to y/x Q30 1988 | @ r6 u1, second approximation to 1/x Q30 1989 | @ r12 result sign, exponent 1990 | 1991 | ldr r2,[r13,#0] @ xL 1992 | mul32_32_64 r2,r5, r2,r3, r1,r4,r7,r2,r3 @ xL*a0 1993 | ldr r4,[r13,#4] @ xH 1994 | muls r4,r5 @ xH*a0 1995 | adds r3,r4 @ r2:r3 now x*a0 Q82 1996 | lsrs r2,#25 1997 | lsls r1,r3,#7 1998 | orrs r2,r1 @ r2 now x*a0 Q57; r7:r2 is x*a0 Q89 1999 | lsls r4,r0,#5 @ y Q57 2000 | subs r0,r4,r2 @ r0x=y-x*a0 Q57 (signed) 2001 | 2002 | @ here 2003 | @ r0 r0x Q57 2004 | @ r5 a0, first approximation to y/x Q30 2005 | @ r4 yL Q57 2006 | @ r6 u1 Q30 2007 | @ r12 result sign, exponent 2008 | 2009 | muls32_32_64 r0,r6, r7,r6, r1,r2,r3, r7,r6 @ r7:r6 r0x*u1 Q87 2010 | asrs r3,r6,#25 2011 | adds r5,r3 2012 | lsls r3,r6,#7 @ r3:r5 a1 Q62 (but bottom 7 bits are zero so 55 bits of precision after binary point) 2013 | @ here we could recover another 7 bits of precision (but not accuracy) from the top of r7 2014 | @ but these bits are thrown away in the rounding and conversion to Q52 below 2015 | 2016 | @ here 2017 | @ r3:r5 a1 Q62 candidate quotient [0.5,2) or so 2018 | @ r4 yL Q57 2019 | @ r12 result sign, exponent 2020 | 2021 | movs r6,#0 2022 | adds r3,#128 @ for initial rounding to Q53 2023 | adcs r5,r5,r6 2024 | lsrs r1,r5,#30 2025 | bne dd_0 2026 | @ here candidate quotient a1 is in range [0.5,1) 2027 | @ so 30 significant bits in r5 2028 | 2029 | lsls r4,#1 @ y now Q58 2030 | lsrs r1,r5,#9 @ to Q52 2031 | lsls r0,r5,#23 2032 | lsrs r3,#9 @ 0.5ulp-significance bit in carry: if this is 1 we may need to correct result 2033 | orrs r0,r3 2034 | bcs dd_1 2035 | b dd_2 2036 | dd_0: 2037 | @ here candidate quotient a1 is in range [1,2) 2038 | @ so 31 significant bits in r5 2039 | 2040 | movs r2,#4 2041 | add r12,r12,r2 @ fix exponent; r3:r5 now effectively Q61 2042 | adds r3,#128 @ complete rounding to Q53 2043 | adcs r5,r5,r6 2044 | lsrs r1,r5,#10 2045 | lsls r0,r5,#22 2046 | lsrs r3,#10 @ 0.5ulp-significance bit in carry: if this is 1 we may need to correct result 2047 | orrs r0,r3 2048 | bcc dd_2 2049 | dd_1: 2050 | 2051 | @ here 2052 | @ r0:r1 rounded result Q53 [0.5,1) or Q52 [1,2), but may not be correctly rounded-to-nearest 2053 | @ r4 yL Q58 or Q57 2054 | @ r12 result sign, exponent 2055 | @ carry set 2056 | 2057 | adcs r0,r0,r0 2058 | adcs r1,r1,r1 @ z Q53 with 1 in LSB 2059 | lsls r4,#16 @ Q105-32=Q73 2060 | ldr r2,[r13,#0] @ xL Q52 2061 | ldr r3,[r13,#4] @ xH Q20 2062 | 2063 | movs r5,r1 @ zH Q21 2064 | muls r5,r2 @ zH*xL Q73 2065 | subs r4,r5 2066 | muls r3,r0 @ zL*xH Q73 2067 | subs r4,r3 2068 | mul32_32_64 r2,r0, r2,r3, r5,r6,r7,r2,r3 @ xL*zL 2069 | rsbs r2,#0 @ borrow from low half? 2070 | sbcs r4,r3 @ y-xz Q73 (remainder bits 52..73) 2071 | 2072 | cmp r4,#0 2073 | 2074 | bmi 1f 2075 | movs r2,#0 @ round up 2076 | adds r0,#1 2077 | adcs r1,r2 2078 | 1: 2079 | lsrs r0,#1 @ shift back down to Q52 2080 | lsls r2,r1,#31 2081 | orrs r0,r2 2082 | lsrs r1,#1 2083 | dd_2: 2084 | add r13,#8 2085 | mov r2,r12 2086 | lsls r7,r2,#31 @ result sign 2087 | asrs r2,#2 @ result exponent 2088 | ldr r3,=#0x3fd 2089 | adds r2,r3 2090 | ldr r3,=#0x7fe 2091 | cmp r2,r3 2092 | bhs dd_3 @ over- or underflow? 2093 | lsls r2,#20 2094 | adds r1,r2 @ pack exponent 2095 | dd_5: 2096 | adds r1,r7 @ pack sign 2097 | pop {r4-r7,r15} 2098 | 2099 | dd_3: 2100 | movs r0,#0 2101 | cmp r2,#0 2102 | bgt dd_4 @ overflow? 2103 | movs r1,r7 2104 | pop {r4-r7,r15} 2105 | 2106 | dd_4: 2107 | adds r3,#1 @ 0x7ff 2108 | lsls r1,r3,#20 2109 | b dd_5 2110 | 2111 | /* 2112 | Approach to square root x=sqrt(y) is as follows. 2113 | 2114 | First generate a3, an approximation to 1/sqrt(y) to about 30 bits. Multiply this by y 2115 | to give a4~sqrt(y) to about 28 bits and a remainder r4=y-a4^2. Then, because 2116 | d sqrt(y) / dy = 1 / (2 sqrt(y)) let d4=r4*a3/2 and then the value a5=a4+d4 is 2117 | a better approximation to sqrt(y). If this is near a rounding boundary we 2118 | compute an exact remainder y-a5*a5 to decide whether to round up or down. 2119 | 2120 | The calculation of a3 and a4 is as given in dsqrttest.c. That code verifies exhaustively 2121 | that | 1 - a3a4 | < 10*2^-32, | r4 | < 40*2^-32 and | r4/y | < 20*2^-32. 2122 | 2123 | More precisely, with "y" representing y truncated to 30 binary places: 2124 | 2125 | u=(q3)y; // 24-entry table 2126 | a0=(q8~)"1/sqrtq(x+x_ulp/2)"(u); // first approximation from table 2127 | p0=(q16)(a0*a0) * (q16)y; 2128 | r0=(q20)(p0-1); 2129 | dy0=(q15)(r0*a0); // Newton-Raphson correction term 2130 | a1=(q16)a0-dy0/2; // good to ~9 bits 2131 | 2132 | p1=(q19)(a1*a1)*(q19)y; 2133 | r1=(q23)(p1-1); 2134 | dy1=(q15~)(r1*a1); // second Newton-Raphson correction 2135 | a2x=(q16)a1-dy1/2; // good to ~16 bits 2136 | a2=a2x-a2x/1t16; // prevent overflow of a2*a2 in 32 bits 2137 | 2138 | p2=(a2*a2)*(q30)y; // Q62 2139 | r2=(q36)(p2-1+1t-31); 2140 | dy2=(q30)(r2*a2); // Q52->Q30 2141 | a3=(q31)a2-dy2/2; // good to about 30 bits 2142 | a4=(q30)(a3*(q30)y+1t-31); // good to about 28 bits 2143 | 2144 | Error analysis 2145 | 2146 | r₄ = y - a₄² 2147 | d₄ = 1/2 a₃r₄ 2148 | a₅ = a₄ + d₄ 2149 | r₅ = y - a₅² 2150 | = y - ( a₄ + d₄ )² 2151 | = y - a₄² - a₃a₄r₄ - 1/4 a₃²r₄² 2152 | = r₄ - a₃a₄r₄ - 1/4 a₃²r₄² 2153 | 2154 | | r₅ | < | r₄ | | 1 - a₃a₄ | + 1/4 r₄² 2155 | 2156 | a₅ = √y √( 1 - r₅/y ) 2157 | = √y ( 1 - 1/2 r₅/y + ... ) 2158 | 2159 | So to first order (second order being very tiny) 2160 | 2161 | √y - a₅ = 1/2 r₅/y 2162 | 2163 | and 2164 | 2165 | | √y - a₅ | < 1/2 ( | r₄/y | | 1 - a₃a₄ | + 1/4 r₄²/y ) 2166 | 2167 | From dsqrttest.c (conservatively): 2168 | 2169 | < 1/2 ( 20*2^-32 * 10*2^-32 + 1/4 * 40*2^-32*20*2^-32 ) 2170 | = 1/2 ( 200 + 200 ) * 2^-64 2171 | < 2^-56 2172 | 2173 | Empirically we see about 1ulp worst-case error including rounding at Q57. 2174 | 2175 | To determine correctly whether the exact remainder calculation can be skipped we need a result 2176 | accurate to < 0.25ulp at Q52, or 2^-54. 2177 | */ 2178 | 2179 | dq_2: 2180 | bge dq_3 @ +Inf? 2181 | movs r1,#0 2182 | b dq_4 2183 | 2184 | dq_0: 2185 | lsrs r1,#31 2186 | lsls r1,#31 @ preserve sign bit 2187 | lsrs r2,#21 @ extract exponent 2188 | beq dq_4 @ -0? return it 2189 | asrs r1,#11 @ make -Inf 2190 | b dq_4 2191 | 2192 | dq_3: 2193 | ldr r1,=#0x7ff 2194 | lsls r1,#20 @ return +Inf 2195 | dq_4: 2196 | movs r0,#0 2197 | dq_1: 2198 | bx r14 2199 | 2200 | .align 2 2201 | .thumb_func 2202 | qfp_dsqrt: 2203 | lsls r2,r1,#1 2204 | bcs dq_0 @ negative? 2205 | lsrs r2,#21 @ extract exponent 2206 | subs r2,#1 2207 | ldr r3,=#0x7fe 2208 | cmp r2,r3 2209 | bhs dq_2 @ catches 0 and +Inf 2210 | push {r4-r7,r14} 2211 | lsls r4,r2,#20 2212 | subs r1,r4 @ insert implied 1 2213 | lsrs r2,#1 2214 | bcc 1f @ even exponent? skip 2215 | adds r0,r0,r0 @ odd exponent: shift up mantissa 2216 | adcs r1,r1,r1 2217 | 1: 2218 | lsrs r3,#2 2219 | adds r2,r3 2220 | lsls r2,#20 2221 | mov r12,r2 @ save result exponent 2222 | 2223 | @ here 2224 | @ r0:r1 y mantissa Q52 [1,4) 2225 | @ r12 result exponent 2226 | 2227 | adr r4,drsqrtapp-8 @ first eight table entries are never accessed because of the mantissa's leading 1 2228 | lsrs r2,r1,#17 @ y Q3 2229 | ldrb r2,[r4,r2] @ initial approximation to reciprocal square root a0 Q8 2230 | lsrs r3,r1,#4 @ first Newton-Raphson iteration 2231 | muls r3,r2 2232 | muls r3,r2 @ i32 p0=a0*a0*(y>>14); // Q32 2233 | asrs r3,r3,#12 @ i32 r0=p0>>12; // Q20 2234 | muls r3,r2 2235 | asrs r3,#13 @ i32 dy0=(r0*a0)>>13; // Q15 2236 | lsls r2,#8 2237 | subs r2,r3 @ i32 a1=(a0<<8)-dy0; // Q16 2238 | 2239 | movs r3,r2 2240 | muls r3,r3 2241 | lsrs r3,#13 2242 | lsrs r4,r1,#1 2243 | muls r3,r4 @ i32 p1=((a1*a1)>>11)*(y>>11); // Q19*Q19=Q38 2244 | asrs r3,#15 @ i32 r1=p1>>15; // Q23 2245 | muls r3,r2 2246 | asrs r3,#23 2247 | adds r3,#1 2248 | asrs r3,#1 @ i32 dy1=(r1*a1+(1<<23))>>24; // Q23*Q16=Q39; Q15 2249 | subs r2,r3 @ i32 a2=a1-dy1; // Q16 2250 | lsrs r3,r2,#16 2251 | subs r2,r3 @ if(a2>=0x10000) a2=0xffff; to prevent overflow of a2*a2 2252 | 2253 | @ here 2254 | @ r0:r1 y mantissa 2255 | @ r2 a2 ~ 1/sqrt(y) Q16 2256 | @ r12 result exponent 2257 | 2258 | movs r3,r2 2259 | muls r3,r3 2260 | lsls r1,#10 2261 | lsrs r4,r0,#22 2262 | orrs r1,r4 @ y Q30 2263 | mul32_32_64 r1,r3, r4,r3, r5,r6,r7,r4,r3 @ i64 p2=(ui64)(a2*a2)*(ui64)y; // Q62 r4:r3 2264 | lsls r5,r3,#6 2265 | lsrs r4,#26 2266 | orrs r4,r5 2267 | adds r4,#0x20 @ i32 r2=(p2>>26)+0x20; // Q36 r4 2268 | uxth r5,r4 2269 | muls r5,r2 2270 | asrs r4,#16 2271 | muls r4,r2 2272 | lsrs r5,#16 2273 | adds r4,r5 2274 | asrs r4,#6 @ i32 dy2=((i64)r2*(i64)a2)>>22; // Q36*Q16=Q52; Q30 2275 | lsls r2,#15 2276 | subs r2,r4 2277 | 2278 | @ here 2279 | @ r0 y low bits 2280 | @ r1 y Q30 2281 | @ r2 a3 ~ 1/sqrt(y) Q31 2282 | @ r12 result exponent 2283 | 2284 | mul32_32_64 r2,r1, r3,r4, r5,r6,r7,r3,r4 2285 | adds r3,r3,r3 2286 | adcs r4,r4,r4 2287 | adds r3,r3,r3 2288 | movs r3,#0 2289 | adcs r3,r4 @ ui32 a4=((ui64)a3*(ui64)y+(1U<<31))>>31; // Q30 2290 | 2291 | @ here 2292 | @ r0 y low bits 2293 | @ r1 y Q30 2294 | @ r2 a3 Q31 ~ 1/sqrt(y) 2295 | @ r3 a4 Q30 ~ sqrt(y) 2296 | @ r12 result exponent 2297 | 2298 | square32_64 r3, r4,r5, r6,r5,r7 2299 | lsls r6,r0,#8 2300 | lsrs r7,r1,#2 2301 | subs r6,r4 2302 | sbcs r7,r5 @ r4=(q60)y-a4*a4 2303 | 2304 | @ by exhaustive testing, r4 = fffffffc0e134fdc .. 00000003c2bf539c Q60 2305 | 2306 | lsls r5,r7,#29 2307 | lsrs r6,#3 2308 | adcs r6,r5 @ r4 Q57 with rounding 2309 | muls32_32_64 r6,r2, r6,r2, r4,r5,r7,r6,r2 @ d4=a3*r4/2 Q89 2310 | @ r4+d4 is correct to 1ULP at Q57, tested on ~9bn cases including all extreme values of r4 for each possible y Q30 2311 | 2312 | adds r2,#8 2313 | asrs r2,#5 @ d4 Q52, rounded to Q53 with spare bit in carry 2314 | 2315 | @ here 2316 | @ r0 y low bits 2317 | @ r1 y Q30 2318 | @ r2 d4 Q52, rounded to Q53 2319 | @ C flag contains d4_b53 2320 | @ r3 a4 Q30 2321 | 2322 | bcs dq_5 2323 | 2324 | lsrs r5,r3,#10 @ a4 Q52 2325 | lsls r4,r3,#22 2326 | 2327 | asrs r1,r2,#31 2328 | adds r0,r2,r4 2329 | adcs r1,r5 @ a4+d4 2330 | 2331 | add r1,r12 @ pack exponent 2332 | pop {r4-r7,r15} 2333 | 2334 | .ltorg 2335 | 2336 | 2337 | @ round(sqrt(2^22./[68:8:252])) 2338 | drsqrtapp: 2339 | .byte 0xf8,0xeb,0xdf,0xd6,0xcd,0xc5,0xbe,0xb8 2340 | .byte 0xb2,0xad,0xa8,0xa4,0xa0,0x9c,0x99,0x95 2341 | .byte 0x92,0x8f,0x8d,0x8a,0x88,0x85,0x83,0x81 2342 | 2343 | dq_5: 2344 | @ here we are near a rounding boundary, C is set 2345 | adcs r2,r2,r2 @ d4 Q53+1ulp 2346 | lsrs r5,r3,#9 2347 | lsls r4,r3,#23 @ r4:r5 a4 Q53 2348 | asrs r1,r2,#31 2349 | adds r4,r2,r4 2350 | adcs r5,r1 @ r4:r5 a5=a4+d4 Q53+1ulp 2351 | movs r3,r5 2352 | muls r3,r4 2353 | square32_64 r4,r1,r2,r6,r2,r7 2354 | adds r2,r3 2355 | adds r2,r3 @ r1:r2 a5^2 Q106 2356 | lsls r0,#22 @ y Q84 2357 | 2358 | rsbs r1,#0 2359 | sbcs r0,r2 @ remainder y-a5^2 2360 | bmi 1f @ y 2373 | @ also set flags accordingly 2374 | .thumb_func 2375 | qfp_dcmp: 2376 | push {r4,r6,r7,r14} 2377 | ldr r7,=#0x7ff @ flush NaNs and denormals 2378 | lsls r4,r1,#1 2379 | lsrs r4,#21 2380 | beq 1f 2381 | cmp r4,r7 2382 | bne 2f 2383 | 1: 2384 | movs r0,#0 2385 | lsrs r1,#20 2386 | lsls r1,#20 2387 | 2: 2388 | lsls r4,r3,#1 2389 | lsrs r4,#21 2390 | beq 1f 2391 | cmp r4,r7 2392 | bne 2f 2393 | 1: 2394 | movs r2,#0 2395 | lsrs r3,#20 2396 | lsls r3,#20 2397 | 2: 2398 | dcmp_fast_entry: 2399 | movs r6,#1 2400 | eors r3,r1 2401 | bmi 4f @ opposite signs? then can proceed on basis of sign of x 2402 | eors r3,r1 @ restore r3 2403 | bpl 1f 2404 | rsbs r6,#0 @ negative? flip comparison 2405 | 1: 2406 | cmp r1,r3 2407 | bne 1f 2408 | cmp r0,r2 2409 | bhi 2f 2410 | blo 3f 2411 | 5: 2412 | movs r6,#0 @ equal? result is 0 2413 | 1: 2414 | bgt 2f 2415 | 3: 2416 | rsbs r6,#0 2417 | 2: 2418 | subs r0,r6,#0 @ copy and set flags 2419 | pop {r4,r6,r7,r15} 2420 | 4: 2421 | orrs r3,r1 @ make -0==+0 2422 | adds r3,r3 2423 | orrs r3,r0 2424 | orrs r3,r2 2425 | beq 5b 2426 | cmp r1,#0 2427 | bge 2b 2428 | b 3b 2429 | 2430 | 2431 | @ "scientific" functions start here 2432 | 2433 | .thumb_func 2434 | push_r8_r11: 2435 | mov r4,r8 2436 | mov r5,r9 2437 | mov r6,r10 2438 | mov r7,r11 2439 | push {r4-r7} 2440 | bx r14 2441 | 2442 | .thumb_func 2443 | pop_r8_r11: 2444 | pop {r4-r7} 2445 | mov r8,r4 2446 | mov r9,r5 2447 | mov r10,r6 2448 | mov r11,r7 2449 | bx r14 2450 | 2451 | @ double-length CORDIC rotation step 2452 | 2453 | @ r0:r1 ω 2454 | @ r6 32-i (complementary shift) 2455 | @ r7 i (shift) 2456 | @ r8:r9 x 2457 | @ r10:r11 y 2458 | @ r12 coefficient pointer 2459 | 2460 | @ an option in rotation mode would be to compute the sequence of σ values 2461 | @ in one pass, rotate the initial vector by the residual ω and then run a 2462 | @ second pass to compute the final x and y. This would relieve pressure 2463 | @ on registers and hence possibly be faster. The same trick does not work 2464 | @ in vectoring mode (but perhaps one could work to single precision in 2465 | @ a first pass and then double precision in a second pass?). 2466 | 2467 | .thumb_func 2468 | dcordic_vec_step: 2469 | mov r2,r12 2470 | ldmia r2!,{r3,r4} 2471 | mov r12,r2 2472 | mov r2,r11 2473 | cmp r2,#0 2474 | blt 1f 2475 | b 2f 2476 | 2477 | .thumb_func 2478 | dcordic_rot_step: 2479 | mov r2,r12 2480 | ldmia r2!,{r3,r4} 2481 | mov r12,r2 2482 | cmp r1,#0 2483 | bge 1f 2484 | 2: 2485 | @ ω<0 / y>=0 2486 | @ ω+=dω 2487 | @ x+=y>>i, y-=x>>i 2488 | adds r0,r3 2489 | adcs r1,r4 2490 | 2491 | mov r3,r11 2492 | asrs r3,r7 2493 | mov r4,r11 2494 | lsls r4,r6 2495 | mov r2,r10 2496 | lsrs r2,r7 2497 | orrs r2,r4 @ r2:r3 y>>i, rounding in carry 2498 | mov r4,r8 2499 | mov r5,r9 @ r4:r5 x 2500 | adcs r2,r4 2501 | adcs r3,r5 @ r2:r3 x+(y>>i) 2502 | mov r8,r2 2503 | mov r9,r3 2504 | 2505 | mov r3,r5 2506 | lsls r3,r6 2507 | asrs r5,r7 2508 | lsrs r4,r7 2509 | orrs r4,r3 @ r4:r5 x>>i, rounding in carry 2510 | mov r2,r10 2511 | mov r3,r11 2512 | sbcs r2,r4 2513 | sbcs r3,r5 @ r2:r3 y-(x>>i) 2514 | mov r10,r2 2515 | mov r11,r3 2516 | bx r14 2517 | 2518 | 2519 | @ ω>0 / y<0 2520 | @ ω-=dω 2521 | @ x-=y>>i, y+=x>>i 2522 | 1: 2523 | subs r0,r3 2524 | sbcs r1,r4 2525 | 2526 | mov r3,r9 2527 | asrs r3,r7 2528 | mov r4,r9 2529 | lsls r4,r6 2530 | mov r2,r8 2531 | lsrs r2,r7 2532 | orrs r2,r4 @ r2:r3 x>>i, rounding in carry 2533 | mov r4,r10 2534 | mov r5,r11 @ r4:r5 y 2535 | adcs r2,r4 2536 | adcs r3,r5 @ r2:r3 y+(x>>i) 2537 | mov r10,r2 2538 | mov r11,r3 2539 | 2540 | mov r3,r5 2541 | lsls r3,r6 2542 | asrs r5,r7 2543 | lsrs r4,r7 2544 | orrs r4,r3 @ r4:r5 y>>i, rounding in carry 2545 | mov r2,r8 2546 | mov r3,r9 2547 | sbcs r2,r4 2548 | sbcs r3,r5 @ r2:r3 x-(y>>i) 2549 | mov r8,r2 2550 | mov r9,r3 2551 | bx r14 2552 | 2553 | ret_dzero: 2554 | movs r0,#0 2555 | movs r1,#0 2556 | bx r14 2557 | 2558 | @ convert packed double in r0:r1 to signed/unsigned 32/64-bit integer/fixed-point value in r0:r1 [with r2 places after point], with rounding towards -Inf 2559 | @ fixed-point versions only work with reasonable values in r2 because of the way dunpacks works 2560 | 2561 | .thumb_func 2562 | qfp_double2int: 2563 | movs r2,#0 @ and fall through 2564 | .thumb_func 2565 | qfp_double2fix: 2566 | push {r14} 2567 | adds r2,#32 2568 | bl qfp_double2fix64 2569 | movs r0,r1 2570 | pop {r15} 2571 | 2572 | .thumb_func 2573 | qfp_double2uint: 2574 | movs r2,#0 @ and fall through 2575 | .thumb_func 2576 | qfp_double2ufix: 2577 | push {r14} 2578 | adds r2,#32 2579 | bl qfp_double2ufix64 2580 | movs r0,r1 2581 | pop {r15} 2582 | 2583 | .thumb_func 2584 | qfp_float2int64: 2585 | movs r1,#0 @ and fall through 2586 | .thumb_func 2587 | qfp_float2fix64: 2588 | push {r14} 2589 | bl f2fix 2590 | b d2f64_a 2591 | 2592 | .thumb_func 2593 | qfp_float2uint64: 2594 | movs r1,#0 @ and fall through 2595 | .thumb_func 2596 | qfp_float2ufix64: 2597 | asrs r3,r0,#23 @ negative? return 0 2598 | bmi ret_dzero 2599 | @ and fall through 2600 | 2601 | @ convert float in r0 to signed fixed point in r0:r1:r3, r1 places after point, rounding towards -Inf 2602 | @ result clamped so that r3 can only be 0 or -1 2603 | @ trashes r12 2604 | .thumb_func 2605 | f2fix: 2606 | push {r4,r14} 2607 | mov r12,r1 2608 | asrs r3,r0,#31 2609 | lsls r0,#1 2610 | lsrs r2,r0,#24 2611 | beq 1f @ zero? 2612 | cmp r2,#0xff @ Inf? 2613 | beq 2f 2614 | subs r1,r2,#1 2615 | subs r2,#0x7f @ remove exponent bias 2616 | lsls r1,#24 2617 | subs r0,r1 @ insert implied 1 2618 | eors r0,r3 2619 | subs r0,r3 @ top two's complement 2620 | asrs r1,r0,#4 @ convert to double format 2621 | lsls r0,#28 2622 | b d2fix_a 2623 | 1: 2624 | movs r0,#0 2625 | movs r1,r0 2626 | movs r3,r0 2627 | pop {r4,r15} 2628 | 2: 2629 | mvns r0,r3 @ return max/min value 2630 | mvns r1,r3 2631 | pop {r4,r15} 2632 | 2633 | .thumb_func 2634 | qfp_double2int64: 2635 | movs r2,#0 @ and fall through 2636 | .thumb_func 2637 | qfp_double2fix64: 2638 | push {r14} 2639 | bl d2fix 2640 | d2f64_a: 2641 | asrs r2,r1,#31 2642 | cmp r2,r3 2643 | bne 1f @ sign extension bits fail to match sign of result? 2644 | pop {r15} 2645 | 1: 2646 | mvns r0,r3 2647 | movs r1,#1 2648 | lsls r1,#31 2649 | eors r1,r1,r0 @ generate extreme fixed-point values 2650 | pop {r15} 2651 | 2652 | .thumb_func 2653 | qfp_double2uint64: 2654 | movs r2,#0 @ and fall through 2655 | .thumb_func 2656 | qfp_double2ufix64: 2657 | asrs r3,r1,#20 @ negative? return 0 2658 | bmi ret_dzero 2659 | @ and fall through 2660 | 2661 | @ convert double in r0:r1 to signed fixed point in r0:r1:r3, r2 places after point, rounding towards -Inf 2662 | @ result clamped so that r3 can only be 0 or -1 2663 | @ trashes r12 2664 | .thumb_func 2665 | d2fix: 2666 | push {r4,r14} 2667 | mov r12,r2 2668 | bl dunpacks 2669 | asrs r4,r2,#16 2670 | adds r4,#1 2671 | bge 1f 2672 | movs r1,#0 @ -0 -> +0 2673 | 1: 2674 | asrs r3,r1,#31 2675 | d2fix_a: 2676 | @ here 2677 | @ r0:r1 two's complement mantissa 2678 | @ r2 unbaised exponent 2679 | @ r3 mantissa sign extension bits 2680 | add r2,r12 @ exponent plus offset for required binary point position 2681 | subs r2,#52 @ required shift 2682 | bmi 1f @ shift down? 2683 | @ here a shift up by r2 places 2684 | cmp r2,#12 @ will clamp? 2685 | bge 2f 2686 | movs r4,r0 2687 | lsls r1,r2 2688 | lsls r0,r2 2689 | rsbs r2,#0 2690 | adds r2,#32 @ complementary shift 2691 | lsrs r4,r2 2692 | orrs r1,r4 2693 | pop {r4,r15} 2694 | 2: 2695 | mvns r0,r3 2696 | mvns r1,r3 @ overflow: clamp to extreme fixed-point values 2697 | pop {r4,r15} 2698 | 1: 2699 | @ here a shift down by -r2 places 2700 | adds r2,#32 2701 | bmi 1f @ long shift? 2702 | mov r4,r1 2703 | lsls r4,r2 2704 | rsbs r2,#0 2705 | adds r2,#32 @ complementary shift 2706 | asrs r1,r2 2707 | lsrs r0,r2 2708 | orrs r0,r4 2709 | pop {r4,r15} 2710 | 1: 2711 | @ here a long shift down 2712 | movs r0,r1 2713 | asrs r1,#31 @ shift down 32 places 2714 | adds r2,#32 2715 | bmi 1f @ very long shift? 2716 | rsbs r2,#0 2717 | adds r2,#32 2718 | asrs r0,r2 2719 | pop {r4,r15} 2720 | 1: 2721 | movs r0,r3 @ result very near zero: use sign extension bits 2722 | movs r1,r3 2723 | pop {r4,r15} 2724 | 2725 | @ float <-> double conversions 2726 | .thumb_func 2727 | qfp_float2double: 2728 | lsrs r3,r0,#31 @ sign bit 2729 | lsls r3,#31 2730 | lsls r1,r0,#1 2731 | lsrs r2,r1,#24 @ exponent 2732 | beq 1f @ zero? 2733 | cmp r2,#0xff @ Inf? 2734 | beq 2f 2735 | lsrs r1,#4 @ exponent and top 20 bits of mantissa 2736 | ldr r2,=#(0x3ff-0x7f)<<20 @ difference in exponent offsets 2737 | adds r1,r2 2738 | orrs r1,r3 2739 | lsls r0,#29 @ bottom 3 bits of mantissa 2740 | bx r14 2741 | 1: 2742 | movs r1,r3 @ return signed zero 2743 | 3: 2744 | movs r0,#0 2745 | bx r14 2746 | 2: 2747 | ldr r1,=#0x7ff00000 @ return signed infinity 2748 | adds r1,r3 2749 | b 3b 2750 | 2751 | .thumb_func 2752 | qfp_double2float: 2753 | lsls r2,r1,#1 2754 | lsrs r2,#21 @ exponent 2755 | ldr r3,=#0x3ff-0x7f 2756 | subs r2,r3 @ fix exponent bias 2757 | ble 1f @ underflow or zero 2758 | cmp r2,#0xff 2759 | bge 2f @ overflow or infinity 2760 | lsls r2,#23 @ position exponent of result 2761 | lsrs r3,r1,#31 2762 | lsls r3,#31 2763 | orrs r2,r3 @ insert sign 2764 | lsls r3,r0,#3 @ rounding bits 2765 | lsrs r0,#29 2766 | lsls r1,#12 2767 | lsrs r1,#9 2768 | orrs r0,r1 @ assemble mantissa 2769 | orrs r0,r2 @ insert exponent and sign 2770 | lsls r3,#1 2771 | bcc 3f @ no rounding 2772 | beq 4f @ all sticky bits 0? 2773 | 5: 2774 | adds r0,#1 2775 | 3: 2776 | bx r14 2777 | 4: 2778 | lsrs r3,r0,#1 @ odd? then round up 2779 | bcs 5b 2780 | bx r14 2781 | 1: 2782 | beq 6f @ check case where value is just less than smallest normal 2783 | 7: 2784 | lsrs r0,r1,#31 2785 | lsls r0,#31 2786 | bx r14 2787 | 6: 2788 | lsls r2,r1,#12 @ 20 1:s at top of mantissa? 2789 | asrs r2,#12 2790 | adds r2,#1 2791 | bne 7b 2792 | lsrs r2,r0,#29 @ and 3 more 1:s? 2793 | cmp r2,#7 2794 | bne 7b 2795 | movs r2,#1 @ return smallest normal with correct sign 2796 | b 8f 2797 | 2: 2798 | movs r2,#0xff 2799 | 8: 2800 | lsrs r0,r1,#31 @ return signed infinity 2801 | lsls r0,#8 2802 | adds r0,r2 2803 | lsls r0,#23 2804 | bx r14 2805 | 2806 | @ convert signed/unsigned 32/64-bit integer/fixed-point value in r0:r1 [with r2 places after point] to packed double in r0:r1, with rounding 2807 | 2808 | .thumb_func 2809 | qfp_uint2double: 2810 | movs r1,#0 @ and fall through 2811 | .thumb_func 2812 | qfp_ufix2double: 2813 | movs r2,r1 2814 | movs r1,#0 2815 | b qfp_ufix642double 2816 | 2817 | .thumb_func 2818 | qfp_int2double: 2819 | movs r1,#0 @ and fall through 2820 | .thumb_func 2821 | qfp_fix2double: 2822 | movs r2,r1 2823 | asrs r1,r0,#31 @ sign extend 2824 | b qfp_fix642double 2825 | 2826 | .thumb_func 2827 | qfp_uint642double: 2828 | movs r2,#0 @ and fall through 2829 | .thumb_func 2830 | qfp_ufix642double: 2831 | movs r3,#0 2832 | b uf2d 2833 | 2834 | .thumb_func 2835 | qfp_int642double: 2836 | movs r2,#0 @ and fall through 2837 | .thumb_func 2838 | qfp_fix642double: 2839 | asrs r3,r1,#31 @ sign bit across all bits 2840 | eors r0,r3 2841 | eors r1,r3 2842 | subs r0,r3 2843 | sbcs r1,r3 2844 | uf2d: 2845 | push {r4,r5,r14} 2846 | ldr r4,=#0x432 2847 | subs r2,r4,r2 @ form biased exponent 2848 | @ here 2849 | @ r0:r1 unnormalised mantissa 2850 | @ r2 -Q (will become exponent) 2851 | @ r3 sign across all bits 2852 | cmp r1,#0 2853 | bne 1f @ short normalising shift? 2854 | movs r1,r0 2855 | beq 2f @ zero? return it 2856 | movs r0,#0 2857 | subs r2,#32 @ fix exponent 2858 | 1: 2859 | asrs r4,r1,#21 2860 | bne 3f @ will need shift down (and rounding?) 2861 | bcs 4f @ normalised already? 2862 | 5: 2863 | subs r2,#1 2864 | adds r0,r0 @ shift up 2865 | adcs r1,r1 2866 | lsrs r4,r1,#21 2867 | bcc 5b 2868 | 4: 2869 | ldr r4,=#0x7fe 2870 | cmp r2,r4 2871 | bhs 6f @ over/underflow? return signed zero/infinity 2872 | 7: 2873 | lsls r2,#20 @ pack and return 2874 | adds r1,r2 2875 | lsls r3,#31 2876 | adds r1,r3 2877 | 2: 2878 | pop {r4,r5,r15} 2879 | 6: @ return signed zero/infinity according to unclamped exponent in r2 2880 | mvns r2,r2 2881 | lsrs r2,#21 2882 | movs r0,#0 2883 | movs r1,#0 2884 | b 7b 2885 | 2886 | 3: 2887 | @ here we need to shift down to normalise and possibly round 2888 | bmi 1f @ already normalised to Q63? 2889 | 2: 2890 | subs r2,#1 2891 | adds r0,r0 @ shift up 2892 | adcs r1,r1 2893 | bpl 2b 2894 | 1: 2895 | @ here we have a 1 in b63 of r0:r1 2896 | adds r2,#11 @ correct exponent for subsequent shift down 2897 | lsls r4,r0,#21 @ save bits for rounding 2898 | lsrs r0,#11 2899 | lsls r5,r1,#21 2900 | orrs r0,r5 2901 | lsrs r1,#11 2902 | lsls r4,#1 2903 | beq 1f @ sticky bits are zero? 2904 | 8: 2905 | movs r4,#0 2906 | adcs r0,r4 2907 | adcs r1,r4 2908 | b 4b 2909 | 1: 2910 | bcc 4b @ sticky bits are zero but not on rounding boundary 2911 | lsrs r4,r0,#1 @ increment if odd (force round to even) 2912 | b 8b 2913 | 2914 | 2915 | .ltorg 2916 | 2917 | .thumb_func 2918 | dunpacks: 2919 | mdunpacks r0,r1,r2,r3,r4 2920 | ldr r3,=#0x3ff 2921 | subs r2,r3 @ exponent without offset 2922 | bx r14 2923 | 2924 | @ r0:r1 signed mantissa Q52 2925 | @ r2 unbiased exponent < 10 (i.e., |x|<2^10) 2926 | @ r4 pointer to: 2927 | @ - divisor reciprocal approximation r=1/d Q15 2928 | @ - divisor d Q62 0..20 2929 | @ - divisor d Q62 21..41 2930 | @ - divisor d Q62 42..62 2931 | @ returns: 2932 | @ r0:r1 reduced result y Q62, -0.6 d < y < 0.6 d (better in practice) 2933 | @ r2 quotient q (number of reductions) 2934 | @ if exponent >=10, returns r0:r1=0, r2=1024*mantissa sign 2935 | @ designed to work for 0.5=0: in quadrant 0 3186 | cmp r1,r3 3187 | ble 2f @ y<~x so 0≤θ<~π/4: skip 3188 | adds r6,#1 3189 | eors r1,r5 @ negate x 3190 | b 3f @ and exchange x and y = rotate by -π/2 3191 | 1: 3192 | cmp r3,r7 3193 | bge 2f @ -y<~x so -π/4<~θ≤0: skip 3194 | subs r6,#1 3195 | eors r3,r5 @ negate y and ... 3196 | 3: 3197 | movs r7,r0 @ exchange x and y 3198 | movs r0,r2 3199 | movs r2,r7 3200 | movs r7,r1 3201 | movs r1,r3 3202 | movs r3,r7 3203 | 2: 3204 | @ here -π/4<~θ<~π/4 3205 | @ r6 has quadrant offset 3206 | push {r6} 3207 | cmp r2,#0 3208 | bne 1f 3209 | cmp r3,#0 3210 | beq 10f @ x==0 going into division? 3211 | lsls r4,r3,#1 3212 | asrs r4,#21 3213 | adds r4,#1 3214 | bne 1f @ x==Inf going into division? 3215 | lsls r4,r1,#1 3216 | asrs r4,#21 3217 | adds r4,#1 @ y also ±Inf? 3218 | bne 10f 3219 | subs r1,#1 @ make them both just finite 3220 | subs r3,#1 3221 | b 1f 3222 | 3223 | 10: 3224 | movs r0,#0 3225 | movs r1,#0 3226 | b 12f 3227 | 3228 | 1: 3229 | bl qfp_ddiv 3230 | movs r2,#62 3231 | bl qfp_double2fix64 3232 | @ r0:r1 y/x 3233 | mov r10,r0 3234 | mov r11,r1 3235 | movs r0,#0 @ ω=0 3236 | movs r1,#0 3237 | mov r8,r0 3238 | movs r2,#1 3239 | lsls r2,#30 3240 | mov r9,r2 @ x=1 3241 | 3242 | adr r4,dtab_cc 3243 | mov r12,r4 3244 | movs r7,#1 3245 | movs r6,#31 3246 | 1: 3247 | bl dcordic_vec_step 3248 | adds r7,#1 3249 | subs r6,#1 3250 | cmp r7,#33 3251 | bne 1b 3252 | @ r0:r1 atan(y/x) Q62 3253 | @ r8:r9 x residual Q62 3254 | @ r10:r11 y residual Q62 3255 | mov r2,r9 3256 | mov r3,r10 3257 | subs r2,#12 @ this makes atan(0)==0 3258 | @ the following is basically a division residual y/x ~ atan(residual y/x) 3259 | movs r4,#1 3260 | lsls r4,#29 3261 | movs r7,#0 3262 | 2: 3263 | lsrs r2,#1 3264 | movs r3,r3 @ preserve carry 3265 | bmi 1f 3266 | sbcs r3,r2 3267 | adds r0,r4 3268 | adcs r1,r7 3269 | lsrs r4,#1 3270 | bne 2b 3271 | b 3f 3272 | 1: 3273 | adcs r3,r2 3274 | subs r0,r4 3275 | sbcs r1,r7 3276 | lsrs r4,#1 3277 | bne 2b 3278 | 3: 3279 | lsls r6,r1,#31 3280 | asrs r1,#1 3281 | lsrs r0,#1 3282 | orrs r0,r6 @ Q61 3283 | 3284 | 12: 3285 | pop {r6} 3286 | 3287 | cmp r6,#0 3288 | beq 1f 3289 | ldr r4,=#0x885A308D @ π/2 Q61 3290 | ldr r5,=#0x3243F6A8 3291 | bpl 2f 3292 | mvns r4,r4 @ negative quadrant offset 3293 | mvns r5,r5 3294 | 2: 3295 | lsls r6,#31 3296 | bne 2f @ skip if quadrant offset is ±1 3297 | adds r0,r4 3298 | adcs r1,r5 3299 | 2: 3300 | adds r0,r4 3301 | adcs r1,r5 3302 | 1: 3303 | movs r2,#61 3304 | bl qfp_fix642double 3305 | 3306 | bl pop_r8_r11 3307 | pop {r4-r7,r15} 3308 | 3309 | .ltorg 3310 | 3311 | dtab_cc: 3312 | .word 0x61bb4f69, 0x1dac6705 @ atan 2^-1 Q62 3313 | .word 0x96406eb1, 0x0fadbafc @ atan 2^-2 Q62 3314 | .word 0xab0bdb72, 0x07f56ea6 @ atan 2^-3 Q62 3315 | .word 0xe59fbd39, 0x03feab76 @ atan 2^-4 Q62 3316 | .word 0xba97624b, 0x01ffd55b @ atan 2^-5 Q62 3317 | .word 0xdddb94d6, 0x00fffaaa @ atan 2^-6 Q62 3318 | .word 0x56eeea5d, 0x007fff55 @ atan 2^-7 Q62 3319 | .word 0xaab7776e, 0x003fffea @ atan 2^-8 Q62 3320 | .word 0x5555bbbc, 0x001ffffd @ atan 2^-9 Q62 3321 | .word 0xaaaaadde, 0x000fffff @ atan 2^-10 Q62 3322 | .word 0xf555556f, 0x0007ffff @ atan 2^-11 Q62 3323 | .word 0xfeaaaaab, 0x0003ffff @ atan 2^-12 Q62 3324 | .word 0xffd55555, 0x0001ffff @ atan 2^-13 Q62 3325 | .word 0xfffaaaab, 0x0000ffff @ atan 2^-14 Q62 3326 | .word 0xffff5555, 0x00007fff @ atan 2^-15 Q62 3327 | .word 0xffffeaab, 0x00003fff @ atan 2^-16 Q62 3328 | .word 0xfffffd55, 0x00001fff @ atan 2^-17 Q62 3329 | .word 0xffffffab, 0x00000fff @ atan 2^-18 Q62 3330 | .word 0xfffffff5, 0x000007ff @ atan 2^-19 Q62 3331 | .word 0xffffffff, 0x000003ff @ atan 2^-20 Q62 3332 | .word 0x00000000, 0x00000200 @ atan 2^-21 Q62 @ consider optimising these 3333 | .word 0x00000000, 0x00000100 @ atan 2^-22 Q62 3334 | .word 0x00000000, 0x00000080 @ atan 2^-23 Q62 3335 | .word 0x00000000, 0x00000040 @ atan 2^-24 Q62 3336 | .word 0x00000000, 0x00000020 @ atan 2^-25 Q62 3337 | .word 0x00000000, 0x00000010 @ atan 2^-26 Q62 3338 | .word 0x00000000, 0x00000008 @ atan 2^-27 Q62 3339 | .word 0x00000000, 0x00000004 @ atan 2^-28 Q62 3340 | .word 0x00000000, 0x00000002 @ atan 2^-29 Q62 3341 | .word 0x00000000, 0x00000001 @ atan 2^-30 Q62 3342 | .word 0x80000000, 0x00000000 @ atan 2^-31 Q62 3343 | .word 0x40000000, 0x00000000 @ atan 2^-32 Q62 3344 | 3345 | .thumb_func 3346 | qfp_dexp: 3347 | push {r4-r7,r14} 3348 | bl dunpacks 3349 | adr r4,dreddata1 3350 | bl dreduce 3351 | cmp r1,#0 3352 | bge 1f 3353 | ldr r4,=#0xF473DE6B 3354 | ldr r5,=#0x2C5C85FD @ ln2 Q62 3355 | adds r0,r4 3356 | adcs r1,r5 3357 | subs r2,#1 3358 | 1: 3359 | push {r2} 3360 | movs r7,#1 @ shift 3361 | adr r6,dtab_exp 3362 | movs r2,#0 3363 | movs r3,#1 3364 | lsls r3,#30 @ x=1 Q62 3365 | 3366 | 3: 3367 | ldmia r6!,{r4,r5} 3368 | mov r12,r6 3369 | subs r0,r4 3370 | sbcs r1,r5 3371 | bmi 1f 3372 | 3373 | rsbs r6,r7,#0 3374 | adds r6,#32 @ complementary shift 3375 | movs r5,r3 3376 | asrs r5,r7 3377 | movs r4,r3 3378 | lsls r4,r6 3379 | movs r6,r2 3380 | lsrs r6,r7 @ rounding bit in carry 3381 | orrs r4,r6 3382 | adcs r2,r4 3383 | adcs r3,r5 @ x+=x>>i 3384 | b 2f 3385 | 3386 | 1: 3387 | adds r0,r4 @ restore argument 3388 | adcs r1,r5 3389 | 2: 3390 | mov r6,r12 3391 | adds r7,#1 3392 | cmp r7,#33 3393 | bne 3b 3394 | 3395 | @ here 3396 | @ r0:r1 ε (residual x, where x=a+ε) Q62, |ε|≤2^-32 (so fits in r0) 3397 | @ r2:r3 exp a Q62 3398 | @ and we wish to calculate exp x=exp a exp ε~(exp a)(1+ε) 3399 | muls32_32_64 r0,r3, r4,r1, r5,r6,r7,r4,r1 3400 | @ r4:r1 ε exp a Q(62+62-32)=Q92 3401 | lsrs r4,#30 3402 | lsls r0,r1,#2 3403 | orrs r0,r4 3404 | asrs r1,#30 3405 | adds r0,r2 3406 | adcs r1,r3 3407 | 3408 | pop {r2} 3409 | rsbs r2,#0 3410 | adds r2,#62 3411 | bl qfp_fix642double @ in principle we can pack faster than this because we know the exponent 3412 | pop {r4-r7,r15} 3413 | 3414 | .ltorg 3415 | 3416 | .thumb_func 3417 | qfp_dln: 3418 | push {r4-r7,r14} 3419 | lsls r7,r1,#1 3420 | bcs 5f @ <0 ... 3421 | asrs r7,#21 3422 | beq 5f @ ... or =0? return -Inf 3423 | adds r7,#1 3424 | beq 6f @ Inf/NaN? return +Inf 3425 | bl dunpacks 3426 | push {r2} 3427 | lsls r1,#9 3428 | lsrs r2,r0,#23 3429 | orrs r1,r2 3430 | lsls r0,#9 3431 | @ r0:r1 m Q61 = m/2 Q62 0.5≤m/2<1 3432 | 3433 | movs r7,#1 @ shift 3434 | adr r6,dtab_exp 3435 | mov r12,r6 3436 | movs r2,#0 3437 | movs r3,#0 @ y=0 Q62 3438 | 3439 | 3: 3440 | rsbs r6,r7,#0 3441 | adds r6,#32 @ complementary shift 3442 | movs r5,r1 3443 | asrs r5,r7 3444 | movs r4,r1 3445 | lsls r4,r6 3446 | movs r6,r0 3447 | lsrs r6,r7 3448 | orrs r4,r6 @ x>>i, rounding bit in carry 3449 | adcs r4,r0 3450 | adcs r5,r1 @ x+(x>>i) 3451 | 3452 | lsrs r6,r5,#30 3453 | bne 1f @ x+(x>>i)>1? 3454 | movs r0,r4 3455 | movs r1,r5 @ x+=x>>i 3456 | mov r6,r12 3457 | ldmia r6!,{r4,r5} 3458 | subs r2,r4 3459 | sbcs r3,r5 3460 | 3461 | 1: 3462 | movs r4,#8 3463 | add r12,r4 3464 | adds r7,#1 3465 | cmp r7,#33 3466 | bne 3b 3467 | @ here: 3468 | @ r0:r1 residual x, nearly 1 Q62 3469 | @ r2:r3 y ~ ln m/2 = ln m - ln2 Q62 3470 | @ result is y + ln2 + ln x ~ y + ln2 + (x-1) 3471 | lsls r1,#2 3472 | asrs r1,#2 @ x-1 3473 | adds r2,r0 3474 | adcs r3,r1 3475 | 3476 | pop {r7} 3477 | @ here: 3478 | @ r2:r3 ln m/2 = ln m - ln2 Q62 3479 | @ r7 unbiased exponent 3480 | 3481 | adr r4,dreddata1+4 3482 | ldmia r4,{r0,r1,r4} 3483 | adds r7,#1 3484 | muls r0,r7 @ Q62 3485 | muls r1,r7 @ Q41 3486 | muls r4,r7 @ Q20 3487 | lsls r7,r1,#21 3488 | asrs r1,#11 3489 | asrs r5,r1,#31 3490 | adds r0,r7 3491 | adcs r1,r5 3492 | lsls r7,r4,#10 3493 | asrs r4,#22 3494 | asrs r5,r1,#31 3495 | adds r1,r7 3496 | adcs r4,r5 3497 | @ r0:r1:r4 exponent*ln2 Q62 3498 | asrs r5,r3,#31 3499 | adds r0,r2 3500 | adcs r1,r3 3501 | adcs r4,r5 3502 | @ r0:r1:r4 result Q62 3503 | movs r2,#62 3504 | 1: 3505 | asrs r5,r1,#31 3506 | cmp r4,r5 3507 | beq 2f @ r4 a sign extension of r1? 3508 | lsrs r0,#4 @ no: shift down 4 places and try again 3509 | lsls r6,r1,#28 3510 | orrs r0,r6 3511 | lsrs r1,#4 3512 | lsls r6,r4,#28 3513 | orrs r1,r6 3514 | asrs r4,#4 3515 | subs r2,#4 3516 | b 1b 3517 | 2: 3518 | bl qfp_fix642double 3519 | pop {r4-r7,r15} 3520 | 3521 | 5: 3522 | ldr r1,=#0xfff00000 3523 | movs r0,#0 3524 | pop {r4-r7,r15} 3525 | 3526 | 6: 3527 | ldr r1,=#0x7ff00000 3528 | movs r0,#0 3529 | pop {r4-r7,r15} 3530 | 3531 | 3532 | .ltorg 3533 | 3534 | .align 2 3535 | dreddata1: 3536 | .word 0x0000B8AA @ 1/ln2 Q15 3537 | .word 0x0013DE6B @ ln2 Q62 Q62=2C5C85FDF473DE6B split into 21-bit pieces 3538 | .word 0x000FEFA3 3539 | .word 0x000B1721 3540 | 3541 | dtab_exp: 3542 | .word 0xbf984bf3, 0x19f323ec @ log 1+2^-1 Q62 3543 | .word 0xcd4d10d6, 0x0e47fbe3 @ log 1+2^-2 Q62 3544 | .word 0x8abcb97a, 0x0789c1db @ log 1+2^-3 Q62 3545 | .word 0x022c54cc, 0x03e14618 @ log 1+2^-4 Q62 3546 | .word 0xe7833005, 0x01f829b0 @ log 1+2^-5 Q62 3547 | .word 0x87e01f1e, 0x00fe0545 @ log 1+2^-6 Q62 3548 | .word 0xac419e24, 0x007f80a9 @ log 1+2^-7 Q62 3549 | .word 0x45621781, 0x003fe015 @ log 1+2^-8 Q62 3550 | .word 0xa9ab10e6, 0x001ff802 @ log 1+2^-9 Q62 3551 | .word 0x55455888, 0x000ffe00 @ log 1+2^-10 Q62 3552 | .word 0x0aa9aac4, 0x0007ff80 @ log 1+2^-11 Q62 3553 | .word 0x01554556, 0x0003ffe0 @ log 1+2^-12 Q62 3554 | .word 0x002aa9ab, 0x0001fff8 @ log 1+2^-13 Q62 3555 | .word 0x00055545, 0x0000fffe @ log 1+2^-14 Q62 3556 | .word 0x8000aaaa, 0x00007fff @ log 1+2^-15 Q62 3557 | .word 0xe0001555, 0x00003fff @ log 1+2^-16 Q62 3558 | .word 0xf80002ab, 0x00001fff @ log 1+2^-17 Q62 3559 | .word 0xfe000055, 0x00000fff @ log 1+2^-18 Q62 3560 | .word 0xff80000b, 0x000007ff @ log 1+2^-19 Q62 3561 | .word 0xffe00001, 0x000003ff @ log 1+2^-20 Q62 3562 | .word 0xfff80000, 0x000001ff @ log 1+2^-21 Q62 3563 | .word 0xfffe0000, 0x000000ff @ log 1+2^-22 Q62 3564 | .word 0xffff8000, 0x0000007f @ log 1+2^-23 Q62 3565 | .word 0xffffe000, 0x0000003f @ log 1+2^-24 Q62 3566 | .word 0xfffff800, 0x0000001f @ log 1+2^-25 Q62 3567 | .word 0xfffffe00, 0x0000000f @ log 1+2^-26 Q62 3568 | .word 0xffffff80, 0x00000007 @ log 1+2^-27 Q62 3569 | .word 0xffffffe0, 0x00000003 @ log 1+2^-28 Q62 3570 | .word 0xfffffff8, 0x00000001 @ log 1+2^-29 Q62 3571 | .word 0xfffffffe, 0x00000000 @ log 1+2^-30 Q62 3572 | .word 0x80000000, 0x00000000 @ log 1+2^-31 Q62 3573 | .word 0x40000000, 0x00000000 @ log 1+2^-32 Q62 3574 | 3575 | qfp_lib_end: 3576 | --------------------------------------------------------------------------------