├── LICENCE
├── README.md
├── SConscript
├── qfplib-m0-full.h
└── qfplib-m0-full_gcc.S


/LICENCE:
--------------------------------------------------------------------------------
  1 | 		    GNU GENERAL PUBLIC LICENSE
  2 | 		       Version 2, June 1991
  3 | 
  4 |  Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
  5 |  51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  6 |  Everyone is permitted to copy and distribute verbatim copies
  7 |  of this license document, but changing it is not allowed.
  8 | 
  9 | 			    Preamble
 10 | 
 11 |   The licenses for most software are designed to take away your
 12 | freedom to share and change it.  By contrast, the GNU General Public
 13 | License is intended to guarantee your freedom to share and change free
 14 | software--to make sure the software is free for all its users.  This
 15 | General Public License applies to most of the Free Software
 16 | Foundation's software and to any other program whose authors commit to
 17 | using it.  (Some other Free Software Foundation software is covered by
 18 | the GNU Lesser General Public License instead.)  You can apply it to
 19 | your programs, too.
 20 | 
 21 |   When we speak of free software, we are referring to freedom, not
 22 | price.  Our General Public Licenses are designed to make sure that you
 23 | have the freedom to distribute copies of free software (and charge for
 24 | this service if you wish), that you receive source code or can get it
 25 | if you want it, that you can change the software or use pieces of it
 26 | in new free programs; and that you know you can do these things.
 27 | 
 28 |   To protect your rights, we need to make restrictions that forbid
 29 | anyone to deny you these rights or to ask you to surrender the rights.
 30 | These restrictions translate to certain responsibilities for you if you
 31 | distribute copies of the software, or if you modify it.
 32 | 
 33 |   For example, if you distribute copies of such a program, whether
 34 | gratis or for a fee, you must give the recipients all the rights that
 35 | you have.  You must make sure that they, too, receive or can get the
 36 | source code.  And you must show them these terms so they know their
 37 | rights.
 38 | 
 39 |   We protect your rights with two steps: (1) copyright the software, and
 40 | (2) offer you this license which gives you legal permission to copy,
 41 | distribute and/or modify the software.
 42 | 
 43 |   Also, for each author's protection and ours, we want to make certain
 44 | that everyone understands that there is no warranty for this free
 45 | software.  If the software is modified by someone else and passed on, we
 46 | want its recipients to know that what they have is not the original, so
 47 | that any problems introduced by others will not reflect on the original
 48 | authors' reputations.
 49 | 
 50 |   Finally, any free program is threatened constantly by software
 51 | patents.  We wish to avoid the danger that redistributors of a free
 52 | program will individually obtain patent licenses, in effect making the
 53 | program proprietary.  To prevent this, we have made it clear that any
 54 | patent must be licensed for everyone's free use or not licensed at all.
 55 | 
 56 |   The precise terms and conditions for copying, distribution and
 57 | modification follow.
 58 | 
 59 | 		    GNU GENERAL PUBLIC LICENSE
 60 |    TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
 61 | 
 62 |   0. This License applies to any program or other work which contains
 63 | a notice placed by the copyright holder saying it may be distributed
 64 | under the terms of this General Public License.  The "Program", below,
 65 | refers to any such program or work, and a "work based on the Program"
 66 | means either the Program or any derivative work under copyright law:
 67 | that is to say, a work containing the Program or a portion of it,
 68 | either verbatim or with modifications and/or translated into another
 69 | language.  (Hereinafter, translation is included without limitation in
 70 | the term "modification".)  Each licensee is addressed as "you".
 71 | 
 72 | Activities other than copying, distribution and modification are not
 73 | covered by this License; they are outside its scope.  The act of
 74 | running the Program is not restricted, and the output from the Program
 75 | is covered only if its contents constitute a work based on the
 76 | Program (independent of having been made by running the Program).
 77 | Whether that is true depends on what the Program does.
 78 | 
 79 |   1. You may copy and distribute verbatim copies of the Program's
 80 | source code as you receive it, in any medium, provided that you
 81 | conspicuously and appropriately publish on each copy an appropriate
 82 | copyright notice and disclaimer of warranty; keep intact all the
 83 | notices that refer to this License and to the absence of any warranty;
 84 | and give any other recipients of the Program a copy of this License
 85 | along with the Program.
 86 | 
 87 | You may charge a fee for the physical act of transferring a copy, and
 88 | you may at your option offer warranty protection in exchange for a fee.
 89 | 
 90 |   2. You may modify your copy or copies of the Program or any portion
 91 | of it, thus forming a work based on the Program, and copy and
 92 | distribute such modifications or work under the terms of Section 1
 93 | above, provided that you also meet all of these conditions:
 94 | 
 95 |     a) You must cause the modified files to carry prominent notices
 96 |     stating that you changed the files and the date of any change.
 97 | 
 98 |     b) You must cause any work that you distribute or publish, that in
 99 |     whole or in part contains or is derived from the Program or any
100 |     part thereof, to be licensed as a whole at no charge to all third
101 |     parties under the terms of this License.
102 | 
103 |     c) If the modified program normally reads commands interactively
104 |     when run, you must cause it, when started running for such
105 |     interactive use in the most ordinary way, to print or display an
106 |     announcement including an appropriate copyright notice and a
107 |     notice that there is no warranty (or else, saying that you provide
108 |     a warranty) and that users may redistribute the program under
109 |     these conditions, and telling the user how to view a copy of this
110 |     License.  (Exception: if the Program itself is interactive but
111 |     does not normally print such an announcement, your work based on
112 |     the Program is not required to print an announcement.)
113 | 
114 | These requirements apply to the modified work as a whole.  If
115 | identifiable sections of that work are not derived from the Program,
116 | and can be reasonably considered independent and separate works in
117 | themselves, then this License, and its terms, do not apply to those
118 | sections when you distribute them as separate works.  But when you
119 | distribute the same sections as part of a whole which is a work based
120 | on the Program, the distribution of the whole must be on the terms of
121 | this License, whose permissions for other licensees extend to the
122 | entire whole, and thus to each and every part regardless of who wrote it.
123 | 
124 | Thus, it is not the intent of this section to claim rights or contest
125 | your rights to work written entirely by you; rather, the intent is to
126 | exercise the right to control the distribution of derivative or
127 | collective works based on the Program.
128 | 
129 | In addition, mere aggregation of another work not based on the Program
130 | with the Program (or with a work based on the Program) on a volume of
131 | a storage or distribution medium does not bring the other work under
132 | the scope of this License.
133 | 
134 |   3. You may copy and distribute the Program (or a work based on it,
135 | under Section 2) in object code or executable form under the terms of
136 | Sections 1 and 2 above provided that you also do one of the following:
137 | 
138 |     a) Accompany it with the complete corresponding machine-readable
139 |     source code, which must be distributed under the terms of Sections
140 |     1 and 2 above on a medium customarily used for software interchange; or,
141 | 
142 |     b) Accompany it with a written offer, valid for at least three
143 |     years, to give any third party, for a charge no more than your
144 |     cost of physically performing source distribution, a complete
145 |     machine-readable copy of the corresponding source code, to be
146 |     distributed under the terms of Sections 1 and 2 above on a medium
147 |     customarily used for software interchange; or,
148 | 
149 |     c) Accompany it with the information you received as to the offer
150 |     to distribute corresponding source code.  (This alternative is
151 |     allowed only for noncommercial distribution and only if you
152 |     received the program in object code or executable form with such
153 |     an offer, in accord with Subsection b above.)
154 | 
155 | The source code for a work means the preferred form of the work for
156 | making modifications to it.  For an executable work, complete source
157 | code means all the source code for all modules it contains, plus any
158 | associated interface definition files, plus the scripts used to
159 | control compilation and installation of the executable.  However, as a
160 | special exception, the source code distributed need not include
161 | anything that is normally distributed (in either source or binary
162 | form) with the major components (compiler, kernel, and so on) of the
163 | operating system on which the executable runs, unless that component
164 | itself accompanies the executable.
165 | 
166 | If distribution of executable or object code is made by offering
167 | access to copy from a designated place, then offering equivalent
168 | access to copy the source code from the same place counts as
169 | distribution of the source code, even though third parties are not
170 | compelled to copy the source along with the object code.
171 | 
172 |   4. You may not copy, modify, sublicense, or distribute the Program
173 | except as expressly provided under this License.  Any attempt
174 | otherwise to copy, modify, sublicense or distribute the Program is
175 | void, and will automatically terminate your rights under this License.
176 | However, parties who have received copies, or rights, from you under
177 | this License will not have their licenses terminated so long as such
178 | parties remain in full compliance.
179 | 
180 |   5. You are not required to accept this License, since you have not
181 | signed it.  However, nothing else grants you permission to modify or
182 | distribute the Program or its derivative works.  These actions are
183 | prohibited by law if you do not accept this License.  Therefore, by
184 | modifying or distributing the Program (or any work based on the
185 | Program), you indicate your acceptance of this License to do so, and
186 | all its terms and conditions for copying, distributing or modifying
187 | the Program or works based on it.
188 | 
189 |   6. Each time you redistribute the Program (or any work based on the
190 | Program), the recipient automatically receives a license from the
191 | original licensor to copy, distribute or modify the Program subject to
192 | these terms and conditions.  You may not impose any further
193 | restrictions on the recipients' exercise of the rights granted herein.
194 | You are not responsible for enforcing compliance by third parties to
195 | this License.
196 | 
197 |   7. If, as a consequence of a court judgment or allegation of patent
198 | infringement or for any other reason (not limited to patent issues),
199 | conditions are imposed on you (whether by court order, agreement or
200 | otherwise) that contradict the conditions of this License, they do not
201 | excuse you from the conditions of this License.  If you cannot
202 | distribute so as to satisfy simultaneously your obligations under this
203 | License and any other pertinent obligations, then as a consequence you
204 | may not distribute the Program at all.  For example, if a patent
205 | license would not permit royalty-free redistribution of the Program by
206 | all those who receive copies directly or indirectly through you, then
207 | the only way you could satisfy both it and this License would be to
208 | refrain entirely from distribution of the Program.
209 | 
210 | If any portion of this section is held invalid or unenforceable under
211 | any particular circumstance, the balance of the section is intended to
212 | apply and the section as a whole is intended to apply in other
213 | circumstances.
214 | 
215 | It is not the purpose of this section to induce you to infringe any
216 | patents or other property right claims or to contest validity of any
217 | such claims; this section has the sole purpose of protecting the
218 | integrity of the free software distribution system, which is
219 | implemented by public license practices.  Many people have made
220 | generous contributions to the wide range of software distributed
221 | through that system in reliance on consistent application of that
222 | system; it is up to the author/donor to decide if he or she is willing
223 | to distribute software through any other system and a licensee cannot
224 | impose that choice.
225 | 
226 | This section is intended to make thoroughly clear what is believed to
227 | be a consequence of the rest of this License.
228 | 
229 |   8. If the distribution and/or use of the Program is restricted in
230 | certain countries either by patents or by copyrighted interfaces, the
231 | original copyright holder who places the Program under this License
232 | may add an explicit geographical distribution limitation excluding
233 | those countries, so that distribution is permitted only in or among
234 | countries not thus excluded.  In such case, this License incorporates
235 | the limitation as if written in the body of this License.
236 | 
237 |   9. The Free Software Foundation may publish revised and/or new versions
238 | of the General Public License from time to time.  Such new versions will
239 | be similar in spirit to the present version, but may differ in detail to
240 | address new problems or concerns.
241 | 
242 | Each version is given a distinguishing version number.  If the Program
243 | specifies a version number of this License which applies to it and "any
244 | later version", you have the option of following the terms and conditions
245 | either of that version or of any later version published by the Free
246 | Software Foundation.  If the Program does not specify a version number of
247 | this License, you may choose any version ever published by the Free Software
248 | Foundation.
249 | 
250 |   10. If you wish to incorporate parts of the Program into other free
251 | programs whose distribution conditions are different, write to the author
252 | to ask for permission.  For software which is copyrighted by the Free
253 | Software Foundation, write to the Free Software Foundation; we sometimes
254 | make exceptions for this.  Our decision will be guided by the two goals
255 | of preserving the free status of all derivatives of our free software and
256 | of promoting the sharing and reuse of software generally.
257 | 
258 | 			    NO WARRANTY
259 | 
260 |   11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268 | REPAIR OR CORRECTION.
269 | 
270 |   12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278 | POSSIBILITY OF SUCH DAMAGES.
279 | 
280 | 		     END OF TERMS AND CONDITIONS
281 | 
282 | 	    How to Apply These Terms to Your New Programs
283 | 
284 |   If you develop a new program, and you want it to be of the greatest
285 | possible use to the public, the best way to achieve this is to make it
286 | free software which everyone can redistribute and change under these terms.
287 | 
288 |   To do so, attach the following notices to the program.  It is safest
289 | to attach them to the start of each source file to most effectively
290 | convey the exclusion of warranty; and each file should have at least
291 | the "copyright" line and a pointer to where the full notice is found.
292 | 
293 |     <one line to give the program's name and a brief idea of what it does.>
294 |     Copyright (C) <year>  <name of author>
295 | 
296 |     This program is free software; you can redistribute it and/or modify
297 |     it under the terms of the GNU General Public License as published by
298 |     the Free Software Foundation; either version 2 of the License, or
299 |     (at your option) any later version.
300 | 
301 |     This program is distributed in the hope that it will be useful,
302 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
303 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
304 |     GNU General Public License for more details.
305 | 
306 |     You should have received a copy of the GNU General Public License along
307 |     with this program; if not, write to the Free Software Foundation, Inc.,
308 |     51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
309 | 
310 | Also add information on how to contact you by electronic and paper mail.
311 | 
312 | If the program is interactive, make it output a short notice like this
313 | when it starts in an interactive mode:
314 | 
315 |     Gnomovision version 69, Copyright (C) year name of author
316 |     Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317 |     This is free software, and you are welcome to redistribute it
318 |     under certain conditions; type `show c' for details.
319 | 
320 | The hypothetical commands `show w' and `show c' should show the appropriate
321 | parts of the General Public License.  Of course, the commands you use may
322 | be called something other than `show w' and `show c'; they could even be
323 | mouse-clicks or menu items--whatever suits your program.
324 | 
325 | You should also get your employer (if you work as a programmer) or your
326 | school, if any, to sign a "copyright disclaimer" for the program, if
327 | necessary.  Here is a sample; alter the names:
328 | 
329 |   Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330 |   `Gnomovision' (which makes passes at compilers) written by James Hacker.
331 | 
332 |   <signature of Ty Coon>, 1 April 1989
333 |   Ty Coon, President of Vice
334 | 
335 | This General Public License does not permit incorporating your program into
336 | proprietary programs.  If your program is a subroutine library, you may
337 | consider it more useful to permit linking proprietary applications with the
338 | library.  If this is what you want to do, use the GNU Lesser General
339 | Public License instead of this License.
340 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Qfplib-M0-full for RT-Thread
  2 | 
  3 | ## A free, fast and compact ARM Cortex-M0 floating-point library
  4 | 
  5 | ## Introduction
  6 | 
  7 | Qfplib-M0-full is a library of IEEE 754 single- and double-precision floating-point arithmetic routines for microcontrollers based on the ARM Cortex-M0 core (ARMv6-M architecture). It should also run on Cortex-M3 and Cortex-M4 microcontrollers and will give reasonable performance, but it is not optimised for these devices.
  8 | 
  9 | It provides correctly rounded (to nearest, even-on-tie) addition, subtraction, multiplication, division and square root operations, and sine, cosine, tangent, arctangent, logarithm and exponential functions that give a high degree of accuracy. There are also conversion functions between floating-point values and signed or unsigned integer or fixed-point values. The library occupies less than 6 kbyte of program memory.
 10 | 
 11 | Qfplib-M0-full does not use any static storage. Stack use is parsimonious and statically analysable; recursion is not used.
 12 | 
 13 | ## Licence
 14 | 
 15 | Qfplib-M0-full is open source, licensed under version 2 of the [GNU GPL](http://www.gnu.org/licenses/). Use at your own risk. If you wish to enquire about alternative licensing please use the e-mail address on the [home page](https://www.quinapalus.com/index.html).
 16 | 
 17 | ## How to Obtain
 18 | 
 19 | ```
 20 |  RT-Thread online packages  --->
 21 |     system packages  --->
 22 |         acceleration: Assembly language or algorithmic acceleration packages  --->
 23 |             [*] Qfplib-M0-full: a free, fast and compact ARM Cortex-M0 floating-point library
 24 | ```
 25 | 
 26 | ## Speed
 27 | 
 28 | The following table compares cycle counts for Qfplib-M0-full against other libraries. Qfplib-M0-full and GCC library results are average values for non-exceptional arguments to the functions, include calling overhead, and are approximate. They were measured using an LPC11U68 microcontroller with single-cycle flash memory. Results for the Micro Digital ‘GoFast’ library—presumably optimised for speed rather than size, judging by its name—are inferred from the timings given on [this page](http://www.smxrtos.com/ussw/gofast/gofast_arm_gnu.htm) for an ARM7TDMI-based processor. The comparison here may not be not strictly fair to Qfplib-M0-full as it is not clear from their description whether Micro Digital’s library exploits features available on that processor but not on the Cortex-M0: for example, ARM mode is considerably faster and more flexible than Thumb mode, and the long multiply instructions can be used to advantage in several of the routines, especially in double precision. Micro Digital do not appear to provide public information on the code size of their library. The implementation of the basic functions does not appear to be IEEE 754 compliant with regard to rounding.
 29 | 
 30 | | Function     | **Qfplib-M0-full cycles** | GCC library cycles | ‘GoFast’ library cycles |
 31 | | ------------ | ------------------------- | ------------------ | ----------------------- |
 32 | | `qfp_fadd`   | **76**                    | 102                | 182                     |
 33 | | `qfp_fsub`   | **78**                    | 108                | 181                     |
 34 | | `qfp_fmul`   | **62**                    | 166                | 144                     |
 35 | | `qfp_fdiv`   | **83**                    | 475                | 799                     |
 36 | | `qfp_fcos`   | **595**                   | 3350               | 393                     |
 37 | | `qfp_fsin`   | **584**                   | 3300               | 394                     |
 38 | | `qfp_ftan`   | **671**                   | 6140               | 1090                    |
 39 | | `qfp_fatan2` | **673**                   | 4930               | 2041                    |
 40 | | `qfp_fexp`   | **261**                   | 1930               | 372                     |
 41 | | `qfp_fln`    | **277**                   | 3960               | 1321                    |
 42 | | `qfp_fsqrt`  | **67**                    | 460                | 1590                    |
 43 | | `qfp_dadd`   | **94**                    | 168                | 231                     |
 44 | | `qfp_dsub`   | **100**                   | 167                | 243                     |
 45 | | `qfp_dmul`   | **163**                   | 377                | 224                     |
 46 | | `qfp_ddiv`   | **200**                   | 1190               | 1557                    |
 47 | | `qfp_dcos`   | **1623**                  | 6162               | 951                     |
 48 | | `qfp_dsin`   | **1624**                  | 5854               | 966                     |
 49 | | `qfp_dtan`   | **1906**                  | 11371              | 2541                    |
 50 | | `qfp_datan2` | **2187**                  | 9973               | 4487                    |
 51 | | `qfp_dexp`   | **811**                   | 6655               | 1178                    |
 52 | | `qfp_dln`    | **464**                   | 4375               | 2798                    |
 53 | | `qfp_dsqrt`  | **174**                   | 1305               | 3042                    |
 54 | 
 55 | Note that in every case the Qfplib-M0-full **double**-precision implementation is faster than the corresponding GCC **single**-precision implementation, sometimes by a very large factor.
 56 | 
 57 | The ARM CMSIS implementations of the scientific functions, despite their name ‘FastMath’, appear to be many times slower than Qfplib-M0-full. For example, the average execution time for ARM's single-precision cosine function (compiled using GCC) is about 3880 cycles, virtually independent of the optimisation flags used.
 58 | 
 59 | ## Limitations and deviations from the IEEE 754 standard
 60 | 
 61 | On input and output NaNs are converted to infinities and denormals are flushed to zero.
 62 | 
 63 | ## Function ranges and accuracy
 64 | 
 65 | Subject to the limitations and deviations mentioned above, the addition, subtraction, multiplication, division and square root functions all produce correctly rounded (to nearest, even-on-tie) results. This has been verified using many billions of test cases, both random and contrived.
 66 | 
 67 | Other functions generally give results accurate to approximately 1 ulp (‘unit in last place’). Accuracy is poorer where a tiny change in an argument results in a change in the result of a large number of ulps, such as when taking the logarithm of a value near 1 or the sine of a value near a multiple of π. Accurate handling of such cases consumes a large amount of code space and is seldom if ever needed.
 68 | 
 69 | The single-precision trigonometric functions require an argument between –128 and +128; the double-precision trigonometric functions require an argument between –1024 and +1024.
 70 | 
 71 | The comparison functions return zero if its arguments are equal (negative zero is equal to positive zero) or plus or minus one if its first argument is respectively greater than or less than its second.
 72 | 
 73 | ## Conversion functions
 74 | 
 75 | A comprehensive range of functions is provided to convert between floating-point data and signed and unsigned fixed-point and integer data. They are as follows.
 76 | 
 77 | - `qfp_float2int`
 78 | - `qfp_float2fix`
 79 | - `qfp_float2uint`
 80 | - `qfp_float2ufix`
 81 | - `qfp_int2float`
 82 | - `qfp_fix2float`
 83 | - `qfp_uint2float`
 84 | - `qfp_ufix2float`
 85 | - `qfp_int642float`
 86 | - `qfp_fix642float`
 87 | - `qfp_uint642float`
 88 | - `qfp_ufix642float`
 89 | - `qfp_float2int64`
 90 | - `qfp_float2fix64`
 91 | - `qfp_float2uint64`
 92 | - `qfp_float2ufix64`
 93 | - `qfp_double2int`
 94 | - `qfp_double2fix`
 95 | - `qfp_double2uint`
 96 | - `qfp_double2ufix`
 97 | - `qfp_double2int64`
 98 | - `qfp_double2fix64`
 99 | - `qfp_double2uint64`
100 | - `qfp_double2ufix64`
101 | - `qfp_int2double`
102 | - `qfp_fix2double`
103 | - `qfp_uint2double`
104 | - `qfp_ufix2double`
105 | - `qfp_int642double`
106 | - `qfp_fix642double`
107 | - `qfp_uint642double`
108 | - `qfp_ufix642double`
109 | - `qfp_double2float`
110 | - `qfp_float2double`
111 | 
112 | ## Other functions
113 | 
114 | You may also be interested in the `qfp_float2str` and `qfp_str2float` functions provided as part of [Qfplib-M0-tiny](https://github.com/mysterywolf/Qfplib-M0-tiny) library.
115 | 
116 | Visit http://www.quinapalus.com/qfplib.html for more information.
117 | 


--------------------------------------------------------------------------------
/SConscript:
--------------------------------------------------------------------------------
 1 | Import('rtconfig')
 2 | from building import *
 3 | 
 4 | cwd = GetCurrentDir()
 5 | src	= Glob('*.c')
 6 | 
 7 | if rtconfig.PLATFORM == 'armcc':
 8 |     src += Glob('*_rvds.S')
 9 | 
10 | if rtconfig.PLATFORM == 'gcc':
11 |     src += Glob('*_gcc.S')
12 | 
13 | if rtconfig.PLATFORM == 'iar':
14 |     src += Glob('*_iar.S')
15 | 
16 | CPPPATH = [cwd]
17 | 
18 | group = DefineGroup('Qfplib-M0-full', src, depend = ['PKG_USING_QFPLIB_M0_FULL'], CPPPATH = CPPPATH)
19 | 
20 | Return('group')
21 | 


--------------------------------------------------------------------------------
/qfplib-m0-full.h:
--------------------------------------------------------------------------------
 1 | #ifndef __QFPLIB_M0_FULL_H__
 2 | #define __QFPLIB_M0_FULL_H__
 3 | 
 4 | /*
 5 | Copyright 2019-2020 Mark Owen
 6 | http://www.quinapalus.com
 7 | E-mail: qfp@quinapalus.com
 8 | 
 9 | This file is free software: you can redistribute it and/or modify
10 | it under the terms of version 2 of the GNU General Public License
11 | as published by the Free Software Foundation.
12 | 
13 | This file is distributed in the hope that it will be useful,
14 | but WITHOUT ANY WARRANTY; without even the implied warranty of
15 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
16 | GNU General Public License for more details.
17 | 
18 | You should have received a copy of the GNU General Public License
19 | along with this file.  If not, see <http://www.gnu.org/licenses/> or
20 | write to the Free Software Foundation, Inc., 51 Franklin Street,
21 | Fifth Floor, Boston, MA  02110-1301, USA.
22 | */
23 | 
24 | typedef unsigned           int ui32;
25 | typedef                    int i32;
26 | typedef unsigned long long int ui64;
27 | typedef          long long int i64;
28 | 
29 | extern float  qfp_fadd         (float x,float y);
30 | extern float  qfp_fsub         (float x,float y);
31 | extern float  qfp_fmul         (float x,float y);
32 | extern float  qfp_fdiv         (float x,float y);
33 | extern int    qfp_fcmp         (float x,float y);
34 | extern float  qfp_fsqrt        (float x);
35 | extern i32    qfp_float2int    (float x);
36 | extern i32    qfp_float2fix    (float x,int f);
37 | extern ui32   qfp_float2uint   (float x);
38 | extern ui32   qfp_float2ufix   (float x,int f);
39 | extern float  qfp_int2float    (i32 x);
40 | extern float  qfp_fix2float    (i32 x,int f);
41 | extern float  qfp_uint2float   (ui32 x);
42 | extern float  qfp_ufix2float   (ui32 x,int f);
43 | extern float  qfp_int642float  (i64 x);
44 | extern float  qfp_fix642float  (i64 x,int f);
45 | extern float  qfp_uint642float (ui64 x);
46 | extern float  qfp_ufix642float (ui64 x,int f);
47 | extern float  qfp_fcos         (float x);
48 | extern float  qfp_fsin         (float x);
49 | extern float  qfp_ftan         (float x);
50 | extern float  qfp_fatan2       (float y,float x);
51 | extern float  qfp_fexp         (float x);
52 | extern float  qfp_fln          (float x);
53 | 
54 | extern double qfp_dadd         (double x,double y);
55 | extern double qfp_dsub         (double x,double y);
56 | extern double qfp_dmul         (double x,double y);
57 | extern double qfp_ddiv         (double x,double y);
58 | extern double qfp_dsqrt        (double x);
59 | extern double qfp_dcos         (double x);
60 | extern double qfp_dsin         (double x);
61 | extern double qfp_dtan         (double x);
62 | extern double qfp_datan2       (double y,double x);
63 | extern double qfp_dexp         (double x);
64 | extern double qfp_dln          (double x);
65 | extern int    qfp_dcmp         (double x,double y);
66 | 
67 | extern i64    qfp_float2int64  (float x);
68 | extern i64    qfp_float2fix64  (float x,int f);
69 | extern ui64   qfp_float2uint64 (float x);
70 | extern ui64   qfp_float2ufix64 (float x,int f);
71 | 
72 | extern i32    qfp_double2int   (double x);
73 | extern i32    qfp_double2fix   (double x,int f);
74 | extern ui32   qfp_double2uint  (double x);
75 | extern ui32   qfp_double2ufix  (double x,int f);
76 | extern i64    qfp_double2int64 (double x);
77 | extern i64    qfp_double2fix64 (double x,int f);
78 | extern ui64   qfp_double2uint64(double x);
79 | extern ui64   qfp_double2ufix64(double x,int f);
80 | 
81 | extern double qfp_int2double   (i32  x);
82 | extern double qfp_fix2double   (i32  x,int f);
83 | extern double qfp_uint2double  (ui32 x);
84 | extern double qfp_ufix2double  (ui32 x,int f);
85 | extern double qfp_int642double (i64  x);
86 | extern double qfp_fix642double (i64  x,int f);
87 | extern double qfp_uint642double(ui64 x);
88 | extern double qfp_ufix642double(ui64 x,int f);
89 | 
90 | extern float  qfp_double2float (double x);
91 | extern double qfp_float2double (float x);
92 | 
93 | #endif
94 | 


--------------------------------------------------------------------------------
/qfplib-m0-full_gcc.S:
--------------------------------------------------------------------------------
   1 | @ Copyright 2019-2020 Mark Owen
   2 | @ http://www.quinapalus.com
   3 | @ E-mail: qfp@quinapalus.com
   4 | @
   5 | @ This file is free software: you can redistribute it and/or modify
   6 | @ it under the terms of version 2 of the GNU General Public License
   7 | @ as published by the Free Software Foundation.
   8 | @
   9 | @ This file is distributed in the hope that it will be useful,
  10 | @ but WITHOUT ANY WARRANTY; without even the implied warranty of
  11 | @ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  12 | @ GNU General Public License for more details.
  13 | @
  14 | @ You should have received a copy of the GNU General Public License
  15 | @ along with this file.  If not, see <http://www.gnu.org/licenses/> or
  16 | @ write to the Free Software Foundation, Inc., 51 Franklin Street,
  17 | @ Fifth Floor, Boston, MA  02110-1301, USA.
  18 | 
  19 | .syntax unified
  20 | .cpu cortex-m0plus
  21 | .thumb
  22 | 
  23 | @ exported symbols
  24 | 
  25 | .global qfp_fadd
  26 | .global qfp_fsub
  27 | .global qfp_fmul
  28 | .global qfp_fdiv
  29 | .global qfp_fcmp
  30 | .global qfp_fsqrt
  31 | .global qfp_float2int
  32 | .global qfp_float2fix
  33 | .global qfp_float2uint
  34 | .global qfp_float2ufix
  35 | .global qfp_int2float
  36 | .global qfp_fix2float
  37 | .global qfp_uint2float
  38 | .global qfp_ufix2float
  39 | .global qfp_int642float
  40 | .global qfp_fix642float
  41 | .global qfp_uint642float
  42 | .global qfp_ufix642float
  43 | .global qfp_fcos
  44 | .global qfp_fsin
  45 | .global qfp_ftan
  46 | .global qfp_fatan2
  47 | .global qfp_fexp
  48 | .global qfp_fln
  49 | 
  50 | .global qfp_dadd
  51 | .global qfp_dsub
  52 | .global qfp_dmul
  53 | .global qfp_ddiv
  54 | .global qfp_dsqrt
  55 | .global qfp_dcos
  56 | .global qfp_dsin
  57 | .global qfp_dtan
  58 | .global qfp_datan2
  59 | .global qfp_dexp
  60 | .global qfp_dln
  61 | .global qfp_dcmp
  62 | 
  63 | .global qfp_float2int64
  64 | .global qfp_float2fix64
  65 | .global qfp_float2uint64
  66 | .global qfp_float2ufix64
  67 | 
  68 | .global qfp_double2int
  69 | .global qfp_double2fix
  70 | .global qfp_double2uint
  71 | .global qfp_double2ufix
  72 | .global qfp_double2int64
  73 | .global qfp_double2fix64
  74 | .global qfp_double2uint64
  75 | .global qfp_double2ufix64
  76 | 
  77 | .global qfp_int2double
  78 | .global qfp_fix2double
  79 | .global qfp_uint2double
  80 | .global qfp_ufix2double
  81 | .global qfp_int642double
  82 | .global qfp_fix642double
  83 | .global qfp_uint642double
  84 | .global qfp_ufix642double
  85 | 
  86 | .global qfp_double2float
  87 | .global qfp_float2double
  88 | 
  89 | qfp_lib_start:
  90 | 
  91 | @ exchange r0<->r1, r2<->r3
  92 | xchxy:
  93 |  push {r0,r2,r14}
  94 |  mov r0,r1
  95 |  mov r2,r3
  96 |  pop {r1,r3,r15}
  97 | 
  98 | @ IEEE single in r0-> signed (two's complemennt) mantissa in r0 9Q23 (24 significant bits), signed exponent (bias removed) in r2
  99 | @ trashes r4; zero, denormal -> mantissa=+/-1, exponent=-380; Inf, NaN -> mantissa=+/-1, exponent=+640
 100 | unpackx:
 101 |  lsrs r2,r0,#23 @ save exponent and sign
 102 |  lsls r0,#9     @ extract mantissa
 103 |  lsrs r0,#9
 104 |  movs r4,#1
 105 |  lsls r4,#23
 106 |  orrs r0,r4     @ reinstate implied leading 1
 107 |  cmp r2,#255    @ test sign bit
 108 |  uxtb r2,r2     @ clear it
 109 |  bls 1f         @ branch on positive
 110 |  rsbs r0,#0     @ negate mantissa
 111 | 1:
 112 |  subs r2,#1
 113 |  cmp r2,#254    @ zero/denormal/Inf/NaN?
 114 |  bhs 2f
 115 |  subs r2,#126   @ remove exponent bias: can now be -126..+127
 116 |  bx r14
 117 | 
 118 | 2:              @ here with special-case values
 119 |  cmp r0,#0
 120 |  mov r0,r4      @ set mantissa to +1
 121 |  bpl 3f
 122 |  rsbs r0,#0     @ zero/denormal/Inf/NaN: mantissa=+/-1
 123 | 3:
 124 |  subs r2,#126   @ zero/denormal: exponent -> -127; Inf, NaN: exponent -> 128
 125 |  lsls r2,#2     @ zero/denormal: exponent -> -508; Inf, NaN: exponent -> 512
 126 |  adds r2,#128   @ zero/denormal: exponent -> -380; Inf, NaN: exponent -> 640
 127 |  bx r14
 128 | 
 129 | @ normalise and pack signed mantissa in r0 nominally 3Q29, signed exponent in r2-> IEEE single in r0
 130 | @ trashes r4, preserves r1,r3
 131 | @ r5: "sticky bits", must be zero iff all result bits below r0 are zero for correct rounding
 132 | packx:
 133 |  lsrs r4,r0,#31 @ save sign bit
 134 |  lsls r4,r4,#31 @ sign now in b31
 135 |  bpl 2f         @ skip if positive
 136 |  cmp r5,#0
 137 |  beq 11f
 138 |  adds r0,#1     @ fiddle carry in to following rsb if sticky bits are non-zero
 139 | 11:
 140 |  rsbs r0,#0     @ can now treat r0 as unsigned
 141 | packx0:
 142 |  bmi 3f         @ catch r0=0x80000000 case
 143 | 2:
 144 |  subs r2,#1     @ normalisation loop
 145 |  adds r0,r0
 146 |  beq 1f         @ zero? special case
 147 |  bpl 2b         @ normalise so leading "1" in bit 31
 148 | 3:
 149 |  adds r2,#129   @ (mis-)offset exponent
 150 |  bne 12f        @ special case: highest denormal can round to lowest normal
 151 |  adds r0,#0x80  @ in special case, need to add 256 to r0 for rounding
 152 |  bcs 4f         @ tripped carry? then have leading 1 in C as required
 153 | 12:
 154 |  adds r0,#0x80  @ rounding
 155 |  bcs 4f         @ tripped carry? then have leading 1 in C as required (and result is even so can ignore sticky bits)
 156 |  cmp r5,#0
 157 |  beq 7f         @ sticky bits zero?
 158 | 8:
 159 |  lsls r0,#1     @ remove leading 1
 160 | 9:
 161 |  subs r2,#1     @ compensate exponent on this path
 162 | 4:
 163 |  cmp r2,#254
 164 |  bge 5f         @ overflow?
 165 |  adds r2,#1     @ correct exponent offset
 166 |  ble 10f        @ denormal/underflow?
 167 |  lsrs r0,#9     @ align mantissa
 168 |  lsls r2,#23    @ align exponent
 169 |  orrs r0,r2     @ assemble exponent and mantissa
 170 | 6:
 171 |  orrs r0,r4     @ apply sign
 172 | 1:
 173 |  bx r14
 174 | 
 175 | 5:
 176 |  movs r0,#0xff  @ create infinity
 177 |  lsls r0,#23
 178 |  b 6b
 179 | 
 180 | 10:
 181 |  movs r0,#0     @ create zero
 182 |  bx r14
 183 | 
 184 | 7:              @ sticky bit rounding case
 185 |  lsls r5,r0,#24 @ check bottom 8 bits of r0
 186 |  bne 8b         @ in rounding-tie case?
 187 |  lsrs r0,#9     @ ensure even result
 188 |  lsls r0,#10
 189 |  b 9b
 190 | 
 191 | .align 2
 192 | .ltorg
 193 | 
 194 | @ signed multiply r0 1Q23 by r1 4Q23, result in r0 7Q25, sticky bits in r5
 195 | @ trashes r3,r4
 196 | mul0:
 197 |  uxth r3,r0      @ Q23
 198 |  asrs r4,r1,#16  @ Q7
 199 |  muls r3,r4      @ L*H, Q30 signed
 200 |  asrs r4,r0,#16  @ Q7
 201 |  uxth r5,r1      @ Q23
 202 |  muls r4,r5      @ H*L, Q30 signed
 203 |  adds r3,r4      @ sum of middle partial products
 204 |  uxth r4,r0
 205 |  muls r4,r5      @ L*L, Q46 unsigned
 206 |  lsls r5,r4,#16  @ initialise sticky bits from low half of low partial product
 207 |  lsrs r4,#16     @ Q25
 208 |  adds r3,r4      @ add high half of low partial product to sum of middle partial products
 209 |                  @ (cannot generate carry by limits on input arguments)
 210 |  asrs r0,#16     @ Q7
 211 |  asrs r1,#16     @ Q7
 212 |  muls r0,r1      @ H*H, Q14 signed
 213 |  lsls r0,#11     @ high partial product Q25
 214 |  lsls r1,r3,#27  @ sticky
 215 |  orrs r5,r1      @ collect further sticky bits
 216 |  asrs r1,r3,#5   @ middle partial products Q25
 217 |  adds r0,r1      @ final result
 218 |  bx r14
 219 | 
 220 | .thumb_func
 221 | qfp_fcmp:
 222 |  lsls r2,r0,#1
 223 |  lsrs r2,#24
 224 |  beq 1f
 225 |  cmp r2,#0xff
 226 |  bne 2f
 227 | 1:
 228 |  lsrs r0,#23     @ clear mantissa if NaN or denormal
 229 |  lsls r0,#23
 230 | 2:
 231 |  lsls r2,r1,#1
 232 |  lsrs r2,#24
 233 |  beq 1f
 234 |  cmp r2,#0xff
 235 |  bne 2f
 236 | 1:
 237 |  lsrs r1,#23     @ clear mantissa if NaN or denormal
 238 |  lsls r1,#23
 239 | 2:
 240 |  movs r2,#1      @ initialise result
 241 |  eors r1,r0
 242 |  bmi 4f          @ opposite signs? then can proceed on basis of sign of x
 243 |  eors r1,r0      @ restore y
 244 |  bpl 1f
 245 |  rsbs r2,#0      @ both negative? flip comparison
 246 | 1:
 247 |  cmp r0,r1
 248 |  bgt 2f
 249 |  blt 3f
 250 | 5:
 251 |  movs r2,#0
 252 | 3:
 253 |  rsbs r2,#0
 254 | 2:
 255 |  subs r0,r2,#0
 256 |  bx r14
 257 | 4:
 258 |  orrs r1,r0
 259 |  adds r1,r1
 260 |  beq 5b
 261 |  cmp r0,#0
 262 |  bge 2b
 263 |  b 3b
 264 | 
 265 | @ convert float to signed int, rounding towards -Inf, clamping
 266 | .thumb_func
 267 | qfp_float2int:
 268 |  movs r1,#0      @ fall through
 269 | 
 270 | @ convert float in r0 to signed fixed point in r0, clamping
 271 | .thumb_func
 272 | qfp_float2fix:
 273 |  push {r4,r14}
 274 |  bl unpackx
 275 |  movs r3,r2
 276 |  adds r3,#130
 277 |  bmi 6f          @ -0?
 278 |  add r2,r1       @ incorporate binary point position into exponent
 279 |  subs r2,#23     @ r2 is now amount of left shift required
 280 |  blt 1f          @ requires right shift?
 281 |  cmp r2,#7       @ overflow?
 282 |  ble 4f
 283 | 3:               @ overflow
 284 |  asrs r1,r0,#31  @ +ve:0 -ve:0xffffffff
 285 |  mvns r1,r1      @ +ve:0xffffffff -ve:0
 286 |  movs r0,#1
 287 |  lsls r0,#31
 288 | 5:
 289 |  eors r0,r1      @ +ve:0x7fffffff -ve:0x80000000 (unsigned path: 0xffffffff)
 290 |  pop {r4,r15}
 291 | 1:
 292 |  rsbs r2,#0      @ right shift for r0, >0
 293 |  cmp r2,#32
 294 |  blt 2f          @ more than 32 bits of right shift?
 295 |  movs r2,#32
 296 | 2:
 297 |  asrs r0,r0,r2
 298 |  pop {r4,r15}
 299 | 6:
 300 |  movs r0,#0
 301 |  pop {r4,r15}
 302 | 
 303 | @ unsigned version
 304 | .thumb_func
 305 | qfp_float2uint:
 306 |  movs r1,#0      @ fall through
 307 | .thumb_func
 308 | qfp_float2ufix:
 309 |  push {r4,r14}
 310 |  bl unpackx
 311 |  add r2,r1       @ incorporate binary point position into exponent
 312 |  movs r1,r0
 313 |  bmi 5b          @ negative? return zero
 314 |  subs r2,#23     @ r2 is now amount of left shift required
 315 |  blt 1b          @ requires right shift?
 316 |  mvns r1,r0      @ ready to return 0xffffffff
 317 |  cmp r2,#8       @ overflow?
 318 |  bgt 5b
 319 | 4:
 320 |  lsls r0,r0,r2   @ result fits, left shifted
 321 |  pop {r4,r15}
 322 | 
 323 | 
 324 | @ convert uint64 to float, rounding
 325 | .thumb_func
 326 | qfp_uint642float:
 327 |  movs r2,#0       @ fall through
 328 | 
 329 | @ convert unsigned 64-bit fix to float, rounding; number of r0:r1 bits after point in r2
 330 | .thumb_func
 331 | qfp_ufix642float:
 332 |  push {r4,r5,r14}
 333 |  cmp r1,#0
 334 |  bpl 3f          @ positive? we can use signed code
 335 |  lsls r5,r1,#31  @ contribution to sticky bits
 336 |  orrs r5,r0
 337 |  lsrs r0,r1,#1
 338 |  subs r2,#1
 339 |  b 4f
 340 | 
 341 | @ convert int64 to float, rounding
 342 | .thumb_func
 343 | qfp_int642float:
 344 |  movs r2,#0       @ fall through
 345 | 
 346 | @ convert signed 64-bit fix to float, rounding; number of r0:r1 bits after point in r2
 347 | .thumb_func
 348 | qfp_fix642float:
 349 |  push {r4,r5,r14}
 350 | 3:
 351 |  movs r5,r0
 352 |  orrs r5,r1
 353 |  beq ret_pop45   @ zero? return +0
 354 |  asrs r5,r1,#31  @ sign bits
 355 | 2:
 356 |  asrs r4,r1,#24  @ try shifting 7 bits at a time
 357 |  cmp r4,r5
 358 |  bne 1f          @ next shift will overflow?
 359 |  lsls r1,#7
 360 |  lsrs r4,r0,#25
 361 |  orrs r1,r4
 362 |  lsls r0,#7
 363 |  adds r2,#7
 364 |  b 2b
 365 | 1:
 366 |  movs r5,r0
 367 |  movs r0,r1
 368 | 4:
 369 |  rsbs r2,#0
 370 |  adds r2,#32+29
 371 |  b packret
 372 | 
 373 | @ convert signed int to float, rounding
 374 | .thumb_func
 375 | qfp_int2float:
 376 |  movs r1,#0      @ fall through
 377 | 
 378 | @ convert signed fix to float, rounding; number of r0 bits after point in r1
 379 | .thumb_func
 380 | qfp_fix2float:
 381 |  push {r4,r5,r14}
 382 | 1:
 383 |  movs r2,#29
 384 |  subs r2,r1      @ fix exponent
 385 | packretns:       @ pack and return, sticky bits=0
 386 |  movs r5,#0
 387 | packret:         @ common return point: "pack and return"
 388 |  bl packx
 389 | ret_pop45:
 390 |  pop {r4,r5,r15}
 391 | 
 392 | 
 393 | @ unsigned version
 394 | .thumb_func
 395 | qfp_uint2float:
 396 |  movs r1,#0      @ fall through
 397 | .thumb_func
 398 | qfp_ufix2float:
 399 |  push {r4,r5,r14}
 400 |  cmp r0,#0
 401 |  bge 1b          @ treat <2^31 as signed
 402 |  movs r2,#30
 403 |  subs r2,r1      @ fix exponent
 404 |  lsls r5,r0,#31  @ one sticky bit
 405 |  lsrs r0,#1
 406 |  b packret
 407 | 
 408 | @ All the scientific functions are implemented using the CORDIC algorithm. For notation,
 409 | @ details not explained in the comments below, and a good overall survey see
 410 | @ "50 Years of CORDIC: Algorithms, Architectures, and Applications" by Meher et al.,
 411 | @ IEEE Transactions on Circuits and Systems Part I, Volume 56 Issue 9.
 412 | 
 413 | @ Register use:
 414 | @ r0: x
 415 | @ r1: y
 416 | @ r2: z/omega
 417 | @ r3: coefficient pointer
 418 | @ r4,r12: m
 419 | @ r5: i (shift)
 420 | 
 421 | cordic_start: @ initialisation
 422 |  movs r5,#0   @ initial shift=0
 423 |  mov r12,r4
 424 |  b 5f
 425 | 
 426 | cordic_vstep: @ one step of algorithm in vector mode
 427 |  cmp r1,#0    @ check sign of y
 428 |  bgt 4f
 429 |  b 1f
 430 | cordic_rstep: @ one step of algorithm in rotation mode
 431 |  cmp r2,#0    @ check sign of angle
 432 |  bge 1f
 433 | 4:
 434 |  subs r1,r6   @ negative rotation: y=y-(x>>i)
 435 |  rsbs r7,#0
 436 |  adds r2,r4   @ accumulate angle
 437 |  b 2f
 438 | 1:
 439 |  adds r1,r6   @ positive rotation: y=y+(x>>i)
 440 |  subs r2,r4   @ accumulate angle
 441 | 2:
 442 |  mov r4,r12
 443 |  muls r7,r4   @ apply sign from m
 444 |  subs r0,r7   @ finish rotation: x=x{+/-}(y>>i)
 445 | 5:
 446 |  ldmia r3!,{r4} @ fetch next angle from table and bump pointer
 447 |  lsrs r4,#1   @ repeated angle?
 448 |  bcs 3f
 449 |  adds r5,#1   @ adjust shift if not
 450 | 3:
 451 |  mov r6,r0
 452 |  asrs r6,r5   @ x>>i
 453 |  mov r7,r1
 454 |  asrs r7,r5   @ y>>i
 455 |  lsrs r4,#1   @ shift end flag into carry
 456 |  bx r14
 457 | 
 458 | @ CORDIC rotation mode
 459 | cordic_rot:
 460 |  push {r6,r7,r14}
 461 |  bl cordic_start   @ initialise
 462 | 1:
 463 |  bl cordic_rstep
 464 |  bcc 1b            @ step until table finished
 465 |  asrs r6,r0,#14    @ remaining small rotations can be linearised: see IV.B of paper referenced above
 466 |  asrs r7,r1,#14
 467 |  asrs r2,#3
 468 |  muls r6,r2        @ all remaining CORDIC steps in a multiplication
 469 |  muls r7,r2
 470 |  mov r4,r12
 471 |  muls r7,r4
 472 |  asrs r6,#12
 473 |  asrs r7,#12
 474 |  subs r0,r7        @ x=x{+/-}(yz>>k)
 475 |  adds r1,r6        @ y=y+(xz>>k)
 476 | cordic_exit:
 477 |  pop {r6,r7,r15}
 478 | 
 479 | @ CORDIC vector mode
 480 | cordic_vec:
 481 |  push {r6,r7,r14}
 482 |  bl cordic_start   @ initialise
 483 | 1:
 484 |  bl cordic_vstep
 485 |  bcc 1b            @ step until table finished
 486 | 4:
 487 |  cmp r1,#0         @ continue as in cordic_vstep but without using table; x is not affected as y is small
 488 |  bgt 2f            @ check sign of y
 489 |  adds r1,r6        @ positive rotation: y=y+(x>>i)
 490 |  subs r2,r4        @ accumulate angle
 491 |  b 3f
 492 | 2:
 493 |  subs r1,r6        @ negative rotation: y=y-(x>>i)
 494 |  adds r2,r4        @ accumulate angle
 495 | 3:
 496 |  asrs r6,#1
 497 |  asrs r4,#1        @ next "table entry"
 498 |  bne 4b
 499 |  b cordic_exit
 500 | 
 501 | .thumb_func
 502 | qfp_fsin:            @ calculate sin and cos using CORDIC rotation method
 503 |  push {r4,r5,r14}
 504 |  movs r1,#24
 505 |  bl qfp_float2fix    @ range reduction by repeated subtraction/addition in fixed point
 506 |  ldr r4,pi_q29
 507 |  lsrs r4,#4          @ 2pi Q24
 508 | 1:
 509 |  subs r0,r4
 510 |  bge 1b
 511 | 1:
 512 |  adds r0,r4
 513 |  bmi 1b              @ now in range 0..2pi
 514 |  lsls r2,r0,#2       @ z Q26
 515 |  lsls r5,r4,#1       @ pi Q26 (r4=pi/2 Q26)
 516 |  ldr r0,=#0x136e9db4 @ initialise CORDIC x,y with scaling
 517 |  movs r1,#0
 518 | 1:
 519 |  cmp r2,r4           @ >pi/2?
 520 |  blt 2f
 521 |  subs r2,r5          @ reduce range to -pi/2..pi/2
 522 |  rsbs r0,#0          @ rotate vector by pi
 523 |  b 1b
 524 | 2:
 525 |  lsls r2,#3          @ Q29
 526 |  adr r3,tab_cc       @ circular coefficients
 527 |  movs r4,#1          @ m=1
 528 |  bl cordic_rot
 529 |  adds r1,#9          @ fiddle factor to make sin(0)==0
 530 |  movs r2,#0          @ exponents to zero
 531 |  movs r3,#0
 532 |  movs r5,#0          @ no sticky bits
 533 |  bl clampx
 534 |  bl packx            @ pack cosine
 535 |  bl xchxy
 536 |  bl clampx
 537 |  b packretns         @ pack sine
 538 | 
 539 | .thumb_func
 540 | qfp_fcos:
 541 |  push {r14}
 542 |  bl qfp_fsin
 543 |  mov r0,r1           @ extract cosine result
 544 |  pop {r15}
 545 | 
 546 | @ force r0 to lie in range [-1,1] Q29
 547 | clampx:
 548 |  movs r4,#1
 549 |  lsls r4,#29
 550 |  cmp r0,r4
 551 |  bgt 1f
 552 |  rsbs r4,#0
 553 |  cmp r0,r4
 554 |  ble 1f
 555 |  bx r14
 556 | 1:
 557 |  movs r0,r4
 558 |  bx r14
 559 | 
 560 | .thumb_func
 561 | qfp_ftan:
 562 |  push {r4,r5,r6,r14}
 563 |  bl qfp_fsin         @ sine in r0/r2, cosine in r1/r3
 564 |  b fdiv_n            @ sin/cos
 565 | 
 566 | .thumb_func
 567 | qfp_fexp:
 568 |  push {r4,r5,r14}
 569 |  movs r1,#24
 570 |  bl qfp_float2fix    @ Q24: covers entire valid input range
 571 |  asrs r1,r0,#16      @ Q8
 572 |  ldr r2,=#5909       @ log_2(e) Q12
 573 |  muls r2,r1          @ estimate exponent of result Q20 (always an underestimate)
 574 |  asrs r2,#20         @ Q0
 575 |  lsls r1,r0,#6       @ Q30
 576 |  ldr r0,=#0x2c5c85fe @ ln(2) Q30
 577 |  muls r0,r2          @ accurate contribution of estimated exponent
 578 |  subs r1,r0          @ residual to be exponentiated, guaranteed ≥0, < about 0.75 Q30
 579 | 
 580 | @ here
 581 | @ r1: mantissa to exponentiate, 0...~0.75 Q30
 582 | @ r2: first exponent estimate
 583 | 
 584 |  movs r5,#1          @ shift
 585 |  adr r3,ftab_exp     @ could use alternate words from dtab_exp to save space if required
 586 |  movs r0,#1
 587 |  lsls r0,#29         @ x=1 Q29
 588 | 
 589 | 3:
 590 |  ldmia r3!,{r4}
 591 |  subs r4,r1,r4
 592 |  bmi 1f
 593 |  movs r1,r4          @ keep result of subtraction
 594 |  movs r4,r0
 595 |  lsrs r4,r5
 596 |  adcs r0,r4          @ x+=x>>i with rounding
 597 | 
 598 | 1:
 599 |  adds r5,#1
 600 |  cmp r5,#15
 601 |  bne 3b
 602 | 
 603 | @ here
 604 | @ r0: exp a Q29 1..2+
 605 | @ r1: ε (residual x where x=a+ε), < 2^-14 Q30
 606 | @ r2: first exponent estimate
 607 | @ and we wish to calculate exp x=exp a exp ε~(exp a)(1+ε)
 608 | 
 609 |  lsrs r3,r0,#15      @ exp a Q14
 610 |  muls r3,r1          @ ε exp a Q44
 611 |  lsrs r3,#15         @ ε exp a Q29
 612 |  adcs r0,r3          @ (1+ε) exp a Q29 with rounding
 613 | 
 614 |  b packretns         @ pack result
 615 | 
 616 | .thumb_func
 617 | qfp_fln:
 618 |  push {r4,r5,r14}
 619 |  asrs r1,r0,#23
 620 |  bmi 3f              @ -ve argument?
 621 |  beq 3f              @ 0 argument?
 622 |  cmp r1,#0xff
 623 |  beq 4f              @ +Inf/NaN
 624 |  bl unpackx
 625 |  adds r2,#1
 626 |  ldr r3,=#0x2c5c85fe @ ln(2) Q30
 627 |  lsrs r1,r3,#14      @ ln(2) Q16
 628 |  muls r1,r2          @ result estimate Q16
 629 |  asrs r1,#16         @ integer contribution to result
 630 |  muls r3,r2
 631 |  lsls r4,r1,#30
 632 |  subs r3,r4          @ fractional contribution to result Q30, signed
 633 |  lsls r0,#8          @ Q31
 634 | 
 635 | @ here
 636 | @ r0: mantissa Q31
 637 | @ r1: integer contribution to result
 638 | @ r3: fractional contribution to result Q30, signed
 639 | 
 640 |  movs r5,#1          @ shift
 641 |  adr r4,ftab_exp     @ could use alternate words from dtab_exp to save space if required
 642 | 
 643 | 2:
 644 |  movs r2,r0
 645 |  lsrs r2,r5
 646 |  adcs r2,r0          @ x+(x>>i) with rounding
 647 |  bcs 1f              @ >=2?
 648 |  movs r0,r2          @ keep result
 649 |  ldr r2,[r4]
 650 |  subs r3,r2
 651 | 1:
 652 |  adds r4,#4
 653 |  adds r5,#1
 654 |  cmp r5,#15
 655 |  bne 2b
 656 | 
 657 | @ here
 658 | @ r0: residual x, nearly 2 Q31
 659 | @ r1: integer contribution to result
 660 | @ r3: fractional part of result Q30
 661 | 
 662 |  asrs r0,#2
 663 |  adds r0,r3,r0
 664 | 
 665 |  cmp r1,#0
 666 |  bne 2f
 667 | 
 668 |  asrs r0,#1
 669 |  lsls r1,#29
 670 |  adds r0,r1
 671 |  movs r2,#0
 672 |  b packretns
 673 | 
 674 | 2:
 675 |  lsls r1,#24
 676 |  asrs r0,#6          @ Q24
 677 |  adcs r0,r1          @ with rounding
 678 |  movs r2,#5
 679 |  b packretns
 680 | 
 681 | 3:
 682 |  ldr r0,=#0xff800000 @ -Inf
 683 |  pop {r4,r5,r15}
 684 | 4:
 685 |  ldr r0,=#0x7f800000 @ +Inf
 686 |  pop {r4,r5,r15}
 687 | 
 688 | .align 2
 689 | ftab_exp:
 690 | .word 0x19f323ed   @ log 1+2^-1 Q30
 691 | .word 0x0e47fbe4   @ log 1+2^-2 Q30
 692 | .word 0x0789c1dc   @ log 1+2^-3 Q30
 693 | .word 0x03e14618   @ log 1+2^-4 Q30
 694 | .word 0x01f829b1   @ log 1+2^-5 Q30
 695 | .word 0x00fe0546   @ log 1+2^-6 Q30
 696 | .word 0x007f80aa   @ log 1+2^-7 Q30
 697 | .word 0x003fe015   @ log 1+2^-8 Q30
 698 | .word 0x001ff803   @ log 1+2^-9 Q30
 699 | .word 0x000ffe00   @ log 1+2^-10 Q30
 700 | .word 0x0007ff80   @ log 1+2^-11 Q30
 701 | .word 0x0003ffe0   @ log 1+2^-12 Q30
 702 | .word 0x0001fff8   @ log 1+2^-13 Q30
 703 | .word 0x0000fffe   @ log 1+2^-14 Q30
 704 | 
 705 | .thumb_func
 706 | qfp_fatan2:
 707 |  push {r4,r5,r14}
 708 | 
 709 | @ unpack arguments and shift one down to have common exponent
 710 |  bl unpackx
 711 |  bl xchxy
 712 |  bl unpackx
 713 |  lsls r0,r0,#5  @ Q28
 714 |  lsls r1,r1,#5  @ Q28
 715 |  adds r4,r2,r3  @ this is -760 if both arguments are 0 and at least -380-126=-506 otherwise
 716 |  asrs r4,#9
 717 |  adds r4,#1
 718 |  bmi 2f         @ force y to 0 proper, so result will be zero
 719 |  subs r4,r2,r3  @ calculate shift
 720 |  bge 1f         @ ex>=ey?
 721 |  rsbs r4,#0     @ make shift positive
 722 |  asrs r0,r4
 723 |  cmp r4,#28
 724 |  blo 3f
 725 |  asrs r0,#31
 726 |  b 3f
 727 | 1:
 728 |  asrs r1,r4
 729 |  cmp r4,#28
 730 |  blo 3f
 731 | 2:
 732 | @ here |x|>>|y| or both x and y are ±0
 733 |  cmp r0,#0
 734 |  bge 4f         @ x positive, return signed 0
 735 |  ldr r0,pi_q29  @ x negative, return +/- pi
 736 |  asrs r1,#31
 737 |  eors r0,r1
 738 |  b 7f
 739 | 4:
 740 |  asrs r0,r1,#31
 741 |  b 7f
 742 | 3:
 743 |  movs r2,#0              @ initial angle
 744 |  cmp r0,#0               @ x negative
 745 |  bge 5f
 746 |  rsbs r0,#0              @ rotate to 1st/4th quadrants
 747 |  rsbs r1,#0
 748 |  ldr r2,pi_q29           @ pi Q29
 749 | 5:
 750 |  adr r3,tab_cc           @ circular coefficients
 751 |  movs r4,#1              @ m=1
 752 |  bl cordic_vec           @ also produces magnitude (with scaling factor 1.646760119), which is discarded
 753 |  mov r0,r2               @ result here is -pi/2..3pi/2 Q29
 754 | @ asrs r2,#29
 755 | @ subs r0,r2
 756 |  ldr r2,pi_q29           @ pi Q29
 757 |  adds r4,r0,r2           @ attempt to fix -3pi/2..-pi case
 758 |  bcs 6f                  @ -pi/2..0? leave result as is
 759 |  subs r4,r0,r2           @ <pi? leave as is
 760 |  bmi 6f
 761 |  subs r0,r4,r2           @ >pi: take off 2pi
 762 | 6:
 763 |  subs r0,#1              @ fiddle factor so atan2(0,1)==0
 764 | 7:
 765 |  movs r2,#0              @ exponent for pack
 766 |  b packretns
 767 | 
 768 | .align 2
 769 | .ltorg
 770 | 
 771 | @ first entry in following table is pi Q29
 772 | pi_q29:
 773 | @ circular CORDIC coefficients: atan(2^-i), b0=flag for preventing shift, b1=flag for end of table
 774 | tab_cc:
 775 | .word 0x1921fb54*4+1     @ no shift before first iteration
 776 | .word 0x0ed63383*4+0
 777 | .word 0x07d6dd7e*4+0
 778 | .word 0x03fab753*4+0
 779 | .word 0x01ff55bb*4+0
 780 | .word 0x00ffeaae*4+0
 781 | .word 0x007ffd55*4+0
 782 | .word 0x003fffab*4+0
 783 | .word 0x001ffff5*4+0
 784 | .word 0x000fffff*4+0
 785 | .word 0x0007ffff*4+0
 786 | .word 0x00040000*4+0
 787 | .word 0x00020000*4+0+2   @ +2 marks end
 788 | 
 789 | .align 2
 790 | .thumb_func
 791 | qfp_fsub:
 792 |  ldr r2,=#0x80000000
 793 |  eors r1,r2    @ flip sign on second argument
 794 | @ drop into fadd, on .align2:ed boundary
 795 | 
 796 | .thumb_func
 797 | qfp_fadd:
 798 |  push {r4,r5,r6,r14}
 799 |  asrs r4,r0,#31
 800 |  lsls r2,r0,#1
 801 |  lsrs r2,#24     @ x exponent
 802 |  beq fa_xe0
 803 |  cmp r2,#255
 804 |  beq fa_xe255
 805 | fa_xe:
 806 |  asrs r5,r1,#31
 807 |  lsls r3,r1,#1
 808 |  lsrs r3,#24     @ y exponent
 809 |  beq fa_ye0
 810 |  cmp r3,#255
 811 |  beq fa_ye255
 812 | fa_ye:
 813 |  ldr r6,=#0x007fffff
 814 |  ands r0,r0,r6   @ extract mantissa bits
 815 |  ands r1,r1,r6
 816 |  adds r6,#1      @ r6=0x00800000
 817 |  orrs r0,r0,r6   @ set implied 1
 818 |  orrs r1,r1,r6
 819 |  eors r0,r0,r4   @ complement...
 820 |  eors r1,r1,r5
 821 |  subs r0,r0,r4   @ ... and add 1 if sign bit is set: 2's complement
 822 |  subs r1,r1,r5
 823 |  subs r5,r3,r2   @ ye-xe
 824 |  subs r4,r2,r3   @ xe-ye
 825 |  bmi fa_ygtx
 826 | @ here xe>=ye
 827 |  cmp r4,#30
 828 |  bge fa_xmgty    @ xe much greater than ye?
 829 |  adds r5,#32
 830 |  movs r3,r2      @ save exponent
 831 | @ here y in r1 must be shifted down r4 places to align with x in r0
 832 |  movs r2,r1
 833 |  lsls r2,r2,r5   @ keep the bits we will shift off the bottom of r1
 834 |  asrs r1,r1,r4
 835 |  b fa_0
 836 | 
 837 | .ltorg
 838 |  
 839 | fa_ymgtx:
 840 |  movs r2,#0      @ result is just y
 841 |  movs r0,r1
 842 |  b fa_1
 843 | fa_xmgty:
 844 |  movs r3,r2      @ result is just x
 845 |  movs r2,#0
 846 |  b fa_1
 847 | 
 848 | fa_ygtx:
 849 | @ here ye>xe
 850 |  cmp r5,#30
 851 |  bge fa_ymgtx    @ ye much greater than xe?
 852 |  adds r4,#32
 853 | @ here x in r0 must be shifted down r5 places to align with y in r1
 854 |  movs r2,r0
 855 |  lsls r2,r2,r4   @ keep the bits we will shift off the bottom of r0
 856 |  asrs r0,r0,r5
 857 | fa_0:
 858 |  adds r0,r1      @ result is now in r0:r2, possibly highly denormalised or zero; exponent in r3
 859 |  beq fa_9        @ if zero, inputs must have been of identical magnitude and opposite sign, so return +0
 860 | fa_1: 
 861 |  lsrs r1,r0,#31  @ sign bit
 862 |  beq fa_8
 863 |  mvns r0,r0
 864 |  rsbs r2,r2,#0
 865 |  bne fa_8
 866 |  adds r0,#1
 867 | fa_8:
 868 |  adds r6,r6
 869 | @ r6=0x01000000
 870 |  cmp r0,r6
 871 |  bhs fa_2
 872 | fa_3:
 873 |  adds r2,r2      @ normalisation loop
 874 |  adcs r0,r0
 875 |  subs r3,#1      @ adjust exponent
 876 |  cmp r0,r6
 877 |  blo fa_3
 878 | fa_2:
 879 | @ here r0:r2 is the result mantissa 0x01000000<=r0<0x02000000, r3 the exponent, and r1 the sign bit
 880 |  lsrs r0,#1
 881 |  bcc fa_4
 882 | @ rounding bits here are 1:r2
 883 |  adds r0,#1      @ round up
 884 |  cmp r2,#0
 885 |  beq fa_5        @ sticky bits all zero?
 886 | fa_4:
 887 |  cmp r3,#254
 888 |  bhs fa_6        @ exponent too large or negative?
 889 |  lsls r1,#31     @ pack everything
 890 |  add r0,r1
 891 |  lsls r3,#23
 892 |  add r0,r3
 893 | fa_end:
 894 |  pop {r4,r5,r6,r15}
 895 | 
 896 | fa_9:
 897 |  cmp r2,#0       @ result zero?
 898 |  beq fa_end      @ return +0
 899 |  b fa_1
 900 | 
 901 | fa_5:
 902 |  lsrs r0,#1
 903 |  lsls r0,#1      @ round to even
 904 |  b fa_4
 905 | 
 906 | fa_6:
 907 |  bge fa_7
 908 | @ underflow
 909 | @ can handle denormals here
 910 |  lsls r0,r1,#31  @ result is signed zero
 911 |  pop {r4,r5,r6,r15}
 912 | fa_7:
 913 | @ overflow
 914 |  lsls r0,r1,#8
 915 |  adds r0,#255
 916 |  lsls r0,#23     @ result is signed infinity
 917 |  pop {r4,r5,r6,r15}
 918 | 
 919 | 
 920 | fa_xe0:
 921 | @ can handle denormals here
 922 |  subs r2,#32
 923 |  adds r2,r4       @ exponent -32 for +Inf, -33 for -Inf
 924 |  b fa_xe
 925 | 
 926 | fa_xe255:
 927 | @ can handle NaNs here
 928 |  lsls r2,#8
 929 |  add r2,r2,r4 @ exponent ~64k for +Inf, ~64k-1 for -Inf
 930 |  b fa_xe
 931 | 
 932 | fa_ye0:
 933 | @ can handle denormals here
 934 |  subs r3,#32
 935 |  adds r3,r5       @ exponent -32 for +Inf, -33 for -Inf
 936 |  b fa_ye
 937 | 
 938 | fa_ye255:
 939 | @ can handle NaNs here
 940 |  lsls r3,#8
 941 |  add r3,r3,r5 @ exponent ~64k for +Inf, ~64k-1 for -Inf
 942 |  b fa_ye
 943 | 
 944 | 
 945 | .align 2
 946 | .thumb_func
 947 | qfp_fmul:
 948 |  push {r7,r14}
 949 |  mov r2,r0
 950 |  eors r2,r1       @ sign of result
 951 |  lsrs r2,#31
 952 |  lsls r2,#31
 953 |  mov r14,r2
 954 |  lsls r0,#1
 955 |  lsls r1,#1
 956 |  lsrs r2,r0,#24   @ xe
 957 |  beq fm_xe0
 958 |  cmp r2,#255
 959 |  beq fm_xe255
 960 | fm_xe:
 961 |  lsrs r3,r1,#24   @ ye
 962 |  beq fm_ye0
 963 |  cmp r3,#255
 964 |  beq fm_ye255
 965 | fm_ye:
 966 |  adds r7,r2,r3    @ exponent of result (will possibly be incremented)
 967 |  subs r7,#128     @ adjust bias for packing
 968 |  lsls r0,#8       @ x mantissa
 969 |  lsls r1,#8       @ y mantissa
 970 |  lsrs r0,#9
 971 |  lsrs r1,#9
 972 | 
 973 |  adds r2,r0,r1    @ for later
 974 |  mov r12,r2
 975 |  lsrs r2,r0,#7    @ x[22..7] Q16
 976 |  lsrs r3,r1,#7    @ y[22..7] Q16
 977 |  muls r2,r2,r3    @ result [45..14] Q32: never an overestimate and worst case error is 2*(2^7-1)*(2^23-2^7)+(2^7-1)^2 = 2130690049 < 2^31
 978 |  muls r0,r0,r1    @ result [31..0] Q46
 979 |  lsrs r2,#18      @ result [45..32] Q14
 980 |  bcc 1f
 981 |  cmp r0,#0
 982 |  bmi 1f
 983 |  adds r2,#1       @ fix error in r2
 984 | 1:
 985 |  lsls r3,r0,#9    @ bits off bottom of result
 986 |  lsrs r0,#23      @ Q23
 987 |  lsls r2,#9
 988 |  adds r0,r2       @ cut'n'shut
 989 |  add r0,r12       @ implied 1*(x+y) to compensate for no insertion of implied 1s
 990 | @ result-1 in r3:r0 Q23+32, i.e., in range [0,3)
 991 | 
 992 |  lsrs r1,r0,#23
 993 |  bne fm_0         @ branch if we need to shift down one place
 994 | @ here 1<=result<2
 995 |  cmp r7,#254
 996 |  bhs fm_3a        @ catches both underflow and overflow
 997 |  lsls r3,#1       @ sticky bits at top of R3, rounding bit in carry
 998 |  bcc fm_1         @ no rounding
 999 |  beq fm_2         @ rounding tie?
1000 |  adds r0,#1       @ round up
1001 | fm_1:
1002 |  adds r7,#1       @ for implied 1
1003 |  lsls r7,#23      @ pack result
1004 |  add r0,r7
1005 |  add r0,r14
1006 |  pop {r7,r15}
1007 | fm_2:             @ rounding tie
1008 |  adds r0,#1
1009 | fm_3:
1010 |  lsrs r0,#1
1011 |  lsls r0,#1       @ clear bottom bit
1012 |  b fm_1
1013 | 
1014 | @ here 1<=result-1<3
1015 | fm_0:
1016 |  adds r7,#1       @ increment exponent
1017 |  cmp r7,#254
1018 |  bhs fm_3b        @ catches both underflow and overflow
1019 |  lsrs r0,#1       @ shift mantissa down
1020 |  bcc fm_1a        @ no rounding
1021 |  adds r0,#1       @ assume we will round up
1022 |  cmp r3,#0        @ sticky bits
1023 |  beq fm_3c        @ rounding tie?
1024 | fm_1a:
1025 |  adds r7,r7
1026 |  adds r7,#1       @ for implied 1
1027 |  lsls r7,#22      @ pack result
1028 |  add r0,r7
1029 |  add r0,r14
1030 |  pop {r7,r15}
1031 | 
1032 | fm_3c:
1033 |  lsrs r0,#1
1034 |  lsls r0,#1       @ clear bottom bit
1035 |  b fm_1a
1036 | 
1037 | fm_xe0:
1038 |  subs r2,#16
1039 | fm_xe255:
1040 |  lsls r2,#8
1041 |  b fm_xe
1042 | fm_ye0:
1043 |  subs r3,#16
1044 | fm_ye255:
1045 |  lsls r3,#8
1046 |  b fm_ye
1047 | 
1048 | @ here the result is under- or overflowing
1049 | fm_3b:
1050 |  bge fm_4        @ branch on overflow
1051 | @ trap case where result is denormal 0x007fffff + 0.5ulp or more
1052 |  adds r7,#1      @ exponent=-1?
1053 |  bne fm_5
1054 | @ corrected mantissa will be >= 3.FFFFFC (0x1fffffe Q23)
1055 | @ so r0 >= 2.FFFFFC (0x17ffffe Q23)
1056 |  adds r0,#2
1057 |  lsrs r0,#23
1058 |  cmp r0,#3
1059 |  bne fm_5
1060 |  b fm_6
1061 | 
1062 | fm_3a:
1063 |  bge fm_4        @ branch on overflow
1064 | @ trap case where result is denormal 0x007fffff + 0.5ulp or more
1065 |  adds r7,#1      @ exponent=-1?
1066 |  bne fm_5
1067 |  adds r0,#1      @ mantissa=0xffffff (i.e., r0=0x7fffff)?
1068 |  lsrs r0,#23
1069 |  beq fm_5
1070 | fm_6:
1071 |  movs r0,#1      @ return smallest normal
1072 |  lsls r0,#23
1073 |  add r0,r14
1074 |  pop {r7,r15}
1075 | 
1076 | fm_5:
1077 |  mov r0,r14
1078 |  pop {r7,r15}
1079 | fm_4:
1080 |  movs r0,#0xff
1081 |  lsls r0,#23
1082 |  add r0,r14
1083 |  pop {r7,r15}
1084 | 
1085 | @ This version of the division algorithm uses external divider hardware to estimate the
1086 | @ reciprocal of the divisor to about 14 bits; then a multiplication step to get a first
1087 | @ quotient estimate; then the remainder based on this estimate is used to calculate a
1088 | @ correction to the quotient. The result is good to about 27 bits and so we only need
1089 | @ to calculate the exact remainder when close to a rounding boundary.
1090 | .align 2
1091 | .thumb_func
1092 | qfp_fdiv:
1093 |  push {r4,r5,r6,r14}
1094 | fdiv_n:
1095 | 
1096 |  movs r4,#1
1097 |  lsls r4,#23   @ implied 1 position
1098 |  lsls r2,r1,#9 @ clear out sign and exponent
1099 |  lsrs r2,r2,#9
1100 |  orrs r2,r2,r4 @ divisor mantissa Q23 with implied 1
1101 | 
1102 | @ here
1103 | @ r0=packed dividend
1104 | @ r1=packed divisor
1105 | @ r2=divisor mantissa Q23
1106 | @ r4=1<<23
1107 | 
1108 | // see divtest.c
1109 |  lsrs r3,r2,#18 @ x2=x>>18; // Q5 32..63
1110 |  adr r5,rcpapp-32
1111 |  ldrb r3,[r5,r3] @ u=lut5[x2-32]; // Q8
1112 |  lsls r5,r2,#5
1113 |  muls r5,r5,r3
1114 |  asrs r5,#14 @ e=(i32)(u*(x<<5))>>14; // Q22
1115 |  asrs r6,r5,#11
1116 |  muls r6,r6,r6 @ e2=(e>>11)*(e>>11); // Q22
1117 |  subs r5,r6
1118 |  muls r5,r5,r3 @ c=(e-e2)*u; // Q30
1119 |  lsls r6,r3,#8
1120 |  asrs r5,#13
1121 |  adds r5,#1
1122 |  asrs r5,#1
1123 |  subs r5,r6,r5 @ u0=(u<<8)-((c+0x2000)>>14); // Q16
1124 | 
1125 | @ here
1126 | @ r0=packed dividend
1127 | @ r1=packed divisor
1128 | @ r2=divisor mantissa Q23
1129 | @ r4=1<<23
1130 | @ r5=reciprocal estimate Q16
1131 | 
1132 |  lsrs r6,r0,#23
1133 |  uxtb r3,r6        @ dividend exponent
1134 |  lsls r0,#9
1135 |  lsrs r0,#9
1136 |  orrs r0,r0,r4     @ dividend mantissa Q23
1137 | 
1138 |  lsrs r1,#23
1139 |  eors r6,r1        @ sign of result in bit 8
1140 |  lsrs r6,#8
1141 |  lsls r6,#31       @ sign of result in bit 31, other bits clear
1142 | 
1143 | @ here
1144 | @ r0=dividend mantissa Q23
1145 | @ r1=divisor sign+exponent
1146 | @ r2=divisor mantissa Q23
1147 | @ r3=dividend exponent
1148 | @ r5=reciprocal estimate Q16
1149 | @ r6b31=sign of result
1150 | 
1151 |  uxtb r1,r1        @ divisor exponent
1152 |  cmp r1,#0
1153 |  beq retinf
1154 |  cmp r1,#255
1155 |  beq 20f           @ divisor is infinite
1156 |  cmp r3,#0
1157 |  beq retzero
1158 |  cmp r3,#255
1159 |  beq retinf
1160 |  subs r3,r1        @ initial result exponent (no bias)
1161 |  adds r3,#125      @ add bias
1162 | 
1163 |  lsrs r1,r0,#8     @ dividend mantissa Q15
1164 | 
1165 | @ here
1166 | @ r0=dividend mantissa Q23
1167 | @ r1=dividend mantissa Q15
1168 | @ r2=divisor mantissa Q23
1169 | @ r3=initial result exponent
1170 | @ r5=reciprocal estimate Q16
1171 | @ r6b31=sign of result
1172 | 
1173 |  muls r1,r5
1174 | 
1175 |  lsrs r1,#16       @ Q15 qu0=(q15)(u*y0);
1176 |  lsls r0,r0,#15    @ dividend Q38
1177 |  movs r4,r2
1178 |  muls r4,r1        @ Q38 qu0*x
1179 |  subs r4,r0,r4     @ Q38 re0=(y<<15)-qu0*x; note this remainder is signed
1180 |  asrs r4,#10
1181 |  muls r4,r5        @ Q44 qu1=(re0>>10)*u; this quotient correction is also signed
1182 |  asrs r4,#16       @ Q28
1183 |  lsls r1,#13
1184 |  adds r1,r1,r4     @ Q28 qu=(qu0<<13)+(qu1>>16);
1185 | 
1186 | @ here
1187 | @ r0=dividend mantissa Q38
1188 | @ r1=quotient Q28
1189 | @ r2=divisor mantissa Q23
1190 | @ r3=initial result exponent
1191 | @ r6b31=sign of result
1192 | 
1193 |  lsrs r4,r1,#28
1194 |  bne 1f
1195 | @ here the quotient is less than 1<<28 (i.e., result mantissa <1.0)
1196 | 
1197 |  adds r1,#5
1198 |  lsrs r4,r1,#4     @ rounding + small reduction in systematic bias
1199 |  bcc 2f            @ skip if we are not near a rounding boundary
1200 |  lsrs r1,#3        @ quotient Q25
1201 |  lsls r0,#10       @ dividend mantissa Q48
1202 |  muls r1,r1,r2     @ quotient*divisor Q48
1203 |  subs r0,r0,r1     @ remainder Q48
1204 |  bmi 2f
1205 |  b 3f
1206 | 
1207 | 1:
1208 | @ here the quotient is at least 1<<28 (i.e., result mantissa >=1.0)
1209 | 
1210 |  adds r3,#1        @ bump exponent (and shift mantissa down one more place)
1211 |  adds r1,#9
1212 |  lsrs r4,r1,#5     @ rounding + small reduction in systematic bias
1213 |  bcc 2f            @ skip if we are not near a rounding boundary
1214 | 
1215 |  lsrs r1,#4        @ quotient Q24
1216 |  lsls r0,#9        @ dividend mantissa Q47
1217 |  muls r1,r1,r2     @ quotient*divisor Q47
1218 |  subs r0,r0,r1     @ remainder Q47
1219 |  bmi 2f
1220 | 3:
1221 |  adds r4,#1        @ increment quotient as we are above the rounding boundary
1222 | 
1223 | @ here
1224 | @ r3=result exponent
1225 | @ r4=correctly rounded quotient Q23 in range [1,2] *note closed interval*
1226 | @ r6b31=sign of result
1227 | 
1228 | 2:
1229 |  cmp r3,#254
1230 |  bhs 10f           @ this catches both underflow and overflow
1231 |  lsls r1,r3,#23
1232 |  adds r0,r4,r1
1233 |  adds r0,r6
1234 |  pop {r4,r5,r6,r15}
1235 | 
1236 | @ here divisor is infinite; dividend exponent in r3
1237 | 20:
1238 |  cmp r3,#255
1239 |  bne retzero
1240 | 
1241 | retinf:
1242 |  movs r0,#255
1243 | 21:
1244 |  lsls r0,#23
1245 |  orrs r0,r6
1246 |  pop {r4,r5,r6,r15}
1247 | 
1248 | 10:
1249 |  bge retinf       @ overflow?
1250 |  adds r1,r3,#1
1251 |  bne retzero      @ exponent <-1? return 0
1252 | @ here exponent is exactly -1
1253 |  lsrs r1,r4,#25
1254 |  bcc retzero      @ mantissa is not 01000000?
1255 | @ return minimum normal
1256 |  movs r0,#1
1257 |  lsls r0,#23
1258 |  orrs r0,r6
1259 |  pop {r4,r5,r6,r15}
1260 | 
1261 | retzero:
1262 |  movs r0,r6
1263 |  pop {r4,r5,r6,r15}
1264 | 
1265 | @ x2=[32:1:63]/32;
1266 | @ round(256 ./(x2+1/64))
1267 | .align 2
1268 | rcpapp:
1269 | .byte 252,245,237,231,224,218,213,207,202,197,193,188,184,180,176,172
1270 | .byte 169,165,162,159,156,153,150,148,145,142,140,138,135,133,131,129
1271 | 
1272 | @ The square root routine uses an initial approximation to the reciprocal of the square root of the argument based
1273 | @ on the top four bits of the mantissa (possibly shifted one place to make the exponent even). It then performs two
1274 | @ Newton-Raphson iterations, resulting in about 14 bits of accuracy. This reciprocal is then multiplied by
1275 | @ the original argument to produce an approximation to the result, again with about 14 bits of accuracy.
1276 | @ Then a remainder is calculated, and multiplied by the reciprocal estiamte to generate a correction term
1277 | @ giving a final answer to about 28 bits of accuracy. A final remainder calculation rounds to the correct
1278 | @ result if necessary.
1279 | @ Again, the fixed-point calculation is carefully implemented to preserve accuracy, and similar comments to those
1280 | @ made above on the fast division routine apply.
1281 | @ The reciprocal square root calculation has been tested for all possible (possibly shifted) input mantissa values.
1282 | .align 2
1283 | .thumb_func
1284 | qfp_fsqrt:
1285 |  push {r4}
1286 |  lsls r1,r0,#1
1287 |  bcs sq_0         @ negative?
1288 |  lsls r1,#8
1289 |  lsrs r1,#9       @ mantissa
1290 |  movs r2,#1
1291 |  lsls r2,#23
1292 |  adds r1,r2       @ insert implied 1
1293 |  lsrs r2,r0,#23   @ extract exponent
1294 |  beq sq_2         @ zero?
1295 |  cmp r2,#255      @ infinite?
1296 |  beq sq_1
1297 |  adds r2,#125     @ correction for packing
1298 |  asrs r2,#1       @ exponent/2, LSB into carry
1299 |  bcc 1f
1300 |  lsls r1,#1       @ was even: double mantissa; mantissa y now 1..4 Q23
1301 | 1:
1302 |  adr r4,rsqrtapp-4@ first four table entries are never accessed because of the mantissa's leading 1
1303 |  lsrs r3,r1,#21   @ y Q2
1304 |  ldrb r4,[r4,r3]  @ initial approximation to reciprocal square root a0 Q8
1305 | 
1306 |  lsrs r0,r1,#7    @ y Q16: first Newton-Raphson iteration
1307 |  muls r0,r4       @ a0*y Q24
1308 |  muls r0,r4       @ r0=p0=a0*y*y Q32
1309 |  asrs r0,#12      @ r0 Q20
1310 |  muls r0,r4       @ dy0=a0*r0 Q28
1311 |  asrs r0,#13      @ dy0 Q15
1312 |  lsls r4,#8       @ a0 Q16
1313 |  subs r4,r0       @ a1=a0-dy0/2 Q16-Q15/2 -> Q16
1314 |  adds r4,#170     @ mostly remove systematic error in this approximation: gains approximately 1 bit
1315 | 
1316 |  movs r0,r4       @ second Newton-Raphson iteration
1317 |  muls r0,r0       @ a1*a1 Q32
1318 |  lsrs r0,#15      @ a1*a1 Q17
1319 |  lsrs r3,r1,#8    @ y Q15
1320 |  muls r0,r3       @ r1=p1=a1*a1*y Q32
1321 |  asrs r0,#12      @ r1 Q20
1322 |  muls r0,r4       @ dy1=a1*r1 Q36
1323 |  asrs r0,#21      @ dy1 Q15
1324 |  subs r4,r0       @ a2=a1-dy1/2 Q16-Q15/2 -> Q16
1325 | 
1326 |  muls r3,r4       @ a3=y*a2 Q31
1327 |  lsrs r3,#15      @ a3 Q16
1328 | @ here a2 is an approximation to the reciprocal square root
1329 | @ and a3 is an approximation to the square root
1330 |  movs r0,r3
1331 |  muls r0,r0       @ a3*a3 Q32
1332 |  lsls r1,#9       @ y Q32
1333 |  subs r0,r1,r0    @ r2=y-a3*a3 Q32 remainder
1334 |  asrs r0,#5       @ r2 Q27
1335 |  muls r4,r0       @ r2*a2 Q43
1336 |  lsls r3,#7       @ a3 Q23
1337 |  asrs r0,r4,#15   @ r2*a2 Q28
1338 |  adds r0,#16      @ rounding to Q24
1339 |  asrs r0,r0,#6    @ r2*a2 Q22
1340 |  add r3,r0        @ a4 Q23: candidate final result
1341 |  bcc sq_3         @ near rounding boundary? skip if no rounding needed
1342 |  mov r4,r3
1343 |  adcs r4,r4       @ a4+0.5ulp Q24
1344 |  muls r4,r4       @ Q48
1345 |  lsls r1,#16      @ y Q48
1346 |  subs r1,r4       @ remainder Q48
1347 |  bmi sq_3
1348 |  adds r3,#1       @ round up
1349 | sq_3:
1350 |  lsls r2,#23      @ pack exponent
1351 |  adds r0,r2,r3
1352 | sq_6:
1353 |  pop {r4}
1354 |  bx r14
1355 | 
1356 | sq_0:
1357 |  lsrs r1,#24
1358 |  beq sq_2         @ -0: return it
1359 | @ here negative and not -0: return -Inf
1360 |  asrs r0,#31
1361 | sq_5:
1362 |  lsls r0,#23
1363 |  b sq_6
1364 | sq_1:             @ +Inf
1365 |  lsrs r0,#23
1366 |  b sq_5
1367 | sq_2:
1368 |  lsrs r0,#31
1369 |  lsls r0,#31
1370 |  b sq_6
1371 | 
1372 | @ round(sqrt(2^22./[72:16:248]))
1373 | rsqrtapp:
1374 | .byte 0xf1,0xda,0xc9,0xbb, 0xb0,0xa6,0x9e,0x97, 0x91,0x8b,0x86,0x82
1375 | 
1376 | 
1377 | 
1378 | @ Notation:
1379 | @ rx:ry means the concatenation of rx and ry with rx having the less significant bits
1380 | 
1381 | @ IEEE double in ra:rb ->
1382 | @ mantissa in ra:rb 12Q52 (53 significant bits) with implied 1 set
1383 | @ exponent in re
1384 | @ sign in rs
1385 | @ trashes rt
1386 | .macro mdunpack ra,rb,re,rs,rt
1387 |  lsrs \re,\rb,#20              @ extract sign and exponent
1388 |  subs \rs,\re,#1
1389 |  lsls \rs,#20
1390 |  subs \rb,\rs                  @ clear sign and exponent in mantissa; insert implied 1
1391 |  lsrs \rs,\re,#11              @ sign
1392 |  lsls \re,#21
1393 |  lsrs \re,#21                  @ exponent
1394 |  beq l\@_1                     @ zero exponent?
1395 |  adds \rt,\re,#1
1396 |  lsrs \rt,#11
1397 |  beq l\@_2                     @ exponent != 0x7ff? then done
1398 | l\@_1:
1399 |  movs \ra,#0
1400 |  movs \rb,#1
1401 |  lsls \rb,#20
1402 |  subs \re,#128
1403 |  lsls \re,#12
1404 | l\@_2:
1405 | .endm
1406 | 
1407 | @ IEEE double in ra:rb ->
1408 | @ signed mantissa in ra:rb 12Q52 (53 significant bits) with implied 1
1409 | @ exponent in re
1410 | @ trashes rt0 and rt1
1411 | @ +zero, +denormal -> exponent=-0x80000
1412 | @ -zero, -denormal -> exponent=-0x80000
1413 | @ +Inf, +NaN -> exponent=+0x77f000
1414 | @ -Inf, -NaN -> exponent=+0x77e000
1415 | .macro mdunpacks ra,rb,re,rt0,rt1
1416 |  lsrs \re,\rb,#20              @ extract sign and exponent
1417 |  lsrs \rt1,\rb,#31             @ sign only
1418 |  subs \rt0,\re,#1
1419 |  lsls \rt0,#20
1420 |  subs \rb,\rt0                 @ clear sign and exponent in mantissa; insert implied 1
1421 |  lsls \re,#21
1422 |  bcc l\@_1                     @ skip on positive
1423 |  mvns \rb,\rb                  @ negate mantissa
1424 |  rsbs \ra,#0
1425 |  bcc l\@_1
1426 |  adds \rb,#1
1427 | l\@_1:
1428 |  lsrs \re,#21
1429 |  beq l\@_2                     @ zero exponent?
1430 |  adds \rt0,\re,#1
1431 |  lsrs \rt0,#11
1432 |  beq l\@_3                     @ exponent != 0x7ff? then done
1433 |  subs \re,\rt1
1434 | l\@_2:
1435 |  movs \ra,#0
1436 |  lsls \rt1,#1                  @ +ve: 0  -ve: 2
1437 |  adds \rb,\rt1,#1              @ +ve: 1  -ve: 3
1438 |  lsls \rb,#30                  @ create +/-1 mantissa
1439 |  asrs \rb,#10
1440 |  subs \re,#128
1441 |  lsls \re,#12
1442 | l\@_3:
1443 | .endm
1444 | 
1445 | .align 2
1446 | .thumb_func
1447 | qfp_dsub:
1448 |  push {r4-r7,r14}
1449 |  movs r4,#1
1450 |  lsls r4,#31
1451 |  eors r3,r4                    @ flip sign on second argument
1452 |  b da_entry                    @ continue in dadd
1453 | 
1454 | .align 2
1455 | .thumb_func
1456 | qfp_dadd:
1457 |  push {r4-r7,r14}
1458 | da_entry:
1459 |  mdunpacks r0,r1,r4,r6,r7
1460 |  mdunpacks r2,r3,r5,r6,r7
1461 |  subs r7,r5,r4                 @ ye-xe
1462 |  subs r6,r4,r5                 @ xe-ye
1463 |  bmi da_ygtx
1464 | @ here xe>=ye: need to shift y down r6 places
1465 |  mov r12,r4                    @ save exponent
1466 |  cmp r6,#32
1467 |  bge da_xrgty                  @ xe rather greater than ye?
1468 |  adds r7,#32
1469 |  movs r4,r2
1470 |  lsls r4,r4,r7                 @ rounding bit + sticky bits
1471 | da_xgty0:
1472 |  movs r5,r3
1473 |  lsls r5,r5,r7
1474 |  lsrs r2,r6
1475 |  asrs r3,r6
1476 |  orrs r2,r5
1477 | da_add:
1478 |  adds r0,r2
1479 |  adcs r1,r3
1480 | da_pack:
1481 | @ here unnormalised signed result (possibly 0) is in r0:r1 with exponent r12, rounding + sticky bits in r4
1482 | @ Note that if a large normalisation shift is required then the arguments were close in magnitude and so we
1483 | @ cannot have not gone via the xrgty/yrgtx paths. There will therefore always be enough high bits in r4
1484 | @ to provide a correct continuation of the exact result.
1485 | @ now pack result back up
1486 |  lsrs r3,r1,#31                @ get sign bit
1487 |  beq 1f                        @ skip on positive
1488 |  mvns r1,r1                    @ negate mantissa
1489 |  mvns r0,r0
1490 |  movs r2,#0
1491 |  rsbs r4,#0
1492 |  adcs r0,r2
1493 |  adcs r1,r2
1494 | 1:
1495 |  mov r2,r12                    @ get exponent
1496 |  lsrs r5,r1,#21
1497 |  bne da_0                      @ shift down required?
1498 |  lsrs r5,r1,#20
1499 |  bne da_1                      @ normalised?
1500 |  cmp r0,#0
1501 |  beq da_5                      @ could mantissa be zero?
1502 | da_2:
1503 |  adds r4,r4
1504 |  adcs r0,r0
1505 |  adcs r1,r1
1506 |  subs r2,#1                    @ adjust exponent
1507 |  lsrs r5,r1,#20
1508 |  beq da_2
1509 | da_1:
1510 |  lsls r4,#1                    @ check rounding bit
1511 |  bcc da_3
1512 | da_4:
1513 |  adds r0,#1                    @ round up
1514 |  bcc 2f
1515 |  adds r1,#1
1516 | 2:
1517 |  cmp r4,#0                     @ sticky bits zero?
1518 |  bne da_3
1519 |  lsrs r0,#1                    @ round to even
1520 |  lsls r0,#1
1521 | da_3:
1522 |  subs r2,#1
1523 |  bmi da_6
1524 |  adds r4,r2,#2                 @ check if exponent is overflowing
1525 |  lsrs r4,#11
1526 |  bne da_7
1527 |  lsls r2,#20                   @ pack exponent and sign
1528 |  add r1,r2
1529 |  lsls r3,#31
1530 |  add r1,r3
1531 |  pop {r4-r7,r15}
1532 | 
1533 | da_7:
1534 | @ here exponent overflow: return signed infinity
1535 |  lsls r1,r3,#31
1536 |  ldr r3,=#0x7ff00000
1537 |  orrs r1,r3
1538 |  b 1f
1539 | da_6:
1540 | @ here exponent underflow: return signed zero
1541 |  lsls r1,r3,#31
1542 | 1:
1543 |  movs r0,#0
1544 |  pop {r4-r7,r15}
1545 | 
1546 | da_5:
1547 | @ here mantissa could be zero
1548 |  cmp r1,#0
1549 |  bne da_2
1550 |  cmp r4,#0
1551 |  bne da_2
1552 | @ inputs must have been of identical magnitude and opposite sign, so return +0
1553 |  pop {r4-r7,r15}
1554 | 
1555 | da_0:
1556 | @ here a shift down by one place is required for normalisation
1557 |  adds r2,#1                    @ adjust exponent
1558 |  lsls r6,r0,#31                @ save rounding bit
1559 |  lsrs r0,#1
1560 |  lsls r5,r1,#31
1561 |  orrs r0,r5
1562 |  lsrs r1,#1
1563 |  cmp r6,#0
1564 |  beq da_3
1565 |  b da_4
1566 | 
1567 | da_xrgty:                      @ xe>ye and shift>=32 places
1568 |  cmp r6,#60
1569 |  bge da_xmgty                  @ xe much greater than ye?
1570 |  subs r6,#32
1571 |  adds r7,#64
1572 | 
1573 |  movs r4,r2
1574 |  lsls r4,r4,r7                 @ these would be shifted off the bottom of the sticky bits
1575 |  beq 1f
1576 |  movs r4,#1
1577 | 1:
1578 |  lsrs r2,r2,r6
1579 |  orrs r4,r2
1580 |  movs r2,r3
1581 |  lsls r3,r3,r7
1582 |  orrs r4,r3
1583 |  asrs r3,r2,#31                @ propagate sign bit
1584 |  b da_xgty0
1585 | 
1586 | da_ygtx:
1587 | @ here ye>xe: need to shift x down r7 places
1588 |  mov r12,r5                    @ save exponent
1589 |  cmp r7,#32
1590 |  bge da_yrgtx                  @ ye rather greater than xe?
1591 |  adds r6,#32
1592 |  movs r4,r0
1593 |  lsls r4,r4,r6                 @ rounding bit + sticky bits
1594 | da_ygtx0:
1595 |  movs r5,r1
1596 |  lsls r5,r5,r6
1597 |  lsrs r0,r7
1598 |  asrs r1,r7
1599 |  orrs r0,r5
1600 |  b da_add
1601 | 
1602 | da_yrgtx:
1603 |  cmp r7,#60
1604 |  bge da_ymgtx                  @ ye much greater than xe?
1605 |  subs r7,#32
1606 |  adds r6,#64
1607 | 
1608 |  movs r4,r0
1609 |  lsls r4,r4,r6                 @ these would be shifted off the bottom of the sticky bits
1610 |  beq 1f
1611 |  movs r4,#1
1612 | 1:
1613 |  lsrs r0,r0,r7
1614 |  orrs r4,r0
1615 |  movs r0,r1
1616 |  lsls r1,r1,r6
1617 |  orrs r4,r1
1618 |  asrs r1,r0,#31                @ propagate sign bit
1619 |  b da_ygtx0
1620 | 
1621 | da_ymgtx:                      @ result is just y
1622 |  movs r0,r2
1623 |  movs r1,r3
1624 | da_xmgty:                      @ result is just x
1625 |  movs r4,#0                    @ clear sticky bits
1626 |  b da_pack
1627 | 
1628 | .ltorg
1629 | 
1630 | @ equivalent of UMULL
1631 | @ needs five temporary registers
1632 | @ can have rt3==rx, in which case rx trashed
1633 | @ can have rt4==ry, in which case ry trashed
1634 | @ can have rzl==rx
1635 | @ can have rzh==ry
1636 | @ can have rzl,rzh==rt3,rt4
1637 | .macro mul32_32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4
1638 |                                @   t0   t1   t2   t3   t4
1639 |                                @                  (x)  (y)
1640 |  uxth \rt0,\rx                 @   xl
1641 |  uxth \rt1,\ry                 @        yl
1642 |  muls \rt0,\rt1                @  xlyl=L
1643 |  lsrs \rt2,\rx,#16             @             xh
1644 |  muls \rt1,\rt2                @       xhyl=M0
1645 |  lsrs \rt4,\ry,#16             @                       yh
1646 |  muls \rt2,\rt4                @           xhyh=H
1647 |  uxth \rt3,\rx                 @                   xl
1648 |  muls \rt3,\rt4                @                  xlyh=M1
1649 |  adds \rt1,\rt3                @      M0+M1=M
1650 |  bcc l\@_1                     @ addition of the two cross terms can overflow, so add carry into H
1651 |  movs \rt3,#1                  @                   1
1652 |  lsls \rt3,#16                 @                0x10000
1653 |  adds \rt2,\rt3                @             H'
1654 | l\@_1:
1655 |                                @   t0   t1   t2   t3   t4
1656 |                                @                 (zl) (zh)
1657 |  lsls \rzl,\rt1,#16            @                  ML
1658 |  lsrs \rzh,\rt1,#16            @                       MH
1659 |  adds \rzl,\rt0                @                  ZL
1660 |  adcs \rzh,\rt2                @                       ZH
1661 | .endm
1662 | 
1663 | @ SUMULL: x signed, y unsigned
1664 | @ in table below ¯ means signed variable
1665 | @ needs five temporary registers
1666 | @ can have rt3==rx, in which case rx trashed
1667 | @ can have rt4==ry, in which case ry trashed
1668 | @ can have rzl==rx
1669 | @ can have rzh==ry
1670 | @ can have rzl,rzh==rt3,rt4
1671 | .macro muls32_32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4
1672 |                                @   t0   t1   t2   t3   t4
1673 |                                @                 ¯(x)  (y)
1674 |  uxth \rt0,\rx                 @   xl
1675 |  uxth \rt1,\ry                 @        yl
1676 |  muls \rt0,\rt1                @  xlyl=L
1677 |  asrs \rt2,\rx,#16             @            ¯xh
1678 |  muls \rt1,\rt2                @      ¯xhyl=M0
1679 |  lsrs \rt4,\ry,#16             @                       yh
1680 |  muls \rt2,\rt4                @          ¯xhyh=H
1681 |  uxth \rt3,\rx                 @                   xl
1682 |  muls \rt3,\rt4                @                 xlyh=M1
1683 |  asrs \rt4,\rt1,#31            @                      M0sx   (M1 sign extension is zero)
1684 |  adds \rt1,\rt3                @      M0+M1=M 
1685 |  movs \rt3,#0                  @                    0
1686 |  adcs \rt4,\rt3                @                      ¯Msx
1687 |  lsls \rt4,#16                 @                    ¯Msx<<16
1688 |  adds \rt2,\rt4                @             H'
1689 | 
1690 |                                @   t0   t1   t2   t3   t4
1691 |                                @                 (zl) (zh)
1692 |  lsls \rzl,\rt1,#16            @                  M~
1693 |  lsrs \rzh,\rt1,#16            @                       M~
1694 |  adds \rzl,\rt0                @                  ZL
1695 |  adcs \rzh,\rt2                @                      ¯ZH
1696 | .endm
1697 | 
1698 | @ SSMULL: x signed, y signed
1699 | @ in table below ¯ means signed variable
1700 | @ needs five temporary registers
1701 | @ can have rt3==rx, in which case rx trashed
1702 | @ can have rt4==ry, in which case ry trashed
1703 | @ can have rzl==rx
1704 | @ can have rzh==ry
1705 | @ can have rzl,rzh==rt3,rt4
1706 | .macro muls32_s32_64 rx,ry,rzl,rzh,rt0,rt1,rt2,rt3,rt4
1707 |                                @   t0   t1   t2   t3   t4
1708 |                                @                 ¯(x)  (y)
1709 |  uxth \rt0,\rx                 @   xl
1710 |  uxth \rt1,\ry                 @        yl
1711 |  muls \rt0,\rt1                @  xlyl=L
1712 |  asrs \rt2,\rx,#16             @            ¯xh
1713 |  muls \rt1,\rt2                @      ¯xhyl=M0
1714 |  asrs \rt4,\ry,#16             @                      ¯yh
1715 |  muls \rt2,\rt4                @          ¯xhyh=H
1716 |  uxth \rt3,\rx                 @                   xl
1717 |  muls \rt3,\rt4                @                ¯xlyh=M1
1718 |  adds \rt1,\rt3                @     ¯M0+M1=M
1719 |  asrs \rt3,\rt1,#31            @                  Msx
1720 |  bvc l\@_1                     @
1721 |  mvns \rt3,\rt3                @                 ¯Msx        flip sign extension bits if overflow
1722 | l\@_1:
1723 |  lsls \rt3,#16                 @                    ¯Msx<<16
1724 |  adds \rt2,\rt3                @             H'
1725 | 
1726 |                                @   t0   t1   t2   t3   t4
1727 |                                @                 (zl) (zh)
1728 |  lsls \rzl,\rt1,#16            @                  M~
1729 |  lsrs \rzh,\rt1,#16            @                       M~
1730 |  adds \rzl,\rt0                @                  ZL
1731 |  adcs \rzh,\rt2                @                      ¯ZH
1732 | .endm
1733 | 
1734 | @ can have rt2==rx, in which case rx trashed
1735 | @ can have rzl==rx
1736 | @ can have rzh==rt1
1737 | .macro square32_64 rx,rzl,rzh,rt0,rt1,rt2
1738 |                                @   t0   t1   t2   zl   zh
1739 |  uxth \rt0,\rx                 @   xl
1740 |  muls \rt0,\rt0                @ xlxl=L 
1741 |  uxth \rt1,\rx                 @        xl
1742 |  lsrs \rt2,\rx,#16             @             xh
1743 |  muls \rt1,\rt2                @      xlxh=M
1744 |  muls \rt2,\rt2                @           xhxh=H
1745 |  lsls \rzl,\rt1,#17            @                  ML
1746 |  lsrs \rzh,\rt1,#15            @                       MH
1747 |  adds \rzl,\rt0                @                  ZL
1748 |  adcs \rzh,\rt2                @                       ZH
1749 | .endm
1750 | 
1751 | .align 2
1752 | .thumb_func
1753 | qfp_dmul:
1754 |  push {r4-r7,r14}
1755 |  mdunpack r0,r1,r4,r6,r5
1756 |  mov r12,r4
1757 |  mdunpack r2,r3,r4,r7,r5
1758 |  eors r7,r6                    @ sign of result
1759 |  add r4,r12                    @ exponent of result
1760 |  push {r0-r2,r4,r7}
1761 | 
1762 | @ accumulate full product in r12:r5:r6:r7
1763 |  mul32_32_64 r0,r2, r0,r5, r4,r6,r7,r0,r5    @ XL*YL
1764 |  mov r12,r0                    @ save LL bits
1765 | 
1766 |  mul32_32_64 r1,r3, r6,r7, r0,r2,r4,r6,r7    @ XH*YH
1767 | 
1768 |  pop {r0}                      @ XL
1769 |  mul32_32_64 r0,r3, r0,r3, r1,r2,r4,r0,r3    @ XL*YH
1770 |  adds r5,r0
1771 |  adcs r6,r3
1772 |  movs r0,#0
1773 |  adcs r7,r0
1774 | 
1775 |  pop {r1,r2}                   @ XH,YL
1776 |  mul32_32_64 r1,r2, r1,r2, r0,r3,r4, r1,r2   @ XH*YL
1777 |  adds r5,r1
1778 |  adcs r6,r2
1779 |  movs r0,#0
1780 |  adcs r7,r0
1781 | 
1782 | @ here r5:r6:r7 holds the product [1..4) in Q(104-32)=Q72, with extra LSBs in r12
1783 |  pop {r3,r4}                   @ exponent in r3, sign in r4
1784 |  lsls r1,r7,#11
1785 |  lsrs r2,r6,#21
1786 |  orrs r1,r2
1787 |  lsls r0,r6,#11
1788 |  lsrs r2,r5,#21
1789 |  orrs r0,r2
1790 |  lsls r5,#11                   @ now r5:r0:r1 Q83=Q(51+32), extra LSBs in r12
1791 |  lsrs r2,r1,#20
1792 |  bne 1f                        @ skip if in range [2..4)
1793 |  adds r5,r5                    @ shift up so always [2..4) Q83, i.e. [1..2) Q84=Q(52+32)
1794 |  adcs r0,r0
1795 |  adcs r1,r1
1796 |  subs r3,#1                    @ correct exponent
1797 | 1:
1798 |  ldr r6,=#0x3ff
1799 |  subs r3,r6                    @ correct for exponent bias
1800 |  lsls r6,#1                    @ 0x7fe
1801 |  cmp r3,r6
1802 |  bhs dm_0                      @ exponent over- or underflow
1803 |  lsls r5,#1                    @ rounding bit to carry
1804 |  bcc 1f                        @ result is correctly rounded
1805 |  adds r0,#1
1806 |  movs r6,#0
1807 |  adcs r1,r6                    @ round up
1808 |  mov r6,r12                    @ remaining sticky bits
1809 |  orrs r5,r6
1810 |  bne 1f                        @ some sticky bits set?
1811 |  lsrs r0,#1
1812 |  lsls r0,#1                    @ round to even
1813 | 1:
1814 |  lsls r3,#20
1815 |  adds r1,r3
1816 | dm_2:
1817 |  lsls r4,#31
1818 |  add r1,r4
1819 |  pop {r4-r7,r15}
1820 | 
1821 | @ here for exponent over- or underflow
1822 | dm_0:
1823 |  bge dm_1                      @ overflow?
1824 |  adds r3,#1                    @ would-be zero exponent?
1825 |  bne 1f
1826 |  adds r0,#1
1827 |  bne 1f                        @ all-ones mantissa?
1828 |  adds r1,#1
1829 |  lsrs r7,r1,#21
1830 |  beq 1f
1831 |  lsrs r1,#1
1832 |  b dm_2
1833 | 1:
1834 |  lsls r1,r4,#31
1835 |  movs r0,#0
1836 |  pop {r4-r7,r15}
1837 | 
1838 | @ here for exponent overflow
1839 | dm_1:
1840 |  adds r6,#1                    @ 0x7ff
1841 |  lsls r1,r6,#20
1842 |  movs r0,#0
1843 |  b dm_2
1844 | 
1845 | .ltorg
1846 | 
1847 | @ Approach to division y/x is as follows.
1848 | @
1849 | @ First generate u1, an approximation to 1/x to about 29 bits. Multiply this by the top
1850 | @ 32 bits of y to generate a0, a first approximation to the result (good to 28 bits or so).
1851 | @ Calculate the exact remainder r0=y-a0*x, which will be about 0. Calculate a correction
1852 | @ d0=r0*u1, and then write a1=a0+d0. If near a rounding boundary, compute the exact
1853 | @ remainder r1=y-a1*x (which can be done using r0 as a basis) to determine whether to
1854 | @ round up or down.
1855 | @
1856 | @ The calculation of 1/x is as given in dreciptest.c. That code verifies exhaustively
1857 | @ that | u1*x-1 | < 10*2^-32.
1858 | @
1859 | @ More precisely:
1860 | @
1861 | @ x0=(q16)x;
1862 | @ x1=(q30)x;
1863 | @ y0=(q31)y;
1864 | @ u0=(q15~)"(0xffffffffU/(unsigned int)roundq(x/x_ulp))/powq(2,16)"(x0); // q15 approximation to 1/x; "~" denotes rounding rather than truncation
1865 | @ v=(q30)(u0*x1-1);
1866 | @ u1=(q30)u0-(q30~)(u0*v);
1867 | @
1868 | @ a0=(q30)(u1*y0);
1869 | @ r0=(q82)y-a0*x;
1870 | @ r0x=(q57)r0;
1871 | @ d0=r0x*u1;
1872 | @ a1=d0+a0;
1873 | @
1874 | @ Error analysis
1875 | @
1876 | @ Use Greek letters to represent the errors introduced by rounding and truncation.
1877 | @
1878 | @               r₀ = y - a₀x
1879 | @                  = y - [ u₁ ( y - α ) - β ] x    where 0 ≤ α < 2^-31, 0 ≤ β < 2^-30
1880 | @                  = y ( 1 - u₁x ) + ( u₁α + β ) x
1881 | @
1882 | @     Hence
1883 | @
1884 | @       | r₀ / x | < 2 * 10*2^-32 + 2^-31 + 2^-30
1885 | @                  = 26*2^-32
1886 | @
1887 | @               r₁ = y - a₁x
1888 | @                  = y - a₀x - d₀x
1889 | @                  = r₀ - d₀x
1890 | @                  = r₀ - u₁ ( r₀ - γ ) x    where 0 ≤ γ < 2^-57
1891 | @                  = r₀ ( 1 - u₁x ) + u₁γx
1892 | @
1893 | @     Hence
1894 | @
1895 | @       | r₁ / x | < 26*2^-32 * 10*2^-32 + 2^-57
1896 | @                  = (260+128)*2^-64
1897 | @                  < 2^-55
1898 | @
1899 | @ Empirically it seems to be nearly twice as good as this.
1900 | @
1901 | @ To determine correctly whether the exact remainder calculation can be skipped we need a result
1902 | @ accurate to < 0.25ulp. In the case where x>y the quotient will be shifted up one place for normalisation
1903 | @ and so 1ulp is 2^-53 and so the calculation above suffices.
1904 | 
1905 | .align 2
1906 | .thumb_func
1907 | qfp_ddiv:
1908 |  push {r4-r7,r14}
1909 | ddiv0:                         @ entry point from dtan
1910 |  mdunpack r2,r3,r4,r7,r6       @ unpack divisor
1911 | 
1912 | @ unpack dividend by hand to save on register use
1913 |  lsrs r6,r1,#31
1914 |  adds r6,r7
1915 |  mov r12,r6                    @ result sign in r12b0; r12b1 trashed
1916 |  lsls r1,#1
1917 |  lsrs r7,r1,#21                @ exponent
1918 |  beq 1f                        @ zero exponent?
1919 |  adds r6,r7,#1
1920 |  lsrs r6,#11
1921 |  beq 2f                        @ exponent != 0x7ff? then done
1922 | 1:
1923 |  movs r0,#0
1924 |  movs r1,#0
1925 |  subs r7,#64                   @ less drastic fiddling of exponents to get 0/0, Inf/Inf correct
1926 |  lsls r7,#12
1927 | 2:
1928 |  subs r6,r7,r4
1929 |  lsls r6,#2
1930 |  add r12,r12,r6                @ (signed) exponent in r12[31..8]
1931 |  subs r7,#1                    @ implied 1
1932 |  lsls r7,#21
1933 |  subs r1,r7
1934 |  lsrs r1,#1
1935 | 
1936 | // see dreciptest-boxc.c
1937 |  lsrs r4,r3,#15 @ x2=x>>15; // Q5 32..63
1938 |  ldr r5,=#(rcpapp-32)
1939 |  ldrb r4,[r5,r4] @ u=lut5[x2-32]; // Q8
1940 |  lsls r5,r3,#8
1941 |  muls r5,r5,r4
1942 |  asrs r5,#14 @ e=(i32)(u*(x<<8))>>14; // Q22
1943 |  asrs r6,r5,#11
1944 |  muls r6,r6,r6 @ e2=(e>>11)*(e>>11); // Q22
1945 |  subs r5,r6
1946 |  muls r5,r5,r4 @ c=(e-e2)*u; // Q30
1947 |  lsls r6,r4,#7
1948 |  asrs r5,#14
1949 |  adds r5,#1
1950 |  asrs r5,#1
1951 |  subs r6,r5 @ u0=(u<<7)-((c+0x4000)>>15); // Q15
1952 | 
1953 | @ here
1954 | @ r0:r1 y mantissa
1955 | @ r2:r3 x mantissa
1956 | @ r6    u0, first approximation to 1/x Q15
1957 | @ r12: result sign, exponent
1958 | 
1959 |  lsls r4,r3,#10
1960 |  lsrs r5,r2,#22
1961 |  orrs r5,r4                    @ x1=(q30)x
1962 |  muls r5,r6                    @ u0*x1 Q45
1963 |  asrs r5,#15                   @ v=u0*x1-1 Q30
1964 |  muls r5,r6                    @ u0*v Q45
1965 |  asrs r5,#14
1966 |  adds r5,#1
1967 |  asrs r5,#1                    @ round u0*v to Q30
1968 |  lsls r6,#15
1969 |  subs r6,r5                    @ u1 Q30
1970 | 
1971 | @ here
1972 | @ r0:r1 y mantissa
1973 | @ r2:r3 x mantissa
1974 | @ r6    u1, second approximation to 1/x Q30
1975 | @ r12: result sign, exponent
1976 | 
1977 |  push {r2,r3}
1978 |  lsls r4,r1,#11
1979 |  lsrs r5,r0,#21
1980 |  orrs r4,r5                    @ y0=(q31)y
1981 |  mul32_32_64 r4,r6, r4,r5, r2,r3,r7,r4,r5  @ y0*u1 Q61
1982 |  adds r4,r4
1983 |  adcs r5,r5                    @ a0=(q30)(y0*u1)
1984 | 
1985 | @ here
1986 | @ r0:r1 y mantissa
1987 | @ r5    a0, first approximation to y/x Q30
1988 | @ r6    u1, second approximation to 1/x Q30
1989 | @ r12   result sign, exponent
1990 | 
1991 |  ldr r2,[r13,#0]               @ xL
1992 |  mul32_32_64 r2,r5, r2,r3, r1,r4,r7,r2,r3  @ xL*a0
1993 |  ldr r4,[r13,#4]               @ xH
1994 |  muls r4,r5                    @ xH*a0
1995 |  adds r3,r4                    @ r2:r3 now x*a0 Q82
1996 |  lsrs r2,#25
1997 |  lsls r1,r3,#7
1998 |  orrs r2,r1                    @ r2 now x*a0 Q57; r7:r2 is x*a0 Q89
1999 |  lsls r4,r0,#5                 @ y Q57
2000 |  subs r0,r4,r2                 @ r0x=y-x*a0 Q57 (signed)
2001 | 
2002 | @ here
2003 | @ r0  r0x Q57
2004 | @ r5  a0, first approximation to y/x Q30
2005 | @ r4  yL  Q57
2006 | @ r6  u1 Q30
2007 | @ r12 result sign, exponent
2008 | 
2009 |  muls32_32_64 r0,r6, r7,r6, r1,r2,r3, r7,r6   @ r7:r6 r0x*u1 Q87
2010 |  asrs r3,r6,#25
2011 |  adds r5,r3
2012 |  lsls r3,r6,#7                 @ r3:r5 a1 Q62 (but bottom 7 bits are zero so 55 bits of precision after binary point)
2013 | @ here we could recover another 7 bits of precision (but not accuracy) from the top of r7
2014 | @ but these bits are thrown away in the rounding and conversion to Q52 below
2015 | 
2016 | @ here
2017 | @ r3:r5  a1 Q62 candidate quotient [0.5,2) or so
2018 | @ r4     yL Q57
2019 | @ r12    result sign, exponent
2020 | 
2021 |  movs r6,#0
2022 |  adds r3,#128                  @ for initial rounding to Q53
2023 |  adcs r5,r5,r6
2024 |  lsrs  r1,r5,#30
2025 |  bne dd_0
2026 | @ here candidate quotient a1 is in range [0.5,1)
2027 | @ so 30 significant bits in r5
2028 | 
2029 |  lsls r4,#1                    @ y now Q58
2030 |  lsrs r1,r5,#9                 @ to Q52
2031 |  lsls r0,r5,#23
2032 |  lsrs r3,#9                    @ 0.5ulp-significance bit in carry: if this is 1 we may need to correct result
2033 |  orrs r0,r3
2034 |  bcs dd_1
2035 |  b dd_2
2036 | dd_0:
2037 | @ here candidate quotient a1 is in range [1,2)
2038 | @ so 31 significant bits in r5
2039 | 
2040 |  movs r2,#4
2041 |  add r12,r12,r2                @ fix exponent; r3:r5 now effectively Q61
2042 |  adds r3,#128                  @ complete rounding to Q53
2043 |  adcs r5,r5,r6
2044 |  lsrs r1,r5,#10
2045 |  lsls r0,r5,#22
2046 |  lsrs r3,#10                   @ 0.5ulp-significance bit in carry: if this is 1 we may need to correct result
2047 |  orrs r0,r3
2048 |  bcc dd_2
2049 | dd_1:
2050 | 
2051 | @ here
2052 | @ r0:r1  rounded result Q53 [0.5,1) or Q52 [1,2), but may not be correctly rounded-to-nearest
2053 | @ r4     yL Q58 or Q57
2054 | @ r12    result sign, exponent
2055 | @ carry set
2056 | 
2057 |  adcs r0,r0,r0
2058 |  adcs r1,r1,r1                 @ z Q53 with 1 in LSB
2059 |  lsls r4,#16                   @ Q105-32=Q73
2060 |  ldr r2,[r13,#0]               @ xL Q52
2061 |  ldr r3,[r13,#4]               @ xH Q20
2062 | 
2063 |  movs r5,r1                    @ zH Q21
2064 |  muls r5,r2                    @ zH*xL Q73
2065 |  subs r4,r5
2066 |  muls r3,r0                    @ zL*xH Q73
2067 |  subs r4,r3
2068 |  mul32_32_64 r2,r0, r2,r3, r5,r6,r7,r2,r3  @ xL*zL
2069 |  rsbs r2,#0                    @ borrow from low half?
2070 |  sbcs r4,r3                    @ y-xz Q73 (remainder bits 52..73)
2071 | 
2072 |  cmp r4,#0
2073 | 
2074 |  bmi 1f
2075 |  movs r2,#0                    @ round up
2076 |  adds r0,#1
2077 |  adcs r1,r2
2078 | 1:
2079 |  lsrs r0,#1                    @ shift back down to Q52
2080 |  lsls r2,r1,#31
2081 |  orrs r0,r2
2082 |  lsrs r1,#1
2083 | dd_2:
2084 |  add r13,#8
2085 |  mov r2,r12
2086 |  lsls r7,r2,#31                @ result sign
2087 |  asrs r2,#2                    @ result exponent
2088 |  ldr r3,=#0x3fd
2089 |  adds r2,r3
2090 |  ldr r3,=#0x7fe
2091 |  cmp r2,r3
2092 |  bhs dd_3                      @ over- or underflow?
2093 |  lsls r2,#20
2094 |  adds r1,r2                    @ pack exponent
2095 | dd_5:
2096 |  adds r1,r7                    @ pack sign
2097 |  pop {r4-r7,r15}
2098 | 
2099 | dd_3:
2100 |  movs r0,#0
2101 |  cmp r2,#0
2102 |  bgt dd_4                      @ overflow?
2103 |  movs r1,r7
2104 |  pop {r4-r7,r15}
2105 | 
2106 | dd_4:
2107 |  adds r3,#1                    @ 0x7ff
2108 |  lsls r1,r3,#20
2109 |  b dd_5
2110 | 
2111 | /*
2112 | Approach to square root x=sqrt(y) is as follows.
2113 | 
2114 | First generate a3, an approximation to 1/sqrt(y) to about 30 bits. Multiply this by y
2115 | to give a4~sqrt(y) to about 28 bits and a remainder r4=y-a4^2. Then, because
2116 | d sqrt(y) / dy = 1 / (2 sqrt(y)) let d4=r4*a3/2 and then the value a5=a4+d4 is
2117 | a better approximation to sqrt(y). If this is near a rounding boundary we
2118 | compute an exact remainder y-a5*a5 to decide whether to round up or down.
2119 | 
2120 | The calculation of a3 and a4 is as given in dsqrttest.c. That code verifies exhaustively
2121 | that | 1 - a3a4 | < 10*2^-32, | r4 | < 40*2^-32 and | r4/y | < 20*2^-32.
2122 | 
2123 | More precisely, with "y" representing y truncated to 30 binary places:
2124 | 
2125 | u=(q3)y;                          // 24-entry table
2126 | a0=(q8~)"1/sqrtq(x+x_ulp/2)"(u);  // first approximation from table
2127 | p0=(q16)(a0*a0) * (q16)y;
2128 | r0=(q20)(p0-1);
2129 | dy0=(q15)(r0*a0);                 // Newton-Raphson correction term
2130 | a1=(q16)a0-dy0/2;                 // good to ~9 bits
2131 | 
2132 | p1=(q19)(a1*a1)*(q19)y;
2133 | r1=(q23)(p1-1);
2134 | dy1=(q15~)(r1*a1);                // second Newton-Raphson correction
2135 | a2x=(q16)a1-dy1/2;                // good to ~16 bits
2136 | a2=a2x-a2x/1t16;                  // prevent overflow of a2*a2 in 32 bits
2137 | 
2138 | p2=(a2*a2)*(q30)y;                // Q62
2139 | r2=(q36)(p2-1+1t-31);
2140 | dy2=(q30)(r2*a2);                 // Q52->Q30
2141 | a3=(q31)a2-dy2/2;                 // good to about 30 bits
2142 | a4=(q30)(a3*(q30)y+1t-31);        // good to about 28 bits
2143 | 
2144 | Error analysis
2145 | 
2146 |           r₄ = y - a₄²
2147 |           d₄ = 1/2 a₃r₄
2148 |           a₅ = a₄ + d₄
2149 |           r₅ = y - a₅²
2150 |              = y - ( a₄ + d₄ )²
2151 |              = y - a₄² - a₃a₄r₄ - 1/4 a₃²r₄²
2152 |              = r₄ - a₃a₄r₄ - 1/4 a₃²r₄²
2153 | 
2154 |       | r₅ | < | r₄ | | 1 - a₃a₄ | + 1/4 r₄²
2155 | 
2156 |           a₅ = √y √( 1 - r₅/y )
2157 |              = √y ( 1 - 1/2 r₅/y + ... )
2158 | 
2159 | So to first order (second order being very tiny)
2160 | 
2161 |      √y - a₅ = 1/2 r₅/y
2162 | 
2163 | and
2164 | 
2165 |  | √y - a₅ | < 1/2 ( | r₄/y | | 1 - a₃a₄ | + 1/4 r₄²/y )
2166 | 
2167 | From dsqrttest.c (conservatively):
2168 | 
2169 |              < 1/2 ( 20*2^-32 * 10*2^-32 + 1/4 * 40*2^-32*20*2^-32 )
2170 |              = 1/2 ( 200 + 200 ) * 2^-64
2171 |              < 2^-56
2172 | 
2173 | Empirically we see about 1ulp worst-case error including rounding at Q57.
2174 | 
2175 | To determine correctly whether the exact remainder calculation can be skipped we need a result
2176 | accurate to < 0.25ulp at Q52, or 2^-54.
2177 | */
2178 | 
2179 | dq_2:
2180 |  bge dq_3                      @ +Inf?
2181 |  movs r1,#0
2182 |  b dq_4
2183 | 
2184 | dq_0:
2185 |  lsrs r1,#31
2186 |  lsls r1,#31                   @ preserve sign bit
2187 |  lsrs r2,#21                   @ extract exponent
2188 |  beq dq_4                      @ -0? return it
2189 |  asrs r1,#11                   @ make -Inf
2190 |  b dq_4
2191 | 
2192 | dq_3:
2193 |  ldr r1,=#0x7ff
2194 |  lsls r1,#20                   @ return +Inf
2195 | dq_4:
2196 |  movs r0,#0
2197 | dq_1:
2198 |  bx r14
2199 | 
2200 | .align 2
2201 | .thumb_func
2202 | qfp_dsqrt:
2203 |  lsls r2,r1,#1
2204 |  bcs dq_0                      @ negative?
2205 |  lsrs r2,#21                   @ extract exponent
2206 |  subs r2,#1
2207 |  ldr r3,=#0x7fe
2208 |  cmp r2,r3
2209 |  bhs dq_2                      @ catches 0 and +Inf
2210 |  push {r4-r7,r14}
2211 |  lsls r4,r2,#20
2212 |  subs r1,r4                    @ insert implied 1
2213 |  lsrs r2,#1
2214 |  bcc 1f                        @ even exponent? skip
2215 |  adds r0,r0,r0                 @ odd exponent: shift up mantissa
2216 |  adcs r1,r1,r1
2217 | 1:
2218 |  lsrs r3,#2
2219 |  adds r2,r3
2220 |  lsls r2,#20
2221 |  mov r12,r2                    @ save result exponent
2222 | 
2223 | @ here
2224 | @ r0:r1  y mantissa Q52 [1,4)
2225 | @ r12    result exponent
2226 | 
2227 |  adr r4,drsqrtapp-8            @ first eight table entries are never accessed because of the mantissa's leading 1
2228 |  lsrs r2,r1,#17                @ y Q3
2229 |  ldrb r2,[r4,r2]               @ initial approximation to reciprocal square root a0 Q8
2230 |  lsrs r3,r1,#4                 @ first Newton-Raphson iteration
2231 |  muls r3,r2
2232 |  muls r3,r2                    @  i32 p0=a0*a0*(y>>14);          // Q32
2233 |  asrs r3,r3,#12                @  i32 r0=p0>>12;                 // Q20
2234 |  muls r3,r2
2235 |  asrs r3,#13                   @  i32 dy0=(r0*a0)>>13;           // Q15
2236 |  lsls r2,#8
2237 |  subs r2,r3                    @  i32 a1=(a0<<8)-dy0;         // Q16
2238 | 
2239 |  movs r3,r2
2240 |  muls r3,r3
2241 |  lsrs r3,#13
2242 |  lsrs r4,r1,#1
2243 |  muls r3,r4                    @  i32 p1=((a1*a1)>>11)*(y>>11);  // Q19*Q19=Q38
2244 |  asrs r3,#15                   @  i32 r1=p1>>15;                 // Q23
2245 |  muls r3,r2
2246 |  asrs r3,#23
2247 |  adds r3,#1
2248 |  asrs r3,#1                    @  i32 dy1=(r1*a1+(1<<23))>>24;   // Q23*Q16=Q39; Q15
2249 |  subs r2,r3                    @  i32 a2=a1-dy1;                 // Q16
2250 |  lsrs r3,r2,#16
2251 |  subs r2,r3                    @  if(a2>=0x10000) a2=0xffff; to prevent overflow of a2*a2
2252 | 
2253 | @ here
2254 | @ r0:r1 y mantissa
2255 | @ r2    a2 ~ 1/sqrt(y) Q16
2256 | @ r12   result exponent
2257 | 
2258 |  movs r3,r2
2259 |  muls r3,r3
2260 |  lsls r1,#10
2261 |  lsrs r4,r0,#22
2262 |  orrs r1,r4                    @ y Q30
2263 |  mul32_32_64 r1,r3, r4,r3, r5,r6,r7,r4,r3   @  i64 p2=(ui64)(a2*a2)*(ui64)y;  // Q62 r4:r3
2264 |  lsls r5,r3,#6
2265 |  lsrs r4,#26
2266 |  orrs r4,r5
2267 |  adds r4,#0x20                 @  i32 r2=(p2>>26)+0x20;          // Q36 r4
2268 |  uxth r5,r4
2269 |  muls r5,r2
2270 |  asrs r4,#16
2271 |  muls r4,r2
2272 |  lsrs r5,#16
2273 |  adds r4,r5
2274 |  asrs r4,#6                    @ i32 dy2=((i64)r2*(i64)a2)>>22; // Q36*Q16=Q52; Q30
2275 |  lsls r2,#15
2276 |  subs r2,r4
2277 | 
2278 | @ here
2279 | @ r0    y low bits
2280 | @ r1    y Q30
2281 | @ r2    a3 ~ 1/sqrt(y) Q31
2282 | @ r12   result exponent
2283 | 
2284 |  mul32_32_64 r2,r1, r3,r4, r5,r6,r7,r3,r4
2285 |  adds r3,r3,r3
2286 |  adcs r4,r4,r4
2287 |  adds r3,r3,r3
2288 |  movs r3,#0
2289 |  adcs r3,r4                    @ ui32 a4=((ui64)a3*(ui64)y+(1U<<31))>>31; // Q30
2290 | 
2291 | @ here
2292 | @ r0    y low bits
2293 | @ r1    y Q30
2294 | @ r2    a3 Q31 ~ 1/sqrt(y)
2295 | @ r3    a4 Q30 ~ sqrt(y)
2296 | @ r12   result exponent
2297 | 
2298 |  square32_64 r3, r4,r5, r6,r5,r7
2299 |  lsls r6,r0,#8
2300 |  lsrs r7,r1,#2
2301 |  subs r6,r4
2302 |  sbcs r7,r5                    @ r4=(q60)y-a4*a4
2303 | 
2304 | @ by exhaustive testing, r4 = fffffffc0e134fdc .. 00000003c2bf539c Q60
2305 | 
2306 |  lsls r5,r7,#29
2307 |  lsrs r6,#3
2308 |  adcs r6,r5                    @ r4 Q57 with rounding
2309 |  muls32_32_64 r6,r2, r6,r2, r4,r5,r7,r6,r2    @ d4=a3*r4/2 Q89
2310 | @ r4+d4 is correct to 1ULP at Q57, tested on ~9bn cases including all extreme values of r4 for each possible y Q30
2311 | 
2312 |  adds r2,#8
2313 |  asrs r2,#5                    @ d4 Q52, rounded to Q53 with spare bit in carry
2314 | 
2315 | @ here
2316 | @ r0    y low bits
2317 | @ r1    y Q30
2318 | @ r2    d4 Q52, rounded to Q53
2319 | @ C flag contains d4_b53
2320 | @ r3    a4 Q30
2321 | 
2322 |  bcs dq_5
2323 | 
2324 |  lsrs r5,r3,#10                @ a4 Q52
2325 |  lsls r4,r3,#22
2326 | 
2327 |  asrs r1,r2,#31
2328 |  adds r0,r2,r4
2329 |  adcs r1,r5                    @ a4+d4
2330 | 
2331 |  add r1,r12                    @ pack exponent
2332 |  pop {r4-r7,r15}
2333 | 
2334 | .ltorg
2335 |  
2336 | 
2337 | @ round(sqrt(2^22./[68:8:252]))
2338 | drsqrtapp:
2339 | .byte 0xf8,0xeb,0xdf,0xd6,0xcd,0xc5,0xbe,0xb8
2340 | .byte 0xb2,0xad,0xa8,0xa4,0xa0,0x9c,0x99,0x95
2341 | .byte 0x92,0x8f,0x8d,0x8a,0x88,0x85,0x83,0x81
2342 | 
2343 | dq_5:
2344 | @ here we are near a rounding boundary, C is set
2345 |  adcs r2,r2,r2                 @ d4 Q53+1ulp
2346 |  lsrs r5,r3,#9
2347 |  lsls r4,r3,#23                @ r4:r5 a4 Q53
2348 |  asrs r1,r2,#31
2349 |  adds r4,r2,r4
2350 |  adcs r5,r1                    @ r4:r5 a5=a4+d4 Q53+1ulp
2351 |  movs r3,r5
2352 |  muls r3,r4
2353 |  square32_64 r4,r1,r2,r6,r2,r7
2354 |  adds r2,r3
2355 |  adds r2,r3                    @ r1:r2 a5^2 Q106
2356 |  lsls r0,#22                   @ y Q84
2357 | 
2358 |  rsbs r1,#0
2359 |  sbcs r0,r2                    @ remainder y-a5^2
2360 |  bmi 1f                        @ y<a5^2: no need to increment a5
2361 |  movs r3,#0
2362 |  adds r4,#1
2363 |  adcs r5,r3                    @ bump a5 if over rounding boundary
2364 | 1:
2365 |  lsrs r0,r4,#1
2366 |  lsrs r1,r5,#1
2367 |  lsls r5,#31
2368 |  orrs r0,r5
2369 |  add r1,r12
2370 |  pop {r4-r7,r15}
2371 | 
2372 | @ compare r0:r1 against r2:r3, returning -1/0/1 for <, =, >
2373 | @ also set flags accordingly
2374 | .thumb_func
2375 | qfp_dcmp:
2376 |  push {r4,r6,r7,r14}
2377 |  ldr r7,=#0x7ff                @ flush NaNs and denormals
2378 |  lsls r4,r1,#1
2379 |  lsrs r4,#21
2380 |  beq 1f
2381 |  cmp r4,r7
2382 |  bne 2f
2383 | 1:
2384 |  movs r0,#0
2385 |  lsrs r1,#20
2386 |  lsls r1,#20
2387 | 2:
2388 |  lsls r4,r3,#1
2389 |  lsrs r4,#21
2390 |  beq 1f
2391 |  cmp r4,r7
2392 |  bne 2f
2393 | 1:
2394 |  movs r2,#0
2395 |  lsrs r3,#20
2396 |  lsls r3,#20
2397 | 2:
2398 | dcmp_fast_entry:
2399 |  movs r6,#1
2400 |  eors r3,r1
2401 |  bmi 4f                        @ opposite signs? then can proceed on basis of sign of x
2402 |  eors r3,r1                    @ restore r3
2403 |  bpl 1f
2404 |  rsbs r6,#0                    @ negative? flip comparison
2405 | 1:
2406 |  cmp r1,r3
2407 |  bne 1f
2408 |  cmp r0,r2
2409 |  bhi 2f
2410 |  blo 3f
2411 | 5:
2412 |  movs r6,#0                    @ equal? result is 0
2413 | 1:
2414 |  bgt 2f
2415 | 3:
2416 |  rsbs r6,#0
2417 | 2:
2418 |  subs r0,r6,#0                 @ copy and set flags
2419 |  pop {r4,r6,r7,r15}
2420 | 4:
2421 |  orrs r3,r1                    @ make -0==+0
2422 |  adds r3,r3
2423 |  orrs r3,r0
2424 |  orrs r3,r2
2425 |  beq 5b
2426 |  cmp r1,#0
2427 |  bge 2b
2428 |  b 3b
2429 |  
2430 | 
2431 | @ "scientific" functions start here
2432 | 
2433 | .thumb_func
2434 | push_r8_r11:
2435 |  mov r4,r8
2436 |  mov r5,r9
2437 |  mov r6,r10
2438 |  mov r7,r11
2439 |  push {r4-r7}
2440 |  bx r14
2441 | 
2442 | .thumb_func
2443 | pop_r8_r11:
2444 |  pop {r4-r7}
2445 |  mov r8,r4
2446 |  mov r9,r5
2447 |  mov r10,r6
2448 |  mov r11,r7
2449 |  bx r14
2450 | 
2451 | @ double-length CORDIC rotation step
2452 | 
2453 | @ r0:r1   ω
2454 | @ r6      32-i (complementary shift)
2455 | @ r7      i (shift)
2456 | @ r8:r9   x
2457 | @ r10:r11 y
2458 | @ r12     coefficient pointer
2459 | 
2460 | @ an option in rotation mode would be to compute the sequence of σ values
2461 | @ in one pass, rotate the initial vector by the residual ω and then run a
2462 | @ second pass to compute the final x and y. This would relieve pressure
2463 | @ on registers and hence possibly be faster. The same trick does not work
2464 | @ in vectoring mode (but perhaps one could work to single precision in
2465 | @ a first pass and then double precision in a second pass?).
2466 | 
2467 | .thumb_func
2468 | dcordic_vec_step:
2469 |  mov r2,r12
2470 |  ldmia r2!,{r3,r4}
2471 |  mov r12,r2
2472 |  mov r2,r11
2473 |  cmp r2,#0
2474 |  blt 1f
2475 |  b 2f
2476 | 
2477 | .thumb_func
2478 | dcordic_rot_step:
2479 |  mov r2,r12
2480 |  ldmia r2!,{r3,r4}
2481 |  mov r12,r2
2482 |  cmp r1,#0
2483 |  bge 1f
2484 | 2:
2485 | @ ω<0 / y>=0
2486 | @ ω+=dω
2487 | @ x+=y>>i, y-=x>>i
2488 |  adds r0,r3
2489 |  adcs r1,r4
2490 | 
2491 |  mov r3,r11
2492 |  asrs r3,r7
2493 |  mov r4,r11
2494 |  lsls r4,r6
2495 |  mov r2,r10
2496 |  lsrs r2,r7
2497 |  orrs r2,r4                    @ r2:r3 y>>i, rounding in carry
2498 |  mov r4,r8
2499 |  mov r5,r9                     @ r4:r5 x
2500 |  adcs r2,r4
2501 |  adcs r3,r5                    @ r2:r3 x+(y>>i)
2502 |  mov r8,r2
2503 |  mov r9,r3
2504 | 
2505 |  mov r3,r5
2506 |  lsls r3,r6
2507 |  asrs r5,r7
2508 |  lsrs r4,r7
2509 |  orrs r4,r3                    @ r4:r5 x>>i, rounding in carry
2510 |  mov r2,r10
2511 |  mov r3,r11
2512 |  sbcs r2,r4
2513 |  sbcs r3,r5                    @ r2:r3 y-(x>>i)
2514 |  mov r10,r2
2515 |  mov r11,r3
2516 |  bx r14
2517 | 
2518 | 
2519 | @ ω>0 / y<0
2520 | @ ω-=dω
2521 | @ x-=y>>i, y+=x>>i
2522 | 1:
2523 |  subs r0,r3
2524 |  sbcs r1,r4
2525 | 
2526 |  mov r3,r9
2527 |  asrs r3,r7
2528 |  mov r4,r9
2529 |  lsls r4,r6
2530 |  mov r2,r8
2531 |  lsrs r2,r7
2532 |  orrs r2,r4                    @ r2:r3 x>>i, rounding in carry
2533 |  mov r4,r10
2534 |  mov r5,r11                    @ r4:r5 y
2535 |  adcs r2,r4
2536 |  adcs r3,r5                    @ r2:r3 y+(x>>i)
2537 |  mov r10,r2
2538 |  mov r11,r3
2539 | 
2540 |  mov r3,r5
2541 |  lsls r3,r6
2542 |  asrs r5,r7
2543 |  lsrs r4,r7
2544 |  orrs r4,r3                    @ r4:r5 y>>i, rounding in carry
2545 |  mov r2,r8
2546 |  mov r3,r9
2547 |  sbcs r2,r4
2548 |  sbcs r3,r5                    @ r2:r3 x-(y>>i)
2549 |  mov r8,r2
2550 |  mov r9,r3
2551 |  bx r14
2552 | 
2553 | ret_dzero:
2554 |  movs r0,#0
2555 |  movs r1,#0
2556 |  bx r14
2557 | 
2558 | @ convert packed double in r0:r1 to signed/unsigned 32/64-bit integer/fixed-point value in r0:r1 [with r2 places after point], with rounding towards -Inf
2559 | @ fixed-point versions only work with reasonable values in r2 because of the way dunpacks works
2560 | 
2561 | .thumb_func
2562 | qfp_double2int:
2563 |  movs r2,#0                    @ and fall through
2564 | .thumb_func
2565 | qfp_double2fix:
2566 |  push {r14}
2567 |  adds r2,#32
2568 |  bl qfp_double2fix64
2569 |  movs r0,r1
2570 |  pop {r15}
2571 | 
2572 | .thumb_func
2573 | qfp_double2uint:
2574 |  movs r2,#0                    @ and fall through
2575 | .thumb_func
2576 | qfp_double2ufix:
2577 |  push {r14}
2578 |  adds r2,#32
2579 |  bl qfp_double2ufix64
2580 |  movs r0,r1
2581 |  pop {r15}
2582 | 
2583 | .thumb_func
2584 | qfp_float2int64:
2585 |  movs r1,#0                    @ and fall through
2586 | .thumb_func
2587 | qfp_float2fix64:
2588 |  push {r14}
2589 |  bl f2fix
2590 |  b d2f64_a
2591 | 
2592 | .thumb_func
2593 | qfp_float2uint64:
2594 |  movs r1,#0                    @ and fall through
2595 | .thumb_func
2596 | qfp_float2ufix64:
2597 |  asrs r3,r0,#23                @ negative? return 0
2598 |  bmi ret_dzero
2599 | @ and fall through
2600 | 
2601 | @ convert float in r0 to signed fixed point in r0:r1:r3, r1 places after point, rounding towards -Inf
2602 | @ result clamped so that r3 can only be 0 or -1
2603 | @ trashes r12
2604 | .thumb_func
2605 | f2fix:
2606 |  push {r4,r14}
2607 |  mov r12,r1
2608 |  asrs r3,r0,#31
2609 |  lsls r0,#1
2610 |  lsrs r2,r0,#24
2611 |  beq 1f                        @ zero?
2612 |  cmp r2,#0xff                  @ Inf?
2613 |  beq 2f
2614 |  subs r1,r2,#1
2615 |  subs r2,#0x7f                 @ remove exponent bias
2616 |  lsls r1,#24
2617 |  subs r0,r1                    @ insert implied 1
2618 |  eors r0,r3
2619 |  subs r0,r3                    @ top two's complement
2620 |  asrs r1,r0,#4                 @ convert to double format
2621 |  lsls r0,#28
2622 |  b d2fix_a
2623 | 1:
2624 |  movs r0,#0
2625 |  movs r1,r0
2626 |  movs r3,r0
2627 |  pop {r4,r15}
2628 | 2:
2629 |  mvns r0,r3                    @ return max/min value
2630 |  mvns r1,r3
2631 |  pop {r4,r15}
2632 | 
2633 | .thumb_func
2634 | qfp_double2int64:
2635 |  movs r2,#0                    @ and fall through
2636 | .thumb_func
2637 | qfp_double2fix64:
2638 |  push {r14}
2639 |  bl d2fix
2640 | d2f64_a:
2641 |  asrs r2,r1,#31
2642 |  cmp r2,r3
2643 |  bne 1f                        @ sign extension bits fail to match sign of result?
2644 |  pop {r15}
2645 | 1:
2646 |  mvns r0,r3
2647 |  movs r1,#1
2648 |  lsls r1,#31
2649 |  eors r1,r1,r0                 @ generate extreme fixed-point values
2650 |  pop {r15}
2651 | 
2652 | .thumb_func
2653 | qfp_double2uint64:
2654 |  movs r2,#0                    @ and fall through
2655 | .thumb_func
2656 | qfp_double2ufix64:
2657 |  asrs r3,r1,#20                @ negative? return 0
2658 |  bmi ret_dzero
2659 | @ and fall through
2660 | 
2661 | @ convert double in r0:r1 to signed fixed point in r0:r1:r3, r2 places after point, rounding towards -Inf
2662 | @ result clamped so that r3 can only be 0 or -1
2663 | @ trashes r12
2664 | .thumb_func
2665 | d2fix:
2666 |  push {r4,r14}
2667 |  mov r12,r2
2668 |  bl dunpacks
2669 |  asrs r4,r2,#16
2670 |  adds r4,#1
2671 |  bge 1f
2672 |  movs r1,#0                    @ -0 -> +0
2673 | 1:
2674 |  asrs r3,r1,#31
2675 | d2fix_a:
2676 | @ here
2677 | @ r0:r1 two's complement mantissa
2678 | @ r2    unbaised exponent
2679 | @ r3    mantissa sign extension bits
2680 |  add r2,r12                    @ exponent plus offset for required binary point position
2681 |  subs r2,#52                   @ required shift
2682 |  bmi 1f                        @ shift down?
2683 | @ here a shift up by r2 places
2684 |  cmp r2,#12                    @ will clamp?
2685 |  bge 2f
2686 |  movs r4,r0
2687 |  lsls r1,r2
2688 |  lsls r0,r2
2689 |  rsbs r2,#0
2690 |  adds r2,#32                   @ complementary shift
2691 |  lsrs r4,r2
2692 |  orrs r1,r4
2693 |  pop {r4,r15}
2694 | 2:
2695 |  mvns r0,r3
2696 |  mvns r1,r3                    @ overflow: clamp to extreme fixed-point values
2697 |  pop {r4,r15}
2698 | 1:
2699 | @ here a shift down by -r2 places
2700 |  adds r2,#32
2701 |  bmi 1f                        @ long shift?
2702 |  mov r4,r1
2703 |  lsls r4,r2
2704 |  rsbs r2,#0
2705 |  adds r2,#32                   @ complementary shift
2706 |  asrs r1,r2
2707 |  lsrs r0,r2
2708 |  orrs r0,r4
2709 |  pop {r4,r15}
2710 | 1:
2711 | @ here a long shift down
2712 |  movs r0,r1
2713 |  asrs r1,#31                   @ shift down 32 places
2714 |  adds r2,#32
2715 |  bmi 1f                        @ very long shift?
2716 |  rsbs r2,#0
2717 |  adds r2,#32
2718 |  asrs r0,r2
2719 |  pop {r4,r15}
2720 | 1:
2721 |  movs r0,r3                    @ result very near zero: use sign extension bits
2722 |  movs r1,r3
2723 |  pop {r4,r15}
2724 | 
2725 | @ float <-> double conversions
2726 | .thumb_func
2727 | qfp_float2double:
2728 |  lsrs r3,r0,#31                @ sign bit
2729 |  lsls r3,#31
2730 |  lsls r1,r0,#1
2731 |  lsrs r2,r1,#24                @ exponent
2732 |  beq 1f                        @ zero?
2733 |  cmp r2,#0xff                  @ Inf?
2734 |  beq 2f
2735 |  lsrs r1,#4                    @ exponent and top 20 bits of mantissa
2736 |  ldr r2,=#(0x3ff-0x7f)<<20     @ difference in exponent offsets
2737 |  adds r1,r2
2738 |  orrs r1,r3
2739 |  lsls r0,#29                   @ bottom 3 bits of mantissa
2740 |  bx r14
2741 | 1:
2742 |  movs r1,r3                    @ return signed zero
2743 | 3:
2744 |  movs r0,#0
2745 |  bx r14
2746 | 2:
2747 |  ldr r1,=#0x7ff00000           @ return signed infinity
2748 |  adds r1,r3
2749 |  b 3b
2750 | 
2751 | .thumb_func
2752 | qfp_double2float:
2753 |  lsls r2,r1,#1
2754 |  lsrs r2,#21                   @ exponent
2755 |  ldr r3,=#0x3ff-0x7f
2756 |  subs r2,r3                    @ fix exponent bias
2757 |  ble 1f                        @ underflow or zero
2758 |  cmp r2,#0xff
2759 |  bge 2f                        @ overflow or infinity
2760 |  lsls r2,#23                   @ position exponent of result
2761 |  lsrs r3,r1,#31
2762 |  lsls r3,#31
2763 |  orrs r2,r3                    @ insert sign
2764 |  lsls r3,r0,#3                 @ rounding bits
2765 |  lsrs r0,#29
2766 |  lsls r1,#12
2767 |  lsrs r1,#9
2768 |  orrs r0,r1                    @ assemble mantissa
2769 |  orrs r0,r2                    @ insert exponent and sign
2770 |  lsls r3,#1
2771 |  bcc 3f                        @ no rounding
2772 |  beq 4f                        @ all sticky bits 0?
2773 | 5:
2774 |  adds r0,#1
2775 | 3:
2776 |  bx r14
2777 | 4:
2778 |  lsrs r3,r0,#1                 @ odd? then round up
2779 |  bcs 5b
2780 |  bx r14
2781 | 1:
2782 |  beq 6f                        @ check case where value is just less than smallest normal
2783 | 7:
2784 |  lsrs r0,r1,#31
2785 |  lsls r0,#31
2786 |  bx r14
2787 | 6:
2788 |  lsls r2,r1,#12                @ 20 1:s at top of mantissa?
2789 |  asrs r2,#12
2790 |  adds r2,#1
2791 |  bne 7b
2792 |  lsrs r2,r0,#29                @ and 3 more 1:s?
2793 |  cmp r2,#7
2794 |  bne 7b
2795 |  movs r2,#1                    @ return smallest normal with correct sign
2796 |  b 8f
2797 | 2:
2798 |  movs r2,#0xff
2799 | 8:
2800 |  lsrs r0,r1,#31                @ return signed infinity
2801 |  lsls r0,#8
2802 |  adds r0,r2
2803 |  lsls r0,#23
2804 |  bx r14
2805 | 
2806 | @ convert signed/unsigned 32/64-bit integer/fixed-point value in r0:r1 [with r2 places after point] to packed double in r0:r1, with rounding
2807 | 
2808 | .thumb_func
2809 | qfp_uint2double:
2810 |  movs r1,#0                    @ and fall through
2811 | .thumb_func
2812 | qfp_ufix2double:
2813 |  movs r2,r1
2814 |  movs r1,#0
2815 |  b qfp_ufix642double
2816 | 
2817 | .thumb_func
2818 | qfp_int2double:
2819 |  movs r1,#0                    @ and fall through
2820 | .thumb_func
2821 | qfp_fix2double:
2822 |  movs r2,r1
2823 |  asrs r1,r0,#31                @ sign extend
2824 |  b qfp_fix642double
2825 | 
2826 | .thumb_func
2827 | qfp_uint642double:
2828 |  movs r2,#0                    @ and fall through
2829 | .thumb_func
2830 | qfp_ufix642double:
2831 |  movs r3,#0
2832 |  b uf2d
2833 | 
2834 | .thumb_func
2835 | qfp_int642double:
2836 |  movs r2,#0                    @ and fall through
2837 | .thumb_func
2838 | qfp_fix642double:
2839 |  asrs r3,r1,#31                @ sign bit across all bits
2840 |  eors r0,r3
2841 |  eors r1,r3
2842 |  subs r0,r3
2843 |  sbcs r1,r3
2844 | uf2d:
2845 |  push {r4,r5,r14}
2846 |  ldr r4,=#0x432
2847 |  subs r2,r4,r2                 @ form biased exponent
2848 | @ here
2849 | @ r0:r1 unnormalised mantissa
2850 | @ r2 -Q (will become exponent)
2851 | @ r3 sign across all bits
2852 |  cmp r1,#0
2853 |  bne 1f                        @ short normalising shift?
2854 |  movs r1,r0
2855 |  beq 2f                        @ zero? return it
2856 |  movs r0,#0
2857 |  subs r2,#32                   @ fix exponent
2858 | 1:
2859 |  asrs r4,r1,#21
2860 |  bne 3f                        @ will need shift down (and rounding?)
2861 |  bcs 4f                        @ normalised already?
2862 | 5:
2863 |  subs r2,#1
2864 |  adds r0,r0                    @ shift up
2865 |  adcs r1,r1
2866 |  lsrs r4,r1,#21
2867 |  bcc 5b
2868 | 4:
2869 |  ldr r4,=#0x7fe
2870 |  cmp r2,r4
2871 |  bhs 6f                        @ over/underflow? return signed zero/infinity
2872 | 7:
2873 |  lsls r2,#20                   @ pack and return
2874 |  adds r1,r2
2875 |  lsls r3,#31
2876 |  adds r1,r3
2877 | 2:
2878 |  pop {r4,r5,r15}
2879 | 6:                             @ return signed zero/infinity according to unclamped exponent in r2
2880 |  mvns r2,r2
2881 |  lsrs r2,#21
2882 |  movs r0,#0
2883 |  movs r1,#0
2884 |  b 7b
2885 | 
2886 | 3:
2887 | @ here we need to shift down to normalise and possibly round
2888 |  bmi 1f                        @ already normalised to Q63?
2889 | 2:
2890 |  subs r2,#1
2891 |  adds r0,r0                    @ shift up
2892 |  adcs r1,r1
2893 |  bpl 2b
2894 | 1:
2895 | @ here we have a 1 in b63 of r0:r1
2896 |  adds r2,#11                   @ correct exponent for subsequent shift down
2897 |  lsls r4,r0,#21                @ save bits for rounding
2898 |  lsrs r0,#11
2899 |  lsls r5,r1,#21
2900 |  orrs r0,r5
2901 |  lsrs r1,#11
2902 |  lsls r4,#1
2903 |  beq 1f                        @ sticky bits are zero?
2904 | 8:
2905 |  movs r4,#0
2906 |  adcs r0,r4
2907 |  adcs r1,r4
2908 |  b 4b
2909 | 1:
2910 |  bcc 4b                        @ sticky bits are zero but not on rounding boundary
2911 |  lsrs r4,r0,#1                 @ increment if odd (force round to even)
2912 |  b 8b
2913 | 
2914 | 
2915 | .ltorg
2916 | 
2917 | .thumb_func
2918 | dunpacks:
2919 |  mdunpacks r0,r1,r2,r3,r4
2920 |  ldr r3,=#0x3ff
2921 |  subs r2,r3                    @ exponent without offset
2922 |  bx r14
2923 | 
2924 | @ r0:r1  signed mantissa Q52
2925 | @ r2     unbiased exponent < 10 (i.e., |x|<2^10)
2926 | @ r4     pointer to:
2927 | @          - divisor reciprocal approximation r=1/d Q15
2928 | @          - divisor d Q62  0..20
2929 | @          - divisor d Q62 21..41
2930 | @          - divisor d Q62 42..62
2931 | @ returns:
2932 | @ r0:r1  reduced result y Q62, -0.6 d < y < 0.6 d (better in practice)
2933 | @ r2     quotient q (number of reductions)
2934 | @ if exponent >=10, returns r0:r1=0, r2=1024*mantissa sign
2935 | @ designed to work for 0.5<d<2, in particular d=ln2 (~0.7) and d=π/2 (~1.6)
2936 | .thumb_func
2937 | dreduce:
2938 |  adds r2,#2                    @ e+2
2939 |  bmi 1f                        @ |x|<0.25, too small to need adjustment
2940 |  cmp r2,#12
2941 |  bge 4f
2942 | 2:
2943 |  movs r5,#17
2944 |  subs r5,r2                    @ 15-e
2945 |  movs r3,r1                    @ Q20
2946 |  asrs r3,r5                    @ x Q5
2947 |  adds r2,#8                    @ e+10
2948 |  adds r5,#7                    @ 22-e = 32-(e+10)
2949 |  movs r6,r0
2950 |  lsrs r6,r5
2951 |  lsls r0,r2
2952 |  lsls r1,r2
2953 |  orrs r1,r6                    @ r0:r1 x Q62
2954 |  ldmia r4,{r4-r7}
2955 |  muls r3,r4                    @ rx Q20
2956 |  asrs r2,r3,#20
2957 |  movs r3,#0
2958 |  adcs r2,r3                    @ rx Q0 rounded = q; for e.g. r=1.5 |q|<1.5*2^10
2959 |  muls r5,r2                    @ qd in pieces: L Q62
2960 |  muls r6,r2                    @               M Q41
2961 |  muls r7,r2                    @               H Q20
2962 |  lsls r7,#10
2963 |  asrs r4,r6,#11
2964 |  lsls r6,#21
2965 |  adds r6,r5
2966 |  adcs r7,r4
2967 |  asrs r5,#31
2968 |  adds r7,r5                    @ r6:r7 qd Q62
2969 |  subs r0,r6
2970 |  sbcs r1,r7                    @ remainder Q62
2971 |  bx r14
2972 | 4:
2973 |  movs r2,#12                   @ overflow: clamp to +/-1024
2974 |  movs r0,#0
2975 |  asrs r1,#31
2976 |  lsls r1,#1
2977 |  adds r1,#1
2978 |  lsls r1,#20
2979 |  b 2b
2980 | 
2981 | 1:
2982 |  lsls r1,#8
2983 |  lsrs r3,r0,#24
2984 |  orrs r1,r3
2985 |  lsls r0,#8                    @ r0:r1 Q60, to be shifted down -r2 places
2986 |  rsbs r3,r2,#0
2987 |  adds r2,#32                   @ shift down in r3, complementary shift in r2
2988 |  bmi 1f                        @ long shift?
2989 | 2:
2990 |  movs r4,r1
2991 |  asrs r1,r3
2992 |  lsls r4,r2
2993 |  lsrs r0,r3
2994 |  orrs r0,r4
2995 |  movs r2,#0                    @ rounding
2996 |  adcs r0,r2
2997 |  adcs r1,r2
2998 |  bx r14
2999 | 
3000 | 1:
3001 |  movs r0,r1                    @ down 32 places
3002 |  asrs r1,#31
3003 |  subs r3,#32
3004 |  adds r2,#32
3005 |  bpl 2b
3006 |  movs r0,#0                    @ very long shift? return 0
3007 |  movs r1,#0
3008 |  movs r2,#0
3009 |  bx r14
3010 | 
3011 | .thumb_func
3012 | qfp_dtan:
3013 |  push {r4-r7,r14}
3014 |  bl push_r8_r11
3015 |  bl dsincos
3016 |  mov r12,r0                    @ save ε
3017 |  bl dcos_finish
3018 |  push {r0,r1}
3019 |  mov r0,r12
3020 |  bl dsin_finish
3021 |  pop {r2,r3}
3022 |  bl pop_r8_r11
3023 |  b ddiv0                       @ compute sin θ/cos θ
3024 | 
3025 | .thumb_func
3026 | qfp_dcos:
3027 |  push {r4-r7,r14}
3028 |  bl push_r8_r11
3029 |  bl dsincos
3030 |  bl dcos_finish
3031 |  b 1f
3032 | 
3033 | .thumb_func
3034 | qfp_dsin:
3035 |  push {r4-r7,r14}
3036 |  bl push_r8_r11
3037 |  bl dsincos
3038 |  bl dsin_finish
3039 | 1:
3040 |  bl pop_r8_r11
3041 |  pop {r4-r7,r15}
3042 | 
3043 | 
3044 | @ unpack double θ in r0:r1, range reduce and calculate ε, cos α and sin α such that
3045 | @ θ=α+ε and |ε|≤2^-32
3046 | @ on return:
3047 | @ r0:r1   ε (residual ω, where θ=α+ε) Q62, |ε|≤2^-32 (so fits in r0)
3048 | @ r8:r9   cos α Q62
3049 | @ r10:r11 sin α Q62
3050 | .thumb_func
3051 | dsincos:
3052 |  push {r14}
3053 |  bl dunpacks
3054 |  adr r4,dreddata0
3055 |  bl dreduce
3056 | 
3057 |  movs r4,#0
3058 |  ldr r5,=#0x9df04dbb           @ this value compensates for the non-unity scaling of the CORDIC rotations
3059 |  ldr r6,=#0x36f656c5
3060 |  lsls r2,#31
3061 |  bcc 1f
3062 | @ quadrant 2 or 3
3063 |  mvns r6,r6
3064 |  rsbs r5,r5,#0
3065 |  adcs r6,r4
3066 | 1:
3067 |  lsls r2,#1
3068 |  bcs 1f
3069 | @ even quadrant
3070 |  mov r10,r4
3071 |  mov r11,r4
3072 |  mov r8,r5
3073 |  mov r9,r6
3074 |  b 2f
3075 | 1:
3076 | @ odd quadrant
3077 |  mov r8,r4
3078 |  mov r9,r4
3079 |  mov r10,r5
3080 |  mov r11,r6
3081 | 2:
3082 |  adr r4,dtab_cc
3083 |  mov r12,r4
3084 |  movs r7,#1
3085 |  movs r6,#31
3086 | 1:
3087 |  bl dcordic_rot_step
3088 |  adds r7,#1
3089 |  subs r6,#1
3090 |  cmp r7,#33
3091 |  bne 1b
3092 |  pop {r15}
3093 | 
3094 | dcos_finish:
3095 | @ here
3096 | @ r0:r1   ε (residual ω, where θ=α+ε) Q62, |ε|≤2^-32 (so fits in r0)
3097 | @ r8:r9   cos α Q62
3098 | @ r10:r11 sin α Q62
3099 | @ and we wish to calculate cos θ=cos(α+ε)~cos α - ε sin α
3100 |  mov r1,r11
3101 | @ mov r2,r10
3102 | @ lsrs r2,#31
3103 | @ adds r1,r2                    @ rounding improves accuracy very slightly
3104 |  muls32_s32_64 r0,r1, r2,r3, r4,r5,r6,r2,r3
3105 | @ r2:r3   ε sin α Q(62+62-32)=Q92
3106 |  mov r0,r8
3107 |  mov r1,r9
3108 |  lsls r5,r3,#2
3109 |  asrs r3,r3,#30
3110 |  lsrs r2,r2,#30
3111 |  orrs r2,r5
3112 |  sbcs r0,r2                    @ include rounding
3113 |  sbcs r1,r3
3114 |  movs r2,#62
3115 |  b qfp_fix642double
3116 | 
3117 | dsin_finish:
3118 | @ here
3119 | @ r0:r1   ε (residual ω, where θ=α+ε) Q62, |ε|≤2^-32 (so fits in r0)
3120 | @ r8:r9   cos α Q62
3121 | @ r10:r11 sin α Q62
3122 | @ and we wish to calculate sin θ=sin(α+ε)~sin α + ε cos α
3123 |  mov r1,r9
3124 |  muls32_s32_64 r0,r1, r2,r3, r4,r5,r6,r2,r3
3125 | @ r2:r3   ε cos α Q(62+62-32)=Q92
3126 |  mov r0,r10
3127 |  mov r1,r11
3128 |  lsls r5,r3,#2
3129 |  asrs r3,r3,#30
3130 |  lsrs r2,r2,#30
3131 |  orrs r2,r5
3132 |  adcs r0,r2                    @ include rounding
3133 |  adcs r1,r3
3134 |  movs r2,#62
3135 |  b qfp_fix642double
3136 | 
3137 | .ltorg
3138 | .align 2
3139 | dreddata0:
3140 | .word 0x0000517d               @ 2/π Q15
3141 | .word 0x0014611A               @ π/2 Q62=6487ED5110B4611A split into 21-bit pieces
3142 | .word 0x000A8885
3143 | .word 0x001921FB
3144 | 
3145 | .thumb_func
3146 | qfp_datan2:
3147 | @ r0:r1 y
3148 | @ r2:r3 x
3149 |  push {r4-r7,r14}
3150 |  bl push_r8_r11
3151 |  ldr r5,=#0x7ff00000
3152 |  movs r4,r1
3153 |  ands r4,r5                    @ y==0?
3154 |  beq 1f
3155 |  cmp r4,r5                     @ or Inf/NaN?
3156 |  bne 2f
3157 | 1:
3158 |  lsrs r1,#20                   @ flush
3159 |  lsls r1,#20
3160 |  movs r0,#0
3161 | 2:
3162 |  movs r4,r3
3163 |  ands r4,r5                    @ x==0?
3164 |  beq 1f
3165 |  cmp r4,r5                     @ or Inf/NaN?
3166 |  bne 2f
3167 | 1:
3168 |  lsrs r3,#20                   @ flush
3169 |  lsls r3,#20
3170 |  movs r2,#0
3171 | 2:
3172 |  movs r6,#0                    @ quadrant offset
3173 |  lsls r5,#11                   @ constant 0x80000000
3174 |  cmp r3,#0
3175 |  bpl 1f                        @ skip if x positive
3176 |  movs r6,#2
3177 |  eors r3,r5
3178 |  eors r1,r5
3179 |  bmi 1f                        @ quadrant offset=+2 if y was positive
3180 |  rsbs r6,#0                    @ quadrant offset=-2 if y was negative
3181 | 1:
3182 | @ now in quadrant 0 or 3
3183 |  adds r7,r1,r5                 @ r7=-r1
3184 |  bpl 1f
3185 | @ y>=0: in quadrant 0
3186 |  cmp r1,r3
3187 |  ble 2f                        @ y<~x so 0≤θ<~π/4: skip
3188 |  adds r6,#1
3189 |  eors r1,r5                    @ negate x
3190 |  b 3f                          @ and exchange x and y = rotate by -π/2
3191 | 1:
3192 |  cmp r3,r7
3193 |  bge 2f                        @ -y<~x so -π/4<~θ≤0: skip
3194 |  subs r6,#1
3195 |  eors r3,r5                    @ negate y and ...
3196 | 3:
3197 |  movs r7,r0                    @ exchange x and y
3198 |  movs r0,r2
3199 |  movs r2,r7
3200 |  movs r7,r1
3201 |  movs r1,r3
3202 |  movs r3,r7
3203 | 2:
3204 | @ here -π/4<~θ<~π/4
3205 | @ r6 has quadrant offset
3206 |  push {r6}
3207 |  cmp r2,#0
3208 |  bne 1f
3209 |  cmp r3,#0
3210 |  beq 10f                       @ x==0 going into division?
3211 |  lsls r4,r3,#1
3212 |  asrs r4,#21
3213 |  adds r4,#1
3214 |  bne 1f                        @ x==Inf going into division?
3215 |  lsls r4,r1,#1
3216 |  asrs r4,#21
3217 |  adds r4,#1                    @ y also ±Inf?
3218 |  bne 10f
3219 |  subs r1,#1                    @ make them both just finite
3220 |  subs r3,#1
3221 |  b 1f
3222 | 
3223 | 10:
3224 |  movs r0,#0
3225 |  movs r1,#0
3226 |  b 12f
3227 |  
3228 | 1:
3229 |  bl qfp_ddiv
3230 |  movs r2,#62
3231 |  bl qfp_double2fix64
3232 | @ r0:r1 y/x
3233 |  mov r10,r0
3234 |  mov r11,r1
3235 |  movs r0,#0                    @ ω=0
3236 |  movs r1,#0
3237 |  mov r8,r0
3238 |  movs r2,#1
3239 |  lsls r2,#30
3240 |  mov r9,r2                     @ x=1
3241 | 
3242 |  adr r4,dtab_cc
3243 |  mov r12,r4
3244 |  movs r7,#1
3245 |  movs r6,#31
3246 | 1:
3247 |  bl dcordic_vec_step
3248 |  adds r7,#1
3249 |  subs r6,#1
3250 |  cmp r7,#33
3251 |  bne 1b
3252 | @ r0:r1   atan(y/x) Q62
3253 | @ r8:r9   x residual Q62
3254 | @ r10:r11 y residual Q62
3255 |  mov r2,r9
3256 |  mov r3,r10
3257 |  subs r2,#12                   @ this makes atan(0)==0
3258 | @ the following is basically a division residual y/x ~ atan(residual y/x)
3259 |  movs r4,#1
3260 |  lsls r4,#29
3261 |  movs r7,#0
3262 | 2:
3263 |  lsrs r2,#1
3264 |  movs r3,r3                    @ preserve carry
3265 |  bmi 1f
3266 |  sbcs r3,r2
3267 |  adds r0,r4
3268 |  adcs r1,r7
3269 |  lsrs r4,#1
3270 |  bne 2b
3271 |  b 3f
3272 | 1:
3273 |  adcs r3,r2
3274 |  subs r0,r4
3275 |  sbcs r1,r7
3276 |  lsrs r4,#1
3277 |  bne 2b
3278 | 3:
3279 |  lsls r6,r1,#31
3280 |  asrs r1,#1
3281 |  lsrs r0,#1
3282 |  orrs r0,r6                    @ Q61
3283 | 
3284 | 12:
3285 |  pop {r6}
3286 | 
3287 |  cmp r6,#0
3288 |  beq 1f
3289 |  ldr r4,=#0x885A308D           @ π/2 Q61
3290 |  ldr r5,=#0x3243F6A8
3291 |  bpl 2f
3292 |  mvns r4,r4                    @ negative quadrant offset
3293 |  mvns r5,r5
3294 | 2:
3295 |  lsls r6,#31
3296 |  bne 2f                        @ skip if quadrant offset is ±1
3297 |  adds r0,r4
3298 |  adcs r1,r5
3299 | 2:
3300 |  adds r0,r4
3301 |  adcs r1,r5
3302 | 1:
3303 |  movs r2,#61
3304 |  bl qfp_fix642double
3305 | 
3306 |  bl pop_r8_r11
3307 |  pop {r4-r7,r15}
3308 | 
3309 | .ltorg
3310 | 
3311 | dtab_cc:
3312 | .word 0x61bb4f69, 0x1dac6705   @ atan 2^-1 Q62
3313 | .word 0x96406eb1, 0x0fadbafc   @ atan 2^-2 Q62
3314 | .word 0xab0bdb72, 0x07f56ea6   @ atan 2^-3 Q62
3315 | .word 0xe59fbd39, 0x03feab76   @ atan 2^-4 Q62
3316 | .word 0xba97624b, 0x01ffd55b   @ atan 2^-5 Q62
3317 | .word 0xdddb94d6, 0x00fffaaa   @ atan 2^-6 Q62
3318 | .word 0x56eeea5d, 0x007fff55   @ atan 2^-7 Q62
3319 | .word 0xaab7776e, 0x003fffea   @ atan 2^-8 Q62
3320 | .word 0x5555bbbc, 0x001ffffd   @ atan 2^-9 Q62
3321 | .word 0xaaaaadde, 0x000fffff   @ atan 2^-10 Q62
3322 | .word 0xf555556f, 0x0007ffff   @ atan 2^-11 Q62
3323 | .word 0xfeaaaaab, 0x0003ffff   @ atan 2^-12 Q62
3324 | .word 0xffd55555, 0x0001ffff   @ atan 2^-13 Q62
3325 | .word 0xfffaaaab, 0x0000ffff   @ atan 2^-14 Q62
3326 | .word 0xffff5555, 0x00007fff   @ atan 2^-15 Q62
3327 | .word 0xffffeaab, 0x00003fff   @ atan 2^-16 Q62
3328 | .word 0xfffffd55, 0x00001fff   @ atan 2^-17 Q62
3329 | .word 0xffffffab, 0x00000fff   @ atan 2^-18 Q62
3330 | .word 0xfffffff5, 0x000007ff   @ atan 2^-19 Q62
3331 | .word 0xffffffff, 0x000003ff   @ atan 2^-20 Q62
3332 | .word 0x00000000, 0x00000200   @ atan 2^-21 Q62 @ consider optimising these
3333 | .word 0x00000000, 0x00000100   @ atan 2^-22 Q62
3334 | .word 0x00000000, 0x00000080   @ atan 2^-23 Q62
3335 | .word 0x00000000, 0x00000040   @ atan 2^-24 Q62
3336 | .word 0x00000000, 0x00000020   @ atan 2^-25 Q62
3337 | .word 0x00000000, 0x00000010   @ atan 2^-26 Q62
3338 | .word 0x00000000, 0x00000008   @ atan 2^-27 Q62
3339 | .word 0x00000000, 0x00000004   @ atan 2^-28 Q62
3340 | .word 0x00000000, 0x00000002   @ atan 2^-29 Q62
3341 | .word 0x00000000, 0x00000001   @ atan 2^-30 Q62
3342 | .word 0x80000000, 0x00000000   @ atan 2^-31 Q62
3343 | .word 0x40000000, 0x00000000   @ atan 2^-32 Q62
3344 | 
3345 | .thumb_func
3346 | qfp_dexp:
3347 |  push {r4-r7,r14}
3348 |  bl dunpacks
3349 |  adr r4,dreddata1
3350 |  bl dreduce
3351 |  cmp r1,#0
3352 |  bge 1f
3353 |  ldr r4,=#0xF473DE6B
3354 |  ldr r5,=#0x2C5C85FD           @ ln2 Q62
3355 |  adds r0,r4
3356 |  adcs r1,r5
3357 |  subs r2,#1
3358 | 1:
3359 |  push {r2}
3360 |  movs r7,#1                    @ shift
3361 |  adr r6,dtab_exp
3362 |  movs r2,#0
3363 |  movs r3,#1
3364 |  lsls r3,#30                   @ x=1 Q62
3365 | 
3366 | 3:
3367 |  ldmia r6!,{r4,r5}
3368 |  mov r12,r6
3369 |  subs r0,r4
3370 |  sbcs r1,r5
3371 |  bmi 1f
3372 | 
3373 |  rsbs r6,r7,#0
3374 |  adds r6,#32                   @ complementary shift
3375 |  movs r5,r3
3376 |  asrs r5,r7
3377 |  movs r4,r3
3378 |  lsls r4,r6
3379 |  movs r6,r2
3380 |  lsrs r6,r7                    @ rounding bit in carry
3381 |  orrs r4,r6
3382 |  adcs r2,r4
3383 |  adcs r3,r5                    @ x+=x>>i
3384 |  b 2f
3385 | 
3386 | 1:
3387 |  adds r0,r4                    @ restore argument
3388 |  adcs r1,r5
3389 | 2:
3390 |  mov r6,r12
3391 |  adds r7,#1
3392 |  cmp r7,#33
3393 |  bne 3b
3394 | 
3395 | @ here
3396 | @ r0:r1   ε (residual x, where x=a+ε) Q62, |ε|≤2^-32 (so fits in r0)
3397 | @ r2:r3   exp a Q62
3398 | @ and we wish to calculate exp x=exp a exp ε~(exp a)(1+ε)
3399 |  muls32_32_64 r0,r3, r4,r1, r5,r6,r7,r4,r1
3400 | @ r4:r1 ε exp a Q(62+62-32)=Q92
3401 |  lsrs r4,#30
3402 |  lsls r0,r1,#2
3403 |  orrs r0,r4
3404 |  asrs r1,#30
3405 |  adds r0,r2
3406 |  adcs r1,r3
3407 | 
3408 |  pop {r2}
3409 |  rsbs r2,#0
3410 |  adds r2,#62
3411 |  bl qfp_fix642double                 @ in principle we can pack faster than this because we know the exponent
3412 |  pop {r4-r7,r15}
3413 | 
3414 | .ltorg
3415 | 
3416 | .thumb_func
3417 | qfp_dln:
3418 |  push {r4-r7,r14}
3419 |  lsls r7,r1,#1
3420 |  bcs 5f                        @ <0 ...
3421 |  asrs r7,#21
3422 |  beq 5f                        @ ... or =0? return -Inf
3423 |  adds r7,#1
3424 |  beq 6f                        @ Inf/NaN? return +Inf
3425 |  bl dunpacks
3426 |  push {r2}
3427 |  lsls r1,#9
3428 |  lsrs r2,r0,#23
3429 |  orrs r1,r2
3430 |  lsls r0,#9
3431 | @ r0:r1 m Q61 = m/2 Q62 0.5≤m/2<1
3432 | 
3433 |  movs r7,#1                    @ shift
3434 |  adr r6,dtab_exp
3435 |  mov r12,r6
3436 |  movs r2,#0
3437 |  movs r3,#0                    @ y=0 Q62
3438 | 
3439 | 3:
3440 |  rsbs r6,r7,#0
3441 |  adds r6,#32                   @ complementary shift
3442 |  movs r5,r1
3443 |  asrs r5,r7
3444 |  movs r4,r1
3445 |  lsls r4,r6
3446 |  movs r6,r0
3447 |  lsrs r6,r7
3448 |  orrs r4,r6                    @ x>>i, rounding bit in carry
3449 |  adcs r4,r0
3450 |  adcs r5,r1                    @ x+(x>>i)
3451 | 
3452 |  lsrs r6,r5,#30
3453 |  bne 1f                        @ x+(x>>i)>1?
3454 |  movs r0,r4
3455 |  movs r1,r5                    @ x+=x>>i
3456 |  mov r6,r12
3457 |  ldmia r6!,{r4,r5}
3458 |  subs r2,r4
3459 |  sbcs r3,r5
3460 | 
3461 | 1:
3462 |  movs r4,#8
3463 |  add r12,r4
3464 |  adds r7,#1
3465 |  cmp r7,#33
3466 |  bne 3b
3467 | @ here:
3468 | @ r0:r1 residual x, nearly 1 Q62
3469 | @ r2:r3 y ~ ln m/2 = ln m - ln2 Q62
3470 | @ result is y + ln2 + ln x ~ y + ln2 + (x-1)
3471 |  lsls r1,#2
3472 |  asrs r1,#2                    @ x-1
3473 |  adds r2,r0
3474 |  adcs r3,r1
3475 | 
3476 |  pop {r7}
3477 | @ here:
3478 | @ r2:r3 ln m/2 = ln m - ln2 Q62
3479 | @ r7    unbiased exponent
3480 | 
3481 |  adr r4,dreddata1+4
3482 |  ldmia r4,{r0,r1,r4}
3483 |  adds r7,#1
3484 |  muls r0,r7                    @ Q62
3485 |  muls r1,r7                    @ Q41
3486 |  muls r4,r7                    @ Q20
3487 |  lsls r7,r1,#21
3488 |  asrs r1,#11
3489 |  asrs r5,r1,#31
3490 |  adds r0,r7
3491 |  adcs r1,r5
3492 |  lsls r7,r4,#10
3493 |  asrs r4,#22
3494 |  asrs r5,r1,#31
3495 |  adds r1,r7
3496 |  adcs r4,r5
3497 | @ r0:r1:r4 exponent*ln2 Q62
3498 |  asrs r5,r3,#31
3499 |  adds r0,r2
3500 |  adcs r1,r3
3501 |  adcs r4,r5
3502 | @ r0:r1:r4 result Q62
3503 |  movs r2,#62
3504 | 1:
3505 |  asrs r5,r1,#31
3506 |  cmp r4,r5
3507 |  beq 2f                        @ r4 a sign extension of r1?
3508 |  lsrs r0,#4                    @ no: shift down 4 places and try again
3509 |  lsls r6,r1,#28
3510 |  orrs r0,r6
3511 |  lsrs r1,#4
3512 |  lsls r6,r4,#28
3513 |  orrs r1,r6
3514 |  asrs r4,#4
3515 |  subs r2,#4
3516 |  b 1b
3517 | 2:
3518 |  bl qfp_fix642double
3519 |  pop {r4-r7,r15}
3520 | 
3521 | 5:
3522 |  ldr r1,=#0xfff00000
3523 |  movs r0,#0
3524 |  pop {r4-r7,r15}
3525 | 
3526 | 6:
3527 |  ldr r1,=#0x7ff00000
3528 |  movs r0,#0
3529 |  pop {r4-r7,r15}
3530 | 
3531 | 
3532 | .ltorg
3533 | 
3534 | .align 2
3535 | dreddata1:
3536 | .word 0x0000B8AA               @ 1/ln2 Q15
3537 | .word 0x0013DE6B               @ ln2 Q62 Q62=2C5C85FDF473DE6B split into 21-bit pieces
3538 | .word 0x000FEFA3
3539 | .word 0x000B1721
3540 | 
3541 | dtab_exp:
3542 | .word 0xbf984bf3, 0x19f323ec   @ log 1+2^-1 Q62
3543 | .word 0xcd4d10d6, 0x0e47fbe3   @ log 1+2^-2 Q62
3544 | .word 0x8abcb97a, 0x0789c1db   @ log 1+2^-3 Q62
3545 | .word 0x022c54cc, 0x03e14618   @ log 1+2^-4 Q62
3546 | .word 0xe7833005, 0x01f829b0   @ log 1+2^-5 Q62
3547 | .word 0x87e01f1e, 0x00fe0545   @ log 1+2^-6 Q62
3548 | .word 0xac419e24, 0x007f80a9   @ log 1+2^-7 Q62
3549 | .word 0x45621781, 0x003fe015   @ log 1+2^-8 Q62
3550 | .word 0xa9ab10e6, 0x001ff802   @ log 1+2^-9 Q62
3551 | .word 0x55455888, 0x000ffe00   @ log 1+2^-10 Q62
3552 | .word 0x0aa9aac4, 0x0007ff80   @ log 1+2^-11 Q62
3553 | .word 0x01554556, 0x0003ffe0   @ log 1+2^-12 Q62
3554 | .word 0x002aa9ab, 0x0001fff8   @ log 1+2^-13 Q62
3555 | .word 0x00055545, 0x0000fffe   @ log 1+2^-14 Q62
3556 | .word 0x8000aaaa, 0x00007fff   @ log 1+2^-15 Q62
3557 | .word 0xe0001555, 0x00003fff   @ log 1+2^-16 Q62
3558 | .word 0xf80002ab, 0x00001fff   @ log 1+2^-17 Q62
3559 | .word 0xfe000055, 0x00000fff   @ log 1+2^-18 Q62
3560 | .word 0xff80000b, 0x000007ff   @ log 1+2^-19 Q62
3561 | .word 0xffe00001, 0x000003ff   @ log 1+2^-20 Q62
3562 | .word 0xfff80000, 0x000001ff   @ log 1+2^-21 Q62
3563 | .word 0xfffe0000, 0x000000ff   @ log 1+2^-22 Q62
3564 | .word 0xffff8000, 0x0000007f   @ log 1+2^-23 Q62
3565 | .word 0xffffe000, 0x0000003f   @ log 1+2^-24 Q62
3566 | .word 0xfffff800, 0x0000001f   @ log 1+2^-25 Q62
3567 | .word 0xfffffe00, 0x0000000f   @ log 1+2^-26 Q62
3568 | .word 0xffffff80, 0x00000007   @ log 1+2^-27 Q62
3569 | .word 0xffffffe0, 0x00000003   @ log 1+2^-28 Q62
3570 | .word 0xfffffff8, 0x00000001   @ log 1+2^-29 Q62
3571 | .word 0xfffffffe, 0x00000000   @ log 1+2^-30 Q62
3572 | .word 0x80000000, 0x00000000   @ log 1+2^-31 Q62
3573 | .word 0x40000000, 0x00000000   @ log 1+2^-32 Q62
3574 | 
3575 | qfp_lib_end:
3576 | 


--------------------------------------------------------------------------------