├── .gitignore
├── LICENSE
├── README.md
├── code_of_conduct.md
├── examples
    ├── README.md
    ├── anaconda.md
    ├── arkouda.md
    ├── gromacs.md
    ├── hpl-cpu
    │   ├── ACfL.patch
    │   ├── GCC_BLIS.patch
    │   ├── HPL.dat
    │   ├── NVIDIA_HPC_SDK.patch
    │   ├── hpl-cpu.md
    │   └── hplgen.py
    ├── motorBike.png
    ├── openfoam.md
    ├── stream-cpu.md
    ├── tensorflow-cpu.md
    ├── tensorflow-gpu.md
    ├── velox.md
    └── wrf.md
├── isv.md
├── known_issues.md
├── languages
    ├── README.md
    ├── c-c++.md
    ├── dotnet.md
    ├── fortran.md
    ├── golang.md
    ├── java.md
    ├── python.md
    └── rust.md
├── optimization
    ├── README.md
    ├── atomics.md
    ├── cpudetect.md
    └── vectorization.md
├── slack.png
├── software
    ├── README.md
    ├── compilers.md
    ├── containers.md
    ├── cuda.md
    ├── mathlibs.md
    ├── ml.md
    ├── mpi.md
    └── os.md
└── transition-guide.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Attribution-ShareAlike 4.0 International Public
  2 | License
  3 | 
  4 | By exercising the Licensed Rights (defined below), You accept and agree
  5 | to be bound by the terms and conditions of this Creative Commons
  6 | Attribution-ShareAlike 4.0 International Public License ("Public
  7 | License"). To the extent this Public License may be interpreted as a
  8 | contract, You are granted the Licensed Rights in consideration of Your
  9 | acceptance of these terms and conditions, and the Licensor grants You
 10 | such rights in consideration of benefits the Licensor receives from
 11 | making the Licensed Material available under these terms and
 12 | conditions.
 13 | 
 14 | 
 15 | Section 1 -- Definitions.
 16 | 
 17 |   a. Adapted Material means material subject to Copyright and Similar
 18 |      Rights that is derived from or based upon the Licensed Material
 19 |      and in which the Licensed Material is translated, altered,
 20 |      arranged, transformed, or otherwise modified in a manner requiring
 21 |      permission under the Copyright and Similar Rights held by the
 22 |      Licensor. For purposes of this Public License, where the Licensed
 23 |      Material is a musical work, performance, or sound recording,
 24 |      Adapted Material is always produced where the Licensed Material is
 25 |      synched in timed relation with a moving image.
 26 | 
 27 |   b. Adapter's License means the license You apply to Your Copyright
 28 |      and Similar Rights in Your contributions to Adapted Material in
 29 |      accordance with the terms and conditions of this Public License.
 30 | 
 31 |   c. BY-SA Compatible License means a license listed at
 32 |      creativecommons.org/compatiblelicenses, approved by Creative
 33 |      Commons as essentially the equivalent of this Public License.
 34 | 
 35 |   d. Copyright and Similar Rights means copyright and/or similar rights
 36 |      closely related to copyright including, without limitation,
 37 |      performance, broadcast, sound recording, and Sui Generis Database
 38 |      Rights, without regard to how the rights are labeled or
 39 |      categorized. For purposes of this Public License, the rights
 40 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 41 |      Rights.
 42 | 
 43 |   e. Effective Technological Measures means those measures that, in the
 44 |      absence of proper authority, may not be circumvented under laws
 45 |      fulfilling obligations under Article 11 of the WIPO Copyright
 46 |      Treaty adopted on December 20, 1996, and/or similar international
 47 |      agreements.
 48 | 
 49 |   f. Exceptions and Limitations means fair use, fair dealing, and/or
 50 |      any other exception or limitation to Copyright and Similar Rights
 51 |      that applies to Your use of the Licensed Material.
 52 | 
 53 |   g. License Elements means the license attributes listed in the name
 54 |      of a Creative Commons Public License. The License Elements of this
 55 |      Public License are Attribution and ShareAlike.
 56 | 
 57 |   h. Licensed Material means the artistic or literary work, database,
 58 |      or other material to which the Licensor applied this Public
 59 |      License.
 60 | 
 61 |   i. Licensed Rights means the rights granted to You subject to the
 62 |      terms and conditions of this Public License, which are limited to
 63 |      all Copyright and Similar Rights that apply to Your use of the
 64 |      Licensed Material and that the Licensor has authority to license.
 65 | 
 66 |   j. Licensor means the individual(s) or entity(ies) granting rights
 67 |      under this Public License.
 68 | 
 69 |   k. Share means to provide material to the public by any means or
 70 |      process that requires permission under the Licensed Rights, such
 71 |      as reproduction, public display, public performance, distribution,
 72 |      dissemination, communication, or importation, and to make material
 73 |      available to the public including in ways that members of the
 74 |      public may access the material from a place and at a time
 75 |      individually chosen by them.
 76 | 
 77 |   l. Sui Generis Database Rights means rights other than copyright
 78 |      resulting from Directive 96/9/EC of the European Parliament and of
 79 |      the Council of 11 March 1996 on the legal protection of databases,
 80 |      as amended and/or succeeded, as well as other essentially
 81 |      equivalent rights anywhere in the world.
 82 | 
 83 |   m. You means the individual or entity exercising the Licensed Rights
 84 |      under this Public License. Your has a corresponding meaning.
 85 | 
 86 | 
 87 | Section 2 -- Scope.
 88 | 
 89 |   a. License grant.
 90 | 
 91 |        1. Subject to the terms and conditions of this Public License,
 92 |           the Licensor hereby grants You a worldwide, royalty-free,
 93 |           non-sublicensable, non-exclusive, irrevocable license to
 94 |           exercise the Licensed Rights in the Licensed Material to:
 95 | 
 96 |             a. reproduce and Share the Licensed Material, in whole or
 97 |                in part; and
 98 | 
 99 |             b. produce, reproduce, and Share Adapted Material.
100 | 
101 |        2. Exceptions and Limitations. For the avoidance of doubt, where
102 |           Exceptions and Limitations apply to Your use, this Public
103 |           License does not apply, and You do not need to comply with
104 |           its terms and conditions.
105 | 
106 |        3. Term. The term of this Public License is specified in Section
107 |           6(a).
108 | 
109 |        4. Media and formats; technical modifications allowed. The
110 |           Licensor authorizes You to exercise the Licensed Rights in
111 |           all media and formats whether now known or hereafter created,
112 |           and to make technical modifications necessary to do so. The
113 |           Licensor waives and/or agrees not to assert any right or
114 |           authority to forbid You from making technical modifications
115 |           necessary to exercise the Licensed Rights, including
116 |           technical modifications necessary to circumvent Effective
117 |           Technological Measures. For purposes of this Public License,
118 |           simply making modifications authorized by this Section 2(a)
119 |           (4) never produces Adapted Material.
120 | 
121 |        5. Downstream recipients.
122 | 
123 |             a. Offer from the Licensor -- Licensed Material. Every
124 |                recipient of the Licensed Material automatically
125 |                receives an offer from the Licensor to exercise the
126 |                Licensed Rights under the terms and conditions of this
127 |                Public License.
128 | 
129 |             b. Additional offer from the Licensor -- Adapted Material.
130 |                Every recipient of Adapted Material from You
131 |                automatically receives an offer from the Licensor to
132 |                exercise the Licensed Rights in the Adapted Material
133 |                under the conditions of the Adapter's License You apply.
134 | 
135 |             c. No downstream restrictions. You may not offer or impose
136 |                any additional or different terms or conditions on, or
137 |                apply any Effective Technological Measures to, the
138 |                Licensed Material if doing so restricts exercise of the
139 |                Licensed Rights by any recipient of the Licensed
140 |                Material.
141 | 
142 |        6. No endorsement. Nothing in this Public License constitutes or
143 |           may be construed as permission to assert or imply that You
144 |           are, or that Your use of the Licensed Material is, connected
145 |           with, or sponsored, endorsed, or granted official status by,
146 |           the Licensor or others designated to receive attribution as
147 |           provided in Section 3(a)(1)(A)(i).
148 | 
149 |   b. Other rights.
150 | 
151 |        1. Moral rights, such as the right of integrity, are not
152 |           licensed under this Public License, nor are publicity,
153 |           privacy, and/or other similar personality rights; however, to
154 |           the extent possible, the Licensor waives and/or agrees not to
155 |           assert any such rights held by the Licensor to the limited
156 |           extent necessary to allow You to exercise the Licensed
157 |           Rights, but not otherwise.
158 | 
159 |        2. Patent and trademark rights are not licensed under this
160 |           Public License.
161 | 
162 |        3. To the extent possible, the Licensor waives any right to
163 |           collect royalties from You for the exercise of the Licensed
164 |           Rights, whether directly or through a collecting society
165 |           under any voluntary or waivable statutory or compulsory
166 |           licensing scheme. In all other cases the Licensor expressly
167 |           reserves any right to collect such royalties.
168 | 
169 | 
170 | Section 3 -- License Conditions.
171 | 
172 | Your exercise of the Licensed Rights is expressly made subject to the
173 | following conditions.
174 | 
175 |   a. Attribution.
176 | 
177 |        1. If You Share the Licensed Material (including in modified
178 |           form), You must:
179 | 
180 |             a. retain the following if it is supplied by the Licensor
181 |                with the Licensed Material:
182 | 
183 |                  i. identification of the creator(s) of the Licensed
184 |                     Material and any others designated to receive
185 |                     attribution, in any reasonable manner requested by
186 |                     the Licensor (including by pseudonym if
187 |                     designated);
188 | 
189 |                 ii. a copyright notice;
190 | 
191 |                iii. a notice that refers to this Public License;
192 | 
193 |                 iv. a notice that refers to the disclaimer of
194 |                     warranties;
195 | 
196 |                  v. a URI or hyperlink to the Licensed Material to the
197 |                     extent reasonably practicable;
198 | 
199 |             b. indicate if You modified the Licensed Material and
200 |                retain an indication of any previous modifications; and
201 | 
202 |             c. indicate the Licensed Material is licensed under this
203 |                Public License, and include the text of, or the URI or
204 |                hyperlink to, this Public License.
205 | 
206 |        2. You may satisfy the conditions in Section 3(a)(1) in any
207 |           reasonable manner based on the medium, means, and context in
208 |           which You Share the Licensed Material. For example, it may be
209 |           reasonable to satisfy the conditions by providing a URI or
210 |           hyperlink to a resource that includes the required
211 |           information.
212 | 
213 |        3. If requested by the Licensor, You must remove any of the
214 |           information required by Section 3(a)(1)(A) to the extent
215 |           reasonably practicable.
216 | 
217 |   b. ShareAlike.
218 | 
219 |      In addition to the conditions in Section 3(a), if You Share
220 |      Adapted Material You produce, the following conditions also apply.
221 | 
222 |        1. The Adapter's License You apply must be a Creative Commons
223 |           license with the same License Elements, this version or
224 |           later, or a BY-SA Compatible License.
225 | 
226 |        2. You must include the text of, or the URI or hyperlink to, the
227 |           Adapter's License You apply. You may satisfy this condition
228 |           in any reasonable manner based on the medium, means, and
229 |           context in which You Share Adapted Material.
230 | 
231 |        3. You may not offer or impose any additional or different terms
232 |           or conditions on, or apply any Effective Technological
233 |           Measures to, Adapted Material that restrict exercise of the
234 |           rights granted under the Adapter's License You apply.
235 | 
236 | 
237 | Section 4 -- Sui Generis Database Rights.
238 | 
239 | Where the Licensed Rights include Sui Generis Database Rights that
240 | apply to Your use of the Licensed Material:
241 | 
242 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
243 |      to extract, reuse, reproduce, and Share all or a substantial
244 |      portion of the contents of the database;
245 | 
246 |   b. if You include all or a substantial portion of the database
247 |      contents in a database in which You have Sui Generis Database
248 |      Rights, then the database in which You have Sui Generis Database
249 |      Rights (but not its individual contents) is Adapted Material,
250 | 
251 |      including for purposes of Section 3(b); and
252 |   c. You must comply with the conditions in Section 3(a) if You Share
253 |      all or a substantial portion of the contents of the database.
254 | 
255 | For the avoidance of doubt, this Section 4 supplements and does not
256 | replace Your obligations under this Public License where the Licensed
257 | Rights include other Copyright and Similar Rights.
258 | 
259 | 
260 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
261 | 
262 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
263 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
264 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
265 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
266 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
267 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
268 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
269 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
270 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
271 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
272 | 
273 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
274 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
275 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
276 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
277 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
278 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
279 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
280 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
281 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
282 | 
283 |   c. The disclaimer of warranties and limitation of liability provided
284 |      above shall be interpreted in a manner that, to the extent
285 |      possible, most closely approximates an absolute disclaimer and
286 |      waiver of all liability.
287 | 
288 | 
289 | Section 6 -- Term and Termination.
290 | 
291 |   a. This Public License applies for the term of the Copyright and
292 |      Similar Rights licensed here. However, if You fail to comply with
293 |      this Public License, then Your rights under this Public License
294 |      terminate automatically.
295 | 
296 |   b. Where Your right to use the Licensed Material has terminated under
297 |      Section 6(a), it reinstates:
298 | 
299 |        1. automatically as of the date the violation is cured, provided
300 |           it is cured within 30 days of Your discovery of the
301 |           violation; or
302 | 
303 |        2. upon express reinstatement by the Licensor.
304 | 
305 |      For the avoidance of doubt, this Section 6(b) does not affect any
306 |      right the Licensor may have to seek remedies for Your violations
307 |      of this Public License.
308 | 
309 |   c. For the avoidance of doubt, the Licensor may also offer the
310 |      Licensed Material under separate terms or conditions or stop
311 |      distributing the Licensed Material at any time; however, doing so
312 |      will not terminate this Public License.
313 | 
314 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
315 |      License.
316 | 
317 | 
318 | Section 7 -- Other Terms and Conditions.
319 | 
320 |   a. The Licensor shall not be bound by any additional or different
321 |      terms or conditions communicated by You unless expressly agreed.
322 | 
323 |   b. Any arrangements, understandings, or agreements regarding the
324 |      Licensed Material not stated herein are separate from and
325 |      independent of the terms and conditions of this Public License.
326 | 
327 | 
328 | Section 8 -- Interpretation.
329 | 
330 |   a. For the avoidance of doubt, this Public License does not, and
331 |      shall not be interpreted to, reduce, limit, restrict, or impose
332 |      conditions on any use of the Licensed Material that could lawfully
333 |      be made without permission under this Public License.
334 | 
335 |   b. To the extent possible, if any provision of this Public License is
336 |      deemed unenforceable, it shall be automatically reformed to the
337 |      minimum extent necessary to make it enforceable. If the provision
338 |      cannot be reformed, it shall be severed from this Public License
339 |      without affecting the enforceability of the remaining terms and
340 |      conditions.
341 | 
342 |   c. No term or condition of this Public License will be waived and no
343 |      failure to comply consented to unless expressly agreed to by the
344 |      Licensor.
345 | 
346 |   d. Nothing in this Public License constitutes or may be interpreted
347 |      as a limitation upon, or waiver of, any privileges and immunities
348 |      that apply to the Licensor or You, including from the legal
349 |      processes of any jurisdiction or authority.
350 | 
351 | 
352 | =======================================================================
353 | 
354 | Creative Commons is not a party to its public
355 | licenses. Notwithstanding, Creative Commons may elect to apply one of
356 | its public licenses to material it publishes and in those instances
357 | will be considered the “Licensor.” The text of the Creative Commons
358 | public licenses is dedicated to the public domain under the CC0 Public
359 | Domain Dedication. Except for the limited purpose of indicating that
360 | material is shared under a Creative Commons public license or as
361 | otherwise permitted by the Creative Commons policies published at
362 | creativecommons.org/policies, Creative Commons does not authorize the
363 | use of the trademark "Creative Commons" or any other trademark or logo
364 | of Creative Commons without its prior written consent including,
365 | without limitation, in connection with any unauthorized modifications
366 | to any of its public licenses or any other arrangements,
367 | understandings, or agreements concerning use of licensed material. For
368 | the avoidance of doubt, this paragraph does not form part of the
369 | public licenses.
370 | 
371 | Creative Commons may be contacted at creativecommons.org.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Getting started with HPC on Arm64 
  2 | This guide includes how-to guides, sample code, recommendations, and technical best practices to help new users get started with Arm-based systems like the NVIDIA Arm HPC Developer Kit.  While it is intended for the users and administrators of NVIDIA's Arm-based platforms, this guide is also generically useful for anyone running HPC applications on Arm CPUs, with or without GPUs.  The focus is mostly on the CPU since Arm-hosted GPUs are just the same as GPUs hosted by any other CPUs.
  3 | 
  4 | 
  5 | ## Contents
  6 | * [Introduction to Arm64 and the NVIDIA Arm HPC Developer Kit](#introduction-to-arm64-and-the-nvidia-hpc-developer-kit)
  7 | * [Transitioning Workloads to Arm64](transition-guide.md)
  8 |   * [Commercial Software (ISV)](isv.md)
  9 | * [Examples](examples/README.md)
 10 |   * [Benchmarks and Health Tests](examples/README.md#benchmarks-and-health-tests)
 11 |     * [HPL on CPU](examples/hpl-cpu/hpl-cpu.md)
 12 |     * [STREAM on CPU](examples/stream-cpu.md)
 13 |   * [Modeling and Simulation](examples/README.md#modeling-and-simulation)
 14 |     * [OpenFOAM](examples/openfoam.md)
 15 |     * [WRF](examples/wrf.md)
 16 |     * [... see all mod-sim examples](examples/README.md#modeling-and-simulation)
 17 |   * [Machine Learning](examples/README.md#machine-learning)
 18 |     * [TensorFlow GPU-accelerated Training and Inference](examples/tensorflow-gpu.md)
 19 |     * [Tensorflow On-CPU Inference](examples/tensorflow-cpu.md)
 20 |     * [... see all ML examples](examples/README.md#machine-learning)
 21 |   * [Data Science](examples/README.md#data-science)
 22 |     * [Anaconda, Miniconda, Conda, Mamba](examples/anaconda.md)
 23 |     * [Arkouda](examples/arkouda.md)
 24 |     * [... see all DS examples](examples/README.md#data-science)
 25 |   * ... and more! [See the full list of examples here](examples/README.md)
 26 | * [Supported Software](software/README.md) (See [the full list](software/README.md))
 27 |   * [Machine Learning](software/ml.md)
 28 |   * [Compilers](software/compilers.md) and [CUDA](software/cuda.md)
 29 |   * [Message Passing (MPI)](software/mpi.md)
 30 |   * [Math Libraries](software/mathlibs.md)
 31 |   * [Containers](software/containers.md)
 32 |   * [Operating Systems](software/os.md)
 33 |   * ... and more! [See here for more details](software/README.md)
 34 | * [Optimizing for Arm64](optimization/README.md)
 35 |   * [Arm SIMD Instructions: SVE and NEON](optimization/vectorization.md)
 36 |   * [Arm Atomic Instructions: LSE](optimization/atomics.md)
 37 | * [Language-specific Considerations](languages/README.md)
 38 |   * [C/C++](languages/c-c++.md)
 39 |   * [Fortran](languages/fortran.md)
 40 |   * [Python](languages/python.md)
 41 |   * [Rust](languages/rust.md)
 42 |   * [Go](languages/golang.md)
 43 |   * [Java](languages/java.md)
 44 |   * [.NET](languages/dotnet.md)
 45 | * [Additional Resources](#additional-resources)
 46 | * [Acknowledgements](#acknowledgements)
 47 | 
 48 | 
 49 | ## Join the NVIDIA Arm Community!
 50 | 
 51 | [![Join us on Slack](slack.png)](https://join.slack.com/t/nvidia-arm-hpc/shared_invite/zt-1cbn4fksk-P6n5zBX5Mz4RnaTj9t3axg)
 52 | 
 53 | The easiest way to find help and talk to the experts is to join the NVIDIA Arm HPC Slack workspace.  
 54 | 
 55 | 
 56 | ## Introduction to Arm64 and the NVIDIA HPC Developer Kit
 57 | The NVIDIA Arm HPC Developer Kit (simply "DevKit" in this guide) is an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications on a heterogeneous GPU- and CPU-accelerated computing system. The kit includes an Arm CPU, dual NVIDIA A100 Tensor Core GPUs, dual NVIDIA BlueField-2 DPUs, and the NVIDIA HPC SDK suite of tools.  [See the product page for more information.](https://developer.nvidia.com/arm-hpc-devkit)
 58 | 
 59 | This validated platform provides quick and easy bring-up and a stable environment for accelerated code execution and evaluation, performance analysis, system experimentation, and system characterization.
 60 |  * Delivers a validated system for quick and easy bring-up in familiar HPC environments
 61 |  * Offers a stable hardware and software platform for development and performance analysis of accelerated HPC, AI, and scientific computing applications
 62 |  * Enables experimentation and characterization of high-performance, NVIDIA-accelerated, Arm server-based system architectures
 63 | 
 64 | Hardware | Specification
 65 | -------- | --------
 66 | Model	   | GIGABYTE G242-P32, 2U server
 67 | CPU	     | 1x Ampere Altra Q80-30 (Arm processor)
 68 | GPU	     | 2x NVIDIA A100 GPU
 69 | Memory	 | 512G DDR4 memory
 70 | Storage	 | 6TB SAS/ SATA 3.5″
 71 | Network	 | 2x NVIDIA BlueField-2 E-Series DPU: 200GbE/HDR single-port QSFP56
 72 | 
 73 | The DevKit CPU uses the Arm architecture.  The Arm architecture powers over *two hundred billion* chips across practically all computing domains, so the term "Arm" is somewhat overloaded.  Various communities refer to the architecture as "Arm", "ARM", "Arm64", "AArch64", "arm64", etc.  You may also find the term "SBSA" used to refer to server-class Arm CPUs.  For simplicity, this guide will use the term **"Arm64"** to refer to any CPU built on the Armv8 or Armv9 standards and implementing [Arm's Server Base System Architecture (SBSA)](https://developer.arm.com/documentation/den0029/latest).  This includes CPUs like:
 74 | 
 75 |  * [Ampere Altra](https://amperecomputing.com/processors/ampere-altra/) (NVIDIA Arm HPC Developer Kit)
 76 |  * [NVIDIA Grace](https://www.nvidia.com/en-us/data-center/grace-cpu/)
 77 |  * [AWS Graviton](https://aws.amazon.com/ec2/graviton/) 
 78 |  * [Alibaba Yitian](https://fuse.wikichip.org/news/tag/yitian-710/)
 79 | 
 80 | This guide will call out differences between Arm64 CPUs as needed.  Note that this guide is not intended for mobile and embedded Arm CPUs e.g. NVIDIA Tegra.  While many of the general principles and approaches presented here will hold true for mobile and embedded Arm platforms, this guide is focused on server-class platforms.
 81 | 
 82 | 
 83 | ## Additional resources
 84 |  * [NVIDIA Arm HPC Developer Kit](https://developer.nvidia.com/arm-hpc-devkit)
 85 |  * [Neoverse N1 Software Optimization Guide](https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd)
 86 |  * [Armv8 reference manual](https://documentation-service.arm.com/static/60119835773bb020e3de6fee)
 87 |  * [Package repository search tool](https://pkgs.org/)
 88 | 
 89 | 
 90 | # License
 91 | 
 92 | [![CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](http://creativecommons.org/licenses/by-sa/4.0/)
 93 | 
 94 | Unless otherwise indicated, this work is licensed under a
 95 | [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).  Individual examples or attached source code may be under a different license.  Check the related README or LICENSE files.
 96 | 
 97 | 
 98 | # Acknowledgements
 99 | This guide was inspired by and borrows from the excellent [AWS Graviton Getting Started Guide](https://github.com/aws/aws-graviton-getting-started).  The authors of this guide gratefully acknowledge the work of the AWS engineers and thank AWS for freely providing this valuable information in the public domain.
100 | 
101 | 
102 | **Feedback?** jlinford@nvidia.com
103 | 
104 | 


--------------------------------------------------------------------------------
/code_of_conduct.md:
--------------------------------------------------------------------------------
 1 | # Arm HPC Developer Kit Community Code of Conduct
 2 | 
 3 | This Code of Conduct governs the Arm HPC Developer Kit Community's Community Slack, virtual and in-person events and any other online discussions.
 4 | 
 5 | ## Introduction
 6 | * Diversity and inclusion make our community strong. We encourage participation from the most varied and diverse backgrounds possible and want to be very clear about where we stand.
 7 | * Our goal is to maintain a safe, helpful and friendly community for everyone, regardless of experience, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, nationality, or other defining characteristic.
 8 | * This code and related procedures apply to unacceptable behavior occurring in all community venues, including behavior outside the scope of community activities — online and in-person— as well as in all one-on-one communications, and anywhere such behavior has the potential to adversely affect the safety and well-being of community members.
 9 | 
10 | ## Expected Behavior
11 | * Be welcoming.
12 | * Be kind.
13 | * Look out for each other.
14 | 
15 | ## Unacceptable Behavior
16 | * Conduct or speech which might be considered sexist, racist, homophobic, transphobic, ableist or otherwise discriminatory or offensive in nature.
17 |   * Do not use unwelcome, suggestive, derogatory or inappropriate nicknames or terms.
18 |   *  Do not show disrespect towards others (jokes, innuendo, dismissive attitudes).
19 | * Intimidation or harassment (online or in-person).
20 | * Disrespect towards differences of opinion.
21 | * Inappropriate attention or contact. Be aware of how your actions affect others. If it makes someone uncomfortable, stop.
22 | * Not understanding the differences between constructive criticism and disparagement.
23 | * Sustained disruptions.
24 | * Violence, threats of violence or violent language.
25 | 
26 | ## Enforcement
27 | * Understand that speech and actions have consequences, and unacceptable behavior will not be tolerated.
28 | * If violations occur, organizers will take any action they deem appropriate for the infraction, up to and including expulsion.
29 | 
30 | If you are the subject of, or witness to any violations of this Code of Conduct, please contact us via email.


--------------------------------------------------------------------------------
/examples/README.md:
--------------------------------------------------------------------------------
 1 | # Example Applications
 2 | Following these step-by-step instructions is a great way to get started with your new Arm64 system.  The codes presented here represent major HPC application areas and are a good starting point for running similar applications on Arm64.  Per-code recommendations and best practices are also provided to help give a sense for how applications from these areas generally work on Arm64.
 3 | 
 4 | ## Benchmarks and Health Tests
 5 | 
 6 | These benchmarks were generated from a known-good NVIDIA Arm HPC DevKit and provide a _lower bound_ for expected out-of-the-box performance.  They can be used to determine if your system is configured correctly and operating properly.  It's possible you may exceed these numbers.  **They are not indented for use in any competitive analysis.**
 7 | 
 8 |  * [HPL on CPU](hpl-cpu/hpl-cpu.md)
 9 |  * [STREAM on CPU](stream-cpu.md)
10 | 
11 | ## Modeling and Simulation
12 | 
13 | The high memory bandwidth of the Ampere Altra CPU makes it an excellent platform for CPU-only HPC applications.  
14 | 
15 |   * [GROMACS](gromacs.md)
16 |   * [OpenFOAM](openfoam.md)
17 |   * [WRF](wrf.md)
18 | 
19 | ## Machine Learning
20 | 
21 |   * [TensorFlow GPU-accelerated Training and Inference](tensorflow-gpu.md)
22 |   * [Tensorflow On-CPU Inference](tensorflow-cpu.md)
23 | 
24 | ## Data Science
25 | 
26 |   * [Anaconda, Miniconda, Conda, Mamba](anaconda.md)
27 |   * [Arkouda](arkouda.md)
28 | 
29 | ## ... and more!
30 | 
31 |   * [Velox](velox.md)
32 |   


--------------------------------------------------------------------------------
/examples/anaconda.md:
--------------------------------------------------------------------------------
 1 | # Anaconda, Miniconda, Conda, and Mamba on Arm64
 2 | Anaconda is a distribution of the Python and R programming languages for scientific computing that aims to simplify package management and deployment.
 3 | 
 4 | 
 5 | ## Anaconda
 6 | Anaconda announced [support for Arm64 via AWS Graviton 2 on May 14, 2021](https://www.anaconda.com/blog/anaconda-aws-graviton2). The Ampere Altra CPU found in the NVIDIA Arm HPC DevKit is based on the same Arm Neoverse N1 core as the AWS Gravition2, so Anaconda also supports the Ampere Altra.  *IMPORTANT*: if you encounter errors about missing libraries, see [the dependency information below](#a-quick-note-on-dependencies).
 7 | 
 8 | ```bash
 9 | # Download Anaconda installer
10 | wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-aarch64.sh
11 | # Run the installer
12 | bash Anaconda3-2022.05-Linux-aarch64.sh
13 | ```
14 | 
15 | Additional installation instructions can be found at https://docs.anaconda.com/anaconda/install/graviton2/.
16 | 
17 | 
18 | ## Miniconda Eaxmple
19 | Anaconda also offers a lightweight version called [Miniconda](https://docs.conda.io/en/latest/miniconda.html) which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. 
20 | 
21 | Here is an example on how to use Miniconda to install [numpy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) for Python 3.9.  The resulting installation has a much smaller footprint than the full Anaconda.
22 | 
23 | The first step is to install conda:
24 | ```bash
25 | # Download Miniconda installer
26 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
27 | # Run the installer
28 | bash Miniconda3-latest-Linux-aarch64.sh
29 | ```
30 | 
31 | Once installed, you can either use the `conda` command directly to install packages, or write an environment definition file and create the corresponding environment.
32 | 
33 | Here's an example to install [numpy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) (`arm64-example.yml`):
34 | ```yaml
35 | name: arm64-example
36 | dependencies:
37 |   - numpy
38 |   - pandas
39 | ```
40 | 
41 | The next step is to instantiate the environment from that definition:
42 | ```bash
43 | conda env create -f arm64-example.yml
44 | ```
45 | And you can now use numpy and pandas.
46 | 
47 | 
48 | ## Mamba
49 | 
50 | Mamba is a fast, robust, and cross-platform package manager that is fully compatible with conda packages and supports most of conda’s commands.  It fully supports Arm64.  See https://mamba.readthedocs.io/en/latest/# for details.
51 | 
52 | 
53 | ## A quick note on dependencies
54 | 
55 | This isn't really an Arm64 requirement since it applies to all platforms.  The installers mentioned in this document don't pull in distro-provided dependencies.  If you see an error like this:
56 | ```
57 | /data/jlinford/anaconda3/bin/gtk-query-immodules-3.0: error while loading shared libraries: libXfixes.so.3: cannot open shared object file: No such file or directory
58 | ```
59 | then search for the missing library and install the appropriate package. For example, on Ubuntu Server 20.04:
60 | ```bash
61 | sudo apt-get install libxi6 libgconf-2-4 libxfixes3 libxcursor1
62 | ```
63 | It may take a few tries to get all the dependencies sorted, especially if you're working on a minimal installation of the OS.


--------------------------------------------------------------------------------
/examples/arkouda.md:
--------------------------------------------------------------------------------
  1 | # Arkouda Server and Client on Arm64
  2 | 
  3 | ![Arkouda Logo](https://github.com/Bears-R-Us/arkouda/raw/master/pictures/arkouda_wide_marker1.png)
  4 | 
  5 | Arkouda allows a user to interactively issue massively parallel computations on distributed data using functions and syntax that mimic NumPy, the underlying computational library used in the vast majority of Python data science workflows. The computational heart of Arkouda is a Chapel interpreter that accepts a pre-defined set of commands from a client and uses Chapel's built-in machinery for multi-locale and multithreaded execution. Arkouda has benefited greatly from Chapel's distinctive features and has also helped guide the development of the language.  For more details see https://github.com/Bears-R-Us/arkouda.
  6 | 
  7 | ## Initial Configuration
  8 | 
  9 | Clone the Arkouda repo.  The following steps will take place inside the repo directory.
 10 | ```bash
 11 | git clone https://github.com/Bears-R-Us/arkouda.git
 12 | cd arkouda
 13 | ```
 14 | We recommend using GCC version 11 or later.  For example, Spack is an easy way to install GCC 12.1.0:
 15 | ```bash
 16 | # Install spack, if you haven't already
 17 | git clone https://github.com/spack/spack.git
 18 | 
 19 | # Install gcc+binutils, if you haven't already
 20 | spack install gcc@12.1.0+binutils
 21 | spack load gcc@12.1.0+binutils
 22 | spack compiler find
 23 | ```
 24 | 
 25 | ## Install Arkouda Client
 26 | We will use [Miniconda](https://docs.conda.io/en/latest/miniconda.html) to provide a Python environment and manage Python dependencies.  
 27 | ```bash
 28 | # Install Miniconda3
 29 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
 30 | chmod +x Miniconda3-latest-Linux-aarch64.sh
 31 | ./Miniconda3-latest-Linux-aarch64.sh -b
 32 | 
 33 | # Add Miniconda3 to the environment
 34 | source ~/.bashrc
 35 | 
 36 | # Make sure you're in the Arkouda repo directory.
 37 | cd arkouda
 38 | 
 39 | # Developer conda env
 40 | conda env create -f arkouda-env-dev.yml
 41 | conda activate arkouda-dev
 42 | 
 43 | # User conda env
 44 | conda env create -f arkouda-env.yml     
 45 | conda activate arkouda
 46 | 
 47 | # Install client and server dependencies
 48 | conda install -y jupyter 
 49 | conda install -y cmake>=3.11.0
 50 | 
 51 | # Install the Arkouda Client Package
 52 | pip install -e .
 53 | ```
 54 | 
 55 | 
 56 | ## Install Chapel
 57 | ```bash
 58 | # Download and unpack the Chapel source in the location where you wish to install Chapel
 59 | curl -L https://github.com/chapel-lang/chapel/releases/download/1.27.0/chapel-1.27.0.tar.gz | tar xvzf -
 60 | 
 61 | cd chapel-1.27.0
 62 | 
 63 | # Set CHPL_HOME
 64 | export CHPL_HOME=$PWD
 65 | 
 66 | # Add chpl to PATH
 67 | source $CHPL_HOME/util/setchplenv.bash
 68 | 
 69 | # Set remaining env variables and execute make
 70 | # Arkouda documentation recommends adding these variables to a ~/.chplconfig file to prevent having to export them again
 71 | cat > ~/.chplconfig <<EOF
 72 | CHPL_COMM=gasnet
 73 | CHPL_COMM_SUBSTRATE=smp
 74 | CHPL_TARGET_CPU=native
 75 | CHPL_RE2=bundled
 76 | CHPL_LLVM=bundled
 77 | EOF
 78 | source ~/.chplconfig
 79 | 
 80 | # Build Chapel
 81 | export GASNET_QUIET=Y
 82 | export CHPL_RT_OVERSUBSCRIBED=yes
 83 | cd $CHPL_HOME
 84 | make -j80 && make -j80 chpldoc
 85 | 
 86 | # Add Chapel to your environment
 87 | export PATH=$CHPL_HOME/bin/linux64-aarch64:$PATH
 88 | ```
 89 | 
 90 | 
 91 | ## Build the Arkouda Server
 92 | ```bash
 93 | # Make sure you're in the Arkouda repo directory.
 94 | cd arkouda
 95 | 
 96 | # Confirm Chapel version 1.25 or higher
 97 | (arkouda-dev) [jlinford@amp001 arkouda]$ chpl --version
 98 | chpl version 1.27.0
 99 |   built with LLVM version 14.0.0
100 | Copyright 2020-2022 Hewlett Packard Enterprise Development LP
101 | Copyright 2004-2019 Cray Inc.
102 | (See LICENSE file for more details)
103 | 
104 | # Edit Makefile.paths and add the location of the Arkouda conda environment
105 | # to the first line as shown below
106 | # For example, `cat Makefile.paths` should produce something like: 
107 | #     $(eval $(call add-path,/global/home/groups/amp/miniconda3/envs/arkouda))
108 | #                            ^ 
109 | # Note that there is no space after 'add-path'
110 | cat Makefile.paths
111 | 
112 | # Build the server
113 | make -j80
114 | ```
115 | Here's an an example of the Chapel pass report for a successful build on an Ampere Altra Q80-30 CPU:
116 | ```
117 | Pass               Name                   Main    Check    Clean     Time    %     Accum    %                                                                                  
118 | ---- ---------------------------------  -------  -------  -------  ------- -----  ------- -----                                                                                
119 |   42 makeBinary                         607.002    0.000    3.069  610.072  47.9  610.072  47.9                                                                                
120 |   13 resolve                            462.449    0.204    3.705  466.358  36.6  1076.430  84.6                                                                               
121 |   41 codegen                             67.254    0.000    0.219   67.473   5.3  1143.903  89.9                                                                               
122 |   22 parallel                            16.878    0.141    0.433   17.452   1.4  1161.355  91.3                                                                               
123 |   20 callDestructors                     14.703    0.000    0.367   15.070   1.2  1176.425  92.4                                                                               
124 |   29 copyPropagation                     13.750    0.101    1.210   15.061   1.2  1191.486  93.6                                                                               
125 |   26 inlineFunctions                      9.320    0.224    2.563   12.107   1.0  1203.593  94.6                                                                               
126 |   36 insertWideReferences                10.182    0.103    0.339   10.623   0.8  1214.216  95.4                                                                               
127 |   33 loopInvariantCodeMotion             10.219    0.086    0.204   10.509   0.8  1224.726  96.2                                                                               
128 |   21 lowerIterators                       7.300    0.148    1.128    8.576   0.7  1233.302  96.9                                                                               
129 |   27 scalarReplace                        4.158    0.168    1.758    6.084   0.5  1239.386  97.4                                                                               
130 |   28 refPropagation                       4.556    0.131    1.364    6.052   0.5  1245.438  97.9                                                                               
131 |   40 denormalize                          4.866    0.000    0.878    5.744   0.5  1251.182  98.3                                                                               
132 |   23 prune                                1.710    0.123    1.861    3.694   0.3  1254.876  98.6                                                                               
133 |   30 deadCodeElimination                  3.328    0.084    0.211    3.622   0.3  1258.498  98.9                                                                               
134 |   37 optimizeOnClauses                    2.077    0.103    0.239    2.419   0.2  1260.917  99.1                                                                               
135 |   39 insertLineNumbers                    1.164    0.105    0.253    1.522   0.1  1262.439  99.2                                                                               
136 |   34 prune2                               1.132    0.086    0.223    1.441   0.1  1263.880  99.3                                                                               
137 |   32 localizeGlobals                      0.701    0.084    0.201    0.986   0.1  1264.866  99.4                                                                               
138 |    9 normalize                            0.940    0.000    0.031    0.971   0.1  1265.836  99.5                                                                               
139 |   19 lowerErrorHandling                   0.749    0.000    0.142    0.891   0.1  1266.728  99.5                                                                               
140 |   18 cullOverReferences                   0.620    0.011    0.155    0.786   0.1  1267.513  99.6                                                                               
141 |   17 flattenFunctions                     0.629    0.011    0.138    0.777   0.1  1268.291  99.7                                                                               
142 |    1 parse                                0.691    0.000    0.013    0.704   0.1  1268.995  99.7                                                                               
143 |   35 returnStarTuplesByRefArgs            0.230    0.086    0.209    0.525   0.0  1269.520  99.8                                                                               
144 |    7 scopeResolve                         0.470    0.001    0.027    0.499   0.0  1270.018  99.8                                                                               
145 |   25 removeUnnecessaryAutoCopyCalls       0.032    0.082    0.252    0.366   0.0  1270.385  99.8                                                                               
146 |   38 addInitCalls                         0.003    0.103    0.240    0.345   0.0  1270.730  99.9                                                                               
147 |   24 bulkCopyRecords                      0.005    0.083    0.243    0.331   0.0  1271.061  99.9                                                                               
148 |   15 checkResolved                        0.176    0.000    0.137    0.313   0.0  1271.374  99.9                                                                               
149 |   31 removeEmptyRecords                   0.000    0.083    0.196    0.280   0.0  1271.653  99.9                                                                               
150 |   14 resolveIntents                       0.122    0.011    0.136    0.269   0.0  1271.922  99.9                                                                               
151 |   11 buildDefaultFunctions                0.131    0.000    0.028    0.159   0.0  1272.081 100.0                                                                               
152 |   16 replaceArrayAccessesWithRefTemps     0.000    0.011    0.136    0.147   0.0  1272.228 100.0                                                                               
153 |      init                                 0.098    0.000    0.000    0.098   0.0  1272.326 100.0                                                                               
154 |    2 checkParsed                          0.076    0.001    0.007    0.085   0.0  1272.410 100.0                                                                               
155 |    6 cleanup                              0.045    0.000    0.009    0.055   0.0  1272.465 100.0                                                                               
156 |   12 createTaskFunctions                  0.017    0.010    0.023    0.050   0.0  1272.515 100.0                                                                               
157 |   10 checkNormalized                      0.009    0.001    0.017    0.026   0.0  1272.541 100.0                                                                               
158 |      driverCleanup                        0.019    0.000    0.000    0.019   0.0  1272.560 100.0                                                                               
159 |    5 expandExternArrayCalls               0.001    0.000    0.006    0.007   0.0  1272.568 100.0                                                                               
160 |    8 flattenClasses                       0.000    0.000    0.007    0.007   0.0  1272.574 100.0                                                                               
161 |    3 docs                                 0.000    0.000    0.007    0.007   0.0  1272.581 100.0                                                                               
162 |    4 readExternC                          0.000    0.000    0.007    0.007   0.0  1272.587 100.0                                                                               
163 |      startup                              0.000    0.000    0.000    0.000   0.0  1272.587 100.0                                                                               
164 | 
165 |      total time                         1247.814    2.384   22.390  1272.587
166 | ```
167 | 
168 | ## Test Arkouda
169 | Use `make check` to run the builtin correctness checks.  For example:
170 | ```bash
171 | (arkouda) [jlinford@amp001 arkouda]$ make check    
172 | INFO:root:Starting "['/global/home/groups/amp/arkouda/arkouda_server', '--trace=false', '--serverConnectionInfo=/global/home/groups/amp/arkouda/ak-server-info', '-nl 2', '--Ser
173 | verPort=5555']"
174 | <jemalloc>: Unsupported system page size
175 | INFO:root:Running client "['python3', '/global/home/groups/amp/arkouda/tests/check.py', 'localhost', '5555']"
176 |     _         _                   _
177 |    / \   _ __| | _____  _   _  __| | __ _
178 |   / _ \ | '__| |/ / _ \| | | |/ _` |/ _` |
179 |  / ___ \| |  |   < (_) | |_| | (_| | (_| |
180 | /_/   \_\_|  |_|\_\___/ \__,_|\__,_|\__,_|
181 | 
182 | 
183 | Client Version: v2022.07.08+6.g9d84d1ea.dirty
184 | <jemalloc>: Unsupported system page size
185 | >>> Sanity checks on the arkouda_server
186 | connected to arkouda server tcp://*:5555                                                                                                                                        
187 | check boolean : Passed 
188 | check arange : Passed
189 | check linspace : Passed
190 | check ones : Passed
191 | check zeros : Passed
192 | check argsort : Passed
193 | check coargsort : Passed
194 | check sort : Passed
195 | check get slice [::2] : Passed
196 | check set slice [::2] = value: Passed
197 | check set slice [::2] = pda: Passed
198 | check (compressing) get bool iv : Passed
199 | check (expanding) set bool iv = value: Passed
200 | check (expanding) set bool iv = pda: Passed
201 | check (gather) get integer iv: Passed
202 | check (scatter) set integer iv = value: Passed
203 | check (scatter) set integer iv = pda: Passed
204 | check get integer idx : Passed
205 | check set integer idx = value: Passed
206 | disconnected from arkouda server tcp://*:5555
207 | INFO:root:Running client "['python3', '/global/home/groups/amp/arkouda/util/test/shutdown.py', 'localhost', '5555']"
208 | <jemalloc>: Unsupported system page size
209 | connected to arkouda server tcp://*:5555
210 | Success running checks
211 | ```


--------------------------------------------------------------------------------
/examples/gromacs.md:
--------------------------------------------------------------------------------
  1 | # GROMACS on Arm64 CPU with NVIDIA GPU
  2 | 
  3 | GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles and is a community-driven project. See https://www.gromacs.org/.
  4 | 
  5 | One of GROMACS' attractive features is a high level of optimization for multiple devices, including both Arm64 CPUs and NVIDIA GPUs.  This example will show how to run GROMACS on either the CPU, the GPU, or both!
  6 | 
  7 | ## Benchmark Files
  8 | 
  9 | We'll use the standard Gromacs benchmark data set, ADH.
 10 | ```bash
 11 | mkdir gromacs
 12 | cd gromacs
 13 | wget ftp://ftp.gromacs.org/benchmarks/ADH_bench_systems.tar.gz
 14 | tar xvzf ADH_bench_systems.tar.gz
 15 | ```
 16 | 
 17 | ## NVIDIA NGC Container
 18 | 
 19 | The easiest way to run GROMACS is to use the optimized GROMACS containers available via NVIDIA NGC. See [the containers page](../software/containers.md) for detailed instructions on how to enable NGC on your NVIDIA DevKit.  For more general information, see https://docs.nvidia.com/ngc/ for prerequisites and setup steps for all HPC containers and instructions for pulling NGC containers.
 20 | 
 21 | We'll be using the GROMACS container available at https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs.  Start by preprocessing the ADH benchmark files:
 22 | ```bash
 23 | # Navigate to the ADH benchmarks directory
 24 | cd gromacs/ADH
 25 | # Run the GROMACS preprocessor
 26 | docker run -ti --runtime nvidia \
 27 |         -v /dev/infiniband:/dev/inifiniband \
 28 |         -v $(pwd)/adh_cubic:/benchmark \
 29 |         --workdir /benchmark \
 30 |         nvcr.io/hpc/gromacs:2022.1 sh -c "gmx grompp -f pme_verlet.mdp"
 31 | ```
 32 | 
 33 | ### Running on GPU and CPU
 34 | We'll want to map the MPI ranks (`ntmpi`) and OpenMP threads (`ntomp`) such that `ntmpi * ntomp` is equal to all CPU cores, and `ntmpi` is a multiple of the number of GPUs available.  The NVIDIA Arm HPC DevKit has 80 CPU cores and two A100 GPUs, and keeping `ntomp` at about 8 generally produces the best performmance.  Therefore we use:
 35 |  * `-ntmpi 10`: Ten MPI ranks, five GPU tasks per GPU
 36 |  * `-ntomp 8`: Eight OpenMP threads per MPI rank.
 37 | 
 38 | To enable GPU acceleration, we need to indicate which parts of the computation will execute on the CPU, and which will execute on the GPU.  GROMACS provides functionality to account for a wide range of different types of force calculations. For most simulations, the three most important classes (in terms of computational expense) are specified with these command line options:
 39 |  * `-nb`: Non-bonded short-range forces
 40 |  * `-bonded`: Bonded short-range forces
 41 |  * `-pme`: Particle Mesh Ewald (PME) long-range forces
 42 | 
 43 | For example, to calculate non-bonded short-range forces on the GPU, use `-nb gpu`.  Or, to calculate these forces on the CPU, use `-nb cpu`.  Note that calculating on the CPU only will take _dramatically_ longer than calcluating on the GPU.  For more information about GROMACS command line options, see [the GROMACS manual](https://manual.gromacs.org/current/index.html).
 44 | 
 45 | ```bash
 46 | # Run GROMACS
 47 | docker run -ti --runtime nvidia -v /dev/infiniband:/dev/inifiniband -v $(pwd)/adh_cubic:/benchmark --workdir /benchmark nvcr.io/hpc/gromacs:2022.1 sh -c "gmx mdrun -v -nsteps 100000 -resetstep 90000 -noconfout -ntmpi 10 -ntomp 8 -nb gpu -bonded gpu -pme gpu -npme 1 -nstlist 400 -s topol.tpr"
 48 | ```
 49 | 
 50 | Using two A100 GPUs, you should see a score of about 240 ns/day:
 51 | ```
 52 |                Core t (s)   Wall t (s)        (%)
 53 |        Time:      576.388        7.206     7999.1
 54 |                  (ns/day)    (hour/ns)
 55 | Performance:      239.835        0.100
 56 | ```
 57 | 
 58 | ### Additional optimizations for NVIDIA GPU
 59 | 
 60 | This [technical blog post from NVIDIA](https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/) lists three envrionment variables that can be set to further improve performance in this benchmark.  Note that there are some trade-offs in using these options.
 61 |  * `GMX_GPU_DD_COMMS=true`: enable halo exchange communications between PP tasks
 62 |  * `GMX_GPU_PME_PP_COMMS=true`: enable communications between PME and PP tasks
 63 |  * `GMX_FORCE_UPDATE_DEFAULT_GPU=true`: enable the update and constraints part of the timestep for multi-GPU
 64 | 
 65 |  The combination of these settings triggers all optimizations, including dependencies such as GPU-acceleration of buffer operations.  When using these options, it's best to keep a low number of MPI ranks and increase the number of OpenMP threads.  The suggested layout for these options with two GPUs and 80 CPU cores is four MPI ranks with 20 OpenMP threads each.  Use the `--env` flag to set the environment variables in the container:
 66 | ```bash
 67 | docker run -ti --runtime nvidia \
 68 |         -v /dev/infiniband:/dev/inifiniband \
 69 |         -v $(pwd)/adh_cubic:/benchmark \
 70 |         --workdir /benchmark \
 71 |         --env GMX_GPU_DD_COMMS=true \
 72 |         --env GMX_GPU_PME_PP_COMMS=true \
 73 |         --env GMX_FORCE_UPDATE_DEFAULT_GPU=true \
 74 |         nvcr.io/hpc/gromacs:2022.1 \
 75 |         sh -c "gmx mdrun -v -noconfout \
 76 |                 -nsteps 100000 -resetstep 90000 \
 77 |                 -ntmpi 4 -ntomp 20 -npme 1 \
 78 |                 -nb gpu -bonded gpu -pme gpu \
 79 |                 -pin on \
 80 |                 -nstlist 400 \
 81 |                 -s topol.tpr"
 82 | ```
 83 | 
 84 | With these additional environment variables, you should see about 320 ns/day:
 85 | ```
 86 |                Core t (s)   Wall t (s)        (%)
 87 |        Time:      434.716        5.437     7996.2
 88 |                  (ns/day)    (hour/ns)
 89 | Performance:      317.881        0.076
 90 | ```
 91 | 
 92 | 
 93 | ### Interactive shell in GROMACS container
 94 | 
 95 | The following command will launch an interactive shell in the GROMACS container using `nvidia-docker` mounting `$HOME/data` from the underlying system as `/data` in the container:
 96 | 
 97 | ```
 98 | $ docker run -it --rm --runtime nvidia --privileged -v $HOME/data:/data nvcr.io/hpc/gromacs:2022.1
 99 | ```
100 | The command line options are:
101 |  * `-it`: start container with an interactive terminal (short for `--interactive --tty`)
102 |  * `--rm`: make container ephemeral (removes container on exit)
103 |  * `-v host_path:/data`: bind mount `host_path` into the container as `/data`
104 |  * `--runtime nvidia`: allow nvidia GPU's
105 |  * `--privileged`: allow other devices like InfiniBand 
106 |    * See also: [How-to: Deploy RDMA accelerated Docker container over InfiniBand fabric](https://docs.nvidia.com/networking/pages/releaseview.action?pageId=15049785)
107 | 
108 | This should produce a root prompt within the container.
109 | 
110 | 
111 | ## Installing from source
112 | 
113 | Spack is the easiest way to build GROMACS from source.
114 | 
115 | ```bash
116 | # Close the Spack repo, if you haven't already
117 | git clone https://github.com/spack/spack.git
118 | # Recommended but optional: use the latest GCC from Spack.  You may be able to use other compilers, but this is known to work well.
119 | spack install gcc+binutils+piclibs
120 | # Add the new GCC installation to your environment
121 | spack load gcc
122 | # Update Spack's compiler configuration
123 | spack compiler find
124 | 
125 | # Install GROMACS with GPU acceleration
126 | spack install -j80 gromacs+cuda %gcc@12.1.0
127 | # If no GPUs available, install unaccelerated GROMACS
128 | spack install -j80 gromacs %gcc@12.1.0
129 | ```
130 | 


--------------------------------------------------------------------------------
/examples/hpl-cpu/ACfL.patch:
--------------------------------------------------------------------------------
 1 | --- Make.ACfL	2022-07-19 13:11:08.954180000 -0400
 2 | +++ Make.ACfL.patched	2022-07-11 16:45:10.935651000 -0400
 3 | @@ -61,13 +61,13 @@
 4 |  # - Platform identifier ------------------------------------------------
 5 |  # ----------------------------------------------------------------------
 6 |  #
 7 | -ARCH         = UNKNOWN
 8 | +ARCH         = ACfL
 9 |  #
10 |  # ----------------------------------------------------------------------
11 |  # - HPL Directory Structure / HPL library ------------------------------
12 |  # ----------------------------------------------------------------------
13 |  #
14 | -TOPdir       = $(HOME)/hpl
15 | +TOPdir       = $(HOME)/benchmarks/hpl-2.3
16 |  INCdir       = $(TOPdir)/include
17 |  BINdir       = $(TOPdir)/bin/$(ARCH)
18 |  LIBdir       = $(TOPdir)/lib/$(ARCH)
19 | @@ -94,7 +94,7 @@
20 |  #
21 |  LAdir        = 
22 |  LAinc        = 
23 | -LAlib        = -lblas
24 | +LAlib        = 
25 |  #
26 |  # ----------------------------------------------------------------------
27 |  # - F77 / C interface --------------------------------------------------
28 | @@ -156,7 +156,7 @@
29 |  #    *) call the BLAS Fortran 77 interface,
30 |  #    *) not display detailed timing information.
31 |  #
32 | -HPL_OPTS     =
33 | +HPL_OPTS     = -DHPL_CALL_CBLAS
34 |  # 
35 |  # ----------------------------------------------------------------------
36 |  #
37 | @@ -167,11 +167,11 @@
38 |  # ----------------------------------------------------------------------
39 |  #
40 |  CC           = mpicc
41 | -CCNOOPT      = $(HPL_DEFS) 
42 | -CCFLAGS      = $(HPL_DEFS) 
43 | +CCNOOPT      = $(HPL_DEFS) -O0
44 | +CCFLAGS      = $(HPL_DEFS) -Ofast -mcpu=neoverse-n1 -armpl
45 |  #
46 | -LINKER       = mpif77
47 | -LINKFLAGS    = 
48 | +LINKER       = mpicc
49 | +LINKFLAGS    = -armpl
50 |  #
51 |  ARCHIVER     = ar
52 |  ARFLAGS      = r
53 | 


--------------------------------------------------------------------------------
/examples/hpl-cpu/GCC_BLIS.patch:
--------------------------------------------------------------------------------
 1 | --- Make.GCC_BLIS	2022-07-19 12:28:40.830337000 -0400
 2 | +++ Make.GCC_BLIS.patched	2022-07-19 13:03:47.729051000 -0400
 3 | @@ -61,13 +61,13 @@
 4 |  # - Platform identifier ------------------------------------------------
 5 |  # ----------------------------------------------------------------------
 6 |  #
 7 | -ARCH         = UNKNOWN
 8 | +ARCH         = GCC_BLIS
 9 |  #
10 |  # ----------------------------------------------------------------------
11 |  # - HPL Directory Structure / HPL library ------------------------------
12 |  # ----------------------------------------------------------------------
13 |  #
14 | -TOPdir       = $(HOME)/hpl
15 | +TOPdir       = $(HOME)/benchmarks/hpl-2.3
16 |  INCdir       = $(TOPdir)/include
17 |  BINdir       = $(TOPdir)/bin/$(ARCH)
18 |  LIBdir       = $(TOPdir)/lib/$(ARCH)
19 | @@ -92,9 +92,9 @@
20 |  # header files,  LAlib  is defined  to be the name of  the library to be 
21 |  # used. The variable LAdir is only used for defining LAinc and LAlib.
22 |  #
23 | -LAdir        = 
24 | -LAinc        = 
25 | -LAlib        = -lblas
26 | +LAdir        = $(HOME)/benchmarks/blis_gcc-11.2.0_thunderx2
27 | +LAinc        = -I$(LAdir)/include/blis
28 | +LAlib        = -L$(LAdir)/lib -lblis
29 |  #
30 |  # ----------------------------------------------------------------------
31 |  # - F77 / C interface --------------------------------------------------
32 | @@ -156,7 +156,7 @@
33 |  #    *) call the BLAS Fortran 77 interface,
34 |  #    *) not display detailed timing information.
35 |  #
36 | -HPL_OPTS     =
37 | +HPL_OPTS     = -DHPL_CALL_CBLAS
38 |  # 
39 |  # ----------------------------------------------------------------------
40 |  #
41 | @@ -167,10 +167,10 @@
42 |  # ----------------------------------------------------------------------
43 |  #
44 |  CC           = mpicc
45 | -CCNOOPT      = $(HPL_DEFS) 
46 | -CCFLAGS      = $(HPL_DEFS) 
47 | +CCNOOPT      = $(HPL_DEFS) -O0
48 | +CCFLAGS      = $(HPL_DEFS) -Ofast -mcpu=neoverse-n1
49 |  #
50 | -LINKER       = mpif77
51 | +LINKER       = mpicc
52 |  LINKFLAGS    = 
53 |  #
54 |  ARCHIVER     = ar
55 | 


--------------------------------------------------------------------------------
/examples/hpl-cpu/HPL.dat:
--------------------------------------------------------------------------------
 1 | HPLinpack benchmark input file
 2 | Innovative Computing Laboratory, University of Tennessee
 3 | HPL.out      output file name (if any)
 4 | 6            device out (6=stdout,7=stderr,file)
 5 | 1            # of problems sizes (N)
 6 | 166656         Ns
 7 | 1            # of NBs
 8 | 192           NBs
 9 | 0            PMAP process mapping (0=Row-,1=Column-major)
10 | 1            # of process grids (P x Q)
11 | 8            Ps
12 | 10            Qs
13 | 16.0         threshold
14 | 1            # of panel fact
15 | 2            PFACTs (0=left, 1=Crout, 2=Right)
16 | 1            # of recursive stopping criterium
17 | 4            NBMINs (>= 1)
18 | 1            # of panels in recursion
19 | 2            NDIVs
20 | 1            # of recursive panel fact.
21 | 1            RFACTs (0=left, 1=Crout, 2=Right)
22 | 1            # of broadcast
23 | 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
24 | 1            # of lookahead depth
25 | 1            DEPTHs (>=0)
26 | 2            SWAP (0=bin-exch,1=long,2=mix)
27 | 64           swapping threshold
28 | 0            L1 in (0=transposed,1=no-transposed) form
29 | 0            U  in (0=transposed,1=no-transposed) form
30 | 1            Equilibration (0=no,1=yes)
31 | 8            memory alignment in double (> 0)
32 | ##### This line (no. 32) is ignored (it serves as a separator). ######
33 | 0                               Number of additional problem sizes for PTRANS
34 | 1200 10000 30000                values of N
35 | 0                               number of additional blocking sizes for PTRANS
36 | 40 9 8 13 13 20 16 32 64        values of NB


--------------------------------------------------------------------------------
/examples/hpl-cpu/NVIDIA_HPC_SDK.patch:
--------------------------------------------------------------------------------
 1 | --- Make.NVIDIA_HPC_SDK	2022-07-19 13:13:56.121925000 -0400
 2 | +++ Make.NVIDIA_HPC_SDK.patched	2022-07-19 13:14:16.536185000 -0400
 3 | @@ -61,13 +61,13 @@
 4 |  # - Platform identifier ------------------------------------------------
 5 |  # ----------------------------------------------------------------------
 6 |  #
 7 | -ARCH         = UNKNOWN
 8 | +ARCH         = NVIDIA_HPC_SDK
 9 |  #
10 |  # ----------------------------------------------------------------------
11 |  # - HPL Directory Structure / HPL library ------------------------------
12 |  # ----------------------------------------------------------------------
13 |  #
14 | -TOPdir       = $(HOME)/hpl
15 | +TOPdir       = $(HOME)/benchmarks/hpl-2.3
16 |  INCdir       = $(TOPdir)/include
17 |  BINdir       = $(TOPdir)/bin/$(ARCH)
18 |  LIBdir       = $(TOPdir)/lib/$(ARCH)
19 | @@ -92,9 +92,9 @@
20 |  # header files,  LAlib  is defined  to be the name of  the library to be 
21 |  # used. The variable LAdir is only used for defining LAinc and LAlib.
22 |  #
23 | -LAdir        = 
24 | -LAinc        = 
25 | -LAlib        = -lblas
26 | +LAdir        = $(NVHPC_ROOT)/compilers
27 | +LAinc        = -I$(LAdir)/include
28 | +LAlib        = -L$(LAdir)/lib -lblas
29 |  #
30 |  # ----------------------------------------------------------------------
31 |  # - F77 / C interface --------------------------------------------------
32 | @@ -156,7 +156,7 @@
33 |  #    *) call the BLAS Fortran 77 interface,
34 |  #    *) not display detailed timing information.
35 |  #
36 | -HPL_OPTS     =
37 | +HPL_OPTS     = -DHPL_CALL_CBLAS
38 |  # 
39 |  # ----------------------------------------------------------------------
40 |  #
41 | @@ -167,10 +167,10 @@
42 |  # ----------------------------------------------------------------------
43 |  #
44 |  CC           = mpicc
45 | -CCNOOPT      = $(HPL_DEFS) 
46 | -CCFLAGS      = $(HPL_DEFS) 
47 | +CCNOOPT      = $(HPL_DEFS) -O0 -Kieee
48 | +CCFLAGS      = $(HPL_DEFS) -O3 -fast -Minline=saxpy,sscal -Minfo
49 |  #
50 | -LINKER       = mpif77
51 | +LINKER       = mpicc
52 |  LINKFLAGS    = 
53 |  #
54 |  ARCHIVER     = ar
55 | 


--------------------------------------------------------------------------------
/examples/hpl-cpu/hplgen.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | #
  3 | # John Linford <john.linford@arm.com>
  4 | #
  5 | 
  6 | import os
  7 | import math
  8 | import multiprocessing
  9 | 
 10 | 
 11 | # Fraction of node memory to use
 12 | _MEMORY_FRACTION = 0.9
 13 | 
 14 | # Matrix element size in bytes
 15 | _SIZEOF_ELEMENT = 8
 16 | 
 17 | _TEMPLATE = """\
 18 | HPLinpack benchmark input file
 19 | Innovative Computing Laboratory, University of Tennessee
 20 | HPL.out      output file name (if any) 
 21 | 6            device out (6=stdout,7=stderr,file)
 22 | 1            # of problems sizes (N)
 23 | %(N)d         Ns
 24 | 1            # of NBs
 25 | %(NB)d           NBs
 26 | 0            PMAP process mapping (0=Row-,1=Column-major)
 27 | 1            # of process grids (P x Q)
 28 | %(P)d            Ps
 29 | %(Q)d            Qs
 30 | 16.0         threshold
 31 | 1            # of panel fact
 32 | 2            PFACTs (0=left, 1=Crout, 2=Right)
 33 | 1            # of recursive stopping criterium
 34 | 4            NBMINs (>= 1)
 35 | 1            # of panels in recursion
 36 | 2            NDIVs
 37 | 1            # of recursive panel fact.
 38 | 1            RFACTs (0=left, 1=Crout, 2=Right)
 39 | 1            # of broadcast
 40 | 1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
 41 | 1            # of lookahead depth
 42 | 1            DEPTHs (>=0)
 43 | 2            SWAP (0=bin-exch,1=long,2=mix)
 44 | 64           swapping threshold
 45 | 0            L1 in (0=transposed,1=no-transposed) form
 46 | 0            U  in (0=transposed,1=no-transposed) form
 47 | 1            Equilibration (0=no,1=yes)
 48 | 8            memory alignment in double (> 0)
 49 | ##### This line (no. 32) is ignored (it serves as a separator). ######
 50 | 0                               Number of additional problem sizes for PTRANS
 51 | 1200 10000 30000                values of N
 52 | 0                               number of additional blocking sizes for PTRANS
 53 | 40 9 8 13 13 20 16 32 64        values of NB
 54 | """
 55 | 
 56 | def int_factor(n):
 57 |     p = math.ceil(math.sqrt(n))
 58 |     while p:
 59 |         q = int(n/p)
 60 |         if p*q == n:
 61 |             return p, q
 62 |         p -= 1
 63 |     raise RuntimeError
 64 | 
 65 | 
 66 | def int_input(prompt, default=None):
 67 |     if default:
 68 |         default = str(default)
 69 |         val = input("%s [%s]: " % (prompt, default)) or default
 70 |     else:
 71 |         val = input("%s: " % prompt)
 72 |     return int(val)
 73 | 
 74 | 
 75 | def get_node_memory():
 76 |     try:
 77 |         return int((os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')) / (1024*1024))
 78 |     except ValueError:
 79 |         return 65536
 80 | 
 81 | 
 82 | def main():
 83 |     default_nodes = 1
 84 |     default_cores = multiprocessing.cpu_count()
 85 |     default_memory = get_node_memory()
 86 |     default_block = 192
 87 | 
 88 |     nodes = int_input("Number of nodes", default_nodes)
 89 |     cores_per_node = int_input("Cores per node", default_cores)
 90 |     memory_per_node = int_input("Memory per node (MB)", default_memory)
 91 |     block_size = int_input("Block size", default_block)
 92 | 
 93 |     cores = nodes * cores_per_node
 94 |     memory = nodes * memory_per_node
 95 |     
 96 |     N = int((_MEMORY_FRACTION * math.sqrt(memory * 1024**2 / _SIZEOF_ELEMENT)) / block_size) * block_size
 97 |     NB = block_size
 98 |     P, Q = int_factor(cores)
 99 | 
100 |     fname = "HPL.dat_N%d_NB%d_P%d_Q%d" % (N, NB, P, Q)
101 |     with open(fname, "w") as fout:
102 |         fout.write(_TEMPLATE % {"N": N, "NB": NB, "P": P, "Q": Q})
103 | 
104 | 
105 | if __name__ == "__main__":
106 |     main()
107 | 


--------------------------------------------------------------------------------
/examples/motorBike.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arm-hpc-devkit/nvidia-arm-hpc-devkit-users-guide/77ad9c72ab9c19e2e333184e8839d97e891fbaa0/examples/motorBike.png


--------------------------------------------------------------------------------
/examples/stream-cpu.md:
--------------------------------------------------------------------------------
  1 | # STREAM on Arm64 CPU
  2 | 
  3 | **IMPORTANT**:
  4 | These benchmarks provide a _lower bound_ for expected out-of-the-box performance.  They can be used to determine if your system is configured correctly and operating properly.  It's possible you may exceed these numbers.  **They are not indented for use in any competitive analysis.**
  5 | 
  6 |  * [NVIDIA HPC SDK](#nvidia-hpc-sdk)
  7 |  * [GNU Compilers](#gnu-compilers)
  8 |  * [Arm Compiler for Linux (ACfL)](#arm-compiler-for-linux-acfl)
  9 | 
 10 | 
 11 | ## Initial Configuration
 12 | STREAM is a defacto standard for measuring memory bandwidth.
 13 | The benchmark includes four kernels that operate on 1D arrays `a`, `b`, and `c`, with scalar `x`: 
 14 |  * **COPY**: `c = a`
 15 |  * **SCALE**: `b = x*a`
 16 |  * **ADD**: `c = a + b`
 17 |  * **TRIAD**: `a = b + x*c`
 18 | 
 19 | The kernels are executed in sequence in a loop.  Two parameters configure STREAM:
 20 |  * `STREAM_ARRAY_SIZE`: The number of double-precision elements in each array.
 21 |    It is critical to select a sufficiently large array size when measuring 
 22 |    bandwidth to/from main memory.
 23 |  * `NTIMES`: The number of iterations of the test loop.
 24 | 
 25 | There are many versions of STREAM, but they all use the same four fundamental kernels.  For simplicity we recommend you use this version: https://github.com/jlinford/stream.  This implementation is restructured to demonstrate the performance benefits of OpenMP,
 26 | effective use of NUMA, and features of the Arm architecture.  It uses a version of
 27 | stream that has been modified to enable dynamic memory allocation and to separate 
 28 | the kernel implementation from the benchmark driver.  This makes it the code easier
 29 | to read and faciliatates the use of external tools to measure the performance in 
 30 | each kernel.
 31 | 
 32 | ```bash
 33 | # Clone the repo
 34 | cd $HOME/benchmarks
 35 | git clone https://github.com/jlinford/stream.git
 36 | # That's it!
 37 | ```
 38 | 
 39 | The Makefile for https://github.com/jlinford/stream uses the `COMPILER` variable to select good default flags for various compilers.  At the time of this writing, GCC, NVIDIA, ACfL, and Futjisu compilers are supported.
 40 | 
 41 | ## NVIDIA HPC SDK
 42 | 
 43 | To build with the NVIDIA compilers and run with OpenMP:
 44 | 
 45 | ```bash
 46 | cd $HOME/benchmarks/stream
 47 | make clean run COMPILER=nvidia 
 48 | ```
 49 | 
 50 | TRIAD should report approximately 170GB/s.  Please remember that this number is provided for reference only and should not be used in any competitive analysis.  
 51 | ```
 52 | -------------------------------------------------------------
 53 | Function    Best Rate MB/s  Avg time     Min time     Max time
 54 | Copy:          168917.9     0.203749     0.203411     0.204257
 55 | Scale:         168201.7     0.204544     0.204277     0.204850
 56 | Add:           170183.6     0.303268     0.302847     0.303862
 57 | Triad:         170309.1     0.303220     0.302624     0.303913
 58 | -------------------------------------------------------------
 59 | ```
 60 | 
 61 | ## GNU Compilers
 62 | 
 63 | To build with GCC and run with OpenMP:
 64 | 
 65 | ```bash
 66 | cd $HOME/benchmarks/stream
 67 | make clean run COMPILER=gnu 
 68 | ```
 69 | 
 70 | TRIAD should report approximately 168GB/s.  Please remember that this number is provided for reference only and should not be used in any competitive analysis.  
 71 | ```
 72 | -------------------------------------------------------------
 73 | Function    Best Rate MB/s  Avg time     Min time     Max time
 74 | Copy:          165916.1     0.207446     0.207091     0.207838
 75 | Scale:         168685.6     0.204652     0.203691     0.205092
 76 | Add:           167601.9     0.307983     0.307512     0.308406
 77 | Triad:         167755.8     0.307578     0.307230     0.307824
 78 | -------------------------------------------------------------
 79 | ```
 80 | 
 81 | ## Arm Compiler for Linux (ACfL)
 82 | 
 83 | To build with ACfL and run with OpenMP:
 84 | 
 85 | ```bash
 86 | cd $HOME/benchmarks/stream
 87 | make clean run COMPILER=acfl
 88 | ```
 89 | 
 90 | TRIAD should report approximately 168GB/s. Please remember that this number is provided for reference only and should not be used in any competitive analysis.  
 91 | ```
 92 | -------------------------------------------------------------
 93 | Function    Best Rate MB/s  Avg time     Min time     Max time
 94 | Copy:          164732.6     0.209005     0.208579     0.209475
 95 | Scale:         166109.4     0.207151     0.206850     0.207815
 96 | Add:           169040.6     0.305355     0.304895     0.305726
 97 | Triad:         168924.2     0.305628     0.305105     0.306131
 98 | -------------------------------------------------------------
 99 | ```
100 | 
101 | 


--------------------------------------------------------------------------------
/examples/tensorflow-cpu.md:
--------------------------------------------------------------------------------
  1 | # ML Inference on Arm64 CPUs with TensorFlow
  2 | 
  3 | TensorFlow is an open-source software library for machine learning and artificial intelligence. It can be used across training and inference of deep neural networks. This document covers how to use TensorFlow based machine learning inference on Arm64 CPUs, what runtime configurations are important, and how to debug any performance issues. The document also covers instructions for source builds and how to enable some of the downstream features.
  4 | 
  5 | ## Installing TensorFlow
  6 | 
  7 | There are multiple levels of software package abstractions available including Python wheel (easiest option), Docker container (comes with the wheel, additional packages and benchmarks), and building from source. Examples of using each method are below.
  8 | 
  9 | ### Python Wheel
 10 | TensorFlow wheel supports optimized onednn+acl backend for Arm64 CPUs.
 11 | ```
 12 | pip install tensorflow-cpu
 13 | ```
 14 | 
 15 | ### Docker Hub Container
 16 | ```
 17 | # pull the tensorflow docker container with onednn-acl optimizations enabled
 18 | docker pull armswdev/tensorflow-arm-neoverse
 19 | 
 20 | # launch the docker image
 21 | docker run -it --rm -v /home/ubuntu/:/hostfs armswdev/tensorflow-arm-neoverse
 22 | ```
 23 | 
 24 | ### Building from Source
 25 | 
 26 | While the packages for python wheel/docker container/DLAMI provide stable baseline for ML application development and production, they lack the latest fixes and optimizations from the development branches. This section provides instructions for building TensorFlow from sources, to build the master branch or to incorporate the downstream optimizations.
 27 | ```bash
 28 | # Install bazel for aarch64
 29 | mkdir bazel
 30 | cd bazel
 31 | wget https://github.com/bazelbuild/bazel/releases/download/5.1.1/bazel-5.1.1-linux-arm64
 32 | mv bazel-5.1.1-linux-arm64 bazel
 33 | chmod a+x bazel
 34 | export PATH=/home/ubuntu/bazel/:$PATH
 35 | 
 36 | # Clone the tensorflow repository
 37 | git clone https://github.com/tensorflow/tensorflow.git
 38 | cd tensorflow
 39 | # Optionally checkout the stable version if needed
 40 | git checkout <latest stable version>
 41 | 
 42 | # Set the build configuration
 43 | export HOST_C_COMPILER=(which gcc)
 44 | export HOST_CXX_COMPILER=(which g++)
 45 | export PYTHON_BIN_PATH=(which python)
 46 | export USE_DEFAULT_PYTHON_LIB_PATH=1
 47 | export TF_ENABLE_XLA=1
 48 | export TF_DOWNLOAD_CLANG=0
 49 | export TF_SET_ANDROID_WORKSPACE=0
 50 | export TF_NEED_MPI=0
 51 | export TF_NEED_ROCM=0
 52 | export TF_NEED_GCP=0
 53 | export TF_NEED_S3=0
 54 | export TF_NEED_OPENCL_SYCL=0
 55 | export TF_NEED_CUDA=0
 56 | export TF_NEED_HDFS=0
 57 | export TF_NEED_OPENCL=0
 58 | export TF_NEED_JEMALLOC=1
 59 | export TF_NEED_VERBS=0
 60 | export TF_NEED_AWS=0
 61 | export TF_NEED_GDR=0
 62 | export TF_NEED_OPENCL_SYCL=0
 63 | export TF_NEED_COMPUTECPP=0
 64 | export TF_NEED_KAFKA=0
 65 | export TF_NEED_TENSORRT=0
 66 | ./configure
 67 | 
 68 | # Issue bazel build command with 'mkl_aarch64' config to enable onednn+acl backend
 69 | bazel build --verbose_failures -s --config=mkl_aarch64  //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_cc.so //tensorflow:install_headers
 70 | 
 71 | # Create and install the wheel
 72 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ./wheel-TF2.9.0-py3.8-aarch64
 73 | 
 74 | # The output wheel is generated in /home/ubuntu/tensorflow/wheel-TF2.9.0-py3.8-aarch64
 75 | pip install <wheel-TF2.9.0-py3.8-aarch64/*.whl>
 76 | ```
 77 | 
 78 | ## Runtime Configuration for Optimal Performance
 79 | Once the TensorFlow setup is ready, enable the below runtime configurations to achieve the best performance.
 80 | ```
 81 | # The default runtime backend for tensorflow is Eigen, but typically onednn+acl provides better performance and this can be enabled by setting the below TF environment variable
 82 | export TF_ENABLE_ONEDNN_OPTS=1
 83 | 
 84 | # NVIDIA Arm HPC DevKit supports BF16 format for ML acceleration. This can be enabled in oneDNN by setting the below environment variable
 85 | grep -q bf16 /proc/cpuinfo && export DNNL_DEFAULT_FPMATH_MODE=BF16
 86 | 
 87 | # Make sure the OpenMP threads are distributed across all the processes for multi process applications to avoid over subscription for the vcpus.
 88 | num_vcpus=80
 89 | num_processes=<number of processes>
 90 | export OMP_NUM_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes)))
 91 | export OMP_PROC_BIND=false
 92 | export OMP_PLACES=cores
 93 | ```
 94 | 
 95 | ```python
 96 | # TensorFlow inter and intra_op_parallelism_thread settings are critical for the optimal workload parallelization in a multi-threaded system.
 97 | # set the inter and intra op thread count during the session creation, an example snippet is given below.
 98 | session = Session(
 99 |                  config=ConfigProto(
100 |                       intra_op_parallelism_threads=80,
101 |                       inter_op_parallelism_threads=1,
102 |                 )
103 | )
104 | ```
105 | TensorFlow recommends the graph optimization pass for inference to remove training specific nodes, fold batchnorms and fuse operators. This is a generic optimizaion across CPU, GPU or TPU inference, and the optimization script is part of the TensorFlow python [tools](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_inference_lib.py). For a detailed description, please refer to the TensorFlow [Grappler](https://www.tensorflow.org/guide/graph_optimization) documentation. Below is a snippet of what libraries to import and how to invoke the Grappler passes for inference.
106 | ```python
107 | from tensorflow.python.tools.optimize_for_inference_lib import optimize_for_inference
108 | 
109 | graph_def = tf.compat.v1.GraphDef()
110 | with tf.compat.v1.gfile.FastGFile(model_path, "rb") as f:
111 |      graph_def.ParseFromString(f.read())
112 | 
113 | optimized_graph_def = optimize_for_inference(
114 |           graph_def, 
115 |           [item.split(':')[0] for item in inputs],
116 |           [item.split(':')[0] for item in outputs], 
117 |           dtypes.float32.as_datatype_enum, 
118 |           False)
119 | g = tf.compat.v1.import_graph_def(optimized_graph_def, name='')
120 | ```
121 | Note: While the Grappler optimizer covers majority of the networks, there are few scenarios where either the Grappler optimizer can't optimize the generic graph or the runtime kernel launch overhead is simply not acceptable. XLA addresses these gaps by providing an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Please refer to the **Enable XLA optimizations** section below to achieve the best performance with the downstream XLA optimizations.
122 | 
123 | ## Evaluate performance with the standard MLPerf inference benchmarks
124 | 
125 | 1. Setup MLPerf inference benchmarks and the required tools.
126 | ```
127 | sudo apt install -y build-essential cmake libgl1-mesa-glx libglib2.0-0 libsm6 libxrender1 libxext6 python3-pip
128 | 
129 | git clone https://github.com/mlcommons/inference.git --recursive
130 | cd inference
131 | git checkout 5ec3ac922556107ce0f6ca63c175d379017ba3d8
132 | cd loadgen
133 | CFLAGS="-std=c++14" python3 setup.py bdist_wheel
134 | pip install <dist/*.whl>
135 | ```
136 | 
137 | 2. Benchmark image classification with Resnet50
138 | ```
139 | sudo apt install python3-ck
140 | ck pull repo:ck-env
141 | 
142 | # Download ImageNet's validation set
143 | # These will be installed to ${HOME}/CK_TOOLS/
144 | # Select option 1: val-min data set.
145 | ck install package --tags=image-classification,dataset,imagenet,aux
146 | ck install package --tags=image-classification,dataset,imagenet,val
147 | 
148 | # Copy the labels into the image location
149 | cp ${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-aux/val.txt ${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min/val_map.txt
150 | 
151 | cd inference/vision/classification_and_detection
152 | wget https://zenodo.org/record/2535873/files/resnet50_v1.pb
153 | 
154 | # Install the additional packages required for resnet50 inference
155 | pip install opencv-python
156 | pip install pycocotools
157 | pip install psutil
158 | pip install tqdm
159 | 
160 | # Set the data and model path
161 | export DATA_DIR=${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min
162 | export MODEL_DIR=${HOME}/inference/vision/classification_and_detection
163 | 
164 | # Setup the tensorflow thread pool parameters via MLPerf env variables
165 | export MLPERF_NUM_INTER_THREADS=1
166 | 
167 | num_vcpus=$(getconf _NPROCESSORS_ONLN)
168 | num_processes=<number of processes>
169 | export MLPERF_NUM_INTRA_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes)))
170 | 
171 | ./run_local.sh tf resnet50 cpu --scenario=Offline
172 | ```
173 | 
174 | 3. Benchmark natual language processing with Bert
175 | ```
176 | cd inference/language/bert
177 | make setup
178 | python3 run.py --backend=tf --scenario=Offline
179 | ```
180 | 
181 | ## Troubleshooting performance issues
182 | 
183 | The below steps help debugging performance issues with any inference application.
184 | 
185 | 1. Run inference with DNNL and OpenMP verbose logs enabled to understand which backend is used for the tensor ops execution.
186 | ```
187 | export DNNL_VERBOSE=1
188 | export OMP_DISPLAY_ENV=VERBOSE
189 | ```
190 | If there are no OneDNN logs on the terminal, this could mean either the ops are executed with Eigen or XLA backend. To switch from Eigen to OneDNN+ACL backend, set `TF_ENABLE_ONEDNN_OPTS=1` and rerun the model inference. For non-XLA compiled graphs, there should be a flow of DNN logs with details about the shapes, prop kinds and execution times. Inspect the logs to see if there are any ops and shapes not executed with the ACL gemm kernel, instead executed by cpp reference kernel. See below example dnnl logs to understand how the ACL gemm and reference cpp kernel execution traces look like.
191 | ```
192 | # ACL gemm kernel
193 | dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:acdb:f0 wei_f32::blocked:acdb:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,post_ops:'eltwise_relu;';,alg:convolution_direct,mb1_ic256oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0
194 | 
195 | # OneDNN cpp reference kernel
196 | dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,post_ops:'eltwise_bounded_relu:6;';,alg:convolution_direct,mb1_g64ic64oc64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0
197 | ```
198 | If there are any shapes not going to ACL gemm kernels, the first step is to make sure the graph has been optimized for inference via Grappler passes.
199 | ```python
200 | from tensorflow.python.tools.optimize_for_inference_lib import optimize_for_inference
201 | 
202 | graph_def = tf.compat.v1.GraphDef()
203 | with tf.compat.v1.gfile.FastGFile(model_path, "rb") as f:
204 |      graph_def.ParseFromString(f.read())
205 | 
206 | optimized_graph_def = optimize_for_inference(
207 |           graph_def, 
208 |           [item.split(':')[0] for item in inputs],
209 |           [item.split(':')[0] for item in outputs], 
210 |           dtypes.float32.as_datatype_enum, 
211 |           False)
212 | g = tf.compat.v1.import_graph_def(optimized_graph_def, name='')
213 | ```
214 | If the tensor ops and shapes are still not executed with ACL gemm kernels, please raise an issue on [ACL github](https://github.com/ARM-software/ComputeLibrary) with the operator and shape details.
215 | 
216 | 2. Once the tensor ops are executed with ACL gemm kernels, enable fast math mode, `export DNNL_DEFAULT_FPMATH_MODE=BF16`, to pick bfloat16 hybrid gemm kernels.
217 | 
218 | 3. Verify the TensorFlow inter- and intra- thread pool settings are optimal as recommended in the runtime configurations section. Then, inspect the OMP environment to make sure the CPU cores are not oversubscribed for multi process applications. A typical openmp environment for a 80 thread, single process application looks like the one below.
219 | ```
220 | OPENMP DISPLAY ENVIRONMENT BEGIN
221 |   _OPENMP = '201511'
222 |   OMP_DYNAMIC = 'FALSE'
223 |   OMP_NESTED = 'FALSE'
224 |   OMP_NUM_THREADS = '80'
225 |   OMP_SCHEDULE = 'DYNAMIC'
226 |   OMP_PROC_BIND = 'FALSE'
227 |   OMP_PLACES = ''
228 |   OMP_STACKSIZE = '0'
229 |   OMP_WAIT_POLICY = 'PASSIVE'
230 |   OMP_THREAD_LIMIT = '4294967295'
231 |   OMP_MAX_ACTIVE_LEVELS = '1'
232 |   OMP_CANCELLATION = 'FALSE'
233 |   OMP_DEFAULT_DEVICE = '0'
234 |   OMP_MAX_TASK_PRIORITY = '0'
235 |   OMP_DISPLAY_AFFINITY = 'FALSE'
236 |   OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
237 |   OMP_ALLOCATOR = 'omp_default_mem_alloc'
238 |   OMP_TARGET_OFFLOAD = 'DEFAULT'
239 |   GOMP_CPU_AFFINITY = ''
240 |   GOMP_STACKSIZE = '0'
241 |   GOMP_SPINCOUNT = '300000'
242 | OPENMP DISPLAY ENVIRONMENT END
243 | ```
244 | 
245 | ## Enable TensorFlow Serving with onednn+acl backend
246 | 
247 | [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
248 | 
249 | As of 2.9.0 version, TensorFlow Serving for aarch64 supports TensorFlow with Eigen backend. For the best performance, with onednn+acl backend, please follow the below instructions to cherrypick the PRs and to rebuild the TensorFlow Serving docker image.
250 | ```bash
251 | # Clone the tf serving repository
252 | git clone https://github.com/tensorflow/serving.git
253 | cd serving
254 | 
255 | # Pull https://github.com/tensorflow/serving/pull/1953
256 | git fetch origin pull/1953/head:tfs_aarch64
257 | 
258 | # Pull https://github.com/tensorflow/serving/pull/1954
259 | git fetch origin pull/1954/head:tfs_docker_aarch64
260 | 
261 | # Merge them
262 | git checkout tfs_aarch64
263 | git merge tfs_docker_aarch64
264 | 
265 | # Invoke the docker build script to trigger mkl aarch64 config build
266 | docker build -f tensorflow_serving/tools/docker/Dockerfile.devel-mkl-aarch64 -t tfs:mkl_aarch64 .
267 | 
268 | # Command to launch the serving api with onednn+acl backend, and BF16 kernels for a resnet model
269 | docker run -p 8501:8501 --name tfserving_resnet --mount type=bind,source=/tmp/resnet,target=/models/resnet -e MODEL_NAME=resnet -e TF_ENABLE_ONEDNN_OPTS=1 -e DNNL_DEFAULT_FPMATH_MODE=BF16 -e -t tfs:mkl_aarch64
270 | ```
271 | 
272 | ## Enable XLA optimizations
273 | 
274 | While the Grappler optimizer covers majority of the networks, there are few scenarios where either the Grappler optimizer can't optimize the generic graph or the runtime kernel launch overhead is simply not acceptable. XLA addresses these gaps by providing an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. TensorFlow-2.9.0 supports aarch64 xla backend with Eigen runtime. For the best performance please cherrypick the [PR](https://github.com/tensorflow/tensorflow/pull/55534) and rebuild the TensorFlow libraries.
275 | ```bash
276 | # Clone the tensorflow repository
277 | git clone https://github.com/tensorflow/tensorflow.git
278 | cd tensorflow
279 | 
280 | # Pulll https://github.com/tensorflow/tensorflow/pull/55534
281 | git fetch origin pull/55534/head:xla_acl
282 | git checkout xla_acl
283 | 
284 | # Set the build configuration
285 | export HOST_C_COMPILER=(which gcc)
286 | export HOST_CXX_COMPILER=(which g++)
287 | export PYTHON_BIN_PATH=(which python)
288 | export USE_DEFAULT_PYTHON_LIB_PATH=1
289 | export TF_ENABLE_XLA=1
290 | export TF_DOWNLOAD_CLANG=0
291 | export TF_SET_ANDROID_WORKSPACE=0
292 | export TF_NEED_MPI=0
293 | export TF_NEED_ROCM=0
294 | export TF_NEED_GCP=0
295 | export TF_NEED_S3=0
296 | export TF_NEED_OPENCL_SYCL=0
297 | export TF_NEED_CUDA=0
298 | export TF_NEED_HDFS=0
299 | export TF_NEED_OPENCL=0
300 | export TF_NEED_JEMALLOC=1
301 | export TF_NEED_VERBS=0
302 | export TF_NEED_AWS=0
303 | export TF_NEED_GDR=0
304 | export TF_NEED_OPENCL_SYCL=0
305 | export TF_NEED_COMPUTECPP=0
306 | export TF_NEED_KAFKA=0
307 | export TF_NEED_TENSORRT=0
308 | ./configure
309 | 
310 | # Issue bazel build command with 'mkl_aarch64' config to enable onednn+acl backend
311 | bazel build --verbose_failures -s --config=mkl_aarch64  //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_cc.so //tensorflow:install_headers
312 | 
313 | # Create and install the wheel
314 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ./wheel-TF2.9.0-py3.8-aarch64
315 | 
316 | # The wheel is generated in /home/ubuntu/tensorflow/wheel-TF2.9.0-py3.8-aarch64
317 | pip install <wheel-TF2.9.0-py3.8-aarch64/*.whl>
318 | ```
319 | A simple way to start using XLA in TensorFlow models without any changes is to enable auto-clustering, which automatically finds clusters (connected subgraphs) within the TensorFlow functions which can be compiled and executed using XLA. Auto-clustering on CPU can be enabled by setting the TF_XLA_FLAGS environment variables as below:
320 | ```python
321 | # Set the jit level for the current session via the config
322 | jit_level = tf_compat_v1.OptimizerOptions.ON_1
323 | config.graph_options.optimizer_options.global_jit_level = jit_level
324 | ```
325 | ```
326 | # Enable auto clustering for CPU backend
327 | export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit"
328 | ```
329 | 
330 | ### Troubleshooting XLA Performance Issues
331 | 
332 | 1. If XLA performance improvements are not as expected, the first step is to dump and inspect the XLA optimized graph and ensure there are not many op duplications resulted from the op fusion and other other optimization passes. XLA provides a detailed logging mechanism to dump the state at different checkpoints during the graph optimization passes. At high level, inspecting the graphs generated before and after the XLA pass is sufficient to understand whether XLA compilation is the correct optimization for the current graph. Please refer to the below instructions for enabling auto-clustering, along with .dot generation (using MLPerf Bert inference in SingleStream mode as the example here) and also commands to generate .svg version for easier visualization of the XLA generated graphs.
333 | ```bash
334 | # To enable XLA auto clustering, and to generate .dot files
335 | XLA_FLAGS="--xla_dump_to=/tmp/generated  --xla_dump_hlo_as_dot" TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" python run.py --backend=tf --scenario=SingleStream
336 | 
337 | # To convert the .dot file into .svg
338 | sudo apt install graphviz
339 | dot -Tsvg <.dot file to be converted> >  <output .svg name>
340 | 
341 | # e.g.
342 | dot -Tsvg 1645569101166784.module_0000.cluster_51__XlaCompiledKernel_true__XlaHasReferenceVars_false__XlaNumConstantArgs_0__XlaNumResourceArgs_0_.80.before_optimizations.dot > module0_before_opts.svg
343 | 
344 | dot -Tsvg 1645569101166784.module_0000.cluster_51__XlaCompiledKernel_true__XlaHasReferenceVars_false__XlaNumConstantArgs_0__XlaNumResourceArgs_0_.80.cpu_after_optimizations.dot > module0_after_opts.svg
345 | ```
346 | 
347 | 2. Once the XLA graph looks as expected (without too many duplicated nodes), check how the ops are emitted. Currently XLA framework logging is under the same TF CPP logging, and level 1 is sufficient to get most of the info traces.
348 | ```bash
349 | # Enable TF CPP framework logging
350 | export TF_CPP_MAX_VLOG_LEVEL=1
351 | ```
352 | Then look for the emitter level traces to understand how each op for a given shape is lowered to LLVM IR. The below traces show whether the XLA ops are emitted to ACL or Eigen runtime.
353 | ```bash
354 | # ACL runtime traces
355 | __xla_cpu_runtime_ACLBatchMatMulF32
356 | 
357 | __xla_cpu_runtime_ACLConv2DF32
358 | 
359 | # Eigen runtime traces
360 | __xla_cpu_runtime_EigenBatchMatMulF32
361 | 
362 | __xla_cpu_runtime_EigenConv2DF32
363 | ```
364 | If the shapes are not emitted by the ACL runtime, check the source build configuration to make sure 'mkl_aarch64' bazel config is enabled (which internally enables '--define=build_with_acl=true' bazel configuration). If the LLVM IR still doesn't emit the ACL runtime, please raise an ssue on [ACL github](https://github.com/ARM-software/ComputeLibrary) with the operator and shape details.
365 | 


--------------------------------------------------------------------------------
/examples/tensorflow-gpu.md:
--------------------------------------------------------------------------------
  1 | # TensorFlow on A100 GPUs and Arm64 CPUs
  2 | 
  3 | TensorFlow is an open source platform for machine learning. It provides 
  4 | comprehensive tools and libraries in a flexible architecture allowing easy 
  5 | deployment across a variety of platforms and devices. NGC Containers are the easiest 
  6 | way to get started with TensorFlow. The TensorFlow NGC Container comes with all 
  7 | dependencies included, providing an easy place to start developing common 
  8 | applications, such as conversational AI, natural language processing (NLP), 
  9 | recommenders, and computer vision.
 10 | 
 11 | 
 12 | ## NVIDIA TensorFlow NGC Container
 13 | The NVIDIA NGC container registry includes GPU-accelerated Arm64 builds of TensorFlow.  For most cases, this will be your best option both in terms of ease of use and performance. 
 14 | 
 15 | The [TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) is optimized for GPU acceleration, and contains a 
 16 | validated set of libraries that enable and optimize GPU performance. This container 
 17 | may also contain modifications to the TensorFlow source code in order to maximize 
 18 | performance and compatibility. This container also contains software for 
 19 | accelerating ETL (DALI, RAPIDS), Training (cuDNN, NCCL), and Inference (TensorRT) 
 20 | workloads.  See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow for 
 21 | more information.
 22 | 
 23 | 
 24 | ### Example: Mask-RCNN For TensorFlow 2
 25 | Mask R-CNN is a convolution-based neural network for the task of object instance segmentation. The paper describing the model can be found [here](https://arxiv.org/abs/1703.06870). NVIDIA’s Mask R-CNN is an optimized version of [Google's TPU implementation](https://github.com/tensorflow/tpu/tree/master/models/official/mask_rcnn), leveraging mixed precision arithmetic using Tensor Cores while maintaining target accuracy. 
 26 | 
 27 | This model is trained with mixed precision using Tensor Cores on the NVIDIA Ampere GPU. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training.  See the [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) for more details.
 28 | 
 29 | 1. Download Mask-RCNN example files
 30 | ```bash
 31 | # Clone the NVIDIA Deep Learning Examples repository
 32 | git clone https://github.com/NVIDIA/DeepLearningExamples.git
 33 | 
 34 | # Go to the MaskRCNN example for TensorFlow
 35 | cd DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN
 36 | ```
 37 | 
 38 | 2. Update the Dockerfile to fix references to outdated software versions.  Newer versions of these packages support Arm64.
 39 | ```bash
 40 | # Update software versions in Dockerfile to enable Arm64 support
 41 | cat | patch -p0 << "EOF"
 42 | --- Dockerfile  2022-07-07 15:35:29.824463000 -0700
 43 | +++ Dockerfile.patched  2022-07-07 15:36:47.041457000 -0700
 44 | @@ -12,7 +12,7 @@
 45 |  # See the License for the specific language governing permissions and
 46 |  # limitations under the License.
 47 | 
 48 | -ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:21.02-tf2-py3
 49 | +ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:22.06-tf2-py3
 50 |  FROM ${FROM_IMAGE_NAME}
 51 | 
 52 |  LABEL model="MaskRCNN"
 53 | @@ -30,10 +30,10 @@
 54 |      cd /opt/pybind11 && cmake . && make install && pip install .
 55 | 
 56 | 
 57 | -# update protobuf 3 to 3.3.0
 58 | +# update protobuf 3 to 3.18.2
 59 |  RUN \
 60 | -    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip && \
 61 | -    unzip -u protoc-3.3.0-linux-x86_64.zip -d protoc3 && \
 62 | +    curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.18.2/protoc-3.18.2-linux-aarch_64.zip && \
 63 | +    unzip -u protoc-3.18.2-linux-aarch_64.zip -d protoc3 && \
 64 |      mv protoc3/bin/* /usr/local/bin/ && \
 65 |      mv protoc3/include/* /usr/local/include/
 66 | EOF
 67 | ```
 68 | 
 69 | 3. Update example files to support TensorFlow2.
 70 | ```bash
 71 | cat | patch -p0 <<"EOF"
 72 | --- mrcnn_tf2/runtime/run.py      2022-07-07 16:01:33.565361000 -0700
 73 | +++ mrcnn_tf2/runtime/run.py.patched      2022-07-07 16:01:27.167354000 -0700
 74 | @@ -112,11 +112,9 @@
 75 |          logging.info('XLA is activated')
 76 | 
 77 |      if params.amp:
 78 | -        policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16", loss_scale="dynamic")
 79 | -        tf.keras.mixed_precision.experimental.set_policy(policy)
 80 | +        tf.keras.mixed_precision.set_global_policy("mixed_float16")
 81 |          logging.info('AMP is activated')
 82 | 
 83 | -
 84 |  def create_model(params):
 85 |      model = MaskRCNN(
 86 |          params=params,
 87 | EOF
 88 | ```
 89 | 
 90 | 4. Build the container image
 91 | ```bash
 92 | nvidia-docker build -t nvidia_mrcnn_tf2 .
 93 | ```
 94 | 
 95 | 5. Benchmark model training
 96 | ```bash
 97 | # Start an interactive session in the NGC container
 98 | # Use bind mounts to retain data between runs
 99 | docker run --gpus all -it --rm \
100 |         --shm-size=2g \
101 |         --ulimit memlock=-1 \
102 |         --ulimit stack=67108864 \
103 |         -v /tmp/mask-rcnn/data:/data \
104 |         -v /tmp/mask-rcnn/weights:/weights \
105 |         nvidia_mrcnn_tf2
106 | 
107 | # Note: skip this step if you already have the preprocessed data
108 | # Download and preprocess the dataset (approximately 25GB)
109 | cd /workspace/mrcnn_tf2/dataset
110 | bash download_and_preprocess_coco.sh /data
111 | 
112 | # Note: skip this step if you already have the preprocessed data
113 | # Download the pre-trained ResNet-50 weights.
114 | python scripts/download_weights.py --save_dir=/weights
115 | 
116 | # Start training on 2 GPUs
117 | cd /workspace/mrcnn_tf2
118 | # Optional: if you see an error message like this:
119 | #   ImportError: /usr/lib/aarch64-linux-gnu/libgomp.so.1: cannot allocate memory in static TLS block
120 | # ... then set LD_PRELOAD like this:
121 | #   export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
122 | python scripts/train.py --gpus 2 --batch_size 24
123 | ```
124 | 
125 | 
126 | ### Example: F-LM on Arm64 with 2xA100 GPUs
127 | 
128 | ```bash
129 | # Start an interactive session in the NGC container
130 | # Use bind mounts in $HOME/big_lstm to retain data
131 | docker run --gpus all -it --rm \
132 |         -v $HOME/big_lstm/data:/data \
133 |         -v $HOME/big_lstm/logs:/logs \
134 |         --ipc=host \
135 |         --ulimit memlock=-1 \
136 |         --ulimit stack=67108864 \
137 |         nvcr.io/nvidia/tensorflow:23.01-tf1-py3
138 | 
139 | # Go to the example directory
140 | cd /workspace/nvidia-examples/big_lstm
141 | 
142 | # Download the training data
143 | ./download_1b_words_data.sh
144 | 
145 | # Training for up to 180 seconds
146 | python single_lm_train.py \
147 |         --mode=train \
148 |         --logdir=/logs \
149 |         --num_gpus=2 \
150 |         --datadir=/data/1-billion-word-language-modeling-benchmark-r13output \
151 |         --hpconfig run_profiler=False,max_time=180,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=512
152 | 
153 | # Evaulation 
154 | python single_lm_train.py \
155 |         --mode=eval_full \
156 |         --logdir=/logs \
157 |         --num_gpus=2 \
158 |         --datadir=/data/1-billion-word-language-modeling-benchmark-r13output \
159 |         --hpconfig run_profiler=False,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=512
160 | ```
161 | 
162 | 


--------------------------------------------------------------------------------
/examples/velox.md:
--------------------------------------------------------------------------------
 1 | # Velox on Arm64
 2 | 
 3 | Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana.  See https://github.com/facebookincubator/velox for more information.
 4 | 
 5 | What makes Velox interesting on Arm64 is it's use of the xsimd library.  xsimd provides a unified means for using SIMD features for library authors. Namely, it enables manipulation of batches of numbers with the same arithmetic operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches.  See https://github.com/xtensor-stack/xsimd.
 6 | 
 7 | ## Build from Source
 8 | 
 9 | Note: the standard Velox installation assumes you are running Ubuntu 22.04 LTS and you have root access.  You'll need to modify Velox's setup scripts if this is not the case.
10 | 
11 | ```bash
12 | # Install flex and bison
13 | sudo apt install -y flex bison
14 | 
15 | # Get the source
16 | git clone --recursive https://github.com/facebookincubator/velox.git
17 | cd velox
18 | git submodule sync --recursive
19 | git submodule update --init --recursive
20 | ```
21 | 
22 | We need to edit `scripts/setup-helper-functions.sh` to fix a bug and update the compiler flags for Arm64.
23 | ```bash
24 | # Add missing sudo to ninja command
25 | sed -i 's/ninja -C/sudo ninja -C/' setup-helper-functions.sh
26 | 
27 | # Fix the compiler flags for Arm
28 | sed -i 's/-march=armv8-a+crc+crypto/-mcpu=neoverse-v1 -flax-vector-conversions -fsigned-char/' setup-helper-functions.sh
29 | ```
30 | 
31 | What are these flags and why do we need them?
32 |  * `-flax-vector-conversions`: Let GCC permit conversions between vectors with differing element types.  This flag is needed because NEON differentiates between vectors of signed and unsigned types.  Without this flag, the compiler will not permit implicit cast operations between such vectors.  Velox relies on such implicit casts in several places so we must relax the compiler's vector conversion rules.
33 |  * `-fsigned-char`: The C++ standard does not specify if `char` is signed or unsigned.  GCC on Arm impliments it as unsigned, but there are places in Velox where it is assumed to be signed.  This flag makes all occurrences of `char` be signed, like `signed char`.
34 |  * `-mcpu=neoverse-n1`: Optional, but this enables the compiler to use all available architecture features and also perform micro-architectural tuning for the Neoverse N1 CPU core.  Using `-mcpu` should produce a more tuned binary than using the `-march` flag.
35 | 
36 | Now, compile Velox:
37 | ```bash
38 | export CPU_TARGET=aarch64
39 | ./scripts/setup-ubuntu.sh 
40 | make TREAT_WARNINGS_AS_ERRORS=0 ENABLE_WALL=0 CPU_TARGET=aarch64
41 | ```
42 | 


--------------------------------------------------------------------------------
/examples/wrf.md:
--------------------------------------------------------------------------------
  1 | # WRF on Arm64
  2 | 
  3 | The [Weather Research and Forecasting (WRF) Model](https://www.mmm.ucar.edu/weather-research-and-forecasting-model) is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. 
  4 | 
  5 | Arm64 is supported by the standard WRF distribution as of WRF 4.3.3. The following is an example of how to perform the standard procedure to build and execute on Arm64.  See [http://www2.mmm.ucar.edu/wrf/users/download/get_source.html](http://www2.mmm.ucar.edu/wrf/users/download/get_source.html) for more details.
  6 | 
  7 | ## Running the CONUS (Contiguous US) test
  8 | Build WRF using one of the methods described below (i.e. via Spack or manually) and download a CONUS (Contiguous US) test deck from www2.mmm.ucar.edu:
  9 |  - 12km case, about 1.8GB: https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz
 10 |  - 2.5km case, about 18GB: https://www2.mmm.ucar.edu/wrf/src/conus2.5km.tar.gz
 11 | 
 12 | The 12km and 2.5km cases are run in the same way.
 13 | 
 14 | ```bash
 15 | cd $BUILD_DIR/WRFV4.4.2
 16 | 
 17 | # Copy the run directory template
 18 | cp -a run run_CONUS12km
 19 | cd run_CONUS12km
 20 | 
 21 | # Download the test case files and merge them into the run directory
 22 | curl -L https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz | tar xvzf - --strip-components=1
 23 | 
 24 | # Configure stack limits
 25 | ulimit -s unlimited
 26 | export OMP_STACKSIZE=1G
 27 | 
 28 | # Run with 8 MPI ranks, each having 10 OpenMP threads
 29 | OMP_NUM_THREADS=10 OMP_PLACES=cores OMP_PROC_BIND=close mpirun -np 8 -map-by socket:PE=10 ./wrf.exe 
 30 | 
 31 | # Track progress by watching the Rank 0 output file:
 32 | tail -f rsl.out.0000
 33 | 
 34 | # Quickly calculate the average elapsed seconds per domain as a figure-of-merit
 35 | cat rsl.out.* | grep 'Timing for main:' | awk '{print $9}' | jq -s add/length
 36 | ```
 37 | 
 38 | 
 39 | ## Quick Start with Spack
 40 | One of the easiest ways to use WRF on Arm64 is to install it via Spack.  Simply executing `spack install wrf` will install the latest version of WRF.  However, we recommend you also install a recent version of GCC and then use that GCC installation to build WRF to get the best performance.  For example, 
 41 | ```bash
 42 | spack install gcc@12.1.0
 43 | spack load gcc@12.1.0
 44 | spack compiler find
 45 | spack install wrf %gcc@12.1.0
 46 | ```
 47 | 
 48 | ## Manually Build from Source
 49 | 
 50 | ### Dependencies
 51 | WRF depends on the NetCDF Fortran library, which in turn requires the NetCDF C library and HDF5.  All these packages support Arm64 and build easily with GCC.  Be sure to set the environment variables `HDFDIR` and `NETCDF` to be the location of the HDF5 and NetCDF installations _Note: it is assumed that the Fortran NetCDF interface has been installed at the same location as the C library, i.e. they share the same `lib` and `include` directories._ 
 52 | 
 53 | ```bash
 54 | # Create a build directory to hold WRF and all its dependencies
 55 | mkdir WRF
 56 | cd WRF
 57 | 
 58 | # Configure build environment
 59 | export BUILD_DIR=$PWD
 60 | export HDFDIR=$BUILD_DIR/opt
 61 | export HDF5=$BUILD_DIR/opt
 62 | export NETCDF=$BUILD_DIR/opt
 63 | 
 64 | # HDF5
 65 | curl -L https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.0/src/hdf5-1.14.0.tar.gz | tar xvzf -
 66 | cd hdf5-1.14.0
 67 |  ./configure --prefix=$HDFDIR --enable-fortran --enable-parallel
 68 | make -j && make install
 69 | cd $BUILD_DIR
 70 | 
 71 | # Only set these _after_ HDF5 has compiled and installed
 72 | # Note that LDFLAGS only includes HDF5, not NETCDF
 73 | export CC=`which mpicc`
 74 | export CXX=`which mpicxx`
 75 | export FC=`which mpifort`
 76 | export CPPFLAGS="-I$HDFDIR/include" 
 77 | export CFLAGS="-I$HDFDIR/include" 
 78 | export FFLAGS="-I$HDFDIR/include" 
 79 | export LDFLAGS="-L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz"
 80 | export PATH=$NETCDF/bin:$PATH
 81 | export LD_LIBRARY_PATH=$BUILD_DIR/opt/lib:$LD_LIBRARY_PATH
 82 | 
 83 | # NetCDF-c
 84 | curl -L https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz | tar xvzf -
 85 | cd netcdf-c-4.9.0
 86 | ./configure --prefix=$NETCDF
 87 | make -j && make install
 88 | cd $BUILD_DIR
 89 | 
 90 | # NetCDF-f
 91 | curl -L https://github.com/Unidata/netcdf-fortran/archive/refs/tags/v4.6.0.tar.gz | tar xvzf -
 92 | cd netcdf-fortran-4.6.0/
 93 |  ./configure --prefix=$NETCDF
 94 | make -j && make install
 95 | cd $BUILD_DIR
 96 | ```
 97 | 
 98 | ### Download WRF Source
 99 | The latest WRF source code can be downloaded [here](https://github.com/wrf-model/WRF/releases). 
100 | ```bash
101 | cd $BUILD_DIR
102 | curl -L https://github.com/wrf-model/WRF/releases/download/v4.4.2/v4.4.2.tar.gz | tar xvzf -
103 | cd WRFV4.4.2
104 | ```
105 | 
106 | ### Configure and Compile WRF with GCC
107 | Run WRF's configure script and select from the provided options.
108 | It's best to select an option from the "Aarch64" row as this will add
109 | `-mcpu=native` and other important compiler flags to the build line.
110 | Other rows in the configuration options may not achieve the best performance.
111 | Choose the default option for nesting (i.e. `1`). 
112 | 
113 | Make sure `configure` outputs messages like these:
114 |  - `Will use NETCDF in dir: /home/jlinford/workspace/benchmarks/WRF/opt`
115 |  - `Will use HDF5 in dir: /home/jlinford/workspace/benchmarks/WRF/opt`
116 | If you see errors finding NetCDF or HDF5, resolve those errors before proceeding.
117 | 
118 | ```bash
119 | ./configure
120 | checking for perl5... no
121 | checking for perl... found /usr/bin/perl (perl)
122 | Will use NETCDF in dir: /home/jlinford/workspace/benchmarks/WRF/opt
123 | Will use HDF5 in dir: /home/jlinford/workspace/benchmarks/WRF/opt
124 | PHDF5 not set in environment. Will configure WRF for use without.
125 | Will use 'time' to report timing information
126 | $JASPERLIB or $JASPERINC not found in environment, configuring to build without grib2 I/O...
127 | ------------------------------------------------------------------------
128 | Please select from among the following Linux aarch64 options:
129 | 
130 |   1. (serial)   2. (smpar)   3. (dmpar)   4. (dm+sm)   GNU (gfortran/gcc)
131 |   2. (serial)   6. (smpar)   7. (dmpar)   8. (dm+sm)   GNU (gfortran/gcc)
132 |   3. (serial)  10. (smpar)  11. (dmpar)  12. (dm+sm)   GCC (gfortran/gcc): Aarch64
133 | 
134 | Enter selection [1-12] : 11
135 | ------------------------------------------------------------------------
136 | Compile for nesting? (1=basic, 2=preset moves, 3=vortex following) [default 1]:
137 | 
138 | Configuration successful!
139 | ```
140 | 
141 | Then use WRF's compile script to build the target executable:
142 | ```bash
143 | # Reset build environment to include `-lnetcdf` in LDFLAGS
144 | export CC=`which mpicc`
145 | export CXX=`which mpicxx`
146 | export FC=`which mpifort`
147 | export CPPFLAGS="-I$HDFDIR/include" 
148 | export CFLAGS="-I$HDFDIR/include" 
149 | export FFLAGS="-I$HDFDIR/include" 
150 | export LDFLAGS="-L$HDFDIR/lib -lnetcdf -lhdf5_hl -lhdf5 -lz"
151 | export PATH=$NETCDF/bin:$PATH
152 | export LD_LIBRARY_PATH=$BUILD_DIR/opt/lib:$LD_LIBRARY_PATH
153 | 
154 | # Start compilation on 80 cores
155 | ./compile -j 80 em_real >& compile.log &
156 | # Watch compilation progress by following the log file:
157 | tail -f compile.log
158 | ```
159 | 
160 | Look for this message at the end of the compilation log:
161 | ```
162 | ==========================================================================
163 | build started:   Wed Feb  1 03:35:15 UTC 2023
164 | build completed: Wed Feb 1 04:00:21 UTC 2023
165 | 
166 | --->                  Executables successfully built                  <---
167 | 
168 | -rwxrwxr-x 1 jlinford jlinford 38182176 Feb  1 04:00 main/ndown.exe
169 | -rwxrwxr-x 1 jlinford jlinford 38284448 Feb  1 04:00 main/real.exe
170 | -rwxrwxr-x 1 jlinford jlinford 37837912 Feb  1 04:00 main/tc.exe
171 | -rwxrwxr-x 1 jlinford jlinford 45049424 Feb  1 03:58 main/wrf.exe
172 | 
173 | ==========================================================================
174 | ```
175 | 
176 | ### Optional: Compiling with Arm Compiler for Linux
177 | If you plan to use the Arm Compiler for Linux (ACfL) instead of GCC, note that the stanza for ACfL is not yet part of the standard WRF package. Run WRF's configure script then choose an option 1-4 and select default for nesting. This will produce the `configure.wrf' file:
178 | ```
179 | ./configure
180 | ```
181 | The `configure.wrf' file then needs modifying as follows.
182 | ```
183 | sed -i 's/gcc/armclang/g' configure.wrf
184 | sed -i 's/gfortran/armflang/g' configure.wrf
185 | sed -i 's/mpicc/mpicc -DMPI2_SUPPORT/g' configure.wrf
186 | sed -i 's/ -ftree-vectorize//g' configure.wrf
187 | sed -i 's/length-none/length-0/g' configure.wrf
188 | sed -i 's/-frecord-marker\=4/ /g' configure.wrf
189 | sed -i 's/\-w \-O3 \-c/-mcpu=native \-w \-O3 \-c/g' configure.wrf
190 | sed -i 's/\# \-g $(FCNOOPT).*/\-g/g' configure.wrf
191 | sed -i 's/$(FCBASEOPTS_NO_G)/-mcpu=native $(OMP) $(FCBASEOPTS_NO_G)/g' configure.wrf
192 | ```
193 | ### Optional: Compiling with the NVIDIA HPC SDK
194 | If you plan to use the Compilers that come with the NVIDIA HPC SDK, just like the Arm Compiler for Linux (ACfL), the stanza is not yet part of the standard WRF package. Run WRF's configure script then choose an option 1-4 and select default for nesting. This will produce the `configure.wrf' file:
195 | ```
196 | ./configure
197 | ```
198 | The `configure.wrf' file then needs modifying as follows.
199 | ```
200 | sed -i 's/gcc/nvc/g' configure.wrf
201 | sed -i 's/gfortran/nvfortran/g' configure.wrf
202 | sed -i 's/ -ftree-vectorize//g' configure.wrf
203 | sed -i 's/ -fopenmp/-mp/g' configure.wrf
204 | sed -i 's/#-fdefault-real-8/-i4/g' configure.wrf
205 | sed -i 's/ -O3/ -O3 -fast -Minfo/g' configure.wrf
206 | sed -i 's/-funroll-loops//g' configure.wrf
207 | sed -i 's/-ffree-form -ffree-line-length-none/-Mfree/g' configure.wrf
208 | sed -i 's/-fconvert=big-endian -frecord-marker=4/-byteswapio/g' configure.wrf
209 | sed -i 's/\# \-g $(FCNOOPT).*/\-g/g' configure.wrf
210 | sed -i 's/$(FCBASEOPTS_NO_G)/\-march=native $(OMP) $(FCBASEOPTS_NO_G)/g' configure.wrf
211 | ```
212 | 
213 | 


--------------------------------------------------------------------------------
/known_issues.md:
--------------------------------------------------------------------------------
 1 | # Recent Updates, Known Issues, and Workarounds
 2 | 
 3 | There is a huge amount of activity in the Arm software ecosystem and improvements are being
 4 | made on a daily basis. As a general rule, later versions of compilers and language runtimes
 5 | should be used whenever possible. 
 6 | 
 7 | ## Recent Updates
 8 | The table below includes known recent changes to popular packages that improve performance.  If you know of others please let us know.  CONTACT
 9 | 
10 | Package | Version | Improvements
11 | --------|:-:|-------------
12 | bazel	| [3.4.1+](https://github.com/bazelbuild/bazel/releases) | Pre-built bazel binary for Arm64. [See below](#bazel-on-linux) for installation. 
13 | ffmpeg  |   4.3+  | Improved performance of libswscale by 50% with better NEON vectorization which improves the performance and scalability of ffmpeg multi-thread encoders. The changes are available in FFMPEG version 4.3.
14 | HAProxy  | 2.4+  | A [serious bug](https://github.com/haproxy/haproxy/issues/958) was fixed. Additionally, building with `CPU=armv81` improves HAProxy performance by 4x so please rebuild your code with this flag.
15 | mongodb | 4.2.15+ / 4.4.7+ / 5.0.0+ | Improved performance on Arm64, especially for internal JS engine. LSE support added in [SERVER-56347](https://jira.mongodb.org/browse/SERVER-56347).
16 | MySQL   | 8.0.23+ | Improved spinlock behavior, compiled with -moutline-atomics if compiler supports it.
17 | .NET | [5+](https://dotnet.microsoft.com/download/dotnet/5.0) | [.NET 5 significantly improved performance for ARM64](https://devblogs.microsoft.com/dotnet/Arm64-performance-in-net-5/). Here's an [AWS Blog](https://aws.amazon.com/blogs/compute/powering-net-5-with-aws-graviton2-benchmark-results/) with some performance results. 
18 | OpenH264 | [2.1.1+](https://github.com/cisco/openh264/releases/tag/v2.1.1) | Pre-built Cisco OpenH264 binary for Arm64. 
19 | PCRE2   | 10.34+  | Added NEON vectorization to PCRE's JIT to match first and pairs of characters. This may improve performance of matching by up to 8x. This fixed version of the library now is shipping with Ubuntu 20.04 and PHP 8.
20 | PHP     | 7.4+    | PHP 7.4 includes a number of performance improvements that increase perf by up to 30%
21 | pip     | 19.3+   | Enable installation of Python wheel binaries on Arm64.
22 | PyTorch | 1.7+    | Enable Arm64 compilation and NEON optimization for fp32. Install from source. **Note:** *Requires GCC9 or later.*
23 | ruby    | 3.0+ | Enable Arm64 optimizations that improve performance by as much as 40%. These changes have also been back-ported to the Ruby shipping with AmazonLinux2, Fedora, and Ubuntu 20.04.
24 | zlib    | 1.2.8+  | For the best performance on Arm64 please use [zlib-cloudflare](https://github.com/cloudflare/zlib).
25 | 
26 | 
27 | ## Known Issues
28 | 
29 | ### Postgres
30 | Postgres performance can be heavily impacted by not using [LSE](langs/c-c++.md#large-system-extensions-lse).
31 | Today, postgres binaries from distributions (e.g. Ubuntu) are not built with `-moutline-atomics` or `-march=armv8.2-a` which would enable LSE. 
32 | 
33 | In November 2021 PostgreSQL started to distribute Ubuntu 20.04 packages optimized with `-moutline-atomics`.
34 | For Ubuntu 20.04, we recommend using the PostgreSQL PPA instead of the packages distributed by Ubuntu Focal.
35 | Please follow [the instructions to set up the PostgreSQL PPA.](https://www.postgresql.org/download/linux/ubuntu/)
36 | 
37 | ### Python Installation on Some Linux Distros
38 | The default installation of pip on some Linux distributions is too old \(<19.3\) to install binary wheel packages released for Arm64.  To work around this, it is recommended to upgrade your pip installation using:
39 | ```
40 | sudo python3 -m pip install --upgrade pip
41 | ```
42 | 
43 | ## Bazel on Linux
44 | The [Bazel build tool](https://www.bazel.build/) now releases a pre-built binary for Arm64. As of October 2020, this is not available in their custom Debian repo, and Bazel does not officially provide an RPM. Instead, we recommend using the [Bazelisk installer](https://docs.bazel.build/versions/master/install-bazelisk.html), which will replace your `bazel` command and [keep bazel up to date](https://github.com/bazelbuild/bazelisk/blob/master/README.md).
45 | 
46 | Below is an example using the [latest Arm64 binary release of Bazelisk](https://github.com/bazelbuild/bazelisk/releases/latest) as of October 2020:
47 | ```
48 | wget https://github.com/bazelbuild/bazelisk/releases/download/v1.7.1/bazelisk-linux-arm64
49 | chmod +x bazelisk-linux-arm64
50 | sudo mv bazelisk-linux-arm64 /usr/local/bin/bazel
51 | bazel
52 | ```
53 | 
54 | Bazelisk itself should not require further updates as its only purpose is to keep Bazel updated.
55 | 
56 | ## zlib on Linux
57 | Linux distributions, in general, use the original zlib without any optimizations. zlib-cloudflare has been updated to provide better and faster compression on Arm and x86. To use zlib-cloudflare:
58 | ```
59 | git clone https://github.com/cloudflare/zlib.git
60 | cd zlib
61 | ./configure --prefix=$HOME
62 | make
63 | make install
64 | ```
65 | Make sure to have the full path to your lib at $HOME/lib in /etc/ld.so.conf and run ldconfig.
66 | 
67 | For users of OpenJDK, which is dynamically linked to the system zlib, you can set LD_LIBRARY_PATH to point to the directory where your newly built version of zlib-cloudflare is located or load that library with LD_PRELOAD.
68 | 
69 | You can check the libz that JDK is dynamically linked against with:
70 | ```
71 | $ ldd /Java/jdk-11.0.8/lib/libzip.so | grep libz
72 | libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7783000)
73 | ```
74 | 


--------------------------------------------------------------------------------
/languages/README.md:
--------------------------------------------------------------------------------
 1 | # Language-specific Considerations
 2 | This guide contains language specific sections with recommendations. If there is no language specific section, it is because there is no specific guidance beyond using a suitably current version of the language.  Simply proceed as you would on any other CPUs, Arm-based or otherwise.
 3 | 
 4 | Broadly speaking, applications built using interpreted or JIT'd languages ([Python](python.md), [Java](java.md), PHP, Node.js, etc.) should run as-is on Arm64. Applications using compiled languages including [C/C++](c-c++.md), [Fortran](fortran.md), [Go](golang.md), and [Rust](rust.md) need to be compiled for the Arm64 architecture.  Most modern build systems (Make, CMake, Ninja, etc.) will "just work" on Arm64.  Again , if there is no specific guidance it's because everything works _exactly_ the same on Arm64 as on other platforms.
 5 | 
 6 | ## Language Guides
 7 |  * [C/C++](c-c++.md)
 8 |  * [Fortran](fortran.md)
 9 |  * [Python](python.md)
10 |  * [Rust](rust.md)
11 |  * [Go](golang.md)
12 |  * [Java](java.md)
13 |  * [.NET](dotnet.md)
14 | 
15 | 


--------------------------------------------------------------------------------
/languages/c-c++.md:
--------------------------------------------------------------------------------
 1 | # C/C++ on Arm64
 2 | 
 3 | ## Availability and Standards Compliance
 4 | There are many C/C++ compilers available for Arm64 including:
 5 |  * [NVIDIA HPC Compiler](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html)
 6 |  * Cray/HPE Compiler
 7 |  * GCC
 8 |  * LLVM
 9 |  * Arm Compiler for Linux
10 | 
11 | The NVIDIA HPC Compiler is a direct decedent of the popular PGI C/C++ compiler, so it accepts the same compiler flags.  GCC and LLVM operate more-or-less the same on Arm64 as on other architectures except for the `-mcpu` flag, as described below.  The Arm Compiler for Linux is based on LLVM and includes additional optimizations for Arm Neoverse cores that make it a highly performant compiler for CPU-only applications.  However, it is missing some Fortran 2008 and OpenMP 4.5+ features that may be desirable for GPU-accelerated applications.  
12 | 
13 | 
14 | ## Enabling Arm Architecture Specific Features
15 | For GCC on Arm64, `-mcpu=` acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs `-march` or `-mtune` if you're building for a specific CPU.  You can find additional details in this [presentation from Arm Inc. to Stony Brook University](https://www.stonybrook.edu/commcms/ookami/_pdf/Linford_OokamiUGM_2022.pdf).
16 | 
17 | CPU       | Flag    | GCC version      | LLVM verison
18 | ----------|---------|-------------------|-------------
19 | Any Arm64 | `-mcpu=native` | GCC-9+ | Clang/LLVM 10+
20 | Ampere Altra | `-mcpu=neoverse-n1` | GCC-9+ | Clang/LLVM 10+
21 | 
22 | ## Compilers
23 | Whenever possible, please use the latest compiler version available on your system. Newer compilers provide better support and optimizations for Arm64 processors. Many codes will demonstrate at least 15% better performance when using GCC 10 vs. GCC 7.  The table below shows GCC and LLVM compiler versions available in Linux distributions. Starred version marks the default system compiler.
24 | 
25 | Distribution    | GCC                  | Clang/LLVM
26 | ----------------|----------------------|-------------
27 | Ubuntu 22.04    | 9, 10, 11*, 12       | 11, 12, 13, 14*
28 | Ubuntu 20.04    | 7, 8, 9*, 10         | 6, 7, 8, 9, 10, 11, 12
29 | Ubuntu 18.04    | 4.8, 5, 6, 7*, 8     | 3.9, 4, 5, 6, 7, 8, 9, 10
30 | Debian10        | 7, 8*                | 6, 7, 8
31 | Red Hat EL8     | 8*, 9, 10            | 10
32 | SUSE Linux ES15 | 7*, 9, 10            | 7
33 | 
34 | 
35 | ## Large-System Extensions (LSE)
36 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives.  See [Locks, Synchronization, and Atomics](../optimization/atomics.md) for details.
37 | 
38 | ### Enabling LSE
39 | 
40 | GCC's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE.  If you're cross compiling, use the appropriate `-mcpu` option for your target CPU, e.g. `-mcpu=neoverse-n1` for the Ampere Altra CPU. You can check which Arm features GCC will enable with the `-mcpu=native` flag by using this command:
41 | ```bash
42 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
43 | ```
44 | For example, on the Ampere Altra CPU with GCC 9.4, we see "`__ARM_FEATURE_ATOMICS 1
45 | `" indicating that LSE atomics are enabled:
46 | ```c
47 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
48 | #define __ARM_FEATURE_ATOMICS 1
49 | #define __ARM_FEATURE_UNALIGNED 1
50 | #define __ARM_FEATURE_AES 1
51 | #define __ARM_FEATURE_IDIV 1
52 | #define __ARM_FEATURE_QRDMX 1
53 | #define __ARM_FEATURE_DOTPROD 1
54 | #define __ARM_FEATURE_CRYPTO 1
55 | #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
56 | #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
57 | #define __ARM_FEATURE_FMA 1
58 | #define __ARM_FEATURE_CLZ 1
59 | #define __ARM_FEATURE_SHA2 1
60 | #define __ARM_FEATURE_CRC32 1
61 | #define __ARM_FEATURE_NUMERIC_MAXMIN 1
62 | ```
63 | 
64 | ### Checking for LSE in a binary
65 | To confirm that LSE instructions are used, the output of the `objdump` command line utility should contain LSE instructions:
66 | ```bash
67 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
68 | ```
69 | To check whether the application binary contains load and store exclusives:
70 | ```bash
71 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
72 | ```
73 | 
74 | ## Porting SSE or AVX Intrinsics
75 | To quickly get a prototype running on Arm64, one can use
76 | https://github.com/DLTcollab/sse2neon a translator of x64 intrinsics to NEON.
77 | sse2neon provides a quick starting point in porting performance critical codes
78 | to Arm.  It shortens the time needed to get an Arm working program that then
79 | can be used to extract profiles and to identify hot paths in the code.  A header
80 | file `sse2neon.h` contains several of the functions provided by standard x64
81 | include files like `xmmintrin.h`, only implemented with NEON instructions to
82 | produce the exact semantics of the x64 intrinsic.  Once a profile is
83 | established, the hot paths can be rewritten directly with NEON intrinsics to
84 | avoid the overhead of the generic sse2neon translation.
85 | 
86 | Note that GCC's `__sync` built-ins are outdated and may be biased towards the x86 memory model.  Use `__atomic` versions of these functions instead of the `__sync` versions.  See https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html for more details.
87 | 
88 | 
89 | ## Signed vs. Unsigned char
90 | The C standard doesn't specify the signedness of char. On x86 char is signed by
91 | default while on Arm it is unsigned by default. This can be addressed by using
92 | standard int types that explicitly specify the signedness (e.g. `uint8_t` and `int8_t`)
93 | or compile with `-fsigned-char`.
94 | 
95 | 
96 | ## Arm Instructions for Machine Learning
97 | Many Arm64 CPUs support [Arm dot-product instructions](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/exploring-the-arm-dot-product-instructions) commonly used for Machine Learning (quantized) inference workloads, and [Half precision floating point (FP16)](https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-intrinsics).  These features enable performant and power efficient machine learning by doubling the number of operations per second and reducing the memory footprint compared to single precision floating point (\_float32), all while still enjoying large dynamic range.  Compiling with `-mcpu=native` will enable these features, when available.  See [the examples page](../examples/README.md) for examples of how to use these features in ML frameworks like TensorFlow and PyTorch.
98 | 


--------------------------------------------------------------------------------
/languages/dotnet.md:
--------------------------------------------------------------------------------
 1 | # .NET on Arm64
 2 | .NET 5 is open-source platform for writing different types of applications. Software engineers can write .NET based applications in multiple languages such as C#, F#, and Visual Basic. .NET applications are compiled into Common Intermediate Language (CIL). When an application is executed, the Common Language Runtime (CLR) loads that application binary and uses a just-in-time (JIT) compiler to generate machine code for the architecture being executed on. For more information, please see [what is .NET](https://dotnet.microsoft.com/learn/dotnet/what-is-dotnet).
 3 | 
 4 | 
 5 | ## .NET Versions
 6 | 
 7 | Version            | Linux Arm64   | Notes
 8 | ------------------|-----------|-------------
 9 | [.NET 6](https://dotnet.microsoft.com/download/dotnet/6.0) | Yes |  Preview releases provide early access to features that are currently under development. These releases are generally not supported for production use.
10 | [.NET 5](https://dotnet.microsoft.com/download/dotnet/5.0) | Yes | Arm64-specific optimizations in the .NET libraries and the code produced by RyuJIT. [Arm64 Performance in .NET 5](https://devblogs.microsoft.com/dotnet/arm64-performance-in-net-5/) 
11 | [.NET Framework 4.x](https://dotnet.microsoft.com/learn/dotnet/what-is-dotnet-framework) | No | The original implementation of the .NET Framework does not support Linux hosts, and Windows is not suported on the NVIDIA Arm HPC Developer Kit. 
12 | [.NET Core 3.1](https://dotnet.microsoft.com/download/dotnet/3.1) | Yes | .NET Core 3.0 added support for [Arm64 for Linux](https://docs.microsoft.com/en-us/dotnet/core/whats-new/dotnet-core-3-0#linux-improvements). 
13 | 
14 | 
15 | ## .NET 5
16 | With .NET 5 Microsoft has made specific Arm64 architecture optimizations. These optimizations were made in the .NET libraries as well as in the machine code output by the JIT process.
17 | 
18 |  * AWS DevOps Blog [Build and Deploy .NET web applications to ARM-powered AWS Graviton 2 Amazon ECS Clusters using AWS CDK](https://aws.amazon.com/blogs/devops/build-and-deploy-net-web-applications-to-arm-powered-aws-graviton-2-amazon-ecs-clusters-using-aws-cdk/)
19 |  * AWS Compute Blog [Powering .NET 5 with AWS Graviton: Benchmarks](https://aws.amazon.com/blogs/compute/powering-net-5-with-aws-graviton2-benchmark-results/) 
20 |  * Microsoft .NET Blog [ARM64 Performance in .NET 5](https://devblogs.microsoft.com/dotnet/arm64-performance-in-net-5/)
21 | 
22 | 
23 | ## Building & Publishing for Linux Arm64
24 | The .NET SDK supports choosing a [Runtime Identifier (RID)](https://docs.microsoft.com/en-us/dotnet/core/rid-catalog) used to target platforms where the applications run. These RIDs are used by .NET dependencies (NuGet packages) to represent platform-specific resources in NuGet packages. The following values are examples of RIDs: linux-arm64, linux-x64, ubuntu.14.04-x64, win7-x64, or osx.10.12-x64. For the NuGet packages with native dependencies, the RID designates on which platforms the package can be restored.
25 | 
26 | You can build and publish on any host operating system. As an example, you can develop on Windows and build locally to target Arm64, or you can use a CI server like Jenkins on Linux. The commands are the same.
27 | 
28 | ```bash
29 | dotnet build -r linux-arm64
30 | dotnet publish -c Release -r linux-arm64
31 | ```
32 | 
33 | For more information about [publishing .NET apps with the .NET CLI](https://docs.microsoft.com/en-us/dotnet/core/deploying/deploy-with-cli) please see the official documents.
34 | 
35 | 


--------------------------------------------------------------------------------
/languages/fortran.md:
--------------------------------------------------------------------------------
 1 | # Fortran on Arm64
 2 | 
 3 | ## Availability and Standards Compliance
 4 | There are many Fortran compilers available for Arm64 including:
 5 |  * [NVIDIA HPC Compiler](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html)
 6 |  * Cray/HPE Compiler
 7 |  * GCC
 8 |  * LLVM
 9 |  * Arm Compiler for Linux
10 | 
11 | The NVIDIA HPC Compiler is a direct decedent of the popular PGI Fortran compiler, so it accepts the same compiler flags.  GCC and LLVM operate more-or-less the same on Arm64 as on other architectures except for the `-mcpu` flag, as described below.  The Arm Compiler for Linux is based on LLVM and includes additional optimizations for Arm Neoverse cores that make it a highly performant compiler for CPU-only applications.  However, it is missing some Fortran 2008 and OpenMP 4.5+ features that may be desirable for GPU-accelerated applications.  
12 | 
13 | 
14 | ## Enabling Arm Architecture Specific Features
15 | For gfortran on Arm64, `-mcpu=` acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs `-march` or `-mtune` if you're building for a specific CPU.  You can find additional details in this [presentation from Arm Inc. to Stony Brook University](https://www.stonybrook.edu/commcms/ookami/_pdf/Linford_OokamiUGM_2022.pdf).
16 | 
17 | CPU       | Flag    | gfortran version      | LLVM verison
18 | ----------|---------|-------------------|-------------
19 | Any Arm64 | `-mcpu=native` | gfortran-9+ | flang/LLVM 10+
20 | Ampere Altra | `-mcpu=neoverse-n1` | gfortran-9+ | flang/LLVM 10+
21 | 
22 | 
23 | ## Compilers
24 | Whenever possible, please use the latest compiler version available on your system. Newer compilers provide better support and optimizations for Arm64 processors. Many codes will demonstrate at least 15% better performance when using GCC 10 vs. GCC 7.  The table below shows GCC and LLVM compiler versions available in Linux distributions. Starred version marks the default system compiler.
25 | 
26 | Distribution    | GCC                  | Clang/LLVM
27 | ----------------|----------------------|-------------
28 | Ubuntu 22.04    | 9, 10, 11*, 12       | 11, 12, 13, 14*
29 | Ubuntu 20.04    | 7, 8, 9*, 10         | 6, 7, 8, 9, 10, 11, 12
30 | Ubuntu 18.04    | 4.8, 5, 6, 7*, 8     | 3.9, 4, 5, 6, 7, 8, 9, 10
31 | Debian10        | 7, 8*                | 6, 7, 8
32 | Red Hat EL8     | 8*, 9, 10            | 10
33 | SUSE Linux ES15 | 7*, 9, 10            | 7
34 | 
35 | 
36 | ## Large-System Extensions (LSE)
37 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives.  See [Locks, Synchronization, and Atomics](../optimization/atomics.md) for details.
38 | 
39 | 
40 | ### Enabling LSE
41 | gfortran's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE.  If you're cross compiling, use the appropriate `-mcpu` option for your target CPU, e.g. `-mcpu=neoverse-n1` for the Ampere Altra CPU. You can check which Arm features gfortran will enable with the `-mcpu=native` flag by using this command:
42 | ```bash
43 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
44 | ```
45 | For example, on the Ampere Altra CPU with GCC 9.4, we see "`__ARM_FEATURE_ATOMICS 1
46 | `" indicating that LSE atomics are enabled:
47 | ```c
48 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
49 | #define __ARM_FEATURE_ATOMICS 1
50 | #define __ARM_FEATURE_UNALIGNED 1
51 | #define __ARM_FEATURE_AES 1
52 | #define __ARM_FEATURE_IDIV 1
53 | #define __ARM_FEATURE_QRDMX 1
54 | #define __ARM_FEATURE_DOTPROD 1
55 | #define __ARM_FEATURE_CRYPTO 1
56 | #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
57 | #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
58 | #define __ARM_FEATURE_FMA 1
59 | #define __ARM_FEATURE_CLZ 1
60 | #define __ARM_FEATURE_SHA2 1
61 | #define __ARM_FEATURE_CRC32 1
62 | #define __ARM_FEATURE_NUMERIC_MAXMIN 1
63 | ```
64 | 
65 | ### Checking for LSE in a binary
66 | To confirm that LSE instructions are used, the output of the `objdump` command line utility should contain LSE instructions:
67 | ```bash
68 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
69 | ```
70 | To check whether the application binary contains load and store exclusives:
71 | ```bash
72 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
73 | ```
74 | 


--------------------------------------------------------------------------------
/languages/golang.md:
--------------------------------------------------------------------------------
 1 | # Go on Arm64
 2 | 
 3 | Go is a statically typed, compiled programming language originally designed at Google. Go supports Arm64 out of the box and is available in all common Linux distributions. Make sure to use the latest version of the Go compiler and toolchain to get the best performance on Arm64.
 4 | 
 5 | Here are some noteworthy performance upgrades:
 6 | 
 7 | ## Go 1.18 \[released 2022/03/14\]
 8 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved
 9 | performance on Arm64 by implementing a new way of passing function arguments and results using registers instead of the stack. This change has been available on x86-64 since 1.17, where it brought performance improvements of about 5%. On Arm this change typically gives even higher performance improvements of 10% or more.
10 | 
11 | To learn more about the use cases benefiting from Go 1.18's performance improvements, check the blog post: [Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton](https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/).
12 | 
13 | 
14 | ## Go 1.17 \[released 2021/08/16\]
15 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved
16 | performance for the following standard library packages:
17 | 
18 | - crypto/ed25519 - the package has been rewritten, and all operations are now approximately twice as fast on Arm64.
19 | - crypto/elliptic - CurveParams methods now automatically invoke faster and safer dedicated implementations for known curves (P-224, P-256, and P-521) when available. The P521 curve implementation has also been rewritten and is now constant-time and three times faster on Arm64.
20 | 
21 | 
22 | ## Go 1.16 \[released 2021/02/16\]
23 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved
24 | performance on Arm with couple of changes listed below. Building your project with Go 1.16 will give you these improvements:
25 | 
26 |  * [ARMv8.1-A LSE atomic instructions](https://go-review.googlesource.com/c/go/+/234217), which dramatically improve mutex fairness and speed on modern Arm cores with v8.1 and newer instruction set.
27 |  * [copy performance improvements](https://go-review.googlesource.com/c/go/+/243357), especially when the addresses are unaligned.
28 | 
29 | 
30 | ## Recently updated packages
31 | Changes to commonly used packages that improve performance on Arm64 can make a noticeable difference in
32 | some cases. Here is a partial list of packages to be aware of.
33 | 
34 | Package   | Version   | Improvements
35 | ----------|-----------|-------------
36 | [Snappy](https://github.com/golang/snappy) | as of commit [196ae77](https://github.com/golang/snappy/commit/196ae77b8a26000fa30caa8b2b541e09674dbc43) | assembly implementations of the hot path functions were ported from arm64
37 | 
38 | 


--------------------------------------------------------------------------------
/languages/java.md:
--------------------------------------------------------------------------------
  1 | # Java on Arm64
  2 | 
  3 | Java is a general-purpose programming language. Compiled Java code can run on all platforms that support Java, without the need for recompilation. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. _[Wikipedia](https://en.wikipedia.org/wiki/Java_(programming_language))_
  4 | 
  5 | Java is well supported and generally performant out-of-the-box on Arm64. While Java 8 is fully supported on Arm64, some customers haven't been able to obtain the CPU's full performance benefit until they switched to Java 11.
  6 | 
  7 | This page includes specific details about building and tuning Java applications on Arm64.
  8 | 
  9 | ## Java JVM Options
 10 | There are numerous options that control the JVM and may lead to better performance. Two that have shown large (1.5x) improvements in some Java workloads are:
 11 | 
 12 |  * Eliminating tiered compilation: `-XX:-TieredCompilation`
 13 |  * Restricting the size of the code cache to enable Arm64 cores to better predict branches: `-XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M`
 14 |  
 15 | These are helpful on some workloads but can hurt on others so testing with and without them is essential.
 16 | 
 17 | ## Java Stack Size
 18 | For some JVMs, the default stack size for Java threads (i.e. `ThreadStackSize`) is 2MB on Arm64 instead of the 1MB used on x86_64. You can check the default with:
 19 | ```
 20 | $ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize
 21 |      intx CompilerThreadStackSize = 2048  {pd product} {default}
 22 |      intx ThreadStackSize         = 2048  {pd product} {default}
 23 |      intx VMThreadStackSize       = 2048  {pd product} {default}
 24 | ```
 25 | The default can be easily changed on the command line with either `-XX:ThreadStackSize=<kbytes>` or `-Xss<bytes>`. Notice that `-XX:ThreadStackSize` interprets its argument as kilobytes whereas `-Xss` interprets it as bytes. So `-XX:ThreadStackSize=1024` and `-Xss1m` will both set the stack size for Java threads to 1 megabyte:
 26 | ```
 27 | $ java -Xss1m -XX:+PrintFlagsFinal -version | grep ThreadStackSize
 28 |      intx CompilerThreadStackSize                  = 2048                                   {pd product} {default}
 29 |      intx ThreadStackSize                          = 1024                                   {pd product} {command line}
 30 |      intx VMThreadStackSize                        = 2048                                   {pd product} {default}
 31 | ```
 32 | 
 33 | Usually, there's no need to change the default because the thread stack will be committed lazily as it grows. No matter the default, the thread will always only commit as much stack as it really uses (at page size granularity). However, there's one exception to this rule if [Transparent Huge Pages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html) (THP) are turned on by default on a system. In that case, the THP page size of 2MB matches exactly with the 2MB default stack size on Arm64 and most stacks will be backed up by a single huge page of 2MB. This means that the stack will be completely committed to memory right from the start. If you're using hundreds or even thousands of threads, this memory overhead can be considerable.
 34 | 
 35 | To mitigate this issue, you can either manually change the stack size on the command line (as described above) or you can change the default for THP from `always` to `madvise`:
 36 | ```
 37 | # cat /sys/kernel/mm/transparent_hugepage/enabled
 38 | [always] madvise never
 39 | # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
 40 | # cat /sys/kernel/mm/transparent_hugepage/enabled
 41 | always [madvise] never
 42 | ```
 43 | 
 44 | Notice that even if the the default is changed from `always` to `madvise`, the JVM can still use THP for the Java heap and code cache if you specify `-XX:+UseTransparentHugePages` on the command line.
 45 | 
 46 | ## Looking for x86 shared-objects in JARs
 47 | Java JARs can include shared-objects that are architecture specific. Some Java libraries check
 48 | if these shared objects are found and if they are they use a JNI to call to the native library
 49 | instead of relying on a generic Java implementation of the function. While the code might work,
 50 | without the JNI the performance can suffer.
 51 | 
 52 | A quick way to check if a JAR contains such shared objects is to simply unzip it and check if
 53 | any of the resulting files are shared-objects and if an Arm64 shared-object is missing:
 54 | ```
 55 | $ unzip foo.jar
 56 | $ find . -name "*.so" -exec file {} \;
 57 | ```
 58 | For each x86-64 ELF file, check there is a corresponding aarch64 ELF file in the binaries. With some common packages (e.g. commons-crypto) we've seen that even though a JAR can be built supporting Arm manually, artifact repositories such as Maven don't have updated versions. 
 59 | 
 60 | ## Building multi-arch JARs
 61 | Java is meant to be a write once, and run anywhere language.  When building Java artifacts that
 62 | contain native code, it is important to build those libraries for each major architecture to provide
 63 | a seamless and optimally performing experience for all consumers.  Code that runs well on both Arm64 and x86
 64 | increases the package's utility.
 65 | 
 66 | There is nominally a multi-step process to build the native shared objects for each supported architecture before doing the final packaging with Maven, SBT, Gradle etc. Below is an example of how to create your JAR using Maven that contains shared libraries for multiple distributions and architectures:
 67 | 
 68 | ```
 69 | # You will need access to two systems: one x86 and one Arm64.
 70 | 
 71 | # On the x86 system:
 72 | $ cd java-lib
 73 | $ mvn package
 74 | $ find target/ -name "*.so" -type f -print
 75 | 
 76 | # Note the directory this so file is in.  It will be in a directory
 77 | # such as: target/classes/org/your/class/hierarchy/native/OS/ARCH/lib.so
 78 | 
 79 | # Now, log into the Arm64 system
 80 | $ cd java-lib
 81 | $ mvn package
 82 | 
 83 | # Repeat the below two steps for each OS and ARCH combination you want to release
 84 | $ mkdir target/classes/org/your/class/hierarchy/native/OS/ARCH
 85 | $ scp other-system:~/your-java-lib/target/classes/org/your/class/hierarchy/native/OS/ARCH/lib.so target/classes/org/your/class/hierarchy/native/OS/ARCH/
 86 | 
 87 | # Create the jar packaging with maven.  It will include the additional
 88 | # native libraries even though they were not built directly by this maven process.
 89 | $ mvn package
 90 | 
 91 | # When creating a single Jar for all platform native libraries, 
 92 | # the release plugin's configuration must be modified to specify 
 93 | # the plugin's `preparationGoals` to not include the clean goal.
 94 | # See http://maven.apache.org/maven-release/maven-release-plugin/prepare-mojo.html#preparationGoals
 95 | # For more details.
 96 | 
 97 | # To do a release to Maven Central and/or Sonatype Nexus:
 98 | $ mvn release:prepare
 99 | $ mvn release:perform
100 | ```
101 | 
102 | This is one way to do the JAR packaging with all the libraries in a single JAR.  To build all the JARs, we recommend to build on native machines, but it can also be done via Docker using the buildx plug-in, or by cross-compiling inside your build-environment.
103 | 
104 | There are two additional options for releasing jars with native code.  You can use a manager plugin such as the [nar maven plugin](https://maven-nar.github.io/) to manage each platform specific Jar.  Or, you can release individual architecture specific jars, and then use one system to download these released jars and package them into a combined Jar with a final `mvn release:perform`.  An example of this method can be found in the [Leveldbjni-native](https://github.com/fusesource/leveldbjni) `pom.xml` files. 
105 | 
106 | 
107 | ## Profiling Java applications
108 | For languages that rely on a JIT (such an Java), the symbol information that is captured is lacking, making it difficult to understand where runtime is being consumed. Similar to the code profiling example above, `libperf-jvmti.so` can be used to dump symbols for JITed code as the JVM runs.
109 | 
110 | ```bash
111 | # Compile your Java application with -g
112 | 
113 | # find where libperf-jvmti.so is on your distribution
114 | 
115 | # Run your java app with -agentpath:/path/to/libperf-jvmti.so added to the command line
116 | # Launch perf record on the system
117 | $ perf record -g -k 1 -a -o perf.data sleep 5
118 | 
119 | # Inject the generated methods information into the perf.data file
120 | $ perf inject -j -i perf.data -o perf.data.jit
121 | 
122 | # Process the new file, for instance via Brendan Gregg's Flamegraph tools
123 | $ perf script -i perf.data.jit | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > ./flamegraph.svg
124 | ```


--------------------------------------------------------------------------------
/languages/python.md:
--------------------------------------------------------------------------------
  1 | # Python on Arm64
  2 | 
  3 | Python is an interpreted, high-level, general-purpose programming language, with interpreters available for many operating systems and architectures, including Arm64. _[Read more on Wikipedia](https://en.wikipedia.org/wiki/Python_(programming_language))_
  4 | 
  5 | ## Installing Python packages
  6 | 
  7 | When *pip* (the standard package installer for Python) is used, it pulls the packages from [Python Package Index](https://pypi.org) and other indexes.
  8 | 
  9 | In the case *pip* could not find a pre-compiled package, it automatically downloads, compiles, and builds the package from source code. 
 10 | Normally it may take a few more minutes to install the package from source code than from pre-built.  For some large packages (i.e. *pandas*)
 11 | it may take up to 20 minutes. In some cases, compilation may fail due to missing dependencies.
 12 | 
 13 | ### Prerequisites for installing Python packages from source
 14 | 
 15 | For installing common Python packages from source code, we need to install the following development tools:
 16 | 
 17 | On **RedHat**:
 18 | ```bash
 19 | sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran lapack-devel
 20 | python3 -m pip install --user --upgrade pip
 21 | ```
 22 | 
 23 | On **Debian/Ubuntu**:
 24 | ```bash
 25 | sudo apt update
 26 | sudo apt-get install build-essential python3-pip python3-dev libblas-dev gfortran liblapack-dev
 27 | python3 -m pip install --user --upgrade pip
 28 | ```
 29 | 
 30 | On all distributions, additional compile time dependencies might be needed depending on the Python modules you are trying to install.
 31 | 
 32 | ### Recommended versions
 33 | 
 34 | When adopting Arm64, it is recommended to use recent software versions as much as possible, and Python is no exception.
 35 | 
 36 | Python 2.7 is EOL since January the 1st 2020, so it is definitely recommended to upgrade to a Python 3.x version before moving to Arm64.
 37 | 
 38 | Python 3.6 will reach [EOL in December, 2021](https://www.python.org/dev/peps/pep-0494/#lifespan), so when starting to port an application to Arm64, it is recommended to target at least Python 3.7.
 39 | 
 40 | ### Python on Centos 8 and RHEL 8
 41 | 
 42 | Centos 8 and RHEL 8 distribute Python 3.6 which is
 43 | [scheduled for end of life in December, 2021](https://www.python.org/dev/peps/pep-0494/#lifespan).
 44 | However as of May 2021, some package maintainers have already begun dropping support for
 45 | Python 3.6 by omitting prebuilt wheels published to [pypi.org](https://pypi.org).
 46 | For some packages, it is still possible to use Python 3.6 by using the distribution
 47 | from the package manager. For example `numpy` no longer publishes Python 3.6 wheels,
 48 | but can be installed from the package manager: `yum install python3-numpy`.
 49 | 
 50 | Another option is to use Python 3.8 instead of the default Python package. You can
 51 | install Python 3.8 with pip: `yum install python38-pip`. Then use pip to install
 52 | the latest versions of packages: `pip3 install numpy`.
 53 | 
 54 | Some common Python packages that are distributed by the package manager are:
 55 |  * python3-numpy
 56 |  * python3-markupsafe
 57 |  * python3-pillow
 58 | 
 59 | To see a full list run `yum search python3`
 60 | 
 61 | 
 62 | ## Scientific and numerical applications (NumPy, SciPy, BLAS, etc)
 63 | 
 64 | Python relies on native code to achieve high performance.  For scientific and
 65 | numerical applications, NumPy and SciPy provide an interface to high performance
 66 | computing libraries such as ATLAS, BLAS, BLIS, OpenBLAS, etc.  These libraries
 67 | contain code tuned for Arm64 processors, and especially the Arm Neoverse-N1 core
 68 | which powers the Ampere Altra CPU in the NVIDIA Arm HPC Developer Kit.
 69 | 
 70 | It is recommended to use the latest software versions as much as possible. If the latest
 71 | version migration is not feasible, please ensure that it is at least the minimum version
 72 | recommended below.  Multiple fixes related to data precision and correctness on
 73 | Arm64 went into OpenBLAS between v0.3.9 and v0.3.17 and the below SciPy and NumPy versions
 74 | upgraded OpenBLAS from v0.3.9 to OpenBLAS v0.3.17.
 75 |  * OpenBLAS:  >= v0.3.17
 76 |  * SciPy: >= v1.7.2
 77 |  * NumPy: >= 1.21.1
 78 | 
 79 | ### BLIS may be a faster BLAS
 80 | 
 81 | The default SciPy and NumPy binary installations with `pip3 install numpy scipy`
 82 | are configured to use OpenBLAS.  The default installations of SciPy and NumPy
 83 | are easy to setup and well tested.
 84 | 
 85 | Some workloads will benefit from using BLIS. Benchmarking SciPy and NumPy
 86 | workloads with BLIS might allow to identify additional performance improvement.
 87 | 
 88 | ### Install NumPy and SciPy with BLIS on Ubuntu and Debian
 89 | 
 90 | On Ubuntu and Debian `apt install python3-numpy python3-scipy` will install NumPy
 91 | and SciPy with BLAS and LAPACK libraries. To install SciPy and NumPy with BLIS
 92 | and OpenBLAS on Ubuntu and Debian:
 93 | ```
 94 | sudo apt -y install python3-scipy python3-numpy libopenblas-dev libblis-dev
 95 | sudo update-alternatives --set libblas.so.3-aarch64-linux-gnu \
 96 |     /usr/lib/aarch64-linux-gnu/blis-openmp/libblas.so.3
 97 | ```
 98 | 
 99 | To switch between available alternatives:
100 | ```
101 | sudo update-alternatives --config libblas.so.3-aarch64-linux-gnu
102 | sudo update-alternatives --config liblapack.so.3-aarch64-linux-gnu
103 | ```
104 | 
105 | ### Install NumPy and SciPy with BLIS on RedHat
106 | 
107 | As of June 20th, 2020, NumPy now [provides binaries](https://pypi.org/project/numpy/#files) for Arm64.
108 | 
109 | Prerequisites to build SciPy and NumPy with BLIS on RedHat:
110 | ```bash
111 | # Install RedHat prerequisites
112 | sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran
113 | 
114 | # Install BLIS
115 | git clone https://github.com/flame/blis $HOME/blis
116 | cd $HOME/blis;  
117 | ./configure --enable-threading=openmp --enable-cblas --prefix=/usr cortexa57
118 | make -j;  sudo make install
119 | 
120 | # Install OpenBLAS
121 | git clone https://github.com/xianyi/OpenBLAS.git $HOME/OpenBLAS
122 | cd $HOME/OpenBLAS
123 | make -j BINARY=64 FC=gfortran USE_OPENMP=1
124 | sudo make PREFIX=/usr install
125 | ```
126 | 
127 | To build and install NumPy and SciPy with BLIS and OpenBLAS:
128 | ```bash
129 | git clone https://github.com/numpy/numpy/ $HOME/numpy
130 | cd $HOME/numpy
131 | pip3 install .
132 | 
133 | git clone https://github.com/scipy/scipy/ $HOME/scipy
134 | cd $HOME/scipy
135 | pip3 install .
136 | ```
137 | 
138 | When NumPy and SciPy detect the presence of the BLIS library at build time, they
139 | will use BLIS in priority over the same functionality from BLAS and
140 | OpenBLAS. OpenBLAS or LAPACK libraries need to be installed along BLIS to
141 | provide LAPACK functionality.  To change the library dependencies, one can set
142 | environment variables `NPY_BLAS_ORDER` and `NPY_LAPACK_ORDER` before building numpy
143 | and scipy. The default is:
144 | `NPY_BLAS_ORDER=mkl,blis,openblas,atlas,accelerate,blas` and
145 | `NPY_LAPACK_ORDER=mkl,openblas,libflame,atlas,accelerate,lapack`.
146 | 
147 | ### Testing NumPy and SciPy installation
148 | 
149 | To test that the installed NumPy and SciPy are built with BLIS and OpenBLAS, the
150 | following commands will print native library dependencies:
151 | ```bash
152 | python3 -c "import numpy as np; np.__config__.show()"
153 | python3 -c "import scipy as sp; sp.__config__.show()"
154 | ```
155 | 
156 | In the case of Ubuntu and Debian these commands will print `blas` and `lapack`
157 | which are symbolic links managed by `update-alternatives`.
158 | 
159 | 
160 | ### Improving BLIS and OpenBLAS performance with multi-threading
161 | When OpenBLAS is built with `USE_OPENMP=1` it will use OpenMP to parallelize the
162 | computations.  The environment variable `OMP_NUM_THREADS` can be set to specify
163 | the maximum number of threads.  If this variable is not set, the default is to
164 | use a single thread.
165 | 
166 | To enable parallelism with BLIS, one needs to both configure with
167 | `--enable-threading=openmp` and set the environment variable `BLIS_NUM_THREADS`
168 | to the number of threads to use, the default is to use a single thread.
169 | 
170 | 
171 | ### Arm64 support in Conda / Anaconda
172 | Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.
173 | 
174 | Anaconda has announced [support for Arm64 via AWS Graviton 2 on May 14, 2021](https://www.anaconda.com/blog/anaconda-aws-graviton2). The Ampere Altra CPU found in the NVIDIA Arm HPC DevKit is based on the same Arm Neoverse-N1 core as the AWS Gravition 2, so Anaconda also supports the Ampere Altra. Instructions to install the full Anaconda package installer can be found at https://docs.anaconda.com/anaconda/install/graviton2/.
175 | 
176 | Anaconda also offers a lightweight version called [Miniconda](https://docs.conda.io/en/latest/miniconda.html) which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. 
177 | 
178 | See [the Anaconda, Miniconda, Conda, Mamba example](../examples/anaconda.md) for details.
179 | 
180 | 
181 | ## Other common Python packages
182 | 
183 | ### Pillow
184 | Pillow 8.x or later have a binary wheel for all Arm64 platforms, included OSes with 64kB pages like Redhat/CentOS8.
185 | ```bash
186 | pip3 install --user pillow
187 | ```
188 | should work across all platforms we tested.
189 | 
190 | 
191 | ## Machine Learning Python packages
192 | 
193 | ### PyTorch
194 | PyTorch wheels for nightly builds (cpu builds) are are available for Arm64 since PyTorch 1.8.
195 | ```bash
196 | pip install numpy
197 | pip install --pre torch torchvision  -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
198 | ```
199 | 
200 | ### DGL
201 | Install PyTorch as described above, then follow the [install from source instructions](https://github.com/dmlc/dgl/blob/master/docs/source/install/index.rst#install-from-source).
202 | 
203 | 
204 | ### Sentencepiece
205 | Install PyTorch as described above.  Then:
206 | ```
207 | # git the source and build/install the libraries.
208 | git clone https://github.com/google/sentencepiece
209 | cd sentencepiece
210 | mkdir build
211 | cd build
212 | cmake ..
213 | make -j
214 | sudo make install
215 | sudo ldconfig -v
216 | 
217 | cd ../python
218 | vi make_py_wheel.sh
219 | # change the manylinux1_{$1} to manylinux2014_{$1}
220 | 
221 | sudo python3 setup.py install
222 | ```
223 | 
224 | With the above steps, the wheel should be installed.
225 | 
226 | *Important*: Before calling any python script or starting python, one of the libraries need to be set as preload for python.
227 | 
228 | ```
229 | export LD_PRELOAD=/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/$LD_PRELOAD
230 | python3
231 | ```
232 | 
233 | ### Morfeusz
234 | 
235 | ```
236 | # download the source
237 | wget http://download.sgjp.pl/morfeusz/20200913/morfeusz-src-20200913.tar.gz
238 | tar -xf morfeusz-src-20200913.tar.gz
239 | cd Morfeusz/
240 | sudo apt install cmake zip build-essential autotools-dev \
241 |     python3-stdeb python3-pip python3-all-dev python3-pyparsing devscripts \
242 |     libcppunit-dev acl  default-jdk swig python3-all-dev python3-stdeb
243 | export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
244 | mkdir build
245 | cd build
246 | cmake ..
247 | sudo make install
248 | sudo ldconfig -v
249 | sudo PYTHONPATH=/usr/local/lib/python make install-builder
250 | ```
251 | 
252 | If you run into issue with the last command (_make install-builder_), please try:
253 | ```
254 | sudo PYTHONPATH=`which python3` make install-builder
255 | ```
256 | 


--------------------------------------------------------------------------------
/languages/rust.md:
--------------------------------------------------------------------------------
 1 | # Rust on Arm64
 2 | 
 3 | Rust is supported on Linux/Arm64 systems as a tier1 platform, just like x86.
 4 | 
 5 | ### Large-System Extensions (LSE)
 6 | 
 7 | LSE improves system throughput for CPU-to-CPU communication, locks, and mutexes.
 8 | The improvement can be up to an order of magnitude when using LSE instead of
 9 | load/store exclusives. LSE can be enabled in Rust and we've seen cases on 
10 | larger machines where performance is improved by over 3x by setting the `RUSTFLAGS`
11 | environment variable and rebuilding your project.
12 | 
13 | ```
14 | export RUSTFLAGS="-Ctarget-cpu=neoverse-n1"
15 | cargo build --release
16 | ```
17 | 


--------------------------------------------------------------------------------
/optimization/README.md:
--------------------------------------------------------------------------------
  1 | # Optimizing for Arm64
  2 | 
  3 | ## Detecting Arm Hardware Capabilities at Runtime
  4 | 
  5 | There are several ways to determine the available Arm CPU resources and topology at runtime, including:
  6 | 
  7 |  * CPU architecture and supported instructions
  8 |  * CPU manufacturer
  9 |  * Number of CPU sockets 
 10 |  * CPU cores per socket
 11 |  * Number of NUMA nodes
 12 |  * Number of NUMA nodes per socket
 13 |  * CPU cores per NUMA node
 14 | 
 15 | See [Runtime CPU Detection](cpudetect.md) for more details and example code.
 16 | 
 17 | 
 18 | ## Debugging Problems
 19 | 
 20 | It's possible that incorrect code will work fine on an existing system, but
 21 | produce an incorrect result when using a new compiler. This could be because
 22 | it relies on undefined behavior in the language (e.g. assuming char is signed in C/C++,
 23 | or the behavior of signed integer overflow), contains memory management bugs that
 24 | happen to be exposed by aggressive compiler optimizations, or incorrect ordering.
 25 | Below are some techniques / tools we have used to find issues
 26 | while migrating our internal services to newer compilers and Arm64.
 27 | 
 28 | ### Using Sanitizers
 29 | The compiler may generate code and layout data slightly differently on Arm64
 30 | compared to an x86 system and this could expose latent memory bugs that were previously
 31 | hidden. On GCC, the easiest way to look for these bugs is to compile with the
 32 | memory sanitizers by adding the below to standard compiler flags:
 33 | 
 34 | ```
 35 |     CFLAGS += -fsanitize=address -fsanitize=undefined
 36 |     LDFLAGS += -fsanitize=address  -fsanitize=undefined
 37 | ```
 38 | 
 39 | Run the resulting binary, and any bugs detected by the sanitizers will cause
 40 | the program to exit immediately and print helpful stack traces and other
 41 | information.
 42 | 
 43 | ### Ordering issues
 44 | Arm is weakly ordered, similar to POWER and other modern architectures, while
 45 | x86 is a variant of total-store-ordering (TSO).
 46 | Code that relies on TSO may lack barriers to properly order memory references.
 47 | Arm64 systems are [weakly ordered multi-copy-atomic](https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf).
 48 | 
 49 | While TSO allows reads to occur out-of-order with writes and a processor to
 50 | observe its own write before it is visible to others, the Armv8 memory model has
 51 | further relaxations for performance and power efficiency.
 52 | **Code relying on pthread mutexes or locking abstractions
 53 | found in C++, Java or other languages shouldn't notice any difference.** Code that
 54 | has a bespoke implementation of lockless data structures or implements its own
 55 | synchronization primitives will have to use the proper intrinsics and
 56 | barriers to correctly order memory transactions.  See [Locks, Synchronization, and Atomics](atomics.md) for more information.
 57 | 
 58 | 
 59 | ### Architecture specific optimization
 60 | Sometimes code will have architecture specific optimizations. These can take many forms:
 61 | sometimes the code is optimized in assembly using specific instructions for
 62 | [CRC](https://github.com/php/php-src/commit/2a535a9707c89502df8bc0bd785f2e9192929422),
 63 | other times the code could be enabling a [feature](https://github.com/lz4/lz4/commit/605d811e6cc94736dd609c644404dd24c013fd6f)
 64 | that has been shown to work well on particular architectures. A quick way to see if any optimizations
 65 | are missing for Arm64 is to grep the code for "`__x86_64__`" preprocessor branches (`ifdef`) and see if there
 66 | is corresponding Arm64 code there too. If not, that might be something to improve.
 67 | 
 68 | ### Lock/Synchronization intensive workload
 69 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives.  See [Locks, Synchronization, and Atomics](atomics.md) for details.
 70 | 
 71 | ## Profiling the code
 72 | If you aren't getting the performance you expect, one of the best ways to understand what is
 73 | going on in the system is to compare profiles of execution and understand where the CPU cores are
 74 | spending time. This will frequently point to a hot function that could be optimized. 
 75 | 
 76 | Install the Linux perf tool:
 77 | ```bash
 78 | # Redhat
 79 | sudo yum install perf
 80 | 
 81 | # Ubuntu
 82 | sudo apt-get install linux-tools-$(uname -r)
 83 | ```
 84 | 
 85 | Record a profile:
 86 | ```bash
 87 | # If the program is run interactively
 88 | sudo perf record -g -F99 -o perf.data ./your_program
 89 | 
 90 | # If the program is a service, sample all cpus (-a) and run for 60 seconds while the system is loaded
 91 | sudo perf record -ag -F99 -o perf.data  sleep 60
 92 | ```
 93 | 
 94 | Look at the profile:
 95 | ```bash
 96 | perf report
 97 | ```
 98 | 
 99 | Additionally, there is a tool that will generate a visual representation of the output which can sometimes
100 | be more useful:
101 | ```bash
102 | git clone https://github.com/brendangregg/FlameGraph.git
103 | perf script -i perf.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
104 | ```


--------------------------------------------------------------------------------
/optimization/atomics.md:
--------------------------------------------------------------------------------
 1 | # Locks, Synchronization, and Atomics 
 2 | 
 3 | Synchronization is a hot topic during the software migration process. Arm64 systems typically have more CPU cores than other architectures, so efficient synchronization is critical to achieving good performance.  Synchronization is also a complex and nuanced topic.  If you're looking for a high level overview, you'll find it here.  If you would like a more detailed look, Arm Inc. have published an excellent whitepaper on this topic.  You can deep dive this topic by reading [Synchronization Overview and Case Study on Arm Architecture
 4 | ](https://developer.arm.com/documentation/107630/1-0/?lang=en).
 5 | 
 6 | 
 7 | ## The Arm Memory Model
 8 | One of the most significant differences between Arm and X86 CPUs is their memory model: the Arm architecture has a weak memory model that differs from the x86 architecture TSO (Total Store Order) model. Different memory models can cause low-level codes to function well on one architecture but encounter performance problem or failure on the other. The good news is that Arm's more relaxed memory model allows for more compiler and hardware optimization to boost system performance.
 9 | 
10 | **Generally speaking, you should only care about the Arm memory model if you are writing low level code, e.g. assembly language.** The details of Arm's memory model lie well below the application level and will be completely invisible to most users.  If you are writing in a high level language like C/C++ or Fortran, you do not need to know the nuances of Arm's memory model.  The one exception to this general rule is code that uses boutique synchronization constructs instead of standard best practices, e.g. using `volatile` as a means of thread synchronization.  Deviating from established standards or ignoring best practices results in code that is almost guaranteed to be broken.  It should be rewritten using system-provided locks, conditions, etc. and the `stdatomic` tools.  Here's an example of such a bug: https://github.com/ParRes/Kernels/issues/611
11 | 
12 | Arm isn't the only architecture using a weak memory model.  If your application already runs on CPUs that aren't x86-based, you're likely to encounter fewer bugs related to the weak memory model.  In particular, if your application has been ported to a CPU implementing the POWER architecture (e.g. IBM POWER9) then it is likely to work perfectly on the Arm memory model.
13 | 
14 | 
15 | ## Large-System Extension (LSE) Atomic Instructions
16 | All server-class Arm64 processors have support for the Large-System Extension (LSE) which was first introduced in Armv8.1. LSE provides low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives.  Note that this is not generally true for older Arm64 CPUs like the Marvell ThunderX2 or the Fujitsu A64FX.  Please see [these slides from the ISC 2022 AHUG Workshop](https://agenda.isc-hpc.com/media/slides_pdf/0900_Arm_HPC_User_Group_at_ISC22_wxIExtw.pdf) for more details.
17 | 
18 | You'll get the best performance from a POSIX threads library that uses LSE atomic instructions.  LSE atomics are important for locking and thread synchronization routines.  Many Linux distributions provide a libc compiled with LSE instructions.  For example:
19 |  - RHEL 8.4 and later
20 |  - Ubuntu 20.04 and later
21 | 
22 | Some distributions need an additional package to support LSE.  For example, Ubuntu 18.04 needs `apt install libc6-lse`. Please see [the operating systems page](../software/os.md) for more details.
23 | 
24 | Using atomics is not just a good idea, it's basically required for writing highly parallel code. High core count systems using exclusive instructions instead of atomics are not guaranteed to make progress.  One core complex can monopolize a resource while the other starves.
25 | 
26 | When building an application from source, the compiler needs to generate LSE atomic instructions for applications that use atomic operations.  For example, the code of databases like PostgreSQL contain atomic constructs; c++11 code with std::atomic statements translate into atomic operations.  Since GCC 9.4, GCC's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE.  To confirm that LSE instructions are created, the output of `objdump` command line utility should contain LSE instructions:
27 | ```bash
28 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
29 | ```
30 | To check whether the application binary contains load and store exclusives:
31 | ```bash
32 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
33 | ```
34 | 
35 | ## Language-specific Guidance
36 | 
37 | Check the [Languages](../languages/README.md) page for any language-specific guidance related to LSE, locking, synchronization, and atomics.  If no guide is provided then there are no Arm-related specific issues for that language.  Just proceed as you would on any other platform.
38 | 


--------------------------------------------------------------------------------
/optimization/cpudetect.md:
--------------------------------------------------------------------------------
  1 | # Runtime CPU Detection
  2 | 
  3 | There are several ways to determine the available Arm CPU resources and topology at runtime, including:
  4 | 
  5 |  * CPU architecture and supported instructions
  6 |  * CPU manufacturer
  7 |  * Number of CPU sockets 
  8 |  * CPU cores per socket
  9 |  * Number of NUMA nodes
 10 |  * Number of NUMA nodes per socket
 11 |  * CPU cores per NUMA node
 12 | 
 13 | Well-established portable libraries like libnuma and hwloc are a great choice on Arm64.  You can also use Arm's CPUID registers or query OS files.  Since many of these methods serve the same function, you should choose the method that best fits your application.
 14 | 
 15 | If you're implementing your own approach, then please look at the Arm Architecture Registers, and especially the Main ID Register (MIDR_EL1): https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register
 16 | 
 17 | The source code for the `lscpu` utility is a great example of how to retrieve and use these registers.  For example, here's how to translate the CPU part number in the MIDR_EL1 register to a human-readable string: https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c
 18 | 
 19 | Here's the output of `lscpu` on an AWS Graviton 3.
 20 | ```
 21 | [jlinford@c7g-16xlarge-dy-c7g16xlarge-1 ~]$ lscpu
 22 | Architecture:        aarch64
 23 | Byte Order:          Little Endian
 24 | CPU(s):              64
 25 | On-line CPU(s) list: 0-63
 26 | Thread(s) per core:  1
 27 | Core(s) per socket:  64
 28 | Socket(s):           1
 29 | NUMA node(s):        1
 30 | Model:               1
 31 | BogoMIPS:            2100.00
 32 | L1d cache:           64K
 33 | L1i cache:           64K
 34 | L2 cache:            1024K
 35 | L3 cache:            32768K
 36 | NUMA node0 CPU(s):   0-63
 37 | Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
 38 | ```
 39 | 
 40 | ## CPU Hardware Capabilities
 41 | 
 42 | To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime.  For example, a CPU core compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3.  A developer wanting to build an application or library that can detect the supported instructions in runtime, can follow this example:
 43 | 
 44 | ```c
 45 | #include<sys/auxv.h>
 46 | ......
 47 |   uint64_t hwcaps = getauxval(AT_HWCAP);
 48 |   has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
 49 |   has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
 50 |   has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
 51 |   has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
 52 |   has_sve_feature = hwcaps & HWCAP_SVE ? true : false;
 53 | ```
 54 | 
 55 | The full list of Arm64 hardware capabilities is defined in [glibc header file](https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h) and in the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/hwcap.h).
 56 | 
 57 | ## Example Source Code
 58 | 
 59 | Here's a complete yet simple example code that higlights some of the methods mentioned above.
 60 | 
 61 | ```c
 62 | #include <stdio.h>
 63 | #include <sys/auxv.h>
 64 | #include <numa.h>
 65 | 
 66 | // https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register
 67 | typedef union
 68 | {
 69 |     struct {
 70 |         unsigned int revision : 4;
 71 |         unsigned int part : 12;
 72 |         unsigned int arch : 4;
 73 |         unsigned int variant : 4;
 74 |         unsigned int implementer : 8;
 75 |         unsigned int _RES0 : 32;
 76 |     };
 77 |     unsigned long bits;
 78 | } MIDR_EL1;
 79 | 
 80 | static MIDR_EL1 read_MIDR_EL1()
 81 | {
 82 |     MIDR_EL1 reg;
 83 |     asm("mrs %0, MIDR_EL1" : "=r" (reg.bits));
 84 |     return reg;
 85 | }
 86 | 
 87 | 
 88 | static const char * get_implementer_name(MIDR_EL1 midr)
 89 | {
 90 |     switch(midr.implementer) 
 91 |     {
 92 |         case 0xC0: return "Ampere";
 93 |         case 0x41: return "Arm";
 94 |         case 0x42: return "Broadcom";
 95 |         case 0x43: return "Cavium";
 96 |         case 0x44: return "DEC";
 97 |         case 0x46: return "Fujitsu";
 98 |         case 0x48: return "HiSilicon";
 99 |         case 0x49: return "Infineon";
100 |         case 0x4D: return "Motorola";
101 |         case 0x4E: return "NVIDIA";
102 |         case 0x50: return "Applied Micro";
103 |         case 0x51: return "Qualcomm";
104 |         case 0x56: return "Marvell";
105 |         case 0x69: return "Intel";
106 |         default:   return "Unknown";
107 |     }
108 | }
109 | 
110 | 
111 | static const char * get_part_name(MIDR_EL1 midr)
112 | {
113 |     switch(midr.implementer) 
114 |     {
115 |         case 0x41: // Arm Ltd.
116 |             switch (midr.part) {
117 |                 case 0xd03: return "Cortex A53";
118 |                 case 0xd07: return "Cortex A57";
119 |                 case 0xd08: return "Cortex A72";
120 |                 case 0xd09: return "Cortex A73";
121 |                 case 0xd0c: return "Neoverse N1";
122 |                 case 0xd40: return "Neoverse V1";
123 |                 default:    return "Unknown";
124 |             }
125 |         case 0x42: // Broadcom
126 |             switch (midr.part) {
127 |                 case 0x516: return "Vulcan";
128 |                 default:    return "Unknown";
129 |             }
130 |         case 0x43: // Cavium
131 |             switch (midr.part) {
132 |                 case 0x0a1: return "ThunderX";
133 |                 case 0x0af: return "ThunderX2";
134 |                 default:    return "Unknown";
135 |             }
136 |         case 0x46: // Fujitsu
137 |             switch (midr.part) {
138 |                 case 0x001: return "A64FX";
139 |                 default:    return "Unknown";
140 |             }
141 |         case 0x4E: // NVIDIA
142 |             switch (midr.part) {
143 |                 case 0x000: return "Denver";
144 |                 case 0x003: return "Denver 2";
145 |                 case 0x004: return "Carmel";
146 |                 default:    return "Unknown";
147 |             }
148 |         case 0x50: // Applied Micro
149 |             switch (midr.part) {
150 |                 case 0x000: return "EMAG 8180";
151 |                 default:    return "Unknown";
152 |             }
153 |         default: return "Unknown";
154 |     }
155 | }
156 | 
157 | 
158 | 
159 | 
160 | int main(void) 
161 | {
162 |     // Main ID register
163 |     MIDR_EL1 midr = read_MIDR_EL1();
164 | 
165 |     // CPU ISA capabilities
166 |     unsigned long hwcaps = getauxval(AT_HWCAP);
167 | 
168 |     printf("CPU revision    : 0x%x\n", midr.revision);
169 |     printf("CPU part number : 0x%x (%s)\n", midr.part, get_part_name(midr));
170 |     printf("CPU architecture: 0x%x\n", midr.arch);
171 |     printf("CPU variant     : 0x%x\n", midr.variant);
172 |     printf("CPU implementer : 0x%x (%s)\n", midr.implementer, get_implementer_name(midr));
173 |     printf("CPU LSE atomics : %sSupported\n", (hwcaps & HWCAP_ATOMICS) ? "" : "Not ");
174 |     printf("CPU NEON SIMD   : %sSupported\n", (hwcaps & HWCAP_ASIMD)   ? "" : "Not ");
175 |     printf("CPU SVE SIMD    : %sSupported\n", (hwcaps & HWCAP_SVE)     ? "" : "Not ");
176 |     printf("CPU Dot-product : %sSupported\n", (hwcaps & HWCAP_ASIMDDP) ? "" : "Not ");
177 |     printf("CPU FP16        : %sSupported\n", (hwcaps & HWCAP_FPHP)    ? "" : "Not ");
178 |     printf("CPU BF16        : %sSupported\n", (hwcaps & HWCAP2_BF16)   ? "" : "Not ");
179 | 
180 |     if (numa_available() == -1) {
181 |         printf("libnuma not available\n");
182 |     }
183 |     printf("CPU NUMA nodes  : %d\n", numa_num_configured_nodes());
184 |     printf("CPU Cores       : %d\n", numa_num_configured_cpus());
185 | 
186 |     return 0;
187 | }
188 | ```


--------------------------------------------------------------------------------
/optimization/vectorization.md:
--------------------------------------------------------------------------------
  1 | # Arm SIMD Instructions: SVE and NEON
  2 | 
  3 | The Arm architecture provides two single-instruction-multiple-data (SIMD) instruction extensions: NEON and SVE.
  4 | 
  5 | Arm Advanced SIMD Instructions (a.k.a. "NEON") is the most common SIMD ISA for Arm64.  It is a fixed-length SIMD ISA that supports vectors of 128 bits.  The first Arm-based supercomputer to appear on the Top500 Supercomputers list ([_Astra_](https://www.sandia.gov/labnews/2018/11/21/astra-2/)) used NEON to accelerate linear algebra, and many applications and libraries are already taking advantage of NEON.  The Ampere Altra CPU found in the NVIDIA Arm HPC Developer Kit supports NEON vectorization.
  6 | 
  7 | More recently, Arm64 CPUs have started supporting [Arm Scalable Vector Extensions (SVE)](https://developer.arm.com/documentation/102476/latest/).  SVE is a length-agnostic SIMD ISA that supports more datatypes than NEON (e.g. FP16), offers more powerful instructions (e.g. gather/scatter), and supports vector lengths of more than 128 bits.  SVE is currently found in the AWS Graviton 3, Fujitsu A64FX, and the Alibaba Yitian 710.  SVE is not a new version of NEON, but an entirely new SIMD ISA in it's own right.  Most SVE-capable CPUs also support NEON.
  8 | 
  9 | Here's a quick summary of the SIMD capabilities of some of the currently available Arm64 CPUs:
 10 | 
 11 | |                        | Alibaba Yitian 710 | AWS Graviton3 | Fujitsu A64FX  | AWS Graviton2 | Ampere Altra |
 12 | | ---------------------- | ------------------ | ------------- | -------------- | ------------- | ------------ |
 13 | | CPU Core               | Neoverse N2        | Neoverse V1   | A64FX          | Neoverse N1   | Neoverse N1  |
 14 | | SIMD ISA               | SVE2 & NEON        | SVE & NEON    | SVE & NEON     | NEON only     | NEON only    |
 15 | | NEON Configuration     | 2x128              | 4x128         | 2x128          | 2x128         | 2x128        |
 16 | | SVE Configuration      | 2x128              | 2x256         | 2x512          | N/A           | N/A          |
 17 | | SVE Version            | 2                  | 1             | 1              | N/A           | N/A          |
 18 | | NEON FMLA FP64 TPeak   | 8                  | 16            | 8              | 8             | 8            |
 19 | | SVE FMLA FP64 TPeak    | 8                  | 16            | 32             | N/A           | N/A          |
 20 | 
 21 | Note that recent Arm64 CPUs provide the same peak theoretical performance for both NEON and SVE.  For example, the Graviton3 can either retire four 128-bit NEON operations or two 256-bit SVE operations.  The Yitian 710 takes this one step further and provides both NEON and SVE in the same configuration (2x128).  On paper, the peak performance of both SVE and NEON are the same for these CPUs, which means there's no intrinsic performance advantage for SVE vs. NEON, or vice versa.  (Note: there are micro-architectural details that can give one ISA a performance advantage over the other in certain conditions, but the upper limit on performance is always the same.)
 22 | 
 23 | However, SVE ([and especially SVE2](https://developer.arm.com/documentation/102340/0001/Introducing-SVE2)) is a much more capable SIMD ISA with support for complex datatypes and advanced features that enable vectorization of complicated code.  In practice, kernels that can't be vectorized in NEON _can_ be vectorized with SVE.  So while SVE won't beat NEON in a performance drag race, it can dramatically improve the performance of the application overall by vectorizing loops that would otherwise have executed with scalar instructions.  
 24 | 
 25 | Fortunately, auto-vectorizing compilers are usually the best choice when programming Arm SIMD ISAs.  The compiler will generally make the best decision on when to use SVE or NEON, and it will take advantage of SVE's advanced auto-vectorization features more easily than a human coding in intrinsics or assembly is likely to do.  **Generally speaking, you should not write SVE or NEON intrinsics.**  Instead, use the appropriate command line options with your auto-vectorizing compiler to realize the best performance for a given loop.  You may need to use compiler directives or make changes in the high level code to facilitate autovectorization, but this will be much easier and more maintainable than writing intrinsics.  Leave the finer details to the compiler and focus on code patterns that auto-vectorize well.
 26 | 
 27 | 
 28 | ## Compiler-driven auto-vectorization
 29 | The key to maximizing auto-vectorization is to allow the compiler to take full advantage of the available hardware features.  By default, GCC and LLVM compilers take a conservative approach and do not enable advanced features unless explicitly told to do so.  The easiest way to enable all available features for GCC or LLVM is to use the `-mcpu` compiler flag. If you're compiling on the same CPU that the code will run on, use `-mcpu=native`.  Otherwise you can use `-mcpu=<target>` where `<target>` is one of the CPU identifiers, e.g. `-mcpu=neoverse-n1`.  The NVIDIA compilers take a more agressive approach.  By default, they assume the machine you are compiling on is the one you will run on and so will enable all available hardware features detected at compile time.  And whenever possible, use the most recent version of your compiler.  For example, GCC9 supported auto-vectorization fairly well, but GCC10 has shown impressive improvement over GCC9 in most cases. GCC12 further improves auto-vectorization.
 30 | 
 31 | The second key compiler feature is the compiler vectorization report.  GCC uses the `-fopt-info` flags to report on auto-vectorization success or failure.  You can use the generated informational messages to guide code annotations or transformations that will facilitate autovectorization.  For example, compiling with `-fopt-info-vec-missed` will report on which loops were not vectorized in a code like this:
 32 | ```c
 33 |   1 // test.c 
 34 | ...
 35 |   5 float   a[1024*1024];
 36 |   6 float   b[1024*1024];
 37 |   7 float   c[1024*1024];
 38 | ...
 39 |  37 for (j=0; j<128;j++) { // outer loop, not expected to be vectorized
 40 |  38   for (i=0; i<n ; i+=1){ // inner loop, ideal for vectorization
 41 |  39         c[i]=a[i]*b[i]+j;
 42 |  40   }
 43 |  41 }
 44 | ```
 45 | 
 46 | Compiling with GCC 9:
 47 | ```
 48 | $ gcc test.c -fopt-info-vec-missed -O3
 49 | test.c:37:1: missed: couldn't vectorize loop
 50 | test.c:39:8: missed: not vectorized: complicated access pattern.
 51 | test.c:38:1: missed: couldn't vectorize loop
 52 | test.c:39:6: missed: not vectorized: complicated access pattern.
 53 | ```
 54 | 
 55 | Line 37 is the outer loop, which is not expected to vectorize.  But, the inner loop is a prime candidate for vectorization.  A small change to the code to guarantee that the inner loop would always be aligned to 128-bit SIMD will be enough for GCC 9 to auto-vectorize it.
 56 | ```c
 57 |   1 // test.c 
 58 | ...
 59 |   5 float   a[1024*1024];
 60 |   6 float   b[1024*1024];
 61 |   7 float   c[1024*1024];
 62 | ...
 63 |  19 #if(__aarch64__)
 64 |  20 #define ARM_NEON_WIDTH  128
 65 |  21 #define VF32    ARM_NEON_WIDTH/32
 66 |  22 #else
 67 |  23 #define VF32    1
 68 |  33 #endif
 69 | ...
 70 |  37 for (j=0; j<128;j++) { // outer loop, not expected to be vectorized
 71 |  38   for (i=0; i<( n - n%VF32 ); i+=1){ // forcing inner loop to multiples of 4 iterations
 72 |  39         c[i]=a[i]*b[i]+j;
 73 |  40   }
 74 |  41   // epilog loop
 75 |  42   if (n%VF32)
 76 |  43         for ( ; i < n; i++) 
 77 |  44                 c[i] = a[i] * b[i]+j;
 78 |  45 }
 79 | ```
 80 | 
 81 | The code above is forcing the inner loop to iterate multiples of 4 (128-bit SIMD / 32-bit per float). Results:
 82 | ```
 83 | $ gcc test.c -fopt-info-vec-missed -O3
 84 | test.c:37:1: missed: couldn't vectorize loop
 85 | test.c:37:1: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized
 86 | ```
 87 | And the outer loop is still not vectorized as expected, but the inner loop is now vectorized and 3-4x faster. 
 88 | 
 89 | 
 90 | ## Relaxed vector conversions
 91 | Arm NEON differentiates between vectors of signed and unsigned types.  For example, GCC won't implicitly cast between vectors of signed and unsigned 64-bit integers:
 92 | ```c
 93 | #include <arm_neon.h>
 94 | ...
 95 | uint64x2_t u64x2;
 96 | int64x2_t s64x2;
 97 | // Error: cannot convert 'int64x2_t' to 'uint64x2_t' in assignment
 98 | u64x2 = s64x2;
 99 | ```
100 | 
101 | To perform the cast, you must use NEON's `vreinterpretq` functions:
102 | ```c
103 | u64x2 = vreinterpretq_u64_s64(s64x2);
104 | ```
105 | 
106 | Unfortunately, some codes written for other SIMD ISAs rely on these kinds of implicit conversions (see the [Velox example](../examples/velox.md)).  If you see errors about "no known conversion" in a code that builds for AVX but doesn't build for NEON then we might need to relax GCC's vector converversion rules: 
107 | ```
108 | /tmp/velox/third_party/xsimd/include/xsimd/types/xsimd_batch.hpp:35:11: note:   no known conversion for argument 1 from ‘xsimd::batch<long int>’ to ‘const xsimd::batch<long unsigned int>&’
109 | ```
110 | To allow implicit conversions between vectors with differing numbers of elements and/or incompatible element types, use the `-flax-vector-conversions` flag.  This flag should be fine for legacy code, but it should not be used for new code.  The safest option is to use the appropriate `vreinterpretq` calls.
111 | 
112 | 
113 | ## Runtime detection of supported SIMD instructions
114 | To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime.  For example, a CPU core compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3.  A developer wanting to build an application or library that can detect the supported instructions in runtime, can follow this example:
115 | 
116 | ```c
117 | #include<sys/auxv.h>
118 | ......
119 |   uint64_t hwcaps = getauxval(AT_HWCAP);
120 |   has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
121 |   has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
122 |   has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
123 |   has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
124 |   has_sve_feature = hwcaps & HWCAP_SVE ? true : false;
125 | ```
126 | 
127 | The full list of Arm64 hardware capabilities is defined in [glibc header file](https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h) and in the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/hwcap.h).
128 | 
129 | ## Porting codes with SSE/AVX intrinsics to NEON
130 | 
131 | ### Detecting Arm64 systems
132 | Projects may fail to build on Arm64 with `error: unrecognized command-line
133 | option '-msse2'`, or `-mavx`, `-mssse3`, etc.  These compiler flags enable x86
134 | vector instructions.  The presence of this error means that the build system may
135 | be missing the detection of the target system, and continues to use the x86
136 | target features compiler flags when compiling for Arm64.
137 | 
138 | To detect an Arm64 system, the build system can use:
139 | ```bash
140 | (test $(uname -m) = "aarch64" && echo "arm64 system") || echo "other system"
141 | ```
142 | 
143 | Another way to detect an arm64 system is to compile, run, and check the return
144 | value of a C program:
145 | ```c
146 | # cat << EOF > check-arm64.c
147 | int main () {
148 | #ifdef __aarch64__
149 |   return 0;
150 | #else
151 |   return 1;
152 | #endif
153 | }
154 | EOF
155 | 
156 | # gcc check-arm64.c -o check-arm64
157 | # (./check-arm64 && echo "arm64 system") || echo "other system"
158 | ```
159 | 
160 | ### Translating x86 intrinsics to NEON
161 | When programs contain code with x86 intrinsics, drop-in intrinsic translation tools like [SIMDe](https://github.com/simd-everywhere/simde) or [sse2neon](https://github.com/DLTcollab/sse2neon) can be used to quickly obtain a working program on Arm64.  This is a good starting point for rewriting the x86 intrinsics in either NEON or SVE and will quickly get a prototype up and running.  For example, to port code using AVX2 intrinsics with SIMDe:
162 | ```c
163 | #define SIMDE_ENABLE_NATIVE_ALIASES
164 | #include "simde/x86/avx2.h"
165 | ```
166 | 
167 | SIMDe provides a quick starting point to port performance critical codes
168 | to Arm64.  It shortens the time needed to get a working program that then
169 | can be used to extract profiles and to identify hot paths in the code.
170 | Once a profile is established, the hot paths can be rewritten to avoid the overhead of the generic translation.
171 | 
172 | Since you're rewriting your x86 intrinsics, you might want to take this opportunity to create a more portable version.  Here are some suggestions to consider:
173 | 
174 |  1. Rewrite in native C/C++, Fortran, or another high-level compiled language.  Compilers are constantly improving, and technologies like Arm SVE enable the autovectorization of codes that formally wouldn't vectorize.  You may be able to avoid platform-specific intrinsics entirely and let the compiler do all the work.
175 |  2. If your application is written in C++, use `std::experimental::simd` from the C++ Parallelism Technical Specification V2 via the `<experimental/simd>` header.
176 |  3. Use the [SLEEF Vectorized Math Library](https://sleef.org/) as a header-based set of "portable intrinsics"
177 |  4. Instead of Time Stamp Counter (TSC) RDTSC intrinsics, use standards-compliant portable timers, e.g., `std::chrono` (C++), `clock_gettime` (C/POSIX), `omp_get_wtime` (OpenMP), `MPI_Wtime` (MPI), etc.
178 | 
179 | 
180 | ## Additional resources
181 | 
182 |  * [Neon Intrinsics](https://developer.arm.com/architectures/instruction-sets/intrinsics/)
183 |  * [Coding for Neon](https://developer.arm.com/documentation/102159/latest/)
184 |  * [Neon Programmer's Guide for Armv8-A](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a)
185 | 


--------------------------------------------------------------------------------
/slack.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arm-hpc-devkit/nvidia-arm-hpc-devkit-users-guide/77ad9c72ab9c19e2e333184e8839d97e891fbaa0/slack.png


--------------------------------------------------------------------------------
/software/README.md:
--------------------------------------------------------------------------------
 1 | # Arm64 Software Ecosystem
 2 | 
 3 | ## Compilers
 4 | Many commercial and open source compilers now support Arm64.  See [the compilers page](compilers.md) for details, recommendations, and best practices.  We also recommend you check the [language-specific considerations](../languages/README.md#language-specific-considerations).
 5 | 
 6 | ## Message Passing (MPI)
 7 | Practically all MPI libraries support Arm64 with the notable exception of Intel MPI.  See [the MPI page](mpi.md) for details, recommendations, and best practices.
 8 | 
 9 | ## Math Libraries
10 | Many math libraries support Arm64 CPUs.  In many cases, math libraries supporting Arm64 (e.g. BLIS, OpenBLAS, FFTW, etc.) can be substituted for libraries that have not yet announced support for Arm64 (e.g. Intel MKL).  Arm community members like NVIDIA, AWS, and Oracle are actively contributing to enabling Arm64 support wherever possible. See [the math libraries page](mathlibs.md) for details, recommendations, and best practices.
11 | 
12 | All NVIDIA GPU math libraries work perfectly on Arm-hosted GPUs. In this case, the architecture of the host CPU is irrelevant.  If your application uses GPU-accelerated math libraries, proceed exactly as you would on any other platform.
13 | 
14 | ## Package Managers e.g. Spack and EasyBuild
15 | Package managers like Spack and EasyBuild are a great way to get started with Arm64.  Spack is very well supported on Arm64 and can build full software stacks for CPU-only or CPU+GPU applications using GCC, LLVM, Arm, or NVIDIA compilers and math libraries.  Since the Ampere Altra CPU found in the NVIDIA Arm HPC Developer Kit is architecturally similar to the AWS Gravition 2, we recommend you use [AWS's Spack Rolling Binary Cache](https://aws.amazon.com/blogs/hpc/introducing-the-spack-rolling-binary-cache/) to accelerate your Spack software stack deployments.  [See this AWS blog post fore more details](https://aws.amazon.com/blogs/hpc/introducing-the-spack-rolling-binary-cache/).
16 | 
17 | ## Containers
18 | Docker, Kubernetes, Singuarlity, CharlieCloud, Saurs, and many more container engines and frameworks run with excellent performance on Arm64. Please refer [here](containers.md) for information about running container-based workloads.
19 | 
20 | ## Operating Systems
21 | Please check [here](os.md) for more information about which operating system to run on the NVIDIA Arm HPC Developer Kit.
22 | 
23 | ## Recent Updates, Known Issues, and Workarounds
24 | Please see [here](known_issues.md) for recently identified issues and their solutions.  Please also check the [AWS Graviton Getting Started Guide](https://github.com/aws/aws-graviton-getting-started) for known issues and workarounds.


--------------------------------------------------------------------------------
/software/compilers.md:
--------------------------------------------------------------------------------
 1 | # Compilers for Arm64
 2 | 
 3 | Many commercial and open source compilers now support Arm64.  See [the compilers page](compilers.md) for details, recommendations, and best practices.  We also recommend you check the [language-specific considerations](../languages/README.md#language-specific-considerations).
 4 | 
 5 | ## NVIDIA HPC Compilers
 6 | 
 7 | The NVIDIA HPC SDK includes proven compilers, libraries and software tools essential to maximizing developer productivity and the performance and portability of HPC applications.  The NVIDIA HPC SDK compilers enable cross-platform C, C++, and Fortran programming for NVIDIA GPUs and multicore Arm, OpenPOWER, or x86-64 CPUs.  They are ideal for HPC modeling and simulation applications written in C, C++, or Fortran with OpenMP, OpenACC, and CUDA.  
 8 | 
 9 | These compilers are also fully interoperable with NVIDIA’s optimized math libraries, communication libraries, and performance tuning and debugging tools.  The NVIDIA HPC SDK’s accelerated math libraries maximize performance on common HPC algorithms, and the optimized communications libraries enable standards-based scalable systems programming.  The integrated performance profiling and debugging tools simplify porting and optimization of HPC applications, and the containerization tools enable easy deployment on-premises or in the cloud.  In short, the HPC SDK provides all the tools you need to build HPC applications for GPUs, CPUs, or the cloud.
10 | 
11 | 
12 | ## GNU Compilers
13 | 
14 | When using the GNU compilers, we strongly recommend GCC version 11 or later.  Arm Inc. is a long-standing contributor to the GNU compilers, so much so that Arm and Arm's partners currently contribute the majority of GCC updates.  Quite old GNU compilers will work on Arm64 CPUs, however you should always use the latest versions for best performance.  
15 | 
16 | ## LLVM Compilers
17 | 
18 | LLVM compilers support Arm64 CPUs quite well, though mainly for C and C++ (clang and clang++).  LLVM's Fortran front-end (flang) is not widely used and is still maturing.  Arm Inc's provides multiple LLVM-based compilers, both for server-class Arm64 systems and for embedded and mobile devices.  The Arm Compiler for Linux is based on LLVM and Arm is committed to upstreaming all LLVM patches it develops.  
19 | 
20 | ## Arm Compiler for Linux
21 | 
22 | The Arm Compiler for Linux is tailored to the development of HPC applications.  It is a free-to-use compiler built on LLVM13.  For more details see [this blog post about the launch of Arm Compiler for Linux version 22.0.](https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0)
23 | 
24 | ## HPE/Cray, Fujitsu, and Other Vendor Compilers
25 | 
26 | Arm vendors like HPE/Cray and Fujitsu also provide compilers that target their own Arm-based products.  Code generated by these vendor compilers tends to be very highly tuned for the target platform and therefore will not run well, if at all, on the NVIDIA Arm HPC Developer Kit.
27 | 
28 | 


--------------------------------------------------------------------------------
/software/containers.md:
--------------------------------------------------------------------------------
  1 | # Containers on Arm64
  2 | 
  3 | Containerization has long been of interest to the Arm community.  Today, Arm64 CPUs are ideal for container-based workloads.
  4 | 
  5 | ## NVIDIA Container Toolkit
  6 | 
  7 | The [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) allows users to build and run GPU accelerated  containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs.  Follow this installation guide to get started: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
  8 | 
  9 | As an example, here are the steps for installing NVIDIA Container Toolkit on Ubuntu 20.04 with Docker.  Note that other container frameworks like Podman are also supported.
 10 | 
 11 | ```bash
 12 | # Install Docker dependencies
 13 | sudo apt-get install ca-certificates curl gnupg lsb-release
 14 | 
 15 | # Add Docker repo
 16 | echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
 17 | 
 18 | # Install Docker
 19 | sudo apt-get update
 20 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
 21 | 
 22 | # Enable docker service
 23 | sudo systemctl --now enable docker
 24 | 
 25 | # Add the NVIDIA Container Toolkit repository
 26 | distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
 27 |       && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
 28 |       && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
 29 |             sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
 30 |             sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 31 | 
 32 | # Install NVIDIA Container Toolkit
 33 | sudo apt update
 34 | sudo apt install nvidia-docker2
 35 | 
 36 | # Restart docker services to enable GPU support
 37 | sudo systemctl restart docker
 38 | 
 39 | # Run a simple test using the CUDA multi-arch container
 40 | sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu18.04 nvidia-smi
 41 | Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu18.04' locally
 42 | 11.0.3-base-ubuntu18.04: Pulling from nvidia/cuda
 43 | e196da37f904: Pull complete
 44 | 0b7ba59c359b: Pull complete
 45 | 84bc5f8689bc: Pull complete
 46 | b926124172ef: Pull complete
 47 | fef6c6f16e98: Pull complete
 48 | Digest: sha256:f7b595695b06ad8590aed1accd6437ba068ca44e71c5cf9c11c8cb799c2d8335
 49 | Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu18.04
 50 | Thu Jul  7 17:57:04 2022
 51 | +-----------------------------------------------------------------------------+
 52 | | NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
 53 | |-------------------------------+----------------------+----------------------+
 54 | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 55 | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 56 | |                               |                      |               MIG M. |
 57 | |===============================+======================+======================|
 58 | |   0  NVIDIA A100 80G...  Off  | 0000000C:01:00.0 Off |                    0 |
 59 | | N/A   44C    P0    65W / 300W |      0MiB / 81920MiB |      0%      Default |
 60 | |                               |                      |             Disabled |
 61 | +-------------------------------+----------------------+----------------------+
 62 | |   1  NVIDIA A100 80G...  Off  | 0000000D:01:00.0 Off |                    0 |
 63 | | N/A   36C    P0    63W / 300W |      0MiB / 81920MiB |      2%      Default |
 64 | |                               |                      |             Disabled |
 65 | +-------------------------------+----------------------+----------------------+
 66 | 
 67 | +-----------------------------------------------------------------------------+
 68 | | Processes:                                                                  |
 69 | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 70 | |        ID   ID                                                   Usage      |
 71 | |=============================================================================|
 72 | |  No running processes found                                                 |
 73 | +-----------------------------------------------------------------------------+
 74 | ```
 75 | 
 76 | 
 77 | 
 78 | 
 79 | ## Preparing for Arm64
 80 | 
 81 | The first step for leveraging the benefits of Arm64 systems as container hosts is to ensure all production software dependencies support the Arm64 architecture, as one cannot run images built for an x86_64 host on an Arm64 host, and vice versa.
 82 | 
 83 | Most of the container ecosystem supports both architectures, and often does so transparently through [multiple-architecture (multi-arch)](https://www.docker.com/blog/multi-platform-docker-builds/) images, where the correct image for the host architecture is deployed automatically.
 84 | 
 85 | The major container image repositories, including [Dockerhub](https://hub.docker.com), [Quay](https://www.quay.io), and [Amazon Elastic Container Registry (ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) all support [multi-arch](https://aws.amazon.com/blogs/containers/introducing-multi-architecture-container-images-for-amazon-ecr/) images.
 86 | 
 87 | ### Creating Multi-arch container images
 88 | 
 89 | While most images already support multi-arch (i.e. arm64 and x86_64/amd64), we describe couple of ways for developers to to create a multi-arch image if needed.
 90 | 
 91 | 1. [Docker Buildx](https://github.com/docker/buildx#getting-started)
 92 | 2. Using a CI/CD Build Pipeline such as [Amazon CodePipeline](https://github.com/aws-samples/aws-multiarch-container-build-pipeline) to coordinate native build and manifest generation.
 93 | 
 94 | ### Deploying to Arm64
 95 | 
 96 | Most container orchestration platforms support both arm64 and x86_64 hosts. As an example, here is an _incomplete, non-exhaustive_ list of popular software within the container ecosystem that explicitly supports Arm64.
 97 | 
 98 | | Name                      | URL                           | Comment                |
 99 | | :-----                    |:-----                         | :-----                 |
100 | | Tensorflow | https://hub.docker.com/r/armswdev/tensorflow-arm-neoverse |  |
101 | | PyTorch | https://hub.docker.com/r/armswdev/pytorch-arm-neoverse |Use tags with *-openblas* for performance reasons until confirmed otherwise|
102 | | Istio	| https://github.com/istio/istio/releases/	| 1) arm64 binaries as of 1.6.x release series<br>2) [Istio container build instructions](https://github.com/aws/aws-graviton-getting-started/blob/main/containers-workarounds.md#Istio)|
103 | | Envoy	| https://www.envoyproxy.io/docs/envoy/v1.18.3/start/docker ||
104 | | Traefik | https://github.com/containous/traefik/releases	|| 	 
105 | | Flannel | https://github.com/coreos/flannel/releases	 ||	 
106 | | Helm | https://github.com/helm/helm/releases/tag/v2.16.9 || 
107 | | Jaeger | https://github.com/jaegertracing/jaeger/pull/2176 | | 
108 | | Fluent-bit |https://github.com/fluent/fluent-bit/releases/ | compile from source |
109 | | core-dns |https://github.com/coredns/coredns/releases/ | | 
110 | | external-dns | https://github.com/kubernetes-sigs/external-dns/blob/master/docs/faq.md#which-architectures-are-supported | support from 0.7.5+ |
111 | | Prometheus | https://prometheus.io/download/	 	 | |
112 | |containerd	 | https://github.com/containerd/containerd/issues/3664 |	nightly builds provided for arm64 | 
113 | | kube-state-metrics | https://github.com/kubernetes/kube-state-metrics/issues/1037 | use k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-beta for arm64 |  
114 | | cluster-autoscaler | https://github.com/kubernetes/autoscaler/pull/3714 | arm64 support as of v1.20.0 | 
115 | |gRPC  | 	https://github.com/protocolbuffers/protobuf/releases/	 | protoc/protobuf support	 |
116 | |Nats	 | 	https://github.com/nats-io/nats-server/releases/	 	 | |
117 | |CNI	 | 	https://github.com/containernetworking/plugins/releases/	| | 	  
118 | |Cri-o	 | 	https://github.com/cri-o/cri-o/blob/master/README.md#installing-crio | tested on Ubuntu 18.04 and 20.04	|
119 | |Trivy	 | 	https://github.com/aquasecurity/trivy/releases/	 	 | |
120 | |Argo	 | 	https://github.com/argoproj/argo-cd/releases 	 	 | arm64 images published as of 2.3.0 |
121 | |Cilium	| https://docs.cilium.io/en/stable/contributing/development/images/ |  Multi arch supported from v 1.10.0 |	 
122 | |Calico	| https://hub.docker.com/r/calico/node/tags?page=1&ordering=last_updated |  Multi arch supported on master |	 
123 | |Tanka	 | 	https://github.com/grafana/tanka/releases	 	 | |
124 | |Consul	 | 	https://www.consul.io/downloads	 	 | |
125 | |Nomad	 | 	https://www.nomadproject.io/downloads	| | 	 
126 | |Packer	 | 	https://www.packer.io/downloads	 	 | |
127 | |Vault	 | 	https://www.vaultproject.io/downloads	| | 
128 | |Terraform | https://github.com/hashicorp/terraform/issues/14474 | arm64 support as of v0.14.0 | 	 	 
129 | |Flux	 | 	https://github.com/fluxcd/flux/releases/ | |
130 | |Pulumi | https://github.com/pulumi/pulumi/issues/4868 | arm64 support as of v2.23.0 |
131 | |New Relic	 | 	https://download.newrelic.com/infrastructure_agent/binaries/linux/arm64/ | |
132 | |Datadog - EC2	 | 	https://www.datadoghq.com/blog/datadog-arm-agent/ ||
133 | |Datadog - Docker	 | 	https://hub.docker.com/r/datadog/agent-arm64	|| 	 
134 | |Dynatrace	 | 	https://www.dynatrace.com/news/blog/get-out-of-the-box-visibility-into-your-arm-platform-early-adopter/	 ||	 
135 | |Grafana	 | 	https://grafana.com/grafana/download?platform=arm ||
136 | |Loki	 | 	https://github.com/grafana/loki/releases ||
137 | |kube-bench | https://github.com/aquasecurity/kube-bench/releases/tag/v0.3.1 ||
138 | |metrics-server | https://github.com/kubernetes-sigs/metrics-server/releases/tag/v0.3.7 | docker image is multi-arch from v.0.3.7 |
139 | | Flatcar Container Linux | https://www.flatcar.org | arm64 support in Stable channel as of 3033.2.0 |
140 | 
141 | **If your software isn't listed above, it doesn't mean it won't work!**
142 | 
143 | Many products work on arm64 but don't explicitly distribute arm64 binaries or build multi-arch images *(yet)*. NVIDIA, AWS, Arm, and many developers in the community are working with maintainers and contributing expertise and code to enable full binary or multi-arch support.
144 | 
145 | ## Kubernetes
146 | 
147 | Kubernetes fully supports Arm64.
148 | 
149 | If all of your containerized workloads support Arm64, then you can run your cluster with Arm64 nodes exclusively.  However, if you have some workloads that can only run on x86, or if you just want to be able to run both x86 and Arm64 nodes in the same cluster, then there are a couple of ways to accomplish that:
150 | 
151 |  * **Multiarch Images**: 
152 |  If you are able to use multiarch images (see above) for all containers in your cluster, then you can simply run a mix of x86 and Arm64 nodes without any further action. The multiarch image manifest will ensure that the correct image layers are pulled for a given node's architecture.
153 |  
154 |  * **Built-in labels**: 
155 |  You can schedule pods on nodes according to the `kubernetes.io/arch` [label](https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-arch). This label is automatically added to nodes by Kubernetes and allows you to schedule pods accordingly with a [node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) like this: 
156 |  ```
157 | nodeSelector:
158 |   kubernetes.io/arch: amd64
159 | ```
160 |  * **Using taints**:
161 |  Taints are especially helpful if adding Arm64 nodes to an existing cluster with mostly x86-only containers. While using the built-in `kubernetes.io/arch` label requires you to explicitly use a node selector to place x86-only containers on the right instances, tainting Arm64 instances prevents Kubernetes from scheduling incompatible containers on them without requiring you to change any existing configuration. For example, you can do this with a managed node group using eksctl by adding `--kubelet-extra-args '--register-with-taints=arm=true:NoSchedule'` to the kubelet startup arguments as documented [here](https://eksctl.io/usage/eks-managed-nodes/). Note that if you only taint Arm64 instances and don't specify any node selectors, then you will need to ensure that the images you build for Arm64 are multiarch images that can also run on x86 instance types. Alternatively, you can build Arm64-only images and ensure that they are only scheduled onto Arm64 images using node selectors.
162 | 
163 | ## Further Reading
164 | 
165 | * [Building multi-arch docker images with buildx](https://tech.smartling.com/building-multi-architecture-docker-images-on-arm-64-c3e6f8d78e1c)
166 | * [Unifying Arm software development with Docker](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/unifying-arm-software-development-with-docker)
167 | * [Modern multi-arch builds with docker](https://duske.me/posts/modern-multiarch-builds-with-docker/)
168 | 


--------------------------------------------------------------------------------
/software/cuda.md:
--------------------------------------------------------------------------------
1 | # CUDA on Arm64
2 | 
3 | CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on GPUs.  NVIDIA has officially supported CUDA on Arm64 starting with CUDA 11.  There is no specific guidance for CUDA on Arm64.  Simply use it as you normally would on any other platform.  CUDA is [provided with the NVIDIA HPC SDK](https://developer.nvidia.com/hpc-sdk), or you can [download the CUDA Toolkit for Arm](https://developer.nvidia.com/cuda-downloads).


--------------------------------------------------------------------------------
/software/mathlibs.md:
--------------------------------------------------------------------------------
 1 | # Math Libraries on Arm64
 2 | 
 3 | Many math libraries support Arm64 CPUs.  In many cases, open source math libraries like BLIS, OpenBLAS, FFTW, etc. can be substituted for libraries that have not yet announced support for Arm64 (e.g. Intel MKL).  Arm community members like NVIDIA, AWS, and Oracle are actively contributing to enabling Arm64 support wherever possible.
 4 | 
 5 | ## GPU Math Libraries
 6 | 
 7 | All NVIDIA GPU math libraries work perfectly on Arm-hosted GPUs. In this case, the architecture of the host CPU is irrelevant.  If your application uses GPU-accelerated math libraries, proceed exactly as you would on any other platform.
 8 | 
 9 | ## Multi-node Math Libraries
10 | 
11 | Generally speaking, all the multi-node math libraries you expect work well on Arm64.  Trilinos, PETSc, Hypre, SuperLU, and ParMETIS have been used at scale on the Astra system at Sandia (Marvell ThunderX2 with EDR InfiniBand) and at scale on AWS Graviton 2 with EFA.  GPU support has been tested in most of these same libraries at smaller scales.  Spack is often a good option for installing these libraries.
12 | 


--------------------------------------------------------------------------------
/software/ml.md:
--------------------------------------------------------------------------------
 1 | # AI, ML, and DL Frameworks
 2 | 
 3 | Many AI, ML, and DL frameworks work well on Arm64-based platforms.  In most cases, you will want to use the Arm-hosted GPU for training or inference.  You may also wish to use the CPU for inference.  See [the examples page](../examples/README.md) for more information.
 4 | 
 5 | The following are known to work well on the NVIDIA Arm HPC Developer Kit:
 6 |  * TensorRT: An SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
 7 |  * NVIDIA Triton Inference Server: An open-source inference serving software that helps standardize model deployment and execution, delivering fast and scalable AI in production.
 8 |  * PyTorch: A GPU accelerated tensor computational framework. Functionality can be extended with common Python libraries such as NumPy and SciPy.
 9 |  * TensorFlow: An open source platform for machine learning, providing comprehensive tools and libraries in a flexible architecture allowing easy deployment across a variety of platforms and devices.  
10 |    * Example: [GPU-accelerated training with TensorFlow](../examples/tensorflow-gpu.md) 
11 |    * Example: [on-CPU inference with TensorFlow](../examples/tensorflow-cpu.md)


--------------------------------------------------------------------------------
/software/mpi.md:
--------------------------------------------------------------------------------
 1 | # Message Passing Interface (MPI) Implementations on Arm64
 2 | 
 3 | Practically all MPI libraries support Arm64 with the notable exception of Intel MPI.
 4 | 
 5 | ## OpenMPI
 6 | 
 7 | There is no specific guidance for Arm64 for OpenMPI.  It functions on Arm64 exactly as it does on all other architectures.  Simply install and use it as you would normally.
 8 | 
 9 | ## MVAPICH2
10 | 
11 | There is no specific guidance for Arm64 for MVAPICH2.  It functions on Arm64 exactly as it does on all other architectures.  Simply install and use it as you would normally.
12 | 
13 | ## MPICH
14 | 
15 | There is no specific guidance for Arm64 for MPICH.  It functions on Arm64 exactly as it does on all other architectures.  Simply install and use it as you would normally.
16 | 
17 | 


--------------------------------------------------------------------------------
/software/os.md:
--------------------------------------------------------------------------------
 1 | # Operating Systems on Arm64
 2 | 
 3 | Current versions of popular Linux distributions (Ubuntu, RedHat, etc.) support Arm64 to the same level as other architectures.  The NVIDIA Arm HPC Developer Kit has been internally tested and qualified using Ubuntu 20.04 and RHEL 8.4 operating systems, but other distributions may also work.
 4 | 
 5 | 
 6 | ## RedHat Enterprise Linux and Derivatives
 7 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment
 8 | ------ | ------ | ----- | ----- | ----- | -----
 9 | RHEL | 9.0 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-9.0-aarch64-dvd.iso) | 
10 | RHEL | 8.6 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.6-aarch64-dvd.iso) | 
11 | RHEL | 8.4 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.4-aarch64-dvd.iso) | NVIDIA tested and qualified on DevKit
12 | RHEL | 8.2 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.2-aarch64-dvd.iso) | 
13 | Rocky Linux | 8.4 or later | Yes | 64KB | [ISO](https://download.rockylinux.org/pub/rocky/8/isos/aarch64/Rocky-8.6-aarch64-dvd1.iso) |
14 | CentOS Stream | 9 | No | 64KB | [ISO](https://mirrors.centos.org/mirrorlist?path=/9-stream/BaseOS/aarch64/iso/CentOS-Stream-9-latest-aarch64-dvd1.iso&redirect=1&protocol=https) | 
15 | CentOS Stream | 8 | No | 64KB | [Mirror List](http://isoredirect.centos.org/centos/8-stream/isos/aarch64/) | 
16 | CentOS | 8.2 or later | No | 64KB | [ISO](http://bay.uchicago.edu/centos-vault/8.2.2004/isos/aarch64/CentOS-8.2.2004-aarch64-dvd1.iso) | 
17 | 
18 | 
19 | ## Ubuntu
20 |  Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment
21 | ------ | ------ | ----- | ----- | ----- | -----
22 | Ubuntu | 22.04 LTS | Yes | 4KB | [ISO](https://cdimage.ubuntu.com/releases/22.04/release/ubuntu-22.04-live-server-arm64.iso) | 
23 | Ubuntu | 20.04 LTS | Yes | 4KB | [ISO](https://cdimage.ubuntu.com/releases/20.04/release/ubuntu-20.04.4-live-server-arm64.iso) | NVIDIA tested and qualified on DevKit
24 | Ubuntu | 18.04 LTS | Yes (*) | 4KB | [ISO](https://cdimage.ubuntu.com/releases/18.04/release/ubuntu-18.04.6-server-arm64.iso) | (*) needs `apt install libc6-lse`
25 | 
26 | 
27 | ## SUSE Linux Enterprise Server
28 |  Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment
29 | ------ | ------ | ----- | ----- | ----- | -----
30 | SLES | 15 SP2 or later | Planned | 4KB | [SUSE Download](https://www.suse.com/download/sles/) | 
31 | 
32 | 
33 | ## Others
34 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment
35 | ------ | ------ | ----- | ----- | ----- | -----
36 | AlmaLinux | 9.0 | Yes | 64KB | [Mirror List](https://mirrors.almalinux.org/isos/aarch64/9.0.html) | 
37 | AlmaLinux | 8.6 | Yes | 64KB | [Mirror List](https://mirrors.almalinux.org/isos/aarch64/8.6.html) | 
38 | Alpine Linux | 3.12.7 or later | Yes (*) | 4KB | [ISO](https://dl-cdn.alpinelinux.org/alpine/v3.16/releases/aarch64/alpine-standard-3.16.0-aarch64.iso) | (*) LSE enablement checked in version 3.14 |
39 | Debian | 11 (Bullseye) | Yes | 4KB | [ISO](https://cdimage.debian.org/debian-cd/current/arm64/iso-dvd/debian-11.3.0-arm64-DVD-1.iso) |
40 | Debian | 10 (Buster) | Yes (*) | 4KB | [ISO](https://cdimage.debian.org/cdimage/archive/10.12.0/arm64/iso-dvd/debian-10.12.0-arm64-DVD-1.iso) | LSE supported as of Debian 10.7 (2020-12-07)
41 | FreeBSD | 13.0 or later | Yes | 4KB | [ISO](https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/13.1/FreeBSD-13.1-RELEASE-arm64-aarch64-disc1.iso) | Some DevKit hardware features are not supported
42 | 
43 | 
44 | 
45 | 
46 | 
47 | 


--------------------------------------------------------------------------------
/transition-guide.md:
--------------------------------------------------------------------------------
 1 | # Considerations when Transitioning Workloads to Arm64
 2 | 
 3 | Today, Arm CPUs power application servers, micro-services, high-performance computing, CPU-based machine learning inference, video encoding, electronic design automation, gaming, open-source databases, and in-memory caches. In most cases transitioning to Arm64 CPUs is simple and straightforward.  This transition guide provides a step-by-step approach to assess your workload to identify and address any potential software changes that might be needed.
 4 | 
 5 | ## Introduction - Identifying Target Workloads
 6 | 
 7 | The quickest and easiest workloads to transition are Linux-based, and built using open-source components or in-house applications where you control the source code. Many open source projects already support Arm64, and having access to the source code allows you to build from source if pre-built artifacts do not already exist. There is also a large and growing set of Independent Software Vendor (ISV) software available for Arm64 (e.g. a non-exhaustive list can be found [here](isv.md)). However if you license software you’ll want to check with the respective ISV to ensure they already, or have plans to, support Arm.
 8 | 
 9 | The following transition guide is organized into a logical sequence of steps as follows:
10 | 
11 | * [Learning and exploring](#learning-and-exploring)
12 |     * Step 1 -  [Optional] Understand the NVIDIA Arm HPC Developer Kit and review key documentation
13 |     * Step 2 - Explore your workload, and inventory your current software stack
14 | * [Plan your workload transition](#plan-your-workload-transition)
15 |     * Step 3 - Install and configure your application environment
16 |     * Step 4 - [Optional] Build your application(s) and/or container images
17 | * [Test and optimize your workload](#test-and-optimize-your-workload)
18 |     * Step 5 - Testing and optimizing your workload
19 |     * Step 6 - Performance testing
20 | 
21 | ### Learning and Exploring
22 | 
23 | **Step 1 - [Optional] Understand the NVIDIA Arm HPC Developer Kit and review key documentation**
24 | 
25 | * [Optional] Start by reviewing [the NVIDIA Arm HPC Developer Kit product page](https://developer.nvidia.com/arm-hpc-devkit).
26 | * [Optional] Watch these recommended presentations to learn more about getting started, porting and tuning applications, and expected performance for key applications:
27 |   * [First hands-on experiences using the NVIDIA Arm HPC Developer Kit](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41624/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0)
28 |   * [Getting started with ARM software development: 86 the x86 dependencies in your code](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41702/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0)
29 |   * [Port, Profile, and Tune HPC Applications for Arm-based Supercomputers](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41788/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0)
30 |   * [Introducing Developer Tools for Arm and NVIDIA systems](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32163/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0)
31 | 
32 | **Step 2 -  Explore your workload, and inventory your current software stack**
33 | 
34 | Before starting the transition, you will need to inventory your current software stack so you can identify the path to equivalent software versions that support Arm64. At this stage it can be useful to think in terms of software you download (e.g. open source packages, container images, libraries), software you build and software you procure/license. Areas to review:
35 | 
36 | * [Operating system](os.md): pay attention to specific versions that support Arm64 (usually more recent are better)
37 | * If your workload is container based, check container images you consume for Arm64 support. Keep in mind, many container images now support multiple architectures which simplifies consumption of those images in a mixed-architecture environment. 
38 | * All the libraries, frameworks and runtimes used by the application.
39 | * Tools used to build, deploy and test your application (e.g. compilers, test suites, CI/CD pipelines, provisioning tools and scripts). Note there are language specific sections in this getting started guide with useful pointers to getting the best performance from Arm64 processors.
40 | * Tools and/or agents used to deploy and manage the application in production (e.g. monitoring tools or security agents)
41 | * This guide contains language specifics sections where you'll find additional per-language guidance:
42 |   * [C/C++](languages/c-c++.md)
43 |   * [Go](languages/golang.md)
44 |   * [Java](languages/java.md)
45 |   * [.NET](languages/dotnet.md) 
46 |   * [Python](languages/python.md)
47 |   * [Rust](languages/rust.md)
48 | 
49 | As a rule, the more current your software environment the more likely you will obtain the full performance entitlement from Arm64.
50 | 
51 | For each component of your software stack, check for Arm64 support. A large portion of this can be done using existing system configuration and deployment scripts.  As your scripts run and install packages, you will get messages for any missing components.  Some may build from source automatically while others will cause the script to fail. Pay attention to software versions: more current software is easier to transition and will deliver the best performance. If you do need to perform upgrades prior to adopting Arm64, you might consider doing that using an existing x86 environment to minimize the number of changed variables.  We have seen examples where upgrading OS version on x86 was far more involved and time consuming than transitioning to Arm64 after the upgrade. For more details on checking for software support please see Appendix A.
52 | 
53 | Note: When locating software be aware that some tools, including GCC, refer to the architecture as AArch64, others including the Linux Kernel, call it arm64. When checking packages across various repositories, you’ll find those different naming conventions.
54 | 
55 | 
56 | ### Plan your workload transition
57 | 
58 | **Step 3-  Install and configure your application environment**
59 | 
60 | Complete the installation of your software stack based on the inventory created in Step 2. In many cases your installation scripts can be used as-is or with minor modifications to reference architecture specific versions of components where necessary. The first time through this may be an iterative process as you resolve any remaining dependencies. 
61 | 
62 | **Step 4 - Build your application(s) and/or container images**
63 | 
64 | Applications built using interpreted or JIT'd languages (Python, Java, PHP, Node.js, etc.) should run as-is. This guide contains language specific sections with recommendations e.g. [Java](java.md) and [Python](python.md). If there is no language specific section, it is because there is no specific guidance beyond using a suitably current version of the language.  Simply proceed as you would on any other CPUs, Arm-based or otherwise.
65 | 
66 | Applications using compiled languages including C, C++ or Go, need to be compiled for the Arm64 architecture. Most modern builds (e.g. using Make) will just work when run natively on Arm64.  You’ll find language specific compiler recommendations in this repository: [C/C++](c-c++.md), [Go](golang.md), and [Rust](rust.md).  Again , if there is no specific guidance it's because everything works _exactly_ the same on Arm64 as on other platforms.
67 | 
68 | Just like an operating system, container images are architecture specific. You will need to build Arm64 container images. You might wish to build multi-arch container images that can run automatically on either x86-64 or Arm64. Check out the [container section](containers.md) of this guide for more details.
69 | 
70 | You will also need to review any functional and unit test suite(s) to ensure you can test the new build artifacts with the same test coverage you have already for x86 artifacts.
71 | 
72 | ### Test and optimize your workload
73 | 
74 | **Step 5 - Testing and optimizing your workload**
75 | 
76 | Now that you have your application stack on Aarch64, you should run your test suite to ensure all regular unit and functional tests pass. Resolve any test failures in the application(s) or test suites until you are satisfied everything is working as expected. Most errors should be related to the modifications and updated software versions you have installed during the transition. (Tip: when upgrading software versions, first test them using an existing x86 environment to minimize the number of variables changed at once. If issues occur then resolve them using the current x86 environment before continuing with the new Arm64 environment). If you suspect architecture specific issues then please have a look to our [C/C++ section ](c-c++.md) which gives advice on how to solve them.
77 | 
78 | **Step 6 - Performance testing**
79 | 
80 | With your fully functional application its time to establish a performance baseline on Arm64. In most cases, you should expect performance parity, or even gains.  This guide has sections dedicated to [Optimization](optimizing.md) and a [Performance Runbook](perfrunbook/grace_perfrunbook.md) for you to follow during this stage.
81 | 
82 | ### _Appendix A - locating packages for Arm64/_
83 | 
84 | Remember: When locating software be aware that some tools, including  GCC, refer to the architecture as AArch64, others including the Linux Kernel, call it arm64. When checking packages across various repositories, you’ll find those different naming conventions, and in some cases just "ARM".
85 | 
86 | The main ways to check and places to look for will be:
87 | 
88 | * Package repositories of your chosen Linux distribution. Arm64 support within Linux distributions is largely complete: for example, Debian, which has the largest package repository, has over 98% of its packages built for the Arm64 architecture.
89 | * Container image registry. Amazon ECR now offers [public repositories](https://docs.aws.amazon.com/AmazonECR/latest/public/public-repositories.html) that you can search for [arm64 images](https://gallery.ecr.aws/?architectures=ARM+64&page=1). DockerHub allows you to search for a specific architecture ([e.g. arm64](https://hub.docker.com/search?type=image&architecture=arm64)).
90 |     * Note: Specific to containers you may find an amd64 (x86-64) container image you currently use transitioned to a multi-architecture container image when adding Arm64 support. This means you may not find an explicit Arm64 container, so be sure to check for both as projects may chose to vend discrete images for x86-64 and Arm64 while other projects chose to vend a multi-arch image supporting both architectures.
91 | * On GitHub, you can check for Arm64 versions in the release section. However, some projects don’t use the release section, or only release source archives, so you may need to visit the main project webpage and check the download section. You can also search the GitHub project for "arm64" or "AArch64" to see whether the project has any Arm64 code contributions or issues. Even if a project does not currently produce builds for Arm64, in many cases an Arm64 version of those packages will be available through Linux distributions or additional package repositories (e.g. [EPEL](https://www.redhat.com/en/blog/whats-epel-and-how-do-i-use-it)). You can search for packages using a package search tool such as [pkgs.org](https://pkgs.org/).
92 | * The download section or platform support matrix of your software vendors, look for references to Arm64, AArch64, AWS Gravition, Ampere Altra, or NVIDIA Grace.
93 | 


--------------------------------------------------------------------------------