├── .gitignore ├── LICENSE ├── README.md ├── code_of_conduct.md ├── examples ├── README.md ├── anaconda.md ├── arkouda.md ├── gromacs.md ├── hpl-cpu │ ├── ACfL.patch │ ├── GCC_BLIS.patch │ ├── HPL.dat │ ├── NVIDIA_HPC_SDK.patch │ ├── hpl-cpu.md │ └── hplgen.py ├── motorBike.png ├── openfoam.md ├── stream-cpu.md ├── tensorflow-cpu.md ├── tensorflow-gpu.md ├── velox.md └── wrf.md ├── isv.md ├── known_issues.md ├── languages ├── README.md ├── c-c++.md ├── dotnet.md ├── fortran.md ├── golang.md ├── java.md ├── python.md └── rust.md ├── optimization ├── README.md ├── atomics.md ├── cpudetect.md └── vectorization.md ├── slack.png ├── software ├── README.md ├── compilers.md ├── containers.md ├── cuda.md ├── mathlibs.md ├── ml.md ├── mpi.md └── os.md └── transition-guide.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Attribution-ShareAlike 4.0 International Public 2 | License 3 | 4 | By exercising the Licensed Rights (defined below), You accept and agree 5 | to be bound by the terms and conditions of this Creative Commons 6 | Attribution-ShareAlike 4.0 International Public License ("Public 7 | License"). To the extent this Public License may be interpreted as a 8 | contract, You are granted the Licensed Rights in consideration of Your 9 | acceptance of these terms and conditions, and the Licensor grants You 10 | such rights in consideration of benefits the Licensor receives from 11 | making the Licensed Material available under these terms and 12 | conditions. 13 | 14 | 15 | Section 1 -- Definitions. 16 | 17 | a. Adapted Material means material subject to Copyright and Similar 18 | Rights that is derived from or based upon the Licensed Material 19 | and in which the Licensed Material is translated, altered, 20 | arranged, transformed, or otherwise modified in a manner requiring 21 | permission under the Copyright and Similar Rights held by the 22 | Licensor. For purposes of this Public License, where the Licensed 23 | Material is a musical work, performance, or sound recording, 24 | Adapted Material is always produced where the Licensed Material is 25 | synched in timed relation with a moving image. 26 | 27 | b. Adapter's License means the license You apply to Your Copyright 28 | and Similar Rights in Your contributions to Adapted Material in 29 | accordance with the terms and conditions of this Public License. 30 | 31 | c. BY-SA Compatible License means a license listed at 32 | creativecommons.org/compatiblelicenses, approved by Creative 33 | Commons as essentially the equivalent of this Public License. 34 | 35 | d. Copyright and Similar Rights means copyright and/or similar rights 36 | closely related to copyright including, without limitation, 37 | performance, broadcast, sound recording, and Sui Generis Database 38 | Rights, without regard to how the rights are labeled or 39 | categorized. For purposes of this Public License, the rights 40 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 41 | Rights. 42 | 43 | e. Effective Technological Measures means those measures that, in the 44 | absence of proper authority, may not be circumvented under laws 45 | fulfilling obligations under Article 11 of the WIPO Copyright 46 | Treaty adopted on December 20, 1996, and/or similar international 47 | agreements. 48 | 49 | f. Exceptions and Limitations means fair use, fair dealing, and/or 50 | any other exception or limitation to Copyright and Similar Rights 51 | that applies to Your use of the Licensed Material. 52 | 53 | g. License Elements means the license attributes listed in the name 54 | of a Creative Commons Public License. The License Elements of this 55 | Public License are Attribution and ShareAlike. 56 | 57 | h. Licensed Material means the artistic or literary work, database, 58 | or other material to which the Licensor applied this Public 59 | License. 60 | 61 | i. Licensed Rights means the rights granted to You subject to the 62 | terms and conditions of this Public License, which are limited to 63 | all Copyright and Similar Rights that apply to Your use of the 64 | Licensed Material and that the Licensor has authority to license. 65 | 66 | j. Licensor means the individual(s) or entity(ies) granting rights 67 | under this Public License. 68 | 69 | k. Share means to provide material to the public by any means or 70 | process that requires permission under the Licensed Rights, such 71 | as reproduction, public display, public performance, distribution, 72 | dissemination, communication, or importation, and to make material 73 | available to the public including in ways that members of the 74 | public may access the material from a place and at a time 75 | individually chosen by them. 76 | 77 | l. Sui Generis Database Rights means rights other than copyright 78 | resulting from Directive 96/9/EC of the European Parliament and of 79 | the Council of 11 March 1996 on the legal protection of databases, 80 | as amended and/or succeeded, as well as other essentially 81 | equivalent rights anywhere in the world. 82 | 83 | m. You means the individual or entity exercising the Licensed Rights 84 | under this Public License. Your has a corresponding meaning. 85 | 86 | 87 | Section 2 -- Scope. 88 | 89 | a. License grant. 90 | 91 | 1. Subject to the terms and conditions of this Public License, 92 | the Licensor hereby grants You a worldwide, royalty-free, 93 | non-sublicensable, non-exclusive, irrevocable license to 94 | exercise the Licensed Rights in the Licensed Material to: 95 | 96 | a. reproduce and Share the Licensed Material, in whole or 97 | in part; and 98 | 99 | b. produce, reproduce, and Share Adapted Material. 100 | 101 | 2. Exceptions and Limitations. For the avoidance of doubt, where 102 | Exceptions and Limitations apply to Your use, this Public 103 | License does not apply, and You do not need to comply with 104 | its terms and conditions. 105 | 106 | 3. Term. The term of this Public License is specified in Section 107 | 6(a). 108 | 109 | 4. Media and formats; technical modifications allowed. The 110 | Licensor authorizes You to exercise the Licensed Rights in 111 | all media and formats whether now known or hereafter created, 112 | and to make technical modifications necessary to do so. The 113 | Licensor waives and/or agrees not to assert any right or 114 | authority to forbid You from making technical modifications 115 | necessary to exercise the Licensed Rights, including 116 | technical modifications necessary to circumvent Effective 117 | Technological Measures. For purposes of this Public License, 118 | simply making modifications authorized by this Section 2(a) 119 | (4) never produces Adapted Material. 120 | 121 | 5. Downstream recipients. 122 | 123 | a. Offer from the Licensor -- Licensed Material. Every 124 | recipient of the Licensed Material automatically 125 | receives an offer from the Licensor to exercise the 126 | Licensed Rights under the terms and conditions of this 127 | Public License. 128 | 129 | b. Additional offer from the Licensor -- Adapted Material. 130 | Every recipient of Adapted Material from You 131 | automatically receives an offer from the Licensor to 132 | exercise the Licensed Rights in the Adapted Material 133 | under the conditions of the Adapter's License You apply. 134 | 135 | c. No downstream restrictions. You may not offer or impose 136 | any additional or different terms or conditions on, or 137 | apply any Effective Technological Measures to, the 138 | Licensed Material if doing so restricts exercise of the 139 | Licensed Rights by any recipient of the Licensed 140 | Material. 141 | 142 | 6. No endorsement. Nothing in this Public License constitutes or 143 | may be construed as permission to assert or imply that You 144 | are, or that Your use of the Licensed Material is, connected 145 | with, or sponsored, endorsed, or granted official status by, 146 | the Licensor or others designated to receive attribution as 147 | provided in Section 3(a)(1)(A)(i). 148 | 149 | b. Other rights. 150 | 151 | 1. Moral rights, such as the right of integrity, are not 152 | licensed under this Public License, nor are publicity, 153 | privacy, and/or other similar personality rights; however, to 154 | the extent possible, the Licensor waives and/or agrees not to 155 | assert any such rights held by the Licensor to the limited 156 | extent necessary to allow You to exercise the Licensed 157 | Rights, but not otherwise. 158 | 159 | 2. Patent and trademark rights are not licensed under this 160 | Public License. 161 | 162 | 3. To the extent possible, the Licensor waives any right to 163 | collect royalties from You for the exercise of the Licensed 164 | Rights, whether directly or through a collecting society 165 | under any voluntary or waivable statutory or compulsory 166 | licensing scheme. In all other cases the Licensor expressly 167 | reserves any right to collect such royalties. 168 | 169 | 170 | Section 3 -- License Conditions. 171 | 172 | Your exercise of the Licensed Rights is expressly made subject to the 173 | following conditions. 174 | 175 | a. Attribution. 176 | 177 | 1. If You Share the Licensed Material (including in modified 178 | form), You must: 179 | 180 | a. retain the following if it is supplied by the Licensor 181 | with the Licensed Material: 182 | 183 | i. identification of the creator(s) of the Licensed 184 | Material and any others designated to receive 185 | attribution, in any reasonable manner requested by 186 | the Licensor (including by pseudonym if 187 | designated); 188 | 189 | ii. a copyright notice; 190 | 191 | iii. a notice that refers to this Public License; 192 | 193 | iv. a notice that refers to the disclaimer of 194 | warranties; 195 | 196 | v. a URI or hyperlink to the Licensed Material to the 197 | extent reasonably practicable; 198 | 199 | b. indicate if You modified the Licensed Material and 200 | retain an indication of any previous modifications; and 201 | 202 | c. indicate the Licensed Material is licensed under this 203 | Public License, and include the text of, or the URI or 204 | hyperlink to, this Public License. 205 | 206 | 2. You may satisfy the conditions in Section 3(a)(1) in any 207 | reasonable manner based on the medium, means, and context in 208 | which You Share the Licensed Material. For example, it may be 209 | reasonable to satisfy the conditions by providing a URI or 210 | hyperlink to a resource that includes the required 211 | information. 212 | 213 | 3. If requested by the Licensor, You must remove any of the 214 | information required by Section 3(a)(1)(A) to the extent 215 | reasonably practicable. 216 | 217 | b. ShareAlike. 218 | 219 | In addition to the conditions in Section 3(a), if You Share 220 | Adapted Material You produce, the following conditions also apply. 221 | 222 | 1. The Adapter's License You apply must be a Creative Commons 223 | license with the same License Elements, this version or 224 | later, or a BY-SA Compatible License. 225 | 226 | 2. You must include the text of, or the URI or hyperlink to, the 227 | Adapter's License You apply. You may satisfy this condition 228 | in any reasonable manner based on the medium, means, and 229 | context in which You Share Adapted Material. 230 | 231 | 3. You may not offer or impose any additional or different terms 232 | or conditions on, or apply any Effective Technological 233 | Measures to, Adapted Material that restrict exercise of the 234 | rights granted under the Adapter's License You apply. 235 | 236 | 237 | Section 4 -- Sui Generis Database Rights. 238 | 239 | Where the Licensed Rights include Sui Generis Database Rights that 240 | apply to Your use of the Licensed Material: 241 | 242 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 243 | to extract, reuse, reproduce, and Share all or a substantial 244 | portion of the contents of the database; 245 | 246 | b. if You include all or a substantial portion of the database 247 | contents in a database in which You have Sui Generis Database 248 | Rights, then the database in which You have Sui Generis Database 249 | Rights (but not its individual contents) is Adapted Material, 250 | 251 | including for purposes of Section 3(b); and 252 | c. You must comply with the conditions in Section 3(a) if You Share 253 | all or a substantial portion of the contents of the database. 254 | 255 | For the avoidance of doubt, this Section 4 supplements and does not 256 | replace Your obligations under this Public License where the Licensed 257 | Rights include other Copyright and Similar Rights. 258 | 259 | 260 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 261 | 262 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 263 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 264 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 265 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 266 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 267 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 268 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 269 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 270 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 271 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 272 | 273 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 274 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 275 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 276 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 277 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 278 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 279 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 280 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 281 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 282 | 283 | c. The disclaimer of warranties and limitation of liability provided 284 | above shall be interpreted in a manner that, to the extent 285 | possible, most closely approximates an absolute disclaimer and 286 | waiver of all liability. 287 | 288 | 289 | Section 6 -- Term and Termination. 290 | 291 | a. This Public License applies for the term of the Copyright and 292 | Similar Rights licensed here. However, if You fail to comply with 293 | this Public License, then Your rights under this Public License 294 | terminate automatically. 295 | 296 | b. Where Your right to use the Licensed Material has terminated under 297 | Section 6(a), it reinstates: 298 | 299 | 1. automatically as of the date the violation is cured, provided 300 | it is cured within 30 days of Your discovery of the 301 | violation; or 302 | 303 | 2. upon express reinstatement by the Licensor. 304 | 305 | For the avoidance of doubt, this Section 6(b) does not affect any 306 | right the Licensor may have to seek remedies for Your violations 307 | of this Public License. 308 | 309 | c. For the avoidance of doubt, the Licensor may also offer the 310 | Licensed Material under separate terms or conditions or stop 311 | distributing the Licensed Material at any time; however, doing so 312 | will not terminate this Public License. 313 | 314 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 315 | License. 316 | 317 | 318 | Section 7 -- Other Terms and Conditions. 319 | 320 | a. The Licensor shall not be bound by any additional or different 321 | terms or conditions communicated by You unless expressly agreed. 322 | 323 | b. Any arrangements, understandings, or agreements regarding the 324 | Licensed Material not stated herein are separate from and 325 | independent of the terms and conditions of this Public License. 326 | 327 | 328 | Section 8 -- Interpretation. 329 | 330 | a. For the avoidance of doubt, this Public License does not, and 331 | shall not be interpreted to, reduce, limit, restrict, or impose 332 | conditions on any use of the Licensed Material that could lawfully 333 | be made without permission under this Public License. 334 | 335 | b. To the extent possible, if any provision of this Public License is 336 | deemed unenforceable, it shall be automatically reformed to the 337 | minimum extent necessary to make it enforceable. If the provision 338 | cannot be reformed, it shall be severed from this Public License 339 | without affecting the enforceability of the remaining terms and 340 | conditions. 341 | 342 | c. No term or condition of this Public License will be waived and no 343 | failure to comply consented to unless expressly agreed to by the 344 | Licensor. 345 | 346 | d. Nothing in this Public License constitutes or may be interpreted 347 | as a limitation upon, or waiver of, any privileges and immunities 348 | that apply to the Licensor or You, including from the legal 349 | processes of any jurisdiction or authority. 350 | 351 | 352 | ======================================================================= 353 | 354 | Creative Commons is not a party to its public 355 | licenses. Notwithstanding, Creative Commons may elect to apply one of 356 | its public licenses to material it publishes and in those instances 357 | will be considered the “Licensor.” The text of the Creative Commons 358 | public licenses is dedicated to the public domain under the CC0 Public 359 | Domain Dedication. Except for the limited purpose of indicating that 360 | material is shared under a Creative Commons public license or as 361 | otherwise permitted by the Creative Commons policies published at 362 | creativecommons.org/policies, Creative Commons does not authorize the 363 | use of the trademark "Creative Commons" or any other trademark or logo 364 | of Creative Commons without its prior written consent including, 365 | without limitation, in connection with any unauthorized modifications 366 | to any of its public licenses or any other arrangements, 367 | understandings, or agreements concerning use of licensed material. For 368 | the avoidance of doubt, this paragraph does not form part of the 369 | public licenses. 370 | 371 | Creative Commons may be contacted at creativecommons.org. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Getting started with HPC on Arm64 2 | This guide includes how-to guides, sample code, recommendations, and technical best practices to help new users get started with Arm-based systems like the NVIDIA Arm HPC Developer Kit. While it is intended for the users and administrators of NVIDIA's Arm-based platforms, this guide is also generically useful for anyone running HPC applications on Arm CPUs, with or without GPUs. The focus is mostly on the CPU since Arm-hosted GPUs are just the same as GPUs hosted by any other CPUs. 3 | 4 | 5 | ## Contents 6 | * [Introduction to Arm64 and the NVIDIA Arm HPC Developer Kit](#introduction-to-arm64-and-the-nvidia-hpc-developer-kit) 7 | * [Transitioning Workloads to Arm64](transition-guide.md) 8 | * [Commercial Software (ISV)](isv.md) 9 | * [Examples](examples/README.md) 10 | * [Benchmarks and Health Tests](examples/README.md#benchmarks-and-health-tests) 11 | * [HPL on CPU](examples/hpl-cpu/hpl-cpu.md) 12 | * [STREAM on CPU](examples/stream-cpu.md) 13 | * [Modeling and Simulation](examples/README.md#modeling-and-simulation) 14 | * [OpenFOAM](examples/openfoam.md) 15 | * [WRF](examples/wrf.md) 16 | * [... see all mod-sim examples](examples/README.md#modeling-and-simulation) 17 | * [Machine Learning](examples/README.md#machine-learning) 18 | * [TensorFlow GPU-accelerated Training and Inference](examples/tensorflow-gpu.md) 19 | * [Tensorflow On-CPU Inference](examples/tensorflow-cpu.md) 20 | * [... see all ML examples](examples/README.md#machine-learning) 21 | * [Data Science](examples/README.md#data-science) 22 | * [Anaconda, Miniconda, Conda, Mamba](examples/anaconda.md) 23 | * [Arkouda](examples/arkouda.md) 24 | * [... see all DS examples](examples/README.md#data-science) 25 | * ... and more! [See the full list of examples here](examples/README.md) 26 | * [Supported Software](software/README.md) (See [the full list](software/README.md)) 27 | * [Machine Learning](software/ml.md) 28 | * [Compilers](software/compilers.md) and [CUDA](software/cuda.md) 29 | * [Message Passing (MPI)](software/mpi.md) 30 | * [Math Libraries](software/mathlibs.md) 31 | * [Containers](software/containers.md) 32 | * [Operating Systems](software/os.md) 33 | * ... and more! [See here for more details](software/README.md) 34 | * [Optimizing for Arm64](optimization/README.md) 35 | * [Arm SIMD Instructions: SVE and NEON](optimization/vectorization.md) 36 | * [Arm Atomic Instructions: LSE](optimization/atomics.md) 37 | * [Language-specific Considerations](languages/README.md) 38 | * [C/C++](languages/c-c++.md) 39 | * [Fortran](languages/fortran.md) 40 | * [Python](languages/python.md) 41 | * [Rust](languages/rust.md) 42 | * [Go](languages/golang.md) 43 | * [Java](languages/java.md) 44 | * [.NET](languages/dotnet.md) 45 | * [Additional Resources](#additional-resources) 46 | * [Acknowledgements](#acknowledgements) 47 | 48 | 49 | ## Join the NVIDIA Arm Community! 50 | 51 | [![Join us on Slack](slack.png)](https://join.slack.com/t/nvidia-arm-hpc/shared_invite/zt-1cbn4fksk-P6n5zBX5Mz4RnaTj9t3axg) 52 | 53 | The easiest way to find help and talk to the experts is to join the NVIDIA Arm HPC Slack workspace. 54 | 55 | 56 | ## Introduction to Arm64 and the NVIDIA HPC Developer Kit 57 | The NVIDIA Arm HPC Developer Kit (simply "DevKit" in this guide) is an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications on a heterogeneous GPU- and CPU-accelerated computing system. The kit includes an Arm CPU, dual NVIDIA A100 Tensor Core GPUs, dual NVIDIA BlueField-2 DPUs, and the NVIDIA HPC SDK suite of tools. [See the product page for more information.](https://developer.nvidia.com/arm-hpc-devkit) 58 | 59 | This validated platform provides quick and easy bring-up and a stable environment for accelerated code execution and evaluation, performance analysis, system experimentation, and system characterization. 60 | * Delivers a validated system for quick and easy bring-up in familiar HPC environments 61 | * Offers a stable hardware and software platform for development and performance analysis of accelerated HPC, AI, and scientific computing applications 62 | * Enables experimentation and characterization of high-performance, NVIDIA-accelerated, Arm server-based system architectures 63 | 64 | Hardware | Specification 65 | -------- | -------- 66 | Model | GIGABYTE G242-P32, 2U server 67 | CPU | 1x Ampere Altra Q80-30 (Arm processor) 68 | GPU | 2x NVIDIA A100 GPU 69 | Memory | 512G DDR4 memory 70 | Storage | 6TB SAS/ SATA 3.5″ 71 | Network | 2x NVIDIA BlueField-2 E-Series DPU: 200GbE/HDR single-port QSFP56 72 | 73 | The DevKit CPU uses the Arm architecture. The Arm architecture powers over *two hundred billion* chips across practically all computing domains, so the term "Arm" is somewhat overloaded. Various communities refer to the architecture as "Arm", "ARM", "Arm64", "AArch64", "arm64", etc. You may also find the term "SBSA" used to refer to server-class Arm CPUs. For simplicity, this guide will use the term **"Arm64"** to refer to any CPU built on the Armv8 or Armv9 standards and implementing [Arm's Server Base System Architecture (SBSA)](https://developer.arm.com/documentation/den0029/latest). This includes CPUs like: 74 | 75 | * [Ampere Altra](https://amperecomputing.com/processors/ampere-altra/) (NVIDIA Arm HPC Developer Kit) 76 | * [NVIDIA Grace](https://www.nvidia.com/en-us/data-center/grace-cpu/) 77 | * [AWS Graviton](https://aws.amazon.com/ec2/graviton/) 78 | * [Alibaba Yitian](https://fuse.wikichip.org/news/tag/yitian-710/) 79 | 80 | This guide will call out differences between Arm64 CPUs as needed. Note that this guide is not intended for mobile and embedded Arm CPUs e.g. NVIDIA Tegra. While many of the general principles and approaches presented here will hold true for mobile and embedded Arm platforms, this guide is focused on server-class platforms. 81 | 82 | 83 | ## Additional resources 84 | * [NVIDIA Arm HPC Developer Kit](https://developer.nvidia.com/arm-hpc-devkit) 85 | * [Neoverse N1 Software Optimization Guide](https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd) 86 | * [Armv8 reference manual](https://documentation-service.arm.com/static/60119835773bb020e3de6fee) 87 | * [Package repository search tool](https://pkgs.org/) 88 | 89 | 90 | # License 91 | 92 | [![CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](http://creativecommons.org/licenses/by-sa/4.0/) 93 | 94 | Unless otherwise indicated, this work is licensed under a 95 | [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/). Individual examples or attached source code may be under a different license. Check the related README or LICENSE files. 96 | 97 | 98 | # Acknowledgements 99 | This guide was inspired by and borrows from the excellent [AWS Graviton Getting Started Guide](https://github.com/aws/aws-graviton-getting-started). The authors of this guide gratefully acknowledge the work of the AWS engineers and thank AWS for freely providing this valuable information in the public domain. 100 | 101 | 102 | **Feedback?** jlinford@nvidia.com 103 | 104 | -------------------------------------------------------------------------------- /code_of_conduct.md: -------------------------------------------------------------------------------- 1 | # Arm HPC Developer Kit Community Code of Conduct 2 | 3 | This Code of Conduct governs the Arm HPC Developer Kit Community's Community Slack, virtual and in-person events and any other online discussions. 4 | 5 | ## Introduction 6 | * Diversity and inclusion make our community strong. We encourage participation from the most varied and diverse backgrounds possible and want to be very clear about where we stand. 7 | * Our goal is to maintain a safe, helpful and friendly community for everyone, regardless of experience, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, nationality, or other defining characteristic. 8 | * This code and related procedures apply to unacceptable behavior occurring in all community venues, including behavior outside the scope of community activities — online and in-person— as well as in all one-on-one communications, and anywhere such behavior has the potential to adversely affect the safety and well-being of community members. 9 | 10 | ## Expected Behavior 11 | * Be welcoming. 12 | * Be kind. 13 | * Look out for each other. 14 | 15 | ## Unacceptable Behavior 16 | * Conduct or speech which might be considered sexist, racist, homophobic, transphobic, ableist or otherwise discriminatory or offensive in nature. 17 | * Do not use unwelcome, suggestive, derogatory or inappropriate nicknames or terms. 18 | * Do not show disrespect towards others (jokes, innuendo, dismissive attitudes). 19 | * Intimidation or harassment (online or in-person). 20 | * Disrespect towards differences of opinion. 21 | * Inappropriate attention or contact. Be aware of how your actions affect others. If it makes someone uncomfortable, stop. 22 | * Not understanding the differences between constructive criticism and disparagement. 23 | * Sustained disruptions. 24 | * Violence, threats of violence or violent language. 25 | 26 | ## Enforcement 27 | * Understand that speech and actions have consequences, and unacceptable behavior will not be tolerated. 28 | * If violations occur, organizers will take any action they deem appropriate for the infraction, up to and including expulsion. 29 | 30 | If you are the subject of, or witness to any violations of this Code of Conduct, please contact us via email. -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | # Example Applications 2 | Following these step-by-step instructions is a great way to get started with your new Arm64 system. The codes presented here represent major HPC application areas and are a good starting point for running similar applications on Arm64. Per-code recommendations and best practices are also provided to help give a sense for how applications from these areas generally work on Arm64. 3 | 4 | ## Benchmarks and Health Tests 5 | 6 | These benchmarks were generated from a known-good NVIDIA Arm HPC DevKit and provide a _lower bound_ for expected out-of-the-box performance. They can be used to determine if your system is configured correctly and operating properly. It's possible you may exceed these numbers. **They are not indented for use in any competitive analysis.** 7 | 8 | * [HPL on CPU](hpl-cpu/hpl-cpu.md) 9 | * [STREAM on CPU](stream-cpu.md) 10 | 11 | ## Modeling and Simulation 12 | 13 | The high memory bandwidth of the Ampere Altra CPU makes it an excellent platform for CPU-only HPC applications. 14 | 15 | * [GROMACS](gromacs.md) 16 | * [OpenFOAM](openfoam.md) 17 | * [WRF](wrf.md) 18 | 19 | ## Machine Learning 20 | 21 | * [TensorFlow GPU-accelerated Training and Inference](tensorflow-gpu.md) 22 | * [Tensorflow On-CPU Inference](tensorflow-cpu.md) 23 | 24 | ## Data Science 25 | 26 | * [Anaconda, Miniconda, Conda, Mamba](anaconda.md) 27 | * [Arkouda](arkouda.md) 28 | 29 | ## ... and more! 30 | 31 | * [Velox](velox.md) 32 | -------------------------------------------------------------------------------- /examples/anaconda.md: -------------------------------------------------------------------------------- 1 | # Anaconda, Miniconda, Conda, and Mamba on Arm64 2 | Anaconda is a distribution of the Python and R programming languages for scientific computing that aims to simplify package management and deployment. 3 | 4 | 5 | ## Anaconda 6 | Anaconda announced [support for Arm64 via AWS Graviton 2 on May 14, 2021](https://www.anaconda.com/blog/anaconda-aws-graviton2). The Ampere Altra CPU found in the NVIDIA Arm HPC DevKit is based on the same Arm Neoverse N1 core as the AWS Gravition2, so Anaconda also supports the Ampere Altra. *IMPORTANT*: if you encounter errors about missing libraries, see [the dependency information below](#a-quick-note-on-dependencies). 7 | 8 | ```bash 9 | # Download Anaconda installer 10 | wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-aarch64.sh 11 | # Run the installer 12 | bash Anaconda3-2022.05-Linux-aarch64.sh 13 | ``` 14 | 15 | Additional installation instructions can be found at https://docs.anaconda.com/anaconda/install/graviton2/. 16 | 17 | 18 | ## Miniconda Eaxmple 19 | Anaconda also offers a lightweight version called [Miniconda](https://docs.conda.io/en/latest/miniconda.html) which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. 20 | 21 | Here is an example on how to use Miniconda to install [numpy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) for Python 3.9. The resulting installation has a much smaller footprint than the full Anaconda. 22 | 23 | The first step is to install conda: 24 | ```bash 25 | # Download Miniconda installer 26 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh 27 | # Run the installer 28 | bash Miniconda3-latest-Linux-aarch64.sh 29 | ``` 30 | 31 | Once installed, you can either use the `conda` command directly to install packages, or write an environment definition file and create the corresponding environment. 32 | 33 | Here's an example to install [numpy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) (`arm64-example.yml`): 34 | ```yaml 35 | name: arm64-example 36 | dependencies: 37 | - numpy 38 | - pandas 39 | ``` 40 | 41 | The next step is to instantiate the environment from that definition: 42 | ```bash 43 | conda env create -f arm64-example.yml 44 | ``` 45 | And you can now use numpy and pandas. 46 | 47 | 48 | ## Mamba 49 | 50 | Mamba is a fast, robust, and cross-platform package manager that is fully compatible with conda packages and supports most of conda’s commands. It fully supports Arm64. See https://mamba.readthedocs.io/en/latest/# for details. 51 | 52 | 53 | ## A quick note on dependencies 54 | 55 | This isn't really an Arm64 requirement since it applies to all platforms. The installers mentioned in this document don't pull in distro-provided dependencies. If you see an error like this: 56 | ``` 57 | /data/jlinford/anaconda3/bin/gtk-query-immodules-3.0: error while loading shared libraries: libXfixes.so.3: cannot open shared object file: No such file or directory 58 | ``` 59 | then search for the missing library and install the appropriate package. For example, on Ubuntu Server 20.04: 60 | ```bash 61 | sudo apt-get install libxi6 libgconf-2-4 libxfixes3 libxcursor1 62 | ``` 63 | It may take a few tries to get all the dependencies sorted, especially if you're working on a minimal installation of the OS. -------------------------------------------------------------------------------- /examples/arkouda.md: -------------------------------------------------------------------------------- 1 | # Arkouda Server and Client on Arm64 2 | 3 | ![Arkouda Logo](https://github.com/Bears-R-Us/arkouda/raw/master/pictures/arkouda_wide_marker1.png) 4 | 5 | Arkouda allows a user to interactively issue massively parallel computations on distributed data using functions and syntax that mimic NumPy, the underlying computational library used in the vast majority of Python data science workflows. The computational heart of Arkouda is a Chapel interpreter that accepts a pre-defined set of commands from a client and uses Chapel's built-in machinery for multi-locale and multithreaded execution. Arkouda has benefited greatly from Chapel's distinctive features and has also helped guide the development of the language. For more details see https://github.com/Bears-R-Us/arkouda. 6 | 7 | ## Initial Configuration 8 | 9 | Clone the Arkouda repo. The following steps will take place inside the repo directory. 10 | ```bash 11 | git clone https://github.com/Bears-R-Us/arkouda.git 12 | cd arkouda 13 | ``` 14 | We recommend using GCC version 11 or later. For example, Spack is an easy way to install GCC 12.1.0: 15 | ```bash 16 | # Install spack, if you haven't already 17 | git clone https://github.com/spack/spack.git 18 | 19 | # Install gcc+binutils, if you haven't already 20 | spack install gcc@12.1.0+binutils 21 | spack load gcc@12.1.0+binutils 22 | spack compiler find 23 | ``` 24 | 25 | ## Install Arkouda Client 26 | We will use [Miniconda](https://docs.conda.io/en/latest/miniconda.html) to provide a Python environment and manage Python dependencies. 27 | ```bash 28 | # Install Miniconda3 29 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh 30 | chmod +x Miniconda3-latest-Linux-aarch64.sh 31 | ./Miniconda3-latest-Linux-aarch64.sh -b 32 | 33 | # Add Miniconda3 to the environment 34 | source ~/.bashrc 35 | 36 | # Make sure you're in the Arkouda repo directory. 37 | cd arkouda 38 | 39 | # Developer conda env 40 | conda env create -f arkouda-env-dev.yml 41 | conda activate arkouda-dev 42 | 43 | # User conda env 44 | conda env create -f arkouda-env.yml 45 | conda activate arkouda 46 | 47 | # Install client and server dependencies 48 | conda install -y jupyter 49 | conda install -y cmake>=3.11.0 50 | 51 | # Install the Arkouda Client Package 52 | pip install -e . 53 | ``` 54 | 55 | 56 | ## Install Chapel 57 | ```bash 58 | # Download and unpack the Chapel source in the location where you wish to install Chapel 59 | curl -L https://github.com/chapel-lang/chapel/releases/download/1.27.0/chapel-1.27.0.tar.gz | tar xvzf - 60 | 61 | cd chapel-1.27.0 62 | 63 | # Set CHPL_HOME 64 | export CHPL_HOME=$PWD 65 | 66 | # Add chpl to PATH 67 | source $CHPL_HOME/util/setchplenv.bash 68 | 69 | # Set remaining env variables and execute make 70 | # Arkouda documentation recommends adding these variables to a ~/.chplconfig file to prevent having to export them again 71 | cat > ~/.chplconfig <: Unsupported system page size 175 | INFO:root:Running client "['python3', '/global/home/groups/amp/arkouda/tests/check.py', 'localhost', '5555']" 176 | _ _ _ 177 | / \ _ __| | _____ _ _ __| | __ _ 178 | / _ \ | '__| |/ / _ \| | | |/ _` |/ _` | 179 | / ___ \| | | < (_) | |_| | (_| | (_| | 180 | /_/ \_\_| |_|\_\___/ \__,_|\__,_|\__,_| 181 | 182 | 183 | Client Version: v2022.07.08+6.g9d84d1ea.dirty 184 | : Unsupported system page size 185 | >>> Sanity checks on the arkouda_server 186 | connected to arkouda server tcp://*:5555 187 | check boolean : Passed 188 | check arange : Passed 189 | check linspace : Passed 190 | check ones : Passed 191 | check zeros : Passed 192 | check argsort : Passed 193 | check coargsort : Passed 194 | check sort : Passed 195 | check get slice [::2] : Passed 196 | check set slice [::2] = value: Passed 197 | check set slice [::2] = pda: Passed 198 | check (compressing) get bool iv : Passed 199 | check (expanding) set bool iv = value: Passed 200 | check (expanding) set bool iv = pda: Passed 201 | check (gather) get integer iv: Passed 202 | check (scatter) set integer iv = value: Passed 203 | check (scatter) set integer iv = pda: Passed 204 | check get integer idx : Passed 205 | check set integer idx = value: Passed 206 | disconnected from arkouda server tcp://*:5555 207 | INFO:root:Running client "['python3', '/global/home/groups/amp/arkouda/util/test/shutdown.py', 'localhost', '5555']" 208 | : Unsupported system page size 209 | connected to arkouda server tcp://*:5555 210 | Success running checks 211 | ``` -------------------------------------------------------------------------------- /examples/gromacs.md: -------------------------------------------------------------------------------- 1 | # GROMACS on Arm64 CPU with NVIDIA GPU 2 | 3 | GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles and is a community-driven project. See https://www.gromacs.org/. 4 | 5 | One of GROMACS' attractive features is a high level of optimization for multiple devices, including both Arm64 CPUs and NVIDIA GPUs. This example will show how to run GROMACS on either the CPU, the GPU, or both! 6 | 7 | ## Benchmark Files 8 | 9 | We'll use the standard Gromacs benchmark data set, ADH. 10 | ```bash 11 | mkdir gromacs 12 | cd gromacs 13 | wget ftp://ftp.gromacs.org/benchmarks/ADH_bench_systems.tar.gz 14 | tar xvzf ADH_bench_systems.tar.gz 15 | ``` 16 | 17 | ## NVIDIA NGC Container 18 | 19 | The easiest way to run GROMACS is to use the optimized GROMACS containers available via NVIDIA NGC. See [the containers page](../software/containers.md) for detailed instructions on how to enable NGC on your NVIDIA DevKit. For more general information, see https://docs.nvidia.com/ngc/ for prerequisites and setup steps for all HPC containers and instructions for pulling NGC containers. 20 | 21 | We'll be using the GROMACS container available at https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs. Start by preprocessing the ADH benchmark files: 22 | ```bash 23 | # Navigate to the ADH benchmarks directory 24 | cd gromacs/ADH 25 | # Run the GROMACS preprocessor 26 | docker run -ti --runtime nvidia \ 27 | -v /dev/infiniband:/dev/inifiniband \ 28 | -v $(pwd)/adh_cubic:/benchmark \ 29 | --workdir /benchmark \ 30 | nvcr.io/hpc/gromacs:2022.1 sh -c "gmx grompp -f pme_verlet.mdp" 31 | ``` 32 | 33 | ### Running on GPU and CPU 34 | We'll want to map the MPI ranks (`ntmpi`) and OpenMP threads (`ntomp`) such that `ntmpi * ntomp` is equal to all CPU cores, and `ntmpi` is a multiple of the number of GPUs available. The NVIDIA Arm HPC DevKit has 80 CPU cores and two A100 GPUs, and keeping `ntomp` at about 8 generally produces the best performmance. Therefore we use: 35 | * `-ntmpi 10`: Ten MPI ranks, five GPU tasks per GPU 36 | * `-ntomp 8`: Eight OpenMP threads per MPI rank. 37 | 38 | To enable GPU acceleration, we need to indicate which parts of the computation will execute on the CPU, and which will execute on the GPU. GROMACS provides functionality to account for a wide range of different types of force calculations. For most simulations, the three most important classes (in terms of computational expense) are specified with these command line options: 39 | * `-nb`: Non-bonded short-range forces 40 | * `-bonded`: Bonded short-range forces 41 | * `-pme`: Particle Mesh Ewald (PME) long-range forces 42 | 43 | For example, to calculate non-bonded short-range forces on the GPU, use `-nb gpu`. Or, to calculate these forces on the CPU, use `-nb cpu`. Note that calculating on the CPU only will take _dramatically_ longer than calcluating on the GPU. For more information about GROMACS command line options, see [the GROMACS manual](https://manual.gromacs.org/current/index.html). 44 | 45 | ```bash 46 | # Run GROMACS 47 | docker run -ti --runtime nvidia -v /dev/infiniband:/dev/inifiniband -v $(pwd)/adh_cubic:/benchmark --workdir /benchmark nvcr.io/hpc/gromacs:2022.1 sh -c "gmx mdrun -v -nsteps 100000 -resetstep 90000 -noconfout -ntmpi 10 -ntomp 8 -nb gpu -bonded gpu -pme gpu -npme 1 -nstlist 400 -s topol.tpr" 48 | ``` 49 | 50 | Using two A100 GPUs, you should see a score of about 240 ns/day: 51 | ``` 52 | Core t (s) Wall t (s) (%) 53 | Time: 576.388 7.206 7999.1 54 | (ns/day) (hour/ns) 55 | Performance: 239.835 0.100 56 | ``` 57 | 58 | ### Additional optimizations for NVIDIA GPU 59 | 60 | This [technical blog post from NVIDIA](https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/) lists three envrionment variables that can be set to further improve performance in this benchmark. Note that there are some trade-offs in using these options. 61 | * `GMX_GPU_DD_COMMS=true`: enable halo exchange communications between PP tasks 62 | * `GMX_GPU_PME_PP_COMMS=true`: enable communications between PME and PP tasks 63 | * `GMX_FORCE_UPDATE_DEFAULT_GPU=true`: enable the update and constraints part of the timestep for multi-GPU 64 | 65 | The combination of these settings triggers all optimizations, including dependencies such as GPU-acceleration of buffer operations. When using these options, it's best to keep a low number of MPI ranks and increase the number of OpenMP threads. The suggested layout for these options with two GPUs and 80 CPU cores is four MPI ranks with 20 OpenMP threads each. Use the `--env` flag to set the environment variables in the container: 66 | ```bash 67 | docker run -ti --runtime nvidia \ 68 | -v /dev/infiniband:/dev/inifiniband \ 69 | -v $(pwd)/adh_cubic:/benchmark \ 70 | --workdir /benchmark \ 71 | --env GMX_GPU_DD_COMMS=true \ 72 | --env GMX_GPU_PME_PP_COMMS=true \ 73 | --env GMX_FORCE_UPDATE_DEFAULT_GPU=true \ 74 | nvcr.io/hpc/gromacs:2022.1 \ 75 | sh -c "gmx mdrun -v -noconfout \ 76 | -nsteps 100000 -resetstep 90000 \ 77 | -ntmpi 4 -ntomp 20 -npme 1 \ 78 | -nb gpu -bonded gpu -pme gpu \ 79 | -pin on \ 80 | -nstlist 400 \ 81 | -s topol.tpr" 82 | ``` 83 | 84 | With these additional environment variables, you should see about 320 ns/day: 85 | ``` 86 | Core t (s) Wall t (s) (%) 87 | Time: 434.716 5.437 7996.2 88 | (ns/day) (hour/ns) 89 | Performance: 317.881 0.076 90 | ``` 91 | 92 | 93 | ### Interactive shell in GROMACS container 94 | 95 | The following command will launch an interactive shell in the GROMACS container using `nvidia-docker` mounting `$HOME/data` from the underlying system as `/data` in the container: 96 | 97 | ``` 98 | $ docker run -it --rm --runtime nvidia --privileged -v $HOME/data:/data nvcr.io/hpc/gromacs:2022.1 99 | ``` 100 | The command line options are: 101 | * `-it`: start container with an interactive terminal (short for `--interactive --tty`) 102 | * `--rm`: make container ephemeral (removes container on exit) 103 | * `-v host_path:/data`: bind mount `host_path` into the container as `/data` 104 | * `--runtime nvidia`: allow nvidia GPU's 105 | * `--privileged`: allow other devices like InfiniBand 106 | * See also: [How-to: Deploy RDMA accelerated Docker container over InfiniBand fabric](https://docs.nvidia.com/networking/pages/releaseview.action?pageId=15049785) 107 | 108 | This should produce a root prompt within the container. 109 | 110 | 111 | ## Installing from source 112 | 113 | Spack is the easiest way to build GROMACS from source. 114 | 115 | ```bash 116 | # Close the Spack repo, if you haven't already 117 | git clone https://github.com/spack/spack.git 118 | # Recommended but optional: use the latest GCC from Spack. You may be able to use other compilers, but this is known to work well. 119 | spack install gcc+binutils+piclibs 120 | # Add the new GCC installation to your environment 121 | spack load gcc 122 | # Update Spack's compiler configuration 123 | spack compiler find 124 | 125 | # Install GROMACS with GPU acceleration 126 | spack install -j80 gromacs+cuda %gcc@12.1.0 127 | # If no GPUs available, install unaccelerated GROMACS 128 | spack install -j80 gromacs %gcc@12.1.0 129 | ``` 130 | -------------------------------------------------------------------------------- /examples/hpl-cpu/ACfL.patch: -------------------------------------------------------------------------------- 1 | --- Make.ACfL 2022-07-19 13:11:08.954180000 -0400 2 | +++ Make.ACfL.patched 2022-07-11 16:45:10.935651000 -0400 3 | @@ -61,13 +61,13 @@ 4 | # - Platform identifier ------------------------------------------------ 5 | # ---------------------------------------------------------------------- 6 | # 7 | -ARCH = UNKNOWN 8 | +ARCH = ACfL 9 | # 10 | # ---------------------------------------------------------------------- 11 | # - HPL Directory Structure / HPL library ------------------------------ 12 | # ---------------------------------------------------------------------- 13 | # 14 | -TOPdir = $(HOME)/hpl 15 | +TOPdir = $(HOME)/benchmarks/hpl-2.3 16 | INCdir = $(TOPdir)/include 17 | BINdir = $(TOPdir)/bin/$(ARCH) 18 | LIBdir = $(TOPdir)/lib/$(ARCH) 19 | @@ -94,7 +94,7 @@ 20 | # 21 | LAdir = 22 | LAinc = 23 | -LAlib = -lblas 24 | +LAlib = 25 | # 26 | # ---------------------------------------------------------------------- 27 | # - F77 / C interface -------------------------------------------------- 28 | @@ -156,7 +156,7 @@ 29 | # *) call the BLAS Fortran 77 interface, 30 | # *) not display detailed timing information. 31 | # 32 | -HPL_OPTS = 33 | +HPL_OPTS = -DHPL_CALL_CBLAS 34 | # 35 | # ---------------------------------------------------------------------- 36 | # 37 | @@ -167,11 +167,11 @@ 38 | # ---------------------------------------------------------------------- 39 | # 40 | CC = mpicc 41 | -CCNOOPT = $(HPL_DEFS) 42 | -CCFLAGS = $(HPL_DEFS) 43 | +CCNOOPT = $(HPL_DEFS) -O0 44 | +CCFLAGS = $(HPL_DEFS) -Ofast -mcpu=neoverse-n1 -armpl 45 | # 46 | -LINKER = mpif77 47 | -LINKFLAGS = 48 | +LINKER = mpicc 49 | +LINKFLAGS = -armpl 50 | # 51 | ARCHIVER = ar 52 | ARFLAGS = r 53 | -------------------------------------------------------------------------------- /examples/hpl-cpu/GCC_BLIS.patch: -------------------------------------------------------------------------------- 1 | --- Make.GCC_BLIS 2022-07-19 12:28:40.830337000 -0400 2 | +++ Make.GCC_BLIS.patched 2022-07-19 13:03:47.729051000 -0400 3 | @@ -61,13 +61,13 @@ 4 | # - Platform identifier ------------------------------------------------ 5 | # ---------------------------------------------------------------------- 6 | # 7 | -ARCH = UNKNOWN 8 | +ARCH = GCC_BLIS 9 | # 10 | # ---------------------------------------------------------------------- 11 | # - HPL Directory Structure / HPL library ------------------------------ 12 | # ---------------------------------------------------------------------- 13 | # 14 | -TOPdir = $(HOME)/hpl 15 | +TOPdir = $(HOME)/benchmarks/hpl-2.3 16 | INCdir = $(TOPdir)/include 17 | BINdir = $(TOPdir)/bin/$(ARCH) 18 | LIBdir = $(TOPdir)/lib/$(ARCH) 19 | @@ -92,9 +92,9 @@ 20 | # header files, LAlib is defined to be the name of the library to be 21 | # used. The variable LAdir is only used for defining LAinc and LAlib. 22 | # 23 | -LAdir = 24 | -LAinc = 25 | -LAlib = -lblas 26 | +LAdir = $(HOME)/benchmarks/blis_gcc-11.2.0_thunderx2 27 | +LAinc = -I$(LAdir)/include/blis 28 | +LAlib = -L$(LAdir)/lib -lblis 29 | # 30 | # ---------------------------------------------------------------------- 31 | # - F77 / C interface -------------------------------------------------- 32 | @@ -156,7 +156,7 @@ 33 | # *) call the BLAS Fortran 77 interface, 34 | # *) not display detailed timing information. 35 | # 36 | -HPL_OPTS = 37 | +HPL_OPTS = -DHPL_CALL_CBLAS 38 | # 39 | # ---------------------------------------------------------------------- 40 | # 41 | @@ -167,10 +167,10 @@ 42 | # ---------------------------------------------------------------------- 43 | # 44 | CC = mpicc 45 | -CCNOOPT = $(HPL_DEFS) 46 | -CCFLAGS = $(HPL_DEFS) 47 | +CCNOOPT = $(HPL_DEFS) -O0 48 | +CCFLAGS = $(HPL_DEFS) -Ofast -mcpu=neoverse-n1 49 | # 50 | -LINKER = mpif77 51 | +LINKER = mpicc 52 | LINKFLAGS = 53 | # 54 | ARCHIVER = ar 55 | -------------------------------------------------------------------------------- /examples/hpl-cpu/HPL.dat: -------------------------------------------------------------------------------- 1 | HPLinpack benchmark input file 2 | Innovative Computing Laboratory, University of Tennessee 3 | HPL.out output file name (if any) 4 | 6 device out (6=stdout,7=stderr,file) 5 | 1 # of problems sizes (N) 6 | 166656 Ns 7 | 1 # of NBs 8 | 192 NBs 9 | 0 PMAP process mapping (0=Row-,1=Column-major) 10 | 1 # of process grids (P x Q) 11 | 8 Ps 12 | 10 Qs 13 | 16.0 threshold 14 | 1 # of panel fact 15 | 2 PFACTs (0=left, 1=Crout, 2=Right) 16 | 1 # of recursive stopping criterium 17 | 4 NBMINs (>= 1) 18 | 1 # of panels in recursion 19 | 2 NDIVs 20 | 1 # of recursive panel fact. 21 | 1 RFACTs (0=left, 1=Crout, 2=Right) 22 | 1 # of broadcast 23 | 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 24 | 1 # of lookahead depth 25 | 1 DEPTHs (>=0) 26 | 2 SWAP (0=bin-exch,1=long,2=mix) 27 | 64 swapping threshold 28 | 0 L1 in (0=transposed,1=no-transposed) form 29 | 0 U in (0=transposed,1=no-transposed) form 30 | 1 Equilibration (0=no,1=yes) 31 | 8 memory alignment in double (> 0) 32 | ##### This line (no. 32) is ignored (it serves as a separator). ###### 33 | 0 Number of additional problem sizes for PTRANS 34 | 1200 10000 30000 values of N 35 | 0 number of additional blocking sizes for PTRANS 36 | 40 9 8 13 13 20 16 32 64 values of NB -------------------------------------------------------------------------------- /examples/hpl-cpu/NVIDIA_HPC_SDK.patch: -------------------------------------------------------------------------------- 1 | --- Make.NVIDIA_HPC_SDK 2022-07-19 13:13:56.121925000 -0400 2 | +++ Make.NVIDIA_HPC_SDK.patched 2022-07-19 13:14:16.536185000 -0400 3 | @@ -61,13 +61,13 @@ 4 | # - Platform identifier ------------------------------------------------ 5 | # ---------------------------------------------------------------------- 6 | # 7 | -ARCH = UNKNOWN 8 | +ARCH = NVIDIA_HPC_SDK 9 | # 10 | # ---------------------------------------------------------------------- 11 | # - HPL Directory Structure / HPL library ------------------------------ 12 | # ---------------------------------------------------------------------- 13 | # 14 | -TOPdir = $(HOME)/hpl 15 | +TOPdir = $(HOME)/benchmarks/hpl-2.3 16 | INCdir = $(TOPdir)/include 17 | BINdir = $(TOPdir)/bin/$(ARCH) 18 | LIBdir = $(TOPdir)/lib/$(ARCH) 19 | @@ -92,9 +92,9 @@ 20 | # header files, LAlib is defined to be the name of the library to be 21 | # used. The variable LAdir is only used for defining LAinc and LAlib. 22 | # 23 | -LAdir = 24 | -LAinc = 25 | -LAlib = -lblas 26 | +LAdir = $(NVHPC_ROOT)/compilers 27 | +LAinc = -I$(LAdir)/include 28 | +LAlib = -L$(LAdir)/lib -lblas 29 | # 30 | # ---------------------------------------------------------------------- 31 | # - F77 / C interface -------------------------------------------------- 32 | @@ -156,7 +156,7 @@ 33 | # *) call the BLAS Fortran 77 interface, 34 | # *) not display detailed timing information. 35 | # 36 | -HPL_OPTS = 37 | +HPL_OPTS = -DHPL_CALL_CBLAS 38 | # 39 | # ---------------------------------------------------------------------- 40 | # 41 | @@ -167,10 +167,10 @@ 42 | # ---------------------------------------------------------------------- 43 | # 44 | CC = mpicc 45 | -CCNOOPT = $(HPL_DEFS) 46 | -CCFLAGS = $(HPL_DEFS) 47 | +CCNOOPT = $(HPL_DEFS) -O0 -Kieee 48 | +CCFLAGS = $(HPL_DEFS) -O3 -fast -Minline=saxpy,sscal -Minfo 49 | # 50 | -LINKER = mpif77 51 | +LINKER = mpicc 52 | LINKFLAGS = 53 | # 54 | ARCHIVER = ar 55 | -------------------------------------------------------------------------------- /examples/hpl-cpu/hplgen.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # John Linford 4 | # 5 | 6 | import os 7 | import math 8 | import multiprocessing 9 | 10 | 11 | # Fraction of node memory to use 12 | _MEMORY_FRACTION = 0.9 13 | 14 | # Matrix element size in bytes 15 | _SIZEOF_ELEMENT = 8 16 | 17 | _TEMPLATE = """\ 18 | HPLinpack benchmark input file 19 | Innovative Computing Laboratory, University of Tennessee 20 | HPL.out output file name (if any) 21 | 6 device out (6=stdout,7=stderr,file) 22 | 1 # of problems sizes (N) 23 | %(N)d Ns 24 | 1 # of NBs 25 | %(NB)d NBs 26 | 0 PMAP process mapping (0=Row-,1=Column-major) 27 | 1 # of process grids (P x Q) 28 | %(P)d Ps 29 | %(Q)d Qs 30 | 16.0 threshold 31 | 1 # of panel fact 32 | 2 PFACTs (0=left, 1=Crout, 2=Right) 33 | 1 # of recursive stopping criterium 34 | 4 NBMINs (>= 1) 35 | 1 # of panels in recursion 36 | 2 NDIVs 37 | 1 # of recursive panel fact. 38 | 1 RFACTs (0=left, 1=Crout, 2=Right) 39 | 1 # of broadcast 40 | 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 41 | 1 # of lookahead depth 42 | 1 DEPTHs (>=0) 43 | 2 SWAP (0=bin-exch,1=long,2=mix) 44 | 64 swapping threshold 45 | 0 L1 in (0=transposed,1=no-transposed) form 46 | 0 U in (0=transposed,1=no-transposed) form 47 | 1 Equilibration (0=no,1=yes) 48 | 8 memory alignment in double (> 0) 49 | ##### This line (no. 32) is ignored (it serves as a separator). ###### 50 | 0 Number of additional problem sizes for PTRANS 51 | 1200 10000 30000 values of N 52 | 0 number of additional blocking sizes for PTRANS 53 | 40 9 8 13 13 20 16 32 64 values of NB 54 | """ 55 | 56 | def int_factor(n): 57 | p = math.ceil(math.sqrt(n)) 58 | while p: 59 | q = int(n/p) 60 | if p*q == n: 61 | return p, q 62 | p -= 1 63 | raise RuntimeError 64 | 65 | 66 | def int_input(prompt, default=None): 67 | if default: 68 | default = str(default) 69 | val = input("%s [%s]: " % (prompt, default)) or default 70 | else: 71 | val = input("%s: " % prompt) 72 | return int(val) 73 | 74 | 75 | def get_node_memory(): 76 | try: 77 | return int((os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES')) / (1024*1024)) 78 | except ValueError: 79 | return 65536 80 | 81 | 82 | def main(): 83 | default_nodes = 1 84 | default_cores = multiprocessing.cpu_count() 85 | default_memory = get_node_memory() 86 | default_block = 192 87 | 88 | nodes = int_input("Number of nodes", default_nodes) 89 | cores_per_node = int_input("Cores per node", default_cores) 90 | memory_per_node = int_input("Memory per node (MB)", default_memory) 91 | block_size = int_input("Block size", default_block) 92 | 93 | cores = nodes * cores_per_node 94 | memory = nodes * memory_per_node 95 | 96 | N = int((_MEMORY_FRACTION * math.sqrt(memory * 1024**2 / _SIZEOF_ELEMENT)) / block_size) * block_size 97 | NB = block_size 98 | P, Q = int_factor(cores) 99 | 100 | fname = "HPL.dat_N%d_NB%d_P%d_Q%d" % (N, NB, P, Q) 101 | with open(fname, "w") as fout: 102 | fout.write(_TEMPLATE % {"N": N, "NB": NB, "P": P, "Q": Q}) 103 | 104 | 105 | if __name__ == "__main__": 106 | main() 107 | -------------------------------------------------------------------------------- /examples/motorBike.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arm-hpc-devkit/nvidia-arm-hpc-devkit-users-guide/77ad9c72ab9c19e2e333184e8839d97e891fbaa0/examples/motorBike.png -------------------------------------------------------------------------------- /examples/stream-cpu.md: -------------------------------------------------------------------------------- 1 | # STREAM on Arm64 CPU 2 | 3 | **IMPORTANT**: 4 | These benchmarks provide a _lower bound_ for expected out-of-the-box performance. They can be used to determine if your system is configured correctly and operating properly. It's possible you may exceed these numbers. **They are not indented for use in any competitive analysis.** 5 | 6 | * [NVIDIA HPC SDK](#nvidia-hpc-sdk) 7 | * [GNU Compilers](#gnu-compilers) 8 | * [Arm Compiler for Linux (ACfL)](#arm-compiler-for-linux-acfl) 9 | 10 | 11 | ## Initial Configuration 12 | STREAM is a defacto standard for measuring memory bandwidth. 13 | The benchmark includes four kernels that operate on 1D arrays `a`, `b`, and `c`, with scalar `x`: 14 | * **COPY**: `c = a` 15 | * **SCALE**: `b = x*a` 16 | * **ADD**: `c = a + b` 17 | * **TRIAD**: `a = b + x*c` 18 | 19 | The kernels are executed in sequence in a loop. Two parameters configure STREAM: 20 | * `STREAM_ARRAY_SIZE`: The number of double-precision elements in each array. 21 | It is critical to select a sufficiently large array size when measuring 22 | bandwidth to/from main memory. 23 | * `NTIMES`: The number of iterations of the test loop. 24 | 25 | There are many versions of STREAM, but they all use the same four fundamental kernels. For simplicity we recommend you use this version: https://github.com/jlinford/stream. This implementation is restructured to demonstrate the performance benefits of OpenMP, 26 | effective use of NUMA, and features of the Arm architecture. It uses a version of 27 | stream that has been modified to enable dynamic memory allocation and to separate 28 | the kernel implementation from the benchmark driver. This makes it the code easier 29 | to read and faciliatates the use of external tools to measure the performance in 30 | each kernel. 31 | 32 | ```bash 33 | # Clone the repo 34 | cd $HOME/benchmarks 35 | git clone https://github.com/jlinford/stream.git 36 | # That's it! 37 | ``` 38 | 39 | The Makefile for https://github.com/jlinford/stream uses the `COMPILER` variable to select good default flags for various compilers. At the time of this writing, GCC, NVIDIA, ACfL, and Futjisu compilers are supported. 40 | 41 | ## NVIDIA HPC SDK 42 | 43 | To build with the NVIDIA compilers and run with OpenMP: 44 | 45 | ```bash 46 | cd $HOME/benchmarks/stream 47 | make clean run COMPILER=nvidia 48 | ``` 49 | 50 | TRIAD should report approximately 170GB/s. Please remember that this number is provided for reference only and should not be used in any competitive analysis. 51 | ``` 52 | ------------------------------------------------------------- 53 | Function Best Rate MB/s Avg time Min time Max time 54 | Copy: 168917.9 0.203749 0.203411 0.204257 55 | Scale: 168201.7 0.204544 0.204277 0.204850 56 | Add: 170183.6 0.303268 0.302847 0.303862 57 | Triad: 170309.1 0.303220 0.302624 0.303913 58 | ------------------------------------------------------------- 59 | ``` 60 | 61 | ## GNU Compilers 62 | 63 | To build with GCC and run with OpenMP: 64 | 65 | ```bash 66 | cd $HOME/benchmarks/stream 67 | make clean run COMPILER=gnu 68 | ``` 69 | 70 | TRIAD should report approximately 168GB/s. Please remember that this number is provided for reference only and should not be used in any competitive analysis. 71 | ``` 72 | ------------------------------------------------------------- 73 | Function Best Rate MB/s Avg time Min time Max time 74 | Copy: 165916.1 0.207446 0.207091 0.207838 75 | Scale: 168685.6 0.204652 0.203691 0.205092 76 | Add: 167601.9 0.307983 0.307512 0.308406 77 | Triad: 167755.8 0.307578 0.307230 0.307824 78 | ------------------------------------------------------------- 79 | ``` 80 | 81 | ## Arm Compiler for Linux (ACfL) 82 | 83 | To build with ACfL and run with OpenMP: 84 | 85 | ```bash 86 | cd $HOME/benchmarks/stream 87 | make clean run COMPILER=acfl 88 | ``` 89 | 90 | TRIAD should report approximately 168GB/s. Please remember that this number is provided for reference only and should not be used in any competitive analysis. 91 | ``` 92 | ------------------------------------------------------------- 93 | Function Best Rate MB/s Avg time Min time Max time 94 | Copy: 164732.6 0.209005 0.208579 0.209475 95 | Scale: 166109.4 0.207151 0.206850 0.207815 96 | Add: 169040.6 0.305355 0.304895 0.305726 97 | Triad: 168924.2 0.305628 0.305105 0.306131 98 | ------------------------------------------------------------- 99 | ``` 100 | 101 | -------------------------------------------------------------------------------- /examples/tensorflow-cpu.md: -------------------------------------------------------------------------------- 1 | # ML Inference on Arm64 CPUs with TensorFlow 2 | 3 | TensorFlow is an open-source software library for machine learning and artificial intelligence. It can be used across training and inference of deep neural networks. This document covers how to use TensorFlow based machine learning inference on Arm64 CPUs, what runtime configurations are important, and how to debug any performance issues. The document also covers instructions for source builds and how to enable some of the downstream features. 4 | 5 | ## Installing TensorFlow 6 | 7 | There are multiple levels of software package abstractions available including Python wheel (easiest option), Docker container (comes with the wheel, additional packages and benchmarks), and building from source. Examples of using each method are below. 8 | 9 | ### Python Wheel 10 | TensorFlow wheel supports optimized onednn+acl backend for Arm64 CPUs. 11 | ``` 12 | pip install tensorflow-cpu 13 | ``` 14 | 15 | ### Docker Hub Container 16 | ``` 17 | # pull the tensorflow docker container with onednn-acl optimizations enabled 18 | docker pull armswdev/tensorflow-arm-neoverse 19 | 20 | # launch the docker image 21 | docker run -it --rm -v /home/ubuntu/:/hostfs armswdev/tensorflow-arm-neoverse 22 | ``` 23 | 24 | ### Building from Source 25 | 26 | While the packages for python wheel/docker container/DLAMI provide stable baseline for ML application development and production, they lack the latest fixes and optimizations from the development branches. This section provides instructions for building TensorFlow from sources, to build the master branch or to incorporate the downstream optimizations. 27 | ```bash 28 | # Install bazel for aarch64 29 | mkdir bazel 30 | cd bazel 31 | wget https://github.com/bazelbuild/bazel/releases/download/5.1.1/bazel-5.1.1-linux-arm64 32 | mv bazel-5.1.1-linux-arm64 bazel 33 | chmod a+x bazel 34 | export PATH=/home/ubuntu/bazel/:$PATH 35 | 36 | # Clone the tensorflow repository 37 | git clone https://github.com/tensorflow/tensorflow.git 38 | cd tensorflow 39 | # Optionally checkout the stable version if needed 40 | git checkout 41 | 42 | # Set the build configuration 43 | export HOST_C_COMPILER=(which gcc) 44 | export HOST_CXX_COMPILER=(which g++) 45 | export PYTHON_BIN_PATH=(which python) 46 | export USE_DEFAULT_PYTHON_LIB_PATH=1 47 | export TF_ENABLE_XLA=1 48 | export TF_DOWNLOAD_CLANG=0 49 | export TF_SET_ANDROID_WORKSPACE=0 50 | export TF_NEED_MPI=0 51 | export TF_NEED_ROCM=0 52 | export TF_NEED_GCP=0 53 | export TF_NEED_S3=0 54 | export TF_NEED_OPENCL_SYCL=0 55 | export TF_NEED_CUDA=0 56 | export TF_NEED_HDFS=0 57 | export TF_NEED_OPENCL=0 58 | export TF_NEED_JEMALLOC=1 59 | export TF_NEED_VERBS=0 60 | export TF_NEED_AWS=0 61 | export TF_NEED_GDR=0 62 | export TF_NEED_OPENCL_SYCL=0 63 | export TF_NEED_COMPUTECPP=0 64 | export TF_NEED_KAFKA=0 65 | export TF_NEED_TENSORRT=0 66 | ./configure 67 | 68 | # Issue bazel build command with 'mkl_aarch64' config to enable onednn+acl backend 69 | bazel build --verbose_failures -s --config=mkl_aarch64 //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_cc.so //tensorflow:install_headers 70 | 71 | # Create and install the wheel 72 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ./wheel-TF2.9.0-py3.8-aarch64 73 | 74 | # The output wheel is generated in /home/ubuntu/tensorflow/wheel-TF2.9.0-py3.8-aarch64 75 | pip install 76 | ``` 77 | 78 | ## Runtime Configuration for Optimal Performance 79 | Once the TensorFlow setup is ready, enable the below runtime configurations to achieve the best performance. 80 | ``` 81 | # The default runtime backend for tensorflow is Eigen, but typically onednn+acl provides better performance and this can be enabled by setting the below TF environment variable 82 | export TF_ENABLE_ONEDNN_OPTS=1 83 | 84 | # NVIDIA Arm HPC DevKit supports BF16 format for ML acceleration. This can be enabled in oneDNN by setting the below environment variable 85 | grep -q bf16 /proc/cpuinfo && export DNNL_DEFAULT_FPMATH_MODE=BF16 86 | 87 | # Make sure the OpenMP threads are distributed across all the processes for multi process applications to avoid over subscription for the vcpus. 88 | num_vcpus=80 89 | num_processes= 90 | export OMP_NUM_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes))) 91 | export OMP_PROC_BIND=false 92 | export OMP_PLACES=cores 93 | ``` 94 | 95 | ```python 96 | # TensorFlow inter and intra_op_parallelism_thread settings are critical for the optimal workload parallelization in a multi-threaded system. 97 | # set the inter and intra op thread count during the session creation, an example snippet is given below. 98 | session = Session( 99 | config=ConfigProto( 100 | intra_op_parallelism_threads=80, 101 | inter_op_parallelism_threads=1, 102 | ) 103 | ) 104 | ``` 105 | TensorFlow recommends the graph optimization pass for inference to remove training specific nodes, fold batchnorms and fuse operators. This is a generic optimizaion across CPU, GPU or TPU inference, and the optimization script is part of the TensorFlow python [tools](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_inference_lib.py). For a detailed description, please refer to the TensorFlow [Grappler](https://www.tensorflow.org/guide/graph_optimization) documentation. Below is a snippet of what libraries to import and how to invoke the Grappler passes for inference. 106 | ```python 107 | from tensorflow.python.tools.optimize_for_inference_lib import optimize_for_inference 108 | 109 | graph_def = tf.compat.v1.GraphDef() 110 | with tf.compat.v1.gfile.FastGFile(model_path, "rb") as f: 111 | graph_def.ParseFromString(f.read()) 112 | 113 | optimized_graph_def = optimize_for_inference( 114 | graph_def, 115 | [item.split(':')[0] for item in inputs], 116 | [item.split(':')[0] for item in outputs], 117 | dtypes.float32.as_datatype_enum, 118 | False) 119 | g = tf.compat.v1.import_graph_def(optimized_graph_def, name='') 120 | ``` 121 | Note: While the Grappler optimizer covers majority of the networks, there are few scenarios where either the Grappler optimizer can't optimize the generic graph or the runtime kernel launch overhead is simply not acceptable. XLA addresses these gaps by providing an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Please refer to the **Enable XLA optimizations** section below to achieve the best performance with the downstream XLA optimizations. 122 | 123 | ## Evaluate performance with the standard MLPerf inference benchmarks 124 | 125 | 1. Setup MLPerf inference benchmarks and the required tools. 126 | ``` 127 | sudo apt install -y build-essential cmake libgl1-mesa-glx libglib2.0-0 libsm6 libxrender1 libxext6 python3-pip 128 | 129 | git clone https://github.com/mlcommons/inference.git --recursive 130 | cd inference 131 | git checkout 5ec3ac922556107ce0f6ca63c175d379017ba3d8 132 | cd loadgen 133 | CFLAGS="-std=c++14" python3 setup.py bdist_wheel 134 | pip install 135 | ``` 136 | 137 | 2. Benchmark image classification with Resnet50 138 | ``` 139 | sudo apt install python3-ck 140 | ck pull repo:ck-env 141 | 142 | # Download ImageNet's validation set 143 | # These will be installed to ${HOME}/CK_TOOLS/ 144 | # Select option 1: val-min data set. 145 | ck install package --tags=image-classification,dataset,imagenet,aux 146 | ck install package --tags=image-classification,dataset,imagenet,val 147 | 148 | # Copy the labels into the image location 149 | cp ${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-aux/val.txt ${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min/val_map.txt 150 | 151 | cd inference/vision/classification_and_detection 152 | wget https://zenodo.org/record/2535873/files/resnet50_v1.pb 153 | 154 | # Install the additional packages required for resnet50 inference 155 | pip install opencv-python 156 | pip install pycocotools 157 | pip install psutil 158 | pip install tqdm 159 | 160 | # Set the data and model path 161 | export DATA_DIR=${HOME}/CK-TOOLS/dataset-imagenet-ilsvrc2012-val-min 162 | export MODEL_DIR=${HOME}/inference/vision/classification_and_detection 163 | 164 | # Setup the tensorflow thread pool parameters via MLPerf env variables 165 | export MLPERF_NUM_INTER_THREADS=1 166 | 167 | num_vcpus=$(getconf _NPROCESSORS_ONLN) 168 | num_processes= 169 | export MLPERF_NUM_INTRA_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes))) 170 | 171 | ./run_local.sh tf resnet50 cpu --scenario=Offline 172 | ``` 173 | 174 | 3. Benchmark natual language processing with Bert 175 | ``` 176 | cd inference/language/bert 177 | make setup 178 | python3 run.py --backend=tf --scenario=Offline 179 | ``` 180 | 181 | ## Troubleshooting performance issues 182 | 183 | The below steps help debugging performance issues with any inference application. 184 | 185 | 1. Run inference with DNNL and OpenMP verbose logs enabled to understand which backend is used for the tensor ops execution. 186 | ``` 187 | export DNNL_VERBOSE=1 188 | export OMP_DISPLAY_ENV=VERBOSE 189 | ``` 190 | If there are no OneDNN logs on the terminal, this could mean either the ops are executed with Eigen or XLA backend. To switch from Eigen to OneDNN+ACL backend, set `TF_ENABLE_ONEDNN_OPTS=1` and rerun the model inference. For non-XLA compiled graphs, there should be a flow of DNN logs with details about the shapes, prop kinds and execution times. Inspect the logs to see if there are any ops and shapes not executed with the ACL gemm kernel, instead executed by cpp reference kernel. See below example dnnl logs to understand how the ACL gemm and reference cpp kernel execution traces look like. 191 | ``` 192 | # ACL gemm kernel 193 | dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:acdb:f0 wei_f32::blocked:acdb:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,post_ops:'eltwise_relu;';,alg:convolution_direct,mb1_ic256oc64_ih56oh56kh1sh1dh0ph0_iw56ow56kw1sw1dw0pw0 194 | 195 | # OneDNN cpp reference kernel 196 | dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,post_ops:'eltwise_bounded_relu:6;';,alg:convolution_direct,mb1_g64ic64oc64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 197 | ``` 198 | If there are any shapes not going to ACL gemm kernels, the first step is to make sure the graph has been optimized for inference via Grappler passes. 199 | ```python 200 | from tensorflow.python.tools.optimize_for_inference_lib import optimize_for_inference 201 | 202 | graph_def = tf.compat.v1.GraphDef() 203 | with tf.compat.v1.gfile.FastGFile(model_path, "rb") as f: 204 | graph_def.ParseFromString(f.read()) 205 | 206 | optimized_graph_def = optimize_for_inference( 207 | graph_def, 208 | [item.split(':')[0] for item in inputs], 209 | [item.split(':')[0] for item in outputs], 210 | dtypes.float32.as_datatype_enum, 211 | False) 212 | g = tf.compat.v1.import_graph_def(optimized_graph_def, name='') 213 | ``` 214 | If the tensor ops and shapes are still not executed with ACL gemm kernels, please raise an issue on [ACL github](https://github.com/ARM-software/ComputeLibrary) with the operator and shape details. 215 | 216 | 2. Once the tensor ops are executed with ACL gemm kernels, enable fast math mode, `export DNNL_DEFAULT_FPMATH_MODE=BF16`, to pick bfloat16 hybrid gemm kernels. 217 | 218 | 3. Verify the TensorFlow inter- and intra- thread pool settings are optimal as recommended in the runtime configurations section. Then, inspect the OMP environment to make sure the CPU cores are not oversubscribed for multi process applications. A typical openmp environment for a 80 thread, single process application looks like the one below. 219 | ``` 220 | OPENMP DISPLAY ENVIRONMENT BEGIN 221 | _OPENMP = '201511' 222 | OMP_DYNAMIC = 'FALSE' 223 | OMP_NESTED = 'FALSE' 224 | OMP_NUM_THREADS = '80' 225 | OMP_SCHEDULE = 'DYNAMIC' 226 | OMP_PROC_BIND = 'FALSE' 227 | OMP_PLACES = '' 228 | OMP_STACKSIZE = '0' 229 | OMP_WAIT_POLICY = 'PASSIVE' 230 | OMP_THREAD_LIMIT = '4294967295' 231 | OMP_MAX_ACTIVE_LEVELS = '1' 232 | OMP_CANCELLATION = 'FALSE' 233 | OMP_DEFAULT_DEVICE = '0' 234 | OMP_MAX_TASK_PRIORITY = '0' 235 | OMP_DISPLAY_AFFINITY = 'FALSE' 236 | OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A' 237 | OMP_ALLOCATOR = 'omp_default_mem_alloc' 238 | OMP_TARGET_OFFLOAD = 'DEFAULT' 239 | GOMP_CPU_AFFINITY = '' 240 | GOMP_STACKSIZE = '0' 241 | GOMP_SPINCOUNT = '300000' 242 | OPENMP DISPLAY ENVIRONMENT END 243 | ``` 244 | 245 | ## Enable TensorFlow Serving with onednn+acl backend 246 | 247 | [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data. 248 | 249 | As of 2.9.0 version, TensorFlow Serving for aarch64 supports TensorFlow with Eigen backend. For the best performance, with onednn+acl backend, please follow the below instructions to cherrypick the PRs and to rebuild the TensorFlow Serving docker image. 250 | ```bash 251 | # Clone the tf serving repository 252 | git clone https://github.com/tensorflow/serving.git 253 | cd serving 254 | 255 | # Pull https://github.com/tensorflow/serving/pull/1953 256 | git fetch origin pull/1953/head:tfs_aarch64 257 | 258 | # Pull https://github.com/tensorflow/serving/pull/1954 259 | git fetch origin pull/1954/head:tfs_docker_aarch64 260 | 261 | # Merge them 262 | git checkout tfs_aarch64 263 | git merge tfs_docker_aarch64 264 | 265 | # Invoke the docker build script to trigger mkl aarch64 config build 266 | docker build -f tensorflow_serving/tools/docker/Dockerfile.devel-mkl-aarch64 -t tfs:mkl_aarch64 . 267 | 268 | # Command to launch the serving api with onednn+acl backend, and BF16 kernels for a resnet model 269 | docker run -p 8501:8501 --name tfserving_resnet --mount type=bind,source=/tmp/resnet,target=/models/resnet -e MODEL_NAME=resnet -e TF_ENABLE_ONEDNN_OPTS=1 -e DNNL_DEFAULT_FPMATH_MODE=BF16 -e -t tfs:mkl_aarch64 270 | ``` 271 | 272 | ## Enable XLA optimizations 273 | 274 | While the Grappler optimizer covers majority of the networks, there are few scenarios where either the Grappler optimizer can't optimize the generic graph or the runtime kernel launch overhead is simply not acceptable. XLA addresses these gaps by providing an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. TensorFlow-2.9.0 supports aarch64 xla backend with Eigen runtime. For the best performance please cherrypick the [PR](https://github.com/tensorflow/tensorflow/pull/55534) and rebuild the TensorFlow libraries. 275 | ```bash 276 | # Clone the tensorflow repository 277 | git clone https://github.com/tensorflow/tensorflow.git 278 | cd tensorflow 279 | 280 | # Pulll https://github.com/tensorflow/tensorflow/pull/55534 281 | git fetch origin pull/55534/head:xla_acl 282 | git checkout xla_acl 283 | 284 | # Set the build configuration 285 | export HOST_C_COMPILER=(which gcc) 286 | export HOST_CXX_COMPILER=(which g++) 287 | export PYTHON_BIN_PATH=(which python) 288 | export USE_DEFAULT_PYTHON_LIB_PATH=1 289 | export TF_ENABLE_XLA=1 290 | export TF_DOWNLOAD_CLANG=0 291 | export TF_SET_ANDROID_WORKSPACE=0 292 | export TF_NEED_MPI=0 293 | export TF_NEED_ROCM=0 294 | export TF_NEED_GCP=0 295 | export TF_NEED_S3=0 296 | export TF_NEED_OPENCL_SYCL=0 297 | export TF_NEED_CUDA=0 298 | export TF_NEED_HDFS=0 299 | export TF_NEED_OPENCL=0 300 | export TF_NEED_JEMALLOC=1 301 | export TF_NEED_VERBS=0 302 | export TF_NEED_AWS=0 303 | export TF_NEED_GDR=0 304 | export TF_NEED_OPENCL_SYCL=0 305 | export TF_NEED_COMPUTECPP=0 306 | export TF_NEED_KAFKA=0 307 | export TF_NEED_TENSORRT=0 308 | ./configure 309 | 310 | # Issue bazel build command with 'mkl_aarch64' config to enable onednn+acl backend 311 | bazel build --verbose_failures -s --config=mkl_aarch64 //tensorflow/tools/pip_package:build_pip_package //tensorflow:libtensorflow_cc.so //tensorflow:install_headers 312 | 313 | # Create and install the wheel 314 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ./wheel-TF2.9.0-py3.8-aarch64 315 | 316 | # The wheel is generated in /home/ubuntu/tensorflow/wheel-TF2.9.0-py3.8-aarch64 317 | pip install 318 | ``` 319 | A simple way to start using XLA in TensorFlow models without any changes is to enable auto-clustering, which automatically finds clusters (connected subgraphs) within the TensorFlow functions which can be compiled and executed using XLA. Auto-clustering on CPU can be enabled by setting the TF_XLA_FLAGS environment variables as below: 320 | ```python 321 | # Set the jit level for the current session via the config 322 | jit_level = tf_compat_v1.OptimizerOptions.ON_1 323 | config.graph_options.optimizer_options.global_jit_level = jit_level 324 | ``` 325 | ``` 326 | # Enable auto clustering for CPU backend 327 | export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" 328 | ``` 329 | 330 | ### Troubleshooting XLA Performance Issues 331 | 332 | 1. If XLA performance improvements are not as expected, the first step is to dump and inspect the XLA optimized graph and ensure there are not many op duplications resulted from the op fusion and other other optimization passes. XLA provides a detailed logging mechanism to dump the state at different checkpoints during the graph optimization passes. At high level, inspecting the graphs generated before and after the XLA pass is sufficient to understand whether XLA compilation is the correct optimization for the current graph. Please refer to the below instructions for enabling auto-clustering, along with .dot generation (using MLPerf Bert inference in SingleStream mode as the example here) and also commands to generate .svg version for easier visualization of the XLA generated graphs. 333 | ```bash 334 | # To enable XLA auto clustering, and to generate .dot files 335 | XLA_FLAGS="--xla_dump_to=/tmp/generated --xla_dump_hlo_as_dot" TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit" python run.py --backend=tf --scenario=SingleStream 336 | 337 | # To convert the .dot file into .svg 338 | sudo apt install graphviz 339 | dot -Tsvg <.dot file to be converted> > 340 | 341 | # e.g. 342 | dot -Tsvg 1645569101166784.module_0000.cluster_51__XlaCompiledKernel_true__XlaHasReferenceVars_false__XlaNumConstantArgs_0__XlaNumResourceArgs_0_.80.before_optimizations.dot > module0_before_opts.svg 343 | 344 | dot -Tsvg 1645569101166784.module_0000.cluster_51__XlaCompiledKernel_true__XlaHasReferenceVars_false__XlaNumConstantArgs_0__XlaNumResourceArgs_0_.80.cpu_after_optimizations.dot > module0_after_opts.svg 345 | ``` 346 | 347 | 2. Once the XLA graph looks as expected (without too many duplicated nodes), check how the ops are emitted. Currently XLA framework logging is under the same TF CPP logging, and level 1 is sufficient to get most of the info traces. 348 | ```bash 349 | # Enable TF CPP framework logging 350 | export TF_CPP_MAX_VLOG_LEVEL=1 351 | ``` 352 | Then look for the emitter level traces to understand how each op for a given shape is lowered to LLVM IR. The below traces show whether the XLA ops are emitted to ACL or Eigen runtime. 353 | ```bash 354 | # ACL runtime traces 355 | __xla_cpu_runtime_ACLBatchMatMulF32 356 | 357 | __xla_cpu_runtime_ACLConv2DF32 358 | 359 | # Eigen runtime traces 360 | __xla_cpu_runtime_EigenBatchMatMulF32 361 | 362 | __xla_cpu_runtime_EigenConv2DF32 363 | ``` 364 | If the shapes are not emitted by the ACL runtime, check the source build configuration to make sure 'mkl_aarch64' bazel config is enabled (which internally enables '--define=build_with_acl=true' bazel configuration). If the LLVM IR still doesn't emit the ACL runtime, please raise an ssue on [ACL github](https://github.com/ARM-software/ComputeLibrary) with the operator and shape details. 365 | -------------------------------------------------------------------------------- /examples/tensorflow-gpu.md: -------------------------------------------------------------------------------- 1 | # TensorFlow on A100 GPUs and Arm64 CPUs 2 | 3 | TensorFlow is an open source platform for machine learning. It provides 4 | comprehensive tools and libraries in a flexible architecture allowing easy 5 | deployment across a variety of platforms and devices. NGC Containers are the easiest 6 | way to get started with TensorFlow. The TensorFlow NGC Container comes with all 7 | dependencies included, providing an easy place to start developing common 8 | applications, such as conversational AI, natural language processing (NLP), 9 | recommenders, and computer vision. 10 | 11 | 12 | ## NVIDIA TensorFlow NGC Container 13 | The NVIDIA NGC container registry includes GPU-accelerated Arm64 builds of TensorFlow. For most cases, this will be your best option both in terms of ease of use and performance. 14 | 15 | The [TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) is optimized for GPU acceleration, and contains a 16 | validated set of libraries that enable and optimize GPU performance. This container 17 | may also contain modifications to the TensorFlow source code in order to maximize 18 | performance and compatibility. This container also contains software for 19 | accelerating ETL (DALI, RAPIDS), Training (cuDNN, NCCL), and Inference (TensorRT) 20 | workloads. See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow for 21 | more information. 22 | 23 | 24 | ### Example: Mask-RCNN For TensorFlow 2 25 | Mask R-CNN is a convolution-based neural network for the task of object instance segmentation. The paper describing the model can be found [here](https://arxiv.org/abs/1703.06870). NVIDIA’s Mask R-CNN is an optimized version of [Google's TPU implementation](https://github.com/tensorflow/tpu/tree/master/models/official/mask_rcnn), leveraging mixed precision arithmetic using Tensor Cores while maintaining target accuracy. 26 | 27 | This model is trained with mixed precision using Tensor Cores on the NVIDIA Ampere GPU. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. See the [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) for more details. 28 | 29 | 1. Download Mask-RCNN example files 30 | ```bash 31 | # Clone the NVIDIA Deep Learning Examples repository 32 | git clone https://github.com/NVIDIA/DeepLearningExamples.git 33 | 34 | # Go to the MaskRCNN example for TensorFlow 35 | cd DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN 36 | ``` 37 | 38 | 2. Update the Dockerfile to fix references to outdated software versions. Newer versions of these packages support Arm64. 39 | ```bash 40 | # Update software versions in Dockerfile to enable Arm64 support 41 | cat | patch -p0 << "EOF" 42 | --- Dockerfile 2022-07-07 15:35:29.824463000 -0700 43 | +++ Dockerfile.patched 2022-07-07 15:36:47.041457000 -0700 44 | @@ -12,7 +12,7 @@ 45 | # See the License for the specific language governing permissions and 46 | # limitations under the License. 47 | 48 | -ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:21.02-tf2-py3 49 | +ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:22.06-tf2-py3 50 | FROM ${FROM_IMAGE_NAME} 51 | 52 | LABEL model="MaskRCNN" 53 | @@ -30,10 +30,10 @@ 54 | cd /opt/pybind11 && cmake . && make install && pip install . 55 | 56 | 57 | -# update protobuf 3 to 3.3.0 58 | +# update protobuf 3 to 3.18.2 59 | RUN \ 60 | - curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip && \ 61 | - unzip -u protoc-3.3.0-linux-x86_64.zip -d protoc3 && \ 62 | + curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.18.2/protoc-3.18.2-linux-aarch_64.zip && \ 63 | + unzip -u protoc-3.18.2-linux-aarch_64.zip -d protoc3 && \ 64 | mv protoc3/bin/* /usr/local/bin/ && \ 65 | mv protoc3/include/* /usr/local/include/ 66 | EOF 67 | ``` 68 | 69 | 3. Update example files to support TensorFlow2. 70 | ```bash 71 | cat | patch -p0 <<"EOF" 72 | --- mrcnn_tf2/runtime/run.py 2022-07-07 16:01:33.565361000 -0700 73 | +++ mrcnn_tf2/runtime/run.py.patched 2022-07-07 16:01:27.167354000 -0700 74 | @@ -112,11 +112,9 @@ 75 | logging.info('XLA is activated') 76 | 77 | if params.amp: 78 | - policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16", loss_scale="dynamic") 79 | - tf.keras.mixed_precision.experimental.set_policy(policy) 80 | + tf.keras.mixed_precision.set_global_policy("mixed_float16") 81 | logging.info('AMP is activated') 82 | 83 | - 84 | def create_model(params): 85 | model = MaskRCNN( 86 | params=params, 87 | EOF 88 | ``` 89 | 90 | 4. Build the container image 91 | ```bash 92 | nvidia-docker build -t nvidia_mrcnn_tf2 . 93 | ``` 94 | 95 | 5. Benchmark model training 96 | ```bash 97 | # Start an interactive session in the NGC container 98 | # Use bind mounts to retain data between runs 99 | docker run --gpus all -it --rm \ 100 | --shm-size=2g \ 101 | --ulimit memlock=-1 \ 102 | --ulimit stack=67108864 \ 103 | -v /tmp/mask-rcnn/data:/data \ 104 | -v /tmp/mask-rcnn/weights:/weights \ 105 | nvidia_mrcnn_tf2 106 | 107 | # Note: skip this step if you already have the preprocessed data 108 | # Download and preprocess the dataset (approximately 25GB) 109 | cd /workspace/mrcnn_tf2/dataset 110 | bash download_and_preprocess_coco.sh /data 111 | 112 | # Note: skip this step if you already have the preprocessed data 113 | # Download the pre-trained ResNet-50 weights. 114 | python scripts/download_weights.py --save_dir=/weights 115 | 116 | # Start training on 2 GPUs 117 | cd /workspace/mrcnn_tf2 118 | # Optional: if you see an error message like this: 119 | # ImportError: /usr/lib/aarch64-linux-gnu/libgomp.so.1: cannot allocate memory in static TLS block 120 | # ... then set LD_PRELOAD like this: 121 | # export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1 122 | python scripts/train.py --gpus 2 --batch_size 24 123 | ``` 124 | 125 | 126 | ### Example: F-LM on Arm64 with 2xA100 GPUs 127 | 128 | ```bash 129 | # Start an interactive session in the NGC container 130 | # Use bind mounts in $HOME/big_lstm to retain data 131 | docker run --gpus all -it --rm \ 132 | -v $HOME/big_lstm/data:/data \ 133 | -v $HOME/big_lstm/logs:/logs \ 134 | --ipc=host \ 135 | --ulimit memlock=-1 \ 136 | --ulimit stack=67108864 \ 137 | nvcr.io/nvidia/tensorflow:23.01-tf1-py3 138 | 139 | # Go to the example directory 140 | cd /workspace/nvidia-examples/big_lstm 141 | 142 | # Download the training data 143 | ./download_1b_words_data.sh 144 | 145 | # Training for up to 180 seconds 146 | python single_lm_train.py \ 147 | --mode=train \ 148 | --logdir=/logs \ 149 | --num_gpus=2 \ 150 | --datadir=/data/1-billion-word-language-modeling-benchmark-r13output \ 151 | --hpconfig run_profiler=False,max_time=180,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=512 152 | 153 | # Evaulation 154 | python single_lm_train.py \ 155 | --mode=eval_full \ 156 | --logdir=/logs \ 157 | --num_gpus=2 \ 158 | --datadir=/data/1-billion-word-language-modeling-benchmark-r13output \ 159 | --hpconfig run_profiler=False,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=512 160 | ``` 161 | 162 | -------------------------------------------------------------------------------- /examples/velox.md: -------------------------------------------------------------------------------- 1 | # Velox on Arm64 2 | 3 | Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox was created by Facebook and it is currently developed in partnership with Intel, ByteDance, and Ahana. See https://github.com/facebookincubator/velox for more information. 4 | 5 | What makes Velox interesting on Arm64 is it's use of the xsimd library. xsimd provides a unified means for using SIMD features for library authors. Namely, it enables manipulation of batches of numbers with the same arithmetic operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches. See https://github.com/xtensor-stack/xsimd. 6 | 7 | ## Build from Source 8 | 9 | Note: the standard Velox installation assumes you are running Ubuntu 22.04 LTS and you have root access. You'll need to modify Velox's setup scripts if this is not the case. 10 | 11 | ```bash 12 | # Install flex and bison 13 | sudo apt install -y flex bison 14 | 15 | # Get the source 16 | git clone --recursive https://github.com/facebookincubator/velox.git 17 | cd velox 18 | git submodule sync --recursive 19 | git submodule update --init --recursive 20 | ``` 21 | 22 | We need to edit `scripts/setup-helper-functions.sh` to fix a bug and update the compiler flags for Arm64. 23 | ```bash 24 | # Add missing sudo to ninja command 25 | sed -i 's/ninja -C/sudo ninja -C/' setup-helper-functions.sh 26 | 27 | # Fix the compiler flags for Arm 28 | sed -i 's/-march=armv8-a+crc+crypto/-mcpu=neoverse-v1 -flax-vector-conversions -fsigned-char/' setup-helper-functions.sh 29 | ``` 30 | 31 | What are these flags and why do we need them? 32 | * `-flax-vector-conversions`: Let GCC permit conversions between vectors with differing element types. This flag is needed because NEON differentiates between vectors of signed and unsigned types. Without this flag, the compiler will not permit implicit cast operations between such vectors. Velox relies on such implicit casts in several places so we must relax the compiler's vector conversion rules. 33 | * `-fsigned-char`: The C++ standard does not specify if `char` is signed or unsigned. GCC on Arm impliments it as unsigned, but there are places in Velox where it is assumed to be signed. This flag makes all occurrences of `char` be signed, like `signed char`. 34 | * `-mcpu=neoverse-n1`: Optional, but this enables the compiler to use all available architecture features and also perform micro-architectural tuning for the Neoverse N1 CPU core. Using `-mcpu` should produce a more tuned binary than using the `-march` flag. 35 | 36 | Now, compile Velox: 37 | ```bash 38 | export CPU_TARGET=aarch64 39 | ./scripts/setup-ubuntu.sh 40 | make TREAT_WARNINGS_AS_ERRORS=0 ENABLE_WALL=0 CPU_TARGET=aarch64 41 | ``` 42 | -------------------------------------------------------------------------------- /examples/wrf.md: -------------------------------------------------------------------------------- 1 | # WRF on Arm64 2 | 3 | The [Weather Research and Forecasting (WRF) Model](https://www.mmm.ucar.edu/weather-research-and-forecasting-model) is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. 4 | 5 | Arm64 is supported by the standard WRF distribution as of WRF 4.3.3. The following is an example of how to perform the standard procedure to build and execute on Arm64. See [http://www2.mmm.ucar.edu/wrf/users/download/get_source.html](http://www2.mmm.ucar.edu/wrf/users/download/get_source.html) for more details. 6 | 7 | ## Running the CONUS (Contiguous US) test 8 | Build WRF using one of the methods described below (i.e. via Spack or manually) and download a CONUS (Contiguous US) test deck from www2.mmm.ucar.edu: 9 | - 12km case, about 1.8GB: https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz 10 | - 2.5km case, about 18GB: https://www2.mmm.ucar.edu/wrf/src/conus2.5km.tar.gz 11 | 12 | The 12km and 2.5km cases are run in the same way. 13 | 14 | ```bash 15 | cd $BUILD_DIR/WRFV4.4.2 16 | 17 | # Copy the run directory template 18 | cp -a run run_CONUS12km 19 | cd run_CONUS12km 20 | 21 | # Download the test case files and merge them into the run directory 22 | curl -L https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz | tar xvzf - --strip-components=1 23 | 24 | # Configure stack limits 25 | ulimit -s unlimited 26 | export OMP_STACKSIZE=1G 27 | 28 | # Run with 8 MPI ranks, each having 10 OpenMP threads 29 | OMP_NUM_THREADS=10 OMP_PLACES=cores OMP_PROC_BIND=close mpirun -np 8 -map-by socket:PE=10 ./wrf.exe 30 | 31 | # Track progress by watching the Rank 0 output file: 32 | tail -f rsl.out.0000 33 | 34 | # Quickly calculate the average elapsed seconds per domain as a figure-of-merit 35 | cat rsl.out.* | grep 'Timing for main:' | awk '{print $9}' | jq -s add/length 36 | ``` 37 | 38 | 39 | ## Quick Start with Spack 40 | One of the easiest ways to use WRF on Arm64 is to install it via Spack. Simply executing `spack install wrf` will install the latest version of WRF. However, we recommend you also install a recent version of GCC and then use that GCC installation to build WRF to get the best performance. For example, 41 | ```bash 42 | spack install gcc@12.1.0 43 | spack load gcc@12.1.0 44 | spack compiler find 45 | spack install wrf %gcc@12.1.0 46 | ``` 47 | 48 | ## Manually Build from Source 49 | 50 | ### Dependencies 51 | WRF depends on the NetCDF Fortran library, which in turn requires the NetCDF C library and HDF5. All these packages support Arm64 and build easily with GCC. Be sure to set the environment variables `HDFDIR` and `NETCDF` to be the location of the HDF5 and NetCDF installations _Note: it is assumed that the Fortran NetCDF interface has been installed at the same location as the C library, i.e. they share the same `lib` and `include` directories._ 52 | 53 | ```bash 54 | # Create a build directory to hold WRF and all its dependencies 55 | mkdir WRF 56 | cd WRF 57 | 58 | # Configure build environment 59 | export BUILD_DIR=$PWD 60 | export HDFDIR=$BUILD_DIR/opt 61 | export HDF5=$BUILD_DIR/opt 62 | export NETCDF=$BUILD_DIR/opt 63 | 64 | # HDF5 65 | curl -L https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.0/src/hdf5-1.14.0.tar.gz | tar xvzf - 66 | cd hdf5-1.14.0 67 | ./configure --prefix=$HDFDIR --enable-fortran --enable-parallel 68 | make -j && make install 69 | cd $BUILD_DIR 70 | 71 | # Only set these _after_ HDF5 has compiled and installed 72 | # Note that LDFLAGS only includes HDF5, not NETCDF 73 | export CC=`which mpicc` 74 | export CXX=`which mpicxx` 75 | export FC=`which mpifort` 76 | export CPPFLAGS="-I$HDFDIR/include" 77 | export CFLAGS="-I$HDFDIR/include" 78 | export FFLAGS="-I$HDFDIR/include" 79 | export LDFLAGS="-L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz" 80 | export PATH=$NETCDF/bin:$PATH 81 | export LD_LIBRARY_PATH=$BUILD_DIR/opt/lib:$LD_LIBRARY_PATH 82 | 83 | # NetCDF-c 84 | curl -L https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.0.tar.gz | tar xvzf - 85 | cd netcdf-c-4.9.0 86 | ./configure --prefix=$NETCDF 87 | make -j && make install 88 | cd $BUILD_DIR 89 | 90 | # NetCDF-f 91 | curl -L https://github.com/Unidata/netcdf-fortran/archive/refs/tags/v4.6.0.tar.gz | tar xvzf - 92 | cd netcdf-fortran-4.6.0/ 93 | ./configure --prefix=$NETCDF 94 | make -j && make install 95 | cd $BUILD_DIR 96 | ``` 97 | 98 | ### Download WRF Source 99 | The latest WRF source code can be downloaded [here](https://github.com/wrf-model/WRF/releases). 100 | ```bash 101 | cd $BUILD_DIR 102 | curl -L https://github.com/wrf-model/WRF/releases/download/v4.4.2/v4.4.2.tar.gz | tar xvzf - 103 | cd WRFV4.4.2 104 | ``` 105 | 106 | ### Configure and Compile WRF with GCC 107 | Run WRF's configure script and select from the provided options. 108 | It's best to select an option from the "Aarch64" row as this will add 109 | `-mcpu=native` and other important compiler flags to the build line. 110 | Other rows in the configuration options may not achieve the best performance. 111 | Choose the default option for nesting (i.e. `1`). 112 | 113 | Make sure `configure` outputs messages like these: 114 | - `Will use NETCDF in dir: /home/jlinford/workspace/benchmarks/WRF/opt` 115 | - `Will use HDF5 in dir: /home/jlinford/workspace/benchmarks/WRF/opt` 116 | If you see errors finding NetCDF or HDF5, resolve those errors before proceeding. 117 | 118 | ```bash 119 | ./configure 120 | checking for perl5... no 121 | checking for perl... found /usr/bin/perl (perl) 122 | Will use NETCDF in dir: /home/jlinford/workspace/benchmarks/WRF/opt 123 | Will use HDF5 in dir: /home/jlinford/workspace/benchmarks/WRF/opt 124 | PHDF5 not set in environment. Will configure WRF for use without. 125 | Will use 'time' to report timing information 126 | $JASPERLIB or $JASPERINC not found in environment, configuring to build without grib2 I/O... 127 | ------------------------------------------------------------------------ 128 | Please select from among the following Linux aarch64 options: 129 | 130 | 1. (serial) 2. (smpar) 3. (dmpar) 4. (dm+sm) GNU (gfortran/gcc) 131 | 2. (serial) 6. (smpar) 7. (dmpar) 8. (dm+sm) GNU (gfortran/gcc) 132 | 3. (serial) 10. (smpar) 11. (dmpar) 12. (dm+sm) GCC (gfortran/gcc): Aarch64 133 | 134 | Enter selection [1-12] : 11 135 | ------------------------------------------------------------------------ 136 | Compile for nesting? (1=basic, 2=preset moves, 3=vortex following) [default 1]: 137 | 138 | Configuration successful! 139 | ``` 140 | 141 | Then use WRF's compile script to build the target executable: 142 | ```bash 143 | # Reset build environment to include `-lnetcdf` in LDFLAGS 144 | export CC=`which mpicc` 145 | export CXX=`which mpicxx` 146 | export FC=`which mpifort` 147 | export CPPFLAGS="-I$HDFDIR/include" 148 | export CFLAGS="-I$HDFDIR/include" 149 | export FFLAGS="-I$HDFDIR/include" 150 | export LDFLAGS="-L$HDFDIR/lib -lnetcdf -lhdf5_hl -lhdf5 -lz" 151 | export PATH=$NETCDF/bin:$PATH 152 | export LD_LIBRARY_PATH=$BUILD_DIR/opt/lib:$LD_LIBRARY_PATH 153 | 154 | # Start compilation on 80 cores 155 | ./compile -j 80 em_real >& compile.log & 156 | # Watch compilation progress by following the log file: 157 | tail -f compile.log 158 | ``` 159 | 160 | Look for this message at the end of the compilation log: 161 | ``` 162 | ========================================================================== 163 | build started: Wed Feb 1 03:35:15 UTC 2023 164 | build completed: Wed Feb 1 04:00:21 UTC 2023 165 | 166 | ---> Executables successfully built <--- 167 | 168 | -rwxrwxr-x 1 jlinford jlinford 38182176 Feb 1 04:00 main/ndown.exe 169 | -rwxrwxr-x 1 jlinford jlinford 38284448 Feb 1 04:00 main/real.exe 170 | -rwxrwxr-x 1 jlinford jlinford 37837912 Feb 1 04:00 main/tc.exe 171 | -rwxrwxr-x 1 jlinford jlinford 45049424 Feb 1 03:58 main/wrf.exe 172 | 173 | ========================================================================== 174 | ``` 175 | 176 | ### Optional: Compiling with Arm Compiler for Linux 177 | If you plan to use the Arm Compiler for Linux (ACfL) instead of GCC, note that the stanza for ACfL is not yet part of the standard WRF package. Run WRF's configure script then choose an option 1-4 and select default for nesting. This will produce the `configure.wrf' file: 178 | ``` 179 | ./configure 180 | ``` 181 | The `configure.wrf' file then needs modifying as follows. 182 | ``` 183 | sed -i 's/gcc/armclang/g' configure.wrf 184 | sed -i 's/gfortran/armflang/g' configure.wrf 185 | sed -i 's/mpicc/mpicc -DMPI2_SUPPORT/g' configure.wrf 186 | sed -i 's/ -ftree-vectorize//g' configure.wrf 187 | sed -i 's/length-none/length-0/g' configure.wrf 188 | sed -i 's/-frecord-marker\=4/ /g' configure.wrf 189 | sed -i 's/\-w \-O3 \-c/-mcpu=native \-w \-O3 \-c/g' configure.wrf 190 | sed -i 's/\# \-g $(FCNOOPT).*/\-g/g' configure.wrf 191 | sed -i 's/$(FCBASEOPTS_NO_G)/-mcpu=native $(OMP) $(FCBASEOPTS_NO_G)/g' configure.wrf 192 | ``` 193 | ### Optional: Compiling with the NVIDIA HPC SDK 194 | If you plan to use the Compilers that come with the NVIDIA HPC SDK, just like the Arm Compiler for Linux (ACfL), the stanza is not yet part of the standard WRF package. Run WRF's configure script then choose an option 1-4 and select default for nesting. This will produce the `configure.wrf' file: 195 | ``` 196 | ./configure 197 | ``` 198 | The `configure.wrf' file then needs modifying as follows. 199 | ``` 200 | sed -i 's/gcc/nvc/g' configure.wrf 201 | sed -i 's/gfortran/nvfortran/g' configure.wrf 202 | sed -i 's/ -ftree-vectorize//g' configure.wrf 203 | sed -i 's/ -fopenmp/-mp/g' configure.wrf 204 | sed -i 's/#-fdefault-real-8/-i4/g' configure.wrf 205 | sed -i 's/ -O3/ -O3 -fast -Minfo/g' configure.wrf 206 | sed -i 's/-funroll-loops//g' configure.wrf 207 | sed -i 's/-ffree-form -ffree-line-length-none/-Mfree/g' configure.wrf 208 | sed -i 's/-fconvert=big-endian -frecord-marker=4/-byteswapio/g' configure.wrf 209 | sed -i 's/\# \-g $(FCNOOPT).*/\-g/g' configure.wrf 210 | sed -i 's/$(FCBASEOPTS_NO_G)/\-march=native $(OMP) $(FCBASEOPTS_NO_G)/g' configure.wrf 211 | ``` 212 | 213 | -------------------------------------------------------------------------------- /known_issues.md: -------------------------------------------------------------------------------- 1 | # Recent Updates, Known Issues, and Workarounds 2 | 3 | There is a huge amount of activity in the Arm software ecosystem and improvements are being 4 | made on a daily basis. As a general rule, later versions of compilers and language runtimes 5 | should be used whenever possible. 6 | 7 | ## Recent Updates 8 | The table below includes known recent changes to popular packages that improve performance. If you know of others please let us know. CONTACT 9 | 10 | Package | Version | Improvements 11 | --------|:-:|------------- 12 | bazel | [3.4.1+](https://github.com/bazelbuild/bazel/releases) | Pre-built bazel binary for Arm64. [See below](#bazel-on-linux) for installation. 13 | ffmpeg | 4.3+ | Improved performance of libswscale by 50% with better NEON vectorization which improves the performance and scalability of ffmpeg multi-thread encoders. The changes are available in FFMPEG version 4.3. 14 | HAProxy | 2.4+ | A [serious bug](https://github.com/haproxy/haproxy/issues/958) was fixed. Additionally, building with `CPU=armv81` improves HAProxy performance by 4x so please rebuild your code with this flag. 15 | mongodb | 4.2.15+ / 4.4.7+ / 5.0.0+ | Improved performance on Arm64, especially for internal JS engine. LSE support added in [SERVER-56347](https://jira.mongodb.org/browse/SERVER-56347). 16 | MySQL | 8.0.23+ | Improved spinlock behavior, compiled with -moutline-atomics if compiler supports it. 17 | .NET | [5+](https://dotnet.microsoft.com/download/dotnet/5.0) | [.NET 5 significantly improved performance for ARM64](https://devblogs.microsoft.com/dotnet/Arm64-performance-in-net-5/). Here's an [AWS Blog](https://aws.amazon.com/blogs/compute/powering-net-5-with-aws-graviton2-benchmark-results/) with some performance results. 18 | OpenH264 | [2.1.1+](https://github.com/cisco/openh264/releases/tag/v2.1.1) | Pre-built Cisco OpenH264 binary for Arm64. 19 | PCRE2 | 10.34+ | Added NEON vectorization to PCRE's JIT to match first and pairs of characters. This may improve performance of matching by up to 8x. This fixed version of the library now is shipping with Ubuntu 20.04 and PHP 8. 20 | PHP | 7.4+ | PHP 7.4 includes a number of performance improvements that increase perf by up to 30% 21 | pip | 19.3+ | Enable installation of Python wheel binaries on Arm64. 22 | PyTorch | 1.7+ | Enable Arm64 compilation and NEON optimization for fp32. Install from source. **Note:** *Requires GCC9 or later.* 23 | ruby | 3.0+ | Enable Arm64 optimizations that improve performance by as much as 40%. These changes have also been back-ported to the Ruby shipping with AmazonLinux2, Fedora, and Ubuntu 20.04. 24 | zlib | 1.2.8+ | For the best performance on Arm64 please use [zlib-cloudflare](https://github.com/cloudflare/zlib). 25 | 26 | 27 | ## Known Issues 28 | 29 | ### Postgres 30 | Postgres performance can be heavily impacted by not using [LSE](langs/c-c++.md#large-system-extensions-lse). 31 | Today, postgres binaries from distributions (e.g. Ubuntu) are not built with `-moutline-atomics` or `-march=armv8.2-a` which would enable LSE. 32 | 33 | In November 2021 PostgreSQL started to distribute Ubuntu 20.04 packages optimized with `-moutline-atomics`. 34 | For Ubuntu 20.04, we recommend using the PostgreSQL PPA instead of the packages distributed by Ubuntu Focal. 35 | Please follow [the instructions to set up the PostgreSQL PPA.](https://www.postgresql.org/download/linux/ubuntu/) 36 | 37 | ### Python Installation on Some Linux Distros 38 | The default installation of pip on some Linux distributions is too old \(<19.3\) to install binary wheel packages released for Arm64. To work around this, it is recommended to upgrade your pip installation using: 39 | ``` 40 | sudo python3 -m pip install --upgrade pip 41 | ``` 42 | 43 | ## Bazel on Linux 44 | The [Bazel build tool](https://www.bazel.build/) now releases a pre-built binary for Arm64. As of October 2020, this is not available in their custom Debian repo, and Bazel does not officially provide an RPM. Instead, we recommend using the [Bazelisk installer](https://docs.bazel.build/versions/master/install-bazelisk.html), which will replace your `bazel` command and [keep bazel up to date](https://github.com/bazelbuild/bazelisk/blob/master/README.md). 45 | 46 | Below is an example using the [latest Arm64 binary release of Bazelisk](https://github.com/bazelbuild/bazelisk/releases/latest) as of October 2020: 47 | ``` 48 | wget https://github.com/bazelbuild/bazelisk/releases/download/v1.7.1/bazelisk-linux-arm64 49 | chmod +x bazelisk-linux-arm64 50 | sudo mv bazelisk-linux-arm64 /usr/local/bin/bazel 51 | bazel 52 | ``` 53 | 54 | Bazelisk itself should not require further updates as its only purpose is to keep Bazel updated. 55 | 56 | ## zlib on Linux 57 | Linux distributions, in general, use the original zlib without any optimizations. zlib-cloudflare has been updated to provide better and faster compression on Arm and x86. To use zlib-cloudflare: 58 | ``` 59 | git clone https://github.com/cloudflare/zlib.git 60 | cd zlib 61 | ./configure --prefix=$HOME 62 | make 63 | make install 64 | ``` 65 | Make sure to have the full path to your lib at $HOME/lib in /etc/ld.so.conf and run ldconfig. 66 | 67 | For users of OpenJDK, which is dynamically linked to the system zlib, you can set LD_LIBRARY_PATH to point to the directory where your newly built version of zlib-cloudflare is located or load that library with LD_PRELOAD. 68 | 69 | You can check the libz that JDK is dynamically linked against with: 70 | ``` 71 | $ ldd /Java/jdk-11.0.8/lib/libzip.so | grep libz 72 | libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7783000) 73 | ``` 74 | -------------------------------------------------------------------------------- /languages/README.md: -------------------------------------------------------------------------------- 1 | # Language-specific Considerations 2 | This guide contains language specific sections with recommendations. If there is no language specific section, it is because there is no specific guidance beyond using a suitably current version of the language. Simply proceed as you would on any other CPUs, Arm-based or otherwise. 3 | 4 | Broadly speaking, applications built using interpreted or JIT'd languages ([Python](python.md), [Java](java.md), PHP, Node.js, etc.) should run as-is on Arm64. Applications using compiled languages including [C/C++](c-c++.md), [Fortran](fortran.md), [Go](golang.md), and [Rust](rust.md) need to be compiled for the Arm64 architecture. Most modern build systems (Make, CMake, Ninja, etc.) will "just work" on Arm64. Again , if there is no specific guidance it's because everything works _exactly_ the same on Arm64 as on other platforms. 5 | 6 | ## Language Guides 7 | * [C/C++](c-c++.md) 8 | * [Fortran](fortran.md) 9 | * [Python](python.md) 10 | * [Rust](rust.md) 11 | * [Go](golang.md) 12 | * [Java](java.md) 13 | * [.NET](dotnet.md) 14 | 15 | -------------------------------------------------------------------------------- /languages/c-c++.md: -------------------------------------------------------------------------------- 1 | # C/C++ on Arm64 2 | 3 | ## Availability and Standards Compliance 4 | There are many C/C++ compilers available for Arm64 including: 5 | * [NVIDIA HPC Compiler](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html) 6 | * Cray/HPE Compiler 7 | * GCC 8 | * LLVM 9 | * Arm Compiler for Linux 10 | 11 | The NVIDIA HPC Compiler is a direct decedent of the popular PGI C/C++ compiler, so it accepts the same compiler flags. GCC and LLVM operate more-or-less the same on Arm64 as on other architectures except for the `-mcpu` flag, as described below. The Arm Compiler for Linux is based on LLVM and includes additional optimizations for Arm Neoverse cores that make it a highly performant compiler for CPU-only applications. However, it is missing some Fortran 2008 and OpenMP 4.5+ features that may be desirable for GPU-accelerated applications. 12 | 13 | 14 | ## Enabling Arm Architecture Specific Features 15 | For GCC on Arm64, `-mcpu=` acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs `-march` or `-mtune` if you're building for a specific CPU. You can find additional details in this [presentation from Arm Inc. to Stony Brook University](https://www.stonybrook.edu/commcms/ookami/_pdf/Linford_OokamiUGM_2022.pdf). 16 | 17 | CPU | Flag | GCC version | LLVM verison 18 | ----------|---------|-------------------|------------- 19 | Any Arm64 | `-mcpu=native` | GCC-9+ | Clang/LLVM 10+ 20 | Ampere Altra | `-mcpu=neoverse-n1` | GCC-9+ | Clang/LLVM 10+ 21 | 22 | ## Compilers 23 | Whenever possible, please use the latest compiler version available on your system. Newer compilers provide better support and optimizations for Arm64 processors. Many codes will demonstrate at least 15% better performance when using GCC 10 vs. GCC 7. The table below shows GCC and LLVM compiler versions available in Linux distributions. Starred version marks the default system compiler. 24 | 25 | Distribution | GCC | Clang/LLVM 26 | ----------------|----------------------|------------- 27 | Ubuntu 22.04 | 9, 10, 11*, 12 | 11, 12, 13, 14* 28 | Ubuntu 20.04 | 7, 8, 9*, 10 | 6, 7, 8, 9, 10, 11, 12 29 | Ubuntu 18.04 | 4.8, 5, 6, 7*, 8 | 3.9, 4, 5, 6, 7, 8, 9, 10 30 | Debian10 | 7, 8* | 6, 7, 8 31 | Red Hat EL8 | 8*, 9, 10 | 10 32 | SUSE Linux ES15 | 7*, 9, 10 | 7 33 | 34 | 35 | ## Large-System Extensions (LSE) 36 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. See [Locks, Synchronization, and Atomics](../optimization/atomics.md) for details. 37 | 38 | ### Enabling LSE 39 | 40 | GCC's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE. If you're cross compiling, use the appropriate `-mcpu` option for your target CPU, e.g. `-mcpu=neoverse-n1` for the Ampere Altra CPU. You can check which Arm features GCC will enable with the `-mcpu=native` flag by using this command: 41 | ```bash 42 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE 43 | ``` 44 | For example, on the Ampere Altra CPU with GCC 9.4, we see "`__ARM_FEATURE_ATOMICS 1 45 | `" indicating that LSE atomics are enabled: 46 | ```c 47 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE 48 | #define __ARM_FEATURE_ATOMICS 1 49 | #define __ARM_FEATURE_UNALIGNED 1 50 | #define __ARM_FEATURE_AES 1 51 | #define __ARM_FEATURE_IDIV 1 52 | #define __ARM_FEATURE_QRDMX 1 53 | #define __ARM_FEATURE_DOTPROD 1 54 | #define __ARM_FEATURE_CRYPTO 1 55 | #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 56 | #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 57 | #define __ARM_FEATURE_FMA 1 58 | #define __ARM_FEATURE_CLZ 1 59 | #define __ARM_FEATURE_SHA2 1 60 | #define __ARM_FEATURE_CRC32 1 61 | #define __ARM_FEATURE_NUMERIC_MAXMIN 1 62 | ``` 63 | 64 | ### Checking for LSE in a binary 65 | To confirm that LSE instructions are used, the output of the `objdump` command line utility should contain LSE instructions: 66 | ```bash 67 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l 68 | ``` 69 | To check whether the application binary contains load and store exclusives: 70 | ```bash 71 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l 72 | ``` 73 | 74 | ## Porting SSE or AVX Intrinsics 75 | To quickly get a prototype running on Arm64, one can use 76 | https://github.com/DLTcollab/sse2neon a translator of x64 intrinsics to NEON. 77 | sse2neon provides a quick starting point in porting performance critical codes 78 | to Arm. It shortens the time needed to get an Arm working program that then 79 | can be used to extract profiles and to identify hot paths in the code. A header 80 | file `sse2neon.h` contains several of the functions provided by standard x64 81 | include files like `xmmintrin.h`, only implemented with NEON instructions to 82 | produce the exact semantics of the x64 intrinsic. Once a profile is 83 | established, the hot paths can be rewritten directly with NEON intrinsics to 84 | avoid the overhead of the generic sse2neon translation. 85 | 86 | Note that GCC's `__sync` built-ins are outdated and may be biased towards the x86 memory model. Use `__atomic` versions of these functions instead of the `__sync` versions. See https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html for more details. 87 | 88 | 89 | ## Signed vs. Unsigned char 90 | The C standard doesn't specify the signedness of char. On x86 char is signed by 91 | default while on Arm it is unsigned by default. This can be addressed by using 92 | standard int types that explicitly specify the signedness (e.g. `uint8_t` and `int8_t`) 93 | or compile with `-fsigned-char`. 94 | 95 | 96 | ## Arm Instructions for Machine Learning 97 | Many Arm64 CPUs support [Arm dot-product instructions](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/exploring-the-arm-dot-product-instructions) commonly used for Machine Learning (quantized) inference workloads, and [Half precision floating point (FP16)](https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-intrinsics). These features enable performant and power efficient machine learning by doubling the number of operations per second and reducing the memory footprint compared to single precision floating point (\_float32), all while still enjoying large dynamic range. Compiling with `-mcpu=native` will enable these features, when available. See [the examples page](../examples/README.md) for examples of how to use these features in ML frameworks like TensorFlow and PyTorch. 98 | -------------------------------------------------------------------------------- /languages/dotnet.md: -------------------------------------------------------------------------------- 1 | # .NET on Arm64 2 | .NET 5 is open-source platform for writing different types of applications. Software engineers can write .NET based applications in multiple languages such as C#, F#, and Visual Basic. .NET applications are compiled into Common Intermediate Language (CIL). When an application is executed, the Common Language Runtime (CLR) loads that application binary and uses a just-in-time (JIT) compiler to generate machine code for the architecture being executed on. For more information, please see [what is .NET](https://dotnet.microsoft.com/learn/dotnet/what-is-dotnet). 3 | 4 | 5 | ## .NET Versions 6 | 7 | Version | Linux Arm64 | Notes 8 | ------------------|-----------|------------- 9 | [.NET 6](https://dotnet.microsoft.com/download/dotnet/6.0) | Yes | Preview releases provide early access to features that are currently under development. These releases are generally not supported for production use. 10 | [.NET 5](https://dotnet.microsoft.com/download/dotnet/5.0) | Yes | Arm64-specific optimizations in the .NET libraries and the code produced by RyuJIT. [Arm64 Performance in .NET 5](https://devblogs.microsoft.com/dotnet/arm64-performance-in-net-5/) 11 | [.NET Framework 4.x](https://dotnet.microsoft.com/learn/dotnet/what-is-dotnet-framework) | No | The original implementation of the .NET Framework does not support Linux hosts, and Windows is not suported on the NVIDIA Arm HPC Developer Kit. 12 | [.NET Core 3.1](https://dotnet.microsoft.com/download/dotnet/3.1) | Yes | .NET Core 3.0 added support for [Arm64 for Linux](https://docs.microsoft.com/en-us/dotnet/core/whats-new/dotnet-core-3-0#linux-improvements). 13 | 14 | 15 | ## .NET 5 16 | With .NET 5 Microsoft has made specific Arm64 architecture optimizations. These optimizations were made in the .NET libraries as well as in the machine code output by the JIT process. 17 | 18 | * AWS DevOps Blog [Build and Deploy .NET web applications to ARM-powered AWS Graviton 2 Amazon ECS Clusters using AWS CDK](https://aws.amazon.com/blogs/devops/build-and-deploy-net-web-applications-to-arm-powered-aws-graviton-2-amazon-ecs-clusters-using-aws-cdk/) 19 | * AWS Compute Blog [Powering .NET 5 with AWS Graviton: Benchmarks](https://aws.amazon.com/blogs/compute/powering-net-5-with-aws-graviton2-benchmark-results/) 20 | * Microsoft .NET Blog [ARM64 Performance in .NET 5](https://devblogs.microsoft.com/dotnet/arm64-performance-in-net-5/) 21 | 22 | 23 | ## Building & Publishing for Linux Arm64 24 | The .NET SDK supports choosing a [Runtime Identifier (RID)](https://docs.microsoft.com/en-us/dotnet/core/rid-catalog) used to target platforms where the applications run. These RIDs are used by .NET dependencies (NuGet packages) to represent platform-specific resources in NuGet packages. The following values are examples of RIDs: linux-arm64, linux-x64, ubuntu.14.04-x64, win7-x64, or osx.10.12-x64. For the NuGet packages with native dependencies, the RID designates on which platforms the package can be restored. 25 | 26 | You can build and publish on any host operating system. As an example, you can develop on Windows and build locally to target Arm64, or you can use a CI server like Jenkins on Linux. The commands are the same. 27 | 28 | ```bash 29 | dotnet build -r linux-arm64 30 | dotnet publish -c Release -r linux-arm64 31 | ``` 32 | 33 | For more information about [publishing .NET apps with the .NET CLI](https://docs.microsoft.com/en-us/dotnet/core/deploying/deploy-with-cli) please see the official documents. 34 | 35 | -------------------------------------------------------------------------------- /languages/fortran.md: -------------------------------------------------------------------------------- 1 | # Fortran on Arm64 2 | 3 | ## Availability and Standards Compliance 4 | There are many Fortran compilers available for Arm64 including: 5 | * [NVIDIA HPC Compiler](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html) 6 | * Cray/HPE Compiler 7 | * GCC 8 | * LLVM 9 | * Arm Compiler for Linux 10 | 11 | The NVIDIA HPC Compiler is a direct decedent of the popular PGI Fortran compiler, so it accepts the same compiler flags. GCC and LLVM operate more-or-less the same on Arm64 as on other architectures except for the `-mcpu` flag, as described below. The Arm Compiler for Linux is based on LLVM and includes additional optimizations for Arm Neoverse cores that make it a highly performant compiler for CPU-only applications. However, it is missing some Fortran 2008 and OpenMP 4.5+ features that may be desirable for GPU-accelerated applications. 12 | 13 | 14 | ## Enabling Arm Architecture Specific Features 15 | For gfortran on Arm64, `-mcpu=` acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs `-march` or `-mtune` if you're building for a specific CPU. You can find additional details in this [presentation from Arm Inc. to Stony Brook University](https://www.stonybrook.edu/commcms/ookami/_pdf/Linford_OokamiUGM_2022.pdf). 16 | 17 | CPU | Flag | gfortran version | LLVM verison 18 | ----------|---------|-------------------|------------- 19 | Any Arm64 | `-mcpu=native` | gfortran-9+ | flang/LLVM 10+ 20 | Ampere Altra | `-mcpu=neoverse-n1` | gfortran-9+ | flang/LLVM 10+ 21 | 22 | 23 | ## Compilers 24 | Whenever possible, please use the latest compiler version available on your system. Newer compilers provide better support and optimizations for Arm64 processors. Many codes will demonstrate at least 15% better performance when using GCC 10 vs. GCC 7. The table below shows GCC and LLVM compiler versions available in Linux distributions. Starred version marks the default system compiler. 25 | 26 | Distribution | GCC | Clang/LLVM 27 | ----------------|----------------------|------------- 28 | Ubuntu 22.04 | 9, 10, 11*, 12 | 11, 12, 13, 14* 29 | Ubuntu 20.04 | 7, 8, 9*, 10 | 6, 7, 8, 9, 10, 11, 12 30 | Ubuntu 18.04 | 4.8, 5, 6, 7*, 8 | 3.9, 4, 5, 6, 7, 8, 9, 10 31 | Debian10 | 7, 8* | 6, 7, 8 32 | Red Hat EL8 | 8*, 9, 10 | 10 33 | SUSE Linux ES15 | 7*, 9, 10 | 7 34 | 35 | 36 | ## Large-System Extensions (LSE) 37 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. See [Locks, Synchronization, and Atomics](../optimization/atomics.md) for details. 38 | 39 | 40 | ### Enabling LSE 41 | gfortran's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE. If you're cross compiling, use the appropriate `-mcpu` option for your target CPU, e.g. `-mcpu=neoverse-n1` for the Ampere Altra CPU. You can check which Arm features gfortran will enable with the `-mcpu=native` flag by using this command: 42 | ```bash 43 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE 44 | ``` 45 | For example, on the Ampere Altra CPU with GCC 9.4, we see "`__ARM_FEATURE_ATOMICS 1 46 | `" indicating that LSE atomics are enabled: 47 | ```c 48 | gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE 49 | #define __ARM_FEATURE_ATOMICS 1 50 | #define __ARM_FEATURE_UNALIGNED 1 51 | #define __ARM_FEATURE_AES 1 52 | #define __ARM_FEATURE_IDIV 1 53 | #define __ARM_FEATURE_QRDMX 1 54 | #define __ARM_FEATURE_DOTPROD 1 55 | #define __ARM_FEATURE_CRYPTO 1 56 | #define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1 57 | #define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1 58 | #define __ARM_FEATURE_FMA 1 59 | #define __ARM_FEATURE_CLZ 1 60 | #define __ARM_FEATURE_SHA2 1 61 | #define __ARM_FEATURE_CRC32 1 62 | #define __ARM_FEATURE_NUMERIC_MAXMIN 1 63 | ``` 64 | 65 | ### Checking for LSE in a binary 66 | To confirm that LSE instructions are used, the output of the `objdump` command line utility should contain LSE instructions: 67 | ```bash 68 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l 69 | ``` 70 | To check whether the application binary contains load and store exclusives: 71 | ```bash 72 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l 73 | ``` 74 | -------------------------------------------------------------------------------- /languages/golang.md: -------------------------------------------------------------------------------- 1 | # Go on Arm64 2 | 3 | Go is a statically typed, compiled programming language originally designed at Google. Go supports Arm64 out of the box and is available in all common Linux distributions. Make sure to use the latest version of the Go compiler and toolchain to get the best performance on Arm64. 4 | 5 | Here are some noteworthy performance upgrades: 6 | 7 | ## Go 1.18 \[released 2022/03/14\] 8 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved 9 | performance on Arm64 by implementing a new way of passing function arguments and results using registers instead of the stack. This change has been available on x86-64 since 1.17, where it brought performance improvements of about 5%. On Arm this change typically gives even higher performance improvements of 10% or more. 10 | 11 | To learn more about the use cases benefiting from Go 1.18's performance improvements, check the blog post: [Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton](https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/). 12 | 13 | 14 | ## Go 1.17 \[released 2021/08/16\] 15 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved 16 | performance for the following standard library packages: 17 | 18 | - crypto/ed25519 - the package has been rewritten, and all operations are now approximately twice as fast on Arm64. 19 | - crypto/elliptic - CurveParams methods now automatically invoke faster and safer dedicated implementations for known curves (P-224, P-256, and P-521) when available. The P521 curve implementation has also been rewritten and is now constant-time and three times faster on Arm64. 20 | 21 | 22 | ## Go 1.16 \[released 2021/02/16\] 23 | The main implementation of the Go compiler, [golang/go](https://github.com/golang/go), has improved 24 | performance on Arm with couple of changes listed below. Building your project with Go 1.16 will give you these improvements: 25 | 26 | * [ARMv8.1-A LSE atomic instructions](https://go-review.googlesource.com/c/go/+/234217), which dramatically improve mutex fairness and speed on modern Arm cores with v8.1 and newer instruction set. 27 | * [copy performance improvements](https://go-review.googlesource.com/c/go/+/243357), especially when the addresses are unaligned. 28 | 29 | 30 | ## Recently updated packages 31 | Changes to commonly used packages that improve performance on Arm64 can make a noticeable difference in 32 | some cases. Here is a partial list of packages to be aware of. 33 | 34 | Package | Version | Improvements 35 | ----------|-----------|------------- 36 | [Snappy](https://github.com/golang/snappy) | as of commit [196ae77](https://github.com/golang/snappy/commit/196ae77b8a26000fa30caa8b2b541e09674dbc43) | assembly implementations of the hot path functions were ported from arm64 37 | 38 | -------------------------------------------------------------------------------- /languages/java.md: -------------------------------------------------------------------------------- 1 | # Java on Arm64 2 | 3 | Java is a general-purpose programming language. Compiled Java code can run on all platforms that support Java, without the need for recompilation. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of the underlying computer architecture. _[Wikipedia](https://en.wikipedia.org/wiki/Java_(programming_language))_ 4 | 5 | Java is well supported and generally performant out-of-the-box on Arm64. While Java 8 is fully supported on Arm64, some customers haven't been able to obtain the CPU's full performance benefit until they switched to Java 11. 6 | 7 | This page includes specific details about building and tuning Java applications on Arm64. 8 | 9 | ## Java JVM Options 10 | There are numerous options that control the JVM and may lead to better performance. Two that have shown large (1.5x) improvements in some Java workloads are: 11 | 12 | * Eliminating tiered compilation: `-XX:-TieredCompilation` 13 | * Restricting the size of the code cache to enable Arm64 cores to better predict branches: `-XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M` 14 | 15 | These are helpful on some workloads but can hurt on others so testing with and without them is essential. 16 | 17 | ## Java Stack Size 18 | For some JVMs, the default stack size for Java threads (i.e. `ThreadStackSize`) is 2MB on Arm64 instead of the 1MB used on x86_64. You can check the default with: 19 | ``` 20 | $ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize 21 | intx CompilerThreadStackSize = 2048 {pd product} {default} 22 | intx ThreadStackSize = 2048 {pd product} {default} 23 | intx VMThreadStackSize = 2048 {pd product} {default} 24 | ``` 25 | The default can be easily changed on the command line with either `-XX:ThreadStackSize=` or `-Xss`. Notice that `-XX:ThreadStackSize` interprets its argument as kilobytes whereas `-Xss` interprets it as bytes. So `-XX:ThreadStackSize=1024` and `-Xss1m` will both set the stack size for Java threads to 1 megabyte: 26 | ``` 27 | $ java -Xss1m -XX:+PrintFlagsFinal -version | grep ThreadStackSize 28 | intx CompilerThreadStackSize = 2048 {pd product} {default} 29 | intx ThreadStackSize = 1024 {pd product} {command line} 30 | intx VMThreadStackSize = 2048 {pd product} {default} 31 | ``` 32 | 33 | Usually, there's no need to change the default because the thread stack will be committed lazily as it grows. No matter the default, the thread will always only commit as much stack as it really uses (at page size granularity). However, there's one exception to this rule if [Transparent Huge Pages](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html) (THP) are turned on by default on a system. In that case, the THP page size of 2MB matches exactly with the 2MB default stack size on Arm64 and most stacks will be backed up by a single huge page of 2MB. This means that the stack will be completely committed to memory right from the start. If you're using hundreds or even thousands of threads, this memory overhead can be considerable. 34 | 35 | To mitigate this issue, you can either manually change the stack size on the command line (as described above) or you can change the default for THP from `always` to `madvise`: 36 | ``` 37 | # cat /sys/kernel/mm/transparent_hugepage/enabled 38 | [always] madvise never 39 | # echo madvise > /sys/kernel/mm/transparent_hugepage/enabled 40 | # cat /sys/kernel/mm/transparent_hugepage/enabled 41 | always [madvise] never 42 | ``` 43 | 44 | Notice that even if the the default is changed from `always` to `madvise`, the JVM can still use THP for the Java heap and code cache if you specify `-XX:+UseTransparentHugePages` on the command line. 45 | 46 | ## Looking for x86 shared-objects in JARs 47 | Java JARs can include shared-objects that are architecture specific. Some Java libraries check 48 | if these shared objects are found and if they are they use a JNI to call to the native library 49 | instead of relying on a generic Java implementation of the function. While the code might work, 50 | without the JNI the performance can suffer. 51 | 52 | A quick way to check if a JAR contains such shared objects is to simply unzip it and check if 53 | any of the resulting files are shared-objects and if an Arm64 shared-object is missing: 54 | ``` 55 | $ unzip foo.jar 56 | $ find . -name "*.so" -exec file {} \; 57 | ``` 58 | For each x86-64 ELF file, check there is a corresponding aarch64 ELF file in the binaries. With some common packages (e.g. commons-crypto) we've seen that even though a JAR can be built supporting Arm manually, artifact repositories such as Maven don't have updated versions. 59 | 60 | ## Building multi-arch JARs 61 | Java is meant to be a write once, and run anywhere language. When building Java artifacts that 62 | contain native code, it is important to build those libraries for each major architecture to provide 63 | a seamless and optimally performing experience for all consumers. Code that runs well on both Arm64 and x86 64 | increases the package's utility. 65 | 66 | There is nominally a multi-step process to build the native shared objects for each supported architecture before doing the final packaging with Maven, SBT, Gradle etc. Below is an example of how to create your JAR using Maven that contains shared libraries for multiple distributions and architectures: 67 | 68 | ``` 69 | # You will need access to two systems: one x86 and one Arm64. 70 | 71 | # On the x86 system: 72 | $ cd java-lib 73 | $ mvn package 74 | $ find target/ -name "*.so" -type f -print 75 | 76 | # Note the directory this so file is in. It will be in a directory 77 | # such as: target/classes/org/your/class/hierarchy/native/OS/ARCH/lib.so 78 | 79 | # Now, log into the Arm64 system 80 | $ cd java-lib 81 | $ mvn package 82 | 83 | # Repeat the below two steps for each OS and ARCH combination you want to release 84 | $ mkdir target/classes/org/your/class/hierarchy/native/OS/ARCH 85 | $ scp other-system:~/your-java-lib/target/classes/org/your/class/hierarchy/native/OS/ARCH/lib.so target/classes/org/your/class/hierarchy/native/OS/ARCH/ 86 | 87 | # Create the jar packaging with maven. It will include the additional 88 | # native libraries even though they were not built directly by this maven process. 89 | $ mvn package 90 | 91 | # When creating a single Jar for all platform native libraries, 92 | # the release plugin's configuration must be modified to specify 93 | # the plugin's `preparationGoals` to not include the clean goal. 94 | # See http://maven.apache.org/maven-release/maven-release-plugin/prepare-mojo.html#preparationGoals 95 | # For more details. 96 | 97 | # To do a release to Maven Central and/or Sonatype Nexus: 98 | $ mvn release:prepare 99 | $ mvn release:perform 100 | ``` 101 | 102 | This is one way to do the JAR packaging with all the libraries in a single JAR. To build all the JARs, we recommend to build on native machines, but it can also be done via Docker using the buildx plug-in, or by cross-compiling inside your build-environment. 103 | 104 | There are two additional options for releasing jars with native code. You can use a manager plugin such as the [nar maven plugin](https://maven-nar.github.io/) to manage each platform specific Jar. Or, you can release individual architecture specific jars, and then use one system to download these released jars and package them into a combined Jar with a final `mvn release:perform`. An example of this method can be found in the [Leveldbjni-native](https://github.com/fusesource/leveldbjni) `pom.xml` files. 105 | 106 | 107 | ## Profiling Java applications 108 | For languages that rely on a JIT (such an Java), the symbol information that is captured is lacking, making it difficult to understand where runtime is being consumed. Similar to the code profiling example above, `libperf-jvmti.so` can be used to dump symbols for JITed code as the JVM runs. 109 | 110 | ```bash 111 | # Compile your Java application with -g 112 | 113 | # find where libperf-jvmti.so is on your distribution 114 | 115 | # Run your java app with -agentpath:/path/to/libperf-jvmti.so added to the command line 116 | # Launch perf record on the system 117 | $ perf record -g -k 1 -a -o perf.data sleep 5 118 | 119 | # Inject the generated methods information into the perf.data file 120 | $ perf inject -j -i perf.data -o perf.data.jit 121 | 122 | # Process the new file, for instance via Brendan Gregg's Flamegraph tools 123 | $ perf script -i perf.data.jit | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > ./flamegraph.svg 124 | ``` -------------------------------------------------------------------------------- /languages/python.md: -------------------------------------------------------------------------------- 1 | # Python on Arm64 2 | 3 | Python is an interpreted, high-level, general-purpose programming language, with interpreters available for many operating systems and architectures, including Arm64. _[Read more on Wikipedia](https://en.wikipedia.org/wiki/Python_(programming_language))_ 4 | 5 | ## Installing Python packages 6 | 7 | When *pip* (the standard package installer for Python) is used, it pulls the packages from [Python Package Index](https://pypi.org) and other indexes. 8 | 9 | In the case *pip* could not find a pre-compiled package, it automatically downloads, compiles, and builds the package from source code. 10 | Normally it may take a few more minutes to install the package from source code than from pre-built. For some large packages (i.e. *pandas*) 11 | it may take up to 20 minutes. In some cases, compilation may fail due to missing dependencies. 12 | 13 | ### Prerequisites for installing Python packages from source 14 | 15 | For installing common Python packages from source code, we need to install the following development tools: 16 | 17 | On **RedHat**: 18 | ```bash 19 | sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran lapack-devel 20 | python3 -m pip install --user --upgrade pip 21 | ``` 22 | 23 | On **Debian/Ubuntu**: 24 | ```bash 25 | sudo apt update 26 | sudo apt-get install build-essential python3-pip python3-dev libblas-dev gfortran liblapack-dev 27 | python3 -m pip install --user --upgrade pip 28 | ``` 29 | 30 | On all distributions, additional compile time dependencies might be needed depending on the Python modules you are trying to install. 31 | 32 | ### Recommended versions 33 | 34 | When adopting Arm64, it is recommended to use recent software versions as much as possible, and Python is no exception. 35 | 36 | Python 2.7 is EOL since January the 1st 2020, so it is definitely recommended to upgrade to a Python 3.x version before moving to Arm64. 37 | 38 | Python 3.6 will reach [EOL in December, 2021](https://www.python.org/dev/peps/pep-0494/#lifespan), so when starting to port an application to Arm64, it is recommended to target at least Python 3.7. 39 | 40 | ### Python on Centos 8 and RHEL 8 41 | 42 | Centos 8 and RHEL 8 distribute Python 3.6 which is 43 | [scheduled for end of life in December, 2021](https://www.python.org/dev/peps/pep-0494/#lifespan). 44 | However as of May 2021, some package maintainers have already begun dropping support for 45 | Python 3.6 by omitting prebuilt wheels published to [pypi.org](https://pypi.org). 46 | For some packages, it is still possible to use Python 3.6 by using the distribution 47 | from the package manager. For example `numpy` no longer publishes Python 3.6 wheels, 48 | but can be installed from the package manager: `yum install python3-numpy`. 49 | 50 | Another option is to use Python 3.8 instead of the default Python package. You can 51 | install Python 3.8 with pip: `yum install python38-pip`. Then use pip to install 52 | the latest versions of packages: `pip3 install numpy`. 53 | 54 | Some common Python packages that are distributed by the package manager are: 55 | * python3-numpy 56 | * python3-markupsafe 57 | * python3-pillow 58 | 59 | To see a full list run `yum search python3` 60 | 61 | 62 | ## Scientific and numerical applications (NumPy, SciPy, BLAS, etc) 63 | 64 | Python relies on native code to achieve high performance. For scientific and 65 | numerical applications, NumPy and SciPy provide an interface to high performance 66 | computing libraries such as ATLAS, BLAS, BLIS, OpenBLAS, etc. These libraries 67 | contain code tuned for Arm64 processors, and especially the Arm Neoverse-N1 core 68 | which powers the Ampere Altra CPU in the NVIDIA Arm HPC Developer Kit. 69 | 70 | It is recommended to use the latest software versions as much as possible. If the latest 71 | version migration is not feasible, please ensure that it is at least the minimum version 72 | recommended below. Multiple fixes related to data precision and correctness on 73 | Arm64 went into OpenBLAS between v0.3.9 and v0.3.17 and the below SciPy and NumPy versions 74 | upgraded OpenBLAS from v0.3.9 to OpenBLAS v0.3.17. 75 | * OpenBLAS: >= v0.3.17 76 | * SciPy: >= v1.7.2 77 | * NumPy: >= 1.21.1 78 | 79 | ### BLIS may be a faster BLAS 80 | 81 | The default SciPy and NumPy binary installations with `pip3 install numpy scipy` 82 | are configured to use OpenBLAS. The default installations of SciPy and NumPy 83 | are easy to setup and well tested. 84 | 85 | Some workloads will benefit from using BLIS. Benchmarking SciPy and NumPy 86 | workloads with BLIS might allow to identify additional performance improvement. 87 | 88 | ### Install NumPy and SciPy with BLIS on Ubuntu and Debian 89 | 90 | On Ubuntu and Debian `apt install python3-numpy python3-scipy` will install NumPy 91 | and SciPy with BLAS and LAPACK libraries. To install SciPy and NumPy with BLIS 92 | and OpenBLAS on Ubuntu and Debian: 93 | ``` 94 | sudo apt -y install python3-scipy python3-numpy libopenblas-dev libblis-dev 95 | sudo update-alternatives --set libblas.so.3-aarch64-linux-gnu \ 96 | /usr/lib/aarch64-linux-gnu/blis-openmp/libblas.so.3 97 | ``` 98 | 99 | To switch between available alternatives: 100 | ``` 101 | sudo update-alternatives --config libblas.so.3-aarch64-linux-gnu 102 | sudo update-alternatives --config liblapack.so.3-aarch64-linux-gnu 103 | ``` 104 | 105 | ### Install NumPy and SciPy with BLIS on RedHat 106 | 107 | As of June 20th, 2020, NumPy now [provides binaries](https://pypi.org/project/numpy/#files) for Arm64. 108 | 109 | Prerequisites to build SciPy and NumPy with BLIS on RedHat: 110 | ```bash 111 | # Install RedHat prerequisites 112 | sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran 113 | 114 | # Install BLIS 115 | git clone https://github.com/flame/blis $HOME/blis 116 | cd $HOME/blis; 117 | ./configure --enable-threading=openmp --enable-cblas --prefix=/usr cortexa57 118 | make -j; sudo make install 119 | 120 | # Install OpenBLAS 121 | git clone https://github.com/xianyi/OpenBLAS.git $HOME/OpenBLAS 122 | cd $HOME/OpenBLAS 123 | make -j BINARY=64 FC=gfortran USE_OPENMP=1 124 | sudo make PREFIX=/usr install 125 | ``` 126 | 127 | To build and install NumPy and SciPy with BLIS and OpenBLAS: 128 | ```bash 129 | git clone https://github.com/numpy/numpy/ $HOME/numpy 130 | cd $HOME/numpy 131 | pip3 install . 132 | 133 | git clone https://github.com/scipy/scipy/ $HOME/scipy 134 | cd $HOME/scipy 135 | pip3 install . 136 | ``` 137 | 138 | When NumPy and SciPy detect the presence of the BLIS library at build time, they 139 | will use BLIS in priority over the same functionality from BLAS and 140 | OpenBLAS. OpenBLAS or LAPACK libraries need to be installed along BLIS to 141 | provide LAPACK functionality. To change the library dependencies, one can set 142 | environment variables `NPY_BLAS_ORDER` and `NPY_LAPACK_ORDER` before building numpy 143 | and scipy. The default is: 144 | `NPY_BLAS_ORDER=mkl,blis,openblas,atlas,accelerate,blas` and 145 | `NPY_LAPACK_ORDER=mkl,openblas,libflame,atlas,accelerate,lapack`. 146 | 147 | ### Testing NumPy and SciPy installation 148 | 149 | To test that the installed NumPy and SciPy are built with BLIS and OpenBLAS, the 150 | following commands will print native library dependencies: 151 | ```bash 152 | python3 -c "import numpy as np; np.__config__.show()" 153 | python3 -c "import scipy as sp; sp.__config__.show()" 154 | ``` 155 | 156 | In the case of Ubuntu and Debian these commands will print `blas` and `lapack` 157 | which are symbolic links managed by `update-alternatives`. 158 | 159 | 160 | ### Improving BLIS and OpenBLAS performance with multi-threading 161 | When OpenBLAS is built with `USE_OPENMP=1` it will use OpenMP to parallelize the 162 | computations. The environment variable `OMP_NUM_THREADS` can be set to specify 163 | the maximum number of threads. If this variable is not set, the default is to 164 | use a single thread. 165 | 166 | To enable parallelism with BLIS, one needs to both configure with 167 | `--enable-threading=openmp` and set the environment variable `BLIS_NUM_THREADS` 168 | to the number of threads to use, the default is to use a single thread. 169 | 170 | 171 | ### Arm64 support in Conda / Anaconda 172 | Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. 173 | 174 | Anaconda has announced [support for Arm64 via AWS Graviton 2 on May 14, 2021](https://www.anaconda.com/blog/anaconda-aws-graviton2). The Ampere Altra CPU found in the NVIDIA Arm HPC DevKit is based on the same Arm Neoverse-N1 core as the AWS Gravition 2, so Anaconda also supports the Ampere Altra. Instructions to install the full Anaconda package installer can be found at https://docs.anaconda.com/anaconda/install/graviton2/. 175 | 176 | Anaconda also offers a lightweight version called [Miniconda](https://docs.conda.io/en/latest/miniconda.html) which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. 177 | 178 | See [the Anaconda, Miniconda, Conda, Mamba example](../examples/anaconda.md) for details. 179 | 180 | 181 | ## Other common Python packages 182 | 183 | ### Pillow 184 | Pillow 8.x or later have a binary wheel for all Arm64 platforms, included OSes with 64kB pages like Redhat/CentOS8. 185 | ```bash 186 | pip3 install --user pillow 187 | ``` 188 | should work across all platforms we tested. 189 | 190 | 191 | ## Machine Learning Python packages 192 | 193 | ### PyTorch 194 | PyTorch wheels for nightly builds (cpu builds) are are available for Arm64 since PyTorch 1.8. 195 | ```bash 196 | pip install numpy 197 | pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html 198 | ``` 199 | 200 | ### DGL 201 | Install PyTorch as described above, then follow the [install from source instructions](https://github.com/dmlc/dgl/blob/master/docs/source/install/index.rst#install-from-source). 202 | 203 | 204 | ### Sentencepiece 205 | Install PyTorch as described above. Then: 206 | ``` 207 | # git the source and build/install the libraries. 208 | git clone https://github.com/google/sentencepiece 209 | cd sentencepiece 210 | mkdir build 211 | cd build 212 | cmake .. 213 | make -j 214 | sudo make install 215 | sudo ldconfig -v 216 | 217 | cd ../python 218 | vi make_py_wheel.sh 219 | # change the manylinux1_{$1} to manylinux2014_{$1} 220 | 221 | sudo python3 setup.py install 222 | ``` 223 | 224 | With the above steps, the wheel should be installed. 225 | 226 | *Important*: Before calling any python script or starting python, one of the libraries need to be set as preload for python. 227 | 228 | ``` 229 | export LD_PRELOAD=/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/$LD_PRELOAD 230 | python3 231 | ``` 232 | 233 | ### Morfeusz 234 | 235 | ``` 236 | # download the source 237 | wget http://download.sgjp.pl/morfeusz/20200913/morfeusz-src-20200913.tar.gz 238 | tar -xf morfeusz-src-20200913.tar.gz 239 | cd Morfeusz/ 240 | sudo apt install cmake zip build-essential autotools-dev \ 241 | python3-stdeb python3-pip python3-all-dev python3-pyparsing devscripts \ 242 | libcppunit-dev acl default-jdk swig python3-all-dev python3-stdeb 243 | export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8 244 | mkdir build 245 | cd build 246 | cmake .. 247 | sudo make install 248 | sudo ldconfig -v 249 | sudo PYTHONPATH=/usr/local/lib/python make install-builder 250 | ``` 251 | 252 | If you run into issue with the last command (_make install-builder_), please try: 253 | ``` 254 | sudo PYTHONPATH=`which python3` make install-builder 255 | ``` 256 | -------------------------------------------------------------------------------- /languages/rust.md: -------------------------------------------------------------------------------- 1 | # Rust on Arm64 2 | 3 | Rust is supported on Linux/Arm64 systems as a tier1 platform, just like x86. 4 | 5 | ### Large-System Extensions (LSE) 6 | 7 | LSE improves system throughput for CPU-to-CPU communication, locks, and mutexes. 8 | The improvement can be up to an order of magnitude when using LSE instead of 9 | load/store exclusives. LSE can be enabled in Rust and we've seen cases on 10 | larger machines where performance is improved by over 3x by setting the `RUSTFLAGS` 11 | environment variable and rebuilding your project. 12 | 13 | ``` 14 | export RUSTFLAGS="-Ctarget-cpu=neoverse-n1" 15 | cargo build --release 16 | ``` 17 | -------------------------------------------------------------------------------- /optimization/README.md: -------------------------------------------------------------------------------- 1 | # Optimizing for Arm64 2 | 3 | ## Detecting Arm Hardware Capabilities at Runtime 4 | 5 | There are several ways to determine the available Arm CPU resources and topology at runtime, including: 6 | 7 | * CPU architecture and supported instructions 8 | * CPU manufacturer 9 | * Number of CPU sockets 10 | * CPU cores per socket 11 | * Number of NUMA nodes 12 | * Number of NUMA nodes per socket 13 | * CPU cores per NUMA node 14 | 15 | See [Runtime CPU Detection](cpudetect.md) for more details and example code. 16 | 17 | 18 | ## Debugging Problems 19 | 20 | It's possible that incorrect code will work fine on an existing system, but 21 | produce an incorrect result when using a new compiler. This could be because 22 | it relies on undefined behavior in the language (e.g. assuming char is signed in C/C++, 23 | or the behavior of signed integer overflow), contains memory management bugs that 24 | happen to be exposed by aggressive compiler optimizations, or incorrect ordering. 25 | Below are some techniques / tools we have used to find issues 26 | while migrating our internal services to newer compilers and Arm64. 27 | 28 | ### Using Sanitizers 29 | The compiler may generate code and layout data slightly differently on Arm64 30 | compared to an x86 system and this could expose latent memory bugs that were previously 31 | hidden. On GCC, the easiest way to look for these bugs is to compile with the 32 | memory sanitizers by adding the below to standard compiler flags: 33 | 34 | ``` 35 | CFLAGS += -fsanitize=address -fsanitize=undefined 36 | LDFLAGS += -fsanitize=address -fsanitize=undefined 37 | ``` 38 | 39 | Run the resulting binary, and any bugs detected by the sanitizers will cause 40 | the program to exit immediately and print helpful stack traces and other 41 | information. 42 | 43 | ### Ordering issues 44 | Arm is weakly ordered, similar to POWER and other modern architectures, while 45 | x86 is a variant of total-store-ordering (TSO). 46 | Code that relies on TSO may lack barriers to properly order memory references. 47 | Arm64 systems are [weakly ordered multi-copy-atomic](https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf). 48 | 49 | While TSO allows reads to occur out-of-order with writes and a processor to 50 | observe its own write before it is visible to others, the Armv8 memory model has 51 | further relaxations for performance and power efficiency. 52 | **Code relying on pthread mutexes or locking abstractions 53 | found in C++, Java or other languages shouldn't notice any difference.** Code that 54 | has a bespoke implementation of lockless data structures or implements its own 55 | synchronization primitives will have to use the proper intrinsics and 56 | barriers to correctly order memory transactions. See [Locks, Synchronization, and Atomics](atomics.md) for more information. 57 | 58 | 59 | ### Architecture specific optimization 60 | Sometimes code will have architecture specific optimizations. These can take many forms: 61 | sometimes the code is optimized in assembly using specific instructions for 62 | [CRC](https://github.com/php/php-src/commit/2a535a9707c89502df8bc0bd785f2e9192929422), 63 | other times the code could be enabling a [feature](https://github.com/lz4/lz4/commit/605d811e6cc94736dd609c644404dd24c013fd6f) 64 | that has been shown to work well on particular architectures. A quick way to see if any optimizations 65 | are missing for Arm64 is to grep the code for "`__x86_64__`" preprocessor branches (`ifdef`) and see if there 66 | is corresponding Arm64 code there too. If not, that might be something to improve. 67 | 68 | ### Lock/Synchronization intensive workload 69 | All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. See [Locks, Synchronization, and Atomics](atomics.md) for details. 70 | 71 | ## Profiling the code 72 | If you aren't getting the performance you expect, one of the best ways to understand what is 73 | going on in the system is to compare profiles of execution and understand where the CPU cores are 74 | spending time. This will frequently point to a hot function that could be optimized. 75 | 76 | Install the Linux perf tool: 77 | ```bash 78 | # Redhat 79 | sudo yum install perf 80 | 81 | # Ubuntu 82 | sudo apt-get install linux-tools-$(uname -r) 83 | ``` 84 | 85 | Record a profile: 86 | ```bash 87 | # If the program is run interactively 88 | sudo perf record -g -F99 -o perf.data ./your_program 89 | 90 | # If the program is a service, sample all cpus (-a) and run for 60 seconds while the system is loaded 91 | sudo perf record -ag -F99 -o perf.data sleep 60 92 | ``` 93 | 94 | Look at the profile: 95 | ```bash 96 | perf report 97 | ``` 98 | 99 | Additionally, there is a tool that will generate a visual representation of the output which can sometimes 100 | be more useful: 101 | ```bash 102 | git clone https://github.com/brendangregg/FlameGraph.git 103 | perf script -i perf.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg 104 | ``` -------------------------------------------------------------------------------- /optimization/atomics.md: -------------------------------------------------------------------------------- 1 | # Locks, Synchronization, and Atomics 2 | 3 | Synchronization is a hot topic during the software migration process. Arm64 systems typically have more CPU cores than other architectures, so efficient synchronization is critical to achieving good performance. Synchronization is also a complex and nuanced topic. If you're looking for a high level overview, you'll find it here. If you would like a more detailed look, Arm Inc. have published an excellent whitepaper on this topic. You can deep dive this topic by reading [Synchronization Overview and Case Study on Arm Architecture 4 | ](https://developer.arm.com/documentation/107630/1-0/?lang=en). 5 | 6 | 7 | ## The Arm Memory Model 8 | One of the most significant differences between Arm and X86 CPUs is their memory model: the Arm architecture has a weak memory model that differs from the x86 architecture TSO (Total Store Order) model. Different memory models can cause low-level codes to function well on one architecture but encounter performance problem or failure on the other. The good news is that Arm's more relaxed memory model allows for more compiler and hardware optimization to boost system performance. 9 | 10 | **Generally speaking, you should only care about the Arm memory model if you are writing low level code, e.g. assembly language.** The details of Arm's memory model lie well below the application level and will be completely invisible to most users. If you are writing in a high level language like C/C++ or Fortran, you do not need to know the nuances of Arm's memory model. The one exception to this general rule is code that uses boutique synchronization constructs instead of standard best practices, e.g. using `volatile` as a means of thread synchronization. Deviating from established standards or ignoring best practices results in code that is almost guaranteed to be broken. It should be rewritten using system-provided locks, conditions, etc. and the `stdatomic` tools. Here's an example of such a bug: https://github.com/ParRes/Kernels/issues/611 11 | 12 | Arm isn't the only architecture using a weak memory model. If your application already runs on CPUs that aren't x86-based, you're likely to encounter fewer bugs related to the weak memory model. In particular, if your application has been ported to a CPU implementing the POWER architecture (e.g. IBM POWER9) then it is likely to work perfectly on the Arm memory model. 13 | 14 | 15 | ## Large-System Extension (LSE) Atomic Instructions 16 | All server-class Arm64 processors have support for the Large-System Extension (LSE) which was first introduced in Armv8.1. LSE provides low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. Note that this is not generally true for older Arm64 CPUs like the Marvell ThunderX2 or the Fujitsu A64FX. Please see [these slides from the ISC 2022 AHUG Workshop](https://agenda.isc-hpc.com/media/slides_pdf/0900_Arm_HPC_User_Group_at_ISC22_wxIExtw.pdf) for more details. 17 | 18 | You'll get the best performance from a POSIX threads library that uses LSE atomic instructions. LSE atomics are important for locking and thread synchronization routines. Many Linux distributions provide a libc compiled with LSE instructions. For example: 19 | - RHEL 8.4 and later 20 | - Ubuntu 20.04 and later 21 | 22 | Some distributions need an additional package to support LSE. For example, Ubuntu 18.04 needs `apt install libc6-lse`. Please see [the operating systems page](../software/os.md) for more details. 23 | 24 | Using atomics is not just a good idea, it's basically required for writing highly parallel code. High core count systems using exclusive instructions instead of atomics are not guaranteed to make progress. One core complex can monopolize a resource while the other starves. 25 | 26 | When building an application from source, the compiler needs to generate LSE atomic instructions for applications that use atomic operations. For example, the code of databases like PostgreSQL contain atomic constructs; c++11 code with std::atomic statements translate into atomic operations. Since GCC 9.4, GCC's `-mcpu=native` flag enables all instructions supported by the host CPU, including LSE. To confirm that LSE instructions are created, the output of `objdump` command line utility should contain LSE instructions: 27 | ```bash 28 | $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l 29 | ``` 30 | To check whether the application binary contains load and store exclusives: 31 | ```bash 32 | $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l 33 | ``` 34 | 35 | ## Language-specific Guidance 36 | 37 | Check the [Languages](../languages/README.md) page for any language-specific guidance related to LSE, locking, synchronization, and atomics. If no guide is provided then there are no Arm-related specific issues for that language. Just proceed as you would on any other platform. 38 | -------------------------------------------------------------------------------- /optimization/cpudetect.md: -------------------------------------------------------------------------------- 1 | # Runtime CPU Detection 2 | 3 | There are several ways to determine the available Arm CPU resources and topology at runtime, including: 4 | 5 | * CPU architecture and supported instructions 6 | * CPU manufacturer 7 | * Number of CPU sockets 8 | * CPU cores per socket 9 | * Number of NUMA nodes 10 | * Number of NUMA nodes per socket 11 | * CPU cores per NUMA node 12 | 13 | Well-established portable libraries like libnuma and hwloc are a great choice on Arm64. You can also use Arm's CPUID registers or query OS files. Since many of these methods serve the same function, you should choose the method that best fits your application. 14 | 15 | If you're implementing your own approach, then please look at the Arm Architecture Registers, and especially the Main ID Register (MIDR_EL1): https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register 16 | 17 | The source code for the `lscpu` utility is a great example of how to retrieve and use these registers. For example, here's how to translate the CPU part number in the MIDR_EL1 register to a human-readable string: https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c 18 | 19 | Here's the output of `lscpu` on an AWS Graviton 3. 20 | ``` 21 | [jlinford@c7g-16xlarge-dy-c7g16xlarge-1 ~]$ lscpu 22 | Architecture: aarch64 23 | Byte Order: Little Endian 24 | CPU(s): 64 25 | On-line CPU(s) list: 0-63 26 | Thread(s) per core: 1 27 | Core(s) per socket: 64 28 | Socket(s): 1 29 | NUMA node(s): 1 30 | Model: 1 31 | BogoMIPS: 2100.00 32 | L1d cache: 64K 33 | L1i cache: 64K 34 | L2 cache: 1024K 35 | L3 cache: 32768K 36 | NUMA node0 CPU(s): 0-63 37 | Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng 38 | ``` 39 | 40 | ## CPU Hardware Capabilities 41 | 42 | To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer wanting to build an application or library that can detect the supported instructions in runtime, can follow this example: 43 | 44 | ```c 45 | #include 46 | ...... 47 | uint64_t hwcaps = getauxval(AT_HWCAP); 48 | has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false; 49 | has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false; 50 | has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false; 51 | has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false; 52 | has_sve_feature = hwcaps & HWCAP_SVE ? true : false; 53 | ``` 54 | 55 | The full list of Arm64 hardware capabilities is defined in [glibc header file](https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h) and in the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/hwcap.h). 56 | 57 | ## Example Source Code 58 | 59 | Here's a complete yet simple example code that higlights some of the methods mentioned above. 60 | 61 | ```c 62 | #include 63 | #include 64 | #include 65 | 66 | // https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register 67 | typedef union 68 | { 69 | struct { 70 | unsigned int revision : 4; 71 | unsigned int part : 12; 72 | unsigned int arch : 4; 73 | unsigned int variant : 4; 74 | unsigned int implementer : 8; 75 | unsigned int _RES0 : 32; 76 | }; 77 | unsigned long bits; 78 | } MIDR_EL1; 79 | 80 | static MIDR_EL1 read_MIDR_EL1() 81 | { 82 | MIDR_EL1 reg; 83 | asm("mrs %0, MIDR_EL1" : "=r" (reg.bits)); 84 | return reg; 85 | } 86 | 87 | 88 | static const char * get_implementer_name(MIDR_EL1 midr) 89 | { 90 | switch(midr.implementer) 91 | { 92 | case 0xC0: return "Ampere"; 93 | case 0x41: return "Arm"; 94 | case 0x42: return "Broadcom"; 95 | case 0x43: return "Cavium"; 96 | case 0x44: return "DEC"; 97 | case 0x46: return "Fujitsu"; 98 | case 0x48: return "HiSilicon"; 99 | case 0x49: return "Infineon"; 100 | case 0x4D: return "Motorola"; 101 | case 0x4E: return "NVIDIA"; 102 | case 0x50: return "Applied Micro"; 103 | case 0x51: return "Qualcomm"; 104 | case 0x56: return "Marvell"; 105 | case 0x69: return "Intel"; 106 | default: return "Unknown"; 107 | } 108 | } 109 | 110 | 111 | static const char * get_part_name(MIDR_EL1 midr) 112 | { 113 | switch(midr.implementer) 114 | { 115 | case 0x41: // Arm Ltd. 116 | switch (midr.part) { 117 | case 0xd03: return "Cortex A53"; 118 | case 0xd07: return "Cortex A57"; 119 | case 0xd08: return "Cortex A72"; 120 | case 0xd09: return "Cortex A73"; 121 | case 0xd0c: return "Neoverse N1"; 122 | case 0xd40: return "Neoverse V1"; 123 | default: return "Unknown"; 124 | } 125 | case 0x42: // Broadcom 126 | switch (midr.part) { 127 | case 0x516: return "Vulcan"; 128 | default: return "Unknown"; 129 | } 130 | case 0x43: // Cavium 131 | switch (midr.part) { 132 | case 0x0a1: return "ThunderX"; 133 | case 0x0af: return "ThunderX2"; 134 | default: return "Unknown"; 135 | } 136 | case 0x46: // Fujitsu 137 | switch (midr.part) { 138 | case 0x001: return "A64FX"; 139 | default: return "Unknown"; 140 | } 141 | case 0x4E: // NVIDIA 142 | switch (midr.part) { 143 | case 0x000: return "Denver"; 144 | case 0x003: return "Denver 2"; 145 | case 0x004: return "Carmel"; 146 | default: return "Unknown"; 147 | } 148 | case 0x50: // Applied Micro 149 | switch (midr.part) { 150 | case 0x000: return "EMAG 8180"; 151 | default: return "Unknown"; 152 | } 153 | default: return "Unknown"; 154 | } 155 | } 156 | 157 | 158 | 159 | 160 | int main(void) 161 | { 162 | // Main ID register 163 | MIDR_EL1 midr = read_MIDR_EL1(); 164 | 165 | // CPU ISA capabilities 166 | unsigned long hwcaps = getauxval(AT_HWCAP); 167 | 168 | printf("CPU revision : 0x%x\n", midr.revision); 169 | printf("CPU part number : 0x%x (%s)\n", midr.part, get_part_name(midr)); 170 | printf("CPU architecture: 0x%x\n", midr.arch); 171 | printf("CPU variant : 0x%x\n", midr.variant); 172 | printf("CPU implementer : 0x%x (%s)\n", midr.implementer, get_implementer_name(midr)); 173 | printf("CPU LSE atomics : %sSupported\n", (hwcaps & HWCAP_ATOMICS) ? "" : "Not "); 174 | printf("CPU NEON SIMD : %sSupported\n", (hwcaps & HWCAP_ASIMD) ? "" : "Not "); 175 | printf("CPU SVE SIMD : %sSupported\n", (hwcaps & HWCAP_SVE) ? "" : "Not "); 176 | printf("CPU Dot-product : %sSupported\n", (hwcaps & HWCAP_ASIMDDP) ? "" : "Not "); 177 | printf("CPU FP16 : %sSupported\n", (hwcaps & HWCAP_FPHP) ? "" : "Not "); 178 | printf("CPU BF16 : %sSupported\n", (hwcaps & HWCAP2_BF16) ? "" : "Not "); 179 | 180 | if (numa_available() == -1) { 181 | printf("libnuma not available\n"); 182 | } 183 | printf("CPU NUMA nodes : %d\n", numa_num_configured_nodes()); 184 | printf("CPU Cores : %d\n", numa_num_configured_cpus()); 185 | 186 | return 0; 187 | } 188 | ``` -------------------------------------------------------------------------------- /optimization/vectorization.md: -------------------------------------------------------------------------------- 1 | # Arm SIMD Instructions: SVE and NEON 2 | 3 | The Arm architecture provides two single-instruction-multiple-data (SIMD) instruction extensions: NEON and SVE. 4 | 5 | Arm Advanced SIMD Instructions (a.k.a. "NEON") is the most common SIMD ISA for Arm64. It is a fixed-length SIMD ISA that supports vectors of 128 bits. The first Arm-based supercomputer to appear on the Top500 Supercomputers list ([_Astra_](https://www.sandia.gov/labnews/2018/11/21/astra-2/)) used NEON to accelerate linear algebra, and many applications and libraries are already taking advantage of NEON. The Ampere Altra CPU found in the NVIDIA Arm HPC Developer Kit supports NEON vectorization. 6 | 7 | More recently, Arm64 CPUs have started supporting [Arm Scalable Vector Extensions (SVE)](https://developer.arm.com/documentation/102476/latest/). SVE is a length-agnostic SIMD ISA that supports more datatypes than NEON (e.g. FP16), offers more powerful instructions (e.g. gather/scatter), and supports vector lengths of more than 128 bits. SVE is currently found in the AWS Graviton 3, Fujitsu A64FX, and the Alibaba Yitian 710. SVE is not a new version of NEON, but an entirely new SIMD ISA in it's own right. Most SVE-capable CPUs also support NEON. 8 | 9 | Here's a quick summary of the SIMD capabilities of some of the currently available Arm64 CPUs: 10 | 11 | | | Alibaba Yitian 710 | AWS Graviton3 | Fujitsu A64FX | AWS Graviton2 | Ampere Altra | 12 | | ---------------------- | ------------------ | ------------- | -------------- | ------------- | ------------ | 13 | | CPU Core | Neoverse N2 | Neoverse V1 | A64FX | Neoverse N1 | Neoverse N1 | 14 | | SIMD ISA | SVE2 & NEON | SVE & NEON | SVE & NEON | NEON only | NEON only | 15 | | NEON Configuration | 2x128 | 4x128 | 2x128 | 2x128 | 2x128 | 16 | | SVE Configuration | 2x128 | 2x256 | 2x512 | N/A | N/A | 17 | | SVE Version | 2 | 1 | 1 | N/A | N/A | 18 | | NEON FMLA FP64 TPeak | 8 | 16 | 8 | 8 | 8 | 19 | | SVE FMLA FP64 TPeak | 8 | 16 | 32 | N/A | N/A | 20 | 21 | Note that recent Arm64 CPUs provide the same peak theoretical performance for both NEON and SVE. For example, the Graviton3 can either retire four 128-bit NEON operations or two 256-bit SVE operations. The Yitian 710 takes this one step further and provides both NEON and SVE in the same configuration (2x128). On paper, the peak performance of both SVE and NEON are the same for these CPUs, which means there's no intrinsic performance advantage for SVE vs. NEON, or vice versa. (Note: there are micro-architectural details that can give one ISA a performance advantage over the other in certain conditions, but the upper limit on performance is always the same.) 22 | 23 | However, SVE ([and especially SVE2](https://developer.arm.com/documentation/102340/0001/Introducing-SVE2)) is a much more capable SIMD ISA with support for complex datatypes and advanced features that enable vectorization of complicated code. In practice, kernels that can't be vectorized in NEON _can_ be vectorized with SVE. So while SVE won't beat NEON in a performance drag race, it can dramatically improve the performance of the application overall by vectorizing loops that would otherwise have executed with scalar instructions. 24 | 25 | Fortunately, auto-vectorizing compilers are usually the best choice when programming Arm SIMD ISAs. The compiler will generally make the best decision on when to use SVE or NEON, and it will take advantage of SVE's advanced auto-vectorization features more easily than a human coding in intrinsics or assembly is likely to do. **Generally speaking, you should not write SVE or NEON intrinsics.** Instead, use the appropriate command line options with your auto-vectorizing compiler to realize the best performance for a given loop. You may need to use compiler directives or make changes in the high level code to facilitate autovectorization, but this will be much easier and more maintainable than writing intrinsics. Leave the finer details to the compiler and focus on code patterns that auto-vectorize well. 26 | 27 | 28 | ## Compiler-driven auto-vectorization 29 | The key to maximizing auto-vectorization is to allow the compiler to take full advantage of the available hardware features. By default, GCC and LLVM compilers take a conservative approach and do not enable advanced features unless explicitly told to do so. The easiest way to enable all available features for GCC or LLVM is to use the `-mcpu` compiler flag. If you're compiling on the same CPU that the code will run on, use `-mcpu=native`. Otherwise you can use `-mcpu=` where `` is one of the CPU identifiers, e.g. `-mcpu=neoverse-n1`. The NVIDIA compilers take a more agressive approach. By default, they assume the machine you are compiling on is the one you will run on and so will enable all available hardware features detected at compile time. And whenever possible, use the most recent version of your compiler. For example, GCC9 supported auto-vectorization fairly well, but GCC10 has shown impressive improvement over GCC9 in most cases. GCC12 further improves auto-vectorization. 30 | 31 | The second key compiler feature is the compiler vectorization report. GCC uses the `-fopt-info` flags to report on auto-vectorization success or failure. You can use the generated informational messages to guide code annotations or transformations that will facilitate autovectorization. For example, compiling with `-fopt-info-vec-missed` will report on which loops were not vectorized in a code like this: 32 | ```c 33 | 1 // test.c 34 | ... 35 | 5 float a[1024*1024]; 36 | 6 float b[1024*1024]; 37 | 7 float c[1024*1024]; 38 | ... 39 | 37 for (j=0; j<128;j++) { // outer loop, not expected to be vectorized 40 | 38 for (i=0; i 94 | ... 95 | uint64x2_t u64x2; 96 | int64x2_t s64x2; 97 | // Error: cannot convert 'int64x2_t' to 'uint64x2_t' in assignment 98 | u64x2 = s64x2; 99 | ``` 100 | 101 | To perform the cast, you must use NEON's `vreinterpretq` functions: 102 | ```c 103 | u64x2 = vreinterpretq_u64_s64(s64x2); 104 | ``` 105 | 106 | Unfortunately, some codes written for other SIMD ISAs rely on these kinds of implicit conversions (see the [Velox example](../examples/velox.md)). If you see errors about "no known conversion" in a code that builds for AVX but doesn't build for NEON then we might need to relax GCC's vector converversion rules: 107 | ``` 108 | /tmp/velox/third_party/xsimd/include/xsimd/types/xsimd_batch.hpp:35:11: note: no known conversion for argument 1 from ‘xsimd::batch’ to ‘const xsimd::batch&’ 109 | ``` 110 | To allow implicit conversions between vectors with differing numbers of elements and/or incompatible element types, use the `-flax-vector-conversions` flag. This flag should be fine for legacy code, but it should not be used for new code. The safest option is to use the appropriate `vreinterpretq` calls. 111 | 112 | 113 | ## Runtime detection of supported SIMD instructions 114 | To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer wanting to build an application or library that can detect the supported instructions in runtime, can follow this example: 115 | 116 | ```c 117 | #include 118 | ...... 119 | uint64_t hwcaps = getauxval(AT_HWCAP); 120 | has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false; 121 | has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false; 122 | has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false; 123 | has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false; 124 | has_sve_feature = hwcaps & HWCAP_SVE ? true : false; 125 | ``` 126 | 127 | The full list of Arm64 hardware capabilities is defined in [glibc header file](https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h) and in the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/hwcap.h). 128 | 129 | ## Porting codes with SSE/AVX intrinsics to NEON 130 | 131 | ### Detecting Arm64 systems 132 | Projects may fail to build on Arm64 with `error: unrecognized command-line 133 | option '-msse2'`, or `-mavx`, `-mssse3`, etc. These compiler flags enable x86 134 | vector instructions. The presence of this error means that the build system may 135 | be missing the detection of the target system, and continues to use the x86 136 | target features compiler flags when compiling for Arm64. 137 | 138 | To detect an Arm64 system, the build system can use: 139 | ```bash 140 | (test $(uname -m) = "aarch64" && echo "arm64 system") || echo "other system" 141 | ``` 142 | 143 | Another way to detect an arm64 system is to compile, run, and check the return 144 | value of a C program: 145 | ```c 146 | # cat << EOF > check-arm64.c 147 | int main () { 148 | #ifdef __aarch64__ 149 | return 0; 150 | #else 151 | return 1; 152 | #endif 153 | } 154 | EOF 155 | 156 | # gcc check-arm64.c -o check-arm64 157 | # (./check-arm64 && echo "arm64 system") || echo "other system" 158 | ``` 159 | 160 | ### Translating x86 intrinsics to NEON 161 | When programs contain code with x86 intrinsics, drop-in intrinsic translation tools like [SIMDe](https://github.com/simd-everywhere/simde) or [sse2neon](https://github.com/DLTcollab/sse2neon) can be used to quickly obtain a working program on Arm64. This is a good starting point for rewriting the x86 intrinsics in either NEON or SVE and will quickly get a prototype up and running. For example, to port code using AVX2 intrinsics with SIMDe: 162 | ```c 163 | #define SIMDE_ENABLE_NATIVE_ALIASES 164 | #include "simde/x86/avx2.h" 165 | ``` 166 | 167 | SIMDe provides a quick starting point to port performance critical codes 168 | to Arm64. It shortens the time needed to get a working program that then 169 | can be used to extract profiles and to identify hot paths in the code. 170 | Once a profile is established, the hot paths can be rewritten to avoid the overhead of the generic translation. 171 | 172 | Since you're rewriting your x86 intrinsics, you might want to take this opportunity to create a more portable version. Here are some suggestions to consider: 173 | 174 | 1. Rewrite in native C/C++, Fortran, or another high-level compiled language. Compilers are constantly improving, and technologies like Arm SVE enable the autovectorization of codes that formally wouldn't vectorize. You may be able to avoid platform-specific intrinsics entirely and let the compiler do all the work. 175 | 2. If your application is written in C++, use `std::experimental::simd` from the C++ Parallelism Technical Specification V2 via the `` header. 176 | 3. Use the [SLEEF Vectorized Math Library](https://sleef.org/) as a header-based set of "portable intrinsics" 177 | 4. Instead of Time Stamp Counter (TSC) RDTSC intrinsics, use standards-compliant portable timers, e.g., `std::chrono` (C++), `clock_gettime` (C/POSIX), `omp_get_wtime` (OpenMP), `MPI_Wtime` (MPI), etc. 178 | 179 | 180 | ## Additional resources 181 | 182 | * [Neon Intrinsics](https://developer.arm.com/architectures/instruction-sets/intrinsics/) 183 | * [Coding for Neon](https://developer.arm.com/documentation/102159/latest/) 184 | * [Neon Programmer's Guide for Armv8-A](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a) 185 | -------------------------------------------------------------------------------- /slack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arm-hpc-devkit/nvidia-arm-hpc-devkit-users-guide/77ad9c72ab9c19e2e333184e8839d97e891fbaa0/slack.png -------------------------------------------------------------------------------- /software/README.md: -------------------------------------------------------------------------------- 1 | # Arm64 Software Ecosystem 2 | 3 | ## Compilers 4 | Many commercial and open source compilers now support Arm64. See [the compilers page](compilers.md) for details, recommendations, and best practices. We also recommend you check the [language-specific considerations](../languages/README.md#language-specific-considerations). 5 | 6 | ## Message Passing (MPI) 7 | Practically all MPI libraries support Arm64 with the notable exception of Intel MPI. See [the MPI page](mpi.md) for details, recommendations, and best practices. 8 | 9 | ## Math Libraries 10 | Many math libraries support Arm64 CPUs. In many cases, math libraries supporting Arm64 (e.g. BLIS, OpenBLAS, FFTW, etc.) can be substituted for libraries that have not yet announced support for Arm64 (e.g. Intel MKL). Arm community members like NVIDIA, AWS, and Oracle are actively contributing to enabling Arm64 support wherever possible. See [the math libraries page](mathlibs.md) for details, recommendations, and best practices. 11 | 12 | All NVIDIA GPU math libraries work perfectly on Arm-hosted GPUs. In this case, the architecture of the host CPU is irrelevant. If your application uses GPU-accelerated math libraries, proceed exactly as you would on any other platform. 13 | 14 | ## Package Managers e.g. Spack and EasyBuild 15 | Package managers like Spack and EasyBuild are a great way to get started with Arm64. Spack is very well supported on Arm64 and can build full software stacks for CPU-only or CPU+GPU applications using GCC, LLVM, Arm, or NVIDIA compilers and math libraries. Since the Ampere Altra CPU found in the NVIDIA Arm HPC Developer Kit is architecturally similar to the AWS Gravition 2, we recommend you use [AWS's Spack Rolling Binary Cache](https://aws.amazon.com/blogs/hpc/introducing-the-spack-rolling-binary-cache/) to accelerate your Spack software stack deployments. [See this AWS blog post fore more details](https://aws.amazon.com/blogs/hpc/introducing-the-spack-rolling-binary-cache/). 16 | 17 | ## Containers 18 | Docker, Kubernetes, Singuarlity, CharlieCloud, Saurs, and many more container engines and frameworks run with excellent performance on Arm64. Please refer [here](containers.md) for information about running container-based workloads. 19 | 20 | ## Operating Systems 21 | Please check [here](os.md) for more information about which operating system to run on the NVIDIA Arm HPC Developer Kit. 22 | 23 | ## Recent Updates, Known Issues, and Workarounds 24 | Please see [here](known_issues.md) for recently identified issues and their solutions. Please also check the [AWS Graviton Getting Started Guide](https://github.com/aws/aws-graviton-getting-started) for known issues and workarounds. -------------------------------------------------------------------------------- /software/compilers.md: -------------------------------------------------------------------------------- 1 | # Compilers for Arm64 2 | 3 | Many commercial and open source compilers now support Arm64. See [the compilers page](compilers.md) for details, recommendations, and best practices. We also recommend you check the [language-specific considerations](../languages/README.md#language-specific-considerations). 4 | 5 | ## NVIDIA HPC Compilers 6 | 7 | The NVIDIA HPC SDK includes proven compilers, libraries and software tools essential to maximizing developer productivity and the performance and portability of HPC applications. The NVIDIA HPC SDK compilers enable cross-platform C, C++, and Fortran programming for NVIDIA GPUs and multicore Arm, OpenPOWER, or x86-64 CPUs. They are ideal for HPC modeling and simulation applications written in C, C++, or Fortran with OpenMP, OpenACC, and CUDA. 8 | 9 | These compilers are also fully interoperable with NVIDIA’s optimized math libraries, communication libraries, and performance tuning and debugging tools. The NVIDIA HPC SDK’s accelerated math libraries maximize performance on common HPC algorithms, and the optimized communications libraries enable standards-based scalable systems programming. The integrated performance profiling and debugging tools simplify porting and optimization of HPC applications, and the containerization tools enable easy deployment on-premises or in the cloud. In short, the HPC SDK provides all the tools you need to build HPC applications for GPUs, CPUs, or the cloud. 10 | 11 | 12 | ## GNU Compilers 13 | 14 | When using the GNU compilers, we strongly recommend GCC version 11 or later. Arm Inc. is a long-standing contributor to the GNU compilers, so much so that Arm and Arm's partners currently contribute the majority of GCC updates. Quite old GNU compilers will work on Arm64 CPUs, however you should always use the latest versions for best performance. 15 | 16 | ## LLVM Compilers 17 | 18 | LLVM compilers support Arm64 CPUs quite well, though mainly for C and C++ (clang and clang++). LLVM's Fortran front-end (flang) is not widely used and is still maturing. Arm Inc's provides multiple LLVM-based compilers, both for server-class Arm64 systems and for embedded and mobile devices. The Arm Compiler for Linux is based on LLVM and Arm is committed to upstreaming all LLVM patches it develops. 19 | 20 | ## Arm Compiler for Linux 21 | 22 | The Arm Compiler for Linux is tailored to the development of HPC applications. It is a free-to-use compiler built on LLVM13. For more details see [this blog post about the launch of Arm Compiler for Linux version 22.0.](https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0) 23 | 24 | ## HPE/Cray, Fujitsu, and Other Vendor Compilers 25 | 26 | Arm vendors like HPE/Cray and Fujitsu also provide compilers that target their own Arm-based products. Code generated by these vendor compilers tends to be very highly tuned for the target platform and therefore will not run well, if at all, on the NVIDIA Arm HPC Developer Kit. 27 | 28 | -------------------------------------------------------------------------------- /software/containers.md: -------------------------------------------------------------------------------- 1 | # Containers on Arm64 2 | 3 | Containerization has long been of interest to the Arm community. Today, Arm64 CPUs are ideal for container-based workloads. 4 | 5 | ## NVIDIA Container Toolkit 6 | 7 | The [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPUs. Follow this installation guide to get started: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html 8 | 9 | As an example, here are the steps for installing NVIDIA Container Toolkit on Ubuntu 20.04 with Docker. Note that other container frameworks like Podman are also supported. 10 | 11 | ```bash 12 | # Install Docker dependencies 13 | sudo apt-get install ca-certificates curl gnupg lsb-release 14 | 15 | # Add Docker repo 16 | echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 17 | 18 | # Install Docker 19 | sudo apt-get update 20 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin 21 | 22 | # Enable docker service 23 | sudo systemctl --now enable docker 24 | 25 | # Add the NVIDIA Container Toolkit repository 26 | distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ 27 | && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ 28 | && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ 29 | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ 30 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list 31 | 32 | # Install NVIDIA Container Toolkit 33 | sudo apt update 34 | sudo apt install nvidia-docker2 35 | 36 | # Restart docker services to enable GPU support 37 | sudo systemctl restart docker 38 | 39 | # Run a simple test using the CUDA multi-arch container 40 | sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu18.04 nvidia-smi 41 | Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu18.04' locally 42 | 11.0.3-base-ubuntu18.04: Pulling from nvidia/cuda 43 | e196da37f904: Pull complete 44 | 0b7ba59c359b: Pull complete 45 | 84bc5f8689bc: Pull complete 46 | b926124172ef: Pull complete 47 | fef6c6f16e98: Pull complete 48 | Digest: sha256:f7b595695b06ad8590aed1accd6437ba068ca44e71c5cf9c11c8cb799c2d8335 49 | Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu18.04 50 | Thu Jul 7 17:57:04 2022 51 | +-----------------------------------------------------------------------------+ 52 | | NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 | 53 | |-------------------------------+----------------------+----------------------+ 54 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 55 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 56 | | | | MIG M. | 57 | |===============================+======================+======================| 58 | | 0 NVIDIA A100 80G... Off | 0000000C:01:00.0 Off | 0 | 59 | | N/A 44C P0 65W / 300W | 0MiB / 81920MiB | 0% Default | 60 | | | | Disabled | 61 | +-------------------------------+----------------------+----------------------+ 62 | | 1 NVIDIA A100 80G... Off | 0000000D:01:00.0 Off | 0 | 63 | | N/A 36C P0 63W / 300W | 0MiB / 81920MiB | 2% Default | 64 | | | | Disabled | 65 | +-------------------------------+----------------------+----------------------+ 66 | 67 | +-----------------------------------------------------------------------------+ 68 | | Processes: | 69 | | GPU GI CI PID Type Process name GPU Memory | 70 | | ID ID Usage | 71 | |=============================================================================| 72 | | No running processes found | 73 | +-----------------------------------------------------------------------------+ 74 | ``` 75 | 76 | 77 | 78 | 79 | ## Preparing for Arm64 80 | 81 | The first step for leveraging the benefits of Arm64 systems as container hosts is to ensure all production software dependencies support the Arm64 architecture, as one cannot run images built for an x86_64 host on an Arm64 host, and vice versa. 82 | 83 | Most of the container ecosystem supports both architectures, and often does so transparently through [multiple-architecture (multi-arch)](https://www.docker.com/blog/multi-platform-docker-builds/) images, where the correct image for the host architecture is deployed automatically. 84 | 85 | The major container image repositories, including [Dockerhub](https://hub.docker.com), [Quay](https://www.quay.io), and [Amazon Elastic Container Registry (ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) all support [multi-arch](https://aws.amazon.com/blogs/containers/introducing-multi-architecture-container-images-for-amazon-ecr/) images. 86 | 87 | ### Creating Multi-arch container images 88 | 89 | While most images already support multi-arch (i.e. arm64 and x86_64/amd64), we describe couple of ways for developers to to create a multi-arch image if needed. 90 | 91 | 1. [Docker Buildx](https://github.com/docker/buildx#getting-started) 92 | 2. Using a CI/CD Build Pipeline such as [Amazon CodePipeline](https://github.com/aws-samples/aws-multiarch-container-build-pipeline) to coordinate native build and manifest generation. 93 | 94 | ### Deploying to Arm64 95 | 96 | Most container orchestration platforms support both arm64 and x86_64 hosts. As an example, here is an _incomplete, non-exhaustive_ list of popular software within the container ecosystem that explicitly supports Arm64. 97 | 98 | | Name | URL | Comment | 99 | | :----- |:----- | :----- | 100 | | Tensorflow | https://hub.docker.com/r/armswdev/tensorflow-arm-neoverse | | 101 | | PyTorch | https://hub.docker.com/r/armswdev/pytorch-arm-neoverse |Use tags with *-openblas* for performance reasons until confirmed otherwise| 102 | | Istio | https://github.com/istio/istio/releases/ | 1) arm64 binaries as of 1.6.x release series
2) [Istio container build instructions](https://github.com/aws/aws-graviton-getting-started/blob/main/containers-workarounds.md#Istio)| 103 | | Envoy | https://www.envoyproxy.io/docs/envoy/v1.18.3/start/docker || 104 | | Traefik | https://github.com/containous/traefik/releases || 105 | | Flannel | https://github.com/coreos/flannel/releases || 106 | | Helm | https://github.com/helm/helm/releases/tag/v2.16.9 || 107 | | Jaeger | https://github.com/jaegertracing/jaeger/pull/2176 | | 108 | | Fluent-bit |https://github.com/fluent/fluent-bit/releases/ | compile from source | 109 | | core-dns |https://github.com/coredns/coredns/releases/ | | 110 | | external-dns | https://github.com/kubernetes-sigs/external-dns/blob/master/docs/faq.md#which-architectures-are-supported | support from 0.7.5+ | 111 | | Prometheus | https://prometheus.io/download/ | | 112 | |containerd | https://github.com/containerd/containerd/issues/3664 | nightly builds provided for arm64 | 113 | | kube-state-metrics | https://github.com/kubernetes/kube-state-metrics/issues/1037 | use k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-beta for arm64 | 114 | | cluster-autoscaler | https://github.com/kubernetes/autoscaler/pull/3714 | arm64 support as of v1.20.0 | 115 | |gRPC | https://github.com/protocolbuffers/protobuf/releases/ | protoc/protobuf support | 116 | |Nats | https://github.com/nats-io/nats-server/releases/ | | 117 | |CNI | https://github.com/containernetworking/plugins/releases/ | | 118 | |Cri-o | https://github.com/cri-o/cri-o/blob/master/README.md#installing-crio | tested on Ubuntu 18.04 and 20.04 | 119 | |Trivy | https://github.com/aquasecurity/trivy/releases/ | | 120 | |Argo | https://github.com/argoproj/argo-cd/releases | arm64 images published as of 2.3.0 | 121 | |Cilium | https://docs.cilium.io/en/stable/contributing/development/images/ | Multi arch supported from v 1.10.0 | 122 | |Calico | https://hub.docker.com/r/calico/node/tags?page=1&ordering=last_updated | Multi arch supported on master | 123 | |Tanka | https://github.com/grafana/tanka/releases | | 124 | |Consul | https://www.consul.io/downloads | | 125 | |Nomad | https://www.nomadproject.io/downloads | | 126 | |Packer | https://www.packer.io/downloads | | 127 | |Vault | https://www.vaultproject.io/downloads | | 128 | |Terraform | https://github.com/hashicorp/terraform/issues/14474 | arm64 support as of v0.14.0 | 129 | |Flux | https://github.com/fluxcd/flux/releases/ | | 130 | |Pulumi | https://github.com/pulumi/pulumi/issues/4868 | arm64 support as of v2.23.0 | 131 | |New Relic | https://download.newrelic.com/infrastructure_agent/binaries/linux/arm64/ | | 132 | |Datadog - EC2 | https://www.datadoghq.com/blog/datadog-arm-agent/ || 133 | |Datadog - Docker | https://hub.docker.com/r/datadog/agent-arm64 || 134 | |Dynatrace | https://www.dynatrace.com/news/blog/get-out-of-the-box-visibility-into-your-arm-platform-early-adopter/ || 135 | |Grafana | https://grafana.com/grafana/download?platform=arm || 136 | |Loki | https://github.com/grafana/loki/releases || 137 | |kube-bench | https://github.com/aquasecurity/kube-bench/releases/tag/v0.3.1 || 138 | |metrics-server | https://github.com/kubernetes-sigs/metrics-server/releases/tag/v0.3.7 | docker image is multi-arch from v.0.3.7 | 139 | | Flatcar Container Linux | https://www.flatcar.org | arm64 support in Stable channel as of 3033.2.0 | 140 | 141 | **If your software isn't listed above, it doesn't mean it won't work!** 142 | 143 | Many products work on arm64 but don't explicitly distribute arm64 binaries or build multi-arch images *(yet)*. NVIDIA, AWS, Arm, and many developers in the community are working with maintainers and contributing expertise and code to enable full binary or multi-arch support. 144 | 145 | ## Kubernetes 146 | 147 | Kubernetes fully supports Arm64. 148 | 149 | If all of your containerized workloads support Arm64, then you can run your cluster with Arm64 nodes exclusively. However, if you have some workloads that can only run on x86, or if you just want to be able to run both x86 and Arm64 nodes in the same cluster, then there are a couple of ways to accomplish that: 150 | 151 | * **Multiarch Images**: 152 | If you are able to use multiarch images (see above) for all containers in your cluster, then you can simply run a mix of x86 and Arm64 nodes without any further action. The multiarch image manifest will ensure that the correct image layers are pulled for a given node's architecture. 153 | 154 | * **Built-in labels**: 155 | You can schedule pods on nodes according to the `kubernetes.io/arch` [label](https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-arch). This label is automatically added to nodes by Kubernetes and allows you to schedule pods accordingly with a [node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) like this: 156 | ``` 157 | nodeSelector: 158 | kubernetes.io/arch: amd64 159 | ``` 160 | * **Using taints**: 161 | Taints are especially helpful if adding Arm64 nodes to an existing cluster with mostly x86-only containers. While using the built-in `kubernetes.io/arch` label requires you to explicitly use a node selector to place x86-only containers on the right instances, tainting Arm64 instances prevents Kubernetes from scheduling incompatible containers on them without requiring you to change any existing configuration. For example, you can do this with a managed node group using eksctl by adding `--kubelet-extra-args '--register-with-taints=arm=true:NoSchedule'` to the kubelet startup arguments as documented [here](https://eksctl.io/usage/eks-managed-nodes/). Note that if you only taint Arm64 instances and don't specify any node selectors, then you will need to ensure that the images you build for Arm64 are multiarch images that can also run on x86 instance types. Alternatively, you can build Arm64-only images and ensure that they are only scheduled onto Arm64 images using node selectors. 162 | 163 | ## Further Reading 164 | 165 | * [Building multi-arch docker images with buildx](https://tech.smartling.com/building-multi-architecture-docker-images-on-arm-64-c3e6f8d78e1c) 166 | * [Unifying Arm software development with Docker](https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/unifying-arm-software-development-with-docker) 167 | * [Modern multi-arch builds with docker](https://duske.me/posts/modern-multiarch-builds-with-docker/) 168 | -------------------------------------------------------------------------------- /software/cuda.md: -------------------------------------------------------------------------------- 1 | # CUDA on Arm64 2 | 3 | CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on GPUs. NVIDIA has officially supported CUDA on Arm64 starting with CUDA 11. There is no specific guidance for CUDA on Arm64. Simply use it as you normally would on any other platform. CUDA is [provided with the NVIDIA HPC SDK](https://developer.nvidia.com/hpc-sdk), or you can [download the CUDA Toolkit for Arm](https://developer.nvidia.com/cuda-downloads). -------------------------------------------------------------------------------- /software/mathlibs.md: -------------------------------------------------------------------------------- 1 | # Math Libraries on Arm64 2 | 3 | Many math libraries support Arm64 CPUs. In many cases, open source math libraries like BLIS, OpenBLAS, FFTW, etc. can be substituted for libraries that have not yet announced support for Arm64 (e.g. Intel MKL). Arm community members like NVIDIA, AWS, and Oracle are actively contributing to enabling Arm64 support wherever possible. 4 | 5 | ## GPU Math Libraries 6 | 7 | All NVIDIA GPU math libraries work perfectly on Arm-hosted GPUs. In this case, the architecture of the host CPU is irrelevant. If your application uses GPU-accelerated math libraries, proceed exactly as you would on any other platform. 8 | 9 | ## Multi-node Math Libraries 10 | 11 | Generally speaking, all the multi-node math libraries you expect work well on Arm64. Trilinos, PETSc, Hypre, SuperLU, and ParMETIS have been used at scale on the Astra system at Sandia (Marvell ThunderX2 with EDR InfiniBand) and at scale on AWS Graviton 2 with EFA. GPU support has been tested in most of these same libraries at smaller scales. Spack is often a good option for installing these libraries. 12 | -------------------------------------------------------------------------------- /software/ml.md: -------------------------------------------------------------------------------- 1 | # AI, ML, and DL Frameworks 2 | 3 | Many AI, ML, and DL frameworks work well on Arm64-based platforms. In most cases, you will want to use the Arm-hosted GPU for training or inference. You may also wish to use the CPU for inference. See [the examples page](../examples/README.md) for more information. 4 | 5 | The following are known to work well on the NVIDIA Arm HPC Developer Kit: 6 | * TensorRT: An SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. 7 | * NVIDIA Triton Inference Server: An open-source inference serving software that helps standardize model deployment and execution, delivering fast and scalable AI in production. 8 | * PyTorch: A GPU accelerated tensor computational framework. Functionality can be extended with common Python libraries such as NumPy and SciPy. 9 | * TensorFlow: An open source platform for machine learning, providing comprehensive tools and libraries in a flexible architecture allowing easy deployment across a variety of platforms and devices. 10 | * Example: [GPU-accelerated training with TensorFlow](../examples/tensorflow-gpu.md) 11 | * Example: [on-CPU inference with TensorFlow](../examples/tensorflow-cpu.md) -------------------------------------------------------------------------------- /software/mpi.md: -------------------------------------------------------------------------------- 1 | # Message Passing Interface (MPI) Implementations on Arm64 2 | 3 | Practically all MPI libraries support Arm64 with the notable exception of Intel MPI. 4 | 5 | ## OpenMPI 6 | 7 | There is no specific guidance for Arm64 for OpenMPI. It functions on Arm64 exactly as it does on all other architectures. Simply install and use it as you would normally. 8 | 9 | ## MVAPICH2 10 | 11 | There is no specific guidance for Arm64 for MVAPICH2. It functions on Arm64 exactly as it does on all other architectures. Simply install and use it as you would normally. 12 | 13 | ## MPICH 14 | 15 | There is no specific guidance for Arm64 for MPICH. It functions on Arm64 exactly as it does on all other architectures. Simply install and use it as you would normally. 16 | 17 | -------------------------------------------------------------------------------- /software/os.md: -------------------------------------------------------------------------------- 1 | # Operating Systems on Arm64 2 | 3 | Current versions of popular Linux distributions (Ubuntu, RedHat, etc.) support Arm64 to the same level as other architectures. The NVIDIA Arm HPC Developer Kit has been internally tested and qualified using Ubuntu 20.04 and RHEL 8.4 operating systems, but other distributions may also work. 4 | 5 | 6 | ## RedHat Enterprise Linux and Derivatives 7 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment 8 | ------ | ------ | ----- | ----- | ----- | ----- 9 | RHEL | 9.0 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-9.0-aarch64-dvd.iso) | 10 | RHEL | 8.6 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.6-aarch64-dvd.iso) | 11 | RHEL | 8.4 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.4-aarch64-dvd.iso) | NVIDIA tested and qualified on DevKit 12 | RHEL | 8.2 | Yes | 64KB | [ISO](https://developers.redhat.com/content-gateway/file/rhel-8.2-aarch64-dvd.iso) | 13 | Rocky Linux | 8.4 or later | Yes | 64KB | [ISO](https://download.rockylinux.org/pub/rocky/8/isos/aarch64/Rocky-8.6-aarch64-dvd1.iso) | 14 | CentOS Stream | 9 | No | 64KB | [ISO](https://mirrors.centos.org/mirrorlist?path=/9-stream/BaseOS/aarch64/iso/CentOS-Stream-9-latest-aarch64-dvd1.iso&redirect=1&protocol=https) | 15 | CentOS Stream | 8 | No | 64KB | [Mirror List](http://isoredirect.centos.org/centos/8-stream/isos/aarch64/) | 16 | CentOS | 8.2 or later | No | 64KB | [ISO](http://bay.uchicago.edu/centos-vault/8.2.2004/isos/aarch64/CentOS-8.2.2004-aarch64-dvd1.iso) | 17 | 18 | 19 | ## Ubuntu 20 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment 21 | ------ | ------ | ----- | ----- | ----- | ----- 22 | Ubuntu | 22.04 LTS | Yes | 4KB | [ISO](https://cdimage.ubuntu.com/releases/22.04/release/ubuntu-22.04-live-server-arm64.iso) | 23 | Ubuntu | 20.04 LTS | Yes | 4KB | [ISO](https://cdimage.ubuntu.com/releases/20.04/release/ubuntu-20.04.4-live-server-arm64.iso) | NVIDIA tested and qualified on DevKit 24 | Ubuntu | 18.04 LTS | Yes (*) | 4KB | [ISO](https://cdimage.ubuntu.com/releases/18.04/release/ubuntu-18.04.6-server-arm64.iso) | (*) needs `apt install libc6-lse` 25 | 26 | 27 | ## SUSE Linux Enterprise Server 28 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment 29 | ------ | ------ | ----- | ----- | ----- | ----- 30 | SLES | 15 SP2 or later | Planned | 4KB | [SUSE Download](https://www.suse.com/download/sles/) | 31 | 32 | 33 | ## Others 34 | Name | Version | [LSE Support](../optimization/README.md#locksynchronization-intensive-workload) | Kernel page size | Download | Comment 35 | ------ | ------ | ----- | ----- | ----- | ----- 36 | AlmaLinux | 9.0 | Yes | 64KB | [Mirror List](https://mirrors.almalinux.org/isos/aarch64/9.0.html) | 37 | AlmaLinux | 8.6 | Yes | 64KB | [Mirror List](https://mirrors.almalinux.org/isos/aarch64/8.6.html) | 38 | Alpine Linux | 3.12.7 or later | Yes (*) | 4KB | [ISO](https://dl-cdn.alpinelinux.org/alpine/v3.16/releases/aarch64/alpine-standard-3.16.0-aarch64.iso) | (*) LSE enablement checked in version 3.14 | 39 | Debian | 11 (Bullseye) | Yes | 4KB | [ISO](https://cdimage.debian.org/debian-cd/current/arm64/iso-dvd/debian-11.3.0-arm64-DVD-1.iso) | 40 | Debian | 10 (Buster) | Yes (*) | 4KB | [ISO](https://cdimage.debian.org/cdimage/archive/10.12.0/arm64/iso-dvd/debian-10.12.0-arm64-DVD-1.iso) | LSE supported as of Debian 10.7 (2020-12-07) 41 | FreeBSD | 13.0 or later | Yes | 4KB | [ISO](https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/13.1/FreeBSD-13.1-RELEASE-arm64-aarch64-disc1.iso) | Some DevKit hardware features are not supported 42 | 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /transition-guide.md: -------------------------------------------------------------------------------- 1 | # Considerations when Transitioning Workloads to Arm64 2 | 3 | Today, Arm CPUs power application servers, micro-services, high-performance computing, CPU-based machine learning inference, video encoding, electronic design automation, gaming, open-source databases, and in-memory caches. In most cases transitioning to Arm64 CPUs is simple and straightforward. This transition guide provides a step-by-step approach to assess your workload to identify and address any potential software changes that might be needed. 4 | 5 | ## Introduction - Identifying Target Workloads 6 | 7 | The quickest and easiest workloads to transition are Linux-based, and built using open-source components or in-house applications where you control the source code. Many open source projects already support Arm64, and having access to the source code allows you to build from source if pre-built artifacts do not already exist. There is also a large and growing set of Independent Software Vendor (ISV) software available for Arm64 (e.g. a non-exhaustive list can be found [here](isv.md)). However if you license software you’ll want to check with the respective ISV to ensure they already, or have plans to, support Arm. 8 | 9 | The following transition guide is organized into a logical sequence of steps as follows: 10 | 11 | * [Learning and exploring](#learning-and-exploring) 12 | * Step 1 - [Optional] Understand the NVIDIA Arm HPC Developer Kit and review key documentation 13 | * Step 2 - Explore your workload, and inventory your current software stack 14 | * [Plan your workload transition](#plan-your-workload-transition) 15 | * Step 3 - Install and configure your application environment 16 | * Step 4 - [Optional] Build your application(s) and/or container images 17 | * [Test and optimize your workload](#test-and-optimize-your-workload) 18 | * Step 5 - Testing and optimizing your workload 19 | * Step 6 - Performance testing 20 | 21 | ### Learning and Exploring 22 | 23 | **Step 1 - [Optional] Understand the NVIDIA Arm HPC Developer Kit and review key documentation** 24 | 25 | * [Optional] Start by reviewing [the NVIDIA Arm HPC Developer Kit product page](https://developer.nvidia.com/arm-hpc-devkit). 26 | * [Optional] Watch these recommended presentations to learn more about getting started, porting and tuning applications, and expected performance for key applications: 27 | * [First hands-on experiences using the NVIDIA Arm HPC Developer Kit](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41624/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0) 28 | * [Getting started with ARM software development: 86 the x86 dependencies in your code](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41702/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0) 29 | * [Port, Profile, and Tune HPC Applications for Arm-based Supercomputers](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41788/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0) 30 | * [Introducing Developer Tools for Arm and NVIDIA systems](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32163/?playlistId=playList-de66fcc9-9c4e-423e-8b03-01e229c610e0) 31 | 32 | **Step 2 - Explore your workload, and inventory your current software stack** 33 | 34 | Before starting the transition, you will need to inventory your current software stack so you can identify the path to equivalent software versions that support Arm64. At this stage it can be useful to think in terms of software you download (e.g. open source packages, container images, libraries), software you build and software you procure/license. Areas to review: 35 | 36 | * [Operating system](os.md): pay attention to specific versions that support Arm64 (usually more recent are better) 37 | * If your workload is container based, check container images you consume for Arm64 support. Keep in mind, many container images now support multiple architectures which simplifies consumption of those images in a mixed-architecture environment. 38 | * All the libraries, frameworks and runtimes used by the application. 39 | * Tools used to build, deploy and test your application (e.g. compilers, test suites, CI/CD pipelines, provisioning tools and scripts). Note there are language specific sections in this getting started guide with useful pointers to getting the best performance from Arm64 processors. 40 | * Tools and/or agents used to deploy and manage the application in production (e.g. monitoring tools or security agents) 41 | * This guide contains language specifics sections where you'll find additional per-language guidance: 42 | * [C/C++](languages/c-c++.md) 43 | * [Go](languages/golang.md) 44 | * [Java](languages/java.md) 45 | * [.NET](languages/dotnet.md) 46 | * [Python](languages/python.md) 47 | * [Rust](languages/rust.md) 48 | 49 | As a rule, the more current your software environment the more likely you will obtain the full performance entitlement from Arm64. 50 | 51 | For each component of your software stack, check for Arm64 support. A large portion of this can be done using existing system configuration and deployment scripts. As your scripts run and install packages, you will get messages for any missing components. Some may build from source automatically while others will cause the script to fail. Pay attention to software versions: more current software is easier to transition and will deliver the best performance. If you do need to perform upgrades prior to adopting Arm64, you might consider doing that using an existing x86 environment to minimize the number of changed variables. We have seen examples where upgrading OS version on x86 was far more involved and time consuming than transitioning to Arm64 after the upgrade. For more details on checking for software support please see Appendix A. 52 | 53 | Note: When locating software be aware that some tools, including GCC, refer to the architecture as AArch64, others including the Linux Kernel, call it arm64. When checking packages across various repositories, you’ll find those different naming conventions. 54 | 55 | 56 | ### Plan your workload transition 57 | 58 | **Step 3- Install and configure your application environment** 59 | 60 | Complete the installation of your software stack based on the inventory created in Step 2. In many cases your installation scripts can be used as-is or with minor modifications to reference architecture specific versions of components where necessary. The first time through this may be an iterative process as you resolve any remaining dependencies. 61 | 62 | **Step 4 - Build your application(s) and/or container images** 63 | 64 | Applications built using interpreted or JIT'd languages (Python, Java, PHP, Node.js, etc.) should run as-is. This guide contains language specific sections with recommendations e.g. [Java](java.md) and [Python](python.md). If there is no language specific section, it is because there is no specific guidance beyond using a suitably current version of the language. Simply proceed as you would on any other CPUs, Arm-based or otherwise. 65 | 66 | Applications using compiled languages including C, C++ or Go, need to be compiled for the Arm64 architecture. Most modern builds (e.g. using Make) will just work when run natively on Arm64. You’ll find language specific compiler recommendations in this repository: [C/C++](c-c++.md), [Go](golang.md), and [Rust](rust.md). Again , if there is no specific guidance it's because everything works _exactly_ the same on Arm64 as on other platforms. 67 | 68 | Just like an operating system, container images are architecture specific. You will need to build Arm64 container images. You might wish to build multi-arch container images that can run automatically on either x86-64 or Arm64. Check out the [container section](containers.md) of this guide for more details. 69 | 70 | You will also need to review any functional and unit test suite(s) to ensure you can test the new build artifacts with the same test coverage you have already for x86 artifacts. 71 | 72 | ### Test and optimize your workload 73 | 74 | **Step 5 - Testing and optimizing your workload** 75 | 76 | Now that you have your application stack on Aarch64, you should run your test suite to ensure all regular unit and functional tests pass. Resolve any test failures in the application(s) or test suites until you are satisfied everything is working as expected. Most errors should be related to the modifications and updated software versions you have installed during the transition. (Tip: when upgrading software versions, first test them using an existing x86 environment to minimize the number of variables changed at once. If issues occur then resolve them using the current x86 environment before continuing with the new Arm64 environment). If you suspect architecture specific issues then please have a look to our [C/C++ section ](c-c++.md) which gives advice on how to solve them. 77 | 78 | **Step 6 - Performance testing** 79 | 80 | With your fully functional application its time to establish a performance baseline on Arm64. In most cases, you should expect performance parity, or even gains. This guide has sections dedicated to [Optimization](optimizing.md) and a [Performance Runbook](perfrunbook/grace_perfrunbook.md) for you to follow during this stage. 81 | 82 | ### _Appendix A - locating packages for Arm64/_ 83 | 84 | Remember: When locating software be aware that some tools, including GCC, refer to the architecture as AArch64, others including the Linux Kernel, call it arm64. When checking packages across various repositories, you’ll find those different naming conventions, and in some cases just "ARM". 85 | 86 | The main ways to check and places to look for will be: 87 | 88 | * Package repositories of your chosen Linux distribution. Arm64 support within Linux distributions is largely complete: for example, Debian, which has the largest package repository, has over 98% of its packages built for the Arm64 architecture. 89 | * Container image registry. Amazon ECR now offers [public repositories](https://docs.aws.amazon.com/AmazonECR/latest/public/public-repositories.html) that you can search for [arm64 images](https://gallery.ecr.aws/?architectures=ARM+64&page=1). DockerHub allows you to search for a specific architecture ([e.g. arm64](https://hub.docker.com/search?type=image&architecture=arm64)). 90 | * Note: Specific to containers you may find an amd64 (x86-64) container image you currently use transitioned to a multi-architecture container image when adding Arm64 support. This means you may not find an explicit Arm64 container, so be sure to check for both as projects may chose to vend discrete images for x86-64 and Arm64 while other projects chose to vend a multi-arch image supporting both architectures. 91 | * On GitHub, you can check for Arm64 versions in the release section. However, some projects don’t use the release section, or only release source archives, so you may need to visit the main project webpage and check the download section. You can also search the GitHub project for "arm64" or "AArch64" to see whether the project has any Arm64 code contributions or issues. Even if a project does not currently produce builds for Arm64, in many cases an Arm64 version of those packages will be available through Linux distributions or additional package repositories (e.g. [EPEL](https://www.redhat.com/en/blog/whats-epel-and-how-do-i-use-it)). You can search for packages using a package search tool such as [pkgs.org](https://pkgs.org/). 92 | * The download section or platform support matrix of your software vendors, look for references to Arm64, AArch64, AWS Gravition, Ampere Altra, or NVIDIA Grace. 93 | --------------------------------------------------------------------------------