├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE-CC-BY-SA ├── README.md ├── debug ├── NicerTrace.py ├── README.md ├── printflock.py └── torch-distributed-gpu-test.py ├── dtype └── README.md ├── hparams └── README.md ├── instabilities └── README.md ├── parallelism └── README.md ├── resources └── README.md ├── slurm ├── README.md ├── cron-daily.slurm └── cron-hourly.slurm └── throughput ├── README.md └── all_reduce_bench.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the overall 27 | community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or advances of 32 | any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email address, 36 | without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at 64 | feedback@huggingface.co. 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series of 87 | actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or permanent 94 | ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within the 114 | community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.1, available at 120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 127 | [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 16 | 17 | # Contribute to the Large Language Model Training Playbook 18 | 19 | The Large Language Model Training Playbook is a living document. We anticipate regular improvements, so please please watch the repository to be notified about these. 20 | 21 | Everyone is welcome to contribute, and we value everybody's contribution. New content writing 22 | contributions are not the only way to help. Answering questions in issues, helping 23 | others in pull-request, and improving the existing writing are also often valuable. 24 | 25 | Though, please don't file a pull request without first coordinating via the issue system (see below) as (1) it might be content that goes beyond what the playbook is intended to cover or (2) someone else might already be working on this. 26 | 27 | Feel also free to spread the word! You can reference the playbook in blog posts or shout out on Twitter every time if it has helped you, or simply ⭐️ the repository to say thank you. 28 | 29 | However you choose to contribute, please be mindful and respect our 30 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md). 31 | 32 | **This guide was inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).** 33 | 34 | ## Ways to contribute 35 | 36 | There are several ways you can contribute to the "Large Language Model Training Playbook": 37 | 38 | * Propose a new section or propose to add more content to an existing section. 39 | * Submit issues about inexatitude or clarity on current content. 40 | * Read and comment on a pull request proposing new content or correcting the existing content. 41 | 42 | If you don't know where to start, there might be special [Good First 43 | Issue](https://github.com/huggingface/large_language_model_training_playbook/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source. Just comment in the issue that you'd like to work on it. 44 | 45 | > All contributions are equally valuable to the community. 🥰 46 | 47 | ## Propose a new section and/or additional content 48 | 49 | If you would like to add a new section or content to an existing section, please **open an issue first to discuss the matter** before creating a pull request. 50 | 51 | Even though the project aim at integrating as much as possible inputs from any contributors, we don't garantee we'll accept all topics or contributions so it's always better to approval before starting to spend significant amount of time on a writing section. 52 | 53 | ## Submit issues about inexatitude or clarity on current content 54 | 55 | When submitting an issue about inexatitude or clarity on current content please be careful about our 56 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md) as we prohibit some behaviors and type of communication. In particular we try to build a positive environment for our 57 | community by being respectful of differing opinions, viewpoints, and experiences and giving and gracefully accepting constructive feedback. In a nutshell: don't forget there is a human just like you at the other side who has likely spend time and effort writing the content you are now commenting. 58 | 59 | The repo maintainers will be very strict regarding any action they deem in violation of this Code of Conduct (see the [Enforcement Guidelines section of the Code of Conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md#Enforcement-Guidelines)) 60 | 61 | ## Create a Pull Request 62 | 63 | Before writing any section or content, we strongly advise you to search through the existing PRs or 64 | issues to make sure nobody is already working on the same thing. If you are 65 | unsure, it is always a good idea to open an issue to get some feedback. 66 | 67 | You will need basic `git` proficiency to contribute to the 68 | 🤗 Large Language Model Training Playbook. While `git` is not the easiest tool to use, it has the greatest 69 | manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro 70 | Git](https://git-scm.com/book/en/v2) is a very good reference. 71 | 72 | Follow the steps below to start contributing: 73 | 74 | 1. Fork the [repository](https://github.com/huggingface/large_language_model_training_playbook) by 75 | clicking on the **[Fork](https://github.com/huggingface/large_language_model_training_playbook/fork)** button on the repository's page. This creates a copy of the code 76 | under your GitHub user account. 77 | 78 | 2. Clone your fork to your local disk, and add the base repository as a remote: 79 | 80 | ```bash 81 | $ git clone git@github.com:/large_language_model_training_playbook.git 82 | $ cd large_language_model_training_playbook 83 | $ git remote add upstream https://github.com/huggingface/large_language_model_training_playbook.git 84 | ``` 85 | 86 | 3. Create a new branch to hold your development changes: 87 | 88 | ```bash 89 | $ git checkout -b a-descriptive-name-for-my-changes 90 | ``` 91 | 92 | 🚨 **Do not** work on the `main` branch! 93 | 94 | 4. Write the content in your branch. 95 | 96 | You can now write the new content or the correction you wanted to submit. 97 | 98 | Once you're happy with your changes, add changed files with `git add` and 99 | record your changes locally with `git commit`: 100 | 101 | ```bash 102 | $ git add modified_file.md 103 | $ git commit 104 | ``` 105 | 106 | Please remember to write [good commit 107 | messages](https://chris.beams.io/posts/git-commit/) to clearly communicate the changes you made! 108 | 109 | To keep your copy of the code up to date with the original 110 | repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer: 111 | 112 | ```bash 113 | $ git fetch upstream 114 | $ git rebase upstream/main 115 | ``` 116 | 117 | Push your changes to your branch: 118 | 119 | ```bash 120 | $ git push -u origin a-descriptive-name-for-my-changes 121 | ``` 122 | 123 | If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally. 124 | 125 | 5. Now you can go to your fork of the repository on GitHub and click on **Pull request** to open a pull request. When you're ready, you can send your changes to the project maintainers for review. 126 | 127 | 6. It's ok if maintainers request changes, it happens to our core contributors 128 | too! So everyone can see the changes in the pull request, work in your local 129 | branch and push the changes to your fork. They will automatically appear in 130 | the pull request. 131 | 132 | ### Develop on Windows 133 | 134 | On Windows (unless you're working in [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings: 135 | 136 | ```bash 137 | git config core.autocrlf input 138 | ``` 139 | 140 | One way to run the `make` command on Windows is with MSYS2: 141 | 142 | 1. [Download MSYS2](https://www.msys2.org/), and we assume it's installed in `C:\msys64`. 143 | 2. Open the command line `C:\msys64\msys2.exe` (it should be available from the **Start** menu). 144 | 3. Run in the shell: `pacman -Syu` and install `make` with `pacman -S make`. 145 | 4. Add `C:\msys64\usr\bin` to your PATH environment variable. 146 | 147 | You can now use `make` from any terminal (Powershell, cmd.exe, etc.)! 🎉 148 | 149 | ### Sync a forked repository with upstream main (the Hugging Face repository) 150 | 151 | When updating the main branch of a forked repository, please follow these steps to avoid pinging the upstream repository which adds reference notes to each upstream PR, and sends unnecessary notifications to the developers involved in these PRs. 152 | 153 | 1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. 154 | 2. If a PR is absolutely necessary, use the following steps after checking out your branch: 155 | 156 | ```bash 157 | $ git checkout -b your-branch-for-syncing 158 | $ git pull --squash --no-commit upstream main 159 | $ git commit -m '' 160 | $ git push --set-upstream origin your-branch-for-syncing 161 | ``` 162 | -------------------------------------------------------------------------------- /LICENSE-CC-BY-SA: -------------------------------------------------------------------------------- 1 | Attribution-ShareAlike 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-ShareAlike 4.0 International Public 58 | License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-ShareAlike 4.0 International Public License ("Public 63 | License"). To the extent this Public License may be interpreted as a 64 | contract, You are granted the Licensed Rights in consideration of Your 65 | acceptance of these terms and conditions, and the Licensor grants You 66 | such rights in consideration of benefits the Licensor receives from 67 | making the Licensed Material available under these terms and 68 | conditions. 69 | 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. BY-SA Compatible License means a license listed at 88 | creativecommons.org/compatiblelicenses, approved by Creative 89 | Commons as essentially the equivalent of this Public License. 90 | 91 | d. Copyright and Similar Rights means copyright and/or similar rights 92 | closely related to copyright including, without limitation, 93 | performance, broadcast, sound recording, and Sui Generis Database 94 | Rights, without regard to how the rights are labeled or 95 | categorized. For purposes of this Public License, the rights 96 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 97 | Rights. 98 | 99 | e. Effective Technological Measures means those measures that, in the 100 | absence of proper authority, may not be circumvented under laws 101 | fulfilling obligations under Article 11 of the WIPO Copyright 102 | Treaty adopted on December 20, 1996, and/or similar international 103 | agreements. 104 | 105 | f. Exceptions and Limitations means fair use, fair dealing, and/or 106 | any other exception or limitation to Copyright and Similar Rights 107 | that applies to Your use of the Licensed Material. 108 | 109 | g. License Elements means the license attributes listed in the name 110 | of a Creative Commons Public License. The License Elements of this 111 | Public License are Attribution and ShareAlike. 112 | 113 | h. Licensed Material means the artistic or literary work, database, 114 | or other material to which the Licensor applied this Public 115 | License. 116 | 117 | i. Licensed Rights means the rights granted to You subject to the 118 | terms and conditions of this Public License, which are limited to 119 | all Copyright and Similar Rights that apply to Your use of the 120 | Licensed Material and that the Licensor has authority to license. 121 | 122 | j. Licensor means the individual(s) or entity(ies) granting rights 123 | under this Public License. 124 | 125 | k. Share means to provide material to the public by any means or 126 | process that requires permission under the Licensed Rights, such 127 | as reproduction, public display, public performance, distribution, 128 | dissemination, communication, or importation, and to make material 129 | available to the public including in ways that members of the 130 | public may access the material from a place and at a time 131 | individually chosen by them. 132 | 133 | l. Sui Generis Database Rights means rights other than copyright 134 | resulting from Directive 96/9/EC of the European Parliament and of 135 | the Council of 11 March 1996 on the legal protection of databases, 136 | as amended and/or succeeded, as well as other essentially 137 | equivalent rights anywhere in the world. 138 | 139 | m. You means the individual or entity exercising the Licensed Rights 140 | under this Public License. Your has a corresponding meaning. 141 | 142 | 143 | Section 2 -- Scope. 144 | 145 | a. License grant. 146 | 147 | 1. Subject to the terms and conditions of this Public License, 148 | the Licensor hereby grants You a worldwide, royalty-free, 149 | non-sublicensable, non-exclusive, irrevocable license to 150 | exercise the Licensed Rights in the Licensed Material to: 151 | 152 | a. reproduce and Share the Licensed Material, in whole or 153 | in part; and 154 | 155 | b. produce, reproduce, and Share Adapted Material. 156 | 157 | 2. Exceptions and Limitations. For the avoidance of doubt, where 158 | Exceptions and Limitations apply to Your use, this Public 159 | License does not apply, and You do not need to comply with 160 | its terms and conditions. 161 | 162 | 3. Term. The term of this Public License is specified in Section 163 | 6(a). 164 | 165 | 4. Media and formats; technical modifications allowed. The 166 | Licensor authorizes You to exercise the Licensed Rights in 167 | all media and formats whether now known or hereafter created, 168 | and to make technical modifications necessary to do so. The 169 | Licensor waives and/or agrees not to assert any right or 170 | authority to forbid You from making technical modifications 171 | necessary to exercise the Licensed Rights, including 172 | technical modifications necessary to circumvent Effective 173 | Technological Measures. For purposes of this Public License, 174 | simply making modifications authorized by this Section 2(a) 175 | (4) never produces Adapted Material. 176 | 177 | 5. Downstream recipients. 178 | 179 | a. Offer from the Licensor -- Licensed Material. Every 180 | recipient of the Licensed Material automatically 181 | receives an offer from the Licensor to exercise the 182 | Licensed Rights under the terms and conditions of this 183 | Public License. 184 | 185 | b. Additional offer from the Licensor -- Adapted Material. 186 | Every recipient of Adapted Material from You 187 | automatically receives an offer from the Licensor to 188 | exercise the Licensed Rights in the Adapted Material 189 | under the conditions of the Adapter's License You apply. 190 | 191 | c. No downstream restrictions. You may not offer or impose 192 | any additional or different terms or conditions on, or 193 | apply any Effective Technological Measures to, the 194 | Licensed Material if doing so restricts exercise of the 195 | Licensed Rights by any recipient of the Licensed 196 | Material. 197 | 198 | 6. No endorsement. Nothing in this Public License constitutes or 199 | may be construed as permission to assert or imply that You 200 | are, or that Your use of the Licensed Material is, connected 201 | with, or sponsored, endorsed, or granted official status by, 202 | the Licensor or others designated to receive attribution as 203 | provided in Section 3(a)(1)(A)(i). 204 | 205 | b. Other rights. 206 | 207 | 1. Moral rights, such as the right of integrity, are not 208 | licensed under this Public License, nor are publicity, 209 | privacy, and/or other similar personality rights; however, to 210 | the extent possible, the Licensor waives and/or agrees not to 211 | assert any such rights held by the Licensor to the limited 212 | extent necessary to allow You to exercise the Licensed 213 | Rights, but not otherwise. 214 | 215 | 2. Patent and trademark rights are not licensed under this 216 | Public License. 217 | 218 | 3. To the extent possible, the Licensor waives any right to 219 | collect royalties from You for the exercise of the Licensed 220 | Rights, whether directly or through a collecting society 221 | under any voluntary or waivable statutory or compulsory 222 | licensing scheme. In all other cases the Licensor expressly 223 | reserves any right to collect such royalties. 224 | 225 | 226 | Section 3 -- License Conditions. 227 | 228 | Your exercise of the Licensed Rights is expressly made subject to the 229 | following conditions. 230 | 231 | a. Attribution. 232 | 233 | 1. If You Share the Licensed Material (including in modified 234 | form), You must: 235 | 236 | a. retain the following if it is supplied by the Licensor 237 | with the Licensed Material: 238 | 239 | i. identification of the creator(s) of the Licensed 240 | Material and any others designated to receive 241 | attribution, in any reasonable manner requested by 242 | the Licensor (including by pseudonym if 243 | designated); 244 | 245 | ii. a copyright notice; 246 | 247 | iii. a notice that refers to this Public License; 248 | 249 | iv. a notice that refers to the disclaimer of 250 | warranties; 251 | 252 | v. a URI or hyperlink to the Licensed Material to the 253 | extent reasonably practicable; 254 | 255 | b. indicate if You modified the Licensed Material and 256 | retain an indication of any previous modifications; and 257 | 258 | c. indicate the Licensed Material is licensed under this 259 | Public License, and include the text of, or the URI or 260 | hyperlink to, this Public License. 261 | 262 | 2. You may satisfy the conditions in Section 3(a)(1) in any 263 | reasonable manner based on the medium, means, and context in 264 | which You Share the Licensed Material. For example, it may be 265 | reasonable to satisfy the conditions by providing a URI or 266 | hyperlink to a resource that includes the required 267 | information. 268 | 269 | 3. If requested by the Licensor, You must remove any of the 270 | information required by Section 3(a)(1)(A) to the extent 271 | reasonably practicable. 272 | 273 | b. ShareAlike. 274 | 275 | In addition to the conditions in Section 3(a), if You Share 276 | Adapted Material You produce, the following conditions also apply. 277 | 278 | 1. The Adapter's License You apply must be a Creative Commons 279 | license with the same License Elements, this version or 280 | later, or a BY-SA Compatible License. 281 | 282 | 2. You must include the text of, or the URI or hyperlink to, the 283 | Adapter's License You apply. You may satisfy this condition 284 | in any reasonable manner based on the medium, means, and 285 | context in which You Share Adapted Material. 286 | 287 | 3. You may not offer or impose any additional or different terms 288 | or conditions on, or apply any Effective Technological 289 | Measures to, Adapted Material that restrict exercise of the 290 | rights granted under the Adapter's License You apply. 291 | 292 | 293 | Section 4 -- Sui Generis Database Rights. 294 | 295 | Where the Licensed Rights include Sui Generis Database Rights that 296 | apply to Your use of the Licensed Material: 297 | 298 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 299 | to extract, reuse, reproduce, and Share all or a substantial 300 | portion of the contents of the database; 301 | 302 | b. if You include all or a substantial portion of the database 303 | contents in a database in which You have Sui Generis Database 304 | Rights, then the database in which You have Sui Generis Database 305 | Rights (but not its individual contents) is Adapted Material, 306 | 307 | including for purposes of Section 3(b); and 308 | c. You must comply with the conditions in Section 3(a) if You Share 309 | all or a substantial portion of the contents of the database. 310 | 311 | For the avoidance of doubt, this Section 4 supplements and does not 312 | replace Your obligations under this Public License where the Licensed 313 | Rights include other Copyright and Similar Rights. 314 | 315 | 316 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 317 | 318 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 319 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 320 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 321 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 322 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 323 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 324 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 325 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 326 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 327 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 328 | 329 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 330 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 331 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 332 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 333 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 334 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 335 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 336 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 337 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 338 | 339 | c. The disclaimer of warranties and limitation of liability provided 340 | above shall be interpreted in a manner that, to the extent 341 | possible, most closely approximates an absolute disclaimer and 342 | waiver of all liability. 343 | 344 | 345 | Section 6 -- Term and Termination. 346 | 347 | a. This Public License applies for the term of the Copyright and 348 | Similar Rights licensed here. However, if You fail to comply with 349 | this Public License, then Your rights under this Public License 350 | terminate automatically. 351 | 352 | b. Where Your right to use the Licensed Material has terminated under 353 | Section 6(a), it reinstates: 354 | 355 | 1. automatically as of the date the violation is cured, provided 356 | it is cured within 30 days of Your discovery of the 357 | violation; or 358 | 359 | 2. upon express reinstatement by the Licensor. 360 | 361 | For the avoidance of doubt, this Section 6(b) does not affect any 362 | right the Licensor may have to seek remedies for Your violations 363 | of this Public License. 364 | 365 | c. For the avoidance of doubt, the Licensor may also offer the 366 | Licensed Material under separate terms or conditions or stop 367 | distributing the Licensed Material at any time; however, doing so 368 | will not terminate this Public License. 369 | 370 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 371 | License. 372 | 373 | 374 | Section 7 -- Other Terms and Conditions. 375 | 376 | a. The Licensor shall not be bound by any additional or different 377 | terms or conditions communicated by You unless expressly agreed. 378 | 379 | b. Any arrangements, understandings, or agreements regarding the 380 | Licensed Material not stated herein are separate from and 381 | independent of the terms and conditions of this Public License. 382 | 383 | 384 | Section 8 -- Interpretation. 385 | 386 | a. For the avoidance of doubt, this Public License does not, and 387 | shall not be interpreted to, reduce, limit, restrict, or impose 388 | conditions on any use of the Licensed Material that could lawfully 389 | be made without permission under this Public License. 390 | 391 | b. To the extent possible, if any provision of this Public License is 392 | deemed unenforceable, it shall be automatically reformed to the 393 | minimum extent necessary to make it enforceable. If the provision 394 | cannot be reformed, it shall be severed from this Public License 395 | without affecting the enforceability of the remaining terms and 396 | conditions. 397 | 398 | c. No term or condition of this Public License will be waived and no 399 | failure to comply consented to unless expressly agreed to by the 400 | Licensor. 401 | 402 | d. Nothing in this Public License constitutes or may be interpreted 403 | as a limitation upon, or waiver of, any privileges and immunities 404 | that apply to the Licensor or You, including from the legal 405 | processes of any jurisdiction or authority. 406 | 407 | 408 | ======================================================================= 409 | 410 | Creative Commons is not a party to its public 411 | licenses. Notwithstanding, Creative Commons may elect to apply one of 412 | its public licenses to material it publishes and in those instances 413 | will be considered the “Licensor.” The text of the Creative Commons 414 | public licenses is dedicated to the public domain under the CC0 Public 415 | Domain Dedication. Except for the limited purpose of indicating that 416 | material is shared under a Creative Commons public license or as 417 | otherwise permitted by the Creative Commons policies published at 418 | creativecommons.org/policies, Creative Commons does not authorize the 419 | use of the trademark "Creative Commons" or any other trademark or logo 420 | of Creative Commons without its prior written consent including, 421 | without limitation, in connection with any unauthorized modifications 422 | to any of its public licenses or any other arrangements, 423 | understandings, or agreements concerning use of licensed material. For 424 | the avoidance of doubt, this paragraph does not form part of the 425 | public licenses. 426 | 427 | Creative Commons may be contacted at creativecommons.org. 428 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 📖 The Large Language Model Training Handbook 2 | 3 | An open collection of methodologies to help with successful training of large language models. 4 | 5 | This is technical material suitable for LLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly solve your problems. 6 | 7 | If you are not interested in technical details but want more of a detailed overview and concepts please refer to the sister [The Large Language Model Training Playbook](https://github.com/huggingface/large_language_model_training_playbook) instead. 8 | 9 | note: The list of topics will expand over time - at the moment filling in only a subset 10 | 11 | ## [Model parallelism](./parallelism/) 12 | 13 | ## [Maximizing throughput](./throughput/) 14 | 15 | ## [Tensor precision / Data types](./dtype/) 16 | 17 | ## [Training hyper-parameters and model initializations](./hparams/) 18 | 19 | ## [Instabilities](./instabilities/) 20 | 21 | ## [Debugging software and hardware failures](./debug/) 22 | 23 | ## [SLURM](./slurm/) 24 | 25 | ## [Resources](./resources/) 26 | 27 | ## License 28 | 29 | The content of this site is distributed under [Attribution-ShareAlike 4.0 International](./LICENSE-CC-BY-SA). 30 | 31 | Unless specified otherwise the code in this repo is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 32 | -------------------------------------------------------------------------------- /debug/NicerTrace.py: -------------------------------------------------------------------------------- 1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """ 2 | 3 | """ NicerTrace - an improved Trace package """ 4 | 5 | """ 6 | To try it in action and to get a sense of how it can help you just run: 7 | python trace/NicerTrace.py 8 | """ 9 | 10 | 11 | import datetime 12 | import os 13 | import socket 14 | import sys 15 | import sysconfig 16 | import time 17 | import trace 18 | 19 | 20 | class NicerTrace(trace.Trace): 21 | # as the 2 paths overlap the longer with site-packages needs to be first 22 | py_dirs = [sysconfig.get_paths().get(k) for k in ["purelib", "stdlib"]] 23 | site_packages_dir = sysconfig.get_paths()["purelib"] 24 | stdlib_dir = sysconfig.get_paths()["stdlib"] 25 | 26 | def __init__(self, *args, packages_to_include=None, log_pids=False, **kwargs): 27 | """normal init plus added package/dir exclusion overrides: 28 | 29 | While preserving the original behavior a new optional arg is added `packages_to_include` 30 | with the following behavior: 31 | 32 | 1. if ignoredirs is a list the original trace behavior is used - only those dirs and subdirs will be excluded 33 | 2. if ignoredirs is None and packages_to_include is None - everything is included 34 | 3. if packages_to_include="uninstalled" all packages found under /.../site-packages will be excluded. I couldn't find a way to exclude core python packages under /.../lib/python3.8 since it'd then exclude site-packages as well 35 | 3. if packages_to_include=["PIL", "numpy", "pytorch"] all packages found under /.../site-packages, and /.../lib/python3.8 will be excluded except the packages that were listed to be included - use top-level package name here 36 | 4. if packages_to_include=None, everything under /.../site-packages, and /.../lib/python3.8 will be excluded and any packages that are installed via `pip install -e .` will be included 37 | 38 | """ 39 | ignoredirs = kwargs.get("ignoredirs", None) 40 | 41 | if ignoredirs is not None and len(ignoredirs) > 1: 42 | if packages_to_include is not None: 43 | raise ValueError("can't have both ignoredirs and packages_to_include not None") 44 | kwargs["ignoredirs"] = ignoredirs 45 | elif packages_to_include is None: 46 | kwargs["ignoredirs"] = None 47 | elif packages_to_include == "uninstalled": 48 | kwargs["ignoredirs"] = self.stdlib_dir # everything including python core packages 49 | else: 50 | # exclude all of /.../lib/python3.8 and sub-paths from /.../site-packages, and 51 | packages = os.listdir(self.site_packages_dir) 52 | packages_to_exclude = set(packages) - set(packages_to_include) 53 | dirs_to_exclude = [ 54 | f"{self.site_packages_dir}/{dir}" for dir in sorted(packages_to_exclude) if not dir.endswith("-info") 55 | ] 56 | # note, no way to exclude python core packages in this situation because 57 | # sysconfig.get_paths()'s' purelib is a subset of stdlib :(, so excluding only site-packages 58 | kwargs["ignoredirs"] = dirs_to_exclude 59 | 60 | # not packages, but final module names like Image from Image.py 61 | # mods_to_exclude = [] 62 | 63 | # print("\n".join(kwargs["ignoredirs"])) 64 | 65 | super().__init__(*args, **kwargs) 66 | self.log_pids = log_pids 67 | 68 | def strip_py_dirs(self, path): 69 | """strips python path prefix like /.../site-packages, and /.../lib/python3.8 if any matches""" 70 | for prefix in self.py_dirs: 71 | if path.startswith(prefix): 72 | return path.replace(prefix + "/", "") 73 | return path 74 | 75 | def globaltrace_lt(self, frame, why, arg): 76 | """Handler for call events. 77 | If the code block being entered is to be ignored, returns `None', 78 | else returns self.localtrace. 79 | 80 | This is an override to properly show full package names: 81 | 1. if it's under site-packages or core python dir - convert to package name 82 | 2. otherwise show full path to the python file - usually uninstalled packages 83 | 84 | Additionally enter frames now include the line number since some packages have multiple 85 | methods that have the same name and there is no telling which one of them was called. 86 | 87 | It was written against https://github.com/python/cpython/blob/3.8/Lib/trace.py. If you're 88 | using a different python version you may have to adapt it should the core implementation 89 | change (but it's unlikely) 90 | 91 | """ 92 | if why == "call": 93 | code = frame.f_code 94 | # print(f"\n\n{frame.f_code=}") 95 | # print(dir(code)) 96 | 97 | filename = frame.f_globals.get("__file__", None) 98 | if filename: 99 | lineno = code.co_firstlineno 100 | # python's trace fails to get the full package name - let's fix it 101 | # strip the common path of python library 102 | modulename = self.strip_py_dirs(filename) 103 | if filename != modulename: 104 | # the package was installed under /.../site-packages, /.../lib/python3.8 105 | modulename, ext = os.path.splitext(modulename) 106 | modulename = modulename.replace("/", ".") 107 | else: 108 | # still full path, because the package is not installed 109 | modulename = filename 110 | 111 | if modulename is not None: 112 | # XXX: ignoremods may not work now as before 113 | ignore_it = self.ignore.names(filename, modulename) 114 | if not ignore_it: 115 | if self.trace: 116 | if self.log_pids: 117 | print(os.getpid(), end=" ") 118 | 119 | print(f" {modulename}:{lineno} {code.co_name}") 120 | return self.localtrace 121 | else: 122 | return None 123 | 124 | def localtrace_trace_and_count(self, frame, why, arg): 125 | """ 126 | Overriding the default method. 127 | 128 | Using hh:mm:ss format for timestamps (instead of secs) as it's more readable when the trace is run for hours 129 | 130 | XXX: ideally it would be nice not to repeat the same module name on every line, but when I tried 131 | that I discovered that globaltrace_lt doesn't necessarily frame all the local calls, since 132 | localtrace_trace_and_count may continue printing local calls from an earlier frame w/o 133 | notifying that the context has changed. So we are forced to reprint the module name on each 134 | line to keep at least the incomplete context. 135 | 136 | Ideally there should an indication of a frame change before all the local prints 137 | 138 | Read the disclaimer in globaltrace_lt that this was tested with py-3.8 139 | 140 | """ 141 | if why == "line": 142 | # record the file name and line number of every trace 143 | filename = frame.f_code.co_filename 144 | lineno = frame.f_lineno 145 | key = filename, lineno 146 | self.counts[key] = self.counts.get(key, 0) + 1 147 | basename = os.path.basename(filename) 148 | if self.log_pids: 149 | print(os.getpid(), end=" ") 150 | if self.start_time: 151 | delta_time = trace._time() - self.start_time 152 | delta_time = str(datetime.timedelta(seconds=delta_time)).split(".")[0] 153 | print(delta_time, end=" ") 154 | print(f"{basename}:{lineno:>6}: {trace.linecache.getline(filename, lineno)}", end="") 155 | return self.localtrace 156 | 157 | # -------------------------------- # 158 | 159 | 160 | class Tee: 161 | """ 162 | A helper class to tee print's output into a file. 163 | Usage: 164 | sys.stdout = Tee(filename) 165 | """ 166 | 167 | def __init__(self, filename): 168 | self.stdout = sys.stdout 169 | self.file = open(filename, "a") 170 | 171 | def __getattr__(self, attr): 172 | return getattr(self.stdout, attr) 173 | 174 | def write(self, msg): 175 | # comment out the next line if you don't want to write to stdout 176 | self.stdout.write(msg) 177 | self.file.write(msg) 178 | self.file.flush() 179 | 180 | def flush(self): 181 | # comment out the next line if you don't want to write to stdout 182 | self.stdout.flush() 183 | self.file.flush() 184 | 185 | 186 | # -------------------------------- # 187 | 188 | import time 189 | 190 | from PIL import Image 191 | 192 | def main(): 193 | img = Image.new("RGB", (4, 4)) 194 | time.sleep(1) 195 | img1 = img.convert("RGB") 196 | 197 | # or if you want to try another version of main: 198 | 199 | # from transformers import AutoConfig 200 | # def main(): 201 | # c = AutoConfig.from_pretrained("t5-small") 202 | 203 | if __name__ == "__main__": 204 | # enable the trace 205 | if 1: 206 | cwd = os.path.realpath(".") 207 | pid = os.getpid() 208 | hostname = socket.gethostname() 209 | local_rank = int(os.environ.get("LOCAL_RANK", 0)) 210 | trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt" 211 | 212 | # run the new command using the given tracer 213 | sys.stdout = Tee(trace_output_file) 214 | 215 | # create a Trace object, telling it what to ignore, and whether to 216 | # do tracing or line-counting or both. 217 | # tracer = trace.Trace( 218 | tracer = NicerTrace( 219 | # ignoredirs=dirs_to_exclude, # don't set this one if you use packages_to_include 220 | # ignoremods=mods_to_exclude, 221 | trace=1, 222 | count=1, 223 | timing=True, 224 | # log_pids=True, useful if you fork workers and want to tell which process the trace belongs to 225 | packages_to_include=["PIL"], 226 | ) 227 | 228 | # string with commands to run - passed to exec() 229 | tracer.run("main()") 230 | # or to use the function interface to call main with args, kwargs 231 | # tracer.runfunc(main, *args, **kwds)) 232 | else: 233 | main() 234 | -------------------------------------------------------------------------------- /debug/README.md: -------------------------------------------------------------------------------- 1 | # Debugging Software And Hardware Failures 2 | 3 | XXX: I concat'ed 2 docs I wrote elsewhere so might need to restructure them into a more coherent doc. 4 | 5 | ## Debugging PyTorch programs 6 | 7 | ### Prefixing logs with `node:rank`, interleaved asserts 8 | 9 | When you have warnings and asserts (or debug prints), it helps a lot to prefix each log with its hostname:rank 10 | 11 | ``` 12 | python -m torch.distributed.run --role $(hostname -s): --tee 3 --nnodes 1 --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 13 | ``` 14 | 15 | Now each log line will be prefixed with `[hostname:rank]` 16 | 17 | Note that the colon `:` at the end of `--role` entry is important, that's how you get `hostname:rank` prefix. But you can add any other separator there, e.g if you use `-`, you will end up with `hostname-rank` prefix. 18 | 19 | If you're in a SLURM environment the above command line becomes: 20 | 21 | ``` 22 | srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ 23 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ 24 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 25 | --role $(hostname -s): --tee 3 \ 26 | torch-distributed-gpu-test.py' 27 | ``` 28 | 29 | Of course adjust your environment variables to match, this was just an example. 30 | 31 | Important! Note, that I'm using a single quoted string of commands passed to `bash -c`. This way `hostname -s` command is delayed until it's run on each of the nodes. If you'd use double quotes above, `hostname -s` will get executed on the starting node and then all nodes will get the same hostname as the prefix, which defeats the purpose of using these flags. So if you use double quotes you need to rewrite the above like so: 32 | 33 | 34 | ``` 35 | srun --jobid $SLURM_JOBID bash -c "python -m torch.distributed.run \ 36 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank \$SLURM_PROCID \ 37 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 38 | --role \$(hostname -s): --tee 3 \ 39 | torch-distributed-gpu-test.py" 40 | ``` 41 | 42 | `$SLURM_PROCID` is escaped too as it needs to be specific to each node and it's unknown during the launch of the slurm job on the main node. So there are 2 `\$` escapes in this version of the code. 43 | 44 | This prefixing functionality is also super-helpful when one gets the distributed program fail and which often results in interleaved assert messages that are very difficult to interpret. So by `grep`ing for one `node:rank` string of choice, it's now possible to reconstruct the real error message. 45 | 46 | For example, if you get a traceback that looks like: 47 | 48 | ``` 49 | File "/path/to/training/dataset.py", line 785, in __init__ 50 | File "/path/to/training/dataset.py", line 785, in __init__ 51 | if self.dataset_proba.sum() != 1: 52 | AttributeError: 'list' object has no attribute 'sum' 53 | if self.dataset_proba.sum() != 1: 54 | File "/path/to/training/dataset.py", line 785, in __init__ 55 | File "/path/to/training/dataset.py", line 785, in __init__ 56 | if self.dataset_proba.sum() != 1: 57 | if self.dataset_proba.sum() != 1: 58 | AttributeError: 'list' object has no attribute 'sum' 59 | AttributeError: 'list' object has no attribute 'sum' 60 | AttributeError: 'list' object has no attribute 'sum' 61 | ``` 62 | 63 | and when it's dozens of frames over 8 nodes it can't be made sense of, but the above `-tee` + `--role` will generate: 64 | 65 | ``` 66 | [host1:0] File "/path/to/training/dataset.py", line 785, in __init__ 67 | [host1:1] File "/path/to/training/dataset.py", line 785, in __init__ 68 | [host1:0] if self.dataset_proba.sum() != 1: 69 | [host1:0]AttributeError: 'list' object has no attribute 'sum' 70 | [host1:1] if self.dataset_proba.sum() != 1: 71 | [host1:2] File "/path/to/training/dataset.py", line 785, in __init__ 72 | [host1:3] File "/path/to/training/dataset.py", line 785, in __init__ 73 | [host1:3] if self.dataset_proba.sum() != 1: 74 | [host1:2] if self.dataset_proba.sum() != 1: 75 | [host1:1]AttributeError: 'list' object has no attribute 'sum' 76 | [host1:2]AttributeError: 'list' object has no attribute 'sum' 77 | [host1:3]AttributeError: 'list' object has no attribute 'sum' 78 | ``` 79 | and you can `grep` this output for just one `host:rank` prefix, which gives us: 80 | 81 | ``` 82 | $ grep "[host1:0]" log.txt 83 | [host1:0] File "/path/to/training/dataset.py", line 785, in __init__ 84 | [host1:0] if self.dataset_proba.sum() != 1: 85 | [host1:0]AttributeError: 'list' object has no attribute 'sum' 86 | ``` 87 | 88 | and voila, you can now tell what really happened. And as I mentioned earlier there can be easily a few hundred interleaved assert lines there. I was demo'ing a small example. 89 | 90 | Also, if you have just one node, you can just pass `-tee 3` and there is no need to pass `--role`. 91 | 92 | And of course if you're doing debug prints, then to solve this exact issue you can use [`printflock`](./torch-distributed-hanging-solutions.md#good-old-print). 93 | 94 | 95 | 96 | 97 | ### Dealing with Async CUDA bugs 98 | 99 | When using CUDA, failing pytorch programs very often produce a python traceback that makes no sense or can't be acted upon. This is because due to CUDA's async nature - when a CUDA kernel is executed, the program has already moved on and when the error happened the context of the program isn't there. The async functionality is there to make things faster, so that while the GPU is churning some `matmul` the program on CPU could already start doing something else. 100 | 101 | At other times some parts of the system will actually tell you that they couldn't generate the correct traceback, as in this error: 102 | 103 | ``` 104 | [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the 105 | asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/ 106 | incomplete data. To avoid this inconsistency, we are taking the entire process down. 107 | ``` 108 | 109 | There are a few solutions. 110 | 111 | If the failure is instant and can be reproduced on CPU (not all programs work on CPU), simply re-rerun it after hiding your GPUs. This is how you do it: 112 | 113 | ``` 114 | CUDA_VISIBLE_DEVICES="" python my-pytorch-program.py 115 | ``` 116 | 117 | The env var `CUDA_VISIBLE_DEVICES` is used to manually limit the visibility of GPUs to the executed program. So for example if you have 8 gpus and you want to run program1.py with first 4 gpus and program2.py with the remaining 2 gpus you can do: 118 | 119 | ``` 120 | CUDA_VISIBLE_DEVICES="0,1,2,3" python my-pytorch-program1.py 121 | CUDA_VISIBLE_DEVICES="4,5,6,7" python my-pytorch-program2.py 122 | ``` 123 | and the second program won't be the wiser that it's not using GPUs 0-3. 124 | 125 | But in the case of debug we are hiding all GPUs, by setting `CUDA_VISIBLE_DEVICES=""`. 126 | 127 | Now the program runs on CPU and you will get a really nice traceback and will fix the problem in no time. 128 | 129 | But, of course, if you your program requires multiple GPUs this won't work. And so here is another solution. 130 | 131 | Rerun your program after setting this environment variable: 132 | 133 | ``` 134 | CUDA_LAUNCH_BLOCKING=1 python my-pytorch-program.py 135 | ``` 136 | 137 | This variable tells pytorch (or any other CUDA-based program) to turn its async nature off everywhere and now all operations will be synchronous. So when the program crashes you should now get a perfect traceback and you will know exactly what ails your program. 138 | 139 | In theory enabling this variable should make everything run really slow, but in reality it really depends on your software. We did the whole of BLOOM-176B training using `CUDA_LAUNCH_BLOCKING=1` with `Megatron-Deepspeed`](https://github.com/bigscience-workshop/Megatron-DeepSpeed) and had zero slowdown - we had to use it as pytorch was hanging without it and we had no time to figure the hanging out. 140 | 141 | So, yes, when you switch from async to sync nature, often it can hide some subtle race conditions, so there are times that a hanging disappears as in the example I shared above. So measure your throughput with and without this flag and sometimes it might actual not only help with getting an in-context traceback but actually solve your problem altogether. 142 | 143 | Note: [NCCL==2.14.3 coming with `pytorch==1.13` hangs](https://github.com/NVIDIA/nccl/issues/750) when `CUDA_LAUNCH_BLOCKING=1` is used. So don't use it with that version of pytorch. The issue has been fixed in `nccl>=2.17` which should be included in `pytorch==2.0`. 144 | 145 | 146 | 147 | 148 | ### segfaults and getting a backtrace from a core file 149 | 150 | It's not uncommon for a complex pytorch program to segfault and drop a core file. Especially if 151 | you're using complex extensions like NCCL. 152 | 153 | The corefile is what the program generates when it crashes on a low-level - e.g. when using a python extension - such as a CUDA kernel or really any library that is coded directly in some variant of C or another language and made accessible in python through some binding API. The most common cause of a segfault is when such software accesses memory it has not allocated. For example, a program may try to free memory it hasn't allocated. But there could be many other reasons. 154 | 155 | When a segfault event happens Python can't do anything, as the proverbial carpet is pulled out from under its feet, so it can't generate an exception or even write anything to the output. 156 | 157 | In these situation one must go and analyse the libC-level calls that lead to the segfault, which is luckily saved in the core file. 158 | 159 | If your program crashed, you will often find a file that will look something like: `core-python-3097667-6` 160 | 161 | 162 | Before we continue make sure you have `gdb` installed: 163 | ``` 164 | sudo apt-get install gdb 165 | ``` 166 | 167 | Now make sure you know the path to the python executable that was used to run the program that crashed. If you have multiple python environment you have to activate the right environment first. If you don't `gdb` may fail to unpack the core file. 168 | 169 | So typically I'd go: 170 | 171 | ``` 172 | conda activate my-env 173 | gdb python core-python-3097667-6 174 | ``` 175 | - adjust `my-env` to whatever env you use, or instead of conda use whatever way you use to activate your python environment - and perhaps you're using the system-wise python and then you don't need to activate anything. 176 | - adjust the name of the core file to the file you have gotten - it's possible that there are many - pick the latest then. 177 | 178 | Now `gdb` will churn for a bit and will give you a prompt where you type: `bt`. We will use an actual core file here: 179 | 180 | ``` 181 | (gdb) bt 182 | #0 0x0000147539887a9f in raise () from /lib64/libc.so.6 183 | #1 0x000014753985ae05 in abort () from /lib64/libc.so.6 184 | #2 0x000014751b85a09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6 185 | #3 0x000014751b86053c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6 186 | #4 0x000014751b860597 in std::terminate() () from /lib64/libstdc++.so.6 187 | #5 0x000014751b86052e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib64/libstdc++.so.6 188 | #6 0x000014750bb007ef in c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() () 189 | from .../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so 190 | #7 0x000014750bb04c69 in c10d::ProcessGroupNCCL::workCleanupLoop() () 191 | from.../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so 192 | #8 0x000014751b88cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6 193 | #9 0x000014753a3901cf in start_thread () from /lib64/libpthread.so.0 194 | #10 0x0000147539872dd3 in clone () from /lib64/libc.so.6 195 | ``` 196 | 197 | and there you go. How do you make sense of it? 198 | 199 | Well, you go from the bottom of the stack to the top. You can tell that a `clone` call was made in `libc` which then called `start_thread` in `libpthread` and then if you keep going there are a bunch of calls in the torch libraries and finally we can see that the program terminated itself, completing with `raise` from `libc` which told the Linux kernel to kill the program and create the core file. 200 | 201 | This wasn't an easy to understand backtrace. 202 | 203 | footnote: Yes, python calls it a *traceback* and elsewhere it's called a *backtrace* - it's confusing, but it's more or less the same thing. 204 | 205 | Actually I had to ask pytorch devs for help and received: 206 | 207 | - PyTorch `ProcessGroup` watchdog thread caught an asynchronous error from NCCL 208 | - This error is an `“unhandled system error”` which in this particular case turned out to be an IB-OPA error 209 | - The `ProcessGroup`’s `WorkCleanUp` thread rethrew the error so that the main process would crash and the user would get notified (otherwise this async error would not surface) 210 | 211 | Trust me there are times when even if you're inexperienced the backtrace can give you enough of a hint to where you should look for troubleshooting. 212 | 213 | But fear not - most of the time you won't need to understand the traceback. Ideally you'd just attach the core file to your filed Issue. But it can easily be 5GB large. So the developers that will be trying to help you will ask you to generate a `gdb` backtrace and now you know how to do that. 214 | 215 | I didn't promise it'll be easy, I just showed you where to start. 216 | 217 | Now another useful details is that many programs these days run multiple threads. And `bt` only shows the main thread of the process. But, often, it can be helpful to see where other threads in the process were when segfault has happened. For that you simply type 2 commands at the `(gdb)` prompt: 218 | 219 | ``` 220 | (gdb) thread apply all bt 221 | (gdb) bt 222 | ``` 223 | 224 | and this time around you typically will get a massive report, one backtrace per thread. 225 | 226 | 227 | 228 | ### strace 229 | 230 | Similar to [py-spy](./torch-distributed-hanging-solutions.md#py-spy), `strace` is a super-useful tool which traces any running application at the low-level system calls - e.g. `libC` and alike. 231 | 232 | For example, run: 233 | ``` 234 | strace python -c "print('strace')" 235 | ``` 236 | and you will see everything that is done at the system call level as the above program runs. 237 | 238 | But usually it's more useful when you have a stuck program that spins all CPU cores at 100% but nothing happens and you want to see what's it doing. In this situation you simply attached to the running program like so: 239 | 240 | ``` 241 | strace --pid PID 242 | ``` 243 | where you get the PID for example from the output of `top` or `ps`. Typically I just copy-n-paste the PID of the program that consumes the most CPU - `top` usually shows it at the very top of its listing. 244 | 245 | Same as `py-spy` you may need `sudo` perms to attached to an already running process - it all depends on your system setup. But you can always start a program with `strace` as I have shown in the original example. 246 | 247 | Let's look at a small sub-snippet of the output of `strace python -c "print('strace')"` 248 | 249 | ``` 250 | write(1, "strace\n", 7strace 251 | ) = 7 252 | ``` 253 | Here we can see that a write call was executed on filedescriptor `1`, which almost always is `stdout` (`stdin` being 0, and `stderr` being 2). 254 | 255 | If you're not sure what a filedescriptor is pointing to, normally you can tell from `strace`'s output itself. But you can also do: 256 | 257 | ``` 258 | ls -l /proc/PID/fd 259 | ``` 260 | where PID is the pid of the currently running program you're trying to investigate. 261 | 262 | For example, when I run the above while running a pytest test with gpus, I got (partial output): 263 | ``` 264 | l-wx------ 1 stas stas 64 Mar 1 17:22 5 -> /dev/null 265 | lr-x------ 1 stas stas 64 Mar 1 17:22 6 -> /dev/urandom 266 | lrwx------ 1 stas stas 64 Mar 1 17:22 7 -> /dev/nvidiactl 267 | lrwx------ 1 stas stas 64 Mar 1 17:22 8 -> /dev/nvidia0 268 | lr-x------ 1 stas stas 64 Mar 1 17:22 9 -> /dev/nvidia-caps/nvidia-cap2 269 | ``` 270 | so you can see that a device `/dev/null` is open as FD (file descriptor) 5, `/dev/urandom` as FD 6, etc. 271 | 272 | Now let's go look at another snippet from our `strace` run. 273 | 274 | ``` 275 | access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 276 | ``` 277 | Here it tried to see if file `/etc/ld.so.preload` exists, but as we can see it doesn't - this can be useful if some shared library is missing - you can see where it's trying to load it from. 278 | 279 | Let's try another one: 280 | ``` 281 | openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 282 | read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832 283 | newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=21448, ...}, AT_EMPTY_PATH) = 0 284 | mmap(NULL, 16424, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8028807000 285 | mmap(0x7f8028808000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f8028808000 286 | mmap(0x7f8028809000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f8028809000 287 | mmap(0x7f802880a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f802880a000 288 | close(3) 289 | ``` 290 | here we can see that it opens `/lib/x86_64-linux-gnu/libpthread.so.0` and assigns it FD 3, it then reads 832 chars from FD 3, (we can also see that the first chars are ELF - which stands for a shared library format), then memory maps it and closes that file. 291 | 292 | In this following example, we see a python cached file is opened, its filepointer is moved to 0, and then it's read and closed. 293 | ``` 294 | openat(AT_FDCWD, "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/__pycache__/abc.cpython-38.pyc", O_RDONLY|O_CLOEXEC) = 3 295 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0 296 | lseek(3, 0, SEEK_CUR) = 0 297 | lseek(3, 0, SEEK_CUR) = 0 298 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0 299 | brk(0x23bf000) = 0x23bf000 300 | read(3, "U\r\r\n\0\0\0\0\24\216\177c\211\21\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5330) = 5329 301 | read(3, "", 1) = 0 302 | close(3) 303 | ``` 304 | It's important to notice that file descriptors are re-used, so we have seen the same FD 3 twice, but each time it was open to a different file. 305 | 306 | If your program is for example trying to reach to the Internet, you can also tell these calls from `strace` as the program would be reading from a socket file descriptor. 307 | 308 | So let's run an example on a program that downloads files from the HF hub: 309 | ``` 310 | strace python -c 'import sys; from transformers import AutoConfig; AutoConfig.from_pretrained(sys.argv[1])' t5-small 311 | ``` 312 | 313 | here is some relevant to this discussion snippet: 314 | ``` 315 | socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 3 316 | setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 317 | ioctl(3, FIONBIO, [1]) = 0 318 | connect(3, {sa_family=AF_INET6, sin6_port=htons(443), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2600:1f18:147f:e850:e203:c458:10cd:fc3c 319 | ", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress) 320 | poll([{fd=3, events=POLLOUT|POLLERR}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}]) 321 | getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 322 | [...] 323 | write(3, "\26\3\3\0F\20\0\0BA\4\373m\244\16\354/\334\205\361j\225\356\202m*\305\332\275\251\17J"..., 126) = 126 324 | read(3, 0x2f05c13, 5) = -1 EAGAIN (Resource temporarily unavailable) 325 | poll([{fd=3, events=POLLIN}], 1, 9903) = 1 ([{fd=3, revents=POLLIN}]) 326 | read(3, "\24\3\3\0\1", 5) = 5 327 | read(3, "\1", 1) = 1 328 | read(3, "\26\3\3\0(", 5) = 5 329 | read(3, "\0\0\0\0\0\0\0\0\344\v\273\225`\4\24m\234~\371\332%l\364\254\34\3472<\0356s\313"..., 40) = 40 330 | ioctl(3, FIONBIO, [1]) = 0 331 | poll([{fd=3, events=POLLOUT}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}]) 332 | write(3, "\27\3\3\1.\0\374$\361\217\337\377\264g\215\364\345\256\260\211$\326pkR\345\276,\321\221`-"..., 307) = 307 333 | ioctl(3, FIONBIO, [1]) = 0 334 | read(3, 0x2ef7283, 5) = -1 EAGAIN (Resource temporarily unavailable) 335 | poll([{fd=3, events=POLLIN}], 1, 10000) = 1 ([{fd=3, revents=POLLIN}]) 336 | ``` 337 | 338 | You can see where that again it uses FD 3 but this time it opens a INET6 socket instead of a file. You can see that it then connects to that socket, polls, reads and writes from it. 339 | 340 | There are many other super useful understandings one can derive from using this tool. 341 | 342 | BTW, if you don't want to scroll up-down, you can also save the output to a file: 343 | ``` 344 | strace -o strace.txt python -c "print('strace')" 345 | ``` 346 | 347 | 348 | ## Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs 349 | 350 | While the methodologies found in this article were developed while working with multi-node multi-gpu pytorch-based training, they, of course, can help with any multi-process multi-node Python programs. 351 | 352 | ### Helper tools 353 | 354 | Try to use the following script [torch-distributed-gpu-test.py](./torch-distributed-gpu-test.py) to diagnose the situation. 355 | 356 | This will help primarily with discovering network-related issues. And also to quickly understand how multi-gpu communications work. 357 | 358 | For code-related issues read the rest of this document. 359 | 360 | 361 | ### Approaches to diagnosing multi-gpu hanging / deadlocks 362 | 363 | #### py-spy 364 | 365 | First do `pip install py-spy`. 366 | 367 | Now you can attach to each process with: 368 | 369 | ``` 370 | py-spy dump -n -p PID 371 | ``` 372 | and it will tell you where the process hangs (very often it's a nccl collective function or a `barrier`). 373 | 374 | - `PID` is the process id of the hanging python process. 375 | - `-n` is useful if you want to see strack traces from python extensions written in C, C++, etc., as the program may hang in one of the extensions 376 | - you may need to add `sudo` before the command - for more details see [this note](https://github.com/benfred/py-spy#when-do-you-need-to-run-as-sudo). 377 | 378 | 379 | Here is an example of such a stack trace: 380 | ``` 381 | Thread 835995 (active): "MainThread" 382 | broadcast (torch/distributed/distributed_c10d.py:1191) 383 | _aggregate_total_loss (deepspeed/runtime/pipe/engine.py:540) 384 | train_batch (deepspeed/runtime/pipe/engine.py:330) 385 | train_step (megatron/training.py:436) 386 | train (megatron/training.py:851) 387 | pretrain (megatron/training.py:187) 388 | (pretrain_gpt.py:239) 389 | ``` 390 | The very first line is where the program is stuck. 391 | 392 | ##### multi-process py-spy 393 | 394 | Now, how do you do it for multiple processes. Doing it one-by-one is too slow. So let's do it at once. 395 | 396 | If the launch command was `python`, what you do is: 397 | 398 | ``` 399 | pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {} 400 | ``` 401 | 402 | if `deepspeed`: 403 | 404 | ``` 405 | pgrep -P $(pgrep -o deepspeed) | xargs -I {} py-spy dump --pid {} 406 | ``` 407 | 408 | for `accelerate`: 409 | 410 | 411 | ``` 412 | pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {} 413 | ``` 414 | 415 | you get the idea. 416 | 417 | This particular approach will only analyse the main processes and not various other sub-processes/threads spawned by these processes. So if you have 8 gpus and 8 processes, the above will generate 8 stack traces. 418 | 419 | If you want all processes and their subprocesses, then you'd just run: 420 | 421 | 422 | ``` 423 | pgrep -f python | xargs -I {} py-spy dump --pid {} 424 | ``` 425 | (and as before replace `python` with the name of the launcher program if it's not `python`) 426 | 427 | 428 | ##### multi-node py-spy 429 | 430 | What if you have multiple nodes? 431 | 432 | You can of course `ssh` to each node interactively and dump the stack traces. 433 | 434 | If you're using the SLURM environment you can use `srun` to do it on all nodes for you. 435 | 436 | 437 | Now in another console get the `SLURM_JOBID` (or get it from `salloc` log): 438 | ``` 439 | squeue -u `whoami` -o "%.16i %.9P %.26j %.8T %.10M %.8l %.6D %.20S %R" 440 | ``` 441 | 442 | Now use the following `srun` command after adjusting jobid with `SLURM_JOBID` from the outcome of the command above this sentence: 443 | ``` 444 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed" 445 | ``` 446 | 447 | Notes: 448 | - One must use `--gres=gpu:0` for the monitor `srun` or otherwise it will block until the main `srun` (the one running the training) exits. 449 | - Each node will generate its unique log file named `trace-nodename.out` - so this would help to identify which node(s) are problematic. You can remove `--output=trace-%N.out` if you want it all being dumped to stdout 450 | - In some SLURM versions you may also need to add `--overlap` 451 | - In some SLURM versions the jobid might not match that of reported in `squeue`, so you have to get the correct `SLURM_JOB_ID` from the logs of the job you're trying to "attach" to - i.e. your `srun` job that allocated the GPUs. 452 | - Sometimes `bash` doesn't work, but `sh` does. I think it has to do with what dot files get `source`d 453 | - You might need to also activate a custom python environment, which you can do like so: 454 | ``` 455 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'conda activate myenvname; ps auxc | ... ' || echo "failed" 456 | ``` 457 | or you can do it inside `~/.bashrc` or whatever shell's rc file you decide to use. 458 | 459 | As mentioned before if you want just the main processes you'd use this instead: 460 | ``` 461 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' || echo "failed" 462 | ``` 463 | Adjust `python` if need be as explained in the multi-gpu section above. 464 | 465 | The previous longer command will deliver traces for all python processes. 466 | 467 | If you're not getting anything, start with the basic debug like: 468 | 469 | ``` 470 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'date' 471 | ``` 472 | once you know you're talking to all the nodes, then you can progressively unravel the depth of calls, as in: 473 | 474 | ``` 475 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'date' 476 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -o python' 477 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) ' 478 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' 479 | ``` 480 | and at each stage check that the output makes sense - e.g. the 2nd and 3rd call you should be getting the PIDs of the processes. 481 | 482 | The following notes require `pip install deepspeed`. 483 | 484 | In one SLURM environment I also attempted using `pdsh` via `ds_ssh`, but somehow I wasn't able to run `py-spy` remotely - the main issue was that remote `ssh` command wasn't giving the same env as when I was logged in interactively via `ssh`. But if you have `sudo` access on the compute nodes then you could do: 485 | 486 | First prepare `hostfile`: 487 | ``` 488 | function makehostfile() { 489 | perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"}; 490 | $slots=8 if $slots==0; # workaround 8 gpu machines 491 | @nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}]; 492 | print map { "$b$_ slots=$slots\n" } @nodes' 493 | } 494 | makehostfile > hostfile 495 | ``` 496 | Adapt `$slots` to the number of gpus per node. You may have to adapt this script if your `scontrol` produces a different output. 497 | 498 | Now run the `py-spy` extraction command over all participating nodes: 499 | ``` 500 | ds_ssh -f hostfile "source ~/.pdshrc; ps aux | grep python | grep -v grep | grep `whoami` | awk '{print \$2}' | xargs -I {} sudo py-spy dump --pid {} " 501 | ``` 502 | 503 | 504 | 505 | #### Network-level hanging 506 | 507 | The hanging could be happening at the network level. `NCCL_DEBUG=INFO` can help here. 508 | 509 | Run the script with `NCCL_DEBUG=INFO` env var and try to study the outcome for obvious errors. It will tell you which device it's using, e.g.: 510 | ``` 511 | DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0> 512 | ``` 513 | So it's using interface `enp67s0` over `192.168.50.21` 514 | 515 | Is your `192.168.50.21` firewalled? or is it somehow a misconfigured network device? 516 | 517 | Does it work if you use a loopback device `127.0.0.1`? 518 | ``` 519 | NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py 520 | ``` 521 | 522 | if not, see what other local network devices you have via `ifconfig` - try that instead of `lo` if any. 523 | 524 | It's currently using `enp67s0` in the above example. 525 | 526 | 527 | #### Isolate problematic GPUs 528 | 529 | You can also try to see if only some GPUs fail 530 | 531 | For example, does it work if you use the first 2 or the last 2 gpus: 532 | 533 | ``` 534 | CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 535 | ``` 536 | then the 2nd pair: 537 | ``` 538 | CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 539 | ``` 540 | 541 | 542 | #### python `trace` 543 | 544 | Now what happens when the training doesn't just hang, but the hanging process stops responding? e.g. this happens when there is a serious hardware issue. But what if it is recurrent and `py-spy` won't help here, since it won't be able to attach to a process that is not responding. 545 | 546 | So next came the idea of tracing all calls like one does with `strace(1)`, I researched python calls tracing facilities and have discovered that python has a `trace` sub-system. 547 | 548 | The following code will trace all python calls and log them to the console and into a dedicated per process log file, via a custom `Tee` module I added. 549 | 550 | This then can help to understand where some processes stopped responding, since we will have the log of the last call and all the previous calls before it went unresponsive. 551 | 552 | ``` 553 | $ cat train.py 554 | [...] 555 | 556 | def main(): 557 | # [...] 558 | train() 559 | 560 | import re 561 | class Tee: 562 | """ 563 | A helper class to tee print's output into a file. 564 | Usage: 565 | sys.stdout = Tee(filename) 566 | """ 567 | 568 | def __init__(self, filename): 569 | self.stdout = sys.stdout 570 | self.file = open(filename, "a") 571 | 572 | def __getattr__(self, attr): 573 | return getattr(self.stdout, attr) 574 | 575 | def write(self, msg): 576 | self.stdout.write(msg) 577 | self.file.write(msg) 578 | self.file.flush() 579 | 580 | def flush(self): 581 | self.stdout.flush() 582 | self.file.flush() 583 | 584 | if __name__ == "__main__": 585 | 586 | import sys 587 | import trace 588 | import socket 589 | import os 590 | 591 | # enable the trace 592 | if 0: 593 | cwd = os.path.realpath('.') 594 | pid = os.getpid() 595 | hostname = socket.gethostname() 596 | local_rank = int(os.environ["LOCAL_RANK"]) 597 | trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt" 598 | 599 | # create a Trace object, telling it what to ignore, and whether to 600 | # do tracing or line-counting or both. 601 | tracer = trace.Trace( 602 | ignoredirs=[sys.prefix, sys.exec_prefix], 603 | trace=1, 604 | count=1, 605 | timing=True, 606 | ) 607 | 608 | # run the new command using the given tracer 609 | sys.stdout = Tee(trace_output_file) 610 | tracer.run('main()') 611 | else: 612 | main() 613 | 614 | ``` 615 | 616 | This code doesn't require any special handing other than enabling the trace by changing `if 0` to `if 1`. 617 | 618 | If you don't set `ignoredirs`, this will now dump all python calls. Which means expect a lot of GBs of data logged, especially if you have hundreds of GPUs. 619 | 620 | Of course, you don't have to start tracing from `main` - if you suspect a specific are you can start tracing there instead and it'll be much faster and less data to save. 621 | 622 | I wish I could tell `trace` which packages to follow, but alas it only supports dirs to ignore, which is much more difficult to set, and thus you end up with a lot more data than needrf. But still this is a super useful tool for debugging hanging processes. 623 | 624 | Also, your code will now run much much slower and the more packages you trace the slower it will become. 625 | 626 | ##### NicerTrace 627 | 628 | As `Trace` proved to provide very limited usability when debugging a complex multi-node multi-hour run crash, I have started on working on a better version of the `trace` 629 | 630 | You can find it here: [NicerTrace](./NicerTrace.py) 631 | 632 | I added multiple additional flags to the constructor and made the output much more useful. You fill find a full working example in that same file, just run: 633 | 634 | ``` 635 | python NicerTrace.py 636 | ``` 637 | and you should see: 638 | 639 | ``` 640 | trace/NicerTrace.py:1 641 | 0:00:00 : 1: trace/NicerTrace.py:185 main 642 | 0:00:00 NicerTrace.py: 186: img = Image.new("RGB", (4, 4)) 643 | PIL.Image:2896 new 644 | 0:00:00 Image.py: 2912: _check_size(size) 645 | PIL.Image:2875 _check_size 646 | 0:00:00 Image.py: 2883: if not isinstance(size, (list, tuple)): 647 | 0:00:00 Image.py: 2886: if len(size) != 2: 648 | 0:00:00 Image.py: 2889: if size[0] < 0 or size[1] < 0: 649 | ``` 650 | as you will see in the example I set: 651 | 652 | ``` 653 | packages_to_include=["PIL"], 654 | ``` 655 | so it'll trace `PIL` plus anything that is not under `site-packages`. If you need to trace another package, just add it to that list. 656 | 657 | This is a very fresh work-in-progress package, so it's evolving as we are trying to make it help us resolve a very complex crashing situation. 658 | 659 | 660 | ##### Working with generated trace files 661 | 662 | When the per-node-rank trace files has been generated the following might be helpful to quickly analyse the situation: 663 | 664 | 665 | - grep for a specific match and also print the file and line number where it was found: 666 | 667 | ``` 668 | grep -n "backward" trace* 669 | ``` 670 | 671 | - show `tail -1` of all trace files followed by the name of each file: 672 | 673 | ``` 674 | find . -name "trace*" -exec sh -c 'echo "$1: $(tail -3 "$1")"' _ {} \; 675 | ``` 676 | 677 | - or similar to the above, but print 5 last lines with the leading filename and some vertical white space for an easier reading: 678 | 679 | ``` 680 | find . -name "trace*" -exec sh -c 'echo; echo $1; echo "$(tail -5 "$1")"' _ {} \; 681 | ``` 682 | 683 | - count how many times grep matched a given pattern in each ifle and print the matched file (in this example matching the pattern `backward`): 684 | 685 | ``` 686 | find . -name "trace*" -exec sh -c 'echo "$1: $(grep "backward" $1 | wc -l)"' _ {} \; 687 | ``` 688 | 689 | 690 | #### good old `print` 691 | 692 | Now once you discovered where the hanging happens to further understand why this is happening, a debugger would ideally be used, but more often than not debugging multi-process (multi-node) issues can be very difficult. 693 | 694 | In such situations a good old `print` works. You just need to add some debug prints before the calls where things hang, things that would help understand what lead to the deadlock. For example, some `barrier` was missing and one or a few processes skipped some code and while the rest of processes are still blocking waiting for everybody to send some data (for example in NCCL collective functions like `gather` or `reduce`). 695 | 696 | You of course, want to prefix each print with the rank of the process so that you could tell which is which. For example: 697 | 698 | ``` 699 | import torch.distributed as dist 700 | print(f"{dist.get_rank()}: passed stage 0") 701 | ``` 702 | 703 | What you will quickly discover is that if you have multiple GPUs these prints will be badly interleaved and you will have a hard time making sense of the debug data. So let's fix this. We are going to override `print` with a custom version of the same, but which uses `flock` to ensure that only one process can write to stdout at the same time. 704 | 705 | The helper module `printflock.py` is included [here](./printflock.py). To activate it just run this at the top of the module you're debugging: 706 | 707 | ``` 708 | from printflock import printflock as print 709 | ``` 710 | 711 | and now all your `print` calls in that module will magically be non-iterleaved. You can of course, just use `printflock` directly: 712 | 713 | ``` 714 | from printflock import printflock 715 | import torch.distributed as dist 716 | printflock(f"{dist.get_rank()}: passed stage 0") 717 | ``` 718 | 719 | 720 | #### Code loops 721 | 722 | Code loops can be tricky to debug in hanging scenarios. If you have code like the following: 723 | 724 | ``` 725 | for i, d in enumerate(data): 726 | some_hanging_call(d) 727 | ``` 728 | 729 | it's possible that one process hangs in the first iteration, and another process in the second iteration, which makes things very confusing. But the stack trace won't give such indication, as the line numbers would be the same, even though the processes aren't in the same place code progression-wise. 730 | 731 | In such situations unroll the loop to be: 732 | ``` 733 | d_iter = iter(data) 734 | some_hanging_call(next(d_iter) 735 | some_hanging_call(next(d_iter) 736 | ``` 737 | and now when you run `py-spy` the line numbers will be correct. The processes hanging in the first iteration will report the first `some_hanging_call` and those in the second iteration in the second call - as each now has its own line. 738 | 739 | 740 | 741 | 742 | ## Hardware-specific issues 743 | 744 | Some AMD users may need to [Disable IOMMU](https://github.com/stas00/toolbox/issues/1#issuecomment-1076830400) 745 | -------------------------------------------------------------------------------- /debug/printflock.py: -------------------------------------------------------------------------------- 1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """ 2 | 3 | # If you have ever done multi-gpu work and tried to `print` for debugging you quickly discovered 4 | # that some messages get interleaved and are impossible to make sense of. Especially so if you're 5 | # using `print` to debug values. 6 | # 7 | # This simple solution that uses the good old `flock` solves the interleaving problem. To use this 8 | # version of print you can either do: 9 | # 10 | # from printflock import printflock 11 | # import torch.distributed as dist 12 | # printflock(f"{dist.get_rank()}: my long debug message") 13 | # 14 | # or you can override `print` with a better one: 15 | # 16 | # from printflock import printflock as print 17 | # import torch.distributed as dist 18 | # print(f"{dist.get_rank()}: my long debug message") 19 | # 20 | 21 | import builtins 22 | import fcntl 23 | 24 | def printflock(*args, **kwargs): 25 | """ 26 | This is a wrapper around the built-in Python `print` which calls `flock` before calling 27 | `print` and unlocks it immediately after. This wrapper is useful for when each rank needs to 28 | print a message without getting it interleaved with prints from other ranks. 29 | The lock file is the file this wrapper is defined in. 30 | The output order will be random per rank. 31 | 32 | Example: 33 | >>> # assuming 4 GPUs 34 | >>> world_size = dist.get_world_size() 35 | >>> rank = dist.get_rank() 36 | >>> printflock(f"This is a very long message from rank {rank}/{world_size}") 37 | This is a very long message from rank 0/4 38 | This is a very long message from rank 2/4 39 | This is a very long message from rank 3/4 40 | This is a very long message from rank 1/4 41 | 42 | It can also be used to override normal `print`: 43 | 44 | from printflock import printflock as print 45 | 46 | and then you don't need to change anything in your code. 47 | """ 48 | 49 | with open(__file__, "r") as fh: 50 | fcntl.flock(fh, fcntl.LOCK_EX) 51 | try: 52 | builtins.print(*args, **kwargs) 53 | finally: 54 | fcntl.flock(fh, fcntl.LOCK_UN) 55 | -------------------------------------------------------------------------------- /debug/torch-distributed-gpu-test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """ 4 | 5 | # 6 | # This a `torch.distributed` diagnostics script that checks that all GPUs in the cluster (one or 7 | # many nodes) can talk to each other via nccl and allocate gpu memory. 8 | # 9 | # To run first adjust the number of processes and nodes: 10 | # 11 | # python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 12 | # 13 | # You may need to add --master_addr $MASTER_ADDR --master_port $MASTER_PORT if using a custom addr:port 14 | # 15 | # You can also use the rdzv API: --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d 16 | # 17 | # use torch.distributed.launch instead of torch.distributed.run for torch < 1.9 18 | # 19 | # If you get a hanging in `barrier` calls you have some network issues, you may try to debug this with: 20 | # 21 | # NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 22 | # 23 | # which should tell you what's going on behind the scenes. 24 | # 25 | # 26 | # This script can be run via `srun` in the SLURM environment as well. Here is a SLURM script that 27 | # runs on 2 nodes of 4 gpus per node: 28 | 29 | # #!/bin/bash 30 | # #SBATCH --job-name=test-nodes # name 31 | # #SBATCH --nodes=2 # nodes 32 | # #SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! 33 | # #SBATCH --cpus-per-task=10 # number of cores per tasks 34 | # #SBATCH --gres=gpu:4 # number of gpus 35 | # #SBATCH --time 0:05:00 # maximum execution time (HH:MM:SS) 36 | # #SBATCH --output=%x-%j.out # output file name 37 | # 38 | # export GPUS_PER_NODE=4 39 | # export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) 40 | # export MASTER_PORT=6000 41 | # 42 | # srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ 43 | # --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ 44 | # --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 45 | # torch-distributed-gpu-test.py' 46 | # 47 | # can also add this for automatic prefixing of all logs with [hostname:rank] (in addition to `--master_addr` etc) 48 | # --role `hostname -s`: --tee 3 \ 49 | # 50 | 51 | import builtins 52 | import fcntl 53 | import os 54 | import socket 55 | import torch 56 | import torch.distributed as dist 57 | 58 | def print(*args, **kwargs): 59 | """ solves multi-process interleaved print problem """ 60 | with open(__file__, "r") as fh: 61 | fcntl.flock(fh, fcntl.LOCK_EX) 62 | try: 63 | builtins.print(*args, **kwargs) 64 | finally: 65 | fcntl.flock(fh, fcntl.LOCK_UN) 66 | 67 | local_rank = int(os.environ["LOCAL_RANK"]) 68 | torch.cuda.set_device(local_rank) 69 | device = torch.device("cuda", local_rank) 70 | hostname = socket.gethostname() 71 | 72 | gpu = f"[{hostname}-{local_rank}]" 73 | 74 | try: 75 | # test distributed 76 | dist.init_process_group("nccl") 77 | 78 | # global rank 79 | rank = dist.get_rank() 80 | world_size = dist.get_world_size() 81 | 82 | # reduction test 83 | t = torch.ones(1, device=device) 84 | dist.all_reduce(t, op=dist.ReduceOp.SUM) 85 | dist.barrier() 86 | print(f"{gpu} Reduction op=sum result: {t.item()}") 87 | 88 | # test cuda is available and can allocate memory 89 | torch.cuda.is_available() 90 | torch.ones(1).cuda(local_rank) 91 | 92 | print(f"{gpu} is OK (global rank: {rank}/{world_size})") 93 | 94 | dist.barrier() 95 | if rank == 0: 96 | print(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}") 97 | print(f"device compute capabilities={torch.cuda.get_device_capability()}") 98 | print(f"pytorch compute capabilities={torch.cuda.get_arch_list()}") 99 | 100 | except Exception: 101 | print(f"{gpu} is broken") 102 | raise 103 | -------------------------------------------------------------------------------- /dtype/README.md: -------------------------------------------------------------------------------- 1 | # Tensor precision / Data types 2 | 3 | ## Half and Mixed Precision 4 | 5 | fp16 6 | 7 | bf16 8 | 9 | mixed fp16 10 | 11 | mixed bf16 12 | 13 | 14 | ### General OPs 15 | 16 | `LayerNorm`-like operations must not do their work in half-precision, or they may lose a lot of data. Therefore when these operations are implemented correctly they do efficient internal work in fp32 and then their outputs are downcast to half-precision. Very often it's just the accumulation that is done in fp32, since adding up half-precision numbers is very lossy. 17 | 18 | example: 19 | 20 | ### Reduction collectives 21 | 22 | fp16: ok to do in fp16 if loss scaling is in place 23 | 24 | bf16: only ok in fp32 25 | 26 | ### Gradient accumulation 27 | 28 | best done in fp32 for both, but definitely for bf16 29 | 30 | 31 | ### Optimizer step / Vanishing gradients 32 | 33 | when adding a tiny gradient to a large number, that addition is often nullified 34 | 35 | fp32 master weights and fp32 optim states 36 | 37 | bf16 master weights and optim states can be done when using Kahan Summation and/or Stochastic rounding 38 | 39 | 40 | ## Using fp16-pretrained model in bf16 regime 41 | 42 | usually fails 43 | 44 | ## Using bf16-pretrained model in fp16 regime 45 | 46 | will lose some performance on conversion, but should work - best to finetune a bit 47 | -------------------------------------------------------------------------------- /hparams/README.md: -------------------------------------------------------------------------------- 1 | # Selecting Training Hyper-Parameters And Model Initializations 2 | 3 | ## Glossary 4 | 5 | Training jargon uses a multitude of abbreviations and terms, so here are some important for this chapter. 6 | 7 | - BS: Batch Size - here we mean batch size per gpu, often it is also referred to as MBS (micro-batch-size) 8 | - GBS: Global Batch Size - total batch size per iteration - may include gradient accumulation 9 | - GAS: Gradient Accumulation Steps - how many forward/backward cycles to perform before one full iteration is complete 10 | - TFLOPs: Trillion FLOPs per second - [FLOPS](https://en.wikipedia.org/wiki/FLOPS) 11 | - PP: Pipeline Parallelism 12 | 13 | ## Global Batch Size Ramp Up 14 | 15 | If you intend to train with a very large GBS, with say 1024, or 2048 samples and even higher, when you just start training, it's very wasteful to feed such large batch sizes to the model. At this point it's totally random and can't benefit from having too refined data. Therefore to save data and resources, one often ramps up the global batch size over some period of time. 16 | 17 | It's also important to not start with GBS that is too small, since otherwise the progress won't be efficient. When there is too little data the compute (TFLOPS) is inefficient and will slow everything down. This is especially so when Pipeline Parallelism (PP) is used, since the most important thing about PP tuneup is a small GPU idleness bubble, and the smaller the GBS the larger the bubble is. 18 | 19 | For example, for BLOOM-176B, where we did use PP, after doing throughput benchmarking we found that starting with GBS=16 was incredibly slow (8 TFLOPs), so we eventually started with GBS=192 (73 TFLOPs) and then we ramped up to GBS=2048 (150 TFLOPs) - we increased GBS by 16 every 9_765_625 samples. 20 | 21 | 22 | 23 | ### STD Init 24 | 25 | This hyper parameter is super-important and it requires math to get it right. For details see [STD Init](../instabilities#std-init). 26 | -------------------------------------------------------------------------------- /instabilities/README.md: -------------------------------------------------------------------------------- 1 | # Avoiding, Recovering From and Understanding Instabilities 2 | 3 | ## STD Init 4 | 5 | Correctly initializing the initial distribution of the tensors can have a tremendous impact on training's stability. The `std` value isn't fixed and depends on the hidden dimension size. 6 | 7 | This proved to be a very crucial setting in our pre-BLOOM 104B experiments and we couldn't break past the first few thousands iterations until we figured out that the 0.02 default `--init-method-std` in Megatron-LM was a way too big for our model. 8 | 9 | We referred to these two sources: 10 | 11 | 1. "Transformers without Tears" paper https://arxiv.org/abs/1910.05895 prescribes: `sqrt(2/(NHIDDEN*5))` 12 | 13 | 2. The 530B training paper https://arxiv.org/abs/2201.11990 they used an even smaller init formula: `sqrt(1/(NHIDDEN*3))` 14 | 15 | and decided to go with the 530B one as it leads to an even smaller init value. 16 | 17 | To make it easier to compare the two formulas, they can be rewritten as: 18 | 1. `sqrt(0.4000/NHIDDEN)` 19 | 2. `sqrt(0.3333/NHIDDEN)` 20 | 21 | Thus for `NHIDDEN=14336` the math was `sqrt(1/(14336*3)) = 0.00482` and that's what we used. It surely wasn't the only reason why we had no stability issues during BLOOM-176B training, but I think it was one of the crucial ones. 22 | 23 | 24 | ## Numerical instabilities 25 | 26 | Certain mathematical operations could be unstable when dealing with low precision numbers. 27 | 28 | For example, please see this very interesting [PyTorch guide on numerical stability](https://pytorch.org/docs/stable/notes/numerical_accuracy.html). 29 | 30 | Now let's look at a specific example of this concept in action. 31 | 32 | During 104B training experiments where fp16 mixed precision was used - the following improvement was proposed by [Corby Rosset](https://github.com/corbyrosset) to make [self-attention more stable](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/118). 33 | 34 | Specifically this [line](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c839a8aa30731f71b3738d56009be9668508e366/megatron/model/transformer.py#L303) shows that the `norm_factor` may be multiplied after the Query * Key matrix multiplication. If the dim of Q and K are very large, the output may blow up and the `norm_factor` won't be able to save it. 35 | 36 | Proposal: move the `norm_factor` inward, so Q and K are scaled down before matrix multiply: 37 | ``` 38 | matmul_result = torch.baddbmm( 39 | matmul_result, 40 | 1.0/math.sqrt(self.norm_factor) * query_layer.transpose(0, 1), # [b * np, sq, hn] 41 | 1.0/math.sqrt(self.norm_factor) * key_layer.transpose(0, 1).transpose(1, 2), # [b * np, hn, sk] 42 | beta=0.0 if alibi is None else 1.0, alpha=1.0) 43 | 44 | # change view to [b, np, sq, sk] 45 | attention_scores = matmul_result.view(*output_size) 46 | ``` 47 | 48 | To make the operation mathematically equivalent, moving the norm factor inward requires taking sqrt again 49 | if n is a scalar, A and B matrices: 50 | ``` 51 | n * (A dot B) === (sqrt(n) * A) dot (sqrt(n) * B) 52 | ``` 53 | 54 | Now A and B dimensions can be significantly larger. 55 | 56 | For CUDA kernel writers [CuBlas](https://docs.nvidia.com/cuda/cublas/index.html)'s `GemmStridedBatchedEx` at the time of this writing has a similar issue. It is defined as: 57 | 58 | ``` 59 | C+i*strideC=αop(A+i*strideA)op(B+i*strideB)+β(C+i*strideC), for i ∈[0,batchCount−1] 60 | ``` 61 | 62 | The issue is that `alpha` is multiplied after the matrix-matrix multiplication is done so it can cause instability. 63 | -------------------------------------------------------------------------------- /parallelism/README.md: -------------------------------------------------------------------------------- 1 | # Model Parallelism 2 | 3 | ## TP 4 | 5 | TP degree shouldn't span across nodes. 6 | -------------------------------------------------------------------------------- /resources/README.md: -------------------------------------------------------------------------------- 1 | # Resources 2 | 3 | 4 | ## Publicly available training logbooks 5 | 6 | The listing is in no particular order other than the year. 7 | 8 | ### 2021 9 | 10 | - BigScience pre-BLOOM 108B training experiments (2021): 11 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide/chronicles.md) | 12 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide) 13 | 14 | ### 2022 15 | 16 | - BigScience BLOOM-176B (2022): 17 | [chronicles-prequel](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md) | 18 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md) | 19 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/) 20 | 21 | - Meta OPT-175B (2022): 22 | [logbook](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles) | 23 | [Video](https://www.youtube.com/watch?v=p9IxoSkvZ-M) 24 | 25 | - THUDM GLM-130B (2022): [en logbook](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log-en.md) | [Mandarin version](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log.md) 26 | 27 | ### 2023 28 | 29 | - HuggingFace m4 80B multimodal (Flamingo repro) (2023): [Learning log](https://docs.google.com/document/d/1ZNGyVWYFUbzV0xuei4SED2QAakGjMpaaQALcKYQm46U/edit) | [Training Logbook](https://github.com/huggingface/m4-logbook/tree/main/tr-190-80b) 30 | 31 | - BloombergGPT 50B LLM - section C in [BloombergGPT: A Large Language Model for Finance](https://arxiv.org/abs/2303.17564) 32 | -------------------------------------------------------------------------------- /slurm/README.md: -------------------------------------------------------------------------------- 1 | # Working in SLURM Environment 2 | 3 | Unless you're lucky and you have a dedicated cluster that is completely under your control chances are that you will have to use SLURM to timeshare the GPUs with others. But, often, if you train at HPC, and you're given a dedicated partition you still will have to use SLURM. 4 | 5 | This document will not try to teach you SLURM as there are many manuals out there, but we will cover some specific nuances that are useful to help in the training process. 6 | 7 | 8 | ## Crontab Emulation 9 | 10 | One of the most important Unix tools is the crontab, which is essential for being able to schedule various jobs. It however usually is absent from SLURM environment. Therefore one must emulate it. Here is how. 11 | 12 | For this presentation we are going to use `$WORK/cron/` as the base directory. And that you have an exported environment variable `WORK` pointing to some location on your filesystem - if you use Bash you can set it up in your `~/.bash_profile` or if a different shell is used use whatever startup equivalent file is. 13 | 14 | 15 | ### 1. A self-perpetuating scheduler job 16 | 17 | We will use `$WORK/cron/scheduler` dir for scheduler jobs, `$WORK/cron/cron.daily` for daily jobs and `$WORK/cron/cron.hourly` for hourly jobs: 18 | 19 | ``` 20 | $ mkdir -p $WORK/cron/scheduler 21 | $ mkdir -p $WORK/cron/cron.daily 22 | $ mkdir -p $WORK/cron/cron.hourly 23 | ``` 24 | 25 | Now copy these two slurm script in `$WORK/cron/scheduler`: 26 | - [cron-daily.slurm](cron-daily.slurm) 27 | - [cron-hourly.slurm](cron-hourly.slurm) 28 | 29 | after editing those to fit your specific environment's account and partition information. 30 | 31 | Now you can launch the crontab scheduler jobs: 32 | 33 | ``` 34 | $ cd $WORK/cron/scheduler 35 | $ sbatch cron-hourly.slurm 36 | $ sbatch cron-daily.slurm 37 | ``` 38 | 39 | This is it, these jobs will now self-perpetuate and usually you don't need to think about it again unless there is an even that makes SLURM lose all its jobs. 40 | 41 | ### 2. Daily and Hourly Cronjobs 42 | 43 | Now whenever you want some job to run once a day, you simply create a slurm job and put it into the `$WORK/cron/cron.daily` dir. 44 | 45 | Here is an example job that runs daily to update the `mlocate` file index: 46 | ``` 47 | $ cat $WORK/cron/cron.daily/mlocate-update.slurm 48 | #!/bin/bash 49 | #SBATCH --job-name=mlocate-update # job name 50 | #SBATCH --ntasks=1 # number of MP tasks 51 | #SBATCH --nodes=1 52 | #SBATCH --hint=nomultithread # we get physical cores not logical 53 | #SBATCH --time=1:00:00 # maximum execution time (HH:MM:SS) 54 | #SBATCH --output=%x-%j.out # output file name 55 | #SBATCH --partition=PARTITION # edit me 56 | #SBATCH --account=GROUP@PARTITION # edit me 57 | 58 | set -e 59 | date 60 | echo "updating mlocate db" 61 | /usr/bin/updatedb -o $WORK/lib/mlocate/work.db -U $WORK --require-visibility 0 62 | ``` 63 | 64 | This builds an index of the files under `$WORK` which you can then quickly query with: 65 | ``` 66 | /usr/bin/locate -d $WORK/lib/mlocate/work.db pattern 67 | ``` 68 | 69 | To stop running this job, just move it out of the `$WORK/cron/cron.daily` dir. 70 | 71 | The same principle applies to jobs placed into the `$WORK/cron/cron.hourly` dir. These are useful for running something every hour. 72 | 73 | Please note that this crontab implementation is approximate timing-wise, due to various delays in SLURM scheduling they will run approximately every hour and every day. You can recode these to ask SLURM to start something at a more precise time if you have to, but most of the time the just presented method works fine. 74 | 75 | Additionally, you can code your own variations to meet specific needs of your project, e.g., every-30min or every-12h jobs. 76 | 77 | 78 | ### 3. Cleanup 79 | 80 | Finally, since every cron launcher job will leave behind a log file (which is useful if for some reason things don't work), you want to create a cronjob to clean up these logs. Otherwise you may run out of inodes - these logs files are tiny, but there could be tens of thousands of those. 81 | 82 | You could use something like this in a daily job. 83 | 84 | ``` 85 | find $WORK/cron -name "*.out" -mtime +7 -exec rm -f {} + 86 | ``` 87 | Please note that it's set to only delete files that are older than 7 days, in case you need the latest logs for diagnostics. 88 | 89 | 90 | ### Nuances 91 | 92 | The scheduler runs with Unix permissions of the person who launched the SLRUM cron scheduler job and so all other SLURM scripts launched by that cron job. 93 | 94 | 95 | 96 | 97 | ## Overcoming The lack of group SLURM job ownership 98 | 99 | SLURM runs on Unix, but surprisingly its designers haven't adopted the concept of group ownership with regards to SLURM jobs. So if a member of your team started an array of 10 jobs 20h each, and went on vacation - unless you have `sudo` access you now can't do anything to stop those jobs if something is wrong. 100 | 101 | I'm yet to find why this is so, but so far we have been using a kill switch workaround. You have to code it in your framework. For example, see how it was implemented in [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/e52bdabbde3c6895aceb76c1bced295c2646121f/megatron/training.py#L104) (Meg-DS). The program polls for a pre-configured at start up path on the filesystem and if it finds a file there, it exits. 102 | 103 | So if we start Meg-DS with `--kill-switch-path $WORK/tmp/training17-kill-switch` and then at any point we need to kill the SLURM job, we simply do: 104 | 105 | ``` 106 | touch $WORK/tmp/training17-kill-switch 107 | ``` 108 | and the next time the program gets to check for this file it'll detect the event and will exit voluntarily. If you have a job array, well, you will have to wait until each job starts, detects the kill switch and exits. 109 | 110 | Of course, don't forget to remove it when you're done stopping the jobs. 111 | ``` 112 | rm $WORK/tmp/training17-kill-switch 113 | ``` 114 | 115 | Now, this doesn't always work. If the job is hanging, it'll never come to the point of checking for kill-switch and the only solution here is to contact the sysadmins to kill the job for you. Sometimes if the hanging is a simple case pytorch's distributed setup will typically auto-exit after 30min of preset timeout time, but it doesn't always work. 116 | -------------------------------------------------------------------------------- /slurm/cron-daily.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=cron-daily # job name 3 | #SBATCH --ntasks=1 # number of MP tasks 4 | #SBATCH --nodes=1 5 | #SBATCH --hint=nomultithread # we get physical cores not logical 6 | #SBATCH --time=0:30:00 # maximum execution time (HH:MM:SS) 7 | #SBATCH --output=%x-%j.out # output file name 8 | #SBATCH --partition=PARTITION # edit me 9 | #SBATCH --account=GROUP@PARTITION # edit me 10 | 11 | # do not set -e - we must run all of it 12 | # set -x -e 13 | 14 | cd $WORK/cron/scheduler 15 | 16 | # ensure to restart self first 17 | sbatch --begin=now+24hour cron-daily.slurm 18 | 19 | # now launch any slurm scripts in cron.daily 20 | cd $WORK/cron/cron.daily 21 | for f in *.slurm; do 22 | sbatch "$f" 23 | done 24 | -------------------------------------------------------------------------------- /slurm/cron-hourly.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=cron-hourly # job name 3 | #SBATCH --ntasks=1 # number of MP tasks 4 | #SBATCH --nodes=1 5 | #SBATCH --hint=nomultithread # we get physical cores not logical 6 | #SBATCH --time=0:30:00 # maximum execution time (HH:MM:SS) 7 | #SBATCH --output=%x-%j.out # output file name 8 | #SBATCH --partition=PARTITION # edit me 9 | #SBATCH --account=GROUP@PARTITION # edit me 10 | 11 | # do not set -e - we must run all of it 12 | # set -x -e 13 | 14 | cd $WORK/cron/scheduler 15 | 16 | # ensure to restart self first 17 | sbatch --begin=now+1hour cron-hourly.slurm 18 | 19 | # now launch any slurm scripts in cron.hourly 20 | cd $WORK/cron/cron.hourly 21 | for f in *.slurm; do 22 | sbatch "$f" 23 | done 24 | -------------------------------------------------------------------------------- /throughput/README.md: -------------------------------------------------------------------------------- 1 | # How to Maximize Training Throughput 2 | 3 | The faster you can make your model to train the sooner the model will finish training, which is important not only to being first to publish something, but also potentially saving a lot of money. 4 | 5 | In general maximizing throughput is all about running many experiments and measuring the outcome and chosing the one that is superior. 6 | 7 | In certain situations your modeling team may ask you to choose some hyper parameters that will be detrimental to throughput but overall beneficial for the overall model's success. 8 | 9 | ## Crucial reproducibility requirements 10 | 11 | The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables. 12 | 13 | Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable. 14 | 15 | For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for [108B pre-BLOOM experiments](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr8-104B-wide) it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect. 16 | 17 | ## Network throughput 18 | 19 | It's critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time. 20 | 21 | If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance. 22 | 23 | Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network: 24 | 25 | [all_reduce_bench.py](./all_reduce_bench.py) 26 | 27 | Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes. 28 | 29 | To run it on 4 nodes 30 | 31 | ``` 32 | python -m torch.distributed.run --nproc_per_node=4 all_reduce_bench.py 33 | ``` 34 | 35 | You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps. 36 | 37 | Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run. 38 | 39 | Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic. 40 | 41 | To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed ZeRO Stage 3: 42 | 43 | 1. 100Gbps is not enough 44 | 2. 200-400 Gbps is ok 45 | 3. 800-1000 Gbps is ideal 46 | 47 | [full details](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491) 48 | 49 | 50 | ## TFLOPs as a performance metric 51 | 52 | Before you start optimizing the performance of your training setup you need a metric that you can use to see whether the throughput is improving or not. You can measure seconds per iteration, or iterations per second, or some other such timing, but there is a more useful metric that measures TFLOPs. 53 | 54 | footnote: TFLOPs: Trillion FLOPs per second - [FLOPS](https://en.wikipedia.org/wiki/FLOPS) 55 | 56 | Measuring TFLOPs is superior because without it you don't know whether you are close to the best performance that can be achieved or not. This measurement gives you an indication of how far you're from the peak performance reported by the hardware manufacturer. 57 | 58 | In this section I will use BLOOM's training for the examplification. We use 80GB A100 NVIDIA GPUs and we trained in mixed bf16 regime. So let's look at the [A100 spec](https://www.nvidia.com/en-us/data-center/a100/) which tells us: 59 | 60 | ``` 61 | BFLOAT16 Tensor Core 312 TFLOPS 62 | ``` 63 | 64 | Therefore we now know that if we were to only run `matmul` on huge bf16 matrices without copying to and from the device we should get around 312 TFLOPs max. 65 | 66 | Practically though, due to disk IO, communications and copying data from gpu memory to gpu computing unit overhead and because we can't do everything in bf16 and at times we have to do math in fp32 (or tf32) we can really expect about half of that. So 155 TFLOPs should be an amazing sustainable throughput for a complex hundreds of GPUs training setup. 67 | 68 | When we first started tuning things up we were at <100 TFLOPs and a few weeks later when we launched the training we managed to get 150 TFLOPs. 69 | 70 | The important thing to notice here is that we knew that we can't push it further by much and we knew that there was no more point to try and optimize it even more. 71 | 72 | So a general rule of thumb - if your training set up gets about 1/2 of advertised peak performance you're doing great. Don't let it stop you though from beating this suggestion and getting even more efficient. 73 | 74 | When calculating TFLOPs it's important to remember that the math is different if [Gradient checkpointing](#gradient-checkpointing) are enabled, since when it's activated more compute is used and it needs to be taken into an account. 75 | 76 | for transformer models the following is an estimation formula which slightly under-reports the real TFLOPs: 77 | 78 | TFLOPs: `model_size_in_B * 4 * 2 * seqlen * global_batch_size / (time_in_sec_per_interation * total_gpus * 1e3)` 79 | 80 | The factor of 4 is when used with activation check-pointing, otherwise it will be 3, but for 100B+ model, activation check-pointing will always be on. 81 | 82 | So the `3*2` is often called "model FLOPs" and `4*2` - "hardware FLOPs". 83 | 84 | ``` 85 | perl -le '$ng=64; $ms=52; $gbs=1024; $sp=127; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)' 86 | ``` 87 | (ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds) 88 | 89 | same with bash env vars and broken down GBS into mbs*dp*gas (gas=pp_chunks): 90 | ``` 91 | echo "($MSIZE*4*2*SEQLEN*$MICRO_BATCH_SIZE*$DP_SIZE*$GAS)/($THROUGHPUT*$NNODES*4*1000)" | bc -l 92 | ``` 93 | 94 | The exact formula is in Equation 3 of Section 5.1 of the [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) paper. You can see the code [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/251). 95 | 96 | footnote: For Inference only it'd be: `24Bsh^2 + 4𝐵s^2h` floating point operations per layer 97 | 98 | 99 | 100 | 101 | ## Gradient checkpointing 102 | 103 | This is only relevant for training. 104 | 105 | Enabling gradient checkpointing allows one to trade speed for GPU memory. When this feature is activated instead of remembering the outputs of, say, transformer blocks until the backward pass is done, these outputs are dropped. This frees up huge amounts of GPU memory. But, of course, a backward pass is not possible without having the outputs of forward pass, and thus they have to be recalculated. 106 | 107 | This, of course, can vary from model to model, but typically one pays with about 20-25% decrease in throughput, but since a huge amount of gpu memory is liberated, one can now increase the batch size per gpu and thus overall improve the effective throughput of the system. In some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM. 108 | 109 | Activation checkpointing and gradient checkpointing are 2 terms for the same methodology. 110 | 111 | For example, in HF Transformers models you do `model.gradient_checkpointing_enable()` to activate it in your trainer or if you HF Trainer then you'd activate it with `--gradient_checkpointing 1`. 112 | 113 | 114 | 115 | ## Gradient accumulation 116 | 117 | Depending on a situation using a large gradient accumulation can increase the throughput, even though it's only the optimizer `step` that's skipped except at the boundary of the gradient accumulation, it can be quite a significant saving. e.g. in this particular small setup I clocked 20-30% speed up: 118 | 119 | - [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004592231) 120 | - [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) 121 | 122 | 123 | When using Pipeline parallelism a very large Gradient Accumulation is a must to keep the [pipeline's bubble to the minimum]( https://huggingface.co/docs/transformers/main/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism). 124 | 125 | 126 | 127 | 128 | ## Vector and matrix size divisibility 129 | 130 | 131 | ### Tile and wave quantization 132 | 133 | XXX 134 | 135 | 136 | ### Number/size of Attention heads 137 | 138 | XXX 139 | -------------------------------------------------------------------------------- /throughput/all_reduce_bench.py: -------------------------------------------------------------------------------- 1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """ 2 | 3 | # this version has been derived from @jeffra's gist: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36 4 | # which in turn is derived from https://github.com/NVIDIA/nccl-tests 5 | # 6 | # to run for 2 nodes: 7 | # python -m torch.distributed.run --nproc_per_node=2 all_reduce_bench.py 8 | # 9 | # the printed results are already n_gpu-agnostic (i.e. averaged for the world size) 10 | 11 | import argparse 12 | import fcntl 13 | import os 14 | import socket 15 | import time 16 | import torch 17 | import torch.distributed as dist 18 | 19 | TRIALS = 5 20 | 21 | N = 500000 22 | M = 2000 23 | 24 | def printflock(*msgs): 25 | """ print """ 26 | with open(__file__, "r") as fh: 27 | fcntl.flock(fh, fcntl.LOCK_EX) 28 | try: 29 | print(*msgs) 30 | finally: 31 | fcntl.flock(fh, fcntl.LOCK_UN) 32 | 33 | def timed_allreduce(mat, id): 34 | pre = time.perf_counter() 35 | dist.all_reduce(mat) 36 | printflock(f"ignore me {int(mat[0][0])}") # required due to lazy evaluation 37 | duration = time.perf_counter() - pre 38 | tput = ((M*N*4*2)/duration)*8 # *2 is for send + receive, *8 for gigabits/second 39 | size = M * N * 4 # 4 is fp32 40 | n = dist.get_world_size() 41 | busbw = (size / duration) * (2 * (n - 1) / n) * 8 42 | printflock(f"{id}:\n", 43 | f"duration: {duration:.4f} sec\n", 44 | f"algo throughput: {tput:.4f} bps, {tput/1e9:.4f} Gbps\n", 45 | f"busbw: {busbw / 1e9:.4f} Gbps" 46 | ) 47 | 48 | def run(local_rank): 49 | hostname = socket.gethostname() 50 | id = f"{hostname}:{local_rank}" 51 | global_rank = dist.get_rank() 52 | 53 | printflock(f"{id} data size: {M*N*4/1e9} GB") 54 | mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank) 55 | 56 | for i in range(TRIALS): 57 | dist.barrier() 58 | if global_rank == 0: 59 | print(f"\n\n\n-----------trial-{i}----------------") 60 | timed_allreduce(mat, id) 61 | 62 | def init_processes(local_rank, fn, backend='nccl'): 63 | torch.cuda.set_device(local_rank) 64 | dist.init_process_group(backend) 65 | fn(local_rank) 66 | 67 | 68 | if __name__ == "__main__": 69 | rank = int(os.environ["LOCAL_RANK"]) 70 | printflock("local_rank: %d" % rank) 71 | init_processes(local_rank=rank, fn=run) 72 | --------------------------------------------------------------------------------