├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── architecture └── README.md ├── assets └── wikipedia_precision.png ├── debug ├── NicerTrace.py ├── README.md ├── printflock.py └── torch-distributed-gpu-test.py ├── hparams └── README.md ├── instabilities └── README.md ├── parallelism └── README.md ├── resources └── README.md └── throughput ├── README.md └── all_reduce_bench.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Contributor Covenant Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | Examples of behavior that contributes to a positive environment for our 19 | community include: 20 | 21 | * Demonstrating empathy and kindness toward other people 22 | * Being respectful of differing opinions, viewpoints, and experiences 23 | * Giving and gracefully accepting constructive feedback 24 | * Accepting responsibility and apologizing to those affected by our mistakes, 25 | and learning from the experience 26 | * Focusing on what is best not just for us as individuals, but for the overall 27 | community 28 | 29 | Examples of unacceptable behavior include: 30 | 31 | * The use of sexualized language or imagery, and sexual attention or advances of 32 | any kind 33 | * Trolling, insulting or derogatory comments, and personal or political attacks 34 | * Public or private harassment 35 | * Publishing others' private information, such as a physical or email address, 36 | without their explicit permission 37 | * Other conduct which could reasonably be considered inappropriate in a 38 | professional setting 39 | 40 | ## Enforcement Responsibilities 41 | 42 | Community leaders are responsible for clarifying and enforcing our standards of 43 | acceptable behavior and will take appropriate and fair corrective action in 44 | response to any behavior that they deem inappropriate, threatening, offensive, 45 | or harmful. 46 | 47 | Community leaders have the right and responsibility to remove, edit, or reject 48 | comments, commits, code, wiki edits, issues, and other contributions that are 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation 50 | decisions when appropriate. 51 | 52 | ## Scope 53 | 54 | This Code of Conduct applies within all community spaces, and also applies when 55 | an individual is officially representing the community in public spaces. 56 | Examples of representing our community include using an official e-mail address, 57 | posting via an official social media account, or acting as an appointed 58 | representative at an online or offline event. 59 | 60 | ## Enforcement 61 | 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 63 | reported to the community leaders responsible for enforcement at 64 | feedback@huggingface.co. 65 | All complaints will be reviewed and investigated promptly and fairly. 66 | 67 | All community leaders are obligated to respect the privacy and security of the 68 | reporter of any incident. 69 | 70 | ## Enforcement Guidelines 71 | 72 | Community leaders will follow these Community Impact Guidelines in determining 73 | the consequences for any action they deem in violation of this Code of Conduct: 74 | 75 | ### 1. Correction 76 | 77 | **Community Impact**: Use of inappropriate language or other behavior deemed 78 | unprofessional or unwelcome in the community. 79 | 80 | **Consequence**: A private, written warning from community leaders, providing 81 | clarity around the nature of the violation and an explanation of why the 82 | behavior was inappropriate. A public apology may be requested. 83 | 84 | ### 2. Warning 85 | 86 | **Community Impact**: A violation through a single incident or series of 87 | actions. 88 | 89 | **Consequence**: A warning with consequences for continued behavior. No 90 | interaction with the people involved, including unsolicited interaction with 91 | those enforcing the Code of Conduct, for a specified period of time. This 92 | includes avoiding interactions in community spaces as well as external channels 93 | like social media. Violating these terms may lead to a temporary or permanent 94 | ban. 95 | 96 | ### 3. Temporary Ban 97 | 98 | **Community Impact**: A serious violation of community standards, including 99 | sustained inappropriate behavior. 100 | 101 | **Consequence**: A temporary ban from any sort of interaction or public 102 | communication with the community for a specified period of time. No public or 103 | private interaction with the people involved, including unsolicited interaction 104 | with those enforcing the Code of Conduct, is allowed during this period. 105 | Violating these terms may lead to a permanent ban. 106 | 107 | ### 4. Permanent Ban 108 | 109 | **Community Impact**: Demonstrating a pattern of violation of community 110 | standards, including sustained inappropriate behavior, harassment of an 111 | individual, or aggression toward or disparagement of classes of individuals. 112 | 113 | **Consequence**: A permanent ban from any sort of public interaction within the 114 | community. 115 | 116 | ## Attribution 117 | 118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 119 | version 2.1, available at 120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 121 | 122 | Community Impact Guidelines were inspired by 123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 127 | [https://www.contributor-covenant.org/translations][translations]. 128 | 129 | [homepage]: https://www.contributor-covenant.org 130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 131 | [Mozilla CoC]: https://github.com/mozilla/diversity 132 | [FAQ]: https://www.contributor-covenant.org/faq 133 | [translations]: https://www.contributor-covenant.org/translations 134 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 16 | 17 | # Contribute to the Large Language Model Training Playbook 18 | 19 | The Large Language Model Training Playbook is a living document. We anticipate regular improvements, so please please watch the repository to be notified about these. 20 | 21 | Everyone is welcome to contribute, and we value everybody's contribution. New content writing 22 | contributions are not the only way to help. Answering questions in issues, helping 23 | others in pull-request, and improving the existing writing are also often valuable. 24 | 25 | Though, please don't file a pull request without first coordinating via the issue system (see below) as (1) it might be content that goes beyond what the playbook is intended to cover or (2) someone else might already be working on this. 26 | 27 | Feel also free to spread the word! You can reference the playbook in blog posts or shout out on Twitter every time if it has helped you, or simply ⭐️ the repository to say thank you. 28 | 29 | However you choose to contribute, please be mindful and respect our 30 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md). 31 | 32 | **This guide was inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).** 33 | 34 | ## Ways to contribute 35 | 36 | There are several ways you can contribute to the "Large Language Model Training Playbook": 37 | 38 | * Propose a new section or propose to add more content to an existing section. 39 | * Submit issues about inexatitude or clarity on current content. 40 | * Read and comment on a pull request proposing new content or correcting the existing content. 41 | 42 | If you don't know where to start, there might be special [Good First 43 | Issue](https://github.com/huggingface/large_language_model_training_playbook/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source. Just comment in the issue that you'd like to work on it. 44 | 45 | > All contributions are equally valuable to the community. 🥰 46 | 47 | ## Propose a new section and/or additional content 48 | 49 | If you would like to add a new section or content to an existing section, please **open an issue first to discuss the matter** before creating a pull request. 50 | 51 | Even though the project aim at integrating as much as possible inputs from any contributors, we don't garantee we'll accept all topics or contributions so it's always better to approval before starting to spend significant amount of time on a writing section. 52 | 53 | ## Submit issues about inexatitude or clarity on current content 54 | 55 | When submitting an issue about inexatitude or clarity on current content please be careful about our 56 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md) as we prohibit some behaviors and type of communication. In particular we try to build a positive environment for our 57 | community by being respectful of differing opinions, viewpoints, and experiences and giving and gracefully accepting constructive feedback. In a nutshell: don't forget there is a human just like you at the other side who has likely spend time and effort writing the content you are now commenting. 58 | 59 | The repo maintainers will be very strict regarding any action they deem in violation of this Code of Conduct (see the [Enforcement Guidelines section of the Code of Conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md#Enforcement-Guidelines)) 60 | 61 | ## Create a Pull Request 62 | 63 | Before writing any section or content, we strongly advise you to search through the existing PRs or 64 | issues to make sure nobody is already working on the same thing. If you are 65 | unsure, it is always a good idea to open an issue to get some feedback. 66 | 67 | You will need basic `git` proficiency to contribute to the 68 | 🤗 Large Language Model Training Playbook. While `git` is not the easiest tool to use, it has the greatest 69 | manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro 70 | Git](https://git-scm.com/book/en/v2) is a very good reference. 71 | 72 | Follow the steps below to start contributing: 73 | 74 | 1. Fork the [repository](https://github.com/huggingface/large_language_model_training_playbook) by 75 | clicking on the **[Fork](https://github.com/huggingface/large_language_model_training_playbook/fork)** button on the repository's page. This creates a copy of the code 76 | under your GitHub user account. 77 | 78 | 2. Clone your fork to your local disk, and add the base repository as a remote: 79 | 80 | ```bash 81 | $ git clone git@github.com:/large_language_model_training_playbook.git 82 | $ cd large_language_model_training_playbook 83 | $ git remote add upstream https://github.com/huggingface/large_language_model_training_playbook.git 84 | ``` 85 | 86 | 3. Create a new branch to hold your development changes: 87 | 88 | ```bash 89 | $ git checkout -b a-descriptive-name-for-my-changes 90 | ``` 91 | 92 | 🚨 **Do not** work on the `main` branch! 93 | 94 | 4. Write the content in your branch. 95 | 96 | You can now write the new content or the correction you wanted to submit. 97 | 98 | Once you're happy with your changes, add changed files with `git add` and 99 | record your changes locally with `git commit`: 100 | 101 | ```bash 102 | $ git add modified_file.md 103 | $ git commit 104 | ``` 105 | 106 | Please remember to write [good commit 107 | messages](https://chris.beams.io/posts/git-commit/) to clearly communicate the changes you made! 108 | 109 | To keep your copy of the code up to date with the original 110 | repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer: 111 | 112 | ```bash 113 | $ git fetch upstream 114 | $ git rebase upstream/main 115 | ``` 116 | 117 | Push your changes to your branch: 118 | 119 | ```bash 120 | $ git push -u origin a-descriptive-name-for-my-changes 121 | ``` 122 | 123 | If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally. 124 | 125 | 5. Now you can go to your fork of the repository on GitHub and click on **Pull request** to open a pull request. When you're ready, you can send your changes to the project maintainers for review. 126 | 127 | 6. It's ok if maintainers request changes, it happens to our core contributors 128 | too! So everyone can see the changes in the pull request, work in your local 129 | branch and push the changes to your fork. They will automatically appear in 130 | the pull request. 131 | 132 | ### Develop on Windows 133 | 134 | On Windows (unless you're working in [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings: 135 | 136 | ```bash 137 | git config core.autocrlf input 138 | ``` 139 | 140 | One way to run the `make` command on Windows is with MSYS2: 141 | 142 | 1. [Download MSYS2](https://www.msys2.org/), and we assume it's installed in `C:\msys64`. 143 | 2. Open the command line `C:\msys64\msys2.exe` (it should be available from the **Start** menu). 144 | 3. Run in the shell: `pacman -Syu` and install `make` with `pacman -S make`. 145 | 4. Add `C:\msys64\usr\bin` to your PATH environment variable. 146 | 147 | You can now use `make` from any terminal (Powershell, cmd.exe, etc.)! 🎉 148 | 149 | ### Sync a forked repository with upstream main (the Hugging Face repository) 150 | 151 | When updating the main branch of a forked repository, please follow these steps to avoid pinging the upstream repository which adds reference notes to each upstream PR, and sends unnecessary notifications to the developers involved in these PRs. 152 | 153 | 1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. 154 | 2. If a PR is absolutely necessary, use the following steps after checking out your branch: 155 | 156 | ```bash 157 | $ git checkout -b your-branch-for-syncing 158 | $ git pull --squash --no-commit upstream main 159 | $ git commit -m '' 160 | $ git push --set-upstream origin your-branch-for-syncing 161 | ``` 162 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 📖 The Large Language Model Training Playbook 2 | 3 | This playbook is a companion to the [LLM Training Handbook](https://github.com/huggingface/llm_training_handbook) which contains a lot more details and scripts. 4 | 5 | An open collection of implementation tips, tricks and resources for training large language models. 6 | 7 | The following covers questions related to various topics which are interesting or challenging when training large language models. 8 | 9 | ## [Deciding on a model architecture](./architecture/) 10 | 11 | ## Deciding on a model parallelism strategy 12 | 13 | ## Deciding on the model size 14 | 15 | #### Scaling laws 16 | 17 | #### Trade-off of large language model sizes 18 | 19 | ## Issues and questions related to tensor precision 20 | 21 | ### What to chose between fp32, fp16, bf16 22 | 23 | ### Mixed-precisions for optimizers, weights, specifics modules 24 | 25 | ### How to finetune and integrate a model trained in a precision in another precision 26 | 27 | ## [Selecting training hyper-parameters and model initializations](./hparams) 28 | 29 | ### Learning rate and learning rate schedules 30 | 31 | ### Questions on batch size 32 | 33 | ## [Maximizing throughput](./throughput) 34 | 35 | ## [Avoiding, recovering from and understanding instabilities](./instabilities) 36 | 37 | ### Detecting instabilities early 38 | 39 | ### Training tips to reduce instabilities 40 | 41 | ## Issues with data and data processing 42 | 43 | ## [Debugging software and hardware failures](./debug/) 44 | 45 | ## Tips on what metrics to follow during the training 46 | 47 | ## [Resources](./resources/) 48 | -------------------------------------------------------------------------------- /architecture/README.md: -------------------------------------------------------------------------------- 1 | # Deciding on a model architecture 2 | 3 | ## A standard architecture 4 | 5 | ## Activations 6 | 7 | ## Positional embeddings 8 | 9 | ## Frequently seen modifications 10 | 11 | ### Parallel feed-forward and attention 12 | 13 | ### Additional layer-norms 14 | 15 | -------------------------------------------------------------------------------- /assets/wikipedia_precision.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/large_language_model_training_playbook/efa7884290d9e8b942c61b83c03c80932b9dcf1b/assets/wikipedia_precision.png -------------------------------------------------------------------------------- /debug/NicerTrace.py: -------------------------------------------------------------------------------- 1 | """ NicerTrace - an improved Trace package """ 2 | 3 | """ 4 | To try it in action and to get a sense of how it can help you just run: 5 | python trace/NicerTrace.py 6 | """ 7 | 8 | 9 | import datetime 10 | import os 11 | import socket 12 | import sys 13 | import sysconfig 14 | import time 15 | import trace 16 | 17 | 18 | class NicerTrace(trace.Trace): 19 | # as the 2 paths overlap the longer with site-packages needs to be first 20 | py_dirs = [sysconfig.get_paths().get(k) for k in ["purelib", "stdlib"]] 21 | site_packages_dir = sysconfig.get_paths()["purelib"] 22 | stdlib_dir = sysconfig.get_paths()["stdlib"] 23 | 24 | def __init__(self, *args, packages_to_include=None, log_pids=False, **kwargs): 25 | """normal init plus added package/dir exclusion overrides: 26 | 27 | While preserving the original behavior a new optional arg is added `packages_to_include` 28 | with the following behavior: 29 | 30 | 1. if ignoredirs is a list the original trace behavior is used - only those dirs and subdirs will be excluded 31 | 2. if ignoredirs is None and packages_to_include is None - everything is included 32 | 3. if packages_to_include="uninstalled" all packages found under /.../site-packages will be excluded. I couldn't find a way to exclude core python packages under /.../lib/python3.8 since it'd then exclude site-packages as well 33 | 3. if packages_to_include=["PIL", "numpy", "pytorch"] all packages found under /.../site-packages, and /.../lib/python3.8 will be excluded except the packages that were listed to be included - use top-level package name here 34 | 4. if packages_to_include=None, everything under /.../site-packages, and /.../lib/python3.8 will be excluded and any packages that are installed via `pip install -e .` will be included 35 | 36 | """ 37 | ignoredirs = kwargs.get("ignoredirs", None) 38 | 39 | if ignoredirs is not None and len(ignoredirs) > 1: 40 | if packages_to_include is not None: 41 | raise ValueError("can't have both ignoredirs and packages_to_include not None") 42 | kwargs["ignoredirs"] = ignoredirs 43 | elif packages_to_include is None: 44 | kwargs["ignoredirs"] = None 45 | elif packages_to_include == "uninstalled": 46 | kwargs["ignoredirs"] = self.stdlib_dir # everything including python core packages 47 | else: 48 | # exclude all of /.../lib/python3.8 and sub-paths from /.../site-packages, and 49 | packages = os.listdir(self.site_packages_dir) 50 | packages_to_exclude = set(packages) - set(packages_to_include) 51 | dirs_to_exclude = [ 52 | f"{self.site_packages_dir}/{dir}" for dir in sorted(packages_to_exclude) if not dir.endswith("-info") 53 | ] 54 | # note, no way to exclude python core packages in this situation because 55 | # sysconfig.get_paths()'s' purelib is a subset of stdlib :(, so excluding only site-packages 56 | kwargs["ignoredirs"] = dirs_to_exclude 57 | 58 | # not packages, but final module names like Image from Image.py 59 | # mods_to_exclude = [] 60 | 61 | # print("\n".join(kwargs["ignoredirs"])) 62 | 63 | super().__init__(*args, **kwargs) 64 | self.log_pids = log_pids 65 | 66 | def strip_py_dirs(self, path): 67 | """strips python path prefix like /.../site-packages, and /.../lib/python3.8 if any matches""" 68 | for prefix in self.py_dirs: 69 | if path.startswith(prefix): 70 | return path.replace(prefix + "/", "") 71 | return path 72 | 73 | def globaltrace_lt(self, frame, why, arg): 74 | """Handler for call events. 75 | If the code block being entered is to be ignored, returns `None', 76 | else returns self.localtrace. 77 | 78 | This is an override to properly show full package names: 79 | 1. if it's under site-packages or core python dir - convert to package name 80 | 2. otherwise show full path to the python file - usually uninstalled packages 81 | 82 | Additionally enter frames now include the line number since some packages have multiple 83 | methods that have the same name and there is no telling which one of them was called. 84 | 85 | It was written against https://github.com/python/cpython/blob/3.8/Lib/trace.py. If you're 86 | using a different python version you may have to adapt it should the core implementation 87 | change (but it's unlikely) 88 | 89 | """ 90 | if why == "call": 91 | code = frame.f_code 92 | # print(f"\n\n{frame.f_code=}") 93 | # print(dir(code)) 94 | 95 | filename = frame.f_globals.get("__file__", None) 96 | if filename: 97 | lineno = code.co_firstlineno 98 | # python's trace fails to get the full package name - let's fix it 99 | # strip the common path of python library 100 | modulename = self.strip_py_dirs(filename) 101 | if filename != modulename: 102 | # the package was installed under /.../site-packages, /.../lib/python3.8 103 | modulename, ext = os.path.splitext(modulename) 104 | modulename = modulename.replace("/", ".") 105 | else: 106 | # still full path, because the package is not installed 107 | modulename = filename 108 | 109 | if modulename is not None: 110 | # XXX: ignoremods may not work now as before 111 | ignore_it = self.ignore.names(filename, modulename) 112 | if not ignore_it: 113 | if self.trace: 114 | if self.log_pids: 115 | print(os.getpid(), end=" ") 116 | 117 | print(f" {modulename}:{lineno} {code.co_name}") 118 | return self.localtrace 119 | else: 120 | return None 121 | 122 | def localtrace_trace_and_count(self, frame, why, arg): 123 | """ 124 | Overriding the default method. 125 | 126 | Using hh:mm:ss format for timestamps (instead of secs) as it's more readable when the trace is run for hours 127 | 128 | XXX: ideally it would be nice not to repeat the same module name on every line, but when I tried 129 | that I discovered that globaltrace_lt doesn't necessarily frame all the local calls, since 130 | localtrace_trace_and_count may continue printing local calls from an earlier frame w/o 131 | notifying that the context has changed. So we are forced to reprint the module name on each 132 | line to keep at least the incomplete context. 133 | 134 | Ideally there should an indication of a frame change before all the local prints 135 | 136 | Read the disclaimer in globaltrace_lt that this was tested with py-3.8 137 | 138 | """ 139 | if why == "line": 140 | # record the file name and line number of every trace 141 | filename = frame.f_code.co_filename 142 | lineno = frame.f_lineno 143 | key = filename, lineno 144 | self.counts[key] = self.counts.get(key, 0) + 1 145 | basename = os.path.basename(filename) 146 | if self.log_pids: 147 | print(os.getpid(), end=" ") 148 | if self.start_time: 149 | delta_time = trace._time() - self.start_time 150 | delta_time = str(datetime.timedelta(seconds=delta_time)).split(".")[0] 151 | print(delta_time, end=" ") 152 | print(f"{basename}:{lineno:>6}: {trace.linecache.getline(filename, lineno)}", end="") 153 | return self.localtrace 154 | 155 | # -------------------------------- # 156 | 157 | 158 | class Tee: 159 | """ 160 | A helper class to tee print's output into a file. 161 | Usage: 162 | sys.stdout = Tee(filename) 163 | """ 164 | 165 | def __init__(self, filename): 166 | self.stdout = sys.stdout 167 | self.file = open(filename, "a") 168 | 169 | def __getattr__(self, attr): 170 | return getattr(self.stdout, attr) 171 | 172 | def write(self, msg): 173 | # comment out the next line if you don't want to write to stdout 174 | self.stdout.write(msg) 175 | self.file.write(msg) 176 | self.file.flush() 177 | 178 | def flush(self): 179 | # comment out the next line if you don't want to write to stdout 180 | self.stdout.flush() 181 | self.file.flush() 182 | 183 | 184 | # -------------------------------- # 185 | 186 | import time 187 | 188 | from PIL import Image 189 | 190 | def main(): 191 | img = Image.new("RGB", (4, 4)) 192 | time.sleep(1) 193 | img1 = img.convert("RGB") 194 | 195 | # or if you want to try another version of main: 196 | 197 | # from transformers import AutoConfig 198 | # def main(): 199 | # c = AutoConfig.from_pretrained("t5-small") 200 | 201 | if __name__ == "__main__": 202 | # enable the trace 203 | if 1: 204 | cwd = os.path.realpath(".") 205 | pid = os.getpid() 206 | hostname = socket.gethostname() 207 | local_rank = int(os.environ.get("LOCAL_RANK", 0)) 208 | trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt" 209 | 210 | # run the new command using the given tracer 211 | sys.stdout = Tee(trace_output_file) 212 | 213 | # create a Trace object, telling it what to ignore, and whether to 214 | # do tracing or line-counting or both. 215 | # tracer = trace.Trace( 216 | tracer = NicerTrace( 217 | # ignoredirs=dirs_to_exclude, # don't set this one if you use packages_to_include 218 | # ignoremods=mods_to_exclude, 219 | trace=1, 220 | count=1, 221 | timing=True, 222 | # log_pids=True, useful if you fork workers and want to tell which process the trace belongs to 223 | packages_to_include=["PIL"], 224 | ) 225 | 226 | # string with commands to run - passed to exec() 227 | tracer.run("main()") 228 | # or to use the function interface to call main with args, kwargs 229 | # tracer.runfunc(main, *args, **kwds)) 230 | else: 231 | main() 232 | -------------------------------------------------------------------------------- /debug/README.md: -------------------------------------------------------------------------------- 1 | # Debugging Software And Hardware Failures 2 | 3 | XXX: I concat'ed 2 docs I wrote elsewhere so might need to restructure them into a more coherent doc. 4 | 5 | ## Debugging PyTorch programs 6 | 7 | ### Prefixing logs with `node:rank`, interleaved asserts 8 | 9 | When you have warnings and asserts (or debug prints), it helps a lot to prefix each log with its hostname:rank 10 | 11 | ``` 12 | python -m torch.distributed.run --role $(hostname -s): --tee 3 --nnodes 1 --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 13 | ``` 14 | 15 | Now each log line will be prefixed with `[hostname:rank]` 16 | 17 | Note that the colon `:` at the end of `--role` entry is important, that's how you get `hostname:rank` prefix. But you can add any other separator there, e.g if you use `-`, you will end up with `hostname-rank` prefix. 18 | 19 | If you're in a SLURM environment the above command line becomes: 20 | 21 | ``` 22 | srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ 23 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ 24 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 25 | --role $(hostname -s): --tee 3 \ 26 | torch-distributed-gpu-test.py' 27 | ``` 28 | 29 | Of course adjust your environment variables to match, this was just an example. 30 | 31 | Important! Note, that I'm using a single quoted string of commands passed to `bash -c`. This way `hostname -s` command is delayed until it's run on each of the nodes. If you'd use double quotes above, `hostname -s` will get executed on the starting node and then all nodes will get the same hostname as the prefix, which defeats the purpose of using these flags. So if you use double quotes you need to rewrite the above like so: 32 | 33 | 34 | ``` 35 | srun --jobid $SLURM_JOBID bash -c "python -m torch.distributed.run \ 36 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank \$SLURM_PROCID \ 37 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 38 | --role \$(hostname -s): --tee 3 \ 39 | torch-distributed-gpu-test.py" 40 | ``` 41 | 42 | `$SLURM_PROCID` is escaped too as it needs to be specific to each node and it's unknown during the launch of the slurm job on the main node. So there are 2 `\$` escapes in this version of the code. 43 | 44 | This prefixing functionality is also super-helpful when one gets the distributed program fail and which often results in interleaved assert messages that are very difficult to interpret. So by `grep`ing for one `node:rank` string of choice, it's now possible to reconstruct the real error message. 45 | 46 | For example, if you get a traceback that looks like: 47 | 48 | ``` 49 | File "/path/to/training/dataset.py", line 785, in __init__ 50 | File "/path/to/training/dataset.py", line 785, in __init__ 51 | if self.dataset_proba.sum() != 1: 52 | AttributeError: 'list' object has no attribute 'sum' 53 | if self.dataset_proba.sum() != 1: 54 | File "/path/to/training/dataset.py", line 785, in __init__ 55 | File "/path/to/training/dataset.py", line 785, in __init__ 56 | if self.dataset_proba.sum() != 1: 57 | if self.dataset_proba.sum() != 1: 58 | AttributeError: 'list' object has no attribute 'sum' 59 | AttributeError: 'list' object has no attribute 'sum' 60 | AttributeError: 'list' object has no attribute 'sum' 61 | ``` 62 | 63 | and when it's dozens of frames over 8 nodes it can't be made sense of, but the above `-tee` + `--role` will generate: 64 | 65 | ``` 66 | [host1:0] File "/path/to/training/dataset.py", line 785, in __init__ 67 | [host1:1] File "/path/to/training/dataset.py", line 785, in __init__ 68 | [host1:0] if self.dataset_proba.sum() != 1: 69 | [host1:0]AttributeError: 'list' object has no attribute 'sum' 70 | [host1:1] if self.dataset_proba.sum() != 1: 71 | [host1:2] File "/path/to/training/dataset.py", line 785, in __init__ 72 | [host1:3] File "/path/to/training/dataset.py", line 785, in __init__ 73 | [host1:3] if self.dataset_proba.sum() != 1: 74 | [host1:2] if self.dataset_proba.sum() != 1: 75 | [host1:1]AttributeError: 'list' object has no attribute 'sum' 76 | [host1:2]AttributeError: 'list' object has no attribute 'sum' 77 | [host1:3]AttributeError: 'list' object has no attribute 'sum' 78 | ``` 79 | and you can `grep` this output for just one `host:rank` prefix, which gives us: 80 | 81 | ``` 82 | $ grep "[host1:0]" log.txt 83 | [host1:0] File "/path/to/training/dataset.py", line 785, in __init__ 84 | [host1:0] if self.dataset_proba.sum() != 1: 85 | [host1:0]AttributeError: 'list' object has no attribute 'sum' 86 | ``` 87 | 88 | and voila, you can now tell what really happened. And as I mentioned earlier there can be easily a few hundred interleaved assert lines there. I was demo'ing a small example. 89 | 90 | Also, if you have just one node, you can just pass `-tee 3` and there is no need to pass `--role`. 91 | 92 | And of course if you're doing debug prints, then to solve this exact issue you can use [`printflock`](./torch-distributed-hanging-solutions.md#good-old-print). 93 | 94 | 95 | 96 | 97 | ### Dealing with Async CUDA bugs 98 | 99 | When using CUDA, failing pytorch programs very often produce a python traceback that makes no sense or can't be acted upon. This is because due to CUDA's async nature - when a CUDA kernel is executed, the program has already moved on and when the error happened the context of the program isn't there. The async functionality is there to make things faster, so that while the GPU is churning some `matmul` the program on CPU could already start doing something else. 100 | 101 | At other times some parts of the system will actually tell you that they couldn't generate the correct traceback, as in this error: 102 | 103 | ``` 104 | [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the 105 | asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/ 106 | incomplete data. To avoid this inconsistency, we are taking the entire process down. 107 | ``` 108 | 109 | There are a few solutions. 110 | 111 | If the failure is instant and can be reproduced on CPU (not all programs work on CPU), simply re-rerun it after hiding your GPUs. This is how you do it: 112 | 113 | ``` 114 | CUDA_VISIBLE_DEVICES="" python my-pytorch-program.py 115 | ``` 116 | 117 | The env var `CUDA_VISIBLE_DEVICES` is used to manually limit the visibility of GPUs to the executed program. So for example if you have 8 gpus and you want to run program1.py with first 4 gpus and program2.py with the remaining 2 gpus you can do: 118 | 119 | ``` 120 | CUDA_VISIBLE_DEVICES="0,1,2,3" python my-pytorch-program1.py 121 | CUDA_VISIBLE_DEVICES="4,5,6,7" python my-pytorch-program2.py 122 | ``` 123 | and the second program won't be the wiser that it's not using GPUs 0-3. 124 | 125 | But in the case of debug we are hiding all GPUs, by setting `CUDA_VISIBLE_DEVICES=""`. 126 | 127 | Now the program runs on CPU and you will get a really nice traceback and will fix the problem in no time. 128 | 129 | But, of course, if you your program requires multiple GPUs this won't work. And so here is another solution. 130 | 131 | Rerun your program after setting this environment variable: 132 | 133 | ``` 134 | CUDA_LAUNCH_BLOCKING=1 python my-pytorch-program.py 135 | ``` 136 | 137 | This variable tells pytorch (or any other CUDA-based program) to turn its async nature off everywhere and now all operations will be synchronous. So when the program crashes you should now get a perfect traceback and you will know exactly what ails your program. 138 | 139 | In theory enabling this variable should make everything run really slow, but in reality it really depends on your software. We did the whole of BLOOM-176B training using `CUDA_LAUNCH_BLOCKING=1` with `Megatron-Deepspeed`](https://github.com/bigscience-workshop/Megatron-DeepSpeed) and had zero slowdown - we had to use it as pytorch was hanging without it and we had no time to figure the hanging out. 140 | 141 | So, yes, when you switch from async to sync nature, often it can hide some subtle race conditions, so there are times that a hanging disappears as in the example I shared above. So measure your throughput with and without this flag and sometimes it might actual not only help with getting an in-context traceback but actually solve your problem altogether. 142 | 143 | Note: [NCCL==2.14.3 coming with `pytorch==1.13` hangs](https://github.com/NVIDIA/nccl/issues/750) when `CUDA_LAUNCH_BLOCKING=1` is used. So don't use it with that version of pytorch. The issue has been fixed in `nccl>=2.17` which should be included in `pytorch==2.0`. 144 | 145 | 146 | 147 | 148 | ### segfaults and getting a backtrace from a core file 149 | 150 | It's not uncommon for a complex pytorch program to segfault and drop a core file. Especially if 151 | you're using complex extensions like NCCL. 152 | 153 | The corefile is what the program generates when it crashes on a low-level - e.g. when using a python extension - such as a CUDA kernel or really any library that is coded directly in some variant of C or another language and made accessible in python through some binding API. The most common cause of a segfault is when such software accesses memory it has not allocated. For example, a program may try to free memory it hasn't allocated. But there could be many other reasons. 154 | 155 | When a segfault event happens Python can't do anything, as the proverbial carpet is pulled out from under its feet, so it can't generate an exception or even write anything to the output. 156 | 157 | In these situation one must go and analyse the libC-level calls that lead to the segfault, which is luckily saved in the core file. 158 | 159 | If your program crashed, you will often find a file that will look something like: `core-python-3097667-6` 160 | 161 | 162 | Before we continue make sure you have `gdb` installed: 163 | ``` 164 | sudo apt-get install gdb 165 | ``` 166 | 167 | Now make sure you know the path to the python executable that was used to run the program that crashed. If you have multiple python environment you have to activate the right environment first. If you don't `gdb` may fail to unpack the core file. 168 | 169 | So typically I'd go: 170 | 171 | ``` 172 | conda activate my-env 173 | gdb python core-python-3097667-6 174 | ``` 175 | - adjust `my-env` to whatever env you use, or instead of conda use whatever way you use to activate your python environment - and perhaps you're using the system-wise python and then you don't need to activate anything. 176 | - adjust the name of the core file to the file you have gotten - it's possible that there are many - pick the latest then. 177 | 178 | Now `gdb` will churn for a bit and will give you a prompt where you type: `bt`. We will use an actual core file here: 179 | 180 | ``` 181 | (gdb) bt 182 | #0 0x0000147539887a9f in raise () from /lib64/libc.so.6 183 | #1 0x000014753985ae05 in abort () from /lib64/libc.so.6 184 | #2 0x000014751b85a09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6 185 | #3 0x000014751b86053c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6 186 | #4 0x000014751b860597 in std::terminate() () from /lib64/libstdc++.so.6 187 | #5 0x000014751b86052e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib64/libstdc++.so.6 188 | #6 0x000014750bb007ef in c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() () 189 | from .../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so 190 | #7 0x000014750bb04c69 in c10d::ProcessGroupNCCL::workCleanupLoop() () 191 | from.../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so 192 | #8 0x000014751b88cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6 193 | #9 0x000014753a3901cf in start_thread () from /lib64/libpthread.so.0 194 | #10 0x0000147539872dd3 in clone () from /lib64/libc.so.6 195 | ``` 196 | 197 | and there you go. How do you make sense of it? 198 | 199 | Well, you go from the bottom of the stack to the top. You can tell that a `clone` call was made in `libc` which then called `start_thread` in `libpthread` and then if you keep going there are a bunch of calls in the torch libraries and finally we can see that the program terminated itself, completing with `raise` from `libc` which told the Linux kernel to kill the program and create the core file. 200 | 201 | This wasn't an easy to understand backtrace. 202 | 203 | footnote: Yes, python calls it a *traceback* and elsewhere it's called a *backtrace* - it's confusing, but it's more or less the same thing. 204 | 205 | Actually I had to ask pytorch devs for help and received: 206 | 207 | - PyTorch `ProcessGroup` watchdog thread caught an asynchronous error from NCCL 208 | - This error is an `“unhandled system error”` which in this particular case turned out to be an IB-OPA error 209 | - The `ProcessGroup`’s `WorkCleanUp` thread rethrew the error so that the main process would crash and the user would get notified (otherwise this async error would not surface) 210 | 211 | Trust me there are times when even if you're inexperienced the backtrace can give you enough of a hint to where you should look for troubleshooting. 212 | 213 | But fear not - most of the time you won't need to understand the traceback. Ideally you'd just attach the core file to your filed Issue. But it can easily be 5GB large. So the developers that will be trying to help you will ask you to generate a `gdb` backtrace and now you know how to do that. 214 | 215 | I didn't promise it'll be easy, I just showed you where to start. 216 | 217 | Now another useful details is that many programs these days run multiple threads. And `bt` only shows the main thread of the process. But, often, it can be helpful to see where other threads in the process were when segfault has happened. For that you simply type 2 commands at the `(gdb)` prompt: 218 | 219 | ``` 220 | (gdb) thread apply all bt 221 | (gdb) bt 222 | ``` 223 | 224 | and this time around you typically will get a massive report, one backtrace per thread. 225 | 226 | 227 | 228 | ### strace 229 | 230 | Similar to [py-spy](./torch-distributed-hanging-solutions.md#py-spy), `strace` is a super-useful tool which traces any running application at the low-level system calls - e.g. `libC` and alike. 231 | 232 | For example, run: 233 | ``` 234 | strace python -c "print('strace')" 235 | ``` 236 | and you will see everything that is done at the system call level as the above program runs. 237 | 238 | But usually it's more useful when you have a stuck program that spins all CPU cores at 100% but nothing happens and you want to see what's it doing. In this situation you simply attached to the running program like so: 239 | 240 | ``` 241 | strace --pid PID 242 | ``` 243 | where you get the PID for example from the output of `top` or `ps`. Typically I just copy-n-paste the PID of the program that consumes the most CPU - `top` usually shows it at the very top of its listing. 244 | 245 | Same as `py-spy` you may need `sudo` perms to attached to an already running process - it all depends on your system setup. But you can always start a program with `strace` as I have shown in the original example. 246 | 247 | Let's look at a small sub-snippet of the output of `strace python -c "print('strace')"` 248 | 249 | ``` 250 | write(1, "strace\n", 7strace 251 | ) = 7 252 | ``` 253 | Here we can see that a write call was executed on filedescriptor `1`, which almost always is `stdout` (`stdin` being 0, and `stderr` being 2). 254 | 255 | If you're not sure what a filedescriptor is pointing to, normally you can tell from `strace`'s output itself. But you can also do: 256 | 257 | ``` 258 | ls -l /proc/PID/fd 259 | ``` 260 | where PID is the pid of the currently running program you're trying to investigate. 261 | 262 | For example, when I run the above while running a pytest test with gpus, I got (partial output): 263 | ``` 264 | l-wx------ 1 stas stas 64 Mar 1 17:22 5 -> /dev/null 265 | lr-x------ 1 stas stas 64 Mar 1 17:22 6 -> /dev/urandom 266 | lrwx------ 1 stas stas 64 Mar 1 17:22 7 -> /dev/nvidiactl 267 | lrwx------ 1 stas stas 64 Mar 1 17:22 8 -> /dev/nvidia0 268 | lr-x------ 1 stas stas 64 Mar 1 17:22 9 -> /dev/nvidia-caps/nvidia-cap2 269 | ``` 270 | so you can see that a device `/dev/null` is open as FD (file descriptor) 5, `/dev/urandom` as FD 6, etc. 271 | 272 | Now let's go look at another snippet from our `strace` run. 273 | 274 | ``` 275 | access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 276 | ``` 277 | Here it tried to see if file `/etc/ld.so.preload` exists, but as we can see it doesn't - this can be useful if some shared library is missing - you can see where it's trying to load it from. 278 | 279 | Let's try another one: 280 | ``` 281 | openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 282 | read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832 283 | newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=21448, ...}, AT_EMPTY_PATH) = 0 284 | mmap(NULL, 16424, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8028807000 285 | mmap(0x7f8028808000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f8028808000 286 | mmap(0x7f8028809000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f8028809000 287 | mmap(0x7f802880a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f802880a000 288 | close(3) 289 | ``` 290 | here we can see that it opens `/lib/x86_64-linux-gnu/libpthread.so.0` and assigns it FD 3, it then reads 832 chars from FD 3, (we can also see that the first chars are ELF - which stands for a shared library format), then memory maps it and closes that file. 291 | 292 | In this following example, we see a python cached file is opened, its filepointer is moved to 0, and then it's read and closed. 293 | ``` 294 | openat(AT_FDCWD, "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/__pycache__/abc.cpython-38.pyc", O_RDONLY|O_CLOEXEC) = 3 295 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0 296 | lseek(3, 0, SEEK_CUR) = 0 297 | lseek(3, 0, SEEK_CUR) = 0 298 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0 299 | brk(0x23bf000) = 0x23bf000 300 | read(3, "U\r\r\n\0\0\0\0\24\216\177c\211\21\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5330) = 5329 301 | read(3, "", 1) = 0 302 | close(3) 303 | ``` 304 | It's important to notice that file descriptors are re-used, so we have seen the same FD 3 twice, but each time it was open to a different file. 305 | 306 | If your program is for example trying to reach to the Internet, you can also tell these calls from `strace` as the program would be reading from a socket file descriptor. 307 | 308 | So let's run an example on a program that downloads files from the HF hub: 309 | ``` 310 | strace python -c 'import sys; from transformers import AutoConfig; AutoConfig.from_pretrained(sys.argv[1])' t5-small 311 | ``` 312 | 313 | here is some relevant to this discussion snippet: 314 | ``` 315 | socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 3 316 | setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 317 | ioctl(3, FIONBIO, [1]) = 0 318 | connect(3, {sa_family=AF_INET6, sin6_port=htons(443), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2600:1f18:147f:e850:e203:c458:10cd:fc3c 319 | ", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress) 320 | poll([{fd=3, events=POLLOUT|POLLERR}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}]) 321 | getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 322 | [...] 323 | write(3, "\26\3\3\0F\20\0\0BA\4\373m\244\16\354/\334\205\361j\225\356\202m*\305\332\275\251\17J"..., 126) = 126 324 | read(3, 0x2f05c13, 5) = -1 EAGAIN (Resource temporarily unavailable) 325 | poll([{fd=3, events=POLLIN}], 1, 9903) = 1 ([{fd=3, revents=POLLIN}]) 326 | read(3, "\24\3\3\0\1", 5) = 5 327 | read(3, "\1", 1) = 1 328 | read(3, "\26\3\3\0(", 5) = 5 329 | read(3, "\0\0\0\0\0\0\0\0\344\v\273\225`\4\24m\234~\371\332%l\364\254\34\3472<\0356s\313"..., 40) = 40 330 | ioctl(3, FIONBIO, [1]) = 0 331 | poll([{fd=3, events=POLLOUT}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}]) 332 | write(3, "\27\3\3\1.\0\374$\361\217\337\377\264g\215\364\345\256\260\211$\326pkR\345\276,\321\221`-"..., 307) = 307 333 | ioctl(3, FIONBIO, [1]) = 0 334 | read(3, 0x2ef7283, 5) = -1 EAGAIN (Resource temporarily unavailable) 335 | poll([{fd=3, events=POLLIN}], 1, 10000) = 1 ([{fd=3, revents=POLLIN}]) 336 | ``` 337 | 338 | You can see where that again it uses FD 3 but this time it opens a INET6 socket instead of a file. You can see that it then connects to that socket, polls, reads and writes from it. 339 | 340 | There are many other super useful understandings one can derive from using this tool. 341 | 342 | BTW, if you don't want to scroll up-down, you can also save the output to a file: 343 | ``` 344 | strace -o strace.txt python -c "print('strace')" 345 | ``` 346 | 347 | 348 | ## Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs 349 | 350 | While the methodologies found in this article were developed while working with multi-node multi-gpu pytorch-based training, they, of course, can help with any multi-process multi-node Python programs. 351 | 352 | ### Helper tools 353 | 354 | Try to use the following script [torch-distributed-gpu-test.py](./torch-distributed-gpu-test.py) to diagnose the situation. 355 | 356 | This will help primarily with discovering network-related issues. And also to quickly understand how multi-gpu communications work. 357 | 358 | For code-related issues read the rest of this document. 359 | 360 | 361 | ### Approaches to diagnosing multi-gpu hanging / deadlocks 362 | 363 | #### py-spy 364 | 365 | First do `pip install py-spy`. 366 | 367 | Now you can attach to each process with: 368 | 369 | ``` 370 | py-spy dump -n -p PID 371 | ``` 372 | and it will tell you where the process hangs (very often it's a nccl collective function or a `barrier`). 373 | 374 | - `PID` is the process id of the hanging python process. 375 | - `-n` is useful if you want to see strack traces from python extensions written in C, C++, etc., as the program may hang in one of the extensions 376 | - you may need to add `sudo` before the command - for more details see [this note](https://github.com/benfred/py-spy#when-do-you-need-to-run-as-sudo). 377 | 378 | 379 | Here is an example of such a stack trace: 380 | ``` 381 | Thread 835995 (active): "MainThread" 382 | broadcast (torch/distributed/distributed_c10d.py:1191) 383 | _aggregate_total_loss (deepspeed/runtime/pipe/engine.py:540) 384 | train_batch (deepspeed/runtime/pipe/engine.py:330) 385 | train_step (megatron/training.py:436) 386 | train (megatron/training.py:851) 387 | pretrain (megatron/training.py:187) 388 | (pretrain_gpt.py:239) 389 | ``` 390 | The very first line is where the program is stuck. 391 | 392 | ##### multi-process py-spy 393 | 394 | Now, how do you do it for multiple processes. Doing it one-by-one is too slow. So let's do it at once. 395 | 396 | If the launch command was `python`, what you do is: 397 | 398 | ``` 399 | pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {} 400 | ``` 401 | 402 | if `deepspeed`: 403 | 404 | ``` 405 | pgrep -P $(pgrep -o deepspeed) | xargs -I {} py-spy dump --pid {} 406 | ``` 407 | 408 | for `accelerate`: 409 | 410 | 411 | ``` 412 | pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {} 413 | ``` 414 | 415 | you get the idea. 416 | 417 | This particular approach will only analyse the main processes and not various other sub-processes/threads spawned by these processes. So if you have 8 gpus and 8 processes, the above will generate 8 stack traces. 418 | 419 | If you want all processes and their subprocesses, then you'd just run: 420 | 421 | 422 | ``` 423 | pgrep -f python | xargs -I {} py-spy dump --pid {} 424 | ``` 425 | (and as before replace `python` with the name of the launcher program if it's not `python`) 426 | 427 | 428 | ##### multi-node py-spy 429 | 430 | What if you have multiple nodes? 431 | 432 | You can of course `ssh` to each node interactively and dump the stack traces. 433 | 434 | If you're using the SLURM environment you can use `srun` to do it on all nodes for you. 435 | 436 | 437 | Now in another console get the `SLURM_JOBID` (or get it from `salloc` log): 438 | ``` 439 | squeue -u `whoami` -o "%.16i %.9P %.26j %.8T %.10M %.8l %.6D %.20S %R" 440 | ``` 441 | 442 | Now use the following `srun` command after adjusting jobid with `SLURM_JOBID` from the outcome of the command above this sentence: 443 | ``` 444 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed" 445 | ``` 446 | 447 | Notes: 448 | - One must use `--gres=gpu:0` for the monitor `srun` or otherwise it will block until the main `srun` (the one running the training) exits. 449 | - Each node will generate its unique log file named `trace-nodename.out` - so this would help to identify which node(s) are problematic. You can remove `--output=trace-%N.out` if you want it all being dumped to stdout 450 | - In some SLURM versions you may also need to add `--overlap` 451 | - In some SLURM versions the jobid might not match that of reported in `squeue`, so you have to get the correct `SLURM_JOB_ID` from the logs of the job you're trying to "attach" to - i.e. your `srun` job that allocated the GPUs. 452 | - Sometimes `bash` doesn't work, but `sh` does. I think it has to do with what dot files get `source`d 453 | - You might need to also activate a custom python environment, which you can do like so: 454 | ``` 455 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'conda activate myenvname; ps auxc | ... ' || echo "failed" 456 | ``` 457 | or you can do it inside `~/.bashrc` or whatever shell's rc file you decide to use. 458 | 459 | As mentioned before if you want just the main processes you'd use this instead: 460 | ``` 461 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' || echo "failed" 462 | ``` 463 | Adjust `python` if need be as explained in the multi-gpu section above. 464 | 465 | The previous longer command will deliver traces for all python processes. 466 | 467 | If you're not getting anything, start with the basic debug like: 468 | 469 | ``` 470 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'date' 471 | ``` 472 | once you know you're talking to all the nodes, then you can progressively unravel the depth of calls, as in: 473 | 474 | ``` 475 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'date' 476 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -o python' 477 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) ' 478 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' 479 | ``` 480 | and at each stage check that the output makes sense - e.g. the 2nd and 3rd call you should be getting the PIDs of the processes. 481 | 482 | The following notes require `pip install deepspeed`. 483 | 484 | In one SLURM environment I also attempted using `pdsh` via `ds_ssh`, but somehow I wasn't able to run `py-spy` remotely - the main issue was that remote `ssh` command wasn't giving the same env as when I was logged in interactively via `ssh`. But if you have `sudo` access on the compute nodes then you could do: 485 | 486 | First prepare `hostfile`: 487 | ``` 488 | function makehostfile() { 489 | perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"}; 490 | $slots=8 if $slots==0; # workaround 8 gpu machines 491 | @nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}]; 492 | print map { "$b$_ slots=$slots\n" } @nodes' 493 | } 494 | makehostfile > hostfile 495 | ``` 496 | Adapt `$slots` to the number of gpus per node. You may have to adapt this script if your `scontrol` produces a different output. 497 | 498 | Now run the `py-spy` extraction command over all participating nodes: 499 | ``` 500 | ds_ssh -f hostfile "source ~/.pdshrc; ps aux | grep python | grep -v grep | grep `whoami` | awk '{print \$2}' | xargs -I {} sudo py-spy dump --pid {} " 501 | ``` 502 | 503 | 504 | 505 | #### Network-level hanging 506 | 507 | The hanging could be happening at the network level. `NCCL_DEBUG=INFO` can help here. 508 | 509 | Run the script with `NCCL_DEBUG=INFO` env var and try to study the outcome for obvious errors. It will tell you which device it's using, e.g.: 510 | ``` 511 | DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0> 512 | ``` 513 | So it's using interface `enp67s0` over `192.168.50.21` 514 | 515 | Is your `192.168.50.21` firewalled? or is it somehow a misconfigured network device? 516 | 517 | Does it work if you use a loopback device `127.0.0.1`? 518 | ``` 519 | NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py 520 | ``` 521 | 522 | if not, see what other local network devices you have via `ifconfig` - try that instead of `lo` if any. 523 | 524 | It's currently using `enp67s0` in the above example. 525 | 526 | 527 | #### Isolate problematic GPUs 528 | 529 | You can also try to see if only some GPUs fail 530 | 531 | For example, does it work if you use the first 2 or the last 2 gpus: 532 | 533 | ``` 534 | CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 535 | ``` 536 | then the 2nd pair: 537 | ``` 538 | CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 539 | ``` 540 | 541 | 542 | #### python `trace` 543 | 544 | Now what happens when the training doesn't just hang, but the hanging process stops responding? e.g. this happens when there is a serious hardware issue. But what if it is recurrent and `py-spy` won't help here, since it won't be able to attach to a process that is not responding. 545 | 546 | So next came the idea of tracing all calls like one does with `strace(1)`, I researched python calls tracing facilities and have discovered that python has a `trace` sub-system. 547 | 548 | The following code will trace all python calls and log them to the console and into a dedicated per process log file, via a custom `Tee` module I added. 549 | 550 | This then can help to understand where some processes stopped responding, since we will have the log of the last call and all the previous calls before it went unresponsive. 551 | 552 | ``` 553 | $ cat train.py 554 | [...] 555 | 556 | def main(): 557 | # [...] 558 | train() 559 | 560 | import re 561 | class Tee: 562 | """ 563 | A helper class to tee print's output into a file. 564 | Usage: 565 | sys.stdout = Tee(filename) 566 | """ 567 | 568 | def __init__(self, filename): 569 | self.stdout = sys.stdout 570 | self.file = open(filename, "a") 571 | 572 | def __getattr__(self, attr): 573 | return getattr(self.stdout, attr) 574 | 575 | def write(self, msg): 576 | self.stdout.write(msg) 577 | self.file.write(msg) 578 | self.file.flush() 579 | 580 | def flush(self): 581 | self.stdout.flush() 582 | self.file.flush() 583 | 584 | if __name__ == "__main__": 585 | 586 | import sys 587 | import trace 588 | import socket 589 | import os 590 | 591 | # enable the trace 592 | if 0: 593 | cwd = os.path.realpath('.') 594 | pid = os.getpid() 595 | hostname = socket.gethostname() 596 | local_rank = int(os.environ["LOCAL_RANK"]) 597 | trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt" 598 | 599 | # create a Trace object, telling it what to ignore, and whether to 600 | # do tracing or line-counting or both. 601 | tracer = trace.Trace( 602 | ignoredirs=[sys.prefix, sys.exec_prefix], 603 | trace=1, 604 | count=1, 605 | timing=True, 606 | ) 607 | 608 | # run the new command using the given tracer 609 | sys.stdout = Tee(trace_output_file) 610 | tracer.run('main()') 611 | else: 612 | main() 613 | 614 | ``` 615 | 616 | This code doesn't require any special handing other than enabling the trace by changing `if 0` to `if 1`. 617 | 618 | If you don't set `ignoredirs`, this will now dump all python calls. Which means expect a lot of GBs of data logged, especially if you have hundreds of GPUs. 619 | 620 | Of course, you don't have to start tracing from `main` - if you suspect a specific are you can start tracing there instead and it'll be much faster and less data to save. 621 | 622 | I wish I could tell `trace` which packages to follow, but alas it only supports dirs to ignore, which is much more difficult to set, and thus you end up with a lot more data than needrf. But still this is a super useful tool for debugging hanging processes. 623 | 624 | Also, your code will now run much much slower and the more packages you trace the slower it will become. 625 | 626 | ##### NicerTrace 627 | 628 | As `Trace` proved to provide very limited usability when debugging a complex multi-node multi-hour run crash, I have started on working on a better version of the `trace` 629 | 630 | You can find it here: [NicerTrace](./NicerTrace.py) 631 | 632 | I added multiple additional flags to the constructor and made the output much more useful. You fill find a full working example in that same file, just run: 633 | 634 | ``` 635 | python NicerTrace.py 636 | ``` 637 | and you should see: 638 | 639 | ``` 640 | trace/NicerTrace.py:1 641 | 0:00:00 : 1: trace/NicerTrace.py:185 main 642 | 0:00:00 NicerTrace.py: 186: img = Image.new("RGB", (4, 4)) 643 | PIL.Image:2896 new 644 | 0:00:00 Image.py: 2912: _check_size(size) 645 | PIL.Image:2875 _check_size 646 | 0:00:00 Image.py: 2883: if not isinstance(size, (list, tuple)): 647 | 0:00:00 Image.py: 2886: if len(size) != 2: 648 | 0:00:00 Image.py: 2889: if size[0] < 0 or size[1] < 0: 649 | ``` 650 | as you will see in the example I set: 651 | 652 | ``` 653 | packages_to_include=["PIL"], 654 | ``` 655 | so it'll trace `PIL` plus anything that is not under `site-packages`. If you need to trace another package, just add it to that list. 656 | 657 | This is a very fresh work-in-progress package, so it's evolving as we are trying to make it help us resolve a very complex crashing situation. 658 | 659 | 660 | ##### Working with generated trace files 661 | 662 | When the per-node-rank trace files has been generated the following might be helpful to quickly analyse the situation: 663 | 664 | 665 | - grep for a specific match and also print the file and line number where it was found: 666 | 667 | ``` 668 | grep -n "backward" trace* 669 | ``` 670 | 671 | - show `tail -1` of all trace files followed by the name of each file: 672 | 673 | ``` 674 | find . -name "trace*" -exec sh -c 'echo "$1: $(tail -3 "$1")"' _ {} \; 675 | ``` 676 | 677 | - or similar to the above, but print 5 last lines with the leading filename and some vertical white space for an easier reading: 678 | 679 | ``` 680 | find . -name "trace*" -exec sh -c 'echo; echo $1; echo "$(tail -5 "$1")"' _ {} \; 681 | ``` 682 | 683 | - count how many times grep matched a given pattern in each ifle and print the matched file (in this example matching the pattern `backward`): 684 | 685 | ``` 686 | find . -name "trace*" -exec sh -c 'echo "$1: $(grep "backward" $1 | wc -l)"' _ {} \; 687 | ``` 688 | 689 | 690 | #### good old `print` 691 | 692 | Now once you discovered where the hanging happens to further understand why this is happening, a debugger would ideally be used, but more often than not debugging multi-process (multi-node) issues can be very difficult. 693 | 694 | In such situations a good old `print` works. You just need to add some debug prints before the calls where things hang, things that would help understand what lead to the deadlock. For example, some `barrier` was missing and one or a few processes skipped some code and while the rest of processes are still blocking waiting for everybody to send some data (for example in NCCL collective functions like `gather` or `reduce`). 695 | 696 | You of course, want to prefix each print with the rank of the process so that you could tell which is which. For example: 697 | 698 | ``` 699 | import torch.distributed as dist 700 | print(f"{dist.get_rank()}: passed stage 0") 701 | ``` 702 | 703 | What you will quickly discover is that if you have multiple GPUs these prints will be badly interleaved and you will have a hard time making sense of the debug data. So let's fix this. We are going to override `print` with a custom version of the same, but which uses `flock` to ensure that only one process can write to stdout at the same time. 704 | 705 | The helper module `printflock.py` is included [here](./printflock.py). To activate it just run this at the top of the module you're debugging: 706 | 707 | ``` 708 | from printflock import printflock as print 709 | ``` 710 | 711 | and now all your `print` calls in that module will magically be non-iterleaved. You can of course, just use `printflock` directly: 712 | 713 | ``` 714 | from printflock import printflock 715 | import torch.distributed as dist 716 | printflock(f"{dist.get_rank()}: passed stage 0") 717 | ``` 718 | 719 | 720 | #### Code loops 721 | 722 | Code loops can be tricky to debug in hanging scenarios. If you have code like the following: 723 | 724 | ``` 725 | for i, d in enumerate(data): 726 | some_hanging_call(d) 727 | ``` 728 | 729 | it's possible that one process hangs in the first iteration, and another process in the second iteration, which makes things very confusing. But the stack trace won't give such indication, as the line numbers would be the same, even though the processes aren't in the same place code progression-wise. 730 | 731 | In such situations unroll the loop to be: 732 | ``` 733 | d_iter = iter(data) 734 | some_hanging_call(next(d_iter) 735 | some_hanging_call(next(d_iter) 736 | ``` 737 | and now when you run `py-spy` the line numbers will be correct. The processes hanging in the first iteration will report the first `some_hanging_call` and those in the second iteration in the second call - as each now has its own line. 738 | 739 | 740 | 741 | 742 | ## Hardware-specific issues 743 | 744 | Some AMD users may need to [Disable IOMMU](https://github.com/stas00/toolbox/issues/1#issuecomment-1076830400) 745 | -------------------------------------------------------------------------------- /debug/printflock.py: -------------------------------------------------------------------------------- 1 | # If you have ever done multi-gpu work and tried to `print` for debugging you quickly discovered 2 | # that some messages get interleaved and are impossible to make sense of. Especially so if you're 3 | # using `print` to debug values. 4 | # 5 | # This simple solution that uses the good old `flock` solves the interleaving problem. To use this 6 | # version of print you can either do: 7 | # 8 | # from printflock import printflock 9 | # import torch.distributed as dist 10 | # printflock(f"{dist.get_rank()}: my long debug message") 11 | # 12 | # or you can override `print` with a better one: 13 | # 14 | # from printflock import printflock as print 15 | # import torch.distributed as dist 16 | # print(f"{dist.get_rank()}: my long debug message") 17 | # 18 | 19 | import builtins 20 | import fcntl 21 | 22 | def printflock(*args, **kwargs): 23 | """ 24 | This is a wrapper around the built-in Python `print` which calls `flock` before calling 25 | `print` and unlocks it immediately after. This wrapper is useful for when each rank needs to 26 | print a message without getting it interleaved with prints from other ranks. 27 | The lock file is the file this wrapper is defined in. 28 | The output order will be random per rank. 29 | 30 | Example: 31 | >>> # assuming 4 GPUs 32 | >>> world_size = dist.get_world_size() 33 | >>> rank = dist.get_rank() 34 | >>> printflock(f"This is a very long message from rank {rank}/{world_size}") 35 | This is a very long message from rank 0/4 36 | This is a very long message from rank 2/4 37 | This is a very long message from rank 3/4 38 | This is a very long message from rank 1/4 39 | 40 | It can also be used to override normal `print`: 41 | 42 | from printflock import printflock as print 43 | 44 | and then you don't need to change anything in your code. 45 | """ 46 | 47 | with open(__file__, "r") as fh: 48 | fcntl.flock(fh, fcntl.LOCK_EX) 49 | try: 50 | builtins.print(*args, **kwargs) 51 | finally: 52 | fcntl.flock(fh, fcntl.LOCK_UN) 53 | -------------------------------------------------------------------------------- /debug/torch-distributed-gpu-test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # 4 | # This a `torch.distributed` diagnostics script that checks that all GPUs in the cluster (one or 5 | # many nodes) can talk to each other via nccl and allocate gpu memory. 6 | # 7 | # To run first adjust the number of processes and nodes: 8 | # 9 | # python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 10 | # 11 | # You may need to add --master_addr $MASTER_ADDR --master_port $MASTER_PORT if using a custom addr:port 12 | # 13 | # You can also use the rdzv API: --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d 14 | # 15 | # use torch.distributed.launch instead of torch.distributed.run for torch < 1.9 16 | # 17 | # If you get a hanging in `barrier` calls you have some network issues, you may try to debug this with: 18 | # 19 | # NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py 20 | # 21 | # which should tell you what's going on behind the scenes. 22 | # 23 | # 24 | # This script can be run via `srun` in the SLURM environment as well. Here is a SLURM script that 25 | # runs on 2 nodes of 4 gpus per node: 26 | 27 | # #!/bin/bash 28 | # #SBATCH --job-name=test-nodes # name 29 | # #SBATCH --nodes=2 # nodes 30 | # #SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! 31 | # #SBATCH --cpus-per-task=10 # number of cores per tasks 32 | # #SBATCH --gres=gpu:4 # number of gpus 33 | # #SBATCH --time 0:05:00 # maximum execution time (HH:MM:SS) 34 | # #SBATCH --output=%x-%j.out # output file name 35 | # 36 | # export GPUS_PER_NODE=4 37 | # export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) 38 | # export MASTER_PORT=6000 39 | # 40 | # srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ 41 | # --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ 42 | # --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ 43 | # torch-distributed-gpu-test.py' 44 | # 45 | # can also add this for automatic prefixing of all logs with [hostname:rank] (in addition to `--master_addr` etc) 46 | # --role `hostname -s`: --tee 3 \ 47 | # 48 | 49 | import builtins 50 | import fcntl 51 | import os 52 | import socket 53 | import torch 54 | import torch.distributed as dist 55 | 56 | def print(*args, **kwargs): 57 | """ solves multi-process interleaved print problem """ 58 | with open(__file__, "r") as fh: 59 | fcntl.flock(fh, fcntl.LOCK_EX) 60 | try: 61 | builtins.print(*args, **kwargs) 62 | finally: 63 | fcntl.flock(fh, fcntl.LOCK_UN) 64 | 65 | local_rank = int(os.environ["LOCAL_RANK"]) 66 | torch.cuda.set_device(local_rank) 67 | device = torch.device("cuda", local_rank) 68 | hostname = socket.gethostname() 69 | 70 | gpu = f"[{hostname}-{local_rank}]" 71 | 72 | try: 73 | # test distributed 74 | dist.init_process_group("nccl") 75 | 76 | # global rank 77 | rank = dist.get_rank() 78 | world_size = dist.get_world_size() 79 | 80 | # reduction test 81 | t = torch.ones(1, device=device) 82 | dist.all_reduce(t, op=dist.ReduceOp.SUM) 83 | dist.barrier() 84 | print(f"{gpu} Reduction op=sum result: {t.item()}") 85 | 86 | # test cuda is available and can allocate memory 87 | torch.cuda.is_available() 88 | torch.ones(1).cuda(local_rank) 89 | 90 | print(f"{gpu} is OK (global rank: {rank}/{world_size})") 91 | 92 | dist.barrier() 93 | if rank == 0: 94 | print(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}") 95 | print(f"device compute capabilities={torch.cuda.get_device_capability()}") 96 | print(f"pytorch compute capabilities={torch.cuda.get_arch_list()}") 97 | 98 | except Exception: 99 | print(f"{gpu} is broken") 100 | raise 101 | -------------------------------------------------------------------------------- /hparams/README.md: -------------------------------------------------------------------------------- 1 | # Selecting Training Hyper-Parameters And Model Initializations 2 | 3 | ## Glossary 4 | 5 | Training jargon uses a multitude of abbreviations and terms, so here are some important for this chapter. 6 | 7 | - BS: Batch Size - here we mean batch size per gpu, often it is also referred to as MBS (micro-batch-size) 8 | - GBS: Global Batch Size - total batch size per iteration - may include gradient accumulation 9 | - GAS: Gradient Accumulation Steps - how many forward/backward cycles to perform before one full iteration is complete 10 | - TFLOPs: Trillion FLOPs per second - [FLOPS](https://en.wikipedia.org/wiki/FLOPS) 11 | - PP: Pipeline Parallelism 12 | 13 | ## Global Batch Size Ramp Up 14 | 15 | If you intend to train with a very large GBS, with say 1024, or 2048 samples and even higher, when you just start training, it's very wasteful to feed such large batch sizes to the model. At this point it's totally random and can't benefit from having too refined data. Therefore to save data and resources, one often ramps up the global batch size over some period of time. 16 | 17 | It's also important to not start with GBS that is too small, since otherwise the progress won't be efficient. When there is too little data the compute (TFLOPS) is inefficient and will slow everything down. This is especially so when Pipeline Parallelism (PP) is used, since the most important thing about PP tuneup is a small GPU idleness bubble, and the smaller the GBS the larger the bubble is. 18 | 19 | For example, for BLOOM-176B, where we did use PP, after doing throughput benchmarking we found that starting with GBS=16 was incredibly slow (8 TFLOPs), so we eventually started with GBS=192 (73 TFLOPs) and then we ramped up to GBS=2048 (150 TFLOPs) - we increased GBS by 16 every 9_765_625 samples. 20 | -------------------------------------------------------------------------------- /instabilities/README.md: -------------------------------------------------------------------------------- 1 | # Avoiding, Recovering From and Understanding Instabilities 2 | 3 | ## STD Init 4 | -------------------------------------------------------------------------------- /parallelism/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/huggingface/large_language_model_training_playbook/efa7884290d9e8b942c61b83c03c80932b9dcf1b/parallelism/README.md -------------------------------------------------------------------------------- /resources/README.md: -------------------------------------------------------------------------------- 1 | # Resources 2 | 3 | 4 | ## Publicly available training logbooks 5 | 6 | The listing is in no particular order: 7 | 8 | - BigScience BLOOM-176B (2022): 9 | [chronicles-prequel](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md) | 10 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md) | 11 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/) 12 | 13 | - BigScience pre-BLOOM 108B training experiments (2021): 14 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide/chronicles.md) | 15 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide) 16 | 17 | 18 | - Meta OPT-175B (2022): 19 | [logbook](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles) | 20 | [Video](https://www.youtube.com/watch?v=p9IxoSkvZ-M) 21 | 22 | - HF m4 (Flamingo repro) (2023): [Learning log](https://docs.google.com/document/d/1ZNGyVWYFUbzV0xuei4SED2QAakGjMpaaQALcKYQm46U/edit) | [Training Logbook](https://github.com/huggingface/m4-logbook/) 23 | 24 | - THUDM GLM-130B (2022): [en logbook](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log-en.md) | [Mandarin version](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log.md) 25 | -------------------------------------------------------------------------------- /throughput/README.md: -------------------------------------------------------------------------------- 1 | # How to Maximize Training Throughput 2 | 3 | The faster you can make your model to train the sooner the model will finish training, which is important not only to being first to publish something, but also potentially saving a lot of money. 4 | 5 | In general maximizing throughput is all about running many experiments and measuring the outcome and chosing the one that is superior. 6 | 7 | In certain situations your modeling team may ask you to choose some hyper parameters that will be detrimental to throughput but overall beneficial for the overall model's success. 8 | 9 | ## Crucial reproducibility requirements 10 | 11 | The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables. 12 | 13 | Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable. 14 | 15 | For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for [108B pre-BLOOM experiments](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr8-104B-wide) it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect. 16 | 17 | ## Network throughput 18 | 19 | It's critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time. 20 | 21 | If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance. 22 | 23 | Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network: 24 | 25 | [all_reduce_bench.py](./all_reduce_bench.py) 26 | 27 | Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes. 28 | 29 | To run it on 4 nodes 30 | 31 | ``` 32 | python -m torch.distributed.run --nproc_per_node=4 all_reduce_bench.py 33 | ``` 34 | 35 | You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps. 36 | 37 | Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run. 38 | 39 | Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic. 40 | 41 | 42 | ## Checkpoint activations 43 | 44 | Enabling checkpoint activations allows one to trade speed for memory. When this feature is activated instead of remembering the outputs of, say, transformer blocks until the backward pass is done, these outputs are dropped. This frees up huge amounts of GPU memory. But, of course, a backward pass is not possible without having the outputs of forward pass, and thus they have to be recalculated. 45 | 46 | This, of course, can vary from model to model, but typically one pays with about 20-25% decrease in throughput, but since a huge amount of gpu memory is liberated, one can now increase the batch size per gpu and thus overall improve the effective throughput of the system. 47 | 48 | 49 | 50 | ## Vector and matrix size divisibility 51 | 52 | 53 | ### Tile and wave quantization 54 | 55 | XXX 56 | 57 | 58 | ### Number/size of Attention heads 59 | 60 | XXX 61 | 62 | 63 | ### Understanding TFLOPs 64 | -------------------------------------------------------------------------------- /throughput/all_reduce_bench.py: -------------------------------------------------------------------------------- 1 | # this version has been derived from @jeffra's gist: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36 2 | # which in turn is derived from https://github.com/NVIDIA/nccl-tests 3 | # 4 | # to run for 2 nodes: 5 | # python -m torch.distributed.run --nproc_per_node=2 all_reduce_bench.py 6 | # 7 | # the printed results are already n_gpu-agnostic (i.e. averaged for the world size) 8 | 9 | import argparse 10 | import fcntl 11 | import os 12 | import socket 13 | import time 14 | import torch 15 | import torch.distributed as dist 16 | 17 | TRIALS = 5 18 | 19 | N = 500000 20 | M = 2000 21 | 22 | def printflock(*msgs): 23 | """ print """ 24 | with open(__file__, "r") as fh: 25 | fcntl.flock(fh, fcntl.LOCK_EX) 26 | try: 27 | print(*msgs) 28 | finally: 29 | fcntl.flock(fh, fcntl.LOCK_UN) 30 | 31 | def timed_allreduce(mat, id): 32 | pre = time.perf_counter() 33 | dist.all_reduce(mat) 34 | printflock(f"ignore me {int(mat[0][0])}") # required due to lazy evaluation 35 | duration = time.perf_counter() - pre 36 | tput = ((M*N*4*2)/duration)*8 # *2 is for send + receive, *8 for gigabits/second 37 | size = M * N * 4 # 4 is fp32 38 | n = dist.get_world_size() 39 | busbw = (size / duration) * (2 * (n - 1) / n) * 8 40 | printflock(f"{id}:\n", 41 | f"duration: {duration:.4f} sec\n", 42 | f"algo throughput: {tput:.4f} bps, {tput/1e9:.4f} Gbps\n", 43 | f"busbw: {busbw / 1e9:.4f} Gbps" 44 | ) 45 | 46 | def run(local_rank): 47 | hostname = socket.gethostname() 48 | id = f"{hostname}:{local_rank}" 49 | global_rank = dist.get_rank() 50 | 51 | printflock(f"{id} data size: {M*N*4/1e9} GB") 52 | mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank) 53 | 54 | for i in range(TRIALS): 55 | dist.barrier() 56 | if global_rank == 0: 57 | print(f"\n\n\n-----------trial-{i}----------------") 58 | timed_allreduce(mat, id) 59 | 60 | def init_processes(local_rank, fn, backend='nccl'): 61 | torch.cuda.set_device(local_rank) 62 | dist.init_process_group(backend) 63 | fn(local_rank) 64 | 65 | 66 | if __name__ == "__main__": 67 | rank = int(os.environ["LOCAL_RANK"]) 68 | printflock("local_rank: %d" % rank) 69 | init_processes(local_rank=rank, fn=run) 70 | --------------------------------------------------------------------------------