├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE-CC-BY-SA
├── README.md
├── debug
    ├── NicerTrace.py
    ├── README.md
    ├── printflock.py
    └── torch-distributed-gpu-test.py
├── dtype
    └── README.md
├── hparams
    └── README.md
├── instabilities
    └── README.md
├── parallelism
    └── README.md
├── resources
    └── README.md
├── slurm
    ├── README.md
    ├── cron-daily.slurm
    └── cron-hourly.slurm
└── throughput
    ├── README.md
    └── all_reduce_bench.py


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Contributor Covenant Code of Conduct
  3 | 
  4 | ## Our Pledge
  5 | 
  6 | We as members, contributors, and leaders pledge to make participation in our
  7 | community a harassment-free experience for everyone, regardless of age, body
  8 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  9 | identity and expression, level of experience, education, socio-economic status,
 10 | nationality, personal appearance, race, caste, color, religion, or sexual
 11 | identity and orientation.
 12 | 
 13 | We pledge to act and interact in ways that contribute to an open, welcoming,
 14 | diverse, inclusive, and healthy community.
 15 | 
 16 | ## Our Standards
 17 | 
 18 | Examples of behavior that contributes to a positive environment for our
 19 | community include:
 20 | 
 21 | * Demonstrating empathy and kindness toward other people
 22 | * Being respectful of differing opinions, viewpoints, and experiences
 23 | * Giving and gracefully accepting constructive feedback
 24 | * Accepting responsibility and apologizing to those affected by our mistakes,
 25 |   and learning from the experience
 26 | * Focusing on what is best not just for us as individuals, but for the overall
 27 |   community
 28 | 
 29 | Examples of unacceptable behavior include:
 30 | 
 31 | * The use of sexualized language or imagery, and sexual attention or advances of
 32 |   any kind
 33 | * Trolling, insulting or derogatory comments, and personal or political attacks
 34 | * Public or private harassment
 35 | * Publishing others' private information, such as a physical or email address,
 36 |   without their explicit permission
 37 | * Other conduct which could reasonably be considered inappropriate in a
 38 |   professional setting
 39 | 
 40 | ## Enforcement Responsibilities
 41 | 
 42 | Community leaders are responsible for clarifying and enforcing our standards of
 43 | acceptable behavior and will take appropriate and fair corrective action in
 44 | response to any behavior that they deem inappropriate, threatening, offensive,
 45 | or harmful.
 46 | 
 47 | Community leaders have the right and responsibility to remove, edit, or reject
 48 | comments, commits, code, wiki edits, issues, and other contributions that are
 49 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 50 | decisions when appropriate.
 51 | 
 52 | ## Scope
 53 | 
 54 | This Code of Conduct applies within all community spaces, and also applies when
 55 | an individual is officially representing the community in public spaces.
 56 | Examples of representing our community include using an official e-mail address,
 57 | posting via an official social media account, or acting as an appointed
 58 | representative at an online or offline event.
 59 | 
 60 | ## Enforcement
 61 | 
 62 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 63 | reported to the community leaders responsible for enforcement at
 64 | feedback@huggingface.co.
 65 | All complaints will be reviewed and investigated promptly and fairly.
 66 | 
 67 | All community leaders are obligated to respect the privacy and security of the
 68 | reporter of any incident.
 69 | 
 70 | ## Enforcement Guidelines
 71 | 
 72 | Community leaders will follow these Community Impact Guidelines in determining
 73 | the consequences for any action they deem in violation of this Code of Conduct:
 74 | 
 75 | ### 1. Correction
 76 | 
 77 | **Community Impact**: Use of inappropriate language or other behavior deemed
 78 | unprofessional or unwelcome in the community.
 79 | 
 80 | **Consequence**: A private, written warning from community leaders, providing
 81 | clarity around the nature of the violation and an explanation of why the
 82 | behavior was inappropriate. A public apology may be requested.
 83 | 
 84 | ### 2. Warning
 85 | 
 86 | **Community Impact**: A violation through a single incident or series of
 87 | actions.
 88 | 
 89 | **Consequence**: A warning with consequences for continued behavior. No
 90 | interaction with the people involved, including unsolicited interaction with
 91 | those enforcing the Code of Conduct, for a specified period of time. This
 92 | includes avoiding interactions in community spaces as well as external channels
 93 | like social media. Violating these terms may lead to a temporary or permanent
 94 | ban.
 95 | 
 96 | ### 3. Temporary Ban
 97 | 
 98 | **Community Impact**: A serious violation of community standards, including
 99 | sustained inappropriate behavior.
100 | 
101 | **Consequence**: A temporary ban from any sort of interaction or public
102 | communication with the community for a specified period of time. No public or
103 | private interaction with the people involved, including unsolicited interaction
104 | with those enforcing the Code of Conduct, is allowed during this period.
105 | Violating these terms may lead to a permanent ban.
106 | 
107 | ### 4. Permanent Ban
108 | 
109 | **Community Impact**: Demonstrating a pattern of violation of community
110 | standards, including sustained inappropriate behavior, harassment of an
111 | individual, or aggression toward or disparagement of classes of individuals.
112 | 
113 | **Consequence**: A permanent ban from any sort of public interaction within the
114 | community.
115 | 
116 | ## Attribution
117 | 
118 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
119 | version 2.1, available at
120 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
121 | 
122 | Community Impact Guidelines were inspired by
123 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
124 | 
125 | For answers to common questions about this code of conduct, see the FAQ at
126 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
127 | [https://www.contributor-covenant.org/translations][translations].
128 | 
129 | [homepage]: https://www.contributor-covenant.org
130 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
131 | [Mozilla CoC]: https://github.com/mozilla/diversity
132 | [FAQ]: https://www.contributor-covenant.org/faq
133 | [translations]: https://www.contributor-covenant.org/translations
134 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | <!---
  2 | Copyright 2020 The HuggingFace Team. All rights reserved.
  3 | 
  4 | Licensed under the Apache License, Version 2.0 (the "License");
  5 | you may not use this file except in compliance with the License.
  6 | You may obtain a copy of the License at
  7 | 
  8 |     http://www.apache.org/licenses/LICENSE-2.0
  9 | 
 10 | Unless required by applicable law or agreed to in writing, software
 11 | distributed under the License is distributed on an "AS IS" BASIS,
 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | See the License for the specific language governing permissions and
 14 | limitations under the License.
 15 | -->
 16 | 
 17 | # Contribute to the Large Language Model Training Playbook
 18 | 
 19 | The Large Language Model Training Playbook is a living document. We anticipate regular improvements, so please please watch the repository to be notified about these.
 20 | 
 21 | Everyone is welcome to contribute, and we value everybody's contribution. New content writing
 22 | contributions are not the only way to help. Answering questions in issues, helping
 23 | others in pull-request, and improving the existing writing are also often valuable.
 24 | 
 25 | Though, please don't file a pull request without first coordinating via the issue system (see below) as (1) it might be content that goes beyond what the playbook is intended to cover or (2) someone else might already be working on this.
 26 | 
 27 | Feel also free to spread the word! You can reference the playbook in blog posts or shout out on Twitter every time if it has helped you, or simply ⭐️ the repository to say thank you.
 28 | 
 29 | However you choose to contribute, please be mindful and respect our
 30 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md).
 31 | 
 32 | **This guide was inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md).**
 33 | 
 34 | ## Ways to contribute
 35 | 
 36 | There are several ways you can contribute to the "Large Language Model Training Playbook":
 37 | 
 38 | * Propose a new section or propose to add more content to an existing section.
 39 | * Submit issues about inexatitude or clarity on current content.
 40 | * Read and comment on a pull request proposing new content or correcting the existing content.
 41 | 
 42 | If you don't know where to start, there might be special [Good First
 43 | Issue](https://github.com/huggingface/large_language_model_training_playbook/contribute) listing. It will give you a list of open issues that are beginner-friendly and help you start contributing to open-source. Just comment in the issue that you'd like to work on it. 
 44 | 
 45 | > All contributions are equally valuable to the community. 🥰
 46 | 
 47 | ## Propose a new section and/or additional content
 48 | 
 49 | If you would like to add a new section or content to an existing section, please **open an issue first to discuss the matter** before creating a pull request.
 50 | 
 51 | Even though the project aim at integrating as much as possible inputs from any contributors, we don't garantee we'll accept all topics or contributions so it's always better to approval before starting to spend significant amount of time on a writing section.
 52 | 
 53 | ## Submit issues about inexatitude or clarity on current content
 54 | 
 55 | When submitting an issue about inexatitude or clarity on current content please be careful about our
 56 | [code of conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md) as we prohibit some behaviors and type of communication. In particular we try to build a positive environment for our
 57 | community by being respectful of differing opinions, viewpoints, and experiences and giving and gracefully accepting constructive feedback. In a nutshell: don't forget there is a human just like you at the other side who has likely spend time and effort writing the content you are now commenting.
 58 | 
 59 | The repo maintainers will be very strict regarding any action they deem in violation of this Code of Conduct (see the [Enforcement Guidelines section of the Code of Conduct](https://github.com/huggingface/large_language_model_training_playbook/blob/main/CODE_OF_CONDUCT.md#Enforcement-Guidelines))
 60 | 
 61 | ## Create a Pull Request
 62 | 
 63 | Before writing any section or content, we strongly advise you to search through the existing PRs or
 64 | issues to make sure nobody is already working on the same thing. If you are
 65 | unsure, it is always a good idea to open an issue to get some feedback.
 66 | 
 67 | You will need basic `git` proficiency to contribute to the
 68 | 🤗 Large Language Model Training Playbook. While `git` is not the easiest tool to use, it has the greatest
 69 | manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro
 70 | Git](https://git-scm.com/book/en/v2) is a very good reference.
 71 | 
 72 | Follow the steps below to start contributing:
 73 | 
 74 | 1. Fork the [repository](https://github.com/huggingface/large_language_model_training_playbook) by
 75 |    clicking on the **[Fork](https://github.com/huggingface/large_language_model_training_playbook/fork)** button on the repository's page. This creates a copy of the code
 76 |    under your GitHub user account.
 77 | 
 78 | 2. Clone your fork to your local disk, and add the base repository as a remote:
 79 | 
 80 |    ```bash
 81 |    $ git clone git@github.com:<your Github handle>/large_language_model_training_playbook.git
 82 |    $ cd large_language_model_training_playbook
 83 |    $ git remote add upstream https://github.com/huggingface/large_language_model_training_playbook.git
 84 |    ```
 85 | 
 86 | 3. Create a new branch to hold your development changes:
 87 | 
 88 |    ```bash
 89 |    $ git checkout -b a-descriptive-name-for-my-changes
 90 |    ```
 91 | 
 92 |    🚨 **Do not** work on the `main` branch!
 93 | 
 94 | 4. Write the content in your branch.
 95 | 
 96 |    You can now write the new content or the correction you wanted to submit.
 97 | 
 98 |    Once you're happy with your changes, add changed files with `git add` and
 99 |    record your changes locally with `git commit`:
100 | 
101 |    ```bash
102 |    $ git add modified_file.md
103 |    $ git commit
104 |    ```
105 | 
106 |    Please remember to write [good commit
107 |    messages](https://chris.beams.io/posts/git-commit/) to clearly communicate the changes you made!
108 | 
109 |    To keep your copy of the code up to date with the original
110 |    repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer:
111 | 
112 |    ```bash
113 |    $ git fetch upstream
114 |    $ git rebase upstream/main
115 |    ```
116 | 
117 |    Push your changes to your branch:
118 | 
119 |    ```bash
120 |    $ git push -u origin a-descriptive-name-for-my-changes
121 |    ```
122 | 
123 |    If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally.
124 | 
125 | 5. Now you can go to your fork of the repository on GitHub and click on **Pull request** to open a pull request. When you're ready, you can send your changes to the project maintainers for review.
126 | 
127 | 6. It's ok if maintainers request changes, it happens to our core contributors
128 |    too! So everyone can see the changes in the pull request, work in your local
129 |    branch and push the changes to your fork. They will automatically appear in
130 |    the pull request.
131 | 
132 | ### Develop on Windows
133 | 
134 | On Windows (unless you're working in [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings:
135 | 
136 | ```bash
137 | git config core.autocrlf input
138 | ```
139 | 
140 | One way to run the `make` command on Windows is with MSYS2:
141 | 
142 | 1. [Download MSYS2](https://www.msys2.org/), and we assume it's installed in `C:\msys64`.
143 | 2. Open the command line `C:\msys64\msys2.exe` (it should be available from the **Start** menu).
144 | 3. Run in the shell: `pacman -Syu` and install `make` with `pacman -S make`.
145 | 4. Add `C:\msys64\usr\bin` to your PATH environment variable.
146 | 
147 | You can now use `make` from any terminal (Powershell, cmd.exe, etc.)! 🎉
148 | 
149 | ### Sync a forked repository with upstream main (the Hugging Face repository)
150 | 
151 | When updating the main branch of a forked repository, please follow these steps to avoid pinging the upstream repository which adds reference notes to each upstream PR, and sends unnecessary notifications to the developers involved in these PRs.
152 | 
153 | 1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main.
154 | 2. If a PR is absolutely necessary, use the following steps after checking out your branch:
155 | 
156 | ```bash
157 | $ git checkout -b your-branch-for-syncing
158 | $ git pull --squash --no-commit upstream main
159 | $ git commit -m '<your message without GitHub references>'
160 | $ git push --set-upstream origin your-branch-for-syncing
161 | ```
162 | 


--------------------------------------------------------------------------------
/LICENSE-CC-BY-SA:
--------------------------------------------------------------------------------
  1 | Attribution-ShareAlike 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 | 	wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More_considerations
 52 |      for the public:
 53 | 	wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-ShareAlike 4.0 International Public
 58 | License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-ShareAlike 4.0 International Public License ("Public
 63 | License"). To the extent this Public License may be interpreted as a
 64 | contract, You are granted the Licensed Rights in consideration of Your
 65 | acceptance of these terms and conditions, and the Licensor grants You
 66 | such rights in consideration of benefits the Licensor receives from
 67 | making the Licensed Material available under these terms and
 68 | conditions.
 69 | 
 70 | 
 71 | Section 1 -- Definitions.
 72 | 
 73 |   a. Adapted Material means material subject to Copyright and Similar
 74 |      Rights that is derived from or based upon the Licensed Material
 75 |      and in which the Licensed Material is translated, altered,
 76 |      arranged, transformed, or otherwise modified in a manner requiring
 77 |      permission under the Copyright and Similar Rights held by the
 78 |      Licensor. For purposes of this Public License, where the Licensed
 79 |      Material is a musical work, performance, or sound recording,
 80 |      Adapted Material is always produced where the Licensed Material is
 81 |      synched in timed relation with a moving image.
 82 | 
 83 |   b. Adapter's License means the license You apply to Your Copyright
 84 |      and Similar Rights in Your contributions to Adapted Material in
 85 |      accordance with the terms and conditions of this Public License.
 86 | 
 87 |   c. BY-SA Compatible License means a license listed at
 88 |      creativecommons.org/compatiblelicenses, approved by Creative
 89 |      Commons as essentially the equivalent of this Public License.
 90 | 
 91 |   d. Copyright and Similar Rights means copyright and/or similar rights
 92 |      closely related to copyright including, without limitation,
 93 |      performance, broadcast, sound recording, and Sui Generis Database
 94 |      Rights, without regard to how the rights are labeled or
 95 |      categorized. For purposes of this Public License, the rights
 96 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 97 |      Rights.
 98 | 
 99 |   e. Effective Technological Measures means those measures that, in the
100 |      absence of proper authority, may not be circumvented under laws
101 |      fulfilling obligations under Article 11 of the WIPO Copyright
102 |      Treaty adopted on December 20, 1996, and/or similar international
103 |      agreements.
104 | 
105 |   f. Exceptions and Limitations means fair use, fair dealing, and/or
106 |      any other exception or limitation to Copyright and Similar Rights
107 |      that applies to Your use of the Licensed Material.
108 | 
109 |   g. License Elements means the license attributes listed in the name
110 |      of a Creative Commons Public License. The License Elements of this
111 |      Public License are Attribution and ShareAlike.
112 | 
113 |   h. Licensed Material means the artistic or literary work, database,
114 |      or other material to which the Licensor applied this Public
115 |      License.
116 | 
117 |   i. Licensed Rights means the rights granted to You subject to the
118 |      terms and conditions of this Public License, which are limited to
119 |      all Copyright and Similar Rights that apply to Your use of the
120 |      Licensed Material and that the Licensor has authority to license.
121 | 
122 |   j. Licensor means the individual(s) or entity(ies) granting rights
123 |      under this Public License.
124 | 
125 |   k. Share means to provide material to the public by any means or
126 |      process that requires permission under the Licensed Rights, such
127 |      as reproduction, public display, public performance, distribution,
128 |      dissemination, communication, or importation, and to make material
129 |      available to the public including in ways that members of the
130 |      public may access the material from a place and at a time
131 |      individually chosen by them.
132 | 
133 |   l. Sui Generis Database Rights means rights other than copyright
134 |      resulting from Directive 96/9/EC of the European Parliament and of
135 |      the Council of 11 March 1996 on the legal protection of databases,
136 |      as amended and/or succeeded, as well as other essentially
137 |      equivalent rights anywhere in the world.
138 | 
139 |   m. You means the individual or entity exercising the Licensed Rights
140 |      under this Public License. Your has a corresponding meaning.
141 | 
142 | 
143 | Section 2 -- Scope.
144 | 
145 |   a. License grant.
146 | 
147 |        1. Subject to the terms and conditions of this Public License,
148 |           the Licensor hereby grants You a worldwide, royalty-free,
149 |           non-sublicensable, non-exclusive, irrevocable license to
150 |           exercise the Licensed Rights in the Licensed Material to:
151 | 
152 |             a. reproduce and Share the Licensed Material, in whole or
153 |                in part; and
154 | 
155 |             b. produce, reproduce, and Share Adapted Material.
156 | 
157 |        2. Exceptions and Limitations. For the avoidance of doubt, where
158 |           Exceptions and Limitations apply to Your use, this Public
159 |           License does not apply, and You do not need to comply with
160 |           its terms and conditions.
161 | 
162 |        3. Term. The term of this Public License is specified in Section
163 |           6(a).
164 | 
165 |        4. Media and formats; technical modifications allowed. The
166 |           Licensor authorizes You to exercise the Licensed Rights in
167 |           all media and formats whether now known or hereafter created,
168 |           and to make technical modifications necessary to do so. The
169 |           Licensor waives and/or agrees not to assert any right or
170 |           authority to forbid You from making technical modifications
171 |           necessary to exercise the Licensed Rights, including
172 |           technical modifications necessary to circumvent Effective
173 |           Technological Measures. For purposes of this Public License,
174 |           simply making modifications authorized by this Section 2(a)
175 |           (4) never produces Adapted Material.
176 | 
177 |        5. Downstream recipients.
178 | 
179 |             a. Offer from the Licensor -- Licensed Material. Every
180 |                recipient of the Licensed Material automatically
181 |                receives an offer from the Licensor to exercise the
182 |                Licensed Rights under the terms and conditions of this
183 |                Public License.
184 | 
185 |             b. Additional offer from the Licensor -- Adapted Material.
186 |                Every recipient of Adapted Material from You
187 |                automatically receives an offer from the Licensor to
188 |                exercise the Licensed Rights in the Adapted Material
189 |                under the conditions of the Adapter's License You apply.
190 | 
191 |             c. No downstream restrictions. You may not offer or impose
192 |                any additional or different terms or conditions on, or
193 |                apply any Effective Technological Measures to, the
194 |                Licensed Material if doing so restricts exercise of the
195 |                Licensed Rights by any recipient of the Licensed
196 |                Material.
197 | 
198 |        6. No endorsement. Nothing in this Public License constitutes or
199 |           may be construed as permission to assert or imply that You
200 |           are, or that Your use of the Licensed Material is, connected
201 |           with, or sponsored, endorsed, or granted official status by,
202 |           the Licensor or others designated to receive attribution as
203 |           provided in Section 3(a)(1)(A)(i).
204 | 
205 |   b. Other rights.
206 | 
207 |        1. Moral rights, such as the right of integrity, are not
208 |           licensed under this Public License, nor are publicity,
209 |           privacy, and/or other similar personality rights; however, to
210 |           the extent possible, the Licensor waives and/or agrees not to
211 |           assert any such rights held by the Licensor to the limited
212 |           extent necessary to allow You to exercise the Licensed
213 |           Rights, but not otherwise.
214 | 
215 |        2. Patent and trademark rights are not licensed under this
216 |           Public License.
217 | 
218 |        3. To the extent possible, the Licensor waives any right to
219 |           collect royalties from You for the exercise of the Licensed
220 |           Rights, whether directly or through a collecting society
221 |           under any voluntary or waivable statutory or compulsory
222 |           licensing scheme. In all other cases the Licensor expressly
223 |           reserves any right to collect such royalties.
224 | 
225 | 
226 | Section 3 -- License Conditions.
227 | 
228 | Your exercise of the Licensed Rights is expressly made subject to the
229 | following conditions.
230 | 
231 |   a. Attribution.
232 | 
233 |        1. If You Share the Licensed Material (including in modified
234 |           form), You must:
235 | 
236 |             a. retain the following if it is supplied by the Licensor
237 |                with the Licensed Material:
238 | 
239 |                  i. identification of the creator(s) of the Licensed
240 |                     Material and any others designated to receive
241 |                     attribution, in any reasonable manner requested by
242 |                     the Licensor (including by pseudonym if
243 |                     designated);
244 | 
245 |                 ii. a copyright notice;
246 | 
247 |                iii. a notice that refers to this Public License;
248 | 
249 |                 iv. a notice that refers to the disclaimer of
250 |                     warranties;
251 | 
252 |                  v. a URI or hyperlink to the Licensed Material to the
253 |                     extent reasonably practicable;
254 | 
255 |             b. indicate if You modified the Licensed Material and
256 |                retain an indication of any previous modifications; and
257 | 
258 |             c. indicate the Licensed Material is licensed under this
259 |                Public License, and include the text of, or the URI or
260 |                hyperlink to, this Public License.
261 | 
262 |        2. You may satisfy the conditions in Section 3(a)(1) in any
263 |           reasonable manner based on the medium, means, and context in
264 |           which You Share the Licensed Material. For example, it may be
265 |           reasonable to satisfy the conditions by providing a URI or
266 |           hyperlink to a resource that includes the required
267 |           information.
268 | 
269 |        3. If requested by the Licensor, You must remove any of the
270 |           information required by Section 3(a)(1)(A) to the extent
271 |           reasonably practicable.
272 | 
273 |   b. ShareAlike.
274 | 
275 |      In addition to the conditions in Section 3(a), if You Share
276 |      Adapted Material You produce, the following conditions also apply.
277 | 
278 |        1. The Adapter's License You apply must be a Creative Commons
279 |           license with the same License Elements, this version or
280 |           later, or a BY-SA Compatible License.
281 | 
282 |        2. You must include the text of, or the URI or hyperlink to, the
283 |           Adapter's License You apply. You may satisfy this condition
284 |           in any reasonable manner based on the medium, means, and
285 |           context in which You Share Adapted Material.
286 | 
287 |        3. You may not offer or impose any additional or different terms
288 |           or conditions on, or apply any Effective Technological
289 |           Measures to, Adapted Material that restrict exercise of the
290 |           rights granted under the Adapter's License You apply.
291 | 
292 | 
293 | Section 4 -- Sui Generis Database Rights.
294 | 
295 | Where the Licensed Rights include Sui Generis Database Rights that
296 | apply to Your use of the Licensed Material:
297 | 
298 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
299 |      to extract, reuse, reproduce, and Share all or a substantial
300 |      portion of the contents of the database;
301 | 
302 |   b. if You include all or a substantial portion of the database
303 |      contents in a database in which You have Sui Generis Database
304 |      Rights, then the database in which You have Sui Generis Database
305 |      Rights (but not its individual contents) is Adapted Material,
306 | 
307 |      including for purposes of Section 3(b); and
308 |   c. You must comply with the conditions in Section 3(a) if You Share
309 |      all or a substantial portion of the contents of the database.
310 | 
311 | For the avoidance of doubt, this Section 4 supplements and does not
312 | replace Your obligations under this Public License where the Licensed
313 | Rights include other Copyright and Similar Rights.
314 | 
315 | 
316 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
317 | 
318 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
319 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
320 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
321 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
322 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
323 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
324 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
325 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
326 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
327 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
328 | 
329 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
330 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
331 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
332 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
333 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
334 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
335 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
336 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
337 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
338 | 
339 |   c. The disclaimer of warranties and limitation of liability provided
340 |      above shall be interpreted in a manner that, to the extent
341 |      possible, most closely approximates an absolute disclaimer and
342 |      waiver of all liability.
343 | 
344 | 
345 | Section 6 -- Term and Termination.
346 | 
347 |   a. This Public License applies for the term of the Copyright and
348 |      Similar Rights licensed here. However, if You fail to comply with
349 |      this Public License, then Your rights under this Public License
350 |      terminate automatically.
351 | 
352 |   b. Where Your right to use the Licensed Material has terminated under
353 |      Section 6(a), it reinstates:
354 | 
355 |        1. automatically as of the date the violation is cured, provided
356 |           it is cured within 30 days of Your discovery of the
357 |           violation; or
358 | 
359 |        2. upon express reinstatement by the Licensor.
360 | 
361 |      For the avoidance of doubt, this Section 6(b) does not affect any
362 |      right the Licensor may have to seek remedies for Your violations
363 |      of this Public License.
364 | 
365 |   c. For the avoidance of doubt, the Licensor may also offer the
366 |      Licensed Material under separate terms or conditions or stop
367 |      distributing the Licensed Material at any time; however, doing so
368 |      will not terminate this Public License.
369 | 
370 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
371 |      License.
372 | 
373 | 
374 | Section 7 -- Other Terms and Conditions.
375 | 
376 |   a. The Licensor shall not be bound by any additional or different
377 |      terms or conditions communicated by You unless expressly agreed.
378 | 
379 |   b. Any arrangements, understandings, or agreements regarding the
380 |      Licensed Material not stated herein are separate from and
381 |      independent of the terms and conditions of this Public License.
382 | 
383 | 
384 | Section 8 -- Interpretation.
385 | 
386 |   a. For the avoidance of doubt, this Public License does not, and
387 |      shall not be interpreted to, reduce, limit, restrict, or impose
388 |      conditions on any use of the Licensed Material that could lawfully
389 |      be made without permission under this Public License.
390 | 
391 |   b. To the extent possible, if any provision of this Public License is
392 |      deemed unenforceable, it shall be automatically reformed to the
393 |      minimum extent necessary to make it enforceable. If the provision
394 |      cannot be reformed, it shall be severed from this Public License
395 |      without affecting the enforceability of the remaining terms and
396 |      conditions.
397 | 
398 |   c. No term or condition of this Public License will be waived and no
399 |      failure to comply consented to unless expressly agreed to by the
400 |      Licensor.
401 | 
402 |   d. Nothing in this Public License constitutes or may be interpreted
403 |      as a limitation upon, or waiver of, any privileges and immunities
404 |      that apply to the Licensor or You, including from the legal
405 |      processes of any jurisdiction or authority.
406 | 
407 | 
408 | =======================================================================
409 | 
410 | Creative Commons is not a party to its public
411 | licenses. Notwithstanding, Creative Commons may elect to apply one of
412 | its public licenses to material it publishes and in those instances
413 | will be considered the “Licensor.” The text of the Creative Commons
414 | public licenses is dedicated to the public domain under the CC0 Public
415 | Domain Dedication. Except for the limited purpose of indicating that
416 | material is shared under a Creative Commons public license or as
417 | otherwise permitted by the Creative Commons policies published at
418 | creativecommons.org/policies, Creative Commons does not authorize the
419 | use of the trademark "Creative Commons" or any other trademark or logo
420 | of Creative Commons without its prior written consent including,
421 | without limitation, in connection with any unauthorized modifications
422 | to any of its public licenses or any other arrangements,
423 | understandings, or agreements concerning use of licensed material. For
424 | the avoidance of doubt, this paragraph does not form part of the
425 | public licenses.
426 | 
427 | Creative Commons may be contacted at creativecommons.org.
428 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 📖 The Large Language Model Training Handbook
 2 | 
 3 | An open collection of methodologies to help with successful training of large language models.
 4 | 
 5 | This is technical material suitable for LLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly solve your problems.
 6 | 
 7 | If you are not interested in technical details but want more of a detailed overview and concepts please refer to the sister [The Large Language Model Training Playbook](https://github.com/huggingface/large_language_model_training_playbook) instead.
 8 | 
 9 | note: The list of topics will expand over time - at the moment filling in only a subset
10 | 
11 | ## [Model parallelism](./parallelism/)
12 | 
13 | ## [Maximizing throughput](./throughput/)
14 | 
15 | ## [Tensor precision / Data types](./dtype/)
16 | 
17 | ## [Training hyper-parameters and model initializations](./hparams/)
18 | 
19 | ## [Instabilities](./instabilities/)
20 | 
21 | ## [Debugging software and hardware failures](./debug/)
22 | 
23 | ## [SLURM](./slurm/)
24 | 
25 | ## [Resources](./resources/)
26 | 
27 | ## License
28 | 
29 | The content of this site is distributed under [Attribution-ShareAlike 4.0 International](./LICENSE-CC-BY-SA).
30 | 
31 | Unless specified otherwise the code in this repo is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
32 | 


--------------------------------------------------------------------------------
/debug/NicerTrace.py:
--------------------------------------------------------------------------------
  1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """
  2 | 
  3 | """ NicerTrace - an improved Trace package """
  4 | 
  5 | """
  6 | To try it in action and to get a sense of how it can help you just run:
  7 | python trace/NicerTrace.py
  8 | """
  9 | 
 10 | 
 11 | import datetime
 12 | import os
 13 | import socket
 14 | import sys
 15 | import sysconfig
 16 | import time
 17 | import trace
 18 | 
 19 | 
 20 | class NicerTrace(trace.Trace):
 21 |     # as the 2 paths overlap the longer with site-packages needs to be first
 22 |     py_dirs = [sysconfig.get_paths().get(k) for k in ["purelib", "stdlib"]]
 23 |     site_packages_dir = sysconfig.get_paths()["purelib"]
 24 |     stdlib_dir = sysconfig.get_paths()["stdlib"]
 25 | 
 26 |     def __init__(self, *args, packages_to_include=None, log_pids=False, **kwargs):
 27 |         """normal init plus added package/dir exclusion overrides:
 28 | 
 29 |         While preserving the original behavior a new optional arg is added `packages_to_include`
 30 |         with the following behavior:
 31 | 
 32 |         1. if ignoredirs is a list the original trace behavior is used - only those dirs and subdirs will be excluded
 33 |         2. if ignoredirs is None and packages_to_include is None - everything is included
 34 |         3. if packages_to_include="uninstalled" all packages found under  /.../site-packages will be excluded. I couldn't find a way to exclude core python packages under /.../lib/python3.8 since it'd then exclude site-packages as well
 35 |         3. if packages_to_include=["PIL", "numpy", "pytorch"] all packages found under  /.../site-packages, and /.../lib/python3.8 will be excluded except the packages that were listed to be included - use top-level package name here
 36 |         4. if packages_to_include=None, everything under /.../site-packages, and /.../lib/python3.8 will be excluded and any packages that are installed via `pip install -e .` will be included
 37 | 
 38 |         """
 39 |         ignoredirs = kwargs.get("ignoredirs", None)
 40 | 
 41 |         if ignoredirs is not None and len(ignoredirs) > 1:
 42 |             if packages_to_include is not None:
 43 |                 raise ValueError("can't have both ignoredirs and packages_to_include not None")
 44 |             kwargs["ignoredirs"] = ignoredirs
 45 |         elif packages_to_include is None:
 46 |             kwargs["ignoredirs"] = None
 47 |         elif packages_to_include == "uninstalled":
 48 |             kwargs["ignoredirs"] = self.stdlib_dir  # everything including python core packages
 49 |         else:
 50 |             # exclude all of /.../lib/python3.8 and sub-paths from /.../site-packages, and
 51 |             packages = os.listdir(self.site_packages_dir)
 52 |             packages_to_exclude = set(packages) - set(packages_to_include)
 53 |             dirs_to_exclude = [
 54 |                 f"{self.site_packages_dir}/{dir}" for dir in sorted(packages_to_exclude) if not dir.endswith("-info")
 55 |             ]
 56 |             # note, no way to exclude python core packages in this situation because
 57 |             # sysconfig.get_paths()'s' purelib is a subset of stdlib :(, so excluding only site-packages
 58 |             kwargs["ignoredirs"] = dirs_to_exclude
 59 | 
 60 |         # not packages, but final module names like Image from Image.py
 61 |         # mods_to_exclude = []
 62 | 
 63 |         # print("\n".join(kwargs["ignoredirs"]))
 64 | 
 65 |         super().__init__(*args, **kwargs)
 66 |         self.log_pids = log_pids
 67 | 
 68 |     def strip_py_dirs(self, path):
 69 |         """strips python path prefix like /.../site-packages, and /.../lib/python3.8 if any matches"""
 70 |         for prefix in self.py_dirs:
 71 |             if path.startswith(prefix):
 72 |                 return path.replace(prefix + "/", "")
 73 |         return path
 74 | 
 75 |     def globaltrace_lt(self, frame, why, arg):
 76 |         """Handler for call events.
 77 |         If the code block being entered is to be ignored, returns `None',
 78 |         else returns self.localtrace.
 79 | 
 80 |         This is an override to properly show full package names:
 81 |         1. if it's under site-packages or core python dir - convert to package name
 82 |         2. otherwise show full path to the python file - usually uninstalled packages
 83 | 
 84 |         Additionally enter frames now include the line number since some packages have multiple
 85 |         methods that have the same name and there is no telling which one of them was called.
 86 | 
 87 |         It was written against https://github.com/python/cpython/blob/3.8/Lib/trace.py. If you're
 88 |         using a different python version you may have to adapt it should the core implementation
 89 |         change (but it's unlikely)
 90 | 
 91 |         """
 92 |         if why == "call":
 93 |             code = frame.f_code
 94 |             # print(f"\n\n{frame.f_code=}")
 95 |             # print(dir(code))
 96 | 
 97 |             filename = frame.f_globals.get("__file__", None)
 98 |             if filename:
 99 |                 lineno = code.co_firstlineno
100 |                 # python's trace fails to get the full package name - let's fix it
101 |                 # strip the common path of python library
102 |                 modulename = self.strip_py_dirs(filename)
103 |                 if filename != modulename:
104 |                     # the package was installed under /.../site-packages, /.../lib/python3.8
105 |                     modulename, ext = os.path.splitext(modulename)
106 |                     modulename = modulename.replace("/", ".")
107 |                 else:
108 |                     # still full path, because the package is not installed
109 |                     modulename = filename
110 | 
111 |                 if modulename is not None:
112 |                     # XXX: ignoremods may not work now as before
113 |                     ignore_it = self.ignore.names(filename, modulename)
114 |                     if not ignore_it:
115 |                         if self.trace:
116 |                             if self.log_pids:
117 |                                 print(os.getpid(), end=" ")
118 | 
119 |                             print(f"        {modulename}:{lineno} {code.co_name}")
120 |                         return self.localtrace
121 |             else:
122 |                 return None
123 | 
124 |     def localtrace_trace_and_count(self, frame, why, arg):
125 |         """
126 |         Overriding the default method.
127 | 
128 |         Using hh:mm:ss format for timestamps (instead of secs) as it's more readable when the trace is run for hours
129 | 
130 |         XXX: ideally it would be nice not to repeat the same module name on every line, but when I tried
131 |         that I discovered that globaltrace_lt doesn't necessarily frame all the local calls, since
132 |         localtrace_trace_and_count may continue printing local calls from an earlier frame w/o
133 |         notifying that the context has changed. So we are forced to reprint the module name on each
134 |         line to keep at least the incomplete context.
135 | 
136 |         Ideally there should an indication of a frame change before all the local prints
137 | 
138 |         Read the disclaimer in globaltrace_lt that this was tested with py-3.8
139 | 
140 |         """
141 |         if why == "line":
142 |             # record the file name and line number of every trace
143 |             filename = frame.f_code.co_filename
144 |             lineno = frame.f_lineno
145 |             key = filename, lineno
146 |             self.counts[key] = self.counts.get(key, 0) + 1
147 |             basename = os.path.basename(filename)
148 |             if self.log_pids:
149 |                 print(os.getpid(), end=" ")
150 |             if self.start_time:
151 |                 delta_time = trace._time() - self.start_time
152 |                 delta_time = str(datetime.timedelta(seconds=delta_time)).split(".")[0]
153 |                 print(delta_time, end=" ")
154 |             print(f"{basename}:{lineno:>6}: {trace.linecache.getline(filename, lineno)}", end="")
155 |         return self.localtrace
156 | 
157 | # -------------------------------- #
158 | 
159 | 
160 | class Tee:
161 |     """
162 |     A helper class to tee print's output into a file.
163 |     Usage:
164 |     sys.stdout = Tee(filename)
165 |     """
166 | 
167 |     def __init__(self, filename):
168 |         self.stdout = sys.stdout
169 |         self.file = open(filename, "a")
170 | 
171 |     def __getattr__(self, attr):
172 |         return getattr(self.stdout, attr)
173 | 
174 |     def write(self, msg):
175 |         # comment out the next line if you don't want to write to stdout
176 |         self.stdout.write(msg)
177 |         self.file.write(msg)
178 |         self.file.flush()
179 | 
180 |     def flush(self):
181 |         # comment out the next line if you don't want to write to stdout
182 |         self.stdout.flush()
183 |         self.file.flush()
184 | 
185 | 
186 | # -------------------------------- #
187 | 
188 | import time
189 | 
190 | from PIL import Image
191 | 
192 | def main():
193 |     img = Image.new("RGB", (4, 4))
194 |     time.sleep(1)
195 |     img1 = img.convert("RGB")
196 | 
197 | # or if you want to try another version of main:
198 | 
199 | # from transformers import AutoConfig
200 | # def main():
201 | #     c = AutoConfig.from_pretrained("t5-small")
202 | 
203 | if __name__ == "__main__":
204 |     # enable the trace
205 |     if 1:
206 |         cwd = os.path.realpath(".")
207 |         pid = os.getpid()
208 |         hostname = socket.gethostname()
209 |         local_rank = int(os.environ.get("LOCAL_RANK", 0))
210 |         trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt"
211 | 
212 |         # run the new command using the given tracer
213 |         sys.stdout = Tee(trace_output_file)
214 | 
215 |         # create a Trace object, telling it what to ignore, and whether to
216 |         # do tracing or line-counting or both.
217 |         # tracer = trace.Trace(
218 |         tracer = NicerTrace(
219 |             # ignoredirs=dirs_to_exclude, # don't set this one if you use packages_to_include
220 |             # ignoremods=mods_to_exclude,
221 |             trace=1,
222 |             count=1,
223 |             timing=True,
224 |             # log_pids=True, useful if you fork workers and want to tell which process the trace belongs to
225 |             packages_to_include=["PIL"],
226 |         )
227 | 
228 |         # string with commands to run - passed to exec()
229 |         tracer.run("main()")
230 |         # or to use the function interface to call main with args, kwargs
231 |         # tracer.runfunc(main, *args, **kwds))
232 |     else:
233 |         main()
234 | 


--------------------------------------------------------------------------------
/debug/README.md:
--------------------------------------------------------------------------------
  1 | # Debugging Software And Hardware Failures
  2 | 
  3 | XXX: I concat'ed 2 docs I wrote elsewhere so might need to restructure them into a more coherent doc.
  4 | 
  5 | ## Debugging PyTorch programs
  6 | 
  7 | ### Prefixing logs with `node:rank`, interleaved asserts
  8 | 
  9 | When you have warnings and asserts (or debug prints), it helps a lot to prefix each log with its hostname:rank
 10 | 
 11 | ```
 12 | python -m torch.distributed.run --role $(hostname -s): --tee 3 --nnodes 1 --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
 13 | ```
 14 | 
 15 | Now each log line will be prefixed with `[hostname:rank]`
 16 | 
 17 | Note that the colon `:` at the end of `--role` entry is important, that's how you get `hostname:rank` prefix. But you can add any other separator there, e.g if you use `-`, you will end up with `hostname-rank` prefix.
 18 | 
 19 | If you're in a SLURM environment the above command line becomes:
 20 | 
 21 | ```
 22 | srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \
 23 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
 24 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
 25 | --role $(hostname -s): --tee 3 \
 26 | torch-distributed-gpu-test.py'
 27 | ```
 28 | 
 29 | Of course adjust your environment variables to match, this was just an example.
 30 | 
 31 | Important! Note, that I'm using a single quoted string of commands passed to `bash -c`. This way `hostname -s` command is delayed until it's run on each of the nodes. If you'd use double quotes above, `hostname -s` will get executed on the starting node and then all nodes will get the same hostname as the prefix, which defeats the purpose of using these flags. So if you use double quotes you need to rewrite the above like so:
 32 | 
 33 | 
 34 | ```
 35 | srun --jobid $SLURM_JOBID bash -c "python -m torch.distributed.run \
 36 | --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank \$SLURM_PROCID \
 37 | --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
 38 | --role \$(hostname -s): --tee 3 \
 39 | torch-distributed-gpu-test.py"
 40 | ```
 41 | 
 42 | `$SLURM_PROCID` is escaped too as it needs to be specific to each node and it's unknown during the launch of the slurm job on the main node. So there are 2 `\$` escapes in this version of the code.
 43 | 
 44 | This prefixing functionality is also super-helpful when one gets the distributed program fail and which often results in interleaved assert messages that are very difficult to interpret. So by `grep`ing for one `node:rank` string of choice, it's now possible to reconstruct the real error message.
 45 | 
 46 | For example, if you get a traceback that looks like:
 47 | 
 48 | ```
 49 |   File "/path/to/training/dataset.py", line 785, in __init__
 50 |   File "/path/to/training/dataset.py", line 785, in __init__
 51 |     if self.dataset_proba.sum() != 1:
 52 | AttributeError: 'list' object has no attribute 'sum'
 53 |     if self.dataset_proba.sum() != 1:
 54 |   File "/path/to/training/dataset.py", line 785, in __init__
 55 |   File "/path/to/training/dataset.py", line 785, in __init__
 56 |     if self.dataset_proba.sum() != 1:
 57 |     if self.dataset_proba.sum() != 1:
 58 | AttributeError: 'list' object has no attribute 'sum'
 59 | AttributeError: 'list' object has no attribute 'sum'
 60 | AttributeError: 'list' object has no attribute 'sum'
 61 | ```
 62 | 
 63 | and when it's dozens of frames over 8 nodes it can't be made sense of, but the above `-tee` + `--role` will generate:
 64 | 
 65 | ```
 66 | [host1:0]  File "/path/to/training/dataset.py", line 785, in __init__
 67 | [host1:1]  File "/path/to/training/dataset.py", line 785, in __init__
 68 | [host1:0]    if self.dataset_proba.sum() != 1:
 69 | [host1:0]AttributeError: 'list' object has no attribute 'sum'
 70 | [host1:1]    if self.dataset_proba.sum() != 1:
 71 | [host1:2]  File "/path/to/training/dataset.py", line 785, in __init__
 72 | [host1:3]  File "/path/to/training/dataset.py", line 785, in __init__
 73 | [host1:3]    if self.dataset_proba.sum() != 1:
 74 | [host1:2]    if self.dataset_proba.sum() != 1:
 75 | [host1:1]AttributeError: 'list' object has no attribute 'sum'
 76 | [host1:2]AttributeError: 'list' object has no attribute 'sum'
 77 | [host1:3]AttributeError: 'list' object has no attribute 'sum'
 78 | ```
 79 | and you can `grep` this output for just one `host:rank` prefix, which gives us:
 80 | 
 81 | ```
 82 | $ grep "[host1:0]" log.txt
 83 | [host1:0]  File "/path/to/training/dataset.py", line 785, in __init__
 84 | [host1:0]    if self.dataset_proba.sum() != 1:
 85 | [host1:0]AttributeError: 'list' object has no attribute 'sum'
 86 | ```
 87 | 
 88 | and voila, you can now tell what really happened. And as I mentioned earlier there can be easily a few hundred interleaved assert lines there. I was demo'ing a small example.
 89 | 
 90 | Also, if you have just one node, you can just pass `-tee 3` and there is no need to pass `--role`.
 91 | 
 92 | And of course if you're doing debug prints, then to solve this exact issue you can use [`printflock`](./torch-distributed-hanging-solutions.md#good-old-print).
 93 | 
 94 | 
 95 | 
 96 | 
 97 | ### Dealing with Async CUDA bugs
 98 | 
 99 | When using CUDA, failing pytorch programs very often produce a python traceback that makes no sense or can't be acted upon. This is because due to CUDA's async nature - when a CUDA kernel is executed, the program has already moved on and when the error happened the context of the program isn't there. The async functionality is there to make things faster, so that while the GPU is churning some `matmul` the program on CPU could already start doing something else.
100 | 
101 | At other times some parts of the system will actually tell you that they couldn't generate the correct traceback, as in this error:
102 | 
103 | ```
104 | [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the
105 | asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/
106 | incomplete data. To avoid this inconsistency, we are taking the entire process down.
107 | ```
108 | 
109 | There are a few solutions.
110 | 
111 | If the failure is instant and can be reproduced on CPU (not all programs work on CPU), simply re-rerun it after hiding your GPUs. This is how you do it:
112 | 
113 | ```
114 | CUDA_VISIBLE_DEVICES="" python my-pytorch-program.py
115 | ```
116 | 
117 | The env var `CUDA_VISIBLE_DEVICES` is used to manually limit the visibility of GPUs to the executed program. So for example if you have 8 gpus and you want to run program1.py with first 4 gpus and program2.py with the remaining 2 gpus you can do:
118 | 
119 | ```
120 | CUDA_VISIBLE_DEVICES="0,1,2,3" python my-pytorch-program1.py
121 | CUDA_VISIBLE_DEVICES="4,5,6,7" python my-pytorch-program2.py
122 | ```
123 | and the second program won't be the wiser that it's not using GPUs 0-3.
124 | 
125 | But in the case of debug we are hiding all GPUs, by setting `CUDA_VISIBLE_DEVICES=""`.
126 | 
127 | Now the program runs on CPU and you will get a really nice traceback and will fix the problem in no time.
128 | 
129 | But, of course, if you your program requires multiple GPUs this won't work. And so here is another solution.
130 | 
131 | Rerun your program after setting this environment variable:
132 | 
133 | ```
134 | CUDA_LAUNCH_BLOCKING=1 python my-pytorch-program.py
135 | ```
136 | 
137 | This variable tells pytorch (or any other CUDA-based program) to turn its async nature off everywhere and now all operations will be synchronous. So when the program crashes you should now get a perfect traceback and you will know exactly what ails your program.
138 | 
139 | In theory enabling this variable should make everything run really slow, but in reality it really depends on your software. We did the whole of BLOOM-176B training using `CUDA_LAUNCH_BLOCKING=1` with `Megatron-Deepspeed`](https://github.com/bigscience-workshop/Megatron-DeepSpeed) and had zero slowdown - we had to use it as pytorch was hanging without it and we had no time to figure the hanging out.
140 | 
141 | So, yes, when you switch from async to sync nature, often it can hide some subtle race conditions, so there are times that a hanging disappears as in the example I shared above. So measure your throughput with and without this flag and sometimes it might actual not only help with getting an in-context traceback but actually solve your problem altogether.
142 | 
143 | Note: [NCCL==2.14.3 coming with `pytorch==1.13` hangs](https://github.com/NVIDIA/nccl/issues/750) when `CUDA_LAUNCH_BLOCKING=1` is used. So don't use it with that version of pytorch. The issue has been fixed in `nccl>=2.17` which should be included in `pytorch==2.0`.
144 | 
145 | 
146 | 
147 | 
148 | ### segfaults and getting a backtrace from a core file
149 | 
150 | It's not uncommon for a complex pytorch program to segfault and drop a core file. Especially if
151 | you're using complex extensions like NCCL.
152 | 
153 | The corefile is what the program generates when it crashes on a low-level - e.g. when using a python extension - such as a CUDA kernel or really any library that is coded directly in some variant of C or another language and made accessible in python through some binding API. The most common cause of a segfault is when such software accesses memory it has not allocated. For example, a program may try to free memory it hasn't allocated. But there could be many other reasons.
154 | 
155 | When a segfault event happens Python can't do anything, as the proverbial carpet is pulled out from under its feet, so it can't generate an exception or even write anything to the output.
156 | 
157 | In these situation one must go and analyse the libC-level calls that lead to the segfault, which is luckily saved in the core file.
158 | 
159 | If your program crashed, you will often find a file that will look something like: `core-python-3097667-6`
160 | 
161 | 
162 | Before we continue make sure you have `gdb` installed:
163 | ```
164 | sudo apt-get install gdb
165 | ```
166 | 
167 | Now make sure you know the path to the python executable that was used to run the program that crashed. If you have multiple python environment you have to activate the right environment first. If you don't `gdb` may fail to unpack the core file.
168 | 
169 | So typically I'd go:
170 | 
171 | ```
172 | conda activate my-env
173 | gdb python core-python-3097667-6
174 | ```
175 | - adjust `my-env` to whatever env you use, or instead of conda use whatever way you use to activate your python environment - and perhaps you're using the system-wise python and then you don't need to activate anything.
176 | - adjust the name of the core file to the file you have gotten - it's possible that there are many - pick the latest then.
177 | 
178 | Now `gdb` will churn for a bit and will give you a prompt where you type: `bt`. We will use an actual core file here:
179 | 
180 | ```
181 | (gdb) bt
182 | #0  0x0000147539887a9f in raise () from /lib64/libc.so.6
183 | #1  0x000014753985ae05 in abort () from /lib64/libc.so.6
184 | #2  0x000014751b85a09b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6
185 | #3  0x000014751b86053c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
186 | #4  0x000014751b860597 in std::terminate() () from /lib64/libstdc++.so.6
187 | #5  0x000014751b86052e in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /lib64/libstdc++.so.6
188 | #6  0x000014750bb007ef in c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() ()
189 |    from .../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
190 | #7  0x000014750bb04c69 in c10d::ProcessGroupNCCL::workCleanupLoop() ()
191 |    from.../python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
192 | #8  0x000014751b88cba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
193 | #9  0x000014753a3901cf in start_thread () from /lib64/libpthread.so.0
194 | #10 0x0000147539872dd3 in clone () from /lib64/libc.so.6
195 | ```
196 | 
197 | and there you go. How do you make sense of it?
198 | 
199 | Well, you go from the bottom of the stack to the top. You can tell that a `clone` call was made in `libc` which then called `start_thread` in `libpthread` and then if you keep going there are a bunch of calls in the torch libraries and finally we can see that the program terminated itself, completing with `raise` from `libc` which told the Linux kernel to kill the program and create the core file.
200 | 
201 | This wasn't an easy to understand backtrace.
202 | 
203 | footnote: Yes, python calls it a *traceback* and elsewhere it's called a *backtrace* - it's confusing, but it's more or less the same thing.
204 | 
205 | Actually I had to ask pytorch devs for help and received:
206 | 
207 | - PyTorch `ProcessGroup` watchdog thread caught an asynchronous error from NCCL
208 | - This error is an `“unhandled system error”` which in this particular case turned out to be an IB-OPA error
209 | - The `ProcessGroup`’s `WorkCleanUp` thread rethrew the error so that the main process would crash and the user would get notified (otherwise this async error would not surface)
210 | 
211 | Trust me there are times when even if you're inexperienced the backtrace can give you enough of a hint to where you should look for troubleshooting.
212 | 
213 | But fear not - most of the time you won't need to understand the traceback. Ideally you'd just attach the core file to your filed Issue. But it can easily be 5GB large. So the developers that will be trying to help you will ask you to generate a `gdb` backtrace and now you know how to do that.
214 | 
215 | I didn't promise it'll be easy, I just showed you where to start.
216 | 
217 | Now another useful details is that many programs these days run multiple threads. And `bt` only shows the main thread of the process. But, often, it can be helpful to see where other threads in the process were when segfault has happened. For that you simply type 2 commands at the `(gdb)` prompt:
218 | 
219 | ```
220 | (gdb) thread apply all bt
221 | (gdb) bt
222 | ```
223 | 
224 | and this time around you typically will get a massive report, one backtrace per thread.
225 | 
226 | 
227 | 
228 | ### strace
229 | 
230 | Similar to [py-spy](./torch-distributed-hanging-solutions.md#py-spy), `strace` is a super-useful tool which traces any running application at the low-level system calls - e.g. `libC` and alike.
231 | 
232 | For example, run:
233 | ```
234 | strace python -c "print('strace')"
235 | ```
236 | and you will see everything that is done at the system call level as the above program runs.
237 | 
238 | But usually it's more useful when you have a stuck program that spins all CPU cores at 100% but nothing happens and you want to see what's it doing. In this situation you simply attached to the running program like so:
239 | 
240 | ```
241 | strace --pid PID
242 | ```
243 | where you get the PID for example from the output of `top` or `ps`. Typically I just copy-n-paste the PID of the program that consumes the most CPU - `top` usually shows it at the very top of its listing.
244 | 
245 | Same as `py-spy` you may need `sudo` perms to attached to an already running process - it all depends on your system setup. But you can always start a program with `strace` as I have shown in the original example.
246 | 
247 | Let's look at a small sub-snippet of the output of `strace python -c "print('strace')"`
248 | 
249 | ```
250 | write(1, "strace\n", 7strace
251 | )                 = 7
252 | ```
253 | Here we can see that a write call was executed on filedescriptor `1`, which almost always is `stdout` (`stdin` being 0, and `stderr` being 2).
254 | 
255 | If you're not sure what a filedescriptor is pointing to, normally you can tell from `strace`'s output itself. But you can also do:
256 | 
257 | ```
258 | ls -l /proc/PID/fd
259 | ```
260 | where PID is the pid of the currently running program you're trying to investigate.
261 | 
262 | For example, when I run the above while running a pytest test with gpus, I got (partial output):
263 | ```
264 | l-wx------ 1 stas stas 64 Mar  1 17:22 5 -> /dev/null
265 | lr-x------ 1 stas stas 64 Mar  1 17:22 6 -> /dev/urandom
266 | lrwx------ 1 stas stas 64 Mar  1 17:22 7 -> /dev/nvidiactl
267 | lrwx------ 1 stas stas 64 Mar  1 17:22 8 -> /dev/nvidia0
268 | lr-x------ 1 stas stas 64 Mar  1 17:22 9 -> /dev/nvidia-caps/nvidia-cap2
269 | ```
270 | so you can see that a device `/dev/null` is open as FD (file descriptor) 5, `/dev/urandom` as FD 6, etc.
271 | 
272 | Now let's go look at another snippet from our `strace` run.
273 | 
274 | ```
275 | access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
276 | ```
277 | Here it tried to see if file `/etc/ld.so.preload` exists, but as we can see it doesn't - this can be useful if some shared library is missing - you can see where it's trying to load it from.
278 | 
279 | Let's try another one:
280 | ```
281 | openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
282 | read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
283 | newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=21448, ...}, AT_EMPTY_PATH) = 0
284 | mmap(NULL, 16424, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8028807000
285 | mmap(0x7f8028808000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f8028808000
286 | mmap(0x7f8028809000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f8028809000
287 | mmap(0x7f802880a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f802880a000
288 | close(3)
289 | ```
290 | here we can see that it opens `/lib/x86_64-linux-gnu/libpthread.so.0` and assigns it FD 3, it then reads 832 chars from FD 3, (we can also see that the first chars are ELF - which stands for a shared library format), then memory maps it and closes that file.
291 | 
292 | In this following example, we see a python cached file is opened, its filepointer is moved to 0, and then it's read and closed.
293 | ```
294 | openat(AT_FDCWD, "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/__pycache__/abc.cpython-38.pyc", O_RDONLY|O_CLOEXEC) = 3
295 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0
296 | lseek(3, 0, SEEK_CUR)                   = 0
297 | lseek(3, 0, SEEK_CUR)                   = 0
298 | fstat(3, {st_mode=S_IFREG|0664, st_size=5329, ...}) = 0
299 | brk(0x23bf000)                          = 0x23bf000
300 | read(3, "U\r\r\n\0\0\0\0\24\216\177c\211\21\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5330) = 5329
301 | read(3, "", 1)                          = 0
302 | close(3)
303 | ```
304 | It's important to notice that file descriptors are re-used, so we have seen the same FD 3 twice, but each time it was open to a different file.
305 | 
306 | If your program is for example trying to reach to the Internet, you can also tell these calls from `strace` as the program would be reading from a socket file descriptor.
307 | 
308 | So let's run an example on a program that downloads files from the HF hub:
309 | ```
310 | strace python -c 'import sys; from transformers import AutoConfig; AutoConfig.from_pretrained(sys.argv[1])' t5-small
311 | ```
312 | 
313 | here is some relevant to this discussion snippet:
314 | ```
315 | socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 3
316 | setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
317 | ioctl(3, FIONBIO, [1])                  = 0
318 | connect(3, {sa_family=AF_INET6, sin6_port=htons(443), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2600:1f18:147f:e850:e203:c458:10cd:fc3c
319 | ", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
320 | poll([{fd=3, events=POLLOUT|POLLERR}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}])
321 | getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
322 | [...]
323 | write(3, "\26\3\3\0F\20\0\0BA\4\373m\244\16\354/\334\205\361j\225\356\202m*\305\332\275\251\17J"..., 126) = 126
324 | read(3, 0x2f05c13, 5)                   = -1 EAGAIN (Resource temporarily unavailable)
325 | poll([{fd=3, events=POLLIN}], 1, 9903)  = 1 ([{fd=3, revents=POLLIN}])
326 | read(3, "\24\3\3\0\1", 5)               = 5
327 | read(3, "\1", 1)                        = 1
328 | read(3, "\26\3\3\0(", 5)                = 5
329 | read(3, "\0\0\0\0\0\0\0\0\344\v\273\225`\4\24m\234~\371\332%l\364\254\34\3472<\0356s\313"..., 40) = 40
330 | ioctl(3, FIONBIO, [1])                  = 0
331 | poll([{fd=3, events=POLLOUT}], 1, 10000) = 1 ([{fd=3, revents=POLLOUT}])
332 | write(3, "\27\3\3\1.\0\374$\361\217\337\377\264g\215\364\345\256\260\211$\326pkR\345\276,\321\221`-"..., 307) = 307
333 | ioctl(3, FIONBIO, [1])                  = 0
334 | read(3, 0x2ef7283, 5)                   = -1 EAGAIN (Resource temporarily unavailable)
335 | poll([{fd=3, events=POLLIN}], 1, 10000) = 1 ([{fd=3, revents=POLLIN}])
336 | ```
337 | 
338 | You can see where that again it uses FD 3 but this time it opens a INET6 socket instead of a file. You can see that it then connects to that socket, polls, reads and writes from it.
339 | 
340 | There are many other super useful understandings one can derive from using this tool.
341 | 
342 | BTW, if you don't want to scroll up-down, you can also save the output to a file:
343 | ```
344 | strace -o strace.txt python -c "print('strace')"
345 | ```
346 | 
347 | 
348 | ## Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs
349 | 
350 | While the methodologies found in this article were developed while working with multi-node multi-gpu pytorch-based training, they, of course, can help with any multi-process multi-node Python programs.
351 | 
352 | ### Helper tools
353 | 
354 | Try to use the following script [torch-distributed-gpu-test.py](./torch-distributed-gpu-test.py) to diagnose the situation.
355 | 
356 | This will help primarily with discovering network-related issues. And also to quickly understand how multi-gpu communications work.
357 | 
358 | For code-related issues read the rest of this document.
359 | 
360 | 
361 | ### Approaches to diagnosing multi-gpu hanging / deadlocks
362 | 
363 | #### py-spy
364 | 
365 | First do `pip install py-spy`.
366 | 
367 | Now you can attach to each process with:
368 | 
369 | ```
370 | py-spy dump -n -p PID
371 | ```
372 | and it will tell you where the process hangs (very often it's a nccl collective function or a `barrier`).
373 | 
374 | - `PID` is the process id of the hanging python process.
375 | - `-n` is useful if you want to see strack traces from python extensions written in C, C++, etc., as the program may hang in one of the extensions
376 | - you may need to add `sudo` before the command - for more details see [this note](https://github.com/benfred/py-spy#when-do-you-need-to-run-as-sudo).
377 | 
378 | 
379 | Here is an example of such a stack trace:
380 | ```
381 | Thread 835995 (active): "MainThread"
382 |     broadcast (torch/distributed/distributed_c10d.py:1191)
383 |     _aggregate_total_loss (deepspeed/runtime/pipe/engine.py:540)
384 |     train_batch (deepspeed/runtime/pipe/engine.py:330)
385 |     train_step (megatron/training.py:436)
386 |     train (megatron/training.py:851)
387 |     pretrain (megatron/training.py:187)
388 |     <module> (pretrain_gpt.py:239)
389 | ```
390 | The very first line is where the program is stuck.
391 | 
392 | ##### multi-process py-spy
393 | 
394 | Now, how do you do it for multiple processes. Doing it one-by-one is too slow. So let's do it at once.
395 | 
396 | If the launch command was `python`, what you do is:
397 | 
398 | ```
399 | pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}
400 | ```
401 | 
402 | if `deepspeed`:
403 | 
404 | ```
405 | pgrep -P $(pgrep -o deepspeed) | xargs -I {} py-spy dump --pid {}
406 | ```
407 | 
408 | for `accelerate`:
409 | 
410 | 
411 | ```
412 | pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {}
413 | ```
414 | 
415 | you get the idea.
416 | 
417 | This particular approach will only analyse the main processes and not various other sub-processes/threads spawned by these processes. So if you have 8 gpus and 8 processes, the above will generate 8 stack traces.
418 | 
419 | If you want all processes and their subprocesses, then you'd just run:
420 | 
421 | 
422 | ```
423 | pgrep -f python | xargs -I {} py-spy dump --pid {}
424 | ```
425 | (and as before replace `python` with the name of the launcher program if it's not `python`)
426 | 
427 | 
428 | ##### multi-node py-spy
429 | 
430 | What if you have multiple nodes?
431 | 
432 | You can of course `ssh` to each node interactively and dump the stack traces.
433 | 
434 | If you're using the SLURM environment you can use `srun` to do it on all nodes for you.
435 | 
436 | 
437 | Now in another console get the `SLURM_JOBID` (or get it from `salloc` log):
438 | ```
439 | squeue -u `whoami` -o "%.16i %.9P %.26j %.8T %.10M %.8l %.6D %.20S %R"
440 | ```
441 | 
442 | Now use the following `srun` command after adjusting jobid with `SLURM_JOBID` from the outcome of the command above this sentence:
443 | ```
444 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed"
445 | ```
446 | 
447 | Notes:
448 | - One must use `--gres=gpu:0` for the monitor `srun` or otherwise it will block until the main `srun` (the one running the training) exits.
449 | - Each node will generate its unique log file named `trace-nodename.out` - so this would help to identify which node(s) are problematic. You can remove `--output=trace-%N.out` if you want it all being dumped to stdout
450 | - In some SLURM versions you may also need to add `--overlap`
451 | - In some SLURM versions the jobid might not match that of reported in `squeue`, so you have to get the correct `SLURM_JOB_ID` from the logs of the job you're trying to "attach" to - i.e. your `srun` job that allocated the GPUs.
452 | - Sometimes `bash` doesn't work, but `sh` does. I think it has to do with what dot files get `source`d
453 | - You might need to also activate a custom python environment, which you can do like so:
454 | ```
455 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'conda activate myenvname; ps auxc | ... ' || echo "failed"
456 | ```
457 | or you can do it inside `~/.bashrc` or whatever shell's rc file you decide to use.
458 | 
459 | As mentioned before if you want just the main processes you'd use this instead:
460 | ```
461 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' || echo "failed"
462 | ```
463 | Adjust `python` if need be as explained in the multi-gpu section above.
464 | 
465 | The previous longer command will deliver traces for all python processes.
466 | 
467 | If you're not getting anything, start with the basic debug like:
468 | 
469 | ```
470 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'date'
471 | ```
472 | once you know you're talking to all the nodes, then you can progressively unravel the depth of calls, as in:
473 | 
474 | ```
475 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'date'
476 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -o python'
477 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) '
478 | srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}'
479 | ```
480 | and at each stage check that the output makes sense - e.g. the 2nd and 3rd call you should be getting the PIDs of the processes.
481 | 
482 | The following notes require `pip install deepspeed`.
483 | 
484 | In one SLURM environment I also attempted using `pdsh` via `ds_ssh`, but somehow I wasn't able to run `py-spy` remotely - the main issue was that remote `ssh` command wasn't giving the same env as when I was logged in interactively via `ssh`. But if you have `sudo` access on the compute nodes then you could do:
485 | 
486 | First prepare `hostfile`:
487 | ```
488 | function makehostfile() {
489 | perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"};
490 | $slots=8 if $slots==0; # workaround 8 gpu machines
491 | @nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}];
492 | print map { "$b$_ slots=$slots\n" } @nodes'
493 | }
494 | makehostfile > hostfile
495 | ```
496 | Adapt `$slots` to the number of gpus per node. You may have to adapt this script if your `scontrol` produces a different output.
497 | 
498 | Now run the `py-spy` extraction command over all participating nodes:
499 | ```
500 | ds_ssh -f hostfile "source ~/.pdshrc; ps aux | grep python | grep -v grep | grep `whoami` | awk '{print \$2}' | xargs -I {} sudo py-spy dump --pid {} "
501 | ```
502 | 
503 | 
504 | 
505 | #### Network-level hanging
506 | 
507 | The hanging could be happening at the network level. `NCCL_DEBUG=INFO` can help here.
508 | 
509 | Run the script with `NCCL_DEBUG=INFO` env var and try to study the outcome for obvious errors. It will tell you which device it's using, e.g.:
510 | ```
511 | DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
512 | ```
513 | So it's using interface `enp67s0` over `192.168.50.21`
514 | 
515 | Is your `192.168.50.21` firewalled? or is it somehow a misconfigured network device?
516 | 
517 | Does it work if you use a loopback device `127.0.0.1`?
518 | ```
519 | NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py
520 | ```
521 | 
522 | if not, see what other local network devices you have via `ifconfig` - try that instead of `lo` if any.
523 | 
524 | It's currently using `enp67s0` in the above example.
525 | 
526 | 
527 | #### Isolate problematic GPUs
528 | 
529 | You can also try to see if only some GPUs fail
530 | 
531 | For example, does it work if you use the first 2 or the last 2 gpus:
532 | 
533 | ```
534 | CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
535 | ```
536 | then the 2nd pair:
537 | ```
538 | CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
539 | ```
540 | 
541 | 
542 | #### python `trace`
543 | 
544 | Now what happens when the training doesn't just hang, but the hanging process stops responding? e.g. this happens when there is a serious hardware issue. But what if it is recurrent and `py-spy` won't help here, since it won't be able to attach to a process that is not responding.
545 | 
546 | So next came the idea of tracing all calls like one does with `strace(1)`, I researched python calls tracing facilities and have discovered that python has a `trace` sub-system.
547 | 
548 | The following code will trace all python calls and log them to the console and into a dedicated per process log file, via a custom `Tee` module I added.
549 | 
550 | This then can help to understand where some processes stopped responding, since we will have the log of the last call and all the previous calls before it went unresponsive.
551 | 
552 | ```
553 | $ cat train.py
554 | [...]
555 | 
556 | def main():
557 |     # [...]
558 |     train()
559 | 
560 | import re
561 | class Tee:
562 |     """
563 |     A helper class to tee print's output into a file.
564 |     Usage:
565 |     sys.stdout = Tee(filename)
566 |     """
567 | 
568 |     def __init__(self, filename):
569 |         self.stdout = sys.stdout
570 |         self.file = open(filename, "a")
571 | 
572 |     def __getattr__(self, attr):
573 |         return getattr(self.stdout, attr)
574 | 
575 |     def write(self, msg):
576 |         self.stdout.write(msg)
577 |         self.file.write(msg)
578 |         self.file.flush()
579 | 
580 |     def flush(self):
581 |         self.stdout.flush()
582 |         self.file.flush()
583 | 
584 | if __name__ == "__main__":
585 | 
586 |     import sys
587 |     import trace
588 |     import socket
589 |     import os
590 | 
591 |     # enable the trace
592 |     if 0:
593 |         cwd = os.path.realpath('.')
594 |         pid = os.getpid()
595 |         hostname = socket.gethostname()
596 |         local_rank = int(os.environ["LOCAL_RANK"])
597 |         trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt"
598 | 
599 |         # create a Trace object, telling it what to ignore, and whether to
600 |         # do tracing or line-counting or both.
601 |         tracer = trace.Trace(
602 |             ignoredirs=[sys.prefix, sys.exec_prefix],
603 |             trace=1,
604 |             count=1,
605 |             timing=True,
606 |         )
607 | 
608 |         # run the new command using the given tracer
609 |         sys.stdout = Tee(trace_output_file)
610 |         tracer.run('main()')
611 |     else:
612 |         main()
613 | 
614 | ```
615 | 
616 | This code doesn't require any special handing other than enabling the trace by changing `if 0` to `if 1`.
617 | 
618 | If you don't set `ignoredirs`, this will now dump all python calls. Which means expect a lot of GBs of data logged, especially if you have hundreds of GPUs.
619 | 
620 | Of course, you don't have to start tracing from `main` - if you suspect a specific are you can start tracing there instead and it'll be much faster and less data to save.
621 | 
622 | I wish I could tell `trace` which packages to follow, but alas it only supports dirs to ignore, which is much more difficult to set, and thus you end up with a lot more data than needrf. But still this is a super useful tool for debugging hanging processes.
623 | 
624 | Also, your code will now run much much slower and the more packages you trace the slower it will become.
625 | 
626 | ##### NicerTrace
627 | 
628 | As `Trace` proved to provide very limited usability when debugging a complex multi-node multi-hour run crash, I have started on working on a better version of the `trace`
629 | 
630 | You can find it here: [NicerTrace](./NicerTrace.py)
631 | 
632 | I added multiple additional flags to the constructor and made the output much more useful. You fill find a full working example in that same file, just run:
633 | 
634 | ```
635 | python NicerTrace.py
636 | ```
637 | and you should see:
638 | 
639 | ```
640 |         trace/NicerTrace.py:1 <module>
641 | 0:00:00 <string>:     1:         trace/NicerTrace.py:185 main
642 | 0:00:00 NicerTrace.py:   186:     img = Image.new("RGB", (4, 4))
643 |         PIL.Image:2896 new
644 | 0:00:00 Image.py:  2912:     _check_size(size)
645 |         PIL.Image:2875 _check_size
646 | 0:00:00 Image.py:  2883:     if not isinstance(size, (list, tuple)):
647 | 0:00:00 Image.py:  2886:     if len(size) != 2:
648 | 0:00:00 Image.py:  2889:     if size[0] < 0 or size[1] < 0:
649 | ```
650 | as you will see in the example I set:
651 | 
652 | ```
653 |             packages_to_include=["PIL"],
654 | ```
655 | so it'll trace `PIL` plus anything that is not under `site-packages`. If you need to trace another package, just add it to that list.
656 | 
657 | This is a very fresh work-in-progress package, so it's evolving as we are trying to make it help us resolve a very complex crashing situation.
658 | 
659 | 
660 | ##### Working with generated trace files
661 | 
662 | When the per-node-rank trace files has been generated the following might be helpful to quickly analyse the situation:
663 | 
664 | 
665 | - grep for a specific match and also print the file and line number where it was found:
666 | 
667 | ```
668 | grep -n "backward" trace*
669 | ```
670 | 
671 | - show `tail -1` of all trace files followed by the name of each file:
672 | 
673 | ```
674 | find . -name "trace*" -exec sh -c 'echo "$1: $(tail -3 "$1")"' _ {} \;
675 | ```
676 | 
677 | - or similar to the above, but print 5 last lines with the leading filename and some vertical white space for an easier reading:
678 | 
679 | ```
680 | find . -name "trace*" -exec sh -c 'echo; echo $1; echo "$(tail -5 "$1")"' _ {} \;
681 | ```
682 | 
683 | - count how many times grep matched a given pattern in each ifle and print the matched file (in this example matching the pattern `backward`):
684 | 
685 | ```
686 | find . -name "trace*" -exec sh -c 'echo "$1: $(grep "backward" $1 | wc -l)"' _ {} \;
687 | ```
688 | 
689 | 
690 | #### good old `print`
691 | 
692 | Now once you discovered where the hanging happens to further understand why this is happening, a debugger would ideally be used, but more often than not debugging multi-process (multi-node) issues can be very difficult.
693 | 
694 | In such situations a good old `print` works. You just need to add some debug prints before the calls where things hang, things that would help understand what lead to the deadlock. For example, some `barrier` was missing and one or a few processes skipped some code and while the rest of processes are still blocking waiting for everybody to send some data (for example in NCCL collective functions like `gather` or `reduce`).
695 | 
696 | You of course, want to prefix each print with the rank of the process so that you could tell which is which. For example:
697 | 
698 | ```
699 | import torch.distributed as dist
700 | print(f"{dist.get_rank()}: passed stage 0")
701 | ```
702 | 
703 | What you will quickly discover is that if you have multiple GPUs these prints will be badly interleaved and you will have a hard time making sense of the debug data. So let's fix this. We are going to override `print` with a custom version of the same, but which uses `flock` to ensure that only one process can write to stdout at the same time.
704 | 
705 | The helper module `printflock.py` is included [here](./printflock.py). To activate it just run this at the top of the module you're debugging:
706 | 
707 | ```
708 | from printflock import printflock as print
709 | ```
710 | 
711 | and now all your `print` calls in that module will magically be non-iterleaved. You can of course, just use `printflock` directly:
712 | 
713 | ```
714 | from printflock import printflock
715 | import torch.distributed as dist
716 | printflock(f"{dist.get_rank()}: passed stage 0")
717 | ```
718 | 
719 | 
720 | #### Code loops
721 | 
722 | Code loops can be tricky to debug in hanging scenarios. If you have code like the following:
723 | 
724 | ```
725 | for i, d in enumerate(data):
726 |     some_hanging_call(d)
727 | ```
728 | 
729 | it's possible that one process hangs in the first iteration, and another process in the second iteration, which makes things very confusing. But the stack trace won't give such indication, as the line numbers would be the same, even though the processes aren't in the same place code progression-wise.
730 | 
731 | In such situations unroll the loop to be:
732 | ```
733 | d_iter = iter(data)
734 | some_hanging_call(next(d_iter)
735 | some_hanging_call(next(d_iter)
736 | ```
737 | and now when you run `py-spy` the line numbers will be correct. The processes hanging in the first iteration will report the first `some_hanging_call` and those in the second iteration in the second call - as each now has its own line.
738 | 
739 | 
740 | 
741 | 
742 | ## Hardware-specific issues
743 | 
744 | Some AMD users may need to [Disable IOMMU](https://github.com/stas00/toolbox/issues/1#issuecomment-1076830400)
745 | 


--------------------------------------------------------------------------------
/debug/printflock.py:
--------------------------------------------------------------------------------
 1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """
 2 | 
 3 | # If you have ever done multi-gpu work and tried to `print` for debugging you quickly discovered
 4 | # that some messages get interleaved and are impossible to make sense of. Especially so if you're
 5 | # using `print` to debug values.
 6 | #
 7 | # This simple solution that uses the good old `flock` solves the interleaving problem. To use this
 8 | # version of print you can either do:
 9 | #
10 | # from printflock import printflock
11 | # import torch.distributed as dist
12 | # printflock(f"{dist.get_rank()}: my long debug message")
13 | #
14 | # or you can override `print` with a better one:
15 | #
16 | # from printflock import printflock as print
17 | # import torch.distributed as dist
18 | # print(f"{dist.get_rank()}: my long debug message")
19 | #
20 | 
21 | import builtins
22 | import fcntl
23 | 
24 | def printflock(*args, **kwargs):
25 |     """
26 |     This is a wrapper around the built-in Python `print` which calls `flock` before calling
27 |     `print` and unlocks it immediately after. This wrapper is useful for when each rank needs to
28 |     print a message without getting it interleaved with prints from other ranks.
29 |     The lock file is the file this wrapper is defined in.
30 |     The output order will be random per rank.
31 | 
32 |     Example:
33 |         >>> # assuming 4 GPUs
34 |         >>> world_size = dist.get_world_size()
35 |         >>> rank = dist.get_rank()
36 |         >>> printflock(f"This is a very long message from rank {rank}/{world_size}")
37 |        This is a very long message from rank 0/4
38 |        This is a very long message from rank 2/4
39 |        This is a very long message from rank 3/4
40 |        This is a very long message from rank 1/4
41 | 
42 |     It can also be used to override normal `print`:
43 | 
44 |     from printflock import printflock as print
45 | 
46 |     and then you don't need to change anything in your code.
47 |     """
48 | 
49 |     with open(__file__, "r") as fh:
50 |         fcntl.flock(fh, fcntl.LOCK_EX)
51 |         try:
52 |             builtins.print(*args, **kwargs)
53 |         finally:
54 |             fcntl.flock(fh, fcntl.LOCK_UN)
55 | 


--------------------------------------------------------------------------------
/debug/torch-distributed-gpu-test.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """
  4 | 
  5 | #
  6 | # This a `torch.distributed` diagnostics script that checks that all GPUs in the cluster (one or
  7 | # many nodes) can talk to each other via nccl and allocate gpu memory.
  8 | #
  9 | # To run first adjust the number of processes and nodes:
 10 | #
 11 | # python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
 12 | #
 13 | # You may need to add --master_addr $MASTER_ADDR --master_port $MASTER_PORT if using a custom addr:port
 14 | #
 15 | # You can also use the rdzv API: --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT --rdzv_backend c10d
 16 | #
 17 | # use torch.distributed.launch instead of torch.distributed.run for torch < 1.9
 18 | #
 19 | # If you get a hanging in `barrier` calls you have some network issues, you may try to debug this with:
 20 | #
 21 | # NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
 22 | #
 23 | # which should tell you what's going on behind the scenes.
 24 | #
 25 | #
 26 | # This script can be run via `srun` in the SLURM environment as well. Here is a SLURM script that
 27 | # runs on 2 nodes of 4 gpus per node:
 28 | 
 29 | # #!/bin/bash
 30 | # #SBATCH --job-name=test-nodes        # name
 31 | # #SBATCH --nodes=2                    # nodes
 32 | # #SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
 33 | # #SBATCH --cpus-per-task=10           # number of cores per tasks
 34 | # #SBATCH --gres=gpu:4                 # number of gpus
 35 | # #SBATCH --time 0:05:00               # maximum execution time (HH:MM:SS)
 36 | # #SBATCH --output=%x-%j.out           # output file name
 37 | #
 38 | # export GPUS_PER_NODE=4
 39 | # export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
 40 | # export MASTER_PORT=6000
 41 | #
 42 | # srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \
 43 | # --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
 44 | # --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
 45 | # torch-distributed-gpu-test.py'
 46 | #
 47 | # can also add this for automatic prefixing of all logs with [hostname:rank] (in addition to `--master_addr` etc)
 48 | # --role `hostname -s`: --tee 3 \
 49 | #
 50 | 
 51 | import builtins
 52 | import fcntl
 53 | import os
 54 | import socket
 55 | import torch
 56 | import torch.distributed as dist
 57 | 
 58 | def print(*args, **kwargs):
 59 |     """ solves multi-process interleaved print problem """
 60 |     with open(__file__, "r") as fh:
 61 |         fcntl.flock(fh, fcntl.LOCK_EX)
 62 |         try:
 63 |             builtins.print(*args, **kwargs)
 64 |         finally:
 65 |             fcntl.flock(fh, fcntl.LOCK_UN)
 66 | 
 67 | local_rank = int(os.environ["LOCAL_RANK"])
 68 | torch.cuda.set_device(local_rank)
 69 | device = torch.device("cuda", local_rank)
 70 | hostname = socket.gethostname()
 71 | 
 72 | gpu = f"[{hostname}-{local_rank}]"
 73 | 
 74 | try:
 75 |     # test distributed
 76 |     dist.init_process_group("nccl")
 77 | 
 78 |     # global rank
 79 |     rank = dist.get_rank()
 80 |     world_size = dist.get_world_size()
 81 | 
 82 |     # reduction test
 83 |     t = torch.ones(1, device=device)
 84 |     dist.all_reduce(t, op=dist.ReduceOp.SUM)
 85 |     dist.barrier()
 86 |     print(f"{gpu} Reduction op=sum result: {t.item()}")
 87 | 
 88 |     # test cuda is available and can allocate memory
 89 |     torch.cuda.is_available()
 90 |     torch.ones(1).cuda(local_rank)
 91 | 
 92 |     print(f"{gpu} is OK (global rank: {rank}/{world_size})")
 93 | 
 94 |     dist.barrier()
 95 |     if rank == 0:
 96 |         print(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}")
 97 |         print(f"device compute capabilities={torch.cuda.get_device_capability()}")
 98 |         print(f"pytorch compute capabilities={torch.cuda.get_arch_list()}")
 99 | 
100 | except Exception:
101 |     print(f"{gpu} is broken")
102 |     raise
103 | 


--------------------------------------------------------------------------------
/dtype/README.md:
--------------------------------------------------------------------------------
 1 | # Tensor precision / Data types
 2 | 
 3 | ## Half and Mixed Precision
 4 | 
 5 | fp16
 6 | 
 7 | bf16
 8 | 
 9 | mixed fp16
10 | 
11 | mixed bf16
12 | 
13 | 
14 | ### General OPs
15 | 
16 | `LayerNorm`-like operations must not do their work in half-precision, or they may lose a lot of data. Therefore when  these operations are implemented correctly they do efficient internal work in fp32 and then their outputs are downcast to half-precision. Very often it's just the accumulation that is done in fp32, since adding up half-precision numbers is very lossy.
17 | 
18 | example:
19 | 
20 | ### Reduction collectives
21 | 
22 | fp16: ok to do in fp16 if loss scaling is in place
23 | 
24 | bf16: only ok in fp32
25 | 
26 | ### Gradient accumulation
27 | 
28 | best done in fp32 for both, but definitely for bf16
29 | 
30 | 
31 | ### Optimizer step / Vanishing gradients
32 | 
33 | when adding a tiny gradient to a large number, that addition is often nullified
34 | 
35 | fp32 master weights and fp32 optim states
36 | 
37 | bf16 master weights and optim states can be done when using Kahan Summation and/or Stochastic rounding
38 | 
39 | 
40 | ## Using fp16-pretrained model in bf16 regime
41 | 
42 | usually fails
43 | 
44 | ## Using bf16-pretrained model in fp16 regime
45 | 
46 | will lose some performance on conversion, but should work - best to finetune a bit
47 | 


--------------------------------------------------------------------------------
/hparams/README.md:
--------------------------------------------------------------------------------
 1 | # Selecting Training Hyper-Parameters And Model Initializations
 2 | 
 3 | ## Glossary
 4 | 
 5 | Training jargon uses a multitude of abbreviations and terms, so here are some important for this chapter.
 6 | 
 7 | - BS: Batch Size - here we mean batch size per gpu, often it is also referred to as MBS (micro-batch-size)
 8 | - GBS: Global Batch Size - total batch size per iteration - may include gradient accumulation
 9 | - GAS: Gradient Accumulation Steps - how many forward/backward cycles to perform before one full iteration is complete
10 | - TFLOPs: Trillion FLOPs per second - [FLOPS](https://en.wikipedia.org/wiki/FLOPS)
11 | - PP: Pipeline Parallelism
12 | 
13 | ## Global Batch Size Ramp Up
14 | 
15 | If you intend to train with a very large GBS, with say 1024, or 2048 samples and even higher, when you just start training, it's very wasteful to feed such large batch sizes to the model. At this point it's totally random and can't benefit from having too refined data. Therefore to save data and resources, one often ramps up the global batch size over some period of time.
16 | 
17 | It's also important to not start with GBS that is too small, since otherwise the progress won't be efficient. When there is too little data the compute (TFLOPS) is inefficient and will slow everything down. This is especially so when Pipeline Parallelism (PP) is used, since the most important thing about PP tuneup is a small GPU idleness bubble, and the smaller the GBS the larger the bubble is.
18 | 
19 | For example, for BLOOM-176B, where we did use PP, after doing throughput benchmarking we found that starting with GBS=16 was incredibly slow (8 TFLOPs), so we eventually started with GBS=192 (73 TFLOPs) and then we ramped up to GBS=2048 (150 TFLOPs) - we increased GBS by 16 every 9_765_625 samples.
20 | 
21 | 
22 | 
23 | ### STD Init
24 | 
25 | This hyper parameter is super-important and it requires math to get it right. For details see [STD Init](../instabilities#std-init).
26 | 


--------------------------------------------------------------------------------
/instabilities/README.md:
--------------------------------------------------------------------------------
 1 | # Avoiding, Recovering From and Understanding Instabilities
 2 | 
 3 | ## STD Init
 4 | 
 5 | Correctly initializing the initial distribution of the tensors can have a tremendous impact on training's stability. The `std` value isn't fixed and depends on the hidden dimension size.
 6 | 
 7 | This proved to be a very crucial setting in our pre-BLOOM 104B experiments and we couldn't break past the first few thousands iterations until we figured out that the 0.02 default `--init-method-std` in Megatron-LM was a way too big for our model.
 8 | 
 9 | We referred to these two sources:
10 | 
11 | 1. "Transformers without Tears" paper https://arxiv.org/abs/1910.05895 prescribes: `sqrt(2/(NHIDDEN*5))`
12 | 
13 | 2. The 530B training paper https://arxiv.org/abs/2201.11990 they used an even smaller init formula: `sqrt(1/(NHIDDEN*3))`
14 | 
15 | and decided to go with the 530B one as it leads to an even smaller init value.
16 | 
17 | To make it easier to compare the two formulas, they can be rewritten as:
18 | 1. `sqrt(0.4000/NHIDDEN)`
19 | 2. `sqrt(0.3333/NHIDDEN)`
20 | 
21 | Thus for `NHIDDEN=14336` the math was `sqrt(1/(14336*3)) = 0.00482` and that's what we used. It surely wasn't the only reason why we had no stability issues during BLOOM-176B training, but I think it was one of the crucial ones.
22 | 
23 | 
24 | ## Numerical instabilities
25 | 
26 | Certain mathematical operations could be unstable when dealing with low precision numbers.
27 | 
28 | For example, please see this very interesting [PyTorch guide on numerical stability](https://pytorch.org/docs/stable/notes/numerical_accuracy.html).
29 | 
30 | Now let's look at a specific example of this concept in action.
31 | 
32 | During 104B training experiments where fp16 mixed precision was used - the following improvement was proposed by [Corby Rosset](https://github.com/corbyrosset) to make [self-attention more stable](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/118).
33 | 
34 | Specifically this [line](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c839a8aa30731f71b3738d56009be9668508e366/megatron/model/transformer.py#L303) shows that the `norm_factor` may be multiplied after the Query * Key matrix multiplication. If the dim of Q and K are very large, the output may blow up and the `norm_factor` won't be able to save it.
35 | 
36 | Proposal: move the `norm_factor` inward, so Q and K are scaled down before matrix multiply:
37 | ```
38 |         matmul_result = torch.baddbmm(
39 |             matmul_result,
40 |             1.0/math.sqrt(self.norm_factor) * query_layer.transpose(0, 1),   # [b * np, sq, hn]
41 |             1.0/math.sqrt(self.norm_factor) * key_layer.transpose(0, 1).transpose(1, 2),  # [b * np, hn, sk]
42 |             beta=0.0 if alibi is None else 1.0, alpha=1.0)
43 | 
44 |         # change view to [b, np, sq, sk]
45 |         attention_scores = matmul_result.view(*output_size)
46 | ```
47 | 
48 | To make the operation mathematically equivalent, moving the norm factor inward requires taking sqrt again
49 | if n is a scalar, A and B matrices:
50 | ```
51 | n * (A dot B) === (sqrt(n) * A) dot (sqrt(n) * B)
52 | ```
53 | 
54 | Now A and B dimensions can be significantly larger.
55 | 
56 | For CUDA kernel writers [CuBlas](https://docs.nvidia.com/cuda/cublas/index.html)'s `GemmStridedBatchedEx` at the time of this writing has a similar issue. It is defined as:
57 | 
58 | ```
59 | C+i*strideC=αop(A+i*strideA)op(B+i*strideB)+β(C+i*strideC), for i ∈[0,batchCount−1]
60 | ```
61 | 
62 | The issue is that `alpha` is multiplied after the matrix-matrix multiplication is done so it can cause instability.
63 | 


--------------------------------------------------------------------------------
/parallelism/README.md:
--------------------------------------------------------------------------------
1 | # Model Parallelism
2 | 
3 | ## TP
4 | 
5 | TP degree shouldn't span across nodes.
6 | 


--------------------------------------------------------------------------------
/resources/README.md:
--------------------------------------------------------------------------------
 1 | # Resources
 2 | 
 3 | 
 4 | ## Publicly available training logbooks
 5 | 
 6 | The listing is in no particular order other than the year.
 7 | 
 8 | ### 2021
 9 | 
10 | - BigScience pre-BLOOM 108B training experiments (2021):
11 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide/chronicles.md) |
12 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide)
13 | 
14 | ### 2022
15 | 
16 | - BigScience BLOOM-176B (2022):
17 | [chronicles-prequel](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md) |
18 | [chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md) |
19 | [the full spec and discussions](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/)
20 | 
21 | - Meta OPT-175B (2022):
22 |  [logbook](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles) |
23 |  [Video](https://www.youtube.com/watch?v=p9IxoSkvZ-M)
24 | 
25 | - THUDM GLM-130B (2022): [en logbook](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log-en.md) | [Mandarin version](https://github.com/THUDM/GLM-130B/blob/main/logs/main-log.md)
26 | 
27 | ### 2023
28 | 
29 | - HuggingFace m4 80B multimodal (Flamingo repro) (2023): [Learning log](https://docs.google.com/document/d/1ZNGyVWYFUbzV0xuei4SED2QAakGjMpaaQALcKYQm46U/edit) | [Training Logbook](https://github.com/huggingface/m4-logbook/tree/main/tr-190-80b)
30 | 
31 | - BloombergGPT 50B LLM - section C in [BloombergGPT: A Large Language Model for Finance](https://arxiv.org/abs/2303.17564)
32 | 


--------------------------------------------------------------------------------
/slurm/README.md:
--------------------------------------------------------------------------------
  1 | # Working in SLURM Environment
  2 | 
  3 | Unless you're lucky and you have a dedicated cluster that is completely under your control chances are that you will have to use SLURM to timeshare the GPUs with others. But, often, if you train at HPC, and you're given a dedicated partition you still will have to use SLURM.
  4 | 
  5 | This document will not try to teach you SLURM as there are many manuals out there, but we will cover some specific nuances that are useful to help in the training process.
  6 | 
  7 | 
  8 | ## Crontab Emulation
  9 | 
 10 | One of the most important Unix tools is the crontab, which is essential for being able to schedule various jobs. It however usually is absent from SLURM environment. Therefore one must emulate it. Here is how.
 11 | 
 12 | For this presentation we are going to use `$WORK/cron/` as the base directory. And that you have an exported environment variable `WORK` pointing to some location on your filesystem - if you use Bash you can set it up in your `~/.bash_profile` or if a different shell is used use whatever startup equivalent file is.
 13 | 
 14 | 
 15 | ### 1. A self-perpetuating scheduler job
 16 | 
 17 | We will use `$WORK/cron/scheduler` dir for scheduler jobs, `$WORK/cron/cron.daily` for daily jobs and `$WORK/cron/cron.hourly` for hourly jobs:
 18 | 
 19 | ```
 20 | $ mkdir -p $WORK/cron/scheduler
 21 | $ mkdir -p $WORK/cron/cron.daily
 22 | $ mkdir -p $WORK/cron/cron.hourly
 23 | ```
 24 | 
 25 | Now copy these two slurm script in `$WORK/cron/scheduler`:
 26 | - [cron-daily.slurm](cron-daily.slurm)
 27 | - [cron-hourly.slurm](cron-hourly.slurm)
 28 | 
 29 | after editing those to fit your specific environment's account and partition information.
 30 | 
 31 | Now you can launch the crontab scheduler jobs:
 32 | 
 33 | ```
 34 | $ cd $WORK/cron/scheduler
 35 | $ sbatch cron-hourly.slurm
 36 | $ sbatch cron-daily.slurm
 37 | ```
 38 | 
 39 | This is it, these jobs will now self-perpetuate and usually you don't need to think about it again unless there is an even that makes SLURM lose all its jobs.
 40 | 
 41 | ### 2. Daily and Hourly Cronjobs
 42 | 
 43 | Now whenever you want some job to run once a day, you simply create a slurm job and put it into the `$WORK/cron/cron.daily` dir.
 44 | 
 45 | Here is an example job that runs daily to update the `mlocate` file index:
 46 | ```
 47 | $ cat $WORK/cron/cron.daily/mlocate-update.slurm
 48 | #!/bin/bash
 49 | #SBATCH --job-name=mlocate-update    # job name
 50 | #SBATCH --ntasks=1                   # number of MP tasks
 51 | #SBATCH --nodes=1
 52 | #SBATCH --hint=nomultithread         # we get physical cores not logical
 53 | #SBATCH --time=1:00:00               # maximum execution time (HH:MM:SS)
 54 | #SBATCH --output=%x-%j.out           # output file name
 55 | #SBATCH --partition=PARTITION     # edit me
 56 | #SBATCH --account=GROUP@PARTITION # edit me
 57 | 
 58 | set -e
 59 | date
 60 | echo "updating mlocate db"
 61 | /usr/bin/updatedb -o $WORK/lib/mlocate/work.db -U $WORK --require-visibility 0
 62 | ```
 63 | 
 64 | This builds an index of the files under `$WORK` which you can then quickly query with:
 65 | ```
 66 | /usr/bin/locate -d $WORK/lib/mlocate/work.db pattern
 67 | ```
 68 | 
 69 | To stop running this job, just move it out of the `$WORK/cron/cron.daily` dir.
 70 | 
 71 | The same principle applies to jobs placed into the `$WORK/cron/cron.hourly` dir. These are useful for running something every hour.
 72 | 
 73 | Please note that this crontab implementation is approximate timing-wise, due to various delays in SLURM scheduling they will run approximately every hour and every day. You can recode these to ask SLURM to start something at a more precise time if you have to, but most of the time the just presented method works fine.
 74 | 
 75 | Additionally, you can code your own variations  to meet specific needs of your project, e.g., every-30min or every-12h jobs.
 76 | 
 77 | 
 78 | ### 3. Cleanup
 79 | 
 80 | Finally, since every cron launcher job will leave behind a log file (which is useful if for some reason things don't work), you want to create a cronjob to clean up these logs. Otherwise you may run out of inodes - these logs files are tiny, but there could be tens of thousands of those.
 81 | 
 82 | You could use something like this in a daily job.
 83 | 
 84 | ```
 85 | find $WORK/cron -name "*.out" -mtime +7 -exec rm -f {} +
 86 | ```
 87 | Please note that it's set to only delete files that are older than 7 days, in case you need the latest logs for diagnostics.
 88 | 
 89 | 
 90 | ### Nuances
 91 | 
 92 | The scheduler runs with Unix permissions of the person who launched the SLRUM cron scheduler job and so all other SLURM scripts launched by that cron job.
 93 | 
 94 | 
 95 | 
 96 | 
 97 | ## Overcoming The lack of group SLURM job ownership
 98 | 
 99 | SLURM runs on Unix, but surprisingly its designers haven't adopted the concept of group ownership with regards to SLURM jobs. So if a member of your team started an array of 10 jobs 20h each, and went on vacation - unless you have `sudo` access you now can't do anything to stop those jobs if something is wrong.
100 | 
101 | I'm yet to find why this is so, but so far we have been using a kill switch workaround. You have to code it in your framework. For example, see how it was implemented in [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/e52bdabbde3c6895aceb76c1bced295c2646121f/megatron/training.py#L104) (Meg-DS). The program polls for a pre-configured at start up path on the filesystem and if it finds a file there, it exits.
102 | 
103 | So if we start Meg-DS with `--kill-switch-path $WORK/tmp/training17-kill-switch` and then at any point we need to kill the SLURM job, we simply do:
104 | 
105 | ```
106 | touch $WORK/tmp/training17-kill-switch
107 | ```
108 | and the next time the program gets to check for this file it'll detect the event and will exit voluntarily. If you have a job array, well, you will have to wait until each job starts, detects the kill switch and exits.
109 | 
110 | Of course, don't forget to remove it when you're done stopping the jobs.
111 | ```
112 | rm $WORK/tmp/training17-kill-switch
113 | ```
114 | 
115 | Now, this doesn't always work. If the job is hanging, it'll never come to the point of checking for kill-switch and the only solution here is to contact the sysadmins to kill the job for you. Sometimes if the hanging is a simple case pytorch's distributed setup will typically auto-exit after 30min of preset timeout time, but it doesn't always work.
116 | 


--------------------------------------------------------------------------------
/slurm/cron-daily.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=cron-daily        # job name
 3 | #SBATCH --ntasks=1                   # number of MP tasks
 4 | #SBATCH --nodes=1
 5 | #SBATCH --hint=nomultithread         # we get physical cores not logical
 6 | #SBATCH --time=0:30:00               # maximum execution time (HH:MM:SS)
 7 | #SBATCH --output=%x-%j.out           # output file name
 8 | #SBATCH --partition=PARTITION     # edit me
 9 | #SBATCH --account=GROUP@PARTITION # edit me
10 | 
11 | # do not set -e - we must run all of it
12 | # set -x -e
13 | 
14 | cd $WORK/cron/scheduler
15 | 
16 | # ensure to restart self first
17 | sbatch --begin=now+24hour cron-daily.slurm
18 | 
19 | # now launch any slurm scripts in cron.daily
20 | cd $WORK/cron/cron.daily
21 | for f in *.slurm; do
22 |   sbatch "$f"
23 | done
24 | 


--------------------------------------------------------------------------------
/slurm/cron-hourly.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=cron-hourly       # job name
 3 | #SBATCH --ntasks=1                   # number of MP tasks
 4 | #SBATCH --nodes=1
 5 | #SBATCH --hint=nomultithread         # we get physical cores not logical
 6 | #SBATCH --time=0:30:00               # maximum execution time (HH:MM:SS)
 7 | #SBATCH --output=%x-%j.out           # output file name
 8 | #SBATCH --partition=PARTITION     # edit me
 9 | #SBATCH --account=GROUP@PARTITION # edit me
10 | 
11 | # do not set -e - we must run all of it
12 | # set -x -e
13 | 
14 | cd $WORK/cron/scheduler
15 | 
16 | # ensure to restart self first
17 | sbatch --begin=now+1hour cron-hourly.slurm
18 | 
19 | # now launch any slurm scripts in cron.hourly
20 | cd $WORK/cron/cron.hourly
21 | for f in *.slurm; do
22 |   sbatch "$f"
23 | done
24 | 


--------------------------------------------------------------------------------
/throughput/README.md:
--------------------------------------------------------------------------------
  1 | # How to Maximize Training Throughput
  2 | 
  3 | The faster you can make your model to train the sooner the model will finish training, which is important not only to being first to publish something, but also potentially saving a lot of money.
  4 | 
  5 | In general maximizing throughput is all about running many experiments and measuring the outcome and chosing the one that is superior.
  6 | 
  7 | In certain situations your modeling team may ask you to choose some hyper parameters that will be detrimental to throughput but overall beneficial for the overall model's success.
  8 | 
  9 | ## Crucial reproducibility requirements
 10 | 
 11 | The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables.
 12 | 
 13 | Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable.
 14 | 
 15 | For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for [108B pre-BLOOM experiments](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr8-104B-wide) it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect.
 16 | 
 17 | ## Network throughput
 18 | 
 19 | It's critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time.
 20 | 
 21 | If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance.
 22 | 
 23 | Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network:
 24 | 
 25 | [all_reduce_bench.py](./all_reduce_bench.py)
 26 | 
 27 | Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
 28 | 
 29 | To run it on 4 nodes
 30 | 
 31 | ```
 32 | python -m torch.distributed.run --nproc_per_node=4 all_reduce_bench.py
 33 | ```
 34 | 
 35 | You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
 36 | 
 37 | Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
 38 | 
 39 | Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.
 40 | 
 41 | To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed ZeRO Stage 3:
 42 | 
 43 | 1. 100Gbps is not enough
 44 | 2. 200-400 Gbps is ok
 45 | 3. 800-1000 Gbps is ideal
 46 | 
 47 | [full details](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491)
 48 | 
 49 | 
 50 | ## TFLOPs as a performance metric
 51 | 
 52 | Before you start optimizing the performance of your training setup you need a metric that you can use to see whether the throughput is improving or not. You can measure seconds per iteration, or iterations per second, or some other such timing, but there is a more useful metric that measures TFLOPs.
 53 | 
 54 | footnote: TFLOPs: Trillion FLOPs per second - [FLOPS](https://en.wikipedia.org/wiki/FLOPS)
 55 | 
 56 | Measuring TFLOPs is superior because without it you don't know whether you are close to the best performance that can be achieved or not. This measurement gives you an indication of how far you're from the peak performance reported by the hardware manufacturer.
 57 | 
 58 | In this section I will use BLOOM's training for the examplification. We use 80GB A100 NVIDIA GPUs and we trained in mixed bf16 regime. So let's look at the [A100 spec](https://www.nvidia.com/en-us/data-center/a100/) which tells us:
 59 | 
 60 | ```
 61 | BFLOAT16 Tensor Core 	312 TFLOPS
 62 | ```
 63 | 
 64 | Therefore we now know that if we were to only run `matmul` on huge bf16 matrices without copying to and from the device we should get around 312 TFLOPs max.
 65 | 
 66 | Practically though, due to disk IO, communications and copying data from gpu memory to gpu computing unit overhead and because we can't do everything in bf16 and at times we have to do math in fp32 (or tf32) we can really expect about half of that. So 155 TFLOPs should be an amazing sustainable throughput for a complex hundreds of GPUs training setup.
 67 | 
 68 | When we first started tuning things up we were at <100 TFLOPs and a few weeks later when we launched the training we managed to get 150 TFLOPs.
 69 | 
 70 | The important thing to notice here is that we knew that we can't push it further by much and we knew that there was no more point to try and optimize it even more.
 71 | 
 72 | So a general rule of thumb - if your training set up gets about 1/2 of advertised peak performance you're doing great. Don't let it stop you though from beating this suggestion and getting even more efficient.
 73 | 
 74 | When calculating TFLOPs it's important to remember that the math is different if [Gradient checkpointing](#gradient-checkpointing) are enabled, since when it's activated more compute is used and it needs to be taken into an account.
 75 | 
 76 | for transformer models the following is an estimation formula which slightly under-reports the real TFLOPs:
 77 | 
 78 | TFLOPs: `model_size_in_B * 4 * 2 * seqlen * global_batch_size / (time_in_sec_per_interation * total_gpus * 1e3)`
 79 | 
 80 | The factor of 4 is when used with activation check-pointing, otherwise it will be 3, but for 100B+ model, activation check-pointing will always be on.
 81 | 
 82 | So the `3*2` is often called "model FLOPs" and `4*2` - "hardware FLOPs".
 83 | 
 84 | ```
 85 | perl -le '$ng=64; $ms=52; $gbs=1024; $sp=127; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
 86 | ```
 87 | (ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds)
 88 | 
 89 | same with bash env vars and broken down GBS into mbs*dp*gas (gas=pp_chunks):
 90 | ```
 91 | echo "($MSIZE*4*2*SEQLEN*$MICRO_BATCH_SIZE*$DP_SIZE*$GAS)/($THROUGHPUT*$NNODES*4*1000)" | bc -l
 92 | ```
 93 | 
 94 | The exact formula is in Equation 3 of Section 5.1 of the [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) paper. You can see the code [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/251).
 95 | 
 96 | footnote: For Inference only it'd be: `24Bsh^2 + 4𝐵s^2h` floating point operations per layer
 97 | 
 98 | 
 99 | 
100 | 
101 | ## Gradient checkpointing
102 | 
103 | This is only relevant for training.
104 | 
105 | Enabling gradient checkpointing allows one to trade speed for GPU memory. When this feature is activated instead of remembering the outputs of, say, transformer blocks until the backward pass is done, these outputs are dropped. This frees up huge amounts of GPU memory. But, of course, a backward pass is not possible without having the outputs of forward pass, and thus they have to be recalculated.
106 | 
107 | This, of course, can vary from model to model, but typically one pays with about 20-25% decrease in throughput, but since a huge amount of gpu memory is liberated, one can now increase the batch size per gpu and thus overall improve the effective throughput of the system. In some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.
108 | 
109 | Activation checkpointing and gradient checkpointing are 2 terms for the same methodology.
110 | 
111 | For example, in HF Transformers models you do `model.gradient_checkpointing_enable()` to activate it in your trainer or if you HF Trainer then you'd activate it with `--gradient_checkpointing 1`.
112 | 
113 | 
114 | 
115 | ## Gradient accumulation
116 | 
117 | Depending on a situation using a large gradient accumulation can increase the throughput, even though it's only the optimizer `step` that's skipped except at the boundary of the gradient accumulation, it can be quite a significant saving. e.g. in this particular small setup I clocked 20-30% speed up:
118 | 
119 | - [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004592231)
120 | - [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537)
121 | 
122 | 
123 | When using Pipeline parallelism a very large Gradient Accumulation is a must to keep the [pipeline's bubble to the minimum]( https://huggingface.co/docs/transformers/main/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism).
124 | 
125 | 
126 | 
127 | 
128 | ## Vector and matrix size divisibility
129 | 
130 | 
131 | ### Tile and wave quantization
132 | 
133 | XXX
134 | 
135 | 
136 | ### Number/size of Attention heads
137 | 
138 | XXX
139 | 


--------------------------------------------------------------------------------
/throughput/all_reduce_bench.py:
--------------------------------------------------------------------------------
 1 | """ License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.txt """
 2 | 
 3 | # this version has been derived from @jeffra's gist: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36
 4 | # which in turn is derived from https://github.com/NVIDIA/nccl-tests
 5 | #
 6 | # to run for 2 nodes:
 7 | # python -m torch.distributed.run --nproc_per_node=2 all_reduce_bench.py
 8 | #
 9 | # the printed results are already n_gpu-agnostic (i.e. averaged for the world size)
10 | 
11 | import argparse
12 | import fcntl
13 | import os
14 | import socket
15 | import time
16 | import torch
17 | import torch.distributed as dist
18 | 
19 | TRIALS = 5
20 | 
21 | N = 500000
22 | M = 2000
23 | 
24 | def printflock(*msgs):
25 |     """ print """
26 |     with open(__file__, "r") as fh:
27 |         fcntl.flock(fh, fcntl.LOCK_EX)
28 |         try:
29 |             print(*msgs)
30 |         finally:
31 |             fcntl.flock(fh, fcntl.LOCK_UN)
32 | 
33 | def timed_allreduce(mat, id):
34 |     pre = time.perf_counter()
35 |     dist.all_reduce(mat)
36 |     printflock(f"ignore me {int(mat[0][0])}")  # required due to lazy evaluation
37 |     duration = time.perf_counter() - pre
38 |     tput = ((M*N*4*2)/duration)*8 # *2 is for send + receive, *8 for gigabits/second
39 |     size = M * N * 4 # 4 is fp32
40 |     n = dist.get_world_size()
41 |     busbw = (size / duration) * (2 * (n - 1) / n) * 8
42 |     printflock(f"{id}:\n",
43 |                f"duration: {duration:.4f} sec\n",
44 |                f"algo throughput: {tput:.4f} bps, {tput/1e9:.4f} Gbps\n",
45 |                f"busbw: {busbw / 1e9:.4f}  Gbps"
46 |     )
47 | 
48 | def run(local_rank):
49 |     hostname = socket.gethostname()
50 |     id = f"{hostname}:{local_rank}"
51 |     global_rank = dist.get_rank()
52 | 
53 |     printflock(f"{id} data size: {M*N*4/1e9} GB")
54 |     mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank)
55 | 
56 |     for i in range(TRIALS):
57 |         dist.barrier()
58 |         if global_rank == 0:
59 |             print(f"\n\n\n-----------trial-{i}----------------")
60 |         timed_allreduce(mat, id)
61 | 
62 | def init_processes(local_rank, fn, backend='nccl'):
63 |     torch.cuda.set_device(local_rank)
64 |     dist.init_process_group(backend)
65 |     fn(local_rank)
66 | 
67 | 
68 | if __name__ == "__main__":
69 |     rank = int(os.environ["LOCAL_RANK"])
70 |     printflock("local_rank: %d" % rank)
71 |     init_processes(local_rank=rank, fn=run)
72 | 


--------------------------------------------------------------------------------