├── assets
    ├── isolation_plot.png
    ├── bad_search_space.png
    ├── good_search_space.png
    ├── gradient_clipping.png
    ├── more_frequent_evals.png
    ├── stride_instability.png
    ├── beneficial_effect_warmup.png
    ├── have_we_sampled_enough.png
    ├── instability_during_warmup.png
    ├── axis_model_with_instability.png
    └── loss_model_with_instability.png
├── CITATION.bib
├── CONTRIBUTING.md
├── LICENSE
└── README.md


/assets/isolation_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/isolation_plot.png


--------------------------------------------------------------------------------
/assets/bad_search_space.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/bad_search_space.png


--------------------------------------------------------------------------------
/assets/good_search_space.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/good_search_space.png


--------------------------------------------------------------------------------
/assets/gradient_clipping.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/gradient_clipping.png


--------------------------------------------------------------------------------
/assets/more_frequent_evals.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/more_frequent_evals.png


--------------------------------------------------------------------------------
/assets/stride_instability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/stride_instability.png


--------------------------------------------------------------------------------
/assets/beneficial_effect_warmup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/beneficial_effect_warmup.png


--------------------------------------------------------------------------------
/assets/have_we_sampled_enough.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/have_we_sampled_enough.png


--------------------------------------------------------------------------------
/assets/instability_during_warmup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/instability_during_warmup.png


--------------------------------------------------------------------------------
/assets/axis_model_with_instability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/axis_model_with_instability.png


--------------------------------------------------------------------------------
/assets/loss_model_with_instability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/tuning_playbook/main/assets/loss_model_with_instability.png


--------------------------------------------------------------------------------
/CITATION.bib:
--------------------------------------------------------------------------------
1 | @misc{tuningplaybookgithub,
2 |   author = {Varun Godbole and George E. Dahl and Justin Gilmer and Christopher J. Shallue and Zachary Nado},
3 |   title = {Deep Learning Tuning Playbook},
4 |   url = {https://github.com/google-research/tuning_playbook},
5 |   year = {2023},
6 |   note = "Version 1"
7 | }
8 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # How to Contribute
 2 | 
 3 | -   This is not an officially supported Google product.
 4 | 
 5 | -   We'd love to hear your feedback!
 6 | 
 7 |     -   If you like the playbook, please leave a star! Or email
 8 |         deep-learning-tuning-playbook \[at\] googlegroups.com. Testimonials help
 9 |         us justify creating more resources like this.
10 |     -   If anything seems incorrect, please file an issue to start a discussion.
11 |         For questions or other messages where an issue isn't appropriate, please
12 |         open a new discussion topic on GitHub.
13 | 
14 | -   As discussed in the preamble, this is a living document. We anticipate
15 |     making periodic improvements, both small and large. If you’d like to be
16 |     notified, please watch our repository (see instructions).
17 | 
18 | -   Please don't file a pull request without first coordinating with the authors
19 |     via the issue tracking system.
20 | 
21 | 
22 | ## Contributor License Agreement
23 | 
24 | Contributions to this project must be accompanied by a Contributor License
25 | Agreement (CLA). You (or your employer) retain the copyright to your
26 | contribution; this simply gives us permission to use and redistribute your
27 | contributions as part of the project. Head over to
28 | <https://cla.developers.google.com/> to see your current agreements on file or
29 | to sign a new one.
30 | 
31 | You generally only need to submit a CLA once, so if you've already submitted one
32 | (even if it was for a different project), you probably don't need to do it
33 | again.
34 | 
35 | ## Code Reviews
36 | 
37 | All submissions, including submissions by project members, require review. We
38 | use GitHub pull requests for this purpose. Consult
39 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
40 | information on using pull requests.
41 | 
42 | ## Community Guidelines
43 | 
44 | This project follows
45 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/).


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Attribution 4.0 International Public License
  2 | 
  3 | By exercising the Licensed Rights (defined below), You accept and agree
  4 | to be bound by the terms and conditions of this Creative Commons
  5 | Attribution 4.0 International Public License ("Public License"). To the
  6 | extent this Public License may be interpreted as a contract, You are
  7 | granted the Licensed Rights in consideration of Your acceptance of
  8 | these terms and conditions, and the Licensor grants You such rights in
  9 | consideration of benefits the Licensor receives from making the
 10 | Licensed Material available under these terms and conditions.
 11 | 
 12 | 
 13 | Section 1 -- Definitions.
 14 | 
 15 |   a. Adapted Material means material subject to Copyright and Similar
 16 |      Rights that is derived from or based upon the Licensed Material
 17 |      and in which the Licensed Material is translated, altered,
 18 |      arranged, transformed, or otherwise modified in a manner requiring
 19 |      permission under the Copyright and Similar Rights held by the
 20 |      Licensor. For purposes of this Public License, where the Licensed
 21 |      Material is a musical work, performance, or sound recording,
 22 |      Adapted Material is always produced where the Licensed Material is
 23 |      synched in timed relation with a moving image.
 24 | 
 25 |   b. Adapter's License means the license You apply to Your Copyright
 26 |      and Similar Rights in Your contributions to Adapted Material in
 27 |      accordance with the terms and conditions of this Public License.
 28 | 
 29 |   c. Copyright and Similar Rights means copyright and/or similar rights
 30 |      closely related to copyright including, without limitation,
 31 |      performance, broadcast, sound recording, and Sui Generis Database
 32 |      Rights, without regard to how the rights are labeled or
 33 |      categorized. For purposes of this Public License, the rights
 34 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 35 |      Rights.
 36 | 
 37 |   d. Effective Technological Measures means those measures that, in the
 38 |      absence of proper authority, may not be circumvented under laws
 39 |      fulfilling obligations under Article 11 of the WIPO Copyright
 40 |      Treaty adopted on December 20, 1996, and/or similar international
 41 |      agreements.
 42 | 
 43 |   e. Exceptions and Limitations means fair use, fair dealing, and/or
 44 |      any other exception or limitation to Copyright and Similar Rights
 45 |      that applies to Your use of the Licensed Material.
 46 | 
 47 |   f. Licensed Material means the artistic or literary work, database,
 48 |      or other material to which the Licensor applied this Public
 49 |      License.
 50 | 
 51 |   g. Licensed Rights means the rights granted to You subject to the
 52 |      terms and conditions of this Public License, which are limited to
 53 |      all Copyright and Similar Rights that apply to Your use of the
 54 |      Licensed Material and that the Licensor has authority to license.
 55 | 
 56 |   h. Licensor means the individual(s) or entity(ies) granting rights
 57 |      under this Public License.
 58 | 
 59 |   i. Share means to provide material to the public by any means or
 60 |      process that requires permission under the Licensed Rights, such
 61 |      as reproduction, public display, public performance, distribution,
 62 |      dissemination, communication, or importation, and to make material
 63 |      available to the public including in ways that members of the
 64 |      public may access the material from a place and at a time
 65 |      individually chosen by them.
 66 | 
 67 |   j. Sui Generis Database Rights means rights other than copyright
 68 |      resulting from Directive 96/9/EC of the European Parliament and of
 69 |      the Council of 11 March 1996 on the legal protection of databases,
 70 |      as amended and/or succeeded, as well as other essentially
 71 |      equivalent rights anywhere in the world.
 72 | 
 73 |   k. You means the individual or entity exercising the Licensed Rights
 74 |      under this Public License. Your has a corresponding meaning.
 75 | 
 76 | 
 77 | Section 2 -- Scope.
 78 | 
 79 |   a. License grant.
 80 | 
 81 |        1. Subject to the terms and conditions of this Public License,
 82 |           the Licensor hereby grants You a worldwide, royalty-free,
 83 |           non-sublicensable, non-exclusive, irrevocable license to
 84 |           exercise the Licensed Rights in the Licensed Material to:
 85 | 
 86 |             a. reproduce and Share the Licensed Material, in whole or
 87 |                in part; and
 88 | 
 89 |             b. produce, reproduce, and Share Adapted Material.
 90 | 
 91 |        2. Exceptions and Limitations. For the avoidance of doubt, where
 92 |           Exceptions and Limitations apply to Your use, this Public
 93 |           License does not apply, and You do not need to comply with
 94 |           its terms and conditions.
 95 | 
 96 |        3. Term. The term of this Public License is specified in Section
 97 |           6(a).
 98 | 
 99 |        4. Media and formats; technical modifications allowed. The
100 |           Licensor authorizes You to exercise the Licensed Rights in
101 |           all media and formats whether now known or hereafter created,
102 |           and to make technical modifications necessary to do so. The
103 |           Licensor waives and/or agrees not to assert any right or
104 |           authority to forbid You from making technical modifications
105 |           necessary to exercise the Licensed Rights, including
106 |           technical modifications necessary to circumvent Effective
107 |           Technological Measures. For purposes of this Public License,
108 |           simply making modifications authorized by this Section 2(a)
109 |           (4) never produces Adapted Material.
110 | 
111 |        5. Downstream recipients.
112 | 
113 |             a. Offer from the Licensor -- Licensed Material. Every
114 |                recipient of the Licensed Material automatically
115 |                receives an offer from the Licensor to exercise the
116 |                Licensed Rights under the terms and conditions of this
117 |                Public License.
118 | 
119 |             b. No downstream restrictions. You may not offer or impose
120 |                any additional or different terms or conditions on, or
121 |                apply any Effective Technological Measures to, the
122 |                Licensed Material if doing so restricts exercise of the
123 |                Licensed Rights by any recipient of the Licensed
124 |                Material.
125 | 
126 |        6. No endorsement. Nothing in this Public License constitutes or
127 |           may be construed as permission to assert or imply that You
128 |           are, or that Your use of the Licensed Material is, connected
129 |           with, or sponsored, endorsed, or granted official status by,
130 |           the Licensor or others designated to receive attribution as
131 |           provided in Section 3(a)(1)(A)(i).
132 | 
133 |   b. Other rights.
134 | 
135 |        1. Moral rights, such as the right of integrity, are not
136 |           licensed under this Public License, nor are publicity,
137 |           privacy, and/or other similar personality rights; however, to
138 |           the extent possible, the Licensor waives and/or agrees not to
139 |           assert any such rights held by the Licensor to the limited
140 |           extent necessary to allow You to exercise the Licensed
141 |           Rights, but not otherwise.
142 | 
143 |        2. Patent and trademark rights are not licensed under this
144 |           Public License.
145 | 
146 |        3. To the extent possible, the Licensor waives any right to
147 |           collect royalties from You for the exercise of the Licensed
148 |           Rights, whether directly or through a collecting society
149 |           under any voluntary or waivable statutory or compulsory
150 |           licensing scheme. In all other cases the Licensor expressly
151 |           reserves any right to collect such royalties.
152 | 
153 | 
154 | Section 3 -- License Conditions.
155 | 
156 | Your exercise of the Licensed Rights is expressly made subject to the
157 | following conditions.
158 | 
159 |   a. Attribution.
160 | 
161 |        1. If You Share the Licensed Material (including in modified
162 |           form), You must:
163 | 
164 |             a. retain the following if it is supplied by the Licensor
165 |                with the Licensed Material:
166 | 
167 |                  i. identification of the creator(s) of the Licensed
168 |                     Material and any others designated to receive
169 |                     attribution, in any reasonable manner requested by
170 |                     the Licensor (including by pseudonym if
171 |                     designated);
172 | 
173 |                 ii. a copyright notice;
174 | 
175 |                iii. a notice that refers to this Public License;
176 | 
177 |                 iv. a notice that refers to the disclaimer of
178 |                     warranties;
179 | 
180 |                  v. a URI or hyperlink to the Licensed Material to the
181 |                     extent reasonably practicable;
182 | 
183 |             b. indicate if You modified the Licensed Material and
184 |                retain an indication of any previous modifications; and
185 | 
186 |             c. indicate the Licensed Material is licensed under this
187 |                Public License, and include the text of, or the URI or
188 |                hyperlink to, this Public License.
189 | 
190 |        2. You may satisfy the conditions in Section 3(a)(1) in any
191 |           reasonable manner based on the medium, means, and context in
192 |           which You Share the Licensed Material. For example, it may be
193 |           reasonable to satisfy the conditions by providing a URI or
194 |           hyperlink to a resource that includes the required
195 |           information.
196 | 
197 |        3. If requested by the Licensor, You must remove any of the
198 |           information required by Section 3(a)(1)(A) to the extent
199 |           reasonably practicable.
200 | 
201 |        4. If You Share Adapted Material You produce, the Adapter's
202 |           License You apply must not prevent recipients of the Adapted
203 |           Material from complying with this Public License.
204 | 
205 | 
206 | Section 4 -- Sui Generis Database Rights.
207 | 
208 | Where the Licensed Rights include Sui Generis Database Rights that
209 | apply to Your use of the Licensed Material:
210 | 
211 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
212 |      to extract, reuse, reproduce, and Share all or a substantial
213 |      portion of the contents of the database;
214 | 
215 |   b. if You include all or a substantial portion of the database
216 |      contents in a database in which You have Sui Generis Database
217 |      Rights, then the database in which You have Sui Generis Database
218 |      Rights (but not its individual contents) is Adapted Material; and
219 | 
220 |   c. You must comply with the conditions in Section 3(a) if You Share
221 |      all or a substantial portion of the contents of the database.
222 | 
223 | For the avoidance of doubt, this Section 4 supplements and does not
224 | replace Your obligations under this Public License where the Licensed
225 | Rights include other Copyright and Similar Rights.
226 | 
227 | 
228 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
229 | 
230 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
231 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
232 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
233 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
234 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
235 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
236 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
237 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
238 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
239 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
240 | 
241 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
242 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
243 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
244 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
245 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
246 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
247 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
248 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
249 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
250 | 
251 |   c. The disclaimer of warranties and limitation of liability provided
252 |      above shall be interpreted in a manner that, to the extent
253 |      possible, most closely approximates an absolute disclaimer and
254 |      waiver of all liability.
255 | 
256 | 
257 | Section 6 -- Term and Termination.
258 | 
259 |   a. This Public License applies for the term of the Copyright and
260 |      Similar Rights licensed here. However, if You fail to comply with
261 |      this Public License, then Your rights under this Public License
262 |      terminate automatically.
263 | 
264 |   b. Where Your right to use the Licensed Material has terminated under
265 |      Section 6(a), it reinstates:
266 | 
267 |        1. automatically as of the date the violation is cured, provided
268 |           it is cured within 30 days of Your discovery of the
269 |           violation; or
270 | 
271 |        2. upon express reinstatement by the Licensor.
272 | 
273 |      For the avoidance of doubt, this Section 6(b) does not affect any
274 |      right the Licensor may have to seek remedies for Your violations
275 |      of this Public License.
276 | 
277 |   c. For the avoidance of doubt, the Licensor may also offer the
278 |      Licensed Material under separate terms or conditions or stop
279 |      distributing the Licensed Material at any time; however, doing so
280 |      will not terminate this Public License.
281 | 
282 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
283 |      License.
284 | 
285 | 
286 | Section 7 -- Other Terms and Conditions.
287 | 
288 |   a. The Licensor shall not be bound by any additional or different
289 |      terms or conditions communicated by You unless expressly agreed.
290 | 
291 |   b. Any arrangements, understandings, or agreements regarding the
292 |      Licensed Material not stated herein are separate from and
293 |      independent of the terms and conditions of this Public License.
294 | 
295 | 
296 | Section 8 -- Interpretation.
297 | 
298 |   a. For the avoidance of doubt, this Public License does not, and
299 |      shall not be interpreted to, reduce, limit, restrict, or impose
300 |      conditions on any use of the Licensed Material that could lawfully
301 |      be made without permission under this Public License.
302 | 
303 |   b. To the extent possible, if any provision of this Public License is
304 |      deemed unenforceable, it shall be automatically reformed to the
305 |      minimum extent necessary to make it enforceable. If the provision
306 |      cannot be reformed, it shall be severed from this Public License
307 |      without affecting the enforceability of the remaining terms and
308 |      conditions.
309 | 
310 |   c. No term or condition of this Public License will be waived and no
311 |      failure to comply consented to unless expressly agreed to by the
312 |      Licensor.
313 | 
314 |   d. Nothing in this Public License constitutes or may be interpreted
315 |      as a limitation upon, or waiver of, any privileges and immunities
316 |      that apply to the Licensor or You, including from the legal
317 |      processes of any jurisdiction or authority.
318 | 
319 | 
320 | =======================================================================
321 | 
322 | Creative Commons is not a party to its public
323 | licenses. Notwithstanding, Creative Commons may elect to apply one of
324 | its public licenses to material it publishes and in those instances
325 | will be considered the “Licensor.” The text of the Creative Commons
326 | public licenses is dedicated to the public domain under the CC0 Public
327 | Domain Dedication. Except for the limited purpose of indicating that
328 | material is shared under a Creative Commons public license or as
329 | otherwise permitted by the Creative Commons policies published at
330 | creativecommons.org/policies, Creative Commons does not authorize the
331 | use of the trademark "Creative Commons" or any other trademark or logo
332 | of Creative Commons without its prior written consent including,
333 | without limitation, in connection with any unauthorized modifications
334 | to any of its public licenses or any other arrangements,
335 | understandings, or agreements concerning use of licensed material. For
336 | the avoidance of doubt, this paragraph does not form part of the
337 | public licenses.
338 | 
339 | Creative Commons may be contacted at creativecommons.org.
340 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Deep Learning Tuning Playbook
   2 | 
   3 | *This is not an officially supported Google product.*
   4 | 
   5 | **Varun Godbole<sup>&dagger;</sup>, George E. Dahl<sup>&dagger;</sup>, Justin Gilmer<sup>&dagger;</sup>, Christopher J. Shallue<sup>&Dagger;</sup>, Zachary Nado<sup>&dagger;</sup>**
   6 | 
   7 | 
   8 | &dagger; Google Research, Brain Team
   9 | 
  10 | &Dagger; Harvard University
  11 | 
  12 | ## Table of Contents
  13 | 
  14 | -   [Who is this document for?](#who-is-this-document-for)
  15 | -   [Why a tuning playbook?](#why-a-tuning-playbook)
  16 | -   [Guide for starting a new project](#guide-for-starting-a-new-project)
  17 |     -   [Choosing the model architecture](#choosing-a-model-architecture)
  18 |     -   [Choosing the optimizer](#choosing-the-optimizer)
  19 |     -   [Choosing the batch size](#choosing-the-batch-size)
  20 |     -   [Choosing the initial configuration](#choosing-the-initial-configuration)
  21 | -   [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance)
  22 |     -   [The incremental tuning strategy](#the-incremental-tuning-strategy)
  23 |     -   [Exploration vs exploitation](#exploration-vs-exploitation)
  24 |     -   [Choosing the goal for the next round of experiments](#choosing-the-goal-for-the-next-round-of-experiments)
  25 |     -   [Designing the next round of experiments](#Designing-the-next-round-of-experiments)
  26 |     -   [Determining whether to adopt a training pipeline change or
  27 |         hyperparameter
  28 |         configuration](#Determining-whether-to-adopt-a-training-pipeline-change-or-hyperparameter-configuration)
  29 |     -   [After exploration concludes](#After-exploration-concludes)
  30 | -   [Determining the number of steps for each training run](#Determining-the-number-of-steps-for-each-training-run)
  31 |     -   [Deciding how long to train when training is not compute-bound](#Deciding-how-long-to-train-when-training-is-not-compute-bound)
  32 |     -   [Deciding how long to train when training is compute-bound](#Deciding-how-long-to-train-when-training-is-compute-bound)
  33 | -   [Additional guidance for the training pipeline](#Additional-guidance-for-the-training-pipeline)
  34 |     -   [Optimizing the input pipeline](#Optimizing-the-input-pipeline)
  35 |     -   [Evaluating model performance](Evaluating-model-performance)
  36 |     -   [Saving checkpoints and retrospectively selecting the best checkpoint](#Saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint)
  37 |     -   [Setting up experiment tracking](#Setting-up-experiment-tracking)
  38 |     -   [Batch normalization implementation details](#Batch-normalization-implementation-details)
  39 |     -   [Considerations for multi-host pipelines](#Considerations-for-multi-host-pipelines)
  40 | -   [FAQs](#faqs)
  41 | -   [Acknowledgments](#acknowledgments)
  42 | -   [Citing](#citing)
  43 | -   [Contributing](#contributing)
  44 | 
  45 | ## Who is this document for?
  46 | 
  47 | This document is for engineers and researchers (both individuals and teams)
  48 | interested in **maximizing the performance of deep learning models**. We assume
  49 | basic knowledge of machine learning and deep learning concepts.
  50 | 
  51 | Our emphasis is on the **process of hyperparameter tuning**. We touch on other
  52 | aspects of deep learning training, such as pipeline implementation and
  53 | optimization, but our treatment of those aspects is not intended to be complete.
  54 | 
  55 | We assume the machine learning problem is a supervised learning problem or
  56 | something that looks a lot like one (e.g. self-supervised). That said, some of
  57 | the prescriptions in this document may also apply to other types of problems.
  58 | 
  59 | ## Why a tuning playbook?
  60 | 
  61 | Currently, there is an astonishing amount of toil and guesswork involved in
  62 | actually getting deep neural networks to work well in practice. Even worse, the
  63 | actual recipes people use to get good results with deep learning are rarely
  64 | documented. Papers gloss over the process that led to their final results in
  65 | order to present a cleaner story, and machine learning engineers working on
  66 | commercial problems rarely have time to take a step back and generalize their
  67 | process. Textbooks tend to eschew practical guidance and prioritize fundamental
  68 | principles, even if their authors have the necessary experience in applied work
  69 | to provide useful advice. When preparing to create this document, we couldn't
  70 | find any comprehensive attempt to actually explain *how to get good results with
  71 | deep learning*. Instead, we found snippets of advice in blog posts and on social
  72 | media, tricks peeking out of the appendix of research papers, occasional case
  73 | studies about one particular project or pipeline, and a lot of confusion. There
  74 | is a vast gulf between the results achieved by deep learning experts and less
  75 | skilled practitioners using superficially similar methods. At the same time,
  76 | these very experts readily admit some of what they do might not be
  77 | well-justified. As deep learning matures and has a larger impact on the world,
  78 | the community needs more resources covering useful recipes, including all the
  79 | practical details that can be so critical for obtaining good results.
  80 | 
  81 | We are a team of five researchers and engineers who have worked in deep learning
  82 | for many years, some of us since as early as 2006. We have applied deep learning
  83 | to problems in everything from speech recognition to astronomy, and learned a
  84 | lot along the way. This document grew out of our own experience training neural
  85 | networks, teaching new machine learning engineers, and advising our colleagues
  86 | on the practice of deep learning. Although it has been gratifying to see deep
  87 | learning go from a machine learning approach practiced by a handful of academic
  88 | labs to a technology powering products used by billions of people, deep learning
  89 | is still in its infancy as an engineering discipline and we hope this document
  90 | encourages others to help systematize the field's experimental protocols.
  91 | 
  92 | This document came about as we tried to crystalize our own approach to deep
  93 | learning and thus it represents the opinions of the authors at the time of
  94 | writing, not any sort of objective truth. Our own struggles with hyperparameter
  95 | tuning made it a particular focus of our guidance, but we also cover other
  96 | important issues we have encountered in our work (or seen go wrong). Our
  97 | intention is for this work to be a living document that grows and evolves as our
  98 | beliefs change. For example, the material on debugging and mitigating training
  99 | failures would not have been possible for us to write two years ago since it is
 100 | based on recent results and ongoing investigations. Inevitably, some of our
 101 | advice will need to be updated to account for new results and improved
 102 | workflows. We do not know the *optimal* deep learning recipe, but until the
 103 | community starts writing down and debating different procedures, we cannot hope
 104 | to find it. To that end, we would encourage readers who find issues with our
 105 | advice to produce alternative recommendations, along with convincing evidence,
 106 | so we can update the playbook. We would also love to see alternative guides and
 107 | playbooks that might have different recommendations so we can work towards best
 108 | practices as a community. Finally, any sections marked with a 🤖 emoji are places
 109 | we would like to do more research. Only after trying to write this playbook did
 110 | it become completely clear how many interesting and neglected research questions
 111 | can be found in the deep learning practitioner's workflow.
 112 | 
 113 | ## Guide for starting a new project
 114 | 
 115 | Many of the decisions we make over the course of tuning can be made once at the
 116 | beginning of a project and only occasionally revisited when circumstances
 117 | change.
 118 | 
 119 | Our guidance below makes the following assumptions:
 120 | 
 121 | -   Enough of the essential work of problem formulation, data cleaning, etc. has
 122 |     already been done that spending time on the model architecture and training
 123 |     configuration makes sense.
 124 | -   There is already a pipeline set up that does training and evaluation, and it
 125 |     is easy to execute training and prediction jobs for various models of
 126 |     interest.
 127 | -   The appropriate metrics have been selected and implemented. These should be
 128 |     as representative as possible of what would be measured in the deployed
 129 |     environment.
 130 | 
 131 | ### Choosing the model architecture
 132 | 
 133 | ***Summary:*** *When starting a new project, try to reuse a model that already
 134 | works.*
 135 | 
 136 | -   Choose a well established, commonly used model architecture to get working
 137 |     first. It is always possible to build a custom model later.
 138 | -   Model architectures typically have various hyperparameters that determine
 139 |     the model's size and other details (e.g. number of layers, layer width, type
 140 |     of activation function).
 141 |     -   Thus, choosing the architecture really means choosing a family of
 142 |         different models (one for each setting of the model hyperparameters).
 143 |     -   We will consider the problem of choosing the model hyperparameters in
 144 |         [Choosing the initial configuration](#choosing-the-initial-configuration)
 145 |         and
 146 |         [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance).
 147 | -   When possible, try to find a paper that tackles something as close as
 148 |     possible to the problem at hand and reproduce that model as a starting
 149 |     point.
 150 | 
 151 | ### Choosing the optimizer
 152 | 
 153 | ***Summary:*** *Start with the most popular optimizer for the type of problem at
 154 | hand.*
 155 | 
 156 | -   No optimizer is the "best" across all types of machine learning problems and
 157 |     model architectures. Even just
 158 |     [comparing the performance of optimizers is a difficult task](https://arxiv.org/abs/1910.05446).
 159 |     🤖
 160 | -   We recommend sticking with well-established, popular optimizers, especially
 161 |     when starting a new project.
 162 |     -   Ideally, choose the most popular optimizer used for the same type of
 163 |         problem.
 164 | -   Be prepared to give attention to **\*****all****\*** hyperparameters of the
 165 |     chosen optimizer.
 166 |     -   Optimizers with more hyperparameters may require more tuning effort to
 167 |         find the best configuration.
 168 |     -   This is particularly relevant in the beginning stages of a project when
 169 |         we are trying to find the best values of various other hyperparameters
 170 |         (e.g. architecture hyperparameters) while treating optimizer
 171 |         hyperparameters as
 172 |         [nuisance parameters](#identifying-scientific-nuisance-and-fixed-hyperparameters).
 173 |     -   It may be preferable to start with a simpler optimizer (e.g. SGD with
 174 |         fixed momentum or Adam with fixed $\epsilon$, $\beta_{1}$, and
 175 |         $\beta_{2}$) in the initial stages of the project and switch to a more
 176 |         general optimizer later.
 177 | -   Well-established optimizers that we like include (but are not limited to):
 178 |     -   [SGD with momentum](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms)
 179 |         (we like the Nesterov variant)
 180 |     -   [Adam and NAdam](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms),
 181 |         which are more general than SGD with momentum. Note that Adam has 4
 182 |         tunable hyperparameters
 183 |         [and they can all matter](https://arxiv.org/abs/1910.05446)!
 184 |         -   See
 185 |             [How should Adam's hyperparameters be tuned?](#how-should-adams-hyperparameters-be-tuned)
 186 | 
 187 | ### Choosing the batch size
 188 | 
 189 | ***Summary:*** *The batch size governs the training speed and shouldn't be used
 190 | to directly tune the validation set performance. Often, the ideal batch size
 191 | will be the largest batch size supported by the available hardware.*
 192 | 
 193 | -   The batch size is a key factor in determining the *training time* and
 194 |     *computing resource consumption*.
 195 | -   Increasing the batch size will often reduce the training time. This can be
 196 |     highly beneficial because it, e.g.:
 197 |     -   Allows hyperparameters to be tuned more thoroughly within a fixed time
 198 |         interval, potentially resulting in a better final model.
 199 |     -   Reduces the latency of the development cycle, allowing new ideas to be
 200 |         tested more frequently.
 201 | -   Increasing the batch size may either decrease, increase, or not change the
 202 |     resource consumption.
 203 | -   The batch size should *not be* treated as a tunable hyperparameter for
 204 |     validation set performance.
 205 |     -   As long as all hyperparameters are well-tuned (especially the learning
 206 |         rate and regularization hyperparameters) and the number of training
 207 |         steps is sufficient, the same final performance should be attainable
 208 |         using any batch size (see
 209 |         [Shallue et al. 2018](https://arxiv.org/abs/1811.03600)).
 210 |     -   Please see [Why shouldn't the batch size be tuned to directly improve
 211 |         validation set
 212 |         performance?](#why-shouldnt-the-batch-size-be-tuned-to-directly-improve-validation-set-performance)
 213 | 
 214 | #### Determining the feasible batch sizes and estimating training throughput
 215 | 
 216 | 
 217 | <details><summary><em>[Click to expand]</em></summary>
 218 | 
 219 | <br>
 220 | 
 221 | -   For a given model and optimizer, there will typically be a range of batch
 222 |     sizes supported by the available hardware. The limiting factor is usually
 223 |     accelerator memory.
 224 | -   Unfortunately, it can be difficult to calculate which batch sizes will fit
 225 |     in memory without running, or at least compiling, the full training program.
 226 | -   The easiest solution is usually to run training jobs at different batch
 227 |     sizes (e.g. increasing powers of 2) for a small number of steps until one of
 228 |     the jobs exceeds the available memory.
 229 | -   For each batch size, we should train for long enough to get a reliable
 230 |     estimate of the *training throughput*
 231 | 
 232 | <p align="center">training throughput = (# examples processed per second)</p>
 233 | 
 234 | <p align="center">or, equivalently, the <em>time per step</em>.</p>
 235 | 
 236 | <p align="center">time per step = (batch size) / (training throughput)</p>
 237 | 
 238 | -   When the accelerators aren't yet saturated, if the batch size doubles, the
 239 |     training throughput should also double (or at least nearly double).
 240 |     Equivalently, the time per step should be constant (or at least nearly
 241 |     constant) as the batch size increases.
 242 | -   If this is not the case then the training pipeline has a bottleneck such as
 243 |     I/O or synchronization between compute nodes. This may be worth diagnosing
 244 |     and correcting before proceeding.
 245 | -   If the training throughput increases only up to some maximum batch size,
 246 |     then we should only consider batch sizes up to that maximum batch size, even
 247 |     if a larger batch size is supported by the hardware.
 248 |     -   All benefits of using a larger batch size assume the training throughput
 249 |         increases. If it doesn't, fix the bottleneck or use the smaller batch
 250 |         size.
 251 |     -   **Gradient accumulation** simulates a larger batch size than the
 252 |         hardware can support and therefore does not provide any throughput
 253 |         benefits. It should generally be avoided in applied work.
 254 | -   These steps may need to be repeated every time the model or optimizer is
 255 |     changed (e.g. a different model architecture may allow a larger batch size
 256 |     to fit in memory).
 257 | 
 258 | </details>
 259 | 
 260 | #### Choosing the batch size to minimize training time
 261 | 
 262 | <details><summary><em>[Click to expand]</em></summary>
 263 | 
 264 | <br>
 265 | 
 266 | 
 267 | <p align="center">Training time = (time per step) x (total number of steps)</p>
 268 | 
 269 | -   We can often consider the time per step to be approximately constant for all
 270 |     feasible batch sizes. This is true when there is no overhead from parallel
 271 |     computations and all training bottlenecks have been diagnosed and corrected
 272 |     (see the
 273 |     [previous section](#determining-the-feasible-batch-sizes-and-estimating-training-throughput)
 274 |     for how to identify training bottlenecks). In practice, there is usually at
 275 |     least some overhead from increasing the batch size.
 276 | -   As the batch size increases, the total number of steps needed to reach a
 277 |     fixed performance goal typically decreases (provided all relevant
 278 |     hyperparameters are re-tuned when the batch size is changed;
 279 |     [Shallue et al. 2018](https://arxiv.org/abs/1811.03600)).
 280 |     -   E.g. Doubling the batch size might halve the total number of steps
 281 |         required. This is called **perfect scaling**.
 282 |     -   Perfect scaling holds for all batch sizes up to a critical batch size,
 283 |         beyond which one achieves diminishing returns.
 284 |     -   Eventually, increasing the batch size no longer reduces the number of
 285 |         training steps (but never increases it).
 286 | -   Therefore, the batch size that minimizes training time is usually the
 287 |     largest batch size that still provides a reduction in the number of training
 288 |     steps required.
 289 |     -   This batch size depends on the dataset, model, and optimizer, and it is
 290 |         an open problem how to calculate it other than finding it experimentally
 291 |         for every new problem. 🤖
 292 |     -   When comparing batch sizes, beware the distinction between an example
 293 |         budget/[epoch](https://developers.google.com/machine-learning/glossary#epoch)
 294 |         budget (running all experiments while fixing the number of training
 295 |         example presentations) and a step budget (running all experiments with
 296 |         the number of training steps fixed).
 297 |         -   Comparing batch sizes with an epoch budget only probes the perfect
 298 |             scaling regime, even when larger batch sizes might still provide a
 299 |             meaningful speedup by reducing the number of training steps
 300 |             required.
 301 |     -   Often, the largest batch size supported by the available hardware will
 302 |         be smaller than the critical batch size. Therefore, a good rule of thumb
 303 |         (without running any experiments) is to use the largest batch size
 304 |         possible.
 305 | -   There is no point in using a larger batch size if it ends up increasing the
 306 |     training time.
 307 | 
 308 | </details>
 309 | 
 310 | #### Choosing the batch size to minimize resource consumption
 311 | 
 312 | <details><summary><em>[Click to expand]</em></summary>
 313 | 
 314 | <br>
 315 | 
 316 | 
 317 | -   There are two types of resource costs associated with increasing the batch
 318 |     size:
 319 |     1.  *Upfront costs*, e.g. purchasing new hardware or rewriting the training
 320 |         pipeline to implement multi-GPU / multi-TPU training.
 321 |     2.  *Usage costs*, e.g. billing against the team's resource budgets, billing
 322 |         from a cloud provider, electricity / maintenance costs.
 323 | -   If there are significant upfront costs to increasing the batch size, it
 324 |     might be better to defer increasing the batch size until the project has
 325 |     matured and it is easier to assess the cost-benefit tradeoff. Implementing
 326 |     multi-host parallel training programs can introduce
 327 |     [bugs](#considerations-for-multi-host-pipelines) and
 328 |     [subtle issues](#batch-normalization-implementation-details) so it is
 329 |     probably better to start off with a simpler pipeline anyway. (On the other
 330 |     hand, a large speedup in training time might be very beneficial early in the
 331 |     process when a lot of tuning experiments are needed).
 332 | -   We refer to the total usage cost (which may include multiple different kinds
 333 |     of costs) as the "resource consumption". We can break down the resource
 334 |     consumption into the following components:
 335 | 
 336 | <p align="center">Resource consumption = (resource consumption per step) x (total number of steps)</p>
 337 | 
 338 | -   Increasing the batch size usually allows us to
 339 |     [reduce the total number of steps](#choosing-the-batch-size-to-minimize-training-time).
 340 |     Whether the resource consumption increases or decreases will depend on how
 341 |     the consumption per step changes.
 342 |     -   Increasing the batch size might *decrease* the resource consumption. For
 343 |         example, if each step with the larger batch size can be run on the same
 344 |         hardware as the smaller batch size (with only a small increase in time
 345 |         per step), then any increase in the resource consumption per step might
 346 |         be outweighed by the decrease in the number of steps.
 347 |     -   Increasing the batch size might *not change* the resource consumption.
 348 |         For example, if doubling the batch size halves the number of steps
 349 |         required and doubles the number of GPUs used, the total consumption (in
 350 |         terms of GPU-hours) will not change.
 351 |     -   Increasing the batch size might *increase* the resource consumption. For
 352 |         example, if increasing the batch size requires upgraded hardware, the
 353 |         increase in consumption per step might outweigh the reduction in the
 354 |         number of steps.
 355 | 
 356 | </details>
 357 | 
 358 | #### Changing the batch size requires re-tuning most hyperparameters
 359 | 
 360 | <details><summary><em>[Click to expand]</em></summary>
 361 | 
 362 | <br>
 363 | 
 364 | 
 365 | -   The optimal values of most hyperparameters are sensitive to the batch size.
 366 |     Therefore, changing the batch size typically requires starting the tuning
 367 |     process all over again.
 368 | -   The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
 369 | -   Keep this in mind when choosing the batch size at the start of a project. If
 370 |     you need to switch to a different batch size later on, it might be
 371 |     difficult, time consuming, and expensive to re-tune everything for the new
 372 |     batch size.
 373 | 
 374 | </details>
 375 | 
 376 | #### How batch norm interacts with the batch size
 377 | 
 378 | <details><summary><em>[Click to expand]</em></summary>
 379 | 
 380 | <br>
 381 | 
 382 | 
 383 | -   Batch norm is complicated and, in general, should use a different batch size
 384 |     than the gradient computation to compute statistics. See the
 385 |     [batch norm section](#batch-normalization-implementation-details) for a
 386 |     detailed discussion.
 387 | 
 388 | </details>
 389 | 
 390 | ### Choosing the initial configuration
 391 | 
 392 | -   Before beginning hyperparameter tuning we must determine the starting point.
 393 |     This includes specifying (1) the model configuration (e.g. number of
 394 |     layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the
 395 |     number of training steps.
 396 | -   Determining this initial configuration will require some manually configured
 397 |     training runs and trial-and-error.
 398 | -   Our guiding principle is to find a simple, relatively fast, relatively
 399 |     low-resource-consumption configuration that obtains a "reasonable" result.
 400 |     -   "Simple" means avoiding bells and whistles wherever possible; these can
 401 |         always be added later. Even if bells and whistles prove helpful down the
 402 |         road, adding them in the initial configuration risks wasting time tuning
 403 |         unhelpful features and/or baking in unnecessary complications.
 404 |         -   For example, start with a constant learning rate before adding fancy
 405 |             decay schedules.
 406 |     -   Choosing an initial configuration that is fast and consumes minimal
 407 |         resources will make hyperparameter tuning much more efficient.
 408 |         -   For example, start with a smaller model.
 409 |     -   "Reasonable" performance depends on the problem, but at minimum means
 410 |         that the trained model performs much better than random chance on the
 411 |         validation set (although it might be bad enough to not be worth
 412 |         deploying).
 413 | -   Choosing the number of training steps involves balancing the following
 414 |     tension:
 415 |     -   On the one hand, training for more steps can improve performance and
 416 |         makes hyperparameter tuning easier (see
 417 |         [Shallue et al. 2018](https://arxiv.org/abs/1811.03600)).
 418 |     -   On the other hand, training for fewer steps means that each training run
 419 |         is faster and uses fewer resources, boosting tuning efficiency by
 420 |         reducing the time between cycles and allowing more experiments to be run
 421 |         in parallel. Moreover, if an unnecessarily large step budget is chosen
 422 |         initially, it might be hard to change it down the road, e.g. once the
 423 |         learning rate schedule is tuned for that number of steps.
 424 | 
 425 | ## A scientific approach to improving model performance
 426 | 
 427 | For the purposes of this document, the ultimate goal of machine learning
 428 | development is to maximize the utility of the deployed model. Even though many
 429 | aspects of the development process differ between applications (e.g. length of
 430 | time, available computing resources, type of model), we can typically use the
 431 | same basic steps and principles on any problem.
 432 | 
 433 | Our guidance below makes the following assumptions:
 434 | 
 435 | -   There is already a fully-running training pipeline along with a
 436 |     configuration that obtains a reasonable result.
 437 | -   There are enough computational resources available to conduct meaningful
 438 |     tuning experiments and run at least several training jobs in parallel.
 439 | 
 440 | ### The incremental tuning strategy
 441 | 
 442 | ***Summary:*** *Start with a simple configuration and incrementally make
 443 | improvements while building up insight into the problem. Make sure that any
 444 | improvement is based on strong evidence to avoid adding unnecessary complexity.*
 445 | 
 446 | -   Our ultimate goal is to find a configuration that maximizes the performance
 447 |     of our model.
 448 |     -   In some cases, our goal will be to maximize how much we can improve the
 449 |         model by a fixed deadline (e.g. submitting to a competition).
 450 |     -   In other cases, we want to keep improving the model indefinitely (e.g.
 451 |         continually improving a model used in production).
 452 | -   In principle, we could maximize performance by using an algorithm to
 453 |     automatically search the entire space of possible configurations, but this
 454 |     is not a practical option.
 455 |     -   The space of possible configurations is extremely large and there are
 456 |         not yet any algorithms sophisticated enough to efficiently search this
 457 |         space without human guidance.
 458 | -   Most automated search algorithms rely on a hand-designed *search space* that
 459 |     defines the set of configurations to search in, and these search spaces can
 460 |     matter quite a bit.
 461 | -   The most effective way to maximize performance is to start with a simple
 462 |     configuration and incrementally add features and make improvements while
 463 |     building up insight into the problem.
 464 |     -   We use automated search algorithms in each round of tuning and
 465 |         continually update our search spaces as our understanding grows.
 466 | -   As we explore, we will naturally find better and better configurations and
 467 |     therefore our "best" model will continually improve.
 468 |     -   We call it a *launch* when we update our best configuration (which may
 469 |         or may not correspond to an actual launch of a production model).
 470 |     -   For each launch, we must make sure that the change is based on strong
 471 |         evidence – not just random chance based on a lucky configuration – so
 472 |         that we don't add unnecessary complexity to the training pipeline.
 473 | 
 474 | At a high level, our incremental tuning strategy involves repeating the
 475 | following four steps:
 476 | 
 477 | 1.  Identify an appropriately-scoped goal for the next round of experiments.
 478 | 2.  Design and run a set of experiments that makes progress towards this goal.
 479 | 3.  Learn what we can from the results.
 480 | 4.  Consider whether to launch the new best configuration.
 481 | 
 482 | The remainder of this section will consider this strategy in much greater
 483 | detail.
 484 | 
 485 | ### Exploration vs exploitation
 486 | 
 487 | ***Summary:*** *Most of the time, our primary goal is to gain insight into the
 488 | problem.*
 489 | 
 490 | -   Although one might think we would spend most of our time trying to maximize
 491 |     performance on the validation set, in practice we spend the majority of our
 492 |     time trying to gain insight into the problem, and comparatively little time
 493 |     greedily focused on the validation error.
 494 |     -   In other words, we spend most of our time on "exploration" and only a
 495 |         small amount on "exploitation".
 496 | -   In the long run, understanding the problem is critical if we want to
 497 |     maximize our final performance. Prioritizing insight over short term gains
 498 |     can help us:
 499 |     -   Avoid launching unnecessary changes that happened to be present in
 500 |         well-performing runs merely through historical accident.
 501 |     -   Identify which hyperparameters the validation error is most sensitive
 502 |         to, which hyperparameters interact the most and therefore need to be
 503 |         re-tuned together, and which hyperparameters are relatively insensitive
 504 |         to other changes and can therefore be fixed in future experiments.
 505 |     -   Suggest potential new features to try, such as new regularizers if
 506 |         overfitting is an issue.
 507 |     -   Identify features that don't help and therefore can be removed, reducing
 508 |         the complexity of future experiments.
 509 |     -   Recognize when improvements from hyperparameter tuning have likely
 510 |         saturated.
 511 |     -   Narrow our search spaces around the optimal value to improve tuning
 512 |         efficiency.
 513 | -   When we are eventually ready to be greedy, we can focus purely on the
 514 |     validation error even if the experiments aren't maximally informative about
 515 |     the structure of the tuning problem.
 516 | 
 517 | ### Choosing the goal for the next round of experiments
 518 | 
 519 | ***Summary:*** *Each round of experiments should have a clear goal and be
 520 | sufficiently narrow in scope that the experiments can actually make progress
 521 | towards the goal.*
 522 | 
 523 | -   Each round of experiments should have a clear goal and be sufficiently
 524 |     narrow in scope that the experiments can actually make progress towards the
 525 |     goal: if we try to add multiple features or answer multiple questions at
 526 |     once, we may not be able to disentangle the separate effects on the results.
 527 | -   Example goals include:
 528 |     -   Try a potential improvement to the pipeline (e.g. a new regularizer,
 529 |         preprocessing choice, etc.).
 530 |     -   Understand the impact of a particular model hyperparameter (e.g. the
 531 |         activation function)
 532 |     -   Greedily maximize validation error.
 533 | 
 534 | ### Designing the next round of experiments
 535 | 
 536 | ***Summary:*** *Identify which hyperparameters are scientific, nuisance, and
 537 | fixed hyperparameters for the experimental goal. Create a sequence of studies to
 538 | compare different values of the scientific hyperparameters while optimizing over
 539 | the nuisance hyperparameters. Choose the search space of nuisance
 540 | hyperparameters to balance resource costs with scientific value.*
 541 | 
 542 | #### Identifying scientific, nuisance, and fixed hyperparameters
 543 | 
 544 | <details><summary><em>[Click to expand]</em></summary>
 545 | 
 546 | <br>
 547 | 
 548 | -   For a given goal, all hyperparameters will be either **scientific
 549 |     hyperparameters**, **nuisance hyperparameters**, or **fixed
 550 |     hyperparameters**.
 551 |     -   Scientific hyperparameters are those whose effect on the model's
 552 |         performance we're trying to measure.
 553 |     -   Nuisance hyperparameters are those that need to be optimized over in
 554 |         order to fairly compare different values of the scientific
 555 |         hyperparameters. This is similar to the statistical concept of
 556 |         [nuisance parameters](https://en.wikipedia.org/wiki/Nuisance_parameter).
 557 |     -   Fixed hyperparameters will have their values fixed in the current round
 558 |         of experiments. These are hyperparameters whose values do not need to
 559 |         (or we do not want them to) change when comparing different values of
 560 |         the scientific hyperparameters.
 561 |         -   By fixing certain hyperparameters for a set of experiments, we must
 562 |             accept that conclusions derived from the experiments might not be
 563 |             valid for other settings of the fixed hyperparameters. In other
 564 |             words, fixed hyperparameters create caveats for any conclusions we
 565 |             draw from the experiments.
 566 | -   For example, if our goal is to "determine whether a model with more hidden
 567 |     layers will reduce validation error", then the number of hidden layers is a
 568 |     scientific hyperparameter.
 569 |     -   The learning rate is a nuisance hyperparameter because we can only
 570 |         fairly compare models with different numbers of hidden layers if the
 571 |         learning rate is tuned separately for each number of layers (the optimal
 572 |         learning rate generally depends on the model architecture).
 573 |     -   The activation function could be a fixed hyperparameter if we have
 574 |         determined in prior experiments that the best choice of activation
 575 |         function is not sensitive to model depth, or if we are willing to limit
 576 |         our conclusions about the number of hidden layers to only cover this
 577 |         specific choice of activation function. Alternatively, it could be a
 578 |         nuisance parameter if we are prepared to tune it separately for each
 579 |         number of hidden layers.
 580 | -   Whether a particular hyperparameter is a scientific hyperparameter, nuisance
 581 |     hyperparameter, or fixed hyperparameter is not inherent to that
 582 |     hyperparameter, but changes depending on the experimental goal.
 583 |     -   For example, the choice of activation function could be a scientific
 584 |         hyperparameter (is ReLU or tanh a better choice for our problem?), a
 585 |         nuisance hyperparameter (is the best 5-layer model better than the best
 586 |         6-layer model when we allow several different possible activation
 587 |         functions?), or a fixed hyperparameter (for ReLU nets, does adding batch
 588 |         normalization in a particular position help?).
 589 | -   When designing a new round of experiments, we first identify the scientific
 590 |     hyperparameters for our experimental goal.
 591 |     -   At this stage, we consider all other hyperparameters to be nuisance
 592 |         hyperparameters.
 593 | -   Next, we convert some of the nuisance hyperparameters into fixed
 594 |     hyperparameters.
 595 |     -   With limitless resources, we would leave all non-scientific
 596 |         hyperparameters as nuisance hyperparameters so that the conclusions we
 597 |         draw from our experiments are free from caveats about fixed
 598 |         hyperparameter values.
 599 |     -   However, the more nuisance hyperparameters we attempt to tune, the
 600 |         greater the risk we fail to tune them sufficiently well for each setting
 601 |         of the scientific hyperparameters and end up reaching the wrong
 602 |         conclusions from our experiments.
 603 |         -   As described
 604 |             [below](#striking-a-balance-between-informative-and-affordable-experiments),
 605 |             we could counter this risk by increasing the computational budget,
 606 |             but often our maximum resource budget is less than would be needed
 607 |             to tune over all non-scientific hyperparameters.
 608 |     -   We choose to convert a nuisance hyperparameter into a fixed
 609 |         hyperparameter when, in our judgment, the caveats introduced by fixing
 610 |         it are less burdensome than the cost of including it as a nuisance
 611 |         hyperparameter.
 612 |         -   The more a given nuisance hyperparameter interacts with the
 613 |             scientific hyperparameters, the more damaging it is to fix its
 614 |             value. For example, the best value of the weight decay strength
 615 |             typically depends on the model size, so comparing different model
 616 |             sizes assuming a single specific value of the weight decay would not
 617 |             be very insightful.
 618 | -   Although the type we assign to each hyperparameter depends on the
 619 |     experimental goal, we have the following rules of thumb for certain
 620 |     categories of hyperparameters:
 621 |     -   Of the various optimizer hyperparameters (e.g. the learning rate,
 622 |         momentum, learning rate schedule parameters, Adam betas etc.), at least
 623 |         some of them will be nuisance hyperparameters because they tend to
 624 |         interact the most with other changes.
 625 |         -   They are rarely scientific hyperparameters because a goal like "what
 626 |             is the best learning rate for the current pipeline?" doesn't give
 627 |             much insight – the best setting could easily change with the next
 628 |             pipeline change anyway.
 629 |         -   Although we might fix some of them occasionally due to resource
 630 |             constraints or when we have particularly strong evidence that they
 631 |             don't interact with the scientific parameters, we should generally
 632 |             assume that optimizer hyperparameters must be tuned separately to
 633 |             make fair comparisons between different settings of the scientific
 634 |             hyperparameters, and thus shouldn't be fixed.
 635 |             -   Furthermore, we have no *a priori* reason to prefer one
 636 |                 optimizer hyperparameter value over another (e.g. they don't
 637 |                 usually affect the computational cost of forward passes or
 638 |                 gradients in any way).
 639 |     -   In contrast, the *choice* of optimizer is typically a scientific
 640 |         hyperparameter or fixed hyperparameter.
 641 |         -   It is a scientific hyperparameter if our experimental goal involves
 642 |             making fair comparisons between two or more different optimizers
 643 |             (e.g. "determine which optimizer produces the lowest validation
 644 |             error in a given number of steps").
 645 |         -   Alternatively, we might make it a fixed hyperparameter for a variety
 646 |             of reasons, including (1) prior experiments make us believe that the
 647 |             best optimizer for our problem is not sensitive to current
 648 |             scientific hyperparameters; and/or (2) we prefer to compare values
 649 |             of the scientific hyperparameters using this optimizer because its
 650 |             training curves are easier to reason about; and/or (3) we prefer to
 651 |             use this optimizer because it uses less memory than the
 652 |             alternatives.
 653 |     -   Hyperparameters introduced by a regularization technique are typically
 654 |         nuisance hyperparameters, but whether or not we include the
 655 |         regularization technique at all is a scientific or fixed hyperparameter.
 656 |         -   For example, dropout adds code complexity, so when deciding whether
 657 |             to include it we would make "no dropout" vs "dropout" a scientific
 658 |             hyperparameter and the dropout rate a nuisance hyperparameter.
 659 |             -   If we decide to add dropout to our pipeline based on this
 660 |                 experiment, then the dropout rate would be a nuisance
 661 |                 hyperparameter in future experiments.
 662 |     -   Architectural hyperparameters are often scientific or fixed
 663 |         hyperparameters because architecture changes can affect serving and
 664 |         training costs, latency, and memory requirements.
 665 |         -   For example, the number of layers is typically a scientific or fixed
 666 |             hyperparameter since it tends to have dramatic consequences for
 667 |             training speed and memory usage.
 668 | -   In some cases, the sets of nuisance and fixed hyperparameters will depend on
 669 |     the values of the scientific hyperparameters.
 670 |     -   For example, suppose we are trying to determine which optimizer out of
 671 |         Nesterov momentum and Adam results in the lowest validation error. The
 672 |         scientific hyperparameter is the `optimizer`, which takes values
 673 |         `{"Nesterov_momentum", "Adam"}`. The value
 674 |         `optimizer="Nesterov_momentum"` introduces the nuisance/fixed
 675 |         hyperparameters `{learning_rate, momentum}`, but the value
 676 |         `optimizer="Adam"` introduces the nuisance/fixed hyperparameters
 677 |         `{learning_rate, beta1, beta2, epsilon}`.
 678 |     -   Hyperparameters that are only present for certain values of the
 679 |         scientific hyperparameters are called **conditional hyperparameters**.
 680 |     -   We should not assume two conditional hyperparameters are the same just
 681 |         because they have the same name! In the above example, the conditional
 682 |         hyperparameter called `learning_rate` is a *different* hyperparameter
 683 |         for `optimizer="Nesterov_momentum"` versus `optimizer="Adam"`. Its role
 684 |         is similar (although not identical) in the two algorithms, but the range
 685 |         of values that work well in each of the optimizers is typically
 686 |         different by several orders of magnitude.
 687 | 
 688 | </details>
 689 | 
 690 | #### Creating a set of studies
 691 | 
 692 | <details><summary><em>[Click to expand]</em></summary>
 693 | 
 694 | <br>
 695 | 
 696 | 
 697 | -   Once we have identified the scientific and nuisance hyperparameters, we
 698 |     design a "study" or sequence of studies to make progress towards the
 699 |     experimental goal.
 700 |     -   A study specifies a set of hyperparameter configurations to be run for
 701 |         subsequent analysis. Each configuration is called a "trial".
 702 |     -   Creating a study typically involves choosing the hyperparameters that
 703 |         will vary across trials, choosing what values those hyperparameters can
 704 |         take on (the "search space"), choosing the number of trials, and
 705 |         choosing an automated search algorithm to sample that many trials from
 706 |         the search space. Alternatively, we could create a study by specifying
 707 |         the set of hyperparameter configurations manually.
 708 | -   The purpose of the studies is to run the pipeline with different values of
 709 |     the scientific hyperparameters, while at the same time **"optimizing away"**
 710 |     (or "optimizing over") the nuisance hyperparameters so that comparisons
 711 |     between different values of the scientific hyperparameters are as fair as
 712 |     possible.
 713 | -   In the simplest case, we would make a separate study for each configuration
 714 |     of the scientific parameters, where each study tunes over the nuisance
 715 |     hyperparameters.
 716 |     -   For example, if our goal is to select the best optimizer out of Nesterov
 717 |         momentum and Adam, we could create one study in which
 718 |         `optimizer="Nesterov_momentum"` and the nuisance hyperparameters are
 719 |         `{learning_rate, momentum}`, and another study in which
 720 |         `optimizer="Adam"` and the nuisance hyperparameters are `{learning_rate,
 721 |         beta1, beta2, epsilon}`. We would compare the two optimizers by
 722 |         selecting the best performing trial from each study.
 723 |     -   We can use any gradient-free optimization algorithm, including methods
 724 |         such as Bayesian optimization or evolutionary algorithms, to optimize
 725 |         over the nuisance hyperparameters, although
 726 |         [we prefer](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)
 727 |         to use quasi-random search in the
 728 |         [exploration phase](#exploration-vs-exploitation) of tuning because of a
 729 |         variety of advantages it has in this setting.
 730 |         [After exploration concludes](#after-exploration-concludes), if
 731 |         state-of-the-art Bayesian optimization software is available, that is
 732 |         our preferred choice.
 733 | -   In the more complicated case where we want to compare a large number of
 734 |     values of the scientific hyperparameters and it is impractical to make that
 735 |     many independent studies, we can include the scientific parameters in the
 736 |     same search space as the nuisance hyperparameters and use a search algorithm
 737 |     to sample values of *both* the scientific and nuisance hyperparameters in a
 738 |     single study.
 739 |     -   When taking this approach, conditional hyperparameters can cause
 740 |         problems since it is hard to specify a search space unless the set of
 741 |         nuisance hyperparameters is the same for all values of the scientific
 742 |         hyperparameters.
 743 |     -   In this case,
 744 |         [our preference](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)
 745 |         for using quasi-random search over fancier black-box optimization tools
 746 |         is even stronger, since it ensures that we obtain a relatively uniform
 747 |         sampling of values of the scientific hyperparameters. Regardless of the
 748 |         search algorithm, we need to make sure somehow that it searches the
 749 |         scientific parameters uniformly.
 750 | 
 751 | </details>
 752 | 
 753 | #### Striking a balance between informative and affordable experiments
 754 | 
 755 | <details><summary><em>[Click to expand]</em></summary>
 756 | 
 757 | <br>
 758 | 
 759 | 
 760 | -   When designing a study or sequence of studies, we need to allocate a limited
 761 |     budget in order to adequately achieve the following three desiderata:
 762 |     1.  Comparing enough different values of the scientific hyperparameters.
 763 |     2.  Tuning the nuisance hyperparameters over a large enough search space.
 764 |     3.  Sampling the search space of nuisance hyperparameters densely enough.
 765 | -   The better we can achieve these three desiderata, the more insight we can
 766 |     extract from our experiment.
 767 |     -   Comparing as many values of the scientific hyperparameters as possible
 768 |         broadens the scope of the insights we gain from the experiment.
 769 |     -   Including as many nuisance hyperparameters as possible and allowing each
 770 |         nuisance hyperparameter to vary over as wide a range as possible
 771 |         increases our confidence that a "good" value of the nuisance
 772 |         hyperparameters **exists** in the search space for each configuration of
 773 |         the scientific hyperparameters.
 774 |         -   Otherwise, we might make unfair comparisons between values of the
 775 |             scientific hyperparameters by not searching possible regions of the
 776 |             nuisance parameter space where better values might lie for some
 777 |             values of the scientific parameters.
 778 |     -   Sampling the search space of nuisance hyperparameters as densely as
 779 |         possible increases our confidence that any good settings for the
 780 |         nuisance hyperparameters that happen to exist in our search space will
 781 |         be found by the search procedure.
 782 |         -   Otherwise, we might make unfair comparisons between values of the
 783 |             scientific parameters due to some values getting luckier with the
 784 |             sampling of the nuisance hyperparameters.
 785 | -   Unfortunately, improvements in *any* of these three dimensions require
 786 |     either increasing the number of trials, and therefore increasing the
 787 |     resource cost, or finding a way to save resources in one of the other
 788 |     dimensions.
 789 |     -   Every problem has its own idiosyncrasies and computational constraints,
 790 |         so how to allocate resources across these three desiderata requires some
 791 |         level of domain knowledge.
 792 |     -   After running a study, we always try to get a sense of whether the study
 793 |         tuned the nuisance hyperparameters well enough (i.e. searched a large
 794 |         enough space extensively enough) to fairly compare the scientific
 795 |         hyperparameters (as described in greater detail
 796 |         [below](#extracting-insight-from-experimental-results)).
 797 | 
 798 | </details>
 799 | 
 800 | ### Extracting insight from experimental results
 801 | 
 802 | ***Summary:*** *In addition to trying to achieve the original scientific goal of
 803 | each group of experiments, go through a checklist of additional questions and,
 804 | if issues are discovered, revise the experiments and rerun them.*
 805 | 
 806 | -   Ultimately, each group of experiments has a specific goal and we want to
 807 |     evaluate the evidence the experiments provide toward that goal.
 808 |     -   However, if we ask the right questions, we will often find issues that
 809 |         need to be corrected before a given set of experiments can make much
 810 |         progress towards their original goal.
 811 |         -   If we don’t ask these questions, we may draw incorrect conclusions.
 812 |     -   Since running experiments can be expensive, we also want to take the
 813 |         opportunity to extract other useful insights from each group of
 814 |         experiments, even if these insights are not immediately relevant to the
 815 |         current goal.
 816 | -   Before analyzing a given set of experiments to make progress toward their
 817 |     original goal, we should ask ourselves the following additional questions:
 818 |     -   [Is the search space large enough?](#identifying-bad-search-space-boundaries)
 819 |         -   If the optimal point from a study is near the boundary of the search
 820 |             space in one or more dimensions, the search is probably not wide
 821 |             enough. In this case, we should run another study with an expanded
 822 |             search space.
 823 |     -   [Have we sampled enough points from the search space?](#not-sampling-enough-points-in-the-search-space)
 824 |         -   If not, run more points or be less ambitious in the tuning goals.
 825 |     -   What fraction of the trials in each study are **infeasible** (i.e.
 826 |         trials that diverge, get really bad loss values, or fail to run at all
 827 |         because they violate some implicit constraint)?
 828 |         -   When a very large fraction of points in a study are **infeasible**
 829 |             we should try to adjust the search space to avoid sampling such
 830 |             points, which sometimes requires reparameterizing the search space.
 831 |         -   In some cases, a large number of infeasible points can indicate a
 832 |             bug in the training code.
 833 |     -   [Does the model exhibit optimization issues?](#how-can-optimization-failures-be-debugged-and-mitigated)
 834 |     -   [What can we learn from the training curves of the best trials?](#examining-the-training-curves)
 835 |         -   For example, do the best trials have training curves consistent with
 836 |             problematic overfitting?
 837 | -   If necessary, based on the answers to the questions above, refine the most
 838 |     recent study (or group of studies) to improve the search space and/or sample
 839 |     more trials, or take some other corrective action.
 840 | -   Once we have answered the above questions, we can move on to evaluating the
 841 |     evidence the experiments provide towards our original goal (for example,
 842 |     [evaluating whether a change is useful](#detecting-whether-a-change-is-useful-with-isolation-plots)).
 843 | 
 844 | #### Identifying bad search space boundaries
 845 | 
 846 | <details><summary><em>[Click to expand]</em></summary>
 847 | 
 848 | <br>
 849 | 
 850 | 
 851 | -   A search space is suspicious if the best point sampled from it is close to
 852 |     its boundary. We might find an even better point if we expanded the search
 853 |     range in that direction.
 854 | -   To check search space boundaries, we like to plot completed trials on what
 855 |     we call **basic hyperparameter axis plots** where we plot the validation
 856 |     objective value versus one of the hyperparameters (e.g. learning rate). Each
 857 |     point on the plot corresponds to a single trial.
 858 |     -   The validation objective value for each trial should usually be the best
 859 |         value it achieved over the course of training.
 860 | 
 861 | <p align="center" id="figure-1">
 862 | <img src="assets/bad_search_space.png" width="49%" alt="Example of bad search space boundaries">
 863 | <img src="assets/good_search_space.png" width="49%" alt="Example of good search space boundaries">
 864 | </p>
 865 | 
 866 | <p align="center"><b>Figure 1:</b> Examples of bad search space boundaries and acceptable search space boundaries.</p>
 867 | 
 868 | -   The plots in [Figure 1](#figure-1) show the error rate (lower is better)
 869 |     against the initial learning rate.
 870 | -   If the best points cluster towards the edge of a search space (in some
 871 |     dimension), then the search space boundaries might need to be expanded until
 872 |     the best observed point is no longer close to the boundary.
 873 | -   Often, a study will include "infeasible" trials that diverge or get very bad
 874 |     results (marked with red Xs in the above plots).
 875 |     -   If all trials are infeasible for learning rates greater than some
 876 |         threshold value, and if the best performing trials have learning rates
 877 |         at the edge of that region, the model [may suffer from stability issues
 878 |         preventing it from accessing higher learning
 879 |         rates](#how-can-optimization-failures-be-debugged-and-mitigated).
 880 | 
 881 | </details>
 882 | 
 883 | #### Not sampling enough points in the search space
 884 | 
 885 | <details><summary><em>[Click to expand]</em></summary>
 886 | 
 887 | <br>
 888 | 
 889 | 
 890 | -   In general,
 891 |     [it can be very difficult to know](#how-many-trials-are-needed-to-get-good-results-with-quasi-random-search)
 892 |     if the search space has been sampled densely enough. 🤖
 893 | -   Running more trials is of course better, but comes at an obvious cost.
 894 | -   Since it is so hard to know when we have sampled enough, we usually sample
 895 |     what we can afford and try to calibrate our intuitive confidence from
 896 |     repeatedly looking at various hyperparameter axis plots and trying to get a
 897 |     sense of how many points are in the "good" region of the search space.
 898 | 
 899 | </details>
 900 | 
 901 | #### Examining the training curves
 902 | 
 903 | <details><summary><em>[Click to expand]</em></summary>
 904 | 
 905 | <br>
 906 | 
 907 | 
 908 | ***Summary:*** *Examining the training curves is an easy way to identify common
 909 | failure modes and can help us prioritize what actions to take next.*
 910 | 
 911 | -   Although in many cases the primary objective of our experiments only
 912 |     requires considering the validation error of each trial, we must be careful
 913 |     when reducing each trial to a single number because it can hide important
 914 |     details about what’s going on below the surface.
 915 | -   For every study, we always look at the **training curves** (training error
 916 |     and validation error plotted versus training step over the duration of
 917 |     training) of at least the best few trials.
 918 | -   Even if this is not necessary for addressing the primary experimental
 919 |     objective, examining the training curves is an easy way to identify common
 920 |     failure modes and can help us prioritize what actions to take next.
 921 | -   When examining the training curves, we are interested in the following
 922 |     questions.
 923 | -   Are any of the trials exhibiting **problematic overfitting?**
 924 |     -   Problematic overfitting occurs when the validation error starts
 925 |         *increasing* at some point during training.
 926 |     -   In experimental settings where we optimize away nuisance hyperparameters
 927 |         by selecting the "best" trial for each setting of the scientific
 928 |         hyperparameters, we should check for problematic overfitting in *at
 929 |         least* each of the best trials corresponding to the settings of the
 930 |         scientific hyperparameters that we’re comparing.
 931 |         -   If any of the best trials exhibits problematic overfitting, we
 932 |             usually want to re-run the experiment with additional regularization
 933 |             techniques and/or better tune the existing regularization parameters
 934 |             before comparing the values of the scientific hyperparameters.
 935 |             -   This may not apply if the scientific hyperparameters include
 936 |                 regularization parameters, since then it would not be surprising
 937 |                 if low-strength settings of those regularization parameters
 938 |                 resulted in problematic overfitting.
 939 |         -   Reducing overfitting is often straightforward using common
 940 |             regularization techniques that add minimal code complexity or extra
 941 |             computation (e.g. dropout, label smoothing, weight decay), so it’s
 942 |             usually no big deal to add one or more of these to the next round of
 943 |             experiments.
 944 |         -   For example, if the scientific hyperparameter is "number of hidden
 945 |             layers" and the best trial that uses the largest number of hidden
 946 |             layers exhibited problematic overfitting, then we would usually
 947 |             prefer to try it again with additional regularization instead of
 948 |             immediately selecting the smaller number of hidden layers.
 949 |         -   Even if none of the "best" trials are exhibiting problematic
 950 |             overfitting, there might still be a problem if it occurs in *any* of
 951 |             the trials.
 952 |             -   Selecting the best trial suppresses configurations exhibiting
 953 |                 problematic overfitting and favors those that do not. In other
 954 |                 words, it will favor configurations with more regularization.
 955 |             -   However, anything that makes training worse can act as a
 956 |                 regularizer, even if it wasn't intended that way. For example,
 957 |                 choosing a smaller learning rate can regularize training by
 958 |                 hobbling the optimization process, but we typically don't want
 959 |                 to choose the learning rate this way.
 960 |             -   So we must be aware that the "best" trial for each setting of
 961 |                 the scientific hyperparameters might be selected in such a way
 962 |                 that favors "bad" values of some of the scientific or nuisance
 963 |                 hyperparameters.
 964 | -   Is there high step-to-step variance in the training or validation error late
 965 |     in training?
 966 |     -   If so, this could interfere with our ability to compare different values
 967 |         of the scientific hyperparameters (since each trial randomly ends on a
 968 |         "lucky" or "unlucky" step) and our ability to reproduce the result of
 969 |         the best trial in production (since the production model might not end
 970 |         on the same "lucky" step as in the study).
 971 |     -   The most likely causes of step-to-step variance are batch variance (from
 972 |         randomly sampling examples from the training set for each batch), small
 973 |         validation sets, and using a learning rate that’s too high late in
 974 |         training.
 975 |     -   Possible remedies include increasing the batch size, obtaining more
 976 |         validation data, using learning rate decay, or using Polyak averaging.
 977 | -   Are the trials still improving at the end of training?
 978 |     -   If so, this indicates that we are in the
 979 |         ["compute bound" regime](#determining-the-number-of-steps-for-each-training-run)
 980 |         and we may benefit from
 981 |         [increasing the number of training steps](#Deciding-how-long-to-train-when-training-is-compute-bound)
 982 |         or changing the learning rate schedule.
 983 | -   Has performance on the training and validation sets saturated long before
 984 |     the final training step?
 985 |     -   If so, this indicates that we are in the
 986 |         ["not compute-bound"](#determining-the-number-of-steps-for-each-training-run)
 987 |         regime and that we may be able to
 988 |         [decrease the number of training steps](#deciding-how-long-to-train-when-training-is-not-compute-bound).
 989 | -   Although we cannot enumerate them all, there are many other additional
 990 |     behaviors that can become evident from examining the training curves (e.g.
 991 |     training loss *increasing* during training usually indicates a bug in the
 992 |     training pipeline).
 993 | 
 994 | </details>
 995 | 
 996 | #### Detecting whether a change is useful with isolation plots
 997 | 
 998 | <details><summary><em>[Click to expand]</em></summary>
 999 | 
1000 | <br>
1001 | 
1002 | 
1003 | <p align="center" id="figure-2">
1004 | <img src="assets/isolation_plot.png" width="49%" alt="Isolation plot that investigates the best value of weight decay for ResNet-50
1005 | trained on ImageNet.">
1006 | </p>
1007 | 
1008 | <p align="center"><b>Figure 2:</b> Isolation plot that investigates the best value of weight decay for ResNet-50 trained on ImageNet.</p>
1009 | 
1010 | -   Often, the goal of a set of experiments is to compare different values of a
1011 |     scientific hyperparameter.
1012 |     -   For example, we may want to determine the value of weight decay that
1013 |         results in the best validation error.
1014 | -   An **isolation plot** is a special case of the basic hyper-parameter axis
1015 |     plot. Each point on an isolation plot corresponds to the performance of the
1016 |     *best* trial across some (or all) of the nuisance hyperparameters.
1017 |     -   In other words, we plot the model performance after "optimizing away"
1018 |         the nuisance hyperparameters.
1019 | -   An isolation plot makes it easier to perform an apples-to-apples comparison
1020 |     between different values of the scientific hyperparameter.
1021 | -   For example, [Figure 2](#figure-2) reveals the value of weight decay that
1022 |     produces the best validation performance for a particular configuration of
1023 |     ResNet-50 trained on ImageNet.
1024 |     -   If our goal is to determine whether to include weight decay at all, then
1025 |         we would compare the best point from this plot against the baseline of
1026 |         no weight decay. For a fair comparison, the baseline should also have
1027 |         its learning rate equally well tuned.
1028 | -   When we have data generated by (quasi)random search and are considering a
1029 |     continuous hyperparameter for an isolation plot, we can approximate the
1030 |     isolation plot by bucketing the x-axis values of the basic hyperparameter
1031 |     axis plot and taking the best trial in each vertical slice defined by the
1032 |     buckets.
1033 | 
1034 | </details>
1035 | 
1036 | #### Automate generically useful plots
1037 | 
1038 | <details><summary><em>[Click to expand]</em></summary>
1039 | 
1040 | <br>
1041 | 
1042 | -   The more effort it is to generate plots, the less likely we are to look at
1043 |     them as much as we should, so it behooves us to set up our infrastructure to
1044 |     automatically produce as many of them as possible.
1045 | -   At a minimum, we automatically generate basic hyperparameter axis plots for
1046 |     all hyperparameters that we vary in an experiment.
1047 | -   Additionally, we automatically produce training curves for all trials and
1048 |     make it as easy as possible to find the best few trials of each study and
1049 |     examine their training curves.
1050 | -   There are many other potential plots and visualizations we can add that can
1051 |     be useful. Although the ones described above are a good starting point, to
1052 |     paraphrase Geoffrey Hinton, "Every time you plot something new, you learn
1053 |     something new."
1054 | 
1055 | </details>
1056 | 
1057 | ### Determining whether to adopt a training pipeline change or hyperparameter configuration
1058 | 
1059 | ***Summary:*** *When deciding whether to make a change to our model or training
1060 | procedure or adopt a new hyperparameter configuration going forward, we need to
1061 | be aware of the different sources of variation in our results.*
1062 | 
1063 | -   When we are trying to improve our model, we might observe that a particular
1064 |     candidate change initially achieves a better validation error compared to
1065 |     our incumbent configuration, but find that after repeating the experiment
1066 |     there is no consistent advantage. Informally, we can group the most
1067 |     important sources of variation that might cause such an inconsistent result
1068 |     into the following broad categories:
1069 |     -   **Training procedure variance**, **retrain variance**, or **trial
1070 |         variance**: the variation we see between training runs that use the same
1071 |         hyperparameters, but different random seeds.
1072 |         -   For example, different random initializations, training data
1073 |             shuffles, dropout masks, patterns of data augmentation operations,
1074 |             and orderings of parallel arithmetic operations, are all potential
1075 |             sources of trial variance.
1076 |     -   **Hyperparameter search variance**, or **study variance**: the variation
1077 |         in results caused by our procedure to select the hyperparameters.
1078 |         -   For example, we might run the same experiment with a particular
1079 |             search space, but with two different seeds for quasi-random search
1080 |             and end up selecting different hyperparameter values.
1081 |     -   **Data collection and sampling variance**: the variance from any sort of
1082 |         random split into training, validation, and test data or variance due to
1083 |         the training data generation process more generally.
1084 | -   It is all well and good to make comparisons of validation error rates
1085 |     estimated on a finite validation set using fastidious statistical tests, but
1086 |     often the trial variance alone can produce statistically significant
1087 |     differences between two different trained models that use the same
1088 |     hyperparameter settings.
1089 | -   We are most concerned about study variance when trying to make conclusions
1090 |     that go beyond the level of an individual point in hyperparameters space.
1091 |     -   The study variance depends on the number of trials and the search space
1092 |         and we have seen cases where it is larger than the trial variance as
1093 |         well as cases where it is much smaller.
1094 | -   Therefore, before adopting a candidate change, consider running the best
1095 |     trial N times to characterize the run-to-run trial variance.
1096 |     -   Usually, we can get away with only recharacterizing the trial variance
1097 |         after major changes to the pipeline, but in some applications we might
1098 |         need fresher estimates.
1099 |     -   In other applications, characterizing the trial variance is too costly
1100 |         to be worth it.
1101 | -   At the end of the day, although we only want to adopt changes (including new
1102 |     hyperparameter configurations) that produce real improvements, demanding
1103 |     complete certainty that something helps isn't the right answer either.
1104 | -   Therefore, if a new hyperparameter point (or other change) gets a better
1105 |     result than the baseline (taking into account the retrain variance of both
1106 |     the new point and the baseline as best we can), then we probably should
1107 |     adopt it as the new baseline for future comparisons.
1108 |     -   However, we should only adopt changes that produce improvements that
1109 |         outweigh any complexity they add.
1110 | 
1111 | ### After exploration concludes
1112 | 
1113 | ***Summary:*** *Bayesian optimization tools are a compelling option once we’re
1114 | done exploring for good search spaces and have decided what hyperparameters even
1115 | should be tuned at all.*
1116 | 
1117 | -   At some point, our priorities will shift from learning more about the tuning
1118 |     problem to producing a single best configuration to launch or otherwise use.
1119 | -   At this point, there should be a refined search space that comfortably
1120 |     contains the local region around the best observed trial and has been
1121 |     adequately sampled.
1122 | -   Our exploration work should have revealed the most essential hyperparameters
1123 |     to tune (as well as sensible ranges for them) that we can use to construct a
1124 |     search space for a final automated tuning study using as large a tuning
1125 |     budget as possible.
1126 | -   Since we no longer care about maximizing our insight into the tuning
1127 |     problem, many of
1128 |     [the advantages of quasi-random search](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)
1129 |     no longer apply and Bayesian optimization tools should be used to
1130 |     automatically find the best hyperparameter configuration.
1131 |     -   [Open-Source Vizier](https://github.com/google/vizier) implements
1132 |         a variety of sophisticated algorithms for tuning ML models, including
1133 |         Bayesian Optimization algorithms.
1134 |     -   If the search space contains a non-trivial volume of divergent points
1135 |         (points that get NaN training loss or even training loss many standard
1136 |         deviations worse than the mean), it is important to use black box
1137 |         optimization tools that properly handle trials that diverge (see
1138 |         [Bayesian Optimization with Unknown Constraints](https://arxiv.org/abs/1403.5607)
1139 |         for an excellent way to deal with this issue). [Open-Source Vizier](https://github.com/google/vizier)
1140 |         has support for divergent points by marking trials as infeasible, although it may not use our preferred approach from [Gelbart et al.](https://arxiv.org/abs/1403.5607), depending on how it is configured.
1141 | -   At this point, we should also consider checking the performance on the test
1142 |     set.
1143 |     -   In principle, we could even fold the validation set into the training
1144 |         set and retraining the best configuration found with Bayesian
1145 |         optimization. However, this is only appropriate if there won't be future
1146 |         launches with this specific workload (e.g. a one-time Kaggle
1147 |         competition).
1148 | 
1149 | ## Determining the number of steps for each training run
1150 | 
1151 | -   There are two types of workloads: those that are compute-bound and those
1152 |     that are not.
1153 | -   When training is **compute-bound**, training is limited by how long we are
1154 |     willing to wait and not by how much training data we have or some other
1155 |     factor.
1156 |     -   In this case, if we can somehow train longer or more efficiently, we
1157 |         should see a lower training loss and, with proper tuning, an improved
1158 |         validation loss.
1159 |     -   In other words, *speeding up* training is equivalent to *improving*
1160 |         training and the "optimal" training time is always "as long as we can
1161 |         afford."
1162 |     -   That said, just because a workload is compute-limited doesn't mean
1163 |         training longer/faster is the only way to improve results.
1164 | -   When training is **not compute-bound**, we can afford to train as long as we
1165 |     would like to, and, at some point, training longer doesn't help much (or
1166 |     even causes problematic overfitting).
1167 |     -   In this case, we should expect to be able to train to very low training
1168 |         loss, to the point where training longer might slightly reduce the
1169 |         training loss, but will not meaningfully reduce the validation loss.
1170 |     -   Particularly when training is not compute-bound, a more generous
1171 |         training time budget can make tuning easier, especially when tuning
1172 |         learning rate decay schedules, since they have a particularly strong
1173 |         interaction with the training budget.
1174 |         -   In other words, very stingy training time budgets might require a
1175 |             learning rate decay schedule tuned to perfection in order to achieve
1176 |             a good error rate.
1177 | -   Regardless of whether a given workload is compute-bound or not, methods that
1178 |     increase the variance of the gradients (across batches) will usually result
1179 |     in slower training progress, and thus may increase the number of training
1180 |     steps required to reach a particular validation loss. High gradient variance
1181 |     can be caused by:
1182 |     -   Using a smaller batch size
1183 |     -   Adding data augmentation
1184 |     -   Adding some types of regularization (e.g. dropout)
1185 | 
1186 | ### Deciding how long to train when training is *not* compute-bound
1187 | 
1188 | -   Our main goal is to ensure we are training long enough for the model to
1189 |     reach the best possible result, while avoiding being overly wasteful in the
1190 |     number of training steps.
1191 | -   When in doubt, err on the side of training longer. Performance should never
1192 |     degrade when training longer, assuming retrospective (optimal) checkpoint
1193 |     selection is used properly and checkpoints are frequent enough.
1194 | -   Never tune the `max_train_steps` number in a study. Pick a value and use it
1195 |     for all trials. From these trials, plot the training step that retrospective
1196 |     checkpoint selection finds in order to refine the choice of
1197 |     `max_train_steps`.
1198 |     -   For example, if the best step is always during the first 10% of
1199 |         training, then the maximum number of steps is way too high.
1200 |     -   Alternatively, if the best step is consistently in the last 25% of
1201 |         training we might benefit from training longer and re-tuning the decay
1202 |         schedule.
1203 | -   The ideal number of training steps can change when the architecture or data
1204 |     changes (e.g. adding data augmentation).
1205 | -   Below we describe how to pick an initial candidate value for
1206 |     `max_train_steps` based on the number of steps necessary to "perfectly fit"
1207 |     the training set using a constant learning rate.
1208 |     -   Note, we are not using the phrase "perfectly fit the training set" in a
1209 |         precise or mathematically well-defined way. It is merely meant as an
1210 |         informal descriptor to indicate a very low training loss.
1211 |         -   For example, when training with the log loss, absent regularization
1212 |             terms, we might see the training loss keep slowly improving until we
1213 |             reach floating point limits as the network weights grow without
1214 |             bound and the predictions of the model on the training set become
1215 |             increasingly confident. In this case, we might say the model
1216 |             "perfectly fit" the training set around the time the
1217 |             misclassification error reached zero on the training set.
1218 |     -   The starting value for `max_train_steps` we find may need to be
1219 |         increased if the amount of gradient noise in the training procedure
1220 |         increases.
1221 |         -   For example, if data augmentation or regularizers like dropout are
1222 |             introduced to the model.
1223 |     -   It may be possible to decrease `max_train_steps` if the training process
1224 |         improves somehow.
1225 |         -   For example, with a better tuned optimizer or a better tuned
1226 |             learning rate schedule.
1227 | 
1228 | #### Algorithm for picking an initial candidate for max_train_steps using a learning rate sweep
1229 | 
1230 | <details><summary><em>[Click to expand]</em></summary>
1231 | 
1232 | <br>
1233 | 
1234 | -   This procedure assumes it is possible to not only "perfectly" fit the
1235 |     training set, but to do so using a constant learning rate schedule.
1236 | -   If it is possible to perfectly fit the entire training set, then there must
1237 |     exist a configuration (with some value of `max_train_steps`) that perfectly
1238 |     fits the training set; find any such configuration and use its value of
1239 |     `max_train_steps` as a starting point `N`.
1240 | -   Run a constant learning rate sweep (i.e. grid search the learning rate)
1241 |     without data augmentation and without regularization where each trial trains
1242 |     for `N` steps.
1243 | -   The number of steps required for the fastest trial in the sweep to reach
1244 |     perfect training performance is our initial guess for `max_train_steps`.
1245 | -   **NOTE:** Bad search spaces can make it possible to engage in
1246 |     self-deception.
1247 |     -   For example, if all the learning rates in a study are too small, we
1248 |         might incorrectly conclude that a very large value of `max_train_steps`
1249 |         is necessary.
1250 |     -   At a minimum, we should check that the optimal learning rate in the
1251 |         study is not at the boundary of the search space.
1252 | 
1253 | </details>
1254 | 
1255 | ### Deciding how long to train when training is compute-bound
1256 | 
1257 | -   In some cases, training loss keeps improving indefinitely and our patience
1258 |     and computational resources become the limiting factors.
1259 | -   If training loss (or even validation loss) keeps improving indefinitely,
1260 |     should we always train as long as we can afford? Not necessarily.
1261 |     -   We might be able to tune more effectively by running a larger number of
1262 |         shorter experiments and reserving the longest "production length" runs
1263 |         for the models we hope to launch.
1264 |     -   As the training time for trials approaches our patience limit, tuning
1265 |         experiments become more relevant for our potential launch candidates,
1266 |         but we can complete fewer of them.
1267 |     -   There are probably many questions we can answer while only training for
1268 |         ~10% of the production length, but there is always a risk that our
1269 |         conclusions at this time limit will not apply to experiments at 20% of
1270 |         the production length, let alone 100%.
1271 | -   Tuning in multiple rounds with increasing, per-trial training step limits is
1272 |     a sensible approach.
1273 |     -   We can do as many rounds as we want, but usually 1-3 are the most
1274 |         practical.
1275 |     -   Essentially, try to obtain as much understanding of the problem as
1276 |         possible using trials with a very quick turnaround time, trading off
1277 |         tuning thoroughness with relevance to the final, longest runs.
1278 |     -   Once a given per-trial time limit has generated useful insights, we can
1279 |         increase the training time and continue tuning, double-checking our
1280 |         conclusions from the shorter runs as needed.
1281 | -   As a starting point, we recommend two rounds of tuning:
1282 |     -   Round 1: Shorter runs to find good model and optimizer hyperparameters.
1283 |     -   Round 2: Very few long runs on good hyperparameter points to get the
1284 |         final model.
1285 | -   The biggest question going from `Round i` &rarr; `Round i+1` is how to
1286 |     adjust learning rate decay schedules.
1287 |     -   One common pitfall when adjusting learning rate schedules between rounds
1288 |         is using all the extra training steps with too small of a learning rate.
1289 | 
1290 | #### Round 1
1291 | 
1292 | <details><summary><em>[Click to expand]</em></summary>
1293 | 
1294 | <br>
1295 | 
1296 | -   Unfortunately, there is no guarantee that good hyperparameters found in
1297 |     short, incomplete training are still good choices when training length is
1298 |     significantly increased. However, for some kinds of hyperparameters, they
1299 |     are often correlated enough for Round 1 to be useful.
1300 | -   What hyperparameter values found in shorter runs do we expect to transfer to
1301 |     longer training runs? For all of this, we need more research. But based on
1302 |     what we know so far, here are the authors’ suspicions in order of decreasing
1303 |     probability of transferring:
1304 |     -   Very likely to transfer
1305 |         -   Early training instability can be resolved in the first round of
1306 |             tuning using a smaller number of training steps. Perhaps these
1307 |             hyperparameters are the closest thing to a sure bet for transfer
1308 |             that we have.
1309 |             -   Warmup length
1310 |             -   Initialization
1311 |     -   Likely to transfer
1312 |         -   Model architecture - A dramatic win in the model architecture will
1313 |             usually transfer, but there are probably many counterexamples.
1314 |     -   Might transfer
1315 |         -   Optimization algorithm/optimizer hyperparameters - We think this
1316 |             would "loosely" transfer. It’s definitely weaker than the things
1317 |             above it.
1318 |         -   Data augmentation
1319 |         -   Regularization
1320 |             -   If it isn't possible to perfectly fit the training set, the
1321 |                 model might be in a regime where regularization is unlikely to
1322 |                 help very much.
1323 |     -   Unlikely to transfer
1324 |         -   Learning rate schedule: unlikely to transfer perfectly.
1325 |             -   [This paper](https://arxiv.org/abs/2203.15556) suggests that
1326 |                 even decay schedule transfers, but we don't believe this is true
1327 |                 in general. Example: Tuning sqrt decay on small # of training
1328 |                 steps then extending to large # will result in the majority of
1329 |                 training occurring at overly small steps.
1330 |                 -   One can likely do "good enough" with most schedules in the
1331 |                     limit of extreme training budget, but noticeable performance
1332 |                     improvements can likely be seen if it is tuned.
1333 |             -   [Understanding Short-Horizon Bias in Stochastic
1334 |                 Meta-Optimization](https://arxiv.org/abs/1803.02021) describes
1335 |                 the dangers of trying to pick learning rates myopically.
1336 | 
1337 | </details>
1338 | 
1339 | #### Round 2
1340 | 
1341 | <details><summary><em>[Click to expand]</em></summary>
1342 | 
1343 | <br>
1344 | 
1345 | -   Run the best hyperparameter configuration from Round 1.
1346 | -   **(Speculation)** 🤖 Use the extra steps to extend the period of training at
1347 |     a high learning rate.
1348 |     -   E.g. if linear schedule then keep the length of the decay fixed from
1349 |         Round 1 and extend the period of constant lr in the beginning.
1350 |     -   For cosine decay, just keep the base lr from Round 1 and extend
1351 |         `max_train_steps` as in
1352 |         [Chinchilla paper](https://arxiv.org/abs/2203.15556).
1353 | -   More rounds might make sense for teams with very mature modeling and tuning
1354 |     pipelines and very long and expensive production training runs, but they
1355 |     will often be overkill.
1356 |     -   We've described how to transfer from Step 1 &rarr; Step 2. If we didn't care
1357 |         about analysis time and if making efficient use of compute was the
1358 |         overriding concern, then the ideal would be to exponentially increase
1359 |         the length of training runs (and thus the end-to-end time to complete a
1360 |         study) over many different rounds of tuning.
1361 |         -   At each round we systematically ensure our choices continue to hold
1362 |             up.
1363 |         -   New ideas go through a pipeline that progressively derisks them
1364 |             using increasingly long-running experiments from Step i to Step i+1.
1365 | 
1366 | </details>
1367 | 
1368 | ## Additional guidance for the training pipeline
1369 | 
1370 | ### Optimizing the input pipeline
1371 | 
1372 | ***Summary:*** *The causes and interventions of input-bound pipelines are highly
1373 | task-dependent; use a profiler and look out for common issues.*
1374 | 
1375 | -   Use an appropriate profiler to diagnose input-bound pipelines. For example,
1376 |     [Perfetto](https://jax.readthedocs.io/en/latest/profiling.html) for JAX or
1377 |     [TensorFlow profiler](https://www.tensorflow.org/guide/profiler) for
1378 |     TensorFlow.
1379 | -   Ultimately, the specific causes and interventions will be highly
1380 |     task-dependent. Broader engineering considerations (e.g. minimizing disk
1381 |     footprint) may warrant worse input pipeline performance.
1382 | -   Common causes:
1383 |     -   Data are not colocated with the training process, causing I/O latency
1384 |         (this might happen when reading training data over a network).
1385 |     -   Expensive online data preprocessing (consider doing this once offline
1386 |         and saving).
1387 |     -   Unintentional synchronization barriers that interfere with data pipeline
1388 |         prefetching. For example, when synchronizing metrics between the device
1389 |         and host in CommonLoopUtils
1390 |         ([link](https://github.com/google/CommonLoopUtils/blob/fea2518ada8814a78e1492023fd9f00edb0b0568/clu/metrics.py#L291)).
1391 | -   Common tips:
1392 |     -   Instrument input pipeline to prefetch examples (e.g.
1393 |         [tf.data.Dataset.prefetch](https://www.tensorflow.org/guide/data_performance#prefetching))
1394 |     -   Remove unused features/metadata from each as early in the pipeline as
1395 |         possible.
1396 |     -   Increase the replication of the number of jobs generating examples for
1397 |         the input pipeline. For example, by using the
1398 |         [tf.data service](https://www.tensorflow.org/api_docs/python/tf/data/experimental/service).
1399 | 
1400 | ### Evaluating model performance
1401 | 
1402 | ***Summary:*** *Run evaluation at larger batch sizes than training. Run
1403 | evaluations at regular step intervals, not regular time intervals.*
1404 | 
1405 | #### Evaluation settings
1406 | 
1407 | <details><summary><em>[Click to expand]</em></summary>
1408 | 
1409 | <br>
1410 | 
1411 | -   There are several settings in which we can evaluate the performance of our
1412 |     models.
1413 |     -   **Online evaluation** - metrics are collected when the model is serving
1414 |         predictions in a production environment.
1415 |     -   **Offline evaluation** - metrics are collected when the model is run on
1416 |         offline train/validation/test sets that are representative of the
1417 |         production environment.
1418 |     -   **Periodic evaluations** - metrics are collected during model training
1419 |         that might either be a proxy for the offline evaluation, and/or on a
1420 |         subset of the data used in offline evaluation.
1421 | -   Online evaluation is the gold standard, but is often impractical during the
1422 |     model development phase.
1423 | -   Depending on the problem, offline evaluation can be fairly involved and
1424 |     computationally expensive.
1425 | -   Periodic evaluations are the most practical and economical choice, but may
1426 |     not fully represent the production environment.
1427 |     -   Our goal during periodic evaluation is to use an expedient proxy of the
1428 |         offline evaluation, without sacrificing the reliability of the signal we
1429 |         get during training.
1430 | 
1431 | </details>
1432 | 
1433 | #### Setting up periodic evaluations
1434 | 
1435 | <details><summary><em>[Click to expand]</em></summary>
1436 | 
1437 | <br>
1438 | 
1439 | -   We run periodic evaluations during training to monitor its progress in real
1440 |     time, to
1441 |     [facilitate retrospective model checkpoint selection](#saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint),
1442 |     and so that we can
1443 |     [examine the training curves at the end of training](#examining-the-training-curves).
1444 | -   The simplest configuration is to perform both training and periodic
1445 |     evaluations within the same compute instance, periodically alternating
1446 |     between training and evaluation.
1447 |     -   In this case, the batch size used to perform evaluations should be *at
1448 |         least* as large as the batch size used for training because model
1449 |         activations don't need to be maintained during evaluation, lowering the
1450 |         computational requirements per example.
1451 | -   Periodic evaluations should be done at regular step intervals, not time
1452 |     intervals.
1453 |     -   Evaluating based on time intervals can make it harder to interpret the
1454 |         training curves, especially when training may suffer from preemptions of
1455 |         the training jobs, network latency issues, etc.
1456 | -   Periodicity in valid/test metrics (when using a shuffled
1457 |     train/validation/test split) can indicate implementation bugs such as test
1458 |     data having overlap with training data, or training data not being properly
1459 |     shuffled. Evaluating at regular step intervals can make these issues easier
1460 |     to catch.
1461 | -   Partial batches can occur when the evaluation sets are not divisible by the
1462 |     batch size. Ensure that the padded examples are correctly weighed to prevent
1463 |     the loss function from being biased by them. Often, these padded examples
1464 |     can be given a weight of zero.
1465 | -   Save sufficient information per evaluation to support offline analysis.
1466 |     Ideally, we would save predictions on a selection of individual examples
1467 |     since they can be invaluable for debugging.
1468 |     -   Generating artifacts like
1469 |         [SavedModels](https://www.tensorflow.org/guide/saved_model) make it easy
1470 |         to do ad-hoc model inspection after evaluation jobs finish.
1471 | 
1472 | </details>
1473 | 
1474 | #### Choosing a sample for periodic evaluation
1475 | 
1476 | <details><summary><em>[Click to expand]</em></summary>
1477 | 
1478 | <br>
1479 | 
1480 | -   The periodic evaluation job might not run fast enough to compute metrics on
1481 |     the full offline evaluation set in a reasonable amount of time. This often
1482 |     necessitates sampling data for periodic evaluation.
1483 | -   We consider the following factors when constructing a sampled dataset:
1484 |     -   <ins>Sample size</ins>
1485 |         -   Check that the performance computed on the sampled dataset used by
1486 |             the periodic job matches the performance on the whole offline
1487 |             evaluation set, i.e. there is no skew between the sampled set and
1488 |             the full dataset.
1489 |         -   The dataset used for periodic evaluation should be small enough that
1490 |             it’s easy to generate model predictions over its entirety, but large
1491 |             enough that improvements to the model can be accurately measured
1492 |             (i.e. not overwhelmed by label noise).
1493 |         -   It should be large enough to accommodate multiple such evaluations
1494 |             across trials in sequence, and still produce accurate estimates.
1495 |             That is, to avoid adaptively "fitting" to the validation set over
1496 |             time, in a way that doesn't generalize to a held-out test set.
1497 |             However, this consideration is rarely a practical concern.
1498 |     -   <ins>Imbalanced datasets</ins>
1499 |         -   For imbalanced datasets, performance on rare classes of examples
1500 |             will often be noisy.
1501 |         -   For datasets with a small number of examples in a class label, log
1502 |             the number of examples predicted correctly to get more insight into
1503 |             accuracy improvements (.05 sensitivity improvement sounds exciting,
1504 |             but was it just one more example correct?).
1505 | 
1506 | </details>
1507 | 
1508 | ### Saving checkpoints and retrospectively selecting the best checkpoint
1509 | 
1510 | ***Summary:*** *Run training for a fixed number of steps and retrospectively
1511 | choose the best checkpoint from the run.*
1512 | 
1513 | -   Most deep learning frameworks support
1514 |     [model checkpointing](https://flax.readthedocs.io/en/latest/api_reference/flax.training.html).
1515 |     That is, the current state of the model is periodically preserved on disk.
1516 |     This allows the training job to be resilient to compute instance
1517 |     interruptions.
1518 | -   The best checkpoint is often not the last checkpoint, particularly when the
1519 |     validation set performance does not continue to increase over time but
1520 |     rather fluctuates about a particular value.
1521 | -   Set up the pipeline to keep track of the N best checkpoints seen so far
1522 |     during training. At the end of training, model selection is then a matter of
1523 |     choosing the best checkpoint seen during training. We call this
1524 |     **retrospective optimal checkpoint selection**.
1525 | -   Supporting prospective early stopping is usually not necessary, since we’re
1526 |     pre-specifying a trial budget and are preserving the N best checkpoints seen
1527 |     so far.
1528 | 
1529 | ### Setting up experiment tracking
1530 | 
1531 | ***Summary:*** *When tracking different experiments, make sure to note a number
1532 | of essentials like the best performance of a checkpoint in the study, and a
1533 | short description of the study.*
1534 | 
1535 | -   We've found that keeping track of experiment results in a spreadsheet has
1536 |     been helpful for the sorts of modeling problems we've worked on. It often
1537 |     has the following columns:
1538 |     -   Study name
1539 |     -   A link to wherever the config for the study is stored.
1540 |     -   Notes or a short description of the study.
1541 |     -   Number of trials run
1542 |     -   Performance on the validation set of the best checkpoint in the study.
1543 |     -   Specific reproduction commands or notes on what unsubmitted changes were
1544 |         necessary to launch training.
1545 | -   Find a tracking system that captures at least the information listed above
1546 |     and is convenient for the people doing it. Untracked experiments might as
1547 |     well not exist.
1548 | 
1549 | ### Batch normalization implementation details
1550 | 
1551 | ***Summary:*** *Nowadays batch norm can often be replaced with LayerNorm, but in
1552 | cases where it cannot, there are tricky details when changing the batch size or
1553 | number of hosts.*
1554 | 
1555 | -   Batch norm normalizes activations using their mean and variance over the
1556 |     current batch, but in the multi-device setting these statistics are
1557 |     different on each device unless explicitly synchronized.
1558 | -   Anecdotal reports (mostly on ImageNet) say calculating these normalizing
1559 |     statistics using only ~64 examples actually works better in practice (see
1560 |     Ghost Batch Norm from [this paper](https://arxiv.org/abs/1705.08741)).
1561 | -   Decoupling the total batch size and the number of examples used to calculate
1562 |     batch norm statistics is particularly useful for batch size comparisons.
1563 | -   Ghost batch norm implementations do not always correctly handle the case
1564 |     where the per-device batch size > virtual batch size. In this case we'd
1565 |     actually need to subsample the batch on each device in order to get the
1566 |     proper number of batch norm statistic examples.
1567 | -   Exponential moving averages used in test mode batch norm are just a linear
1568 |     combination of training statistics, so these EMAs only need to be
1569 |     synchronized before saving them in checkpoints. However, some common
1570 |     implementations of batch norm do not synchronize these EMAs and only save
1571 |     the EMA from the first device.
1572 | 
1573 | ### Considerations for multi-host pipelines
1574 | 
1575 | ***Summary:*** *for logging, evals, RNGs, checkpointing, and data sharding,
1576 | multi-host training can make it very easy to introduce bugs!*
1577 | 
1578 | -   Ensure the pipeline is only logging and checkpointing on one host.
1579 | -   Make sure before evaluation or checkpointing is run, the batch norm
1580 |     statistics are synchronized across hosts.
1581 | -   It is critical to have RNG seeds that are the same across hosts (for model
1582 |     initialization), and seeds that are different across hosts (for data
1583 |     shuffling/preprocessing), so make sure to mark them appropriately.
1584 | -   Sharding data files across hosts is usually recommended for improved
1585 |     performance.
1586 | 
1587 | ## FAQs
1588 | 
1589 | ### What is the best learning rate decay schedule family?
1590 | 
1591 | <details><summary><em>[Click to expand]</em></summary>
1592 | 
1593 | <br>
1594 | 
1595 | -   It’s an open problem. It’s not clear how to construct a set of rigorous
1596 |     experiments to confidently answer what the "best" LR decay schedule is.
1597 | -   Although we don't know the best schedule family, we're confident that it’s
1598 |     important to have some (non-constant) schedule and that tuning it matters.
1599 | -   Different learning rates work best at different times during the
1600 |     optimization process. Having some sort of schedule makes it more likely for
1601 |     the model to hit a good learning rate.
1602 | 
1603 | </details>
1604 | 
1605 | ### Which learning rate decay should I use as a default?
1606 | 
1607 | <details><summary><em>[Click to expand]</em></summary>
1608 | <br>
1609 | 
1610 | -   Our preference is either linear decay or cosine decay, and a bunch of other
1611 |     schedule families are probably good too.
1612 | 
1613 | </details>
1614 | 
1615 | ### Why do some papers have complicated learning rate schedules?
1616 | 
1617 | <details><summary><em>[Click to expand]</em></summary>
1618 | <br>
1619 | 
1620 | -   It’s not uncommon to see papers with complicated piecewise learning rate
1621 |     (LR) decay schedules.
1622 | -   Readers often wonder how the authors arrived at such a complicated study.
1623 | -   Many complicated LR decay schedules are the result of tuning the schedule as
1624 |     a function of the validation set performance in an ad hoc way:
1625 |     1.  Start a single training run with some simple LR decay (or a constant
1626 |         learning rate).
1627 |     2.  Keep training running until the performance seems to stagnate. If this
1628 |         happens, pause training. Resume it with a perhaps steeper LR decay
1629 |         schedule (or smaller constant learning rate) from this point. Repeat
1630 |         this process until the conference/launch deadline.
1631 | -   Blithely copying the resulting *schedule* is generally not a good idea since
1632 |     the best particular schedule will be sensitive to a host of other
1633 |     hyperparameter choices.
1634 |     -   Better to copy the *algorithm* that produced the schedule, although this
1635 |         is rarely possible when arbitrary human judgment produced the schedule.
1636 | -   This type of validation-error-sensitive schedule is fine to use if it can be
1637 |     fully automated, but human-in-the-loop schedules that are a function of
1638 |     validation error are brittle and not easily reproducible, so we recommend
1639 |     avoiding them.
1640 |     -   Before publishing results that used such a schedule, please try to make
1641 |         it fully reproducible.
1642 | 
1643 | </details>
1644 | 
1645 | ### How should Adam’s hyperparameters be tuned?
1646 | 
1647 | <details><summary><em>[Click to expand]</em></summary>
1648 | <br>
1649 | 
1650 | -   As discussed above, making general statements about search spaces and how
1651 |     many points one should sample from the search space is very difficult. Note
1652 |     that not all the hyperparameters in Adam are equally important. The
1653 |     following rules of thumb correspond to different "budgets" for the number of
1654 |     trials in a study.
1655 |     -   If < 10 trials in a study, only tune the (base) learning rate.
1656 |     -   If 10-25 trials, tune learning rate and $\beta_1$.
1657 |     -   If 25+ trials, tune the learning rate, $\beta_1$ and $\epsilon$.
1658 |     -   If one can run substantially more than 25 trials, additionally tune
1659 |         $\beta_2$.
1660 | 
1661 | </details>
1662 | 
1663 | ### Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?
1664 | 
1665 | <details><summary><em>[Click to expand]</em></summary>
1666 | 
1667 | -   Quasi-random search (based on
1668 |     [low-discrepancy sequences](https://en.wikipedia.org/wiki/Low-discrepancy_sequence))
1669 |     is our preference over fancier black box optimization tools when used as
1670 |     part of an iterative tuning process intended to maximize insight into the
1671 |     tuning problem (what we refer to as the "exploration phase"). Bayesian
1672 |     optimization and similar tools are more appropriate for the exploitation
1673 |     phase.
1674 | -   Quasi-random search based on randomly shifted low-discrepancy sequences can
1675 |     be thought of as "jittered, shuffled grid search", since it uniformly, but
1676 |     randomly, explores a given search space and spreads out the search points
1677 |     more than random search.
1678 | -   The advantages of quasi-random search over more sophisticated black box
1679 |     optimization tools (e.g. Bayesian optimization, evolutionary algorithms)
1680 |     include:
1681 |     1.  Sampling the search space non-adaptively makes it possible to change the
1682 |         tuning objective in post hoc analysis without rerunning experiments.
1683 |         -   For example, we usually want to find the best trial in terms of
1684 |             validation error achieved at any point in training. But the
1685 |             non-adaptive nature of quasi-random search makes it possible to find
1686 |             the best trial based on final validation error, training error, or
1687 |             some alternative evaluation metric without rerunning any
1688 |             experiments.
1689 |     2.  Quasi-random search behaves in a consistent and statistically
1690 |         reproducible way.
1691 |         -   It should be possible to reproduce a study from six months ago even
1692 |             if the implementation of the search algorithm changes, as long as it
1693 |             maintains the same uniformity properties. If using sophisticated
1694 |             Bayesian optimization software, the implementation might change in
1695 |             an important way between versions, making it much harder to
1696 |             reproduce an old search. It isn’t always possible to roll back to an
1697 |             old implementation (e.g. if the optimization tool is run as a
1698 |             service).
1699 |     3.  Its uniform exploration of the search space makes it easier to reason
1700 |         about the results and what they might suggest about the search space.
1701 |         -   For example, if the best point in the traversal of quasi-random
1702 |             search is at the boundary of the search space, this is a good (but
1703 |             not foolproof) signal that the search space bounds should be
1704 |             changed. [This section](#identifying-bad-search-space-boundaries)
1705 |             goes into more depth. However, an adaptive black box optimization
1706 |             algorithm might have neglected the middle of the search space
1707 |             because of some unlucky early trials even if it happens to contain
1708 |             equally good points, since it is this exact sort of non-uniformity
1709 |             that a good optimization algorithm needs to employ to speed up the
1710 |             search.
1711 |     4.  Running different numbers of trials in parallel versus sequentially will
1712 |         not produce statistically different results when using quasi-random
1713 |         search (or other non-adaptive search algorithms), unlike with adaptive
1714 |         algorithms.
1715 |     5.  More sophisticated search algorithms may not always handle infeasible
1716 |         points correctly, especially if they aren't designed with neural network
1717 |         hyperparameter tuning in mind.
1718 |     6.  Quasi-random search is simple and works especially well when many tuning
1719 |         trials will be running in parallel.
1720 |         -   Anecdotally[^3], it is very hard for an adaptive algorithm to beat a
1721 |             quasi-random search that has 2X its budget, especially when many
1722 |             trials need to be run in parallel (and thus there are very few
1723 |             chances to make use of previous trial results when launching new
1724 |             trials).
1725 |         -   Without expertise in Bayesian optimization and other advanced black
1726 |             box optimization methods, we might not achieve the benefits they
1727 |             are, in principle, capable of providing. It is hard to benchmark
1728 |             advanced black box optimization algorithms in realistic deep
1729 |             learning tuning conditions. They are a very active area of current
1730 |             research, and the more sophisticated algorithms come with their own
1731 |             pitfalls for inexperienced users. Experts in these methods are able
1732 |             to get good results, but in high-parallelism conditions the search
1733 |             space and budget tend to matter a lot more.
1734 | -   That said, if our computational resources only allow a small number of
1735 |     trials to run in parallel and we can afford to run many trials in sequence,
1736 |     Bayesian optimization becomes much more attractive despite making our tuning
1737 |     results harder to interpret.
1738 | 
1739 | [^3]: Ben Recht and Kevin Jamieson
1740 |     [pointed out](http://www.argmin.net/2016/06/20/hypertuning/) how strong
1741 |     2X-budget random search is as a baseline (the
1742 |     [Hyperband paper](https://jmlr.org/papers/volume18/16-558/16-558.pdf)
1743 |     makes similar arguments), but it is certainly possible to find search
1744 |     spaces and problems where state-of-the-art Bayesian optimization
1745 |     techniques crush random search that has 2X the budget. However, in our
1746 |     experience beating 2X-budget random search gets much harder in the
1747 |     high-parallelism regime since Bayesian optimization has no opportunity to
1748 |     observe the results of previous trials.
1749 | 
1750 | </details>
1751 | 
1752 | ### Where can I find an implementation of quasi-random search?
1753 | 
1754 | <details><summary><em>[Click to expand]</em></summary>
1755 | <br>
1756 | 
1757 | -   [Open-Source Vizier](https://github.com/google/vizier) has an [implementation
1758 |     of quasi-ranom search](https://github.com/google/vizier/blob/main/vizier/_src/algorithms/designers/quasi_random.py). Set `algorithm="QUASI_RANDOM_SEARCH"` in [this usage example](https://oss-vizier.readthedocs.io/en/latest/guides/user/running_vizier.html).
1759 | -   An alternative implementation exists
1760 |     [here](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/halton.py).
1761 | -   Both implementations above generate a Halton sequence for a given search space (intended to
1762 |     implement a shifted, scrambled Halton sequence as recommended in
1763 |     https://arxiv.org/abs/1706.03200).
1764 | -   If a quasi-random search algorithm based on a low-discrepancy sequence is
1765 |     not available, it is possible to substitute pseudo random uniform search
1766 |     instead, although this is likely to be slightly less efficient.
1767 |     -   In 1-2 dimensions, grid search is also acceptable, although not in
1768 |         higher dimensions (see
1769 |         [Bergstra & Bengio, 2012](https://www.jmlr.org/papers/v13/bergstra12a.html)).
1770 | 
1771 | </details>
1772 | 
1773 | ### How many trials are needed to get good results with quasi-random search?
1774 | 
1775 | <details><summary><em>[Click to expand]</em></summary>
1776 | <br>
1777 | 
1778 | <p align="center">
1779 | <img src="assets/have_we_sampled_enough.png" width="49%" alt="A box plot showing the importance of sampling enough">
1780 | </p>
1781 | 
1782 | <p align="center"><b>Figure 3:</b> A ResNet-50 was tuned on ImageNet with 100
1783 | trials. Via bootstrapping, different amounts of tuning budget were simulated.
1784 | Box plots of the best performances for each trial budget are plotted above.
1785 | 
1786 | -   There is no way to answer this question in general, but we can look at
1787 |     specific examples.
1788 | -   As the Figure 3 shows, the number of trials in a study can have a
1789 |     substantial impact on the results.
1790 |     -   Notice how large the interquartile ranges are when 6 trials were
1791 |         sampled, versus when 20 trials were sampled.
1792 |     -   Even with 20 trials, it is likely that the difference between especially
1793 |         lucky and unlucky studies will be larger than the typical variation
1794 |         between re-trains of this model on different random seeds, with fixed
1795 |         hyperparameters, which for this workload might be around +/- 0.1% on a
1796 |         validation error rate of \~23%.
1797 | 
1798 | </details>
1799 | 
1800 | ### How can optimization failures be debugged and mitigated?
1801 | 
1802 | <details><summary><em>[Click to expand]</em></summary>
1803 | <br>
1804 | 
1805 | 
1806 | ***Summary:*** *If the model is experiencing optimization difficulties, it’s
1807 | important to fix them before trying other things. Diagnosing and correcting
1808 | training failures is an active area of research.*
1809 | 
1810 | <p align="center">
1811 | <img src="assets/stride_instability.png" width="80%" alt="Changing the strides in a single residual block in a WideResnet results in training instability.">
1812 | </p>
1813 | 
1814 | 
1815 | <p align="center"><b>Figure 4:</b> Changing the strides in a single residual block (2x2 -> 1x1) in a WideResnet results in training instability. This does not degrade performance at low learning rates, but high learning rates no longer train well due to the instability. Applying 1000 steps of learning rate warmup resolves this particular instance of instability, allowing stable training at max learning rate of .1.</p>
1816 | 
1817 | #### Identifying unstable workloads
1818 | 
1819 | -   Any workload will become unstable if the learning rate is too large.
1820 |     Instability is only an issue when it forces us to use a learning rate that’s
1821 |     too small.
1822 | -   There are at least two types of training instability worth distinguishing:
1823 |     1.  Instability at initialization/early in training.
1824 |     2.  Sudden instability in the middle of training.
1825 | -   We can take a systematic approach to identifying stability issues in our
1826 |     workload.
1827 |     1.  Do a learning rate sweep and find the best learning rate lr*.
1828 |     2.  Plot training loss curves for learning rates just above lr*.
1829 |     3.  If the learning rates > lr* show loss instability (loss goes up not down
1830 |         during periods of training), then it is likely that fixing the
1831 |         instability will result in better training.
1832 | -   Log the L2 norm of the full loss gradient during training, outlier values
1833 |     can result in spurious instability in the middle of training. This can
1834 |     inform how to pick gradient/update clipping.
1835 | 
1836 | **NOTE:** Some models show very early instability followed by a recovery that
1837 | results in slow but stable training. **Common evaluation schedules can miss
1838 | these issues by not evaluating frequently enough!**
1839 | 
1840 | To check for this, we can train for an abbreviated run of just \~500 steps using
1841 | `lr = 2 * current best`, but evaluate every step.
1842 | 
1843 | <p align="center">
1844 | <img src="assets/more_frequent_evals.png" width="80%" alt="Illustration of the value of more frequent evaluations at the start of
1845 | training.">
1846 | </p>
1847 | 
1848 | <p align="center"><b>Figure 5:</b> Illustration of the value of more frequent evaluations at the start of training. Useful if there’s a suspicion that the model suffers from early training instability.</p>
1849 | 
1850 | #### Potential fixes for common instability patterns
1851 | 
1852 | -   Apply learning rate warmup
1853 |     -   Best for early training instability.
1854 | -   Apply gradient clipping
1855 |     -   Good for both early and mid training instability, may fix some bad inits
1856 |         that warmup cannot.
1857 | -   Try a new optimizer
1858 |     -   Sometimes Adam can handle instabilities that Momentum can’t. This is an
1859 |         active area of research.
1860 | -   We can ensure that we’re using best practices/initializations for our model
1861 |     architecture (examples below).
1862 |     -   Add residual connections and normalization if the model doesn't contain
1863 |         it already.
1864 | -   Normalization should be the last operation before the residual. E.g. x +
1865 |     Norm(f(x)).
1866 | -   Norm(x + f(x)) known to cause issues.
1867 | -   Try initializing residual branches to 0 (e.g.
1868 |     [ReZero init](https://arxiv.org/abs/2003.04887)).
1869 | -   Lower the learning rate
1870 |     -   This is a last resort.
1871 | 
1872 | #### Learning rate warmup
1873 | 
1874 | <p align="center">
1875 | <img src="assets/instability_during_warmup.png" width="80%" alt="An example of instability during a warmup period (note the horizontal axis log
1876 | scale).">
1877 | </p>
1878 | 
1879 | <p align="center"><b>Figure 6:</b> An example of instability during a warmup period (note the horizontal axis log scale). 40k steps of warmup was needed for successful training in this case.</p>
1880 | 
1881 | ##### When to apply learning rate warmup
1882 | 
1883 | <p align="center">
1884 | <img src="assets/axis_model_with_instability.png" width="49%" alt="Axis plot for model with instability">
1885 | </p>
1886 | 
1887 | <p align="center"><b>Figure 7a:</b> An example of a hyperparameter axis plot for a model exhibiting training instability. The best learning rate is at the edge of what is feasible. An "infeasible" trial is defined as one that either produces NaNs or uncharacteristically high values of the loss.</p>
1888 | 
1889 | <p align="center">
1890 | <img src="assets/loss_model_with_instability.png" width="49%" alt="Loss curve for model with instability">
1891 | </p>
1892 | 
1893 | <p align="center"><b>Figure 7b:</b> The training loss of a model trained with a learning rate where we see instability.</p>
1894 | 
1895 | -   Figure 7a shows a hyperparameter axis plot that indicates a model
1896 |     experiencing optimization instabilities, because the best learning rate is
1897 |     right at the edge of instability.
1898 | -   Figure 7b shows how this can be double-checked by examining the training
1899 |     loss of a model trained with a learning rate either 5x or 10x larger than
1900 |     this peak. If that plot shows a sudden rise in the loss after a steady
1901 |     decline (e.g. at step \~10k in the figure above), then the model likely
1902 |     suffers from optimization instability.
1903 | 
1904 | ##### How to apply learning rate warmup
1905 | 
1906 | <p align="center">
1907 | <img src="assets/beneficial_effect_warmup.png" width="80%" alt="Beneficial effect of warmup on training instabilities">
1908 | </p>
1909 | 
1910 | <p align="center"><b>Figure 8:</b> Beneficial effect of learning rate warmup on addressing training instabilities.</p>
1911 | 
1912 | -   Using the section immediately above, we assume that the practitioner has
1913 |     already identified the learning rate at which the model becomes unstable.
1914 |     This is the `unstable_base_learning_rate`.
1915 | -   Warmup involves prepending a learning rate schedule that ramps up the
1916 |     learning rate from 0 to some stable `base_learning_rate`, that is at least
1917 |     one order of magnitude larger than `unstable_base_learning_rate`. The
1918 |     default would be to try a `base_learning_rate` that’s 10x
1919 |     `unstable_base_learning_rate`. Although note that it’d be possible to run
1920 |     this entire procedure again for something like 100x
1921 |     `unstable_base_learning_rate`. The specific schedule is:
1922 |     -   Ramp up from 0 to `base_learning_rate` over `warmup_steps`.
1923 |     -   Train at a constant rate for `post_warmup_steps`.
1924 | -   Our goal is to find the shortest number of `warmup_steps` that allows us to
1925 |     access peak learning rates that are much higher than
1926 |     `unstable_base_learning_rate`.
1927 | -   So for each `base_learning_rate`, we need to tune `warmup_steps` and
1928 |     `post_warmup_steps`. It’s usually fine to set `post_warmup_steps` to be
1929 |     `2*warmup_steps`.
1930 | -   Warmup can be tuned independently of an existing decay schedule.
1931 |     `warmup_steps` should be swept at a few different orders of magnitude. For
1932 |     example, an example study could try [10, 10<sup>3</sup>, 10<sup>4</sup>,
1933 |     10<sup>5</sup>]. The largest feasible point shouldn't be more than 10% of
1934 |     `max_train_steps`.
1935 | -   Once a `warmup_steps` that doesn't blow up training at `base_learning_rate`
1936 |     has been established, it should be applied to the baseline model.
1937 |     Essentially, we prepend this schedule onto the existing schedule, and use
1938 |     the optimal checkpoint selection discussed above to compare this experiment
1939 |     to the baseline. For example, if we originally had 10,000 `max_train_steps`
1940 |     and did `warmup_steps` for 1000 steps, the new training procedure should run
1941 |     for 11,000 steps total.
1942 | -   If long `warmup_steps` are required for stable training (>5% of
1943 |     `max_train_steps`), `max_train_steps` may need to be increased to account
1944 |     for this.
1945 | -   There isn't really a "typical" value across the full range of workloads.
1946 |     Some models only need 100 steps, while others (particularly transformers)
1947 |     may need 40k+.
1948 | 
1949 | #### Gradient clipping
1950 | 
1951 | <p align="center">
1952 | <img src="assets/gradient_clipping.png" width="80%" alt="Gradient clipping on early training instabilities">
1953 | </p>
1954 | 
1955 | <p align="center"><b>Figure 9:</b> Illustration of gradient clipping correcting early training instability.</p>
1956 | 
1957 | -   Gradient clipping is most useful when large or outlier gradient issues
1958 |     occur.
1959 | -   Clipping can fix either early training instability (large gradient norm
1960 |     early), or mid training instabilities (sudden gradient spikes mid training).
1961 | -   Sometimes longer warmup periods can correct instabilities that clipping does
1962 |     not: see [this section above](#How-to-apply-learning-rate-warmup).
1963 |     -   🤖 What about clipping during warmup?
1964 | -   The ideal clip thresholds are just above the "typical" gradient norm.
1965 | -   Here’s an example of how gradient clipping could be done:
1966 |     -   If the norm of the gradient $\left | g \right |$ is greater than the
1967 |         gradient clipping threshold $\lambda$, then do ${g}'= \lambda \times \frac{g}{\left | g \right |}$ where ${g}'$ is the new gradient.
1968 | -   Log the unclipped gradient norm during training. By default, generate:
1969 |     -   A plot of gradient norm vs step
1970 |     -   A histogram of gradient norms aggregated over all steps
1971 | -   Choose a gradient clipping threshold based on the 90th percentile of
1972 |     gradient norms.
1973 |     -   The threshold will be workload dependent, but 90% is a good starting
1974 |         point. If it doesn't work, this threshold can be tuned.
1975 |     -   🤖 What about some sort of adaptive strategy?
1976 | -   If we try gradient clipping and the instability issues remain, we can try it
1977 |     harder (i.e. make the threshold smaller).
1978 | -   Extremely aggressive gradient clipping is in essence a strange way of
1979 |     reducing the learning rate. If we find ourselves using extremely aggressive
1980 |     clipping, we probably should just cut the learning rate instead.
1981 | -   We would usually consider having >50% of the updates getting clipped somehow
1982 |     as "extremely aggressive".
1983 | -   If we need to do extremely aggressive gradient clipping to deal with our
1984 |     instability issues, then we might as well reduce the learning rate.
1985 | 
1986 | </details>
1987 | 
1988 | ### Why do you call the learning rate and other optimization parameters hyperparameters? They are not parameters of any prior distribution.
1989 | 
1990 | <details><summary><em>[Click to expand]</em></summary>
1991 | <br>
1992 | 
1993 | -   It is true that the term "hyperparameter" has a precise
1994 |     [meaning](https://en.wikipedia.org/wiki/Hyperparameter) in Bayesian machine
1995 |     learning and referring to the learning rate and most of the other parameters
1996 |     we tune in deep learning as "hyperparameters" is an abuse of terminology.
1997 | -   We would prefer to use the term "metaparameter" for learning rates,
1998 |     architectural parameters, and all the other things we tune in deep learning,
1999 |     since it avoids the potential for confusion that comes from misusing the
2000 |     word "hyperparameter" (confusion that is especially likely when discussing
2001 |     Bayesian optimization where the probabilistic response surface models have
2002 |     their own true hyperparameters).
2003 | -   Unfortunately, although potentially confusing, the term hyperparameter has become
2004 |     extremely common in the deep learning community.
2005 | -   Therefore, for a document, such as this one, intended for a wide audience
2006 |     that includes many people who are unlikely to be aware of this technicality,
2007 |     we made the choice to contribute to one source of confusion in the
2008 |     field in hopes of avoiding another.
2009 | -   That said, we might make a different choice when publishing a research
2010 |     paper, and we would encourage others to use "metaparameter" instead in most
2011 |     contexts.
2012 | 
2013 | </details>
2014 | 
2015 | ### Why shouldn't the batch size be tuned to directly improve validation set performance?
2016 | 
2017 | <details><summary><em>[Click to expand]</em></summary>
2018 | <br>
2019 | 
2020 | -   Changing the batch size *without changing any other details of the training pipeline* will often affect the validation set performance.
2021 | -   However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.
2022 | -   The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
2023 |     - Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques.
2024 | - In addition, [the number of training steps may need to be adjusted](#choosing-the-batch-size-to-minimize-training-time) when changing the batch size.
2025 | -   Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see [Shallue et al. 2018](https://arxiv.org/abs/1811.03600)).
2026 | 
2027 | </details>
2028 | 
2029 | ### What are the update rules for all the popular optimization algorithms?
2030 | 
2031 | <details><summary><em>[Click to expand]</em></summary>
2032 | 
2033 | <br>
2034 | 
2035 | #### Stochastic gradient descent (SGD)
2036 | 
2037 | $$\theta_{t+1} = \theta_{t} - \eta_t \nabla \mathcal{l}(\theta_t)$$
2038 | 
2039 | #### Momentum
2040 | 
2041 | $$v_0 = 0$$
2042 | 
2043 | $$v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)$$
2044 | 
2045 | $$\theta_{t+1} = \theta_{t} - \eta_t v_{t+1}$$
2046 | 
2047 | #### Nesterov
2048 | 
2049 | $$v_0 = 0$$
2050 | 
2051 | $$v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)$$
2052 | 
2053 | $$\theta_{t+1} = \theta_{t} - \eta_t( \gamma v_{t+1} + \nabla \mathcal{l}(\theta_{t})$$
2054 | 
2055 | #### RMSProp
2056 | 
2057 | $$v_0 = 1 \text{,} m_0 = 0$$
2058 | 
2059 | $$v_{t+1} = \rho v_{t} + (1 - \rho) \nabla \mathcal{l}(\theta_t)^2$$
2060 | 
2061 | $$m_{t+1} = \gamma m_{t} + \frac{\eta_t}{\sqrt{v_{t+1} + \epsilon}}\nabla \mathcal{l}(\theta_t)$$
2062 | 
2063 | $$\theta_{t+1} = \theta_{t} - m_{t+1}$$
2064 | 
2065 | #### ADAM
2066 | 
2067 | $$m_0 = 0 \text{,} v_0 = 0$$
2068 | 
2069 | $$m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)$$
2070 | 
2071 | $$v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l}(\theta_t)^2$$
2072 | 
2073 | $$b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}$$
2074 | 
2075 | $$\theta_{t+1} = \theta_{t} - \alpha_t \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}$$
2076 | 
2077 | #### NADAM
2078 | 
2079 | $$m_0 = 0 \text{,} v_0 = 0$$
2080 | 
2081 | $$m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)$$
2082 | 
2083 | $$v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l} (\theta_t)^2$$
2084 | 
2085 | $$b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}$$
2086 | 
2087 | $$\theta_{t+1} = \theta_{t} - \alpha_t \frac{\beta_1 m_{t+1} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}$$
2088 | 
2089 | </details>
2090 | 
2091 | ## Acknowledgments
2092 | 
2093 | -   We owe a debt of gratitude to Max Bileschi, Roy Frostig, Zelda Mariet, Stan
2094 |     Bileschi, Mohammad Norouzi, Chris DuBois and Charles Sutton for reading the
2095 |     manuscript and providing valuable feedback.
2096 | -   We reused some experimental data for several plots that were originally
2097 |     produced by Naman Agarwal for other joint research.
2098 | -   We would like to thank Will Chen for invaluable advice on the presentation of the document.
2099 | -   We would also like to thank Rohan Anil for useful discussions.
2100 | 
2101 | ## Citing
2102 | 
2103 | ```
2104 | @misc{tuningplaybookgithub,
2105 |   author = {Varun Godbole and George E. Dahl and Justin Gilmer and Christopher J. Shallue and Zachary Nado},
2106 |   title = {Deep Learning Tuning Playbook},
2107 |   url = {http://github.com/google/tuning_playbook},
2108 |   year = {2023},
2109 |   note = {Version 1.0}
2110 | }
2111 | ```
2112 | 
2113 | ## Contributing
2114 | 
2115 | -   This is not an officially supported Google product.
2116 | 
2117 | -   We'd love to hear your feedback!
2118 | 
2119 |     -   If you like the playbook, please [leave a star](https://docs.github.com/en/get-started/exploring-projects-on-github/saving-repositories-with-stars#starring-a-repository)! Or email
2120 |         deep-learning-tuning-playbook \[at\] googlegroups.com. Testimonials help
2121 |         us justify creating more resources like this.
2122 |     -   If anything seems incorrect, please file an issue to start a discussion.
2123 |         For questions or other messages where an issue isn't appropriate, please
2124 |         open a new discussion topic on GitHub.
2125 | 
2126 | -   As discussed in the preamble, this is a living document. We anticipate
2127 |     making periodic improvements, both small and large. If you’d like to be
2128 |     notified, please watch our repository (see [instructions](https://docs.github.com/en/account-and-profile/managing-subscriptions-and-notifications-on-github/setting-up-notifications/configuring-notifications#configuring-your-watch-settings-for-an-individual-repository)).
2129 | 
2130 | -   Please don't file a pull request without first coordinating with the authors
2131 |     via the issue tracking system.
2132 | 
2133 | ### Contributor License Agreement
2134 | 
2135 | Contributions to this project must be accompanied by a Contributor License
2136 | Agreement (CLA). You (or your employer) retain the copyright to your
2137 | contribution; this simply gives us permission to use and redistribute your
2138 | contributions as part of the project. Head over to
2139 | <https://cla.developers.google.com/> to see your current agreements on file or
2140 | to sign a new one.
2141 | 
2142 | You generally only need to submit a CLA once, so if you've already submitted one
2143 | (even if it was for a different project), you probably don't need to do it
2144 | again.
2145 | 
2146 | ### Code Reviews
2147 | 
2148 | All submissions, including submissions by project members, require review. We
2149 | use GitHub pull requests for this purpose. Consult
2150 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
2151 | information on using pull requests.
2152 | 
2153 | ### Community Guidelines
2154 | 
2155 | This project follows
2156 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/).
2157 | 


--------------------------------------------------------------------------------