├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE.md
├── README.md
├── THIRD-PARTY-NOTICES.txt
├── assets
    ├── efficiency.png
    ├── nlp.png
    ├── scalability.png
    └── vision.png
├── examples
    ├── __init__.py
    ├── image_classification
    │   ├── CIFAR_TIMM.py
    │   ├── CV_TIMM.py
    │   ├── CelebA_TIMM.py
    │   ├── README.md
    │   ├── ZERO_examples
    │   │   ├── CIFAR_TIMM_FSDP_extending.py
    │   │   ├── CIFAR_TIMM_ZERO1.py
    │   │   ├── CIFAR_TIMM_ZERO23.py
    │   │   ├── CIFAR_TIMM_ZERO_extending.py
    │   │   └── cifar_config.json
    │   └── __init__.py
    ├── requirements.txt
    ├── table2text
    │   ├── README.md
    │   ├── __init__.py
    │   ├── compiled_args.py
    │   ├── data_utils
    │   │   ├── __init__.py
    │   │   ├── data_collator.py
    │   │   └── language_modeling.py
    │   ├── decoding_utils.py
    │   ├── gpt_config_stage123.json
    │   ├── misc.py
    │   ├── models.py
    │   ├── run.sh
    │   ├── run_ZERO1.sh
    │   ├── run_ZERO23.sh
    │   ├── run_ZERO_extending.py
    │   ├── run_language_modeling.py
    │   ├── run_language_modeling_ZERO23.py
    │   ├── run_language_modeling_extending.py
    │   └── trainer.py
    └── text_classification
    │   ├── README.md
    │   ├── __init__.py
    │   ├── data
    │       ├── download_dataset.sh
    │       ├── make_k_shot_without_dev.py
    │       └── make_valid_data.py
    │   ├── run_classification.py
    │   ├── run_wrapper.py
    │   └── src
    │       ├── __init__.py
    │       ├── common.py
    │       ├── compiled_args.py
    │       ├── dataset.py
    │       ├── label_search.py
    │       ├── models.py
    │       ├── processors.py
    │       └── trainer.py
├── fastDP
    ├── README.md
    ├── __init__.py
    ├── accounting
    │   ├── __init__.py
    │   ├── accounting_manager.py
    │   └── rdp_accounting.py
    ├── autograd_grad_sample.py
    ├── autograd_grad_sample_dist.py
    ├── lora_utils.py
    ├── privacy_engine.py
    ├── privacy_engine_dist_extending.py
    ├── privacy_engine_dist_stage23.py
    ├── supported_differentially_private_layers.py
    ├── supported_layers_grad_samplers.py
    └── transformers_support.py
├── requirements.txt
└── setup.py


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/NOTICE.md:
--------------------------------------------------------------------------------
1 | Fast Differential Privacy
2 | 
3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Fast Differential Privacy
  2 | 
  3 | *Fast Differential Privacy* (**fastDP**) is a library that allows differentially private optimization of PyTorch models, with a few additional lines of code. The goal of this library is to make DP deep learning as similar to the standard non-private learning as possible, in terms of **speed, memory cost, scalability, accuracy and hyperparameter-tuning**. It supports all PyTorch optimizers, popular models in [TIMM](https://github.com/rwightman/pytorch-image-models), [torchvision](https://github.com/pytorch/vision), [HuggingFace](https://huggingface.co/transformers/) (up to supported modules), multiple privacy accountants, multiple clipping functions/styles, most parameter-efficient training methods, and distribute solutions such as DeepSpeed and FSDP. The library has provably little overhead in terms of training time and memory cost, compared with the standard non-private optimization.
  4 | 
  5 | 
  6 | ---
  7 | ## Installation.
  8 | To install the library after Git clone, run
  9 | ```bash
 10 | python -m setup develop
 11 | ```
 12 | 
 13 | > :warning: **NOTE**: We strongly recommend Python>=3.8 and torch<=1.11 (it is a known issue that torch 1.12 can slow down as much as 3 times).
 14 | 
 15 | ## Getting started
 16 | To train a model with differential privacy, simply create a `PrivacyEngine` and continue the standard training pipeline:
 17 | 
 18 | ```python
 19 | from fastDP import PrivacyEngine
 20 | optimizer = SGD(model.parameters(), lr=0.05)
 21 | privacy_engine = PrivacyEngine(
 22 |     model,
 23 |     batch_size=256,
 24 |     sample_size=50000,
 25 |     epochs=3,
 26 |     target_epsilon=2,
 27 |     clipping_fn='automatic',
 28 |     clipping_mode='MixOpt',
 29 |     origin_params=None,
 30 |     clipping_style='all-layer',
 31 | )
 32 | # attaching to optimizers is not needed for multi-GPU distributed learning
 33 | privacy_engine.attach(optimizer) 
 34 | 
 35 | #----- standard training pipeline
 36 | loss = F.cross_entropy(model(batch), labels)
 37 | loss.backward()
 38 | optimizer.step()
 39 | optimizer.zero_grad()
 40 | ```
 41 | 
 42 | We provide details about our privacy engine in `fastDP/README.md`, including the supported modules and the arguments. By default, we use the `'MixOpt'` (hybrid book-keeping [4]) clipping mode (which enjoys almost the same time complexity as non-private optimization), and the automatic clipping function [8] (which does not need to tune the clipping threshold `max_grad_norm`). We support RDP and GLW privacy accountant, and additional accountants can be used through the argument `noise_multiplier`, after its calculation from [[Automating differential privacy computation](https://github.com/yuxiangw/autodp)] library.
 43 | 
 44 | Specifically, we allow the gradient accumulation to use very large batch size, which is beneficial to DP optimization:
 45 | ```python
 46 | for i, batch in enumerate(dataloader):
 47 |     loss = F.cross_entropy(model(batch), labels)
 48 |     loss.backward()
 49 |     if i % gradient_accumulation_steps == 0:
 50 |         optimizer.step()
 51 |         optimizer.zero_grad()
 52 | ```
 53 | 
 54 | ## Foundation model release
 55 | We release DP vision foundation models in [v2.1](https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1): VisionTransformer models (ViT; ~86M param) following [Pre-training Differentially Private Models with Limited Public Data](https://arxiv.org/abs/2402.18752) in NeurIPS 2024. These models have [epsilon=2](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps2.pt) and [epsilon=8](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps8.pt), pre-trained on ImageNet-1k with AdamW (1k classes, 1 million images) and ImageNet-11k with DP-AdamW (11k classes, 11 million images). More DP foundation models to come!
 56 | 
 57 | ## Highlights
 58 | 1. This library enables large model training in the **multi-GPU distributed setting** and **supports mixed precision training** under DeepSpeed and FSDP.
 59 | <p align="center">
 60 |   <img width="900" height="400" src="./assets/scalability.png">
 61 | </p>
 62 | The scalability has been tested on 100B models with 512 GPUs.
 63 | <p align="center">
 64 |   <img width="900" height="300" src="./assets/efficiency.png">
 65 | </p>
 66 | 
 67 | 2. This library enables DP training to have almost **the same time and space complexity** as the standard non-private training. This is achieved by three key techniques as described in [4]: mixed ghost norm, book-keeping, and ghost differentiation. In practice, we observe <20% memory overhead and <25% slowdown across different tasks.
 68 | 
 69 | <p align="center">
 70 |   <img width="900" height="400" src="./assets/nlp.png">
 71 | </p>
 72 | 
 73 | 3. Specifically, this library overcomes the severe memory issues of large model (commonly encountered by Opacus, which computes the per-sample gradients) and high dimensional data (commonly encountered by ghost clipping, e.g. in Private transformers), by leveraging the mixed ghost norm trick [3,8].
 74 | 
 75 | <p align="center">
 76 |   <img width="900" height="220" src="./assets/vision.png">
 77 | </p>
 78 | 
 79 | 4. We **support all optimizers** in [`torch.optim`](https://pytorch.org/docs/stable/optim.html) (SGD, Adam, AdaGrad,...) and a wide range of **models** (BERT, RoBERTa, GPT, ViT, BEiT, CrossViT, DEiT, ResNet, VGG, DenseNet,...), including their parameter-efficient variants. For example, one can run DP bias-term fine-tuning (DP-BiTFiT) by simply freezing non-bias terms, as in `examples/image_classification`.
 80 | 
 81 | ------
 82 | Full fine-tuning results on a single A100 GPU
 83 | 
 84 | | Datasets | ε | Setting                                                  | Model         | Accuracy  | Time(min)/epoch |
 85 | |----------|---|----------------------------------------------------------|---------------|-----------|-----------------|
 86 | | CIFAR10  | 2 | [6] | ViT-large     | 98.9      | 7.0             |
 87 | | CIFAR100 | 2 | [6] | BEiT-large    | 88.7      | 6.5             |
 88 | | CelebA   | 3 | [6] | ResNet18      | 88.2      | 2.7             |
 89 | | SST2     | 3 | [8] | RoBERTa-large | 93.9      | 13.5            |
 90 | | QNLI     | 3 | [8] | RoBERTa-large | 91.0      | 20.2            |
 91 | | QQP      | 3 | [8] | RoBERTa-large | 86.8      | 70.0            |
 92 | | MNLI     | 3 | [8] | RoBERTa-large | 86.3/86.7 | 77.1            |
 93 | 
 94 | More datasets, epsilon budgets, models, fine-tuning styles, and different hyperparamters can be found in the related papers.
 95 | 
 96 | 
 97 | 
 98 | ## Examples
 99 | The `examples` folder covers tasks on the table-to-text (E2E and DART datasets with GPT2 models), the text classification (SST2/QNLI/QQP/MNLI datasets with BERT/RoBERTa models), and the image classification (CIFAR10/CIFAR100/CelebA datasets with [TIMM](https://github.com/rwightman/pytorch-image-models)/[torchvision](https://github.com/pytorch/vision) models). Detailed `README.md` can be found in each sub-folder. These examples can be used to reproduce the results in [2,3,4,6,8].
100 | 
101 | 
102 | ## Citation
103 | Please consider citing the following if you want to use this library in your works:
104 | ```
105 | @inproceedings{bu2023differentially,
106 |   title={Differentially private optimization on large model at small cost},
107 |   author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George},
108 |   booktitle={International Conference on Machine Learning},
109 |   pages={3192--3218},
110 |   year={2023},
111 |   organization={PMLR}
112 | }
113 | 
114 | @article{bu2023zero,
115 |   title={Zero redundancy distributed learning with differential privacy},
116 |   author={Bu, Zhiqi and Chiu, Justin and Liu, Ruixuan and Zha, Sheng and Karypis, George},
117 |   booktitle={ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML},
118 |   journal={arXiv preprint arXiv:2311.11822},
119 |   year={2023}
120 | }
121 | 
122 | @inproceedings{bu2022differentially,
123 |   title={Differentially Private Bias-Term Fine-tuning of Foundation Models},
124 |   author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George},
125 |   booktitle={Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022},
126 |   year={2022}
127 | }
128 | ```
129 | 
130 | ## Acknowledgements
131 | This codebase is largely inspired by [[Opacus (v0.15)]](https://github.com/pytorch/opacus), [[Private transformers (v0.2.3)]](https://github.com/lxuechen/private-transformers), [[Private Vision]](https://github.com/woodyx218/private_vision), and [[FastGradClip]](https://github.com/ppmlguy/fastgradclip).
132 | 
133 | ## References
134 | [1] Ian Goodfellow. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015).
135 | 
136 | [2] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." ICLR (2022).
137 | 
138 | [3] Zhiqi Bu, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." NeurIPS (2022).
139 | 
140 | [4] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." ICML (2023).
141 | 
142 | [5] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021).
143 | 
144 | [6] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term Fine-tuning of Foundation Models." ICML (2024).
145 | 
146 | [7] Martin Abadi, et al. "Deep learning with differential privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.
147 | 
148 | [8] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Automatic clipping: Differentially private deep learning made easier and stronger." NeurIPS (2023).
149 | 
150 | [9] Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, and George Karypis. "Pre-training Differentially Private Models with Limited Public Data." NeurIPS (2024).
151 | 


--------------------------------------------------------------------------------
/THIRD-PARTY-NOTICES.txt:
--------------------------------------------------------------------------------
  1 | ** private-transformers; version 0.2.3 -- https://github.com/lxuechen/private-transformers
  2 | ** opacus; version 0.15 -- https://github.com/pytorch/opacus
  3 | ** private_vision; version initial version -- https://github.com/woodyx218/private_vision
  4 |  
  5 | Apache License
  6 | Version 2.0, January 2004
  7 | http://www.apache.org/licenses/
  8 | 
  9 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 10 | 
 11 | 1. Definitions.
 12 | 
 13 | "License" shall mean the terms and conditions for use, reproduction, and
 14 | distribution as defined by Sections 1 through 9 of this document.
 15 | 
 16 | "Licensor" shall mean the copyright owner or entity authorized by the copyright
 17 | owner that is granting the License.
 18 | 
 19 | "Legal Entity" shall mean the union of the acting entity and all other entities
 20 | that control, are controlled by, or are under common control with that entity.
 21 | For the purposes of this definition, "control" means (i) the power, direct or
 22 | indirect, to cause the direction or management of such entity, whether by
 23 | contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
 24 | outstanding shares, or (iii) beneficial ownership of such entity.
 25 | 
 26 | "You" (or "Your") shall mean an individual or Legal Entity exercising
 27 | permissions granted by this License.
 28 | 
 29 | "Source" form shall mean the preferred form for making modifications, including
 30 | but not limited to software source code, documentation source, and configuration
 31 | files.
 32 | 
 33 | "Object" form shall mean any form resulting from mechanical transformation or
 34 | translation of a Source form, including but not limited to compiled object code,
 35 | generated documentation, and conversions to other media types.
 36 | 
 37 | "Work" shall mean the work of authorship, whether in Source or Object form, made
 38 | available under the License, as indicated by a copyright notice that is included
 39 | in or attached to the work (an example is provided in the Appendix below).
 40 | 
 41 | "Derivative Works" shall mean any work, whether in Source or Object form, that
 42 | is based on (or derived from) the Work and for which the editorial revisions,
 43 | annotations, elaborations, or other modifications represent, as a whole, an
 44 | original work of authorship. For the purposes of this License, Derivative Works
 45 | shall not include works that remain separable from, or merely link (or bind by
 46 | name) to the interfaces of, the Work and Derivative Works thereof.
 47 | 
 48 | "Contribution" shall mean any work of authorship, including the original version
 49 | of the Work and any modifications or additions to that Work or Derivative Works
 50 | thereof, that is intentionally submitted to Licensor for inclusion in the Work
 51 | by the copyright owner or by an individual or Legal Entity authorized to submit
 52 | on behalf of the copyright owner. For the purposes of this definition,
 53 | "submitted" means any form of electronic, verbal, or written communication sent
 54 | to the Licensor or its representatives, including but not limited to
 55 | communication on electronic mailing lists, source code control systems, and
 56 | issue tracking systems that are managed by, or on behalf of, the Licensor for
 57 | the purpose of discussing and improving the Work, but excluding communication
 58 | that is conspicuously marked or otherwise designated in writing by the copyright
 59 | owner as "Not a Contribution."
 60 | 
 61 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf
 62 | of whom a Contribution has been received by Licensor and subsequently
 63 | incorporated within the Work.
 64 | 
 65 | 2. Grant of Copyright License. Subject to the terms and conditions of this
 66 | License, each Contributor hereby grants to You a perpetual, worldwide, non-
 67 | exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce,
 68 | prepare Derivative Works of, publicly display, publicly perform, sublicense, and
 69 | distribute the Work and such Derivative Works in Source or Object form.
 70 | 
 71 | 3. Grant of Patent License. Subject to the terms and conditions of this License,
 72 | each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-
 73 | charge, royalty-free, irrevocable (except as stated in this section) patent
 74 | license to make, have made, use, offer to sell, sell, import, and otherwise
 75 | transfer the Work, where such license applies only to those patent claims
 76 | licensable by such Contributor that are necessarily infringed by their
 77 | Contribution(s) alone or by combination of their Contribution(s) with the Work
 78 | to which such Contribution(s) was submitted. If You institute patent litigation
 79 | against any entity (including a cross-claim or counterclaim in a lawsuit)
 80 | alleging that the Work or a Contribution incorporated within the Work
 81 | constitutes direct or contributory patent infringement, then any patent licenses
 82 | granted to You under this License for that Work shall terminate as of the date
 83 | such litigation is filed.
 84 | 
 85 | 4. Redistribution. You may reproduce and distribute copies of the Work or
 86 | Derivative Works thereof in any medium, with or without modifications, and in
 87 | Source or Object form, provided that You meet the following conditions:
 88 | 
 89 |      (a) You must give any other recipients of the Work or Derivative Works a
 90 | copy of this License; and
 91 | 
 92 |      (b) You must cause any modified files to carry prominent notices stating
 93 | that You changed the files; and
 94 | 
 95 |      (c) You must retain, in the Source form of any Derivative Works that You
 96 | distribute, all copyright, patent, trademark, and attribution notices from the
 97 | Source form of the Work, excluding those notices that do not pertain to any part
 98 | of the Derivative Works; and
 99 | 
100 |      (d) If the Work includes a "NOTICE" text file as part of its distribution,
101 | then any Derivative Works that You distribute must include a readable copy of
102 | the attribution notices contained within such NOTICE file, excluding those
103 | notices that do not pertain to any part of the Derivative Works, in at least one
104 | of the following places: within a NOTICE text file distributed as part of the
105 | Derivative Works; within the Source form or documentation, if provided along
106 | with the Derivative Works; or, within a display generated by the Derivative
107 | Works, if and wherever such third-party notices normally appear. The contents of
108 | the NOTICE file are for informational purposes only and do not modify the
109 | License. You may add Your own attribution notices within Derivative Works that
110 | You distribute, alongside or as an addendum to the NOTICE text from the Work,
111 | provided that such additional attribution notices cannot be construed as
112 | modifying the License.
113 | 
114 |      You may add Your own copyright statement to Your modifications and may
115 | provide additional or different license terms and conditions for use,
116 | reproduction, or distribution of Your modifications, or for any such Derivative
117 | Works as a whole, provided Your use, reproduction, and distribution of the Work
118 | otherwise complies with the conditions stated in this License.
119 | 
120 | 5. Submission of Contributions. Unless You explicitly state otherwise, any
121 | Contribution intentionally submitted for inclusion in the Work by You to the
122 | Licensor shall be under the terms and conditions of this License, without any
123 | additional terms or conditions. Notwithstanding the above, nothing herein shall
124 | supersede or modify the terms of any separate license agreement you may have
125 | executed with Licensor regarding such Contributions.
126 | 
127 | 6. Trademarks. This License does not grant permission to use the trade names,
128 | trademarks, service marks, or product names of the Licensor, except as required
129 | for reasonable and customary use in describing the origin of the Work and
130 | reproducing the content of the NOTICE file.
131 | 
132 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in
133 | writing, Licensor provides the Work (and each Contributor provides its
134 | Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
135 | KIND, either express or implied, including, without limitation, any warranties
136 | or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
137 | PARTICULAR PURPOSE. You are solely responsible for determining the
138 | appropriateness of using or redistributing the Work and assume any risks
139 | associated with Your exercise of permissions under this License.
140 | 
141 | 8. Limitation of Liability. In no event and under no legal theory, whether in
142 | tort (including negligence), contract, or otherwise, unless required by
143 | applicable law (such as deliberate and grossly negligent acts) or agreed to in
144 | writing, shall any Contributor be liable to You for damages, including any
145 | direct, indirect, special, incidental, or consequential damages of any character
146 | arising as a result of this License or out of the use or inability to use the
147 | Work (including but not limited to damages for loss of goodwill, work stoppage,
148 | computer failure or malfunction, or any and all other commercial damages or
149 | losses), even if such Contributor has been advised of the possibility of such
150 | damages.
151 | 
152 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or
153 | Derivative Works thereof, You may choose to offer, and charge a fee for,
154 | acceptance of support, warranty, indemnity, or other liability obligations
155 | and/or rights consistent with this License. However, in accepting such
156 | obligations, You may act only on Your own behalf and on Your sole
157 | responsibility, not on behalf of any other Contributor, and only if You agree to
158 | indemnify, defend, and hold each Contributor harmless for any liability incurred
159 | by, or claims asserted against, such Contributor by reason of your accepting any
160 | such warranty or additional liability.
161 | 
162 | END OF TERMS AND CONDITIONS
163 | 
164 | APPENDIX: How to apply the Apache License to your work.
165 | 
166 | To apply the Apache License to your work, attach the following boilerplate
167 | notice, with the fields enclosed by brackets "[]" replaced with your own
168 | identifying information. (Don't include the brackets!)  The text should be
169 | enclosed in the appropriate comment syntax for the file format. We also
170 | recommend that a file or class name and description of purpose be included on
171 | the same "printed page" as the copyright notice for easier identification within
172 | third-party archives.
173 | 
174 | Copyright [yyyy] [name of copyright owner]
175 | 
176 | Licensed under the Apache License, Version 2.0 (the "License");
177 | you may not use this file except in compliance with the License.
178 | You may obtain a copy of the License at
179 | 
180 | http://www.apache.org/licenses/LICENSE-2.0
181 | 
182 | Unless required by applicable law or agreed to in writing, software
183 | distributed under the License is distributed on an "AS IS" BASIS,
184 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
185 | See the License for the specific language governing permissions and
186 | limitations under the License.
187 | 
188 | * For Private-Transformers see also this required NOTICE:
189 |     None
190 | * For opacus see also this required NOTICE:
191 |     Copyright (c) Meta Platforms, Inc. and affiliates.
192 | * For private_vision see also this required NOTICE:
193 |     None
194 | 
195 | ------
196 | 
197 | ** ml-swissknife; version 0.1.7 -- https://github.com/lxuechen/ml-swissknife
198 | None
199 |  
200 | MIT License
201 | 
202 | Copyright (c) <year> <copyright holders>
203 | 
204 | Permission is hereby granted, free of charge, to any person obtaining a copy of
205 | this software and associated documentation files (the "Software"), to deal in
206 | the Software without restriction, including without limitation the rights to
207 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
208 | the Software, and to permit persons to whom the Software is furnished to do so,
209 | subject to the following conditions:
210 | 
211 | The above copyright notice and this permission notice shall be included in all
212 | copies or substantial portions of the Software.
213 | 
214 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
215 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
216 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
217 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
218 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
219 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
220 | 


--------------------------------------------------------------------------------
/assets/efficiency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/efficiency.png


--------------------------------------------------------------------------------
/assets/nlp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/nlp.png


--------------------------------------------------------------------------------
/assets/scalability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/scalability.png


--------------------------------------------------------------------------------
/assets/vision.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/vision.png


--------------------------------------------------------------------------------
/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/__init__.py


--------------------------------------------------------------------------------
/examples/image_classification/CIFAR_TIMM.py:
--------------------------------------------------------------------------------
  1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
  2 | def main(args):
  3 |     if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
  4 |         print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
  5 |         return None
  6 | 
  7 |     device= torch.device("cuda:0")
  8 | 
  9 |     # Data
 10 |     print('==> Preparing data..')
 11 | 
 12 |     transformation = torchvision.transforms.Compose([
 13 |         torchvision.transforms.Resize(args.dimension),
 14 |         torchvision.transforms.ToTensor(),
 15 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
 16 |     ])
 17 | 
 18 | 
 19 |     if args.cifar_data=='CIFAR10':
 20 |         trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
 21 |         testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
 22 |     elif args.cifar_data=='CIFAR100':
 23 |         trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
 24 |         testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
 25 |     else:
 26 |         return "Must specify datasets as CIFAR10 or CIFAR100"
 27 |          
 28 |  
 29 |     trainloader = torch.utils.data.DataLoader(
 30 |         trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4)
 31 | 
 32 |     testloader = torch.utils.data.DataLoader(
 33 |         testset, batch_size=100, shuffle=False, num_workers=4)
 34 | 
 35 |     n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps
 36 | 
 37 |     # Model
 38 |     print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm. Mode: ', args.clipping_mode)
 39 |     net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))    
 40 |     net = ModuleValidator.fix(net); net=net.to(device)
 41 | 
 42 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
 43 |     print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
 44 |     
 45 |     criterion = nn.CrossEntropyLoss()
 46 | 
 47 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 48 | 
 49 |     if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety
 50 |         for name,param in net.named_parameters():
 51 |             if '.bias' not in name:
 52 |                 param.requires_grad_(False)
 53 | 
 54 |     # Privacy engine
 55 |     if 'nonDP' not in args.clipping_mode:
 56 |         sigma=get_noise_multiplier(
 57 |                 target_epsilon = args.epsilon,
 58 |                 target_delta = 1e-5,
 59 |                 sample_rate = args.bs/len(trainset),
 60 |                 epochs = args.epochs,
 61 |             )
 62 | 
 63 |         if 'BK' in args.clipping_mode:
 64 |             clipping_mode=args.clipping_mode[3:]
 65 |         else:
 66 |             clipping_mode='ghost'
 67 | 
 68 |         if args.clipping_style in [['all-layer'],['layer-wise'],['param-wise']]:
 69 |             args.clipping_style=args.clipping_style[0]
 70 |         privacy_engine = PrivacyEngine(
 71 |             net,
 72 |             batch_size=args.bs,
 73 |             sample_size=len(trainset),
 74 |             noise_multiplier=sigma,
 75 |             epochs=args.epochs,
 76 |             clipping_mode=clipping_mode,
 77 |             clipping_style=args.clipping_style,
 78 |             origin_params=args.origin_params,#['patch_embed.proj.bias'],
 79 |         )
 80 |         privacy_engine.attach(optimizer)        
 81 | 
 82 |         
 83 |     def train(epoch):
 84 | 
 85 |         net.train()
 86 |         train_loss = 0
 87 |         correct = 0
 88 |         total = 0
 89 | 
 90 |    
 91 |         for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
 92 |             inputs, targets = inputs.to(device), targets.to(device)
 93 |             outputs = net(inputs)
 94 |             loss = criterion(outputs, targets)
 95 | 
 96 |             loss.backward()
 97 |             if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
 98 |                 optimizer.step()
 99 |                 optimizer.zero_grad()
100 |                 
101 |             train_loss += loss.item()
102 |             _, predicted = outputs.max(1)
103 |             total += targets.size(0)
104 |             correct += predicted.eq(targets).sum().item()
105 | 
106 |         print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
107 |                          % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
108 | 
109 |     def test(epoch):
110 |         net.eval()
111 |         test_loss = 0
112 |         correct = 0
113 |         total = 0
114 |         with torch.no_grad():
115 |             for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
116 |                 inputs, targets = inputs.to(device), targets.to(device)
117 |                 outputs = net(inputs)
118 |                 loss = criterion(outputs, targets)
119 | 
120 |                 test_loss += loss.item()
121 |                 _, predicted = outputs.max(1)
122 |                 total += targets.size(0)
123 |                 correct += predicted.eq(targets).sum().item()
124 | 
125 |             print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
126 |                              % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
127 | 
128 |     for epoch in range(args.epochs):
129 |         train(epoch)
130 |         test(epoch)
131 | 
132 | 
133 | if __name__ == '__main__':
134 |     import argparse
135 | 
136 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
137 |     parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
138 |     parser.add_argument('--epochs', default=3, type=int,
139 |                         help='numter of epochs')
140 |     parser.add_argument('--bs', default=1000, type=int, help='batch size')
141 |     parser.add_argument('--mini_bs', type=int, default=50)
142 |     parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
143 |     parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
144 |     parser.add_argument('--clipping_style', default='all-layer', nargs='+',type=str)
145 |     parser.add_argument('--model', default='vit_small_patch16_224', type=str)
146 |     parser.add_argument('--cifar_data', type=str, default='CIFAR10')
147 |     parser.add_argument('--dimension', type=int,default=224)
148 |     parser.add_argument('--origin_params', nargs='+', default=None)
149 | 
150 |     args = parser.parse_args()
151 |     
152 |     from fastDP import PrivacyEngine
153 | 
154 |     import torch
155 |     import torchvision
156 |     torch.manual_seed(2)
157 |     import torch.nn as nn
158 |     import torch.optim as optim
159 |     import timm
160 |     from opacus.validators import ModuleValidator
161 |     from opacus.accountants.utils import get_noise_multiplier
162 |     from tqdm import tqdm
163 |     import warnings; warnings.filterwarnings("ignore")
164 | 
165 |     main(args)
166 | 


--------------------------------------------------------------------------------
/examples/image_classification/CV_TIMM.py:
--------------------------------------------------------------------------------
  1 | '''Train CV with PyTorch.'''
  2 | def main(args):
  3 | 
  4 |     device= torch.device("cuda:0")
  5 | 
  6 |     # Data
  7 |     transformation = torchvision.transforms.Compose([
  8 |         torchvision.transforms.Resize((224,224)),#https://discuss.pytorch.org/t/runtimeerror-stack-expects-each-tensor-to-be-equal-size-but-got-3-224-224-at-entry-0-and-3-224-336-at-entry-3/87211/10
  9 |         torchvision.transforms.ToTensor(),
 10 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
 11 |     ])
 12 | 
 13 |     if args.dataset_name in ['SVHN','CIFAR10']:
 14 |         num_classes=10
 15 |     elif args.dataset_name in ['CIFAR100','FGVCAircraft']:
 16 |         num_classes=100
 17 |     elif args.dataset_name in ['Food101']:
 18 |         num_classes=101
 19 |     elif args.dataset_name in ['GTSRB']:
 20 |         num_classes=43
 21 |     elif args.dataset_name in ['CelebA']:
 22 |         num_classes=40
 23 |     elif args.dataset_name in ['Places365']:
 24 |         num_classes=365
 25 |     elif args.dataset_name in ['ImageNet']:
 26 |         num_classes=1000
 27 |     elif args.dataset_name in ['INaturalist']:
 28 |         num_classes=10000
 29 |         
 30 | 
 31 |     if args.dataset_name in ['SVHN','Food101','GTSRB','FGVCAircraft']:
 32 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=True, transform=transformation)
 33 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=True, transform=transformation)
 34 |     elif args.dataset_name in ['CIFAR10','CIFAR100']:
 35 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=True, download=True, transform=transformation)
 36 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=False, download=True, transform=transformation)
 37 |     elif args.dataset_name=='CelebA':
 38 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=False, target_type='attr', transform=transformation)
 39 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=False, target_type='attr',transform=transformation)
 40 |     elif args.dataset_name=='Places365':
 41 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train-standard', small=True, download=False, transform=transformation)
 42 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', small=True, download=False, transform=transformation)
 43 |     elif args.dataset_name=='INaturalist':
 44 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_train_mini', download=False, transform=transformation)
 45 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_valid', download=False, transform=transformation)
 46 |     elif args.dataset_name=='ImageNet':
 47 |         trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', transform=transformation)
 48 |         testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', transform=transformation)
 49 |         
 50 |     trainloader = torch.utils.data.DataLoader(
 51 |         trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4)
 52 | 
 53 |     testloader = torch.utils.data.DataLoader(
 54 |         testset, batch_size=100, shuffle=False, num_workers=4)
 55 | 
 56 |     n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps
 57 | 
 58 | 
 59 |     # Model
 60 |     net = timm.create_model(args.model, pretrained=True, num_classes = num_classes)
 61 |     net = ModuleValidator.fix(net).to(device)
 62 | 
 63 |     if args.dataset_name=='CelebA':
 64 |         criterion = nn.BCEWithLogitsLoss(reduction='none')
 65 |     else:
 66 |         criterion = nn.CrossEntropyLoss()
 67 |     
 68 | 
 69 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 70 | 
 71 |     if 'BiTFiT' in args.clipping_mode:
 72 |         for name,layer in net.named_modules():
 73 |             if hasattr(layer,'weight'):
 74 |                 temp_layer=layer
 75 |         for name,param in net.named_parameters():
 76 |             if '.bias' not in name:
 77 |                 param.requires_grad_(False)
 78 |         for param in temp_layer.parameters():
 79 |             param.requires_grad_(True)
 80 | 
 81 |     # Privacy engine
 82 |     if 'nonDP' not in args.clipping_mode:
 83 |         sigma=get_noise_multiplier(
 84 |                 target_epsilon = args.epsilon,
 85 |                 target_delta = 1/len(trainset),
 86 |                 sample_rate = args.bs/len(trainset),
 87 |                 epochs = args.epochs,
 88 |             )
 89 |         print(f'adding noise level {sigma}')
 90 |         privacy_engine = PrivacyEngine(
 91 |             net,
 92 |             batch_size=args.bs,
 93 |             sample_size=len(trainset),
 94 |             noise_multiplier=sigma,
 95 |             epochs=args.epochs,
 96 |             clipping_mode='MixOpt',
 97 |             clipping_style='all-layer',
 98 |         )
 99 |         privacy_engine.attach(optimizer)        
100 | 
101 |         
102 |     tr_loss=[]
103 |     te_loss=[]
104 |     tr_acc=[]
105 |     te_acc=[]
106 |     
107 |     def train(epoch):
108 | 
109 |         net.train()
110 |         train_loss = 0
111 |         correct = 0
112 |         total = 0
113 | 
114 |    
115 |         for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
116 |             inputs, targets = inputs.to(device), targets.to(device)
117 |             outputs = net(inputs)
118 |             if args.dataset_name=='CelebA':
119 |                 loss = criterion(outputs, targets.float()).sum(dim=1).mean()
120 |             else:
121 |                 loss = criterion(outputs, targets);#print(loss.item())
122 | 
123 |             loss.backward()
124 |             if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
125 |                 optimizer.step()
126 |                 optimizer.zero_grad()     
127 | 
128 |             train_loss += loss.item()
129 |             total += targets.size(0)
130 |             if args.dataset_name=='CelebA':
131 |                 correct += ((outputs > 0) == targets).sum(dim=0).float().mean()
132 |             else:
133 |                 _, predicted = outputs.max(1)
134 |                 correct += predicted.eq(targets).sum().item()
135 | 
136 |             if args.dataset_name in ['Places365','INaturalist','ImageNet'] and (batch_idx + 1) % 100 == 0:
137 |                 print(loss.item(),100.*correct/total)
138 | 
139 |                 
140 |         tr_loss.append(train_loss/(batch_idx+1))
141 |         tr_acc.append(100.*correct/total)
142 |         print('Epoch: ', epoch, 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
143 |                          % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
144 | 
145 |     def test(epoch):
146 |         net.eval()
147 |         test_loss = 0
148 |         correct = 0
149 |         total = 0
150 |         with torch.no_grad():
151 |             for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
152 |                 inputs, targets = inputs.to(device), targets.to(device)
153 |                 outputs = net(inputs)
154 |                 if args.dataset_name=='CelebA':
155 |                     loss = criterion(outputs, targets.float()).sum(dim=1).mean()
156 |                 else:
157 |                     loss = criterion(outputs, targets);#print(loss.item())
158 | 
159 |                 test_loss += loss.item()
160 |                 total += targets.size(0)
161 |                 if args.dataset_name=='CelebA':
162 |                     correct += ((outputs > 0) == targets).sum(dim=0).float().mean()
163 |                 else:
164 |                     _, predicted = outputs.max(1)
165 |                     correct += predicted.eq(targets).sum().item()
166 | 
167 |             te_loss.append(test_loss/(batch_idx+1))
168 |             te_acc.append(100.*correct/total)
169 |             print('Epoch: ', epoch, 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
170 |                              % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
171 | 
172 |     for epoch in range(args.epochs):
173 |         train(epoch)
174 |         test(epoch)
175 |     print(tr_loss,tr_acc,te_loss,te_acc)
176 | 
177 | 
178 | if __name__ == '__main__':
179 |     import argparse
180 | 
181 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
182 |     parser.add_argument('--lr', default=5e-4, type=float, help='learning rate')
183 |     parser.add_argument('--epochs', default=5, type=int,
184 |                         help='numter of epochs')
185 |     parser.add_argument('--bs', default=1000, type=int, help='batch size')
186 |     parser.add_argument('--mini_bs', type=int, default=100)
187 |     parser.add_argument('--epsilon', default=8, type=float, help='target epsilon')
188 |     parser.add_argument('--dataset_name', type=str, default='CIFAR10',help='https://pytorch.org/vision/stable/datasets.html')
189 |     parser.add_argument('--clipping_mode', type=str, default='MixOpt',choices=['BiTFiT','MixOpt', 'nonDP','nonDP-BiTFiT'])
190 |     parser.add_argument('--model', default='vit_base_patch16_224', type=str, help='model name')
191 | 
192 |     args = parser.parse_args()
193 |     
194 |     from fastDP import PrivacyEngine
195 | 
196 |     import torch
197 |     import torchvision
198 |     torch.manual_seed(2)
199 |     import torch.nn as nn
200 |     import torch.optim as optim
201 |     import timm
202 |     from opacus.validators import ModuleValidator
203 |     from opacus.accountants.utils import get_noise_multiplier
204 |     from tqdm import tqdm
205 |     import numpy as np
206 |     import warnings; warnings.filterwarnings("ignore")
207 |     main(args)
208 | 


--------------------------------------------------------------------------------
/examples/image_classification/CelebA_TIMM.py:
--------------------------------------------------------------------------------
  1 | #This runs multi-label classification
  2 | def main(args):
  3 |     if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
  4 |         print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
  5 |         return None
  6 | 
  7 |     device = torch.device('cuda')
  8 | 
  9 |     # Data
 10 |     print('==> Preparing data..')
 11 |     
 12 |     train_set = datasets.CelebA(root='.', split='train',target_type='attr',
 13 |                                 transform=transforms.Compose([
 14 |                                     transforms.ToTensor(),
 15 |                                     #transforms.Normalize(mean=[0.5,0.5,0.5],std=[0.5,0.5,0.5]),
 16 |                                     ]))
 17 |     test_set = datasets.CelebA(root=".", split='test', target_type='attr',
 18 |         transform=transforms.Compose([
 19 |            transforms.ToTensor()]))
 20 | 
 21 |     if args.labels==None:
 22 |         args.labels=list(range(40))
 23 |         print('Training on all 40 labels.')
 24 |     else:
 25 |         print('Training on ', [attr_names[ind] for ind in args.labels])
 26 |         
 27 |         
 28 |     train_set.attr = train_set.attr[:, args.labels].type(torch.float32)
 29 |     test_set.attr = test_set.attr[:, args.labels].type(torch.float32)
 30 | 
 31 |     print('Training/Testing set size: ', len(train_set),len(test_set),' ; Image dimension: ',train_set[0][0].shape)
 32 | 
 33 |     trainloader = torch.utils.data.DataLoader(
 34 |         train_set, batch_size=args.mini_bs, pin_memory=True,num_workers=4,shuffle=True)
 35 |     testloader = torch.utils.data.DataLoader(
 36 |         test_set, batch_size=500, pin_memory=True,num_workers=4, shuffle=False)
 37 | 
 38 |     n_acc_steps=args.bs//args.mini_bs
 39 | 
 40 |     # Model
 41 |     print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm.')
 42 |     net = timm.create_model(args.model, pretrained=True, num_classes=len(args.labels))
 43 |     net = ModuleValidator.fix(net)
 44 |     net=net.to(device)
 45 |     
 46 |     for name,param in net.named_parameters():
 47 |         print("First trainable parameter is: ",name);break
 48 |     
 49 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
 50 |     print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
 51 |     
 52 |     criterion = nn.BCEWithLogitsLoss(reduction='none')
 53 | 
 54 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 55 | 
 56 |     if 'BiTFiT' in args.clipping_mode:
 57 |         for name,param in net.named_parameters():
 58 |             if '.bias' not in name:
 59 |                 param.requires_grad_(False)
 60 | 
 61 |     # Privacy engine
 62 |     if 'nonDP' not in args.clipping_mode:
 63 |         sigma=get_noise_multiplier(
 64 |                 target_epsilon = args.epsilon,
 65 |                 target_delta = 5e-6,
 66 |                 sample_rate = args.bs/len(train_set),
 67 |                 epochs = args.epochs,
 68 |             )
 69 | 
 70 |         if 'BK' in args.clipping_mode:
 71 |             clipping_mode=args.clipping_mode[3:]
 72 |         else:
 73 |             clipping_mode='ghost'
 74 |         privacy_engine = PrivacyEngine(
 75 |             net,
 76 |             batch_size=args.bs,
 77 |             sample_size=len(train_set),
 78 |             noise_multiplier=sigma,
 79 |             epochs=args.epochs,
 80 |             clipping_mode=clipping_mode,
 81 |             origin_params=args.origin_params,
 82 |         )
 83 |         privacy_engine.attach(optimizer)   
 84 |         
 85 |         
 86 |     def train(epoch):
 87 | 
 88 |         net.train()
 89 |         train_loss = 0
 90 |         correct = np.zeros_like([0]*len(args.labels))
 91 |         total = 0
 92 |         for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
 93 |             inputs, targets = inputs.to(device), targets.to(device)
 94 |             outputs = net(inputs)
 95 |             loss = criterion(outputs, targets.float()).sum(dim=1).mean()
 96 | 
 97 | 
 98 |             loss.backward()
 99 |             if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
100 |                 optimizer.step()
101 |                 optimizer.zero_grad()                
102 | 
103 |             train_loss += loss.item()
104 |             total += targets.size(0)
105 |             correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy()
106 | 
107 |         print('Epoch: ', epoch, 'Train Loss: ', (train_loss/(batch_idx+1), 
108 |                 ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total)))
109 | 
110 |     def test(epoch):
111 |         net.eval()
112 |         test_loss = 0
113 |         correct = np.zeros_like([0]*len(args.labels))
114 |         total = 0
115 |         with torch.no_grad():
116 |             for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
117 |                 inputs, targets = inputs.to(device), targets.to(device)
118 |                 outputs = net(inputs)
119 |                 loss = criterion(outputs, targets.float()).sum(dim=1)
120 |                 loss = loss.mean()
121 | 
122 |                 test_loss += loss.item()
123 |                 total += targets.size(0)
124 |                 correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy()
125 | 
126 |             print('Epoch: ', epoch, 'Test Loss: ', (test_loss/(batch_idx+1), 
127 |                     ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total)))
128 | 
129 | 
130 |     for epoch in range(args.epochs):
131 |         train(epoch)
132 |         test(epoch)
133 | 
134 | 
135 | if __name__ == '__main__':
136 |     import argparse
137 |     parser = argparse.ArgumentParser()
138 |     parser.add_argument('--lr', type=float, default=0.001)
139 |     parser.add_argument('--epochs', type=int, default=10)
140 |     parser.add_argument('--bs', type=int, default=500)
141 |     parser.add_argument('--mini_bs', type=int, default=100)
142 |     parser.add_argument('--epsilon', default=3, type=float)
143 |     parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
144 |     parser.add_argument('--model', type=str, default='resnet18')
145 |     parser.add_argument('--labels', nargs="*", type=int, default=None,help='List of label indices, 0-39 for CelebA')
146 |     parser.add_argument('--origin_params', nargs='+', default=None)
147 | 
148 |     
149 |     args = parser.parse_args()
150 | 
151 |     attr_names=['5_o_Clock_Shadow','Arched_Eyebrows','Attractive','Bags_Under_Eyes',
152 |             'Bald','Bangs','Big_Lips','Big_Nose',
153 |             'Black_Hair','Blond_Hair','Blurry','Brown_Hair',
154 |             'Bushy_Eyebrows','Chubby','Double_Chin','Eyeglasses',
155 |             'Goatee','Gray_Hair','Heavy_Makeup','High_Cheekbones',
156 |             'Male','Mouth_Slightly_Open','Mustache','Narrow_Eyes',
157 |             'No_Beard','Oval_Face','Pale_Skin','Pointy_Nose',
158 |             'Receding_Hairline','Rosy_Cheeks','Sideburns','Smiling',
159 |             'Straight_Hair','Wavy_Hair','Wearing_Earrings','Wearing_Hat',
160 |             'Wearing_Lipstick','Wearing_Necklace','Wearing_Necktie','Young']
161 |     
162 |     import numpy as np
163 |     from fastDP import PrivacyEngine
164 | 
165 |     import torch
166 |     from torchvision import datasets, transforms
167 |     torch.manual_seed(0)
168 |     import torch.nn as nn
169 |     import torch.optim as optim
170 |     import timm
171 |     from opacus.validators import ModuleValidator
172 |     from opacus.accountants.utils import get_noise_multiplier
173 |     from tqdm import tqdm
174 |     import warnings; warnings.filterwarnings("ignore")
175 | 
176 |     main(args)
177 | 


--------------------------------------------------------------------------------
/examples/image_classification/README.md:
--------------------------------------------------------------------------------
 1 | ## DP image classification with convolutional neural networks and vision transformers
 2 | 
 3 | We provide the scripts to implement DP optimization on CIFAR10, CIFAR100, SVHN, ImageNet, CelebA, Places365, INaturalist, and other datasets, using the models (CNN and ViT) from [TIMM](https://github.com/rwightman/pytorch-image-models/tree/master/timm/models). Supported models include VGG, ResNet, Wide ResNet, ViT, CrossViT, BEiT, DEiT, ... 
 4 | 
 5 | ### Multi-GPU distributed learning
 6 | See `ZERO_examples` folder. Our privacy engine supports DeepSpeed (ZeRO 1+2+3) and FSDP with mixed precision training. For example,
 7 | ```plaintext
 8 | deepspeed CIFAR_TIMM_ZERO1.py --model vit_large_patch16_224 --cifar_data CIFAR10 --deepspeed_config cifar_config.json
 9 | ```
10 | 
11 | ### CIFAR10/CIFAR100
12 | ```plaintext
13 | python -m CIFAR_TIMM --model vit_large_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_mode BK-MixOpt --cifar_data CIFAR10
14 | ```
15 | 
16 | The script by default uses (hybrid) book-keeping by [Differentially Private Optimization on Large Model at Small Cost](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. This script achieves state-of-the-art accuracy with BEiT-large and ViT-large under 7 min per epoch on one A100 GPU (40GB). Notice that `--origin_params 'patch_embed.proj.bias'` specifically accelerates ViT through the ghost differentiation trick.
17 | 
18 | Arguments:
19 | 
20 | * `--cifar_data`: Whether to train on CIFAR10 (default) or CIFAR100 datasets.
21 | 
22 | * `--epsilon`: Target privacy spending, default is 2.
23 | 
24 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `nonDP` (non-private full fine-tuning), `BK-ghost` (base book-keeping), `BK-MixGhostClip`, `BK-MixOpt` (default), `BiTFiT` (DP bias-term fine-tuning) and `nonDPBiTFiT` (non-private BiTFiT). All BK algorithms are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf), and DP-BiTFiT is from [Bu et al., 2022](https://arxiv.org/pdf/2210.00036.pdf).
25 | 
26 | * `--model`: The pretrained model from TIMM, check the full list by `timm.list_models(pretrained=True)`.
27 | 
28 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters.
29 | 
30 | * `--dimension`: Dimension of images, default is 224, i.e. the image is resized to 224*224.
31 | 
32 | * `--lr`: Learning rate, default is 0.0005. Note BiTFiT learning rate should be larger than full fine-tuning's.
33 | 
34 | * `--mini_bs` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy, default is 50.
35 | 
36 | * `--bs` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is 1000.
37 | 
38 | * `--epochs`: Number of epochs, default is 3.
39 | 
40 | * `--clipping_style`: Which group-wise per-sample gradient clipping style to use. This argument takes one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, a uniform 3-group clipping can be implemented using 
41 | ```plaintext
42 | python -m CIFAR_TIMM --model vit_base_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_style patch_embed.proj blocks.4.norm1 blocks.8.norm1 --cifar_data CIFAR10
43 | ```
44 | 
45 | ### CelebA
46 | Download dataset by `torchvision`, or from the [official host](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) with all .txt and `/img_align_celeba` in the same directory.
47 | ```plaintext
48 | python -m CelebA_TIMM --model resnet18
49 | ```
50 | Same arguments `[lr, epochs, bs, mini_bs, epsilon, clipping mode, model]` as the CIFAR example with one addition `--labels`. Default is `None` to train all 40 labels as multi-label/multi-task problem, otherwise train on label indices specified as a list. For example, label index 31 is 'Smiling' and label index 20 is 'Male'.
51 | 
52 | ### General computer vision experiments
53 | We provide a general script to experiment on many torchvision datasets, and we fix most of the arguments in the privacy engine.
54 | ```plaintext
55 | python -m CV_TIMM --model vit_base_patch16_224 --dataset_name ImageNet
56 | ```
57 | 
58 | ### Note
59 | 1. Vision models oftentimes have batch normalization layers, which violate the DP guarantee (see [Opacus](https://opacus.ai/tutorials/guide_to_module_validator) for the reason). A common solution is to replace with group/layer/instance normalization, and this can be easily fixed by Opacus>=v1.0: `model=ModuleValidator.fix(model)`.
60 | 
61 | 2. To reproduce DP image classification and compare with other packages, we refer to [private-vision](https://github.com/woodyx218/private_vision) (covering GhostClip, MixGhostClip, Opacus-like optimization) and [Opacus](https://github.com/pytorch/opacus). Different packages and clipping modes should produce the same accuracy. Note that training more epochs with larger noise usually gives better accuracy.
62 | 
63 | 3. Generally speaking, GhostClip is inefficient for large image (try 512X512 image with resnet18), Opacus is inefficient for large model (try 224X224 image with BEiT-large). Hence we improve on mixed ghost norm from [Bu et al.](https://arxiv.org/abs/2205.10683) to use GhostClip or Opacus at different layers.
64 | 


--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_FSDP_extending.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | import torch
  4 | import torch.nn as nn
  5 | import torch.optim as optim
  6 | 
  7 | import torchvision
  8 | from fastDP import PrivacyEngine_Distributed_extending
  9 | 
 10 | import timm
 11 | #from opacus.validators import ModuleValidator
 12 | from tqdm import tqdm
 13 | import warnings; warnings.filterwarnings("ignore")
 14 | 
 15 | 
 16 | import torch.distributed as dist
 17 | import torch.multiprocessing as mp
 18 | from torch.utils.data.distributed import DistributedSampler
 19 | from fairscale.nn import FullyShardedDataParallel as FSDP
 20 | 
 21 | #--- if import from torch <= 1.11
 22 | #from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 23 | #from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload,BackwardPrefetch
 24 | #from torch.distributed.fsdp.wrap import default_auto_wrap_policy,enable_wrap,wrap
 25 | from fairscale.nn import default_auto_wrap_policy
 26 | from fairscale.internal.parallel import ProcessGroupName
 27 | 
 28 | 
 29 | def setup(rank, world_size):
 30 |     os.environ['MASTER_ADDR'] = 'localhost'
 31 |     os.environ['MASTER_PORT'] = '12355'
 32 | 
 33 |     # initialize the process group
 34 |     dist.init_process_group("nccl", rank=rank, world_size=world_size)
 35 | 
 36 | def cleanup():
 37 |     dist.destroy_process_group()
 38 |     
 39 | 
 40 | def train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps):
 41 |     net.train()
 42 |     ddp_loss = torch.zeros(3).to(rank)
 43 |    
 44 |     for batch_idx, data in enumerate(tqdm(trainloader)):
 45 |         # get the inputs; data is a list of [inputs, labels]
 46 |         inputs, targets = data[0].to(rank), data[1].to(rank)
 47 |         outputs = net(inputs)
 48 | 
 49 |         loss = criterion(outputs, targets)
 50 | 
 51 |         loss.backward()
 52 |         if ((batch_idx + 1) % grad_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
 53 |             optimizer.step()
 54 |             optimizer.zero_grad()
 55 |             
 56 |         _, predicted = outputs.max(1)
 57 | 
 58 |         ddp_loss[0] += loss.item()
 59 |         ddp_loss[1] += len(data[0])
 60 |         ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item()
 61 | 
 62 |     if rank == 0:
 63 |         print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%%'
 64 |               % (ddp_loss[0]/(batch_idx+1), 100.*ddp_loss[2]/ddp_loss[1]))
 65 |     dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
 66 | 
 67 | def test(epoch,net,rank,testloader,criterion):
 68 |     net.eval()
 69 |     ddp_loss = torch.zeros(3).to(rank)
 70 | 
 71 |     with torch.no_grad():
 72 |         for batch_idx, data in enumerate(tqdm(testloader)):
 73 |             inputs, targets = data[0].to(rank), data[1].to(rank)
 74 |             outputs = net(inputs)
 75 |             loss = criterion(outputs, targets)
 76 | 
 77 |             _, predicted = outputs.max(1)
 78 |             ddp_loss[0] += loss.item()
 79 |             ddp_loss[1] += len(data[0])
 80 |             ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item()
 81 |     if rank == 0:
 82 |         print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%%'
 83 |               % (ddp_loss[0]/ddp_loss[1]*len(inputs), 100.*ddp_loss[2]/ddp_loss[1]))
 84 | 
 85 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
 86 | def main(rank, world_size, args):
 87 | 
 88 |     grad_acc_steps = args.batch_size//args.mini_batch_size//world_size
 89 | 
 90 |     if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
 91 |         print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
 92 |         return None
 93 | 
 94 |     setup(rank, world_size)
 95 | 
 96 | 
 97 |     transformation = torchvision.transforms.Compose([
 98 |         torchvision.transforms.Resize(args.dimension),
 99 |         torchvision.transforms.ToTensor(),
100 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
101 |     ])
102 | 
103 |     # Data
104 |     print('==> Preparing data..')
105 | 
106 |     if args.cifar_data=='CIFAR10':
107 |         trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=False, transform=transformation)
108 |         testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=False, transform=transformation)
109 |     elif args.cifar_data=='CIFAR100':
110 |         trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=False, transform=transformation)
111 |         testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=False, transform=transformation)
112 |     else:
113 |         return "Must specify datasets as CIFAR10 or CIFAR100"
114 |          
115 |     sampler_train = DistributedSampler(trainset, rank=rank, num_replicas=world_size, shuffle=True)
116 |     sampler_test = DistributedSampler(testset, rank=rank, num_replicas=world_size)
117 | 
118 |     train_kwargs = {'batch_size': args.mini_batch_size, 'sampler': sampler_train}
119 |     test_kwargs = {'batch_size': 10, 'sampler': sampler_test}
120 |     cuda_kwargs = {'num_workers': 2,
121 |                     'pin_memory': False,
122 |                     'shuffle': False}
123 |     train_kwargs.update(cuda_kwargs)
124 |     test_kwargs.update(cuda_kwargs)
125 | 
126 |     trainloader = torch.utils.data.DataLoader(trainset,**train_kwargs)
127 |     testloader = torch.utils.data.DataLoader(testset, **test_kwargs)
128 |     torch.cuda.set_device(rank)
129 | 
130 | 
131 |     init_start_event = torch.cuda.Event(enable_timing=True)
132 |     init_end_event = torch.cuda.Event(enable_timing=True)
133 | 
134 |     # Model
135 |     print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode,grad_acc_steps)
136 |     net = timm.create_model(args.model, pretrained=True, num_classes=int(args.cifar_data[5:]))    
137 |   
138 |     if 'BiTFiT' in args.clipping_mode:
139 |       for name,param in net.named_parameters():
140 |           if '.bias' not in name:
141 |               param.requires_grad_(False)
142 |       
143 |     net = net.to(rank)
144 | 
145 |     # Privacy engine
146 |     if 'nonDP' not in args.clipping_mode:
147 |         PrivacyEngine_Distributed_extending(
148 |             net,
149 |             batch_size=args.batch_size,
150 |             sample_size=len(trainset),
151 |             epochs=args.epochs,
152 |             target_epsilon=args.epsilon,
153 |             num_GPUs=world_size,
154 |             torch_seed_is_fixed=True, #FSDP always gives different seeds to devices if use FSDP() to wrap
155 |             grad_accum_steps=grad_acc_steps,
156 |         )
157 | 
158 | 
159 |     #net = FSDP(net,flatten_parameters=False, mixed_precision=args.fp16)# must use flatten_parameters=False https://github.com/facebookresearch/fairscale/issues/1047
160 |     
161 |     from fairscale.nn.wrap import wrap, enable_wrap, auto_wrap
162 |     fsdp_params = dict(wrapper_cls=FSDP, mixed_precision=args.fp16, flatten_parameters=False)#,disable_reshard_on_root=False,reshard_after_forward=False,clear_autocast_cache=True) # True or False
163 |     with enable_wrap(**fsdp_params):
164 |         # cannot wrap the network as a whole, will lose weight.noise
165 |         for pp in net.modules(): # must wrap module/layer not parameter
166 |             if hasattr(pp,'weight'): # AssertionError assert not isinstance(child, cast(type, ConfigAutoWrap.wrapper_cls))
167 |                 pp=auto_wrap(pp)
168 |     
169 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
170 |     print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
171 |     
172 | 
173 |     criterion = nn.CrossEntropyLoss(reduction='sum')
174 |   
175 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
176 |     #https://pytorch.org/docs/stable/fsdp.html
177 |     #The optimizer must be initialized after the module has been wrapped, since FSDP will shard parameters in-place and this will break any previously initialized optimizers.
178 | 
179 |     init_start_event.record()
180 |     
181 |     for epoch in range(args.epochs):
182 |         train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps)
183 |         test(epoch,net,rank,testloader,criterion)
184 |     init_end_event.record()
185 | 
186 |     if rank == 0:
187 |         print(f"CUDA event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")
188 |         
189 |     cleanup()
190 |     
191 |     
192 | 
193 | if __name__ == '__main__':
194 | 
195 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
196 |     parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
197 |     parser.add_argument('--epochs', default=5, type=int,
198 |                         help='numter of epochs')
199 |     parser.add_argument('--batch_size', default=1024, type=int, help='logical batch size')
200 |     parser.add_argument('--mini_batch_size', default=16, type=int, help='physical batch size')
201 |     parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
202 |     parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
203 |     parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str)
204 |     parser.add_argument('--cifar_data', type=str, default='CIFAR100')
205 |     parser.add_argument('--dimension', type=int,default=224)
206 |     parser.add_argument('--fp16', type=bool, default=False)
207 | 
208 |     args = parser.parse_args()
209 |     
210 |     torch.manual_seed(2) # useful for reproduction
211 | 
212 |     WORLD_SIZE = torch.cuda.device_count()
213 | 
214 |     mp.spawn(main,args=(WORLD_SIZE, args),
215 |             nprocs=WORLD_SIZE,join=True)
216 |     #https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn
217 | 


--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO1.py:
--------------------------------------------------------------------------------
  1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
  2 | def main(args):
  3 |     config=json.load(open(args.deepspeed_config))
  4 | 
  5 |     transformation = torchvision.transforms.Compose([
  6 |         torchvision.transforms.Resize(args.dimension),
  7 |         torchvision.transforms.ToTensor(),
  8 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
  9 |     ])
 10 | 
 11 |     if torch.distributed.get_rank() != 0:
 12 |         # might be downloading cifar data, let rank 0 download first
 13 |         torch.distributed.barrier()
 14 | 
 15 | 
 16 |     # Data
 17 |     print('==> Preparing data..')
 18 | 
 19 |     if args.cifar_data=='CIFAR10':
 20 |         trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
 21 |         testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
 22 |     elif args.cifar_data=='CIFAR100':
 23 |         trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
 24 |         testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
 25 |     else:
 26 |         return "Must specify datasets as CIFAR10 or CIFAR100"
 27 |          
 28 |  
 29 |     if torch.distributed.get_rank() == 0:
 30 |         # cifar data is downloaded, indicate other ranks can proceed
 31 |         torch.distributed.barrier()
 32 | 
 33 |     testloader = torch.utils.data.DataLoader(testset, batch_size=20, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
 34 | 
 35 |     # Model
 36 |     print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
 37 |     net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))    
 38 |     net = ModuleValidator.fix(net); 
 39 |     
 40 |     criterion = nn.CrossEntropyLoss()
 41 | 
 42 |     if 'BiTFiT' in args.clipping_mode:
 43 |         for name,param in net.named_parameters():
 44 |             if '.bias' not in name:
 45 |                 param.requires_grad_(False)
 46 | 
 47 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
 48 |     print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
 49 | 
 50 |     # Privacy engine
 51 |     if 'nonDP' not in args.clipping_mode:
 52 |         privacy_engine = PrivacyEngine(
 53 |             net,
 54 |             batch_size=config['train_batch_size'],
 55 |             sample_size=len(trainset),
 56 |             epochs=args.epochs,
 57 |             target_epsilon=args.epsilon,
 58 |             clipping_mode='MixOpt',
 59 |             clipping_style=args.clipping_style,
 60 |             num_GPUs=torch.distributed.get_world_size(),
 61 |             torch_seed_is_fixed=True,
 62 |         )
 63 | 
 64 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 65 | 
 66 |     # Initialize DeepSpeed to use the following features
 67 |     # 1) Distributed model
 68 |     # 2) Distributed data loader
 69 |     # 3) DeepSpeed optimizer
 70 |     model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
 71 | 
 72 |     fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled()
 73 |     print(f'fp16={fp16},bf16={bf16}')
 74 | 
 75 | 
 76 |     def train(epoch):
 77 | 
 78 |         net.train()
 79 |         train_loss = 0
 80 |         correct = 0
 81 |         total = 0
 82 | 
 83 |    
 84 |         for batch_idx, data in enumerate(tqdm(trainloader)):
 85 |             # get the inputs; data is a list of [inputs, labels]
 86 |             inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
 87 |             if fp16:
 88 |                 inputs = inputs.half()
 89 |             if bf16:
 90 |                 inputs = inputs.bfloat16()
 91 |             outputs = model_engine(inputs)
 92 | 
 93 |             loss = criterion(outputs, targets)
 94 | 
 95 |             model_engine.backward(loss)
 96 |             #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)):
 97 |             model_engine.step()
 98 |                 
 99 |             train_loss += loss.item()
100 |             _, predicted = outputs.max(1)
101 |             total += targets.size(0)
102 |             correct += predicted.eq(targets).sum().item()
103 | 
104 |         print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
105 |                          % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
106 | 
107 |     def test(epoch):
108 |         net.eval()
109 |         test_loss = 0
110 |         correct = 0
111 |         total = 0
112 |         with torch.no_grad():
113 |             for batch_idx, data in enumerate(tqdm(testloader)):
114 |                 inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
115 |                 if fp16:
116 |                     inputs = inputs.half()
117 |                 if bf16:
118 |                     inputs = inputs.bfloat16()
119 |                 outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
120 |                 loss = criterion(outputs, targets)
121 | 
122 |                 test_loss += loss.item()
123 |                 _, predicted = outputs.max(1)
124 |                 total += targets.size(0)
125 |                 correct += predicted.eq(targets).sum().item()
126 | 
127 |             print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
128 |                              % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
129 | 
130 |     for epoch in range(args.epochs):
131 |         train(epoch)
132 |         test(epoch)
133 | 
134 | 
135 | if __name__ == '__main__':
136 |     import deepspeed
137 |     import argparse
138 | 
139 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
140 |     parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
141 |     parser.add_argument('--epochs', default=1, type=int,
142 |                         help='numter of epochs')
143 |     parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
144 |     parser.add_argument('--clipping_mode', default='MixOpt', type=str)
145 |     parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str)
146 |     parser.add_argument('--cifar_data', type=str, default='CIFAR100')
147 |     parser.add_argument('--dimension', type=int,default=224)
148 |     parser.add_argument('--clipping_style', type=str, default='layer-wise')
149 | 
150 |     parser.add_argument('--local_rank',
151 |                         type=int,
152 |                         default=-1,
153 |                         help='local rank passed from distributed launcher')
154 |     # Include DeepSpeed configuration arguments
155 |     parser = deepspeed.add_config_arguments(parser)
156 | 
157 |     args = parser.parse_args()
158 |     
159 |     from fastDP import PrivacyEngine
160 | 
161 |     import torch
162 |     import torchvision
163 |     torch.manual_seed(3) # if use, need change privacy engine's argument
164 |     import torch.nn as nn
165 |     import torch.optim as optim
166 |     import timm
167 |     from opacus.validators import ModuleValidator
168 |     from tqdm import tqdm
169 |     import warnings; warnings.filterwarnings("ignore")
170 |     
171 |     import json
172 | 
173 |     deepspeed.init_distributed()
174 | 
175 |     main(args)
176 | 


--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO23.py:
--------------------------------------------------------------------------------
  1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
  2 | def main(args):
  3 |     config=json.load(open(args.deepspeed_config))
  4 | 
  5 |     if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
  6 |         print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
  7 |         return None
  8 | 
  9 | 
 10 |     transformation = torchvision.transforms.Compose([
 11 |         torchvision.transforms.Resize(args.dimension),
 12 |         torchvision.transforms.ToTensor(),
 13 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
 14 |     ])
 15 | 
 16 |     if torch.distributed.get_rank() != 0:
 17 |         # might be downloading cifar data, let rank 0 download first
 18 |         torch.distributed.barrier()
 19 | 
 20 | 
 21 |     # Data
 22 |     print('==> Preparing data..')
 23 | 
 24 |     if args.cifar_data=='CIFAR10':
 25 |         trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
 26 |         testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
 27 |     elif args.cifar_data=='CIFAR100':
 28 |         trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
 29 |         testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
 30 |     else:
 31 |         return "Must specify datasets as CIFAR10 or CIFAR100"
 32 |          
 33 |  
 34 |     if torch.distributed.get_rank() == 0:
 35 |         # cifar data is downloaded, indicate other ranks can proceed
 36 |         torch.distributed.barrier()
 37 | 
 38 |     testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
 39 | 
 40 |     # Model
 41 |     print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
 42 |     net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))    
 43 |     net = ModuleValidator.fix(net); 
 44 |     
 45 |     criterion = nn.CrossEntropyLoss()
 46 | 
 47 |     if 'BiTFiT' in args.clipping_mode:
 48 |         for name,param in net.named_parameters():
 49 |             if '.bias' not in name:
 50 |                 param.requires_grad_(False)
 51 | 
 52 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
 53 |     print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
 54 | 
 55 |     # Privacy engine
 56 |     if 'nonDP' not in args.clipping_mode:
 57 |         privacy_engine = PrivacyEngine_Distributed_Stage_2_and_3(
 58 |             net,
 59 |             batch_size=config['train_batch_size'],
 60 |             sample_size=len(trainset),
 61 |             epochs=args.epochs,
 62 |             #noise_multiplier=0,
 63 |             target_epsilon=args.epsilon,
 64 |             clipping_mode='MixOpt',
 65 |             clipping_style='layer-wise',
 66 |             num_GPUs=torch.distributed.get_world_size(),
 67 |             torch_seed_is_fixed=True,
 68 |         )
 69 | 
 70 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 71 | 
 72 |     # Initialize DeepSpeed to use the following features
 73 |     # 1) Distributed model
 74 |     # 2) Distributed data loader
 75 |     # 3) DeepSpeed optimizer
 76 |     model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
 77 | 
 78 |     fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled()
 79 |     print(f'fp16={fp16},bf16={bf16}')
 80 | 
 81 | 
 82 |     def train(epoch):
 83 | 
 84 |         net.train()
 85 |         train_loss = 0
 86 |         correct = 0
 87 |         total = 0
 88 | 
 89 |    
 90 |         for batch_idx, data in enumerate(tqdm(trainloader)):
 91 |             # get the inputs; data is a list of [inputs, labels]
 92 |             inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
 93 |             if fp16:
 94 |                 inputs = inputs.half()
 95 |             if bf16:
 96 |                 inputs = inputs.bfloat16()
 97 |             outputs = model_engine(inputs)
 98 | 
 99 |             loss = criterion(outputs, targets)
100 | 
101 |             model_engine.backward(loss)
102 |             #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)):
103 |             model_engine.step()
104 |                 
105 |             train_loss += loss.item()
106 |             _, predicted = outputs.max(1)
107 |             total += targets.size(0)
108 |             correct += predicted.eq(targets).sum().item()
109 | 
110 |         print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
111 |                          % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
112 | 
113 |     def test(epoch):
114 |         net.eval()
115 |         test_loss = 0
116 |         correct = 0
117 |         total = 0
118 |         with torch.no_grad():
119 |             for batch_idx, data in enumerate(tqdm(testloader)):
120 |                 inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
121 |                 if fp16:
122 |                     inputs = inputs.half()
123 |                 if bf16:
124 |                     inputs = inputs.bfloat16()
125 |                 outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
126 |                 loss = criterion(outputs, targets)
127 | 
128 |                 test_loss += loss.item()
129 |                 _, predicted = outputs.max(1)
130 |                 total += targets.size(0)
131 |                 correct += predicted.eq(targets).sum().item()
132 | 
133 |             print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
134 |                              % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
135 | 
136 |     for epoch in range(args.epochs):
137 |         train(epoch)
138 |         test(epoch)
139 | 
140 | 
141 | if __name__ == '__main__':
142 |     import deepspeed
143 |     import argparse
144 | 
145 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
146 |     parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
147 |     parser.add_argument('--epochs', default=5, type=int,
148 |                         help='numter of epochs')
149 |     parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
150 |     parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
151 |     parser.add_argument('--model', default='vit_small_patch16_224', type=str)
152 |     parser.add_argument('--cifar_data', type=str, default='CIFAR100')
153 |     parser.add_argument('--dimension', type=int,default=224)
154 |     parser.add_argument('--origin_params', nargs='+', default=None)
155 | 
156 |     parser.add_argument('--local_rank',
157 |                         type=int,
158 |                         default=-1,
159 |                         help='local rank passed from distributed launcher')
160 |     # Include DeepSpeed configuration arguments
161 |     parser = deepspeed.add_config_arguments(parser)
162 | 
163 |     args = parser.parse_args()
164 |     
165 |     from fastDP import PrivacyEngine_Distributed_Stage_2_and_3
166 | 
167 |     import torch
168 |     import torchvision
169 |     torch.manual_seed(3) # if use, need change privacy engine's argument
170 |     import torch.nn as nn
171 |     import torch.optim as optim
172 |     import timm
173 |     from opacus.validators import ModuleValidator
174 |     from tqdm import tqdm
175 |     import warnings; warnings.filterwarnings("ignore")
176 |     
177 |     import json
178 | 
179 |     import deepspeed
180 |     deepspeed.init_distributed()
181 | 
182 |     main(args)
183 | 


--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO_extending.py:
--------------------------------------------------------------------------------
  1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
  2 | def main(args):
  3 |     config=json.load(open(args.deepspeed_config))
  4 | 
  5 |     if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
  6 |         print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
  7 |         return None
  8 | 
  9 | 
 10 |     transformation = torchvision.transforms.Compose([
 11 |         torchvision.transforms.Resize(args.dimension),
 12 |         torchvision.transforms.ToTensor(),
 13 |         torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
 14 |     ])
 15 | 
 16 |     if torch.distributed.get_rank() != 0:
 17 |         # might be downloading cifar data, let rank 0 download first
 18 |         torch.distributed.barrier()
 19 | 
 20 | 
 21 |     # Data
 22 |     if args.cifar_data=='CIFAR10':
 23 |         trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
 24 |         testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
 25 |     elif args.cifar_data=='CIFAR100':
 26 |         trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
 27 |         testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
 28 |     else:
 29 |         return "Must specify datasets as CIFAR10 or CIFAR100"
 30 |          
 31 |  
 32 |     if torch.distributed.get_rank() == 0:
 33 |         # cifar data is downloaded, indicate other ranks can proceed
 34 |         torch.distributed.barrier()
 35 | 
 36 |     testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
 37 | 
 38 |     # Model
 39 |     print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
 40 |     # https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L376
 41 |     # embed_dim a.k.a. width, mlp_ratio=MLP/embed_dim, depth is number of blocks
 42 |     if args.model!='vitANY':
 43 |         net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))
 44 |     else:
 45 |         net = timm.models.vision_transformer.VisionTransformer(embed_dim=768,num_heads=12,depth=12,mlp_ratio=4,num_classes=int(args.cifar_data[5:]))
 46 |     
 47 |     if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety
 48 |         for name,param in net.named_parameters():
 49 |             if '.bias' not in name:
 50 |                 param.requires_grad_(False)
 51 | 
 52 |     criterion = nn.CrossEntropyLoss()      
 53 | 
 54 |     if 'nonDP' not in args.clipping_mode:
 55 |         PrivacyEngine_Distributed_extending(
 56 |             net,
 57 |             batch_size=config['train_batch_size'],
 58 |             sample_size=len(trainset),
 59 |             epochs=args.epochs,
 60 |             target_epsilon=args.epsilon,
 61 |             num_GPUs=torch.distributed.get_world_size(),
 62 |             torch_seed_is_fixed=(args.seed_fixed>=0), # better use False?
 63 |             grad_accum_steps=config['train_batch_size']/config['train_micro_batch_size_per_gpu']/torch.distributed.get_world_size(),
 64 |         )
 65 | 
 66 |     print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
 67 |     print(f"Number of trainable parameters: {sum([p.numel() for p in net.parameters() if p.requires_grad])}({sum([p.numel() for p in net.parameters() if p.requires_grad])/sum([p.numel() for p in net.parameters()])})")
 68 | 
 69 |     optimizer = optim.Adam(net.parameters(), lr=args.lr)
 70 | 
 71 |     # Initialize DeepSpeed to use the following features
 72 |     # 1) Distributed model
 73 |     # 2) Distributed data loader
 74 |     # 3) DeepSpeed optimizer
 75 |     model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
 76 | 
 77 |     fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled();
 78 |     print(f'fp16={fp16},bf16={bf16}')
 79 | 
 80 | 
 81 |     def train(epoch):
 82 | 
 83 |         net.train()
 84 |         train_loss = 0
 85 |         correct = 0
 86 |         total = 0
 87 | 
 88 |    
 89 |         for batch_idx, data in enumerate(tqdm(trainloader)):
 90 |             # get the inputs; data is a list of [inputs, labels]
 91 |             inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
 92 |             if fp16:
 93 |                 inputs = inputs.half()
 94 |             if bf16:
 95 |                 inputs = inputs.bfloat16()
 96 |             outputs = model_engine(inputs)
 97 | 
 98 |             loss = criterion(outputs, targets)
 99 | 
100 |             model_engine.backward(loss)
101 |             model_engine.step()
102 |                 
103 |             train_loss += loss.item()
104 |             _, predicted = outputs.max(1)
105 |             total += targets.size(0)
106 |             correct += predicted.eq(targets).sum().item()
107 | 
108 |         print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
109 |                          % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
110 | 
111 |     def test(epoch):
112 |         net.eval()
113 |         test_loss = 0
114 |         correct = 0
115 |         total = 0
116 |         with torch.no_grad():
117 |             for batch_idx, data in enumerate(tqdm(testloader)):
118 |                 inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
119 |                 if fp16:
120 |                     inputs = inputs.half()
121 |                 if bf16:
122 |                     inputs = inputs.bfloat16()
123 |                 outputs = model_engine(inputs)
124 |                 #outputs = net(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
125 |                 loss = criterion(outputs, targets)
126 | 
127 |                 test_loss += loss.item()
128 |                 _, predicted = outputs.max(1)
129 |                 total += targets.size(0)
130 |                 correct += predicted.eq(targets).sum().item()
131 | 
132 |             print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
133 |                              % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
134 | 
135 |     for epoch in range(args.epochs):
136 |         train(epoch)
137 |         test(epoch)
138 | 
139 | 
140 | if __name__ == '__main__':
141 |     import deepspeed
142 |     import argparse
143 | 
144 |     parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
145 |     parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
146 |     parser.add_argument('--epochs', default=5, type=int,
147 |                         help='numter of epochs')
148 |     parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
149 |     parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
150 |     parser.add_argument('--model', default='vit_large_patch16_224', type=str)
151 |     parser.add_argument('--cifar_data', type=str, default='CIFAR100')
152 |     parser.add_argument('--dimension', type=int,default=224)
153 |     parser.add_argument('--seed_fixed', type=int,default=3)
154 | 
155 |     parser.add_argument('--local_rank',
156 |                         type=int,
157 |                         default=-1,
158 |                         help='local rank passed from distributed launcher')
159 |     # Include DeepSpeed configuration arguments
160 |     parser = deepspeed.add_config_arguments(parser)
161 | 
162 |     args = parser.parse_args()
163 |     
164 |     from fastDP import PrivacyEngine_Distributed_extending
165 | 
166 |     import torch
167 |     import torchvision
168 |     if args.seed_fixed>=0:
169 |         torch.manual_seed(args.seed_fixed) # if use, need change privacy engine's argument
170 |     import torch.nn as nn
171 |     import torch.optim as optim
172 |     import timm
173 |     from tqdm import tqdm
174 |     import warnings; warnings.filterwarnings("ignore")
175 |     
176 |     import json
177 | 
178 |     import deepspeed
179 |     deepspeed.init_distributed()
180 | 
181 |     main(args)
182 | 


--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/cifar_config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "train_batch_size": 1024,
 3 |   "train_micro_batch_size_per_gpu": 32,
 4 |   "steps_per_print": 2000,
 5 |   "prescale_gradients": false,
 6 |   "bf16": {
 7 |       "enabled": false
 8 |   },
 9 |   "fp16": {
10 |       "enabled": true,
11 |       "fp16_master_weights_and_grads": false,
12 |       "loss_scale": 1.0,
13 |       "loss_scale_window": 1000,
14 |       "hysteresis": 2,
15 |       "min_loss_scale": 1,
16 |       "initial_scale_power": 0
17 |   },
18 |   "wall_clock_breakdown": false,
19 |   "zero_optimization": {
20 |       "stage": 1,
21 |       "allgather_partitions": true,
22 |       "reduce_scatter": true,
23 |       "allgather_bucket_size": 50000000,
24 |       "reduce_bucket_size": 50000000,
25 |       "overlap_comm": true,
26 |       "contiguous_gradients": true,
27 |       "cpu_offload": false,
28 |       "stage3_max_live_parameters" : 1e8,
29 |       "stage3_max_reuse_distance" : 1e8,
30 |       "stage3_prefetch_bucket_size" : 1e7
31 |   }
32 | }
33 | 


--------------------------------------------------------------------------------
/examples/image_classification/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/image_classification/__init__.py


--------------------------------------------------------------------------------
/examples/requirements.txt:
--------------------------------------------------------------------------------
 1 | argcomplete==1.12.1
 2 | avro-python3==1.9.2.1
 3 | azure-storage-blob==12.4.0
 4 | bottle==0.12.20
 5 | certifi==2023.7.22
 6 | chardet==3.0.4
 7 | charset-normalizer==2.0.4
 8 | click==8.0.1
 9 | crcmod==1.7
10 | cycler==0.10.0
11 | datasets
12 | diffimg==0.2.3
13 | docopt==0.6.2
14 | fastavro==1.4.1
15 | filelock==3.0.12
16 | fire
17 | fusepy==2.0.4
18 | future==0.18.3
19 | gdown>=5.0
20 | gpytorch
21 | httplib2==0.19.0
22 | idna==3.2
23 | imageio==2.9.0
24 | indexed-gzip-fileobj-fork-epicfaace==1.5.4
25 | isodate==0.6.0
26 | joblib==1.2.0
27 | kiwisolver==1.3.1
28 | lazy_loader==0.3
29 | markdown2==2.4.0
30 | marshmallow==2.15.1
31 | marshmallow-jsonapi==0.15.1
32 | matplotlib==3.4.3
33 | mock==2.0.0
34 | networkx==2.6.2
35 | nltk==3.9
36 | numpy>=1.21.2
37 | oauth2client==4.1.3
38 | packaging==21.0
39 | pandas==1.3.2
40 | pathtools==0.1.2
41 | pbr==5.6.0
42 | Pillow==10.2.0
43 | psutil==5.7.2
44 | pyasn1==0.4.8
45 | pyasn1-modules==0.2.8
46 | pycparser==2.20
47 | pydot==1.4.2
48 | pymongo==3.11.4
49 | pyparsing==2.4.7
50 | PySocks==1.7.1
51 | python-dateutil==2.8.*
52 | pytz==2021.1
53 | PyWavelets==1.1.1
54 | PyYAML==5.4.*
55 | regex==2021.8.3
56 | requests
57 | retry==0.9.2
58 | sacremoses==0.0.45
59 | scikit-image==0.18.2
60 | scikit-learn==1.5.0
61 | scipy>=1.7.1
62 | seaborn==0.11.2
63 | selenium==3.141.0
64 | sentence-transformers>=2.0.0
65 | sentencepiece==0.1.96
66 | sentry-sdk==1.14.0
67 | six==1.15.0
68 | SQLAlchemy==1.3.19
69 | termcolor==1.1.0
70 | threadpoolctl==2.2.0
71 | tifffile==2021.8.8
72 | tokenizers==0.10.3
73 | tqdm>=4.62.1
74 | transformers<=4.26
75 | typing-extensions==3.7.4.3
76 | urllib3==1.26.*
77 | watchdog==0.10.3
78 | websocket-client==1.0.1
79 | 


--------------------------------------------------------------------------------
/examples/table2text/README.md:
--------------------------------------------------------------------------------
 1 | ## DP natural language generation with Huggingface transformers
 2 | 
 3 | ### Getting the data
 4 | 
 5 | E2E and DART datasets are adapted from \[[Li & Liang, 2021](https://arxiv.org/abs/2101.00190)\] and hosted by \[[Li et al., 2021](https://arxiv.org/abs/2110.05679)\] at [Google drive](https://drive.google.com/file/d/1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7/view?usp=sharing). To obtain the data, run
 6 | ```plaintext
 7 | gdown https://drive.google.com/uc?id=1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7
 8 | unzip prefix-tuning.zip
 9 | ```
10 | This should produce a `table2text/prefix-tuning/data` subfolder that contains the datasets.
11 | 
12 | ### Running on single GPU
13 | 
14 | Use the `run.sh` script in the folder, which runs the `run_language_modeling.py` for the command.
15 | 
16 | For instance, run the following under the `examples` folder:
17 | ```plaintext
18 | bash table2text/run.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2"
19 | ```
20 | 
21 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For E2E/DART, training `gpt2` on one A100 GPU (40GB) takes around 2.5/4 min per epoch.
22 | 
23 | Arguments (sequentially):
24 | *   `--output_dir`: path to a folder where results will be written
25 | 
26 | *   `--task_mode`: name of task; one of "e2e" and "dart"
27 | 
28 | *  `--model_name_or_path`: The pretrained model; one of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large".
29 | 
30 | *  `--target_epsilon`: Target privacy spending, default is 8.
31 | 
32 | *   `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
33 | 
34 | *  `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `MixOpt` (default, meaning hybrid book-keeping), `MixGhostClip`, `ghost`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf).
35 | 
36 | ### Running on multi-GPU distributed learning
37 | 
38 | Use the `run_ZERO1.sh`, `run_ZERO23.sh` and `run_ZERO_extending.py` in the folder for ZeRO1, ZeRO2+3 and ZeRO1+2+3, respectively. The scripts read the config from `gpt_config_stage123.json`.
39 | 
40 | For instance, run the following under the `examples` folder:
41 | ```plaintext
42 | bash table2text/run_ZERO1.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2"
43 | ```
44 | 
45 | ### Evaluation
46 | 
47 | The script automatically evaluates some measures like loss during the training. To evaluate the generations with BLEU, ROGUE, METEOR, CIDEr, NIST, etc., we use the official [e2e-metrics](https://github.com/tuetschek/e2e-metrics) for E2E, and [GEM-metrics](https://github.com/GEM-benchmark/GEM-metrics) for DART.
48 | 
49 | Specifically for E2E, after installing e2e-metric in the `table2text` folder, run
50 | ```bash
51 | cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
52 | python e2e-metrics/measure_scores.py prefix-tuning/data/e2e_data/clean_references_test.txt ../<output_dir>/generations_model/eval/global_step_00000420.txt
53 | ```
54 | 


--------------------------------------------------------------------------------
/examples/table2text/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/__init__.py


--------------------------------------------------------------------------------
/examples/table2text/compiled_args.py:
--------------------------------------------------------------------------------
  1 | """Compilation of all the arguments."""
  2 | import logging
  3 | import os
  4 | import sys
  5 | from dataclasses import dataclass, field
  6 | from typing import Optional
  7 | 
  8 | import transformers
  9 | 
 10 | MODEL_CONFIG_CLASSES = list(transformers.MODEL_WITH_LM_HEAD_MAPPING.keys())
 11 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
 12 | 
 13 | TRUE_TAGS = ('y', 'yes', 't', 'true')
 14 | 
 15 | 
 16 | # See all possible arguments in src/transformers/training_args.py
 17 | # or by passing the --help flag to this script.
 18 | # We now keep distinct sets of args, for a cleaner separation of concerns.
 19 | @dataclass
 20 | class ModelArguments:
 21 |     """
 22 |     Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
 23 |     """
 24 |     model_name_or_path: Optional[str] = field(
 25 |         default=None,
 26 |         metadata={
 27 |             "help": "The model checkpoint for weights initialization. Leave None if you want to train a model from "
 28 |                     "scratch."
 29 |         },
 30 |     )
 31 |     model_type: Optional[str] = field(
 32 |         default=None,
 33 |         metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
 34 |     )
 35 |     config_name: Optional[str] = field(
 36 |         default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
 37 |     )
 38 |     tokenizer_name: Optional[str] = field(
 39 |         default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
 40 |     )
 41 |     cache_dir: Optional[str] = field(
 42 |         default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
 43 |     )
 44 | 
 45 |     static_lm_head: str = field(default='no')
 46 |     static_embedding: str = field(default='no')
 47 |     attention_only: str = field(default="no")
 48 |     bias_only: str = field(default="no")
 49 | 
 50 |     def __post_init__(self):
 51 |         self.static_lm_head = self.static_lm_head.lower() in TRUE_TAGS
 52 |         self.static_embedding = self.static_embedding.lower() in TRUE_TAGS
 53 |         self.attention_only = self.attention_only.lower() in TRUE_TAGS
 54 |         self.bias_only = self.bias_only.lower() in TRUE_TAGS
 55 | 
 56 | 
 57 | @dataclass
 58 | class DataTrainingArguments:
 59 |     """
 60 |     Arguments pertaining to what data we are going to input our model for training and eval.
 61 |     """
 62 |     data_folder: Optional[str] = field(default=None, metadata={"help": "Path to folder with all the data."})
 63 | 
 64 |     # Useful for truncating the dataset.
 65 |     max_train_examples: Optional[int] = field(default=sys.maxsize)
 66 |     max_valid_examples: Optional[int] = field(default=sys.maxsize)
 67 |     max_eval_examples: Optional[int] = field(default=sys.maxsize)
 68 | 
 69 |     line_by_line: bool = field(
 70 |         default=True,
 71 |         metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
 72 |     )
 73 |     task_mode: Optional[str] = field(
 74 |         default=None, metadata={"help": "The name of the task."}
 75 |     )
 76 |     format_mode: Optional[str] = field(
 77 |         default='cat', metadata={"help": "The mode of data2text format (cat, peek, nopeek)"}
 78 |     )
 79 |     max_source_length: Optional[int] = field(
 80 |         default=512, metadata={"help": "the max source length of summarization data. "}
 81 |     )
 82 |     train_max_target_length: Optional[int] = field(
 83 |         default=100, metadata={"help": "the max target length for training data. "}
 84 |     )
 85 |     val_max_target_length: Optional[int] = field(
 86 |         default=100, metadata={"help": "the max target length for dev data. "}
 87 |     )
 88 |     block_size: int = field(
 89 |         default=-1,
 90 |         metadata={
 91 |             "help": "Optional input sequence length after tokenization."
 92 |                     "The training dataset will be truncated in block of this size for training."
 93 |                     "Default to the model max input length for single sentence inputs (take into account special "
 94 |                     "tokens)."
 95 |         },
 96 |     )
 97 |     overwrite_cache: bool = field(
 98 |         default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
 99 |     )
100 |     max_seq_len: int = field(default=sys.maxsize)
101 |     
102 | 
103 |     def __post_init__(self):
104 |         if self.data_folder is not None:
105 |             logging.warning(f'Overriding dataset paths using those given in `data_folder`')
106 | 
107 |             if self.task_mode == "e2e":
108 |                 self.train_data_file = os.path.join(self.data_folder, 'src1_train.txt')
109 |                 self.valid_data_file = os.path.join(self.data_folder, 'src1_valid.txt')
110 |                 self.eval_data_file = os.path.join(self.data_folder, 'src1_test.txt')
111 | 
112 |                 self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt')
113 |                 self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt')
114 |                 self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt')
115 | 
116 |             elif self.task_mode == "dart":
117 |                 self.train_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-train.json')
118 |                 self.valid_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-dev.json')
119 |                 self.eval_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-test.json')
120 | 
121 |                 self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt')
122 |                 self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt')
123 |                 self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt')
124 | 
125 | 
126 | @dataclass
127 | class TrainingArguments(transformers.TrainingArguments):
128 |     max_eval_batches: int = field(default=-1, metadata={"help": "Maximum number of evaluation steps to run."})
129 |     max_generations: int = field(default=sys.maxsize)
130 |     max_generations_train: int = field(default=10)
131 |     max_generations_valid: int = field(default=10)
132 |     skip_generation: str = field(default="no")
133 | 
134 |     ema_model_averaging: str = field(default="no")
135 |     ema_model_gamma: float = field(default=0.99)
136 |     ema_model_start_from: int = field(default=1000)
137 |     lr_decay: str = field(default="yes")
138 |     eval_epochs: int = field(default=10)
139 | 
140 |     deepspeed_config: str = field(default=None)
141 |     num_GPUs: int = field(default=1)
142 |     logical_batch_size: int = field(default=None)
143 | 
144 |     evaluate_during_training: str = field(
145 |         default="yes",
146 |         metadata={"help": "Run evaluation during training at each logging step."},
147 |     )
148 |     evaluate_before_training: str = field(
149 |         default="yes",
150 |         metadata={"help": "Run evaluation before training."},
151 |     )
152 |     save_at_last: str = field(default="no", metadata={"help": "Save at the end of training."})
153 | 
154 |     def __post_init__(self):
155 |         super(TrainingArguments, self).__post_init__()
156 |         self.skip_generation = self.skip_generation.lower() in ('y', 'yes')
157 |         self.ema_model_averaging = (self.ema_model_averaging.lower() in ('y', 'yes'))
158 |         self.lr_decay = (self.lr_decay.lower() in ('y', 'yes'))
159 |         self.evaluate_during_training = (self.evaluate_during_training in ('y', 'yes'))
160 |         self.evaluate_before_training = (self.evaluate_before_training in ('y', 'yes'))
161 |         self.save_at_last = (self.save_at_last in ('y', 'yes'))
162 | 
163 | 
164 | @dataclass
165 | class PrivacyArguments:
166 |     """Arguments for differentially private training."""
167 |     per_example_max_grad_norm: float = field(
168 |         default=.1, metadata={
169 |             "help": "Clipping 2-norm of per-sample gradients."
170 |         }
171 |     )
172 |     noise_multiplier: float = field(
173 |         default=None, metadata={
174 |             "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, "
175 |                     "use the one searched based budget"
176 |         }
177 |     )
178 |     target_epsilon: float = field(
179 |         default=None, metadata={
180 |             "help": "Privacy budget; if `None` use the noise multiplier specified."
181 |         }
182 |     )
183 |     target_delta: float = field(
184 |         default=None, metadata={
185 |             "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)."
186 |         }
187 |     )
188 |     accounting_mode: str = field(
189 |         default="rdp", metadata={"help": "One of `rdp`, `glw`, `all`."}
190 |     )
191 |     non_private: str = field(default="no")
192 |     clipping_mode: str = field(default="ghost")
193 |     clipping_fn: str = field(default="automatic")
194 |     clipping_style: str = field(default="all-layer")
195 |     torch_seed_is_fixed: bool = field(default=True)
196 | 
197 |     def __post_init__(self):
198 |         self.non_private = self.non_private.lower() in ('y', 'yes')
199 | 


--------------------------------------------------------------------------------
/examples/table2text/data_utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/data_utils/__init__.py


--------------------------------------------------------------------------------
/examples/table2text/data_utils/data_collator.py:
--------------------------------------------------------------------------------
  1 | from dataclasses import dataclass
  2 | from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
  3 | 
  4 | import torch
  5 | from torch.nn.utils.rnn import pad_sequence
  6 | 
  7 | from transformers.tokenization_utils import PreTrainedTokenizer
  8 | from transformers.tokenization_utils_base import BatchEncoding, PaddingStrategy
  9 | from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
 10 | 
 11 | 
 12 | InputDataClass = NewType("InputDataClass", Any)
 13 | 
 14 | """
 15 | A DataCollator is a function that takes a list of samples from a Dataset
 16 | and collate them into a batch, as a dictionary of Tensors.
 17 | """
 18 | DataCollator = NewType("DataCollator", Callable[[List[InputDataClass]], Dict[str, torch.Tensor]])
 19 | 
 20 | 
 21 | @dataclass
 22 | class DataCollatorForData2TextLanguageModeling:
 23 |     """
 24 |     Data collator used for language modeling.
 25 |     - collates batches of tensors, honoring their tokenizer's pad_token
 26 |     - preprocesses batches for masked language modeling
 27 |     """
 28 |     tokenizer: PreTrainedTokenizer
 29 |     mlm: bool = True
 30 |     format_mode: str = 'cat'
 31 |     mlm_probability: float = 0.15
 32 | 
 33 |     def __call__(
 34 |         self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
 35 |     ) -> Dict[str, torch.Tensor]:
 36 |         if isinstance(examples[0], (dict, BatchEncoding)):
 37 |             examples = [e["input_ids"] for e in examples]
 38 |         input_ids, labels, src, tgt, cate = zip(*examples)
 39 |         if self.mlm:
 40 |             inputs, labels = self.mask_tokens(batch)
 41 |             return {"input_ids": inputs, "labels": labels}
 42 |         else:
 43 |             if self.format_mode == 'cat':
 44 |                 mode_input = 3
 45 |             elif self.format_mode == 'peek':
 46 |                 mode_input = 1
 47 |             elif self.format_mode == 'nopeek':
 48 |                 mode_input = 2
 49 |             elif self.format_mode == 'infix':
 50 |                 mode_input = 4
 51 | 
 52 |             # mode_input = 1 # means that we take the input again.
 53 |             # mode_input = 2 # means that we do not peek at src again.
 54 |             # mode_input = 3 # means that we look at the categories, and see the input again.
 55 | 
 56 |             if mode_input == 1:
 57 |                 # input, batch
 58 |                 batch = self._tensorize_batch(input_ids)
 59 |                 labels = self._tensorize_batch(labels)
 60 |                 src = self._tensorize_batch(src)
 61 |                 cate_batch, cate_attn = None, None
 62 |                 # tgt = self._tensorize_batch(tgt)
 63 |             elif mode_input == 2:
 64 |                 # nopeek.
 65 |                 batch = self._tensorize_batch(tgt)
 66 |                 labels = batch.clone()
 67 |                 src = self._tensorize_batch(src)
 68 |                 cate_batch, cate_attn = None, None
 69 |             elif mode_input == 3:
 70 |                 batch = self._tensorize_batch(input_ids)
 71 |                 labels = self._tensorize_batch(labels)
 72 |                 src = self._tensorize_batch(cate)
 73 |                 cate_batch, cate_attn = None, None
 74 |             elif mode_input == 4:
 75 |                 batch = self._tensorize_batch(tgt)
 76 |                 labels = batch.clone()
 77 |                 src = self._tensorize_batch(src)
 78 | 
 79 |                 cate_batch = self._tensorize_batch(cate)
 80 |                 cate_attn = (cate_batch != self.tokenizer.pad_token_id)
 81 | 
 82 |             labels[labels == self.tokenizer.pad_token_id] = -100 # tgt
 83 |             src_attn = (src != self.tokenizer.pad_token_id) # src
 84 |             tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt
 85 | 
 86 |             if cate_batch is None:
 87 |                 return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn,
 88 |                         'src':src}
 89 |             else:
 90 |                 return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn': tgt_attn,
 91 |                         'src': src, "cate_batch":cate_batch, "cate_attn":cate_attn}
 92 | 
 93 |     def _tensorize_batch(
 94 |         self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
 95 |     ) -> torch.Tensor:
 96 |         # In order to accept both lists of lists and lists of Tensors
 97 |         if isinstance(examples[0], (list, tuple)):
 98 |             examples = [torch.tensor(e, dtype=torch.long) for e in examples]
 99 |         length_of_first = examples[0].size(0)
100 |         are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
101 |         if are_tensors_same_length:
102 |             return torch.stack(examples, dim=0)
103 |         else:
104 |             if self.tokenizer._pad_token is None:
105 |                 raise ValueError(
106 |                     "You are attempting to pad samples but the tokenizer you are using"
107 |                     f" ({self.tokenizer.__class__.__name__}) does not have one."
108 |                 )
109 |             return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)
110 | 
111 |     def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
112 |         """
113 |         Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
114 |         """
115 | 
116 |         if self.tokenizer.mask_token is None:
117 |             raise ValueError(
118 |                 "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
119 |             )
120 | 
121 |         labels = inputs.clone()
122 |         # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
123 |         probability_matrix = torch.full(labels.shape, self.mlm_probability)
124 |         special_tokens_mask = [
125 |             self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
126 |         ]
127 |         probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
128 |         if self.tokenizer._pad_token is not None:
129 |             padding_mask = labels.eq(self.tokenizer.pad_token_id)
130 |             probability_matrix.masked_fill_(padding_mask, value=0.0)
131 |         masked_indices = torch.bernoulli(probability_matrix).bool()
132 |         labels[~masked_indices] = -100  # We only compute loss on masked tokens
133 | 
134 |         # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
135 |         indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
136 |         inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
137 | 
138 |         # 10% of the time, we replace masked input tokens with random word
139 |         indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
140 |         random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
141 |         inputs[indices_random] = random_words[indices_random]
142 | 
143 |         # The rest of the time (10% of the time) we keep the masked input tokens unchanged
144 |         return inputs, labels
145 | 
146 | 
147 | @dataclass
148 | class DataCollatorForSumLanguageModeling:
149 |     """
150 |     Data collator used for language modeling.
151 |     - collates batches of tensors, honoring their tokenizer's pad_token
152 |     - preprocesses batches for masked language modeling
153 |     """
154 |     tokenizer: PreTrainedTokenizer
155 |     mlm: bool = True
156 |     format_mode: str = 'cat'
157 |     mlm_probability: float = 0.15
158 | 
159 |     def __call__(
160 |         self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
161 |     ) -> Dict[str, torch.Tensor]:
162 |         if isinstance(examples[0], (dict, BatchEncoding)):
163 |             examples = [e["input_ids"] for e in examples]
164 |         # print(examples[0])
165 |         # print(len(examples))
166 |         input_ids, labels, src, tgt = zip(*examples)
167 |         # print(len(input_ids), len(labels), len(weights))
168 |         if self.mlm:
169 |             inputs, labels = self.mask_tokens(batch)
170 |             return {"input_ids": inputs, "labels": labels}
171 |         else:
172 | 
173 |             # print(self.format_mode)
174 | 
175 |             if self.format_mode == 'peek' or self.format_mode == 'cat':
176 |                 mode_input = 1
177 |             elif self.format_mode == 'nopeek':
178 |                 assert False, 'should use format_mode = peek or cat.'
179 |                 mode_input = 2
180 |             elif self.format_mode == 'infix':
181 |                 assert False, 'should use format_mode = peek or cat.'
182 |                 mode_input = 4
183 | 
184 |             # mode_input = 1 # means that we take the input again.
185 |             # mode_input = 2 # means that we do not peek at src again.
186 |             # mode_input = 3 # means that we look at the categories, and see the input again.
187 | 
188 |             # print(self.format_mode, mode_input)
189 | 
190 |             if mode_input == 1:
191 |                 # input, batch
192 |                 batch = self._tensorize_batch(input_ids)
193 |                 labels = self._tensorize_batch(labels)
194 |                 src = self._tensorize_batch(src)
195 | 
196 |             labels[labels == self.tokenizer.pad_token_id] = -100 # tgt
197 |             src_attn = (src != self.tokenizer.pad_token_id) # src
198 |             tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt
199 | 
200 |             return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn,
201 |                     'src':src}
202 | 
203 | 
204 |     def _tensorize_batch(
205 |         self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
206 |     ) -> torch.Tensor:
207 |         # In order to accept both lists of lists and lists of Tensors
208 |         if isinstance(examples[0], (list, tuple)):
209 |             examples = [torch.tensor(e, dtype=torch.long) for e in examples]
210 |         length_of_first = examples[0].size(0)
211 |         are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
212 |         if are_tensors_same_length:
213 |             return torch.stack(examples, dim=0)
214 |         else:
215 |             if self.tokenizer._pad_token is None:
216 |                 raise ValueError(
217 |                     "You are attempting to pad samples but the tokenizer you are using"
218 |                     f" ({self.tokenizer.__class__.__name__}) does not have one."
219 |                 )
220 |             return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)
221 | 


--------------------------------------------------------------------------------
/examples/table2text/decoding_utils.py:
--------------------------------------------------------------------------------
  1 | """Utilities for generation."""
  2 | import logging
  3 | import sys
  4 | from typing import Optional
  5 | 
  6 | import tqdm
  7 | import transformers
  8 | 
  9 | 
 10 | def generate(
 11 |     model: transformers.PreTrainedModel,
 12 |     tokenizer: transformers.PreTrainedTokenizer,
 13 |     loader=None,
 14 |     prompt_dataset=None,
 15 |     max_length=100,
 16 |     min_length=5,
 17 |     top_k=0,
 18 |     top_p=0.9,  # Only filter with top_p.
 19 |     repetition_penalty=1,
 20 |     do_sample=False,
 21 |     num_beams=5,
 22 |     bad_words_ids=None,
 23 |     dummy_token_id=-100,  # Used as mask.
 24 |     num_return_sequences=1,
 25 |     max_generations=sys.maxsize,
 26 |     device=None,
 27 |     padding_token="[PAD]",
 28 |     **kwargs,
 29 | ):
 30 |     assert not model.training, "Generation must be when `model` is in eval mode."
 31 |     if kwargs:
 32 |         logging.warning(f"Unknown kwargs: {kwargs}")
 33 | 
 34 |     # These are linebreaks; generating these will mess up the evaluation, since those files assume one example per-line.
 35 |     if bad_words_ids is None:
 36 |         bad_words_ids = [[628], [198]]
 37 |         if padding_token in tokenizer.get_vocab():
 38 |             bad_words_ids.append(tokenizer.encode(padding_token))
 39 | 
 40 |     kwargs = dict(
 41 |         model=model,
 42 |         tokenizer=tokenizer,
 43 |         max_length=max_length,
 44 |         min_length=min_length,
 45 |         top_k=top_k,
 46 |         top_p=top_p,
 47 |         repetition_penalty=repetition_penalty,
 48 |         do_sample=do_sample,
 49 |         num_beams=num_beams,
 50 |         bad_words_ids=bad_words_ids,
 51 |         dummy_token_id=dummy_token_id,
 52 |         num_return_sequences=num_return_sequences,
 53 |         max_generations=max_generations,
 54 |         device=device,
 55 |         padding_token=padding_token,
 56 |     )
 57 |     if loader is not None:
 58 |         result = _generate_with_loader(loader=loader, **kwargs)
 59 |     elif prompt_dataset is not None:
 60 |         result = _generate_with_prompt_dataset(prompt_dataset=prompt_dataset, **kwargs)
 61 |     else:
 62 |         raise ValueError(f"`loader` and `prompt_dataset` cannot both be `None`.")
 63 | 
 64 |     return result
 65 | 
 66 | 
 67 | def _generate_with_loader(
 68 |     loader,
 69 | 
 70 |     model,
 71 |     tokenizer: transformers.PreTrainedTokenizer,
 72 |     max_length,
 73 |     min_length,
 74 |     top_k,
 75 |     top_p,
 76 |     repetition_penalty,
 77 |     do_sample,
 78 |     num_beams,
 79 |     bad_words_ids,
 80 |     dummy_token_id,
 81 |     num_return_sequences,
 82 |     max_generations,
 83 |     device,
 84 |     padding_token,
 85 | ):
 86 |     references = []
 87 |     full_generations = []  # Sentences including the prompt part.
 88 |     unstripped_generations = []
 89 |     generations = []
 90 | 
 91 |     stop_generation = False
 92 |     for batch_idx, batch in tqdm.tqdm(enumerate(loader), desc="generation"):
 93 |         if stop_generation:
 94 |             break
 95 | 
 96 |         batch_input_ids, batch_labels = batch["input_ids"], batch["labels"]
 97 |         # e.g., inputs_ids may be [[95, 123, 32], [198, 19, 120]], and
 98 |         # labels may be [[-100, 123, 32], [-100, -100, 120]
 99 | 
100 |         for input_ids, labels in zip(batch_input_ids, batch_labels):
101 |             if stop_generation:
102 |                 break
103 | 
104 |             # Find the first pad token and end the sentence from there!
105 |             if padding_token in tokenizer.get_vocab():
106 |                 pad_positions, = (
107 |                     input_ids == tokenizer.encode(padding_token, return_tensors="pt").squeeze()
108 |                 ).nonzero(as_tuple=True)
109 |                 # Some sentences might have padding; others might not.
110 |                 if pad_positions.numel() == 0:
111 |                     first_pad_position = None
112 |                 else:
113 |                     first_pad_position = pad_positions[0]
114 |                 reference_str: str = tokenizer.decode(input_ids[:first_pad_position], clean_up_tokenization_spaces=True)
115 |             else:
116 |                 reference_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
117 |             references.append(reference_str)
118 | 
119 |             # Find the first non- -100 position. Note there are trailing -100s.
120 |             non_prompt_positions, = (labels != dummy_token_id).nonzero(as_tuple=True)
121 |             first_non_prompt_position = non_prompt_positions[0].item()
122 |             prompt_len = first_non_prompt_position
123 |             prompt_ids = input_ids[:prompt_len]
124 | 
125 |             output_ids = model.generate(
126 |                 input_ids=prompt_ids[None, ...].to(device),
127 |                 max_length=max_length + prompt_len,  # This cannot be a 0-D tensor!
128 |                 min_length=min_length,
129 |                 top_k=top_k,
130 |                 top_p=top_p,
131 |                 repetition_penalty=repetition_penalty,
132 |                 do_sample=do_sample,
133 |                 bad_words_ids=bad_words_ids,
134 |                 num_return_sequences=num_return_sequences,
135 |                 num_beams=num_beams,
136 |                 pad_token_id=tokenizer.eos_token_id,  # Stop the stupid logging...
137 |             )
138 |             output_ids = output_ids.squeeze(dim=0)  # Throw away batch dimension.
139 | 
140 |             whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True)
141 |             prompt_str: str = tokenizer.decode(prompt_ids, clean_up_tokenization_spaces=True)
142 |             output_str: str = whole_str[len(prompt_str):]
143 | 
144 |             full_generations.append(whole_str)
145 |             del whole_str, prompt_str
146 | 
147 |             # Remove potential eos_token at the end.
148 |             eos_position: Optional[int] = output_str.find(tokenizer.eos_token)
149 |             if eos_position == -1:  # Didn't generate eos_token; that's okay -- just skip!
150 |                 eos_position = None
151 |             output_str = output_str[:eos_position]
152 |             unstripped_generations.append(output_str)
153 | 
154 |             # Removing leading and trailing spaces.
155 |             output_str = output_str.strip()
156 | 
157 |             generations.append(output_str)
158 | 
159 |             if len(generations) >= max_generations:
160 |                 stop_generation = True
161 | 
162 |     return full_generations, unstripped_generations, generations, references
163 | 
164 | 
165 | def _generate_with_prompt_dataset(
166 |     prompt_dataset,
167 | 
168 |     model,
169 |     tokenizer,
170 |     max_length,
171 |     min_length,
172 |     top_k,
173 |     top_p,
174 |     repetition_penalty,
175 |     do_sample,
176 |     num_beams,
177 |     bad_words_ids,
178 |     dummy_token_id,
179 |     num_return_sequences,
180 |     max_generations,
181 |     device,
182 |     padding_token,
183 | ):
184 |     references = []
185 |     full_generations = []  # Sentences including the prompt part.
186 |     unstripped_generations = []
187 |     generations = []
188 | 
189 |     stop_generation = False
190 |     for input_ids in tqdm.tqdm(prompt_dataset, desc="generation"):
191 |         if stop_generation:
192 |             break
193 | 
194 |         prompt_len = len(input_ids[0])
195 |         output_ids = model.generate(
196 |             input_ids=input_ids.to(device),
197 |             max_length=max_length + prompt_len,  # This cannot be a 0-D tensor!
198 |             min_length=min_length,
199 |             top_k=top_k,
200 |             top_p=top_p,
201 |             repetition_penalty=repetition_penalty,
202 |             do_sample=do_sample,
203 |             bad_words_ids=bad_words_ids,
204 |             num_return_sequences=num_return_sequences,
205 |             num_beams=num_beams,
206 |             pad_token_id=tokenizer.eos_token_id,  # Stop the stupid logging...
207 |         )
208 |         output_ids = output_ids.squeeze(dim=0)  # Throw away batch dimension.
209 |         input_ids = input_ids.squeeze(dim=0)
210 | 
211 |         whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True)
212 |         prompt_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
213 |         output_str: str = whole_str[len(prompt_str):]
214 | 
215 |         full_generations.append(whole_str)
216 |         del whole_str, prompt_str
217 | 
218 |         # Remove potential eos_token at the end.
219 |         eos_position: Optional[int] = output_str.find(tokenizer.eos_token)
220 |         if eos_position == -1:  # Didn't generate eos_token; that's okay -- just skip!
221 |             eos_position = None
222 |         output_str = output_str[:eos_position]
223 |         unstripped_generations.append(output_str)
224 | 
225 |         # Removing leading and trailing spaces.
226 |         output_str = output_str.strip()
227 | 
228 |         generations.append(output_str)
229 | 
230 |         if len(generations) >= max_generations:
231 |             stop_generation = True
232 |     return full_generations, unstripped_generations, generations, references
233 | 


--------------------------------------------------------------------------------
/examples/table2text/gpt_config_stage123.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "bf16": {
 3 |       "enabled": true
 4 |     },
 5 |   "fp16": {
 6 |       "enabled": false,
 7 |       "fp16_master_weights_and_grads": false,
 8 |       "loss_scale": 1,
 9 |       "loss_scale_window": 1000,
10 |       "hysteresis": 2,
11 |       "min_loss_scale": 1,
12 |       "initial_scale_power": 3
13 |   },
14 |   "train_micro_batch_size_per_gpu": 999999999,
15 |   "wall_clock_breakdown": false,
16 |   "zero_optimization": {
17 |       "stage": 1,
18 |       "allgather_partitions": true,
19 |       "reduce_scatter": true,
20 |       "allgather_bucket_size": 50000000,
21 |       "reduce_bucket_size": 50000000,
22 |       "overlap_comm": true,
23 |       "contiguous_gradients": true,
24 |       "cpu_offload": false
25 |   }
26 | }
27 | 


--------------------------------------------------------------------------------
/examples/table2text/misc.py:
--------------------------------------------------------------------------------
 1 | """Miscellaneous utilities.
 2 | 
 3 | Mostly bespoke data loaders at the moment.
 4 | """
 5 | 
 6 | from transformers import (
 7 |     DataCollatorForLanguageModeling,
 8 |     DataCollatorForPermutationLanguageModeling,
 9 |     PreTrainedTokenizer
10 | )
11 | 
12 | try:
13 |     from .compiled_args import DataTrainingArguments
14 |     from .data_utils.data_collator import DataCollatorForData2TextLanguageModeling
15 |     from .data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset
16 | except:
17 |     from compiled_args import DataTrainingArguments
18 |     from data_utils.data_collator import DataCollatorForData2TextLanguageModeling
19 |     from data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset
20 | 
21 | 
22 | def get_dataset_with_path(
23 |     data_args: DataTrainingArguments,
24 |     tokenizer: PreTrainedTokenizer,
25 |     file_path: str,
26 |     max_examples: int,
27 |     **_,
28 | ):
29 |     if data_args.line_by_line:
30 |         if data_args.task_mode == 'e2e':
31 |             dataset = LineByLineE2ETextDataset(
32 |                 tokenizer=tokenizer,
33 |                 file_path=file_path,
34 |                 block_size=data_args.block_size,
35 |                 bos_tok=tokenizer.bos_token,
36 |                 eos_tok=tokenizer.eos_token,
37 |                 max_seq_len=data_args.max_seq_len,
38 |                 max_examples=max_examples,
39 |             )
40 |         elif data_args.task_mode == 'dart':
41 |             dataset = LineByLineTriplesTextDataset(
42 |                 tokenizer=tokenizer,
43 |                 file_path=file_path,
44 |                 block_size=data_args.block_size,
45 |                 bos_tok=tokenizer.bos_token,
46 |                 eos_tok=tokenizer.eos_token,
47 |                 max_seq_len=data_args.max_seq_len,
48 |                 max_examples=max_examples,
49 |             )
50 |         else:
51 |             raise ValueError(f"Unknown `args.task_mode`: {data_args.task_mode}")
52 | 
53 |     else:
54 |         raise ValueError("table2text task don't support anything other than line_by_line!")
55 |     return dataset
56 | 
57 | 
58 | def get_prompt_dataset(file_path, tokenizer):
59 |     with open(file_path, 'r') as f:
60 |         lines = f.readlines()
61 |     encoded_lines = [
62 |         tokenizer.encode(line.strip(), add_special_tokens=False, return_tensors="pt")
63 |         for line in lines
64 |     ]
65 |     return encoded_lines
66 | 
67 | 
68 | def get_all_datasets(config, tokenizer, data_args, model_args, **_):
69 |     kwargs = dict(data_args=data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir)
70 |     train_dataset = get_dataset_with_path(
71 |         **kwargs, file_path=data_args.train_data_file, max_examples=data_args.max_train_examples
72 |     )
73 |     valid_dataset = get_dataset_with_path(
74 |         **kwargs, file_path=data_args.valid_data_file, max_examples=data_args.max_valid_examples
75 |     )
76 |     eval_dataset = get_dataset_with_path(
77 |         **kwargs, file_path=data_args.eval_data_file, max_examples=data_args.max_eval_examples
78 |     )
79 | 
80 |     if config.model_type == "xlnet":
81 |         data_collator = DataCollatorForPermutationLanguageModeling(
82 |             tokenizer=tokenizer,
83 |             plm_probability=data_args.plm_probability,
84 |             max_span_length=data_args.max_span_length,
85 |         )
86 |     else:
87 |         if data_args.task_mode == 'e2e' or data_args.task_mode == 'dart':
88 |             data_collator = DataCollatorForData2TextLanguageModeling(
89 |                 tokenizer=tokenizer, mlm=False, format_mode=data_args.format_mode
90 |             )
91 |         else:
92 |             data_collator = DataCollatorForLanguageModeling(
93 |                 tokenizer=tokenizer, mlm=False,
94 |             )
95 | 
96 |     return train_dataset, valid_dataset, eval_dataset, data_collator
97 | 


--------------------------------------------------------------------------------
/examples/table2text/models.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from torch import nn
  3 | from transformers import GPT2PreTrainedModel, GPT2LMHeadModel
  4 | 
  5 | 
  6 | class _View(nn.Module):
  7 |     def __init__(self, shape):
  8 |         super(_View, self).__init__()
  9 |         self.shape = shape
 10 | 
 11 |     def forward(self, x):
 12 |         return x.reshape(*self.shape)
 13 | 
 14 | 
 15 | class PrefixTuner(GPT2PreTrainedModel):
 16 |     """A minimalistic implementation of the core components."""
 17 | 
 18 |     def __init__(self, config, model_args, gpt2=None):
 19 |         super(PrefixTuner, self).__init__(config=config)
 20 | 
 21 |         # Instantiate a GPT-2, and DON'T optimizer it!
 22 |         if gpt2 is None:
 23 |             self.gpt2 = GPT2LMHeadModel.from_pretrained(
 24 |                 model_args.model_name_or_path, config=config, cache_dir=model_args.cache_dir,
 25 |             )
 26 |         else:
 27 |             self.gpt2 = gpt2
 28 | 
 29 |         self.register_buffer('extra_prefix_ids', torch.arange(model_args.prefix_len))
 30 |         # TODO: Also introduce the easier net.
 31 |         self.extra_prefix_net = nn.Sequential(
 32 |             nn.Embedding(model_args.prefix_len, config.n_embd),
 33 |             nn.Linear(config.n_embd, model_args.mid_dim),
 34 |             nn.Tanh(),
 35 |             nn.Linear(model_args.mid_dim, config.n_layer * 2 * config.n_embd),
 36 |             _View((-1, model_args.prefix_len, config.n_layer * 2, config.n_head, config.n_embd // config.n_head)),
 37 |             nn.Dropout(model_args.prefix_dropout),
 38 |         )
 39 | 
 40 |     def make_past_key_values(self, bsz=None):
 41 |         extra_prefix_ids = self.extra_prefix_ids[None, :].expand(bsz, -1)
 42 |         past_key_values = self.extra_prefix_net(extra_prefix_ids)
 43 |         # (n_layer, batch_size, n_head, prefix_len, n_embed // n_head).
 44 |         # e.g., (2, 1, 12, 5, 64,).
 45 |         past_key_values = past_key_values.permute([2, 0, 3, 1, 4]).split(2, dim=0)
 46 |         return past_key_values
 47 | 
 48 |     def state_dict(self):
 49 |         """Avoid storing GPT-2, since it's not even trained."""
 50 |         return self.extra_prefix_net.state_dict()
 51 | 
 52 |     def load_state_dict(self, state_dict):
 53 |         """Avoid loading GPT-2, since it's not even trained."""
 54 |         self.extra_prefix_net.load_state_dict(state_dict)
 55 | 
 56 |     @property
 57 |     def major_device(self):
 58 |         """Returns the device where the parameters are on."""
 59 |         return next(self.parameters()).device
 60 | 
 61 |     def forward(
 62 |         self,
 63 |         input_ids,
 64 |         attention_mask=None,
 65 |         token_type_ids=None,
 66 |         position_ids=None,
 67 |         head_mask=None,
 68 |         inputs_embeds=None,
 69 |         encoder_hidden_states=None,
 70 |         encoder_attention_mask=None,
 71 |         labels=None,
 72 |         use_cache=None,
 73 |         output_attentions=None,
 74 |         output_hidden_states=None,
 75 |         return_dict=None,
 76 |         **kwargs,
 77 |     ):
 78 |         past_key_values = self.make_past_key_values(bsz=input_ids.size(0))
 79 |         return self.gpt2(
 80 |             input_ids=input_ids,
 81 |             past_key_values=past_key_values,
 82 |             attention_mask=attention_mask,
 83 |             token_type_ids=token_type_ids,
 84 |             position_ids=position_ids,
 85 |             head_mask=head_mask,
 86 |             inputs_embeds=inputs_embeds,
 87 |             encoder_hidden_states=encoder_hidden_states,
 88 |             encoder_attention_mask=encoder_attention_mask,
 89 |             labels=labels,
 90 |             use_cache=use_cache,
 91 |             output_attentions=output_attentions,
 92 |             output_hidden_states=output_hidden_states,
 93 |             return_dict=return_dict,
 94 |             **kwargs
 95 |         )
 96 | 
 97 |     def generate(self, input_ids, num_beams, **kwargs):
 98 |         # Additional files also changed:
 99 |         # src/transformers/generation_utils.py
100 |         # src/transformers/models/gpt2/modeling_gpt2.py
101 | 
102 |         #   A sanity check is to optimize the model for a few updates and check if the beam-search generations changed.
103 |         #   The confusing logic in generation_utils:
104 |         #       1) `past` is used in `GPT2LMHeadModel:prepare_inputs_for_generation`,
105 |         #       2) it's converted to `past_key_values` in that function,
106 |         #       3) `past_key_values` is then updated in forward due to return_dict,
107 |         #       4) `past` is set to `past_key_values` in `generation_utils:_update_model_kwargs_for_generation`
108 | 
109 |         # This is expansion step is important for generation, since otherwise the shapes are wrong.
110 |         past_key_values = self.make_past_key_values(bsz=input_ids.size(0) * num_beams)
111 |         # ---
112 | 
113 |         return self.gpt2.generate(
114 |             input_ids=input_ids,
115 |             num_beams=num_beams,
116 |             past_key_values=past_key_values,
117 | 
118 |             use_cache=True,
119 |             position_ids=None,
120 | 
121 |             #   The logic: At beginning, past=None, and then it gets replaced with past_key_values.
122 |             #              Can't directly give in past, since otherwise, input_ids gets truncated to the last index.
123 |             use_past_key_values_as_past_at_init=True,
124 |             nullify_attention_mask=True,
125 |             # ---
126 | 
127 |             **kwargs
128 |         )
129 | 


--------------------------------------------------------------------------------
/examples/table2text/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | data_dir=${1}
 4 | output_dir=${2}
 5 | task_mode=${3}
 6 | model_name_or_path=${4:-"gpt2"} # One of distilgpt2, gpt2, gpt2-medium, gpt2-large
 7 | target_epsilon=${5:-8}
 8 | clipping_fn=${6:-"automatic"}
 9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"all-layer"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-50}
14 | learning_rate=${12:-0.002}
15 | batch_size=${13:-1000}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 | 
20 | if [[ ${task_mode} == "e2e" ]]; then
21 |   data_dir="${data_dir}/data/e2e_data"
22 |   target_delta=8e-6
23 |   num_train_epochs=10
24 |   max_seq_len=100
25 | else
26 |   if [[ ${task_mode} == "dart" ]]; then
27 |     target_delta=1e-5
28 |     data_dir="${data_dir}/data/dart"
29 |     num_train_epochs=15 # Approximately same number of updates.
30 |     learning_rate=5e-4  # Lower learning rate for stability in large models.
31 |     max_seq_len=120
32 |   else
33 |     echo "Unknown task: ${task_mode}"
34 |     exit 1
35 |   fi
36 | fi
37 | 
38 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size}))
39 | 
40 | # Arguments in the last two lines are the most important.
41 | python table2text/run_language_modeling.py \
42 |   --output_dir ${output_dir} --overwrite_output_dir \
43 |   --task_mode ${task_mode} \
44 |   --model_name_or_path ${model_name_or_path} \
45 |   --tokenizer_name ${model_name_or_path} \
46 |   --do_train --do_eval \
47 |   --line_by_line \
48 |   --save_steps 100 --save_total_limit 1 --save_at_last no \
49 |   --logging_dir ${output_dir} --logging_steps -1 \
50 |   --seed 0 \
51 |   --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" --evaluate_during_training "no" --per_device_eval_batch_size 10 \
52 |   --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
53 |   --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
54 |   --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
55 |   --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
56 |   --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
57 |   --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
58 |   --non_private ${non_private} \
59 |   --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" \
60 |   
61 |   
62 |   
63 |   
64 |   
65 |   
66 | 


--------------------------------------------------------------------------------
/examples/table2text/run_ZERO1.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | data_dir=${1:-"data/prefix-tuning"}
 4 | output_dir=${2:-"data/output"}
 5 | task_mode=${3:-"e2e"}
 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl","gptj"
 7 | target_epsilon=${5:-8}
 8 | clipping_fn=${6:-"automatic"}
 9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"layer-wise"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-4}
14 | learning_rate=${12:-0.002}
15 | batch_size=${13:-1024}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 | num_GPUs=${17:-8}
20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"}
21 | 
22 | if [[ ${task_mode} == "e2e" ]]; then
23 |   data_dir="${data_dir}/data/e2e_data"
24 |   target_delta=8e-6
25 |   num_train_epochs=10
26 |   max_seq_len=100
27 |   if [[ ${bias_only} == "yes" ]]; then
28 |     learning_rate=1e-2
29 |   else
30 |     learning_rate=2e-3
31 |   fi
32 | else
33 |   if [[ ${task_mode} == "dart" ]]; then
34 |     target_delta=1e-5
35 |     data_dir="${data_dir}/data/dart"
36 |     num_train_epochs=15 # Approximately same number of updates.
37 |     learning_rate=5e-4  # Lower learning rate for stability in large models.
38 |     max_seq_len=120
39 |     if [[ ${bias_only} == "yes" ]]; then
40 |       learning_rate=2e-3
41 |     else
42 |       learning_rate=5e-4
43 |     fi
44 | 
45 |   else
46 |     echo "Unknown task: ${task_mode}"
47 |     exit 1
48 |   fi
49 | fi
50 | 
51 | deepspeed table2text/run_language_modeling.py --deepspeed_config ${deepspeed_config} \
52 |   --output_dir ${output_dir} --overwrite_output_dir \
53 |   --task_mode ${task_mode} \
54 |   --model_name_or_path ${model_name_or_path} \
55 |   --tokenizer_name ${model_name_or_path} \
56 |   --do_train --do_eval \
57 |   --line_by_line \
58 |   --save_steps 100 --save_total_limit 1 --save_at_last no \
59 |   --logging_dir ${output_dir} --logging_steps -1 \
60 |   --seed 0 \
61 |   --dataloader_num_workers 2 \
62 |   --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
63 |   --evaluate_during_training "no" --per_device_eval_batch_size 10 \
64 |   --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
65 |   --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
66 |   --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
67 |   --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
68 |   --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --logical_batch_size ${batch_size}\
69 |   --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
70 |   --non_private ${non_private} \
71 |   --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" 
72 | 


--------------------------------------------------------------------------------
/examples/table2text/run_ZERO23.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | data_dir=${1:-"table2text/data/prefix-tuning"}
 4 | output_dir=${2:-"table2text/data/output"}
 5 | task_mode=${3:-"e2e"}
 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"
 7 | target_epsilon=${5:-8}
 8 | clipping_fn=${6:-"automatic"}
 9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"layer-wise"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-4}
14 | learning_rate=${12:-0.001}
15 | batch_size=${13:-1024}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 | num_GPUs=${17:-8}
20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"}
21 | 
22 | if [[ ${task_mode} == "e2e" ]]; then
23 |   data_dir="${data_dir}/data/e2e_data"
24 |   target_delta=8e-6
25 |   num_train_epochs=10
26 |   max_seq_len=100
27 | else
28 |   if [[ ${task_mode} == "dart" ]]; then
29 |     target_delta=1e-5
30 |     data_dir="${data_dir}/data/dart"
31 |     num_train_epochs=15 # Approximately same number of updates.
32 |     learning_rate=5e-4  # Lower learning rate for stability in large models.
33 |     max_seq_len=120
34 |   else
35 |     echo "Unknown task: ${task_mode}"
36 |     exit 1
37 |   fi
38 | fi
39 | 
40 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs}))
41 | 
42 | # Arguments in the last two lines are the most important.
43 | deepspeed table2text/run_language_modeling_ZERO23.py --deepspeed_config ${deepspeed_config} \
44 |   --output_dir ${output_dir} --overwrite_output_dir \
45 |   --task_mode ${task_mode} \
46 |   --model_name_or_path ${model_name_or_path} \
47 |   --tokenizer_name ${model_name_or_path} \
48 |   --do_train --do_eval \
49 |   --line_by_line \
50 |   --save_steps 100 --save_total_limit 1 --save_at_last no \
51 |   --logging_dir ${output_dir} --logging_steps -1 \
52 |   --seed 0 \
53 |   --dataloader_num_workers 2 \
54 |   --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
55 |   --evaluate_during_training "no" --per_device_eval_batch_size 10 \
56 |   --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
57 |   --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
58 |   --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
59 |   --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
60 |   --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
61 |   --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
62 |   --non_private ${non_private} \
63 |   --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" 
64 | 


--------------------------------------------------------------------------------
/examples/table2text/run_ZERO_extending.py:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | data_dir=${1}
 4 | output_dir=${2}
 5 | task_mode=${3:-"e2e"}
 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"
 7 | target_epsilon=${5:-8}
 8 | bias_only=${6:-"no"}
 9 | non_private=${7:-"no"}
10 | physical_batch_size=${8:-4}
11 | learning_rate=${9:-0.001}
12 | batch_size=${10:-1024}
13 | attention_only=${11:-"no"}
14 | static_lm_head=${12:-"no"}
15 | static_embedding=${13:-"no"}
16 | num_GPUs=${14:-8}
17 | deepspeed_config=${15:-"table2text/gpt_config_stage123.json"}
18 | 
19 | if [[ ${task_mode} == "e2e" ]]; then
20 |   data_dir="${data_dir}/data/e2e_data"
21 |   target_delta=8e-6
22 |   num_train_epochs=10
23 |   max_seq_len=100
24 | else
25 |   if [[ ${task_mode} == "dart" ]]; then
26 |     target_delta=1e-5
27 |     data_dir="${data_dir}/data/dart"
28 |     num_train_epochs=15 # Approximately same number of updates.
29 |     learning_rate=5e-4  # Lower learning rate for stability in large models.
30 |     max_seq_len=120
31 |   else
32 |     echo "Unknown task: ${task_mode}"
33 |     exit 1
34 |   fi
35 | fi
36 | 
37 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs}))
38 | 
39 | deepspeed table2text/run_language_modeling_extending.py --deepspeed_config ${deepspeed_config} \
40 |   --output_dir ${output_dir} --overwrite_output_dir \
41 |   --task_mode ${task_mode} \
42 |   --model_name_or_path ${model_name_or_path} \
43 |   --tokenizer_name ${model_name_or_path} \
44 |   --do_train --do_eval \
45 |   --line_by_line \
46 |   --save_steps 100 --save_total_limit 1 --save_at_last no \
47 |   --logging_dir ${output_dir} --logging_steps -1 \
48 |   --seed 0 \
49 |   --dataloader_num_workers 2 \
50 |   --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
51 |   --evaluate_during_training "no" --per_device_eval_batch_size 10 \
52 |   --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
53 |   --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
54 |   --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
55 |   --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
56 |   --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
57 |   --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
58 |   --non_private ${non_private} \
59 |   
60 |   
61 |   
62 |   
63 |   
64 |   
65 |   
66 |   
67 |   
68 |   
69 |   
70 |   
71 |   
72 | 


--------------------------------------------------------------------------------
/examples/text_classification/README.md:
--------------------------------------------------------------------------------
 1 | ## DP text classification with Huggingface transformers
 2 | 
 3 | ### Getting the data
 4 | 
 5 | We adopt the data pipeline by \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf)\], which is adapted from the excellent work by \[[Gao et al., 2021](https://arxiv.org/pdf/2012.15723.pdf)\]. To obtain the data, run the following:
 6 | 
 7 | ```plaintext
 8 | cd data; bash download_dataset.sh
 9 | ```
10 | 
11 | This should produce a `data/original` subfolder that contains the GLUE ([General Language Understanding Evaluation](https://huggingface.co/datasets/glue)) datasets.
12 | 
13 | ### Running
14 | 
15 | Use the `run_wrapper.py` script in the folder, which runs the `run_classification.py` for the command.
16 | 
17 | Necessary arguments:
18 | 
19 | *   `--output_dir`: path to a folder where results will be written
20 | *   `--task_name`: name of task; one of `sst-2`, `qnli`, `qqp`, `mnli`
21 | 
22 | For instance, run the following under the `examples` folder:
23 | 
24 | ```plaintext
25 | python -m text_classification.run_wrapper --output_dir ToDeleteNLU --task_name sst-2
26 | ```
27 | 
28 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For SST-2/QNLI/QQP/MNLI, running `roberta-base` on one A100 GPU (40GB) takes around 5/8/37/32 min per epoch.
29 | 
30 | Additional arguments:
31 | 
32 | *   `--model_name_or_path`: The pretrained model; one of `distilbert-base-uncased`, `bert-base-uncased`, `bert-large-uncased`, `distilroberta-base`, `roberta-base`, `roberta-large`.
33 | 
34 | *   `--target_epsilon`: Target privacy spending, default is 8.
35 | 
36 | *   `--few_shot_type`: Whether to use the generic prompt formatter described in Section 3.2 of our paper. `prompt` is to use, `finetune` is to not use.
37 | 
38 | *   `--non_private`: Whether to train differentially privately; one of `yes`, `no` (default).
39 | 
40 | *   `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `ghost` (default, meaning book-keeping), `MixGhostClip`, `MixOpt`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf).
41 | 
42 | *   `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
43 | 
44 | * `--clipping_style`: Which per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, 2-group clipping style can use `--clipping_style 2`; 12-group clipping style can use `--clipping_style 12`.
45 | 
46 | *   `--attention_only`: Whether to only train attention layers; one of `yes`, `no` (default).
47 | 
48 | *   `--bias_only`: Whether to only train bias terms; one of `yes`, `no` (default). If yes, this is implementing [[Differentially Private Bias-Term only
49 | Fine-tuning]](https://arxiv.org/pdf/2210.00036.pdf).
50 | 
51 | *   `--physical_batch_size` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy.
52 | 
53 | *   `--batch_size` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is None.
54 | 
55 | Note that keeping other training hyperparameter (e.g., number of training epochs, clipping threshold, learning rate) as default,  the script should reproduce the results in \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf); [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)\].
56 | 


--------------------------------------------------------------------------------
/examples/text_classification/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/__init__.py


--------------------------------------------------------------------------------
/examples/text_classification/data/download_dataset.sh:
--------------------------------------------------------------------------------
1 | wget https://nlp.cs.princeton.edu/projects/lm-bff/datasets.tar
2 | tar xvf datasets.tar
3 | 


--------------------------------------------------------------------------------
/examples/text_classification/data/make_k_shot_without_dev.py:
--------------------------------------------------------------------------------
 1 | """The datasets in the k-shot folder contain dev.tsv; we make the test set the dev set in the new k-shot.
 2 | 
 3 | python -m classification.data.make_k_shot_without_dev
 4 | """
 5 | import os
 6 | 
 7 | from ml_swissknife import utils
 8 | 
 9 | join = os.path.join
10 | 
11 | base_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot'
12 | new_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot-no-dev'
13 | 
14 | task_names = ("SST-2", "QNLI", "MNLI", "QQP")
15 | for task_name in task_names:
16 |     folder = join(base_dir, task_name)
17 |     new_folder = join(new_dir, task_name)
18 | 
19 |     for name in utils.listdir(folder):
20 |         subfolder = join(folder, name)
21 |         new_subfolder = join(new_folder, name)
22 |         os.makedirs(new_subfolder, exist_ok=True)
23 | 
24 |         train = join(subfolder, 'train.tsv')
25 |         new_train = join(new_subfolder, 'train.tsv')
26 |         os.system(f'cp {train} {new_train}')
27 | 
28 |         if task_name == "MNLI":
29 |             test = join(subfolder, 'test_matched.tsv')
30 |             new_dev = join(new_subfolder, 'dev_matched.tsv')
31 |             os.system(f'cp {test} {new_dev}')
32 | 
33 |             test = join(subfolder, 'test_mismatched.tsv')
34 |             new_dev = join(new_subfolder, 'dev_mismatched.tsv')
35 |             os.system(f'cp {test} {new_dev}')
36 |         else:
37 |             test = join(subfolder, 'test.tsv')
38 |             new_dev = join(new_subfolder, 'dev.tsv')
39 |             os.system(f'cp {test} {new_dev}')
40 | 


--------------------------------------------------------------------------------
/examples/text_classification/data/make_valid_data.py:
--------------------------------------------------------------------------------
  1 | """Make the separate validation data, so that we don't tune on dev set.
  2 | 
  3 | python -m classification.data.make_valid_data
  4 | """
  5 | import os
  6 | 
  7 | import fire
  8 | import numpy as np
  9 | import tqdm
 10 | 
 11 | 
 12 | def write_lines(path, lines, mode="w"):
 13 |     os.makedirs(os.path.dirname(path), exist_ok=True)
 14 |     with open(path, mode) as f:
 15 |         f.writelines(lines)
 16 |         print(len(lines))
 17 | 
 18 | 
 19 | def main():
 20 |     valid_percentage = 0.1
 21 |     original_dir = "/nlp/scr/lxuechen/data/lm-bff/data/original"
 22 |     new_dir = "/nlp/scr/lxuechen/data/lm-bff/data/glue-with-validation"
 23 | 
 24 |     task_folders = ("GLUE-SST-2", "QNLI", "QQP")
 25 |     for task_folder in task_folders:
 26 |         # Create train and valid splits.
 27 |         full_train_path = os.path.join(original_dir, task_folder, 'train.tsv')
 28 |         with open(full_train_path, 'r') as f:
 29 |             full_train = f.readlines()
 30 | 
 31 |         header = full_train[0]
 32 |         full_train = full_train[1:]  # Remove header.
 33 | 
 34 |         indices = np.random.permutation(len(full_train))
 35 |         new_valid_size = int(len(indices) * valid_percentage)
 36 |         new_train_size = len(indices) - new_valid_size
 37 |         new_train_indices = indices[:new_train_size]
 38 |         new_valid_indices = indices[new_train_size:]
 39 |         assert len(new_train_indices) == new_train_size
 40 |         assert len(new_valid_indices) == new_valid_size
 41 | 
 42 |         new_train = [header] + [full_train[i] for i in new_train_indices]
 43 |         new_valid = [header] + [full_train[i] for i in new_valid_indices]
 44 | 
 45 |         new_train_path = os.path.join(new_dir, task_folder, 'train.tsv')
 46 |         new_valid_path = os.path.join(new_dir, task_folder, 'dev.tsv')
 47 | 
 48 |         write_lines(new_train_path, new_train)
 49 |         write_lines(new_valid_path, new_valid)
 50 |         del new_train, new_valid, new_train_path, new_valid_path
 51 |         del new_train_size, new_train_indices
 52 |         del new_valid_size, new_valid_indices
 53 | 
 54 |         # Make test!
 55 |         test_path = os.path.join(original_dir, task_folder, 'dev.tsv')
 56 |         new_test_path = os.path.join(new_dir, task_folder, 'test.tsv')
 57 |         os.system(f'cp {test_path} {new_test_path}')
 58 |         del test_path, new_test_path
 59 | 
 60 |     # Make valid set for MNLI; different, since matched/mismatched!
 61 |     task_folder = "MNLI"
 62 |     matched_genres = ['slate', 'government', 'telephone', 'travel', 'fiction']
 63 |     mismatched_genres = ['letters', 'verbatim', 'facetoface', 'oup', 'nineeleven']
 64 |     full_train_path = os.path.join(original_dir, task_folder, 'train.tsv')
 65 |     with open(full_train_path, 'r') as f:
 66 |         full_train = f.readlines()
 67 |         full_train_csv = [line.split('\t') for line in full_train]
 68 | 
 69 |         # Check the lengths are correct.
 70 |         l = len(full_train_csv[0])
 71 |         for line in full_train_csv:
 72 |             assert l == len(line)
 73 | 
 74 |     # Remove header.
 75 |     header = full_train[0]
 76 |     header_csv = full_train_csv[0]
 77 | 
 78 |     full_train = full_train[1:]
 79 |     full_train_csv = full_train_csv[1:]
 80 | 
 81 |     # Get index of genre.
 82 |     genre_index = header_csv.index('genre')
 83 | 
 84 |     # Shuffle both!
 85 |     indices = np.random.permutation(len(full_train))
 86 |     full_train = [full_train[i] for i in indices]
 87 |     full_train_csv = [full_train_csv[i] for i in indices]
 88 | 
 89 |     # Split validation.
 90 |     new_valid_size = int(len(indices) * valid_percentage)
 91 |     new_matched_valid_size = new_mismatched_valid_size = new_valid_size // 2
 92 | 
 93 |     # Fetch the indices.
 94 |     new_train_indices = []
 95 |     new_matched_valid_indices = []
 96 |     new_mismatched_valid_indices = []
 97 |     matched_count = mismatched_count = 0
 98 |     for i, row in enumerate(full_train_csv):
 99 |         genre = row[genre_index]
100 |         if genre in matched_genres and matched_count < new_matched_valid_size:
101 |             new_matched_valid_indices.append(i)
102 |             matched_count += 1
103 |         elif genre in mismatched_genres and mismatched_count < new_mismatched_valid_size:
104 |             new_mismatched_valid_indices.append(i)
105 |             mismatched_count += 1
106 |         else:
107 |             new_train_indices.append(i)
108 | 
109 |     new_matched_valid_indices = set(new_matched_valid_indices)
110 |     new_mismatched_valid_indices = set(new_mismatched_valid_indices)
111 | 
112 |     new_train = [header]
113 |     new_matched_valid = [header]
114 |     new_mismatched_valid = [header]
115 |     for i, line in tqdm.tqdm(enumerate(full_train)):
116 |         if i in new_matched_valid_indices:
117 |             new_matched_valid.append(line)
118 |         elif i in new_mismatched_valid_indices:
119 |             new_mismatched_valid.append(line)
120 |         else:
121 |             new_train.append(line)
122 | 
123 |     new_train_path = os.path.join(new_dir, task_folder, 'train.tsv')
124 |     new_matched_valid_path = os.path.join(new_dir, task_folder, 'dev_matched.tsv')
125 |     new_mismatched_valid_path = os.path.join(new_dir, task_folder, 'dev_mismatched.tsv')
126 | 
127 |     write_lines(new_train_path, new_train)
128 |     write_lines(new_matched_valid_path, new_matched_valid)
129 |     write_lines(new_mismatched_valid_path, new_mismatched_valid)
130 | 
131 |     matched_test_path = os.path.join(original_dir, task_folder, 'dev_matched.tsv')
132 |     new_matched_test_path = os.path.join(new_dir, task_folder, 'test_matched.tsv')
133 |     os.system(f'cp {matched_test_path} {new_matched_test_path}')
134 | 
135 |     mismatched_test_path = os.path.join(original_dir, task_folder, 'dev_mismatched.tsv')
136 |     new_mismatched_test_path = os.path.join(new_dir, task_folder, 'test_mismatched.tsv')
137 |     os.system(f'cp {mismatched_test_path} {new_mismatched_test_path}')
138 | 
139 | 
140 | if __name__ == "__main__":
141 |     fire.Fire(main)
142 | 


--------------------------------------------------------------------------------
/examples/text_classification/run_wrapper.py:
--------------------------------------------------------------------------------
  1 | """Wrapper launcher script."""
  2 | 
  3 | import os
  4 | 
  5 | import fire
  6 | 
  7 | from .src import common
  8 | 
  9 | 
 10 | def _get_command(
 11 |     task_name,
 12 |     output_dir,
 13 |     model_name_or_path,
 14 |     data_dir,
 15 |     learning_rate,
 16 |     clipping_mode: str,
 17 |     clipping_fn: str,
 18 |     clipping_style: str,
 19 |     non_private,
 20 |     target_epsilon,
 21 |     few_shot_type,
 22 |     seed,
 23 |     attention_only,
 24 |     bias_only,
 25 |     static_lm_head,
 26 |     static_embedding,
 27 |     randomly_initialize,
 28 |     physical_batch_size,
 29 |     batch_size,
 30 |     num_train_epochs,
 31 |     eval_steps,
 32 | ):
 33 |     task_name_to_factor = {
 34 |         "sst-2": 1, "qnli": 2, "qqp": 6, "mnli": 6,
 35 |     }
 36 |     factor = task_name_to_factor[task_name]
 37 | 
 38 |     if batch_size is None:
 39 |         base_batch_size = 1000
 40 |         # This batch size selection roughly ensures the sampling rates on different
 41 |         # datasets are in the same ballpark.
 42 |         batch_size = int(base_batch_size * factor)
 43 |     gradient_accumulation_steps = batch_size // physical_batch_size
 44 | 
 45 |     if num_train_epochs is None:
 46 |         base_num_train_epochs = 3
 47 |         num_train_epochs = int(base_num_train_epochs * factor)
 48 | 
 49 |     if learning_rate is None:
 50 |         if non_private.lower() in ('yes', 'y', 'true', 't'):
 51 |             learning_rate = 5e-5
 52 |             if bias_only.lower() in ('yes', 'y', 'true', 't'):
 53 |                 learning_rate=1e-3
 54 |         else:
 55 |             learning_rate = 5e-4
 56 |             if bias_only.lower() in ('yes', 'y', 'true', 't'):
 57 |                 learning_rate=5e-3
 58 | 
 59 |     data_dir = f"{data_dir}/{common.task_name2suffix_name[task_name]}"
 60 |     template = {
 61 |         "sst-2": "*cls**sent_0*_It_was*mask*.*sep+*",
 62 |         "mnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*",
 63 |         "qnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*",
 64 |         "qqp": "*cls**sent-_0**mask*,*+sentl_1**sep+*",
 65 |     }[task_name]
 66 | 
 67 |     # Epochs chosen roughly to match e2e number of updates. We didn't hyperparameter tune on classification tasks :)
 68 |     cmd = f'''
 69 | python -m text_classification.run_classification \
 70 |   --task_name {task_name} \
 71 |   --data_dir {data_dir} \
 72 |   --output_dir {output_dir} \
 73 |   --overwrite_output_dir \
 74 |   --model_name_or_path {model_name_or_path} \
 75 |   --few_shot_type {few_shot_type} \
 76 |   --num_k 1 \
 77 |   --num_sample 1 --seed {seed} \
 78 |   --template {template} \
 79 |   --non_private {non_private} \
 80 |   --num_train_epochs {num_train_epochs} \
 81 |   --target_epsilon {target_epsilon} \
 82 |   --per_device_train_batch_size {physical_batch_size} \
 83 |   --gradient_accumulation_steps {gradient_accumulation_steps} \
 84 |   --per_device_eval_batch_size 8 \
 85 |   --per_example_max_grad_norm 0.1 --clipping_mode {clipping_mode} \
 86 |   --clipping_fn {clipping_fn} --clipping_style {clipping_style}\
 87 |   --learning_rate {learning_rate} \
 88 |   --lr_decay yes \
 89 |   --adam_epsilon 1e-08 \
 90 |   --weight_decay 0 \
 91 |   --max_seq_len 256 \
 92 |   --evaluation_strategy steps --eval_steps {eval_steps} --evaluate_before_training True \
 93 |   --do_train --do_eval \
 94 |   --first_sent_limit 200 --other_sent_limit 200 --truncate_head yes \
 95 |   --attention_only {attention_only} --bias_only {bias_only} --static_lm_head {static_lm_head} --static_embedding {static_embedding} \
 96 |   --randomly_initialize {randomly_initialize} 
 97 | '''
 98 |     return cmd
 99 | 
100 | 
101 | def main(
102 |     output_dir,
103 |     task_name,
104 |     few_shot_type="prompt", # finetune or prompt
105 |     seed=0,
106 |     model_name_or_path="roberta-base",
107 |     data_dir="text_classification/data/original",
108 |     learning_rate=None,
109 |     clipping_mode="MixOpt",
110 |     clipping_fn="automatic",
111 |     clipping_style="all-layer",
112 |     non_private="no",
113 |     target_epsilon=8,
114 |     attention_only="no",
115 |     bias_only="no",
116 |     static_lm_head="no",
117 |     static_embedding="no",
118 |     physical_batch_size =40,
119 |     eval_steps=10,
120 |     randomly_initialize="no",
121 |     batch_size=None,
122 |     num_train_epochs=None,
123 | ):
124 |     command = _get_command(
125 |         output_dir=output_dir,
126 |         task_name=task_name,
127 |         model_name_or_path=model_name_or_path,
128 |         data_dir=data_dir,
129 |         learning_rate=learning_rate,
130 |         clipping_mode=clipping_mode,
131 |         clipping_fn=clipping_fn,
132 |         clipping_style=clipping_style,
133 |         non_private=non_private,
134 |         target_epsilon=target_epsilon,
135 |         few_shot_type=few_shot_type,
136 |         seed=seed,
137 |         attention_only=attention_only,
138 |         bias_only=bias_only,
139 |         static_lm_head=static_lm_head,
140 |         static_embedding=static_embedding,
141 |         physical_batch_size = physical_batch_size,
142 |         eval_steps=eval_steps,
143 |         randomly_initialize=randomly_initialize,
144 |         batch_size=batch_size,
145 |         num_train_epochs=num_train_epochs,
146 |     )
147 |     print('Running command:')
148 |     print(command)
149 |     os.system(command)
150 | 
151 | 
152 | if __name__ == "__main__":
153 |     fire.Fire(main)
154 | 


--------------------------------------------------------------------------------
/examples/text_classification/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/src/__init__.py


--------------------------------------------------------------------------------
/examples/text_classification/src/common.py:
--------------------------------------------------------------------------------
1 | import torch
2 | 
3 | task_name2suffix_name = {"sst-2": "GLUE-SST-2", "mnli": "MNLI", "qqp": "QQP", "qnli": "QNLI"}
4 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
5 | true_tags = ('y', 'yes', 't', 'true')
6 | 


--------------------------------------------------------------------------------
/examples/text_classification/src/compiled_args.py:
--------------------------------------------------------------------------------
 1 | from dataclasses import dataclass, field
 2 | 
 3 | import transformers
 4 | 
 5 | from .common import true_tags
 6 | from typing import Optional
 7 | 
 8 | 
 9 | @dataclass
10 | class PrivacyArguments:
11 |     """Arguments for differentially private training."""
12 | 
13 |     per_example_max_grad_norm: float = field(
14 |         default=.1, metadata={
15 |             "help": "Clipping 2-norm of per-sample gradients."
16 |         }
17 |     )
18 |     noise_multiplier: float = field(
19 |         default=None, metadata={
20 |             "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, "
21 |                     "use the one searched based budget"
22 |         }
23 |     )
24 |     target_epsilon: float = field(
25 |         default=None, metadata={
26 |             "help": "Privacy budget; if `None` use the noise multiplier specified."
27 |         }
28 |     )
29 |     target_delta: float = field(
30 |         default=None, metadata={
31 |             "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)."
32 |         }
33 |     )
34 |     non_private: str = field(
35 |         default="yes", metadata={"help": "Train non-privately if True."}
36 |     )
37 |     accounting_mode: str = field(
38 |         default="rdp", metadata={"help": "One of (`rdp`, `glw`, `all`)."}
39 |     )
40 |     clipping_mode: str = field(
41 |         default="default"
42 |     )
43 |     clipping_fn: str = field(
44 |         default="automatic"
45 |     )
46 |     clipping_style: str = field(
47 |         default="all-layer"
48 |     )
49 | 
50 |     def __post_init__(self):
51 |         self.non_private = self.non_private.lower() in true_tags  # noqa
52 | 
53 | 
54 | @dataclass
55 | class TrainingArguments(transformers.TrainingArguments):
56 |     eval_epochs: int = field(default=10, metadata={"help": "Evaluate once such epochs"})
57 |     evaluate_before_training: bool = field(default=False, metadata={"help": "Run evaluation before training."})
58 |     lr_decay: str = field(
59 |         default="no", metadata={"help": "Apply the usual linear decay if `yes`, otherwise no deacy."}
60 |     )
61 |     evaluate_test_split: bool = field(default=False, metadata={"help": "Run evaluation on the test split"})
62 | 
63 |     def __post_init__(self):
64 |         super(TrainingArguments, self).__post_init__()
65 |         self.lr_decay = self.lr_decay.lower() in true_tags  # noqa
66 | 


--------------------------------------------------------------------------------
/examples/text_classification/src/label_search.py:
--------------------------------------------------------------------------------
  1 | """Automatic label search helpers."""
  2 | 
  3 | import itertools
  4 | import logging
  5 | import multiprocessing
  6 | 
  7 | import numpy as np
  8 | import scipy.spatial as spatial
  9 | import scipy.special as special
 10 | import scipy.stats as stats
 11 | import tqdm
 12 | 
 13 | logger = logging.getLogger(__name__)
 14 | 
 15 | 
 16 | def select_likely_words(train_logits, train_labels, k_likely=1000, vocab=None, is_regression=False):
 17 |     """Pre-select likely words based on conditional likelihood."""
 18 |     indices = []
 19 |     if is_regression:
 20 |         median = np.median(train_labels)
 21 |         train_labels = (train_labels > median).astype(np.int)
 22 |     num_labels = np.max(train_labels) + 1
 23 |     for idx in range(num_labels):
 24 |         label_logits = train_logits[train_labels == idx]
 25 |         scores = label_logits.mean(axis=0)
 26 |         kept = []
 27 |         for i in np.argsort(-scores):
 28 |             text = vocab[i]
 29 |             if not text.startswith("Ġ"):
 30 |                 continue
 31 |             kept.append(i)
 32 |         indices.append(kept[:k_likely])
 33 |     return indices
 34 | 
 35 | 
 36 | def select_neighbors(distances, k_neighbors, valid):
 37 |     """Select k nearest neighbors based on distance (filtered to be within the 'valid' set)."""
 38 |     indices = np.argsort(distances)
 39 |     neighbors = []
 40 |     for i in indices:
 41 |         if i not in valid:
 42 |             continue
 43 |         neighbors.append(i)
 44 |     if k_neighbors > 0:
 45 |         return neighbors[:k_neighbors]
 46 |     return neighbors
 47 | 
 48 | 
 49 | def init(train_logits, train_labels):
 50 |     global logits, labels
 51 |     logits = train_logits
 52 |     labels = train_labels
 53 | 
 54 | 
 55 | def eval_pairing_acc(pairing):
 56 |     global logits, labels
 57 |     label_logits = np.take(logits, pairing, axis=-1)
 58 |     preds = np.argmax(label_logits, axis=-1)
 59 |     correct = np.sum(preds == labels)
 60 |     return correct / len(labels)
 61 | 
 62 | 
 63 | def eval_pairing_corr(pairing):
 64 |     global logits, labels
 65 |     if pairing[0] == pairing[1]:
 66 |         return -1
 67 |     label_logits = np.take(logits, pairing, axis=-1)
 68 |     label_probs = special.softmax(label_logits, axis=-1)[:, 1]
 69 |     pearson_corr = stats.pearsonr(label_probs, labels)[0]
 70 |     return pearson_corr
 71 | 
 72 | 
 73 | def find_labels(
 74 |     model,
 75 |     train_logits,
 76 |     train_labels,
 77 |     seed_labels=None,
 78 |     k_likely=1000,
 79 |     k_neighbors=None,
 80 |     top_n=-1,
 81 |     vocab=None,
 82 |     is_regression=False,
 83 | ):
 84 |     # Get top indices based on conditional likelihood using the LM.
 85 |     likely_indices = select_likely_words(
 86 |         train_logits=train_logits,
 87 |         train_labels=train_labels,
 88 |         k_likely=k_likely,
 89 |         vocab=vocab,
 90 |         is_regression=is_regression)
 91 | 
 92 |     logger.info("Top labels (conditional) per class:")
 93 |     for i, inds in enumerate(likely_indices):
 94 |         logger.info("\t| Label %d: %s", i, ", ".join([vocab[i] for i in inds[:10]]))
 95 | 
 96 |     # Convert to sets.
 97 |     valid_indices = [set(inds) for inds in likely_indices]
 98 | 
 99 |     # If specified, further re-rank according to nearest neighbors of seed labels.
100 |     # Otherwise, keep ranking as is (based on conditional likelihood only).
101 |     if seed_labels:
102 |         assert (vocab is not None)
103 |         seed_ids = [vocab.index(l) for l in seed_labels]
104 |         vocab_vecs = model.lm_head.decoder.weight.detach().cpu().numpy()
105 |         seed_vecs = np.take(vocab_vecs, seed_ids, axis=0)
106 | 
107 |         # [num_labels, vocab_size]
108 |         label_distances = spatial.distance.cdist(seed_vecs, vocab_vecs, metric="cosine")
109 | 
110 |         # Establish label candidates (as k nearest neighbors).
111 |         label_candidates = []
112 |         logger.info("Re-ranked by nearest neighbors:")
113 |         for i, distances in enumerate(label_distances):
114 |             label_candidates.append(select_neighbors(distances, k_neighbors, valid_indices[i]))
115 |             logger.info("\t| Label: %s", seed_labels[i])
116 |             logger.info("\t| Neighbors: %s", " ".join([vocab[idx] for idx in label_candidates[i]]))
117 |     else:
118 |         label_candidates = likely_indices
119 | 
120 |     # Brute-force search all valid pairings.
121 |     pairings = list(itertools.product(*label_candidates))
122 | 
123 |     if is_regression:
124 |         eval_pairing = eval_pairing_corr
125 |         metric = "corr"
126 |     else:
127 |         eval_pairing = eval_pairing_acc
128 |         metric = "acc"
129 | 
130 |     # Score each pairing.
131 |     pairing_scores = []
132 |     with multiprocessing.Pool(initializer=init, initargs=(train_logits, train_labels)) as workers:
133 |         with tqdm.tqdm(total=len(pairings)) as pbar:
134 |             chunksize = max(10, int(len(pairings) / 1000))
135 |             for score in workers.imap(eval_pairing, pairings, chunksize=chunksize):
136 |                 pairing_scores.append(score)
137 |                 pbar.update()
138 | 
139 |     # Take top-n.
140 |     best_idx = np.argsort(-np.array(pairing_scores))[:top_n]
141 |     best_scores = [pairing_scores[i] for i in best_idx]
142 |     best_pairings = [pairings[i] for i in best_idx]
143 | 
144 |     logger.info("Automatically searched pairings:")
145 |     for i, indices in enumerate(best_pairings):
146 |         logger.info("\t| %s (%s = %2.2f)", " ".join([vocab[j] for j in indices]), metric, best_scores[i])
147 | 
148 |     return best_pairings
149 | 


--------------------------------------------------------------------------------
/fastDP/README.md:
--------------------------------------------------------------------------------
 1 | ### Two Privacy Engines
 2 | 
 3 | FastDP provides two privacy engines to compute the private gradient: **hook-based** and **torch-extending**. These privacy engines are equivalent mathematically, though their applicability and computation efficiency can be different. We summarize the differences and note that some limitations can be overcome with more engineering efforts.
 4 | 
 5 | |                           | Hook-based (DP)                  | Torch-extending (DP) | Standard (non-DP)    |
 6 | |:----------------------------:|:-------------------------------:|:----------------:|:------------:| 
 7 | | Speed (1/time complexity)     | 80-100%                            | ~70%            | 100% |
 8 | | Memory cost (space complexity)     | 100-130% | ~100%             | 100%         | 
 9 | | ZeRO distribution solution   | ✅ Supported              | ✅ Supported  | ✅ Supported |
10 | | Most types of layers     | ✅ Supported (see below)                  | ✅ Supported (see below)   | ✅ Supported  | 
11 | | Per-sample clipping styles     | ✅ Supported for all styles                   | Layer-wise style   |✅ Not needed  |
12 | | Per-sample clipping functions  | ✅ Supported for all functions                | Automatic clipping    |✅ Not needed  |
13 | | Modifying optimizers | Needed for `PrivacyEngine`; not needed for ZeRO    | ✅ Not needed   | ✅ Not needed |
14 | | Private gradient stored in        | `param.private_grad`                    | `param.grad`   | `param.grad`  |
15 | | Fused kernel  | ✅ Supported                | Not supported    |✅ Supported  |
16 | | Ghost differentiation (origin param)           | Supported on single GPU        | Not supported   | Not needed  |
17 | | Recommended usage       | Single GPU or ZeRO   | General   | General  |
18 | 
19 | #### 1. Hook-based
20 | Hook-based approach computes the private gradient with forward hooks (to store the activations) and backward hooks (to compute the per-sample gradient norms, to clip and to add noise). See [this tutorial for hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html). This approach firstly computes the private gradient then overrides the non-DP gradient.
21 | 
22 | On single GPU or data parallelism (see `PrivacyEngine`), the hooks are backward module hooks, which are triggered before `param.grad` is computed; in ZeRO (see `PrivacyEngine_Distributed_Stage_2_and_3`), some backward tensor hooks are in place, which are triggered after `param.grad` has been computed.
23 | 
24 | #### 2. Torch-extending
25 | Torch-extending approach computes the private gradient directly by re-writeing the model's back-propagation mechanism (see `PrivacyEngine_Distributed_extending`). See [this tutorial for extending torch modules](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-nn). This approach overrides the non-DP modules as shown in `supported_differentially_private_layers.py`. Given that this approach does not modify the optimizers and the communication orchestra of distributed solutions, it is expected to be applicable generally. However, some slowdown may be observed as the extension is not implemented at C++ level.
26 | 
27 | ### Supported Modules
28 | 
29 | Our privacy engine supports the commonly used modules that constitute most models, with possibly two methods to compute the per-sample gradient norm:
30 | * nn.Linear (GhostClip & Grad Instantiation)
31 | * nn.LayerNorm (Grad Instantiation)
32 | * nn.GroupNorm (Grad Instantiation)
33 | * nn.InstanceNorm (Grad Instantiation)
34 | * nn.Embedding (GhostClip)
35 | * nn.Conv1d (GhostClip & Grad Instantiation)
36 | * nn.Conv2d (GhostClip & Grad Instantiation)
37 | * nn.Conv3d (GhostClip & Grad Instantiation)
38 | 
39 | Frozen (e.g. `nn.Linear` with `requires_grad=False`) and non-trainable (e.g. `nn.ReLU`, `nn.Tanh`, `nn.MaxPool2d`) modules are also supported.
40 | 
41 | Note GhostClip stands for ghost clipping [1][2][3], that computes the gradient norms without creating and storing the gradients. Grad Instantiation stands for per-sample gradient instantiation [5], that generates the per-sample gradients and then computes their norms. Note that Grad Instantiation can be inefficient for large models and GhostClip can be inefficient for high-dimensional data. Therefore we allow to choose the method at different layers (known as the hybrid algorithms by [3][4]) for modules that support both methods.
42 | 
43 | ### Arguments
44 | * `module`: The model that to be optimized with differential privacy.
45 | * `batch_size`: Logical batch size that determines the convergence and accuracy.
46 | * `sample_size`: Number of training samples.
47 | * `target_epsilon`: Target privacy budget ε.
48 | * `target_delta`: Target privacy budget δ, should be smaller than 1/sample_size.
49 | * `max_grad_norm`: Per-sample gradient clipping threshold, default to 1. No need to tune if `clipping_fn="automatic"`.
50 | * `epochs`: Number of epochs. Not needed if `noise_multiplier` is provided.
51 | * `noise_multiplier`: Level of independent Gaussian noise into the gradient. This can be automatically computed by different `accounting_mode` if `target_epsilon, batch_size, sample_size, epochs` are provided.
52 | * `accounting_mode`: Privacy accounting theory to use, one of "rdp" (default), "glw", "all".
53 | * `named_params`: Specifies which parameters to optimize with differential privacy.
54 | * `clipping_mode`: Per-sample gradient clipping mode, one of 'ghost', 'MixGhostClip', 'MixOpt' (default) from [4]. Note different clipping modes, including Opacus [5], GhostClip [2] and Mixed GhostClip [3], give the same convergence and accuracy though at significantly different time/space complexity.
55 | * `clipping_fn`: Per-sample gradient clipping function to use; one of "automatic" (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), "Abadi" [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , "global" [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
56 | * `clipping_style`: Per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a block, including both weight and bias parameters), `param-wise` (each parameter is a block), or a list of layer names (general block-wise clipping).
57 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters. For example, in text classification with RoBERTa, set `origin_params=["_embeddings"]`; in text generation with GPT2, set `origin_params=["wte","wpe"]`; in image classification with BEiT, set `origin_params=["patch_embed.proj.bias"]`. This trick gives about 8/6=1.666 speedup at no memory overhead.
58 | 
59 | ### Usage
60 | Our privacy engine uses Pytorch [forward and backward hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html) to clip per-sample gradients and to add noises. To privately train models, attach the privacy engine to any optimizers from [torch.optim](https://pytorch.org/docs/stable/optim.html), which accumulates the sum of clipped per-sample gradients into `.grad` during backward propagation and additionally inject noises by `step`.
61 | 
62 | To conduct DP bias-term fine-tuning (DP-BiTFiT [6]), simply freeze all non-bias terms:
63 | ```python
64 | [param.requires_grad_(False) for name, param in model.named_parameters() if '.bias' not in name]
65 | ```
66 | Note that for two-phase DP training (e.g. appendix of [6] or DP continual training), one need to detach the first engine and attach a new engine to a new optimizer.
67 | 
68 | ### References
69 | [1] Goodfellow, Ian. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015).
70 | 
71 | [2] Li, Xuechen, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." arXiv preprint arXiv:2110.05679 (2021).
72 | 
73 | [3] Bu, Zhiqi, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." arXiv preprint arXiv:2205.10683 (2022).
74 | 
75 | [4] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." arXiv preprint arXiv:2210.00038 (2022).
76 | 
77 | [5] Yousefpour, Ashkan, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021).
78 | 
79 | [6] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term only Fine-tuning of Foundation Models." arXiv preprint arXiv:2210.00036 (2022).
80 | 


--------------------------------------------------------------------------------
/fastDP/__init__.py:
--------------------------------------------------------------------------------
1 | from . import lora_utils
2 | from .privacy_engine import PrivacyEngine
3 | from .privacy_engine_dist_stage23 import PrivacyEngine_Distributed_Stage_2_and_3
4 | from .privacy_engine_dist_extending import PrivacyEngine_Distributed_extending
5 | from .supported_differentially_private_layers import *
6 | __version__ = '2.0.0'
7 | 


--------------------------------------------------------------------------------
/fastDP/accounting/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/fastDP/accounting/__init__.py


--------------------------------------------------------------------------------
/fastDP/accounting/accounting_manager.py:
--------------------------------------------------------------------------------
  1 | import abc
  2 | import math
  3 | from typing import Dict, Optional, Union
  4 | 
  5 | from . import rdp_accounting
  6 | 
  7 | DEFAULT_ALPHAS = tuple(1 + x / 10.0 for x in range(1, 100)) + tuple(range(12, 64))  # RDP.
  8 | 
  9 | 
 10 | class AccountingManager(abc.ABC):
 11 |     def _get_sigma_with_target_epsilon(
 12 |         self,
 13 |         target_epsilon,
 14 |         target_delta,
 15 |         sample_rate,
 16 |         steps,
 17 |         threshold,
 18 |         sigma_hi_init,
 19 |         sigma_lo_init,
 20 |     ):
 21 |         """Binary search σ given ε and δ."""
 22 |         if sigma_lo_init > sigma_hi_init:
 23 |             raise ValueError("`sigma_lo` should be smaller than `sigma_hi`.")
 24 | 
 25 |         # Find an appropriate region for binary search.
 26 |         sigma_hi = sigma_hi_init
 27 |         sigma_lo = sigma_lo_init
 28 | 
 29 |         # Ensure sigma_hi isn't too small.
 30 |         while True:
 31 |             eps = self._compute_epsilon_from_sigma(sigma_hi, sample_rate, target_delta, steps)
 32 |             if eps < target_epsilon:
 33 |                 break
 34 |             sigma_hi *= 2
 35 | 
 36 |         # Ensure sigma_lo isn't too large.
 37 |         while True:
 38 |             eps = self._compute_epsilon_from_sigma(sigma_lo, sample_rate, target_delta, steps)
 39 |             if eps > target_epsilon:
 40 |                 break
 41 |             sigma_lo /= 2
 42 | 
 43 |         # Binary search.
 44 |         while sigma_hi - sigma_lo > threshold:
 45 |             sigma = (sigma_hi + sigma_lo) / 2
 46 |             eps = self._compute_epsilon_from_sigma(sigma, sample_rate, target_delta, steps)
 47 |             if eps < target_epsilon:
 48 |                 sigma_hi = sigma
 49 |             else:
 50 |                 sigma_lo = sigma
 51 | 
 52 |         # Conservative estimate.
 53 |         return sigma_hi
 54 | 
 55 |     @abc.abstractmethod
 56 |     def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
 57 |         """Override for reporting results."""
 58 |         raise NotImplementedError
 59 | 
 60 |     @abc.abstractmethod
 61 |     def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps) -> float:
 62 |         """Override for binary sigma search."""
 63 |         raise NotImplementedError
 64 | 
 65 |     def compute_sigma(
 66 |         self,
 67 |         target_epsilon: float,
 68 |         target_delta: float,
 69 |         sample_rate: float,
 70 |         epochs: Optional[Union[float, int]] = None,
 71 |         steps=None,
 72 |         threshold=1e-3,
 73 |         sigma_hi_init=4,
 74 |         sigma_lo_init=0.1,
 75 |     ) -> float:
 76 |         if steps is None:
 77 |             if epochs is None:
 78 |                 raise ValueError("Epochs and steps cannot both be None.")
 79 |             steps = math.ceil(epochs / sample_rate)
 80 |         return self._get_sigma_with_target_epsilon(
 81 |             target_epsilon=target_epsilon,
 82 |             target_delta=target_delta,
 83 |             sample_rate=sample_rate,
 84 |             steps=steps,
 85 |             threshold=threshold,
 86 |             sigma_hi_init=sigma_hi_init,
 87 |             sigma_lo_init=sigma_lo_init,
 88 |         )
 89 | 
 90 | 
 91 | class RDPManager(AccountingManager):
 92 |     def __init__(self, alphas):
 93 |         super(RDPManager, self).__init__()
 94 |         self._alphas = alphas
 95 | 
 96 |     def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps):
 97 |         return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_rdp"]
 98 | 
 99 |     def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
100 |         """Compute RDP as usual, but convert to (ε, δ)-DP based on the result by Canonne, Kamath, Steinke."""
101 |         rdp = rdp_accounting.compute_rdp(q=sample_rate, noise_multiplier=sigma, steps=steps, orders=self._alphas)
102 |         eps, alpha = rdp_accounting.get_privacy_spent(orders=self._alphas, rdp=rdp, delta=target_delta)
103 |         return dict(eps_rdp=eps, alpha_rdp=alpha)
104 | 
105 | 
106 | class GLWManager(AccountingManager):
107 |     def __init__(self, eps_error=0.05):
108 |         super(GLWManager, self).__init__()
109 |         self._eps_error = eps_error
110 | 
111 |     def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps):
112 |         return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_upper"]  # Be conservative.
113 | 
114 |     def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
115 |         if steps == 0:
116 |             return dict(eps_low=None, eps_estimate=None, eps_upper=None)
117 | 
118 |         from prv_accountant import Accountant
119 |         accountant = Accountant(
120 |             noise_multiplier=sigma,
121 |             sampling_probability=sample_rate,
122 |             delta=target_delta,
123 |             eps_error=self._eps_error,
124 |             max_compositions=steps
125 |         )
126 |         eps_low, eps_estimate, eps_upper = accountant.compute_epsilon(num_compositions=steps)
127 |         return dict(eps_low=eps_low, eps_estimate=eps_estimate, eps_upper=eps_upper)
128 | 


--------------------------------------------------------------------------------
/fastDP/accounting/rdp_accounting.py:
--------------------------------------------------------------------------------
  1 | r"""
  2 | This file is adapted from the privacy accounting procedure in Opacus', which in turn is adapted from tf-privacy.
  3 | Below is the original documentation in Opacus.
  4 | 
  5 | *Based on Google's TF Privacy:* https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis
  6 | /rdp_accountant.py.
  7 | *Here, we update this code to Python 3, and optimize dependencies.*
  8 | 
  9 | Functionality for computing Renyi Differential Privacy (RDP) of an additive
 10 | Sampled Gaussian Mechanism (SGM).
 11 | 
 12 | Example:
 13 |     Suppose that we have run an SGM applied to a function with L2-sensitivity of 1.
 14 | 
 15 |     Its parameters are given as a list of tuples
 16 |     ``[(q_1, sigma_1, steps_1), ..., (q_k, sigma_k, steps_k)],``
 17 |     and we wish to compute epsilon for a given target delta.
 18 | 
 19 |     The example code would be:
 20 | 
 21 |     >>> max_order = 32
 22 |     >>> orders = range(2, max_order + 1)
 23 |     >>> rdp = np.zeros_like(orders, dtype=float)
 24 |     >>> for q, sigma, steps in parameters:
 25 |     >>>     rdp += privacy_analysis.compute_rdp(q, sigma, steps, orders)
 26 |     >>> epsilon, opt_order = privacy_analysis.get_privacy_spent(orders, rdp, delta)
 27 | """
 28 | 
 29 | import math
 30 | from typing import List, Sequence, Union
 31 | 
 32 | import numpy as np
 33 | from scipy import special
 34 | 
 35 | 
 36 | ########################
 37 | # LOG-SPACE ARITHMETIC #
 38 | ########################
 39 | 
 40 | 
 41 | def _log_add(logx: float, logy: float) -> float:
 42 |     r"""Adds two numbers in the log space.
 43 | 
 44 |     Args:
 45 |         logx: First term in log space.
 46 |         logy: Second term in log space.
 47 | 
 48 |     Returns:
 49 |         Sum of numbers in log space.
 50 |     """
 51 |     a, b = min(logx, logy), max(logx, logy)
 52 |     if a == -np.inf:  # adding 0
 53 |         return b
 54 |     # Use exp(a) + exp(b) = (exp(a - b) + 1) * exp(b)
 55 |     return math.log1p(math.exp(a - b)) + b  # log1p(x) = log(x + 1)
 56 | 
 57 | 
 58 | def _log_sub(logx: float, logy: float) -> float:
 59 |     r"""Subtracts two numbers in the log space.
 60 | 
 61 |     Args:
 62 |         logx: First term in log space. Expected to be greater than the second term.
 63 |         logy: First term in log space. Expected to be less than the first term.
 64 | 
 65 |     Returns:
 66 |         Difference of numbers in log space.
 67 | 
 68 |     Raises:
 69 |         ValueError
 70 |             If the result is negative.
 71 |     """
 72 |     if logx < logy:
 73 |         raise ValueError("The result of subtraction must be non-negative.")
 74 |     if logy == -np.inf:  # subtracting 0
 75 |         return logx
 76 |     if logx == logy:
 77 |         return -np.inf  # 0 is represented as -np.inf in the log space.
 78 | 
 79 |     try:
 80 |         # Use exp(x) - exp(y) = (exp(x - y) - 1) * exp(y).
 81 |         return math.log(math.expm1(logx - logy)) + logy  # expm1(x) = exp(x) - 1
 82 |     except OverflowError:
 83 |         return logx
 84 | 
 85 | 
 86 | def _compute_log_a_for_int_alpha(q: float, sigma: float, alpha: int) -> float:
 87 |     r"""Computes :math:`log(A_\alpha)` for integer ``alpha``.
 88 | 
 89 |     Notes:
 90 |         Note that
 91 |         :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
 92 |         and that 0 < ``q`` < 1.
 93 | 
 94 |         Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details.
 95 | 
 96 |     Args:
 97 |         q: Sampling rate of SGM.
 98 |         sigma: The standard deviation of the additive Gaussian noise.
 99 |         alpha: The order at which RDP is computed.
100 | 
101 |     Returns:
102 |         :math:`log(A_\alpha)` as defined in Section 3.3 of
103 |         https://arxiv.org/pdf/1908.10530.pdf.
104 |     """
105 | 
106 |     # Initialize with 0 in the log space.
107 |     log_a = -np.inf
108 | 
109 |     for i in range(alpha + 1):
110 |         log_coef_i = (
111 |             math.log(special.binom(alpha, i))
112 |             + i * math.log(q)
113 |             + (alpha - i) * math.log(1 - q)
114 |         )
115 | 
116 |         s = log_coef_i + (i * i - i) / (2 * (sigma ** 2))
117 |         log_a = _log_add(log_a, s)
118 | 
119 |     return float(log_a)
120 | 
121 | 
122 | def _compute_log_a_for_frac_alpha(q: float, sigma: float, alpha: float) -> float:
123 |     r"""Computes :math:`log(A_\alpha)` for fractional ``alpha``.
124 | 
125 |     Notes:
126 |         Note that
127 |         :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
128 |         and that 0 < ``q`` < 1.
129 | 
130 |         Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details.
131 | 
132 |     Args:
133 |         q: Sampling rate of SGM.
134 |         sigma: The standard deviation of the additive Gaussian noise.
135 |         alpha: The order at which RDP is computed.
136 | 
137 |     Returns:
138 |         :math:`log(A_\alpha)` as defined in Section 3.3 of
139 |         https://arxiv.org/pdf/1908.10530.pdf.
140 |     """
141 |     # The two parts of A_alpha, integrals over (-inf,z0] and [z0, +inf), are
142 |     # initialized to 0 in the log space:
143 |     log_a0, log_a1 = -np.inf, -np.inf
144 |     i = 0
145 | 
146 |     z0 = sigma ** 2 * math.log(1 / q - 1) + 0.5
147 | 
148 |     while True:  # do ... until loop
149 |         coef = special.binom(alpha, i)
150 |         log_coef = math.log(abs(coef))
151 |         j = alpha - i
152 | 
153 |         log_t0 = log_coef + i * math.log(q) + j * math.log(1 - q)
154 |         log_t1 = log_coef + j * math.log(q) + i * math.log(1 - q)
155 | 
156 |         log_e0 = math.log(0.5) + _log_erfc((i - z0) / (math.sqrt(2) * sigma))
157 |         log_e1 = math.log(0.5) + _log_erfc((z0 - j) / (math.sqrt(2) * sigma))
158 | 
159 |         log_s0 = log_t0 + (i * i - i) / (2 * (sigma ** 2)) + log_e0
160 |         log_s1 = log_t1 + (j * j - j) / (2 * (sigma ** 2)) + log_e1
161 | 
162 |         if coef > 0:
163 |             log_a0 = _log_add(log_a0, log_s0)
164 |             log_a1 = _log_add(log_a1, log_s1)
165 |         else:
166 |             log_a0 = _log_sub(log_a0, log_s0)
167 |             log_a1 = _log_sub(log_a1, log_s1)
168 | 
169 |         i += 1
170 |         if max(log_s0, log_s1) < -30:
171 |             break
172 | 
173 |     return _log_add(log_a0, log_a1)
174 | 
175 | 
176 | def _compute_log_a(q: float, sigma: float, alpha: float) -> float:
177 |     r"""Computes :math:`log(A_\alpha)` for any positive finite ``alpha``.
178 | 
179 |     Notes:
180 |         Note that
181 |         :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
182 |         and that 0 < ``q`` < 1.
183 | 
184 |         Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf
185 |         for details.
186 | 
187 |     Args:
188 |         q: Sampling rate of SGM.
189 |         sigma: The standard deviation of the additive Gaussian noise.
190 |         alpha: The order at which RDP is computed.
191 | 
192 |     Returns:
193 |         :math:`log(A_\alpha)` as defined in the paper mentioned above.
194 |     """
195 |     if float(alpha).is_integer():
196 |         return _compute_log_a_for_int_alpha(q, sigma, int(alpha))
197 |     else:
198 |         return _compute_log_a_for_frac_alpha(q, sigma, alpha)
199 | 
200 | 
201 | def _log_erfc(x: float) -> float:
202 |     r"""Computes :math:`log(erfc(x))` with high accuracy for large ``x``.
203 | 
204 |     Helper function used in computation of :math:`log(A_\alpha)`
205 |     for a fractional alpha.
206 | 
207 |     Args:
208 |         x: The input to the function
209 | 
210 |     Returns:
211 |         :math:`log(erfc(x))`
212 |     """
213 |     return math.log(2) + special.log_ndtr(-x * 2 ** 0.5)
214 | 
215 | 
216 | def _compute_rdp(q: float, sigma: float, alpha: float) -> float:
217 |     r"""Computes RDP of the Sampled Gaussian Mechanism at order ``alpha``.
218 | 
219 |     Args:
220 |         q: Sampling rate of SGM.
221 |         sigma: The standard deviation of the additive Gaussian noise.
222 |         alpha: The order at which RDP is computed.
223 | 
224 |     Returns:
225 |         RDP at order ``alpha``; can be np.inf.
226 |     """
227 |     if q == 0:
228 |         return 0
229 | 
230 |     # no privacy
231 |     if sigma == 0:
232 |         return np.inf
233 | 
234 |     if q == 1.0:
235 |         return alpha / (2 * sigma ** 2)
236 | 
237 |     if np.isinf(alpha):
238 |         return np.inf
239 | 
240 |     return _compute_log_a(q, sigma, alpha) / (alpha - 1)
241 | 
242 | 
243 | def compute_rdp(
244 |     q: float, noise_multiplier: float, steps: int, orders: Union[Sequence[float], float]
245 | ) -> Union[List[float], float]:
246 |     r"""Computes Renyi Differential Privacy (RDP) guarantees of the
247 |     Sampled Gaussian Mechanism (SGM) iterated ``steps`` times.
248 | 
249 |     Args:
250 |         q: Sampling rate of SGM.
251 |         noise_multiplier: The ratio of the standard deviation of the
252 |             additive Gaussian noise to the L2-sensitivity of the function
253 |             to which it is added. Note that this is same as the standard
254 |             deviation of the additive Gaussian noise when the L2-sensitivity
255 |             of the function is 1.
256 |         steps: The number of iterations of the mechanism.
257 |         orders: An array (or a scalar) of RDP orders.
258 | 
259 |     Returns:
260 |         The RDP guarantees at all orders; can be ``np.inf``.
261 |     """
262 |     if isinstance(orders, float):
263 |         rdp = _compute_rdp(q, noise_multiplier, orders)
264 |     else:
265 |         rdp = np.array([_compute_rdp(q, noise_multiplier, order) for order in orders])
266 | 
267 |     return rdp * steps
268 | 
269 | 
270 | # Based on
271 | #   https://github.com/tensorflow/privacy/blob/5f07198b66b3617b22609db983926e3ba97cd905/tensorflow_privacy/privacy/analysis/rdp_accountant.py#L237
272 | def get_privacy_spent(orders, rdp, delta):
273 |     """Compute epsilon given a list of RDP values and target delta.
274 |     Args:
275 |         orders: An array (or a scalar) of orders.
276 |         rdp: A list (or a scalar) of RDP guarantees.
277 |         delta: The target delta.
278 |     Returns:
279 |         Pair of (eps, optimal_order).
280 |     Raises:
281 |         ValueError: If input is malformed.
282 |     """
283 |     orders_vec = np.atleast_1d(orders)
284 |     rdp_vec = np.atleast_1d(rdp)
285 | 
286 |     if delta <= 0:
287 |         raise ValueError("Privacy failure probability bound delta must be >0.")
288 |     if len(orders_vec) != len(rdp_vec):
289 |         raise ValueError("Input lists must have the same length.")
290 | 
291 |     # Basic bound (see https://arxiv.org/abs/1702.07476 Proposition 3 in v3):
292 |     #   eps = min( rdp_vec - math.log(delta) / (orders_vec - 1) )
293 | 
294 |     # Improved bound from https://arxiv.org/abs/2004.00010 Proposition 12 (in v4).
295 |     # Also appears in https://arxiv.org/abs/2001.05990 Equation 20 (in v1).
296 |     eps_vec = []
297 |     for (a, r) in zip(orders_vec, rdp_vec):
298 |         if a < 1:
299 |             raise ValueError("Renyi divergence order must be >=1.")
300 |         if r < 0:
301 |             raise ValueError("Renyi divergence must be >=0.")
302 | 
303 |         if delta ** 2 + math.expm1(-r) >= 0:
304 |             # In this case, we can simply bound via KL divergence:
305 |             # delta <= sqrt(1-exp(-KL)).
306 |             eps = 0  # No need to try further computation if we have eps = 0.
307 |         elif a > 1.01:
308 |             # This bound is not numerically stable as alpha->1.
309 |             # Thus we have a min value of alpha.
310 |             # The bound is also not useful for small alpha, so doesn't matter.
311 |             eps = r + math.log1p(-1 / a) - math.log(delta * a) / (a - 1)
312 |         else:
313 |             # In this case we can't do anything. E.g., asking for delta = 0.
314 |             eps = np.inf
315 |         eps_vec.append(eps)
316 | 
317 |     idx_opt = np.argmin(eps_vec)
318 |     return max(0, eps_vec[idx_opt]), orders_vec[idx_opt]
319 | 


--------------------------------------------------------------------------------
/fastDP/autograd_grad_sample_dist.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A large portion of this code is adapted from Opacus v0.15 (https://github.com/pytorch/opacus) 
  3 | and from Private-transformers v0.2.3 (https://github.com/lxuechen/private-transformers)
  4 | which are licensed under Apache License 2.0.
  5 | 
  6 | We have modified it considerably to support book-keeping and BiTFiT.
  7 | """
  8 | 
  9 | from typing import Tuple
 10 | 
 11 | import torch
 12 | import torch.nn as nn
 13 | 
 14 | from .supported_layers_grad_samplers import _supported_layers_norm_sample_AND_clipping,_create_or_extend_private_grad
 15 | 
 16 | def requires_grad(module: nn.Module) -> bool:
 17 |     """
 18 |     Checks if any parameters in a specified module require gradients.
 19 | 
 20 |     Args:
 21 |         module: PyTorch module whose parameters are examined
 22 | 
 23 |     Returns:
 24 |         Flag indicate if any parameters require gradients
 25 |     """
 26 |     return any(p.requires_grad for p in module.parameters() if hasattr(p,'requires_grad'))
 27 | 
 28 | 
 29 | def add_hooks(model: nn.Module, loss_reduction='mean', clipping_mode='MixOpt',bias_only=False,
 30 |               clipping_style='all-layer', block_heads=None, named_params=None, named_layers=None,
 31 |               clipping_fn=None, numerical_stability_constant=None, max_grad_norm_layerwise=None):
 32 |     r"""
 33 |     Adds hooks to model to save activations (to layers) and backprop (to params) values.
 34 | 
 35 |     The hooks will
 36 | 
 37 |     1. save activations into ``layer.activations`` (NOT param.activations) during forward pass.
 38 |     Note: BiTFiT is special in that if a layer only requires bias gradient, no need for forward hook
 39 |         
 40 |     2. compute per-sample grad norm or grad and save in ``param.norm_sample`` or ``param.grad_sample`` during backward pass.
 41 | 
 42 |     Args:
 43 |         model: Model to which hooks are added.
 44 |     """
 45 |     if hasattr(model, "autograd_grad_sample_hooks"):
 46 |         raise ValueError("Trying to add hooks twice to the same model")
 47 | 
 48 |     handles = []
 49 | 
 50 |     for name, layer in model.named_modules():
 51 |         if type(layer) in _supported_layers_norm_sample_AND_clipping and requires_grad(layer):
 52 |             if hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad:
 53 |                 #print('Attaching forward hook on', name)
 54 |                 handles.append(layer.register_forward_hook(_capture_activations))
 55 |                 
 56 |             def this_backward(this_layer, grad_input, grad_output):
 57 |                 _prepare_sample_grad_or_norm(this_layer, grad_output, loss_reduction, clipping_mode,bias_only)
 58 |                 _per_block_clip_grad(this_layer, named_params, named_layers, clipping_style, clipping_fn, numerical_stability_constant, max_grad_norm_layerwise)
 59 | 
 60 |             # Starting with 1.8.0, can use `register_full_backward_hook`, but slower
 61 |             handles.append(layer.register_backward_hook(this_backward))            
 62 | 
 63 |     model.__dict__.setdefault("autograd_grad_sample_hooks", []).extend(handles)
 64 | 
 65 | 
 66 | def remove_hooks(model: nn.Module):
 67 |     """Removes hooks added by `add_hooks()`."""
 68 |     for handle in model.autograd_grad_sample_hooks:
 69 |         handle.remove()
 70 |     del model.autograd_grad_sample_hooks
 71 | 
 72 | 
 73 | def _capture_activations(layer: nn.Module, inputs: Tuple, outputs: Tuple):
 74 |     """Forward hook handler captures AND saves activations."""
 75 |     layer.activations=inputs[0].detach()
 76 | 
 77 | def _prepare_sample_grad_or_norm(
 78 |     layer: nn.Module,
 79 |     grad_output: Tuple[torch.Tensor],
 80 |     loss_reduction='mean',
 81 |     clipping_mode='MixOpt',
 82 |     bias_only=False,
 83 |     ):
 84 |     """Backward hook handler captures AND saves grad_outputs (book-keeping)."""
 85 |     backprops = grad_output[0].detach()
 86 | 
 87 |     """Computes per-sample grad norm or grad for individual layers."""
 88 |     if not hasattr(layer,'activations'):
 89 |         layer.activations=None
 90 |     if loss_reduction=='mean':
 91 |         backprops = backprops * backprops.shape[0] # .backprops should save dL_i/ds, not 1/B*dL_i/ds, the mean reduction is taken care of in privacy engine .step()
 92 |     compute_layer_grad_sample, _ = _supported_layers_norm_sample_AND_clipping.get(type(layer))
 93 | 
 94 |     compute_layer_grad_sample(layer, layer.activations, backprops, clipping_mode)
 95 | 
 96 |     layer.backprops=backprops
 97 | 
 98 | 
 99 | def _per_block_clip_grad(
100 |     layer: nn.Module, named_params, named_layers, clipping_style, clipping_fn,
101 |     numerical_stability_constant,max_grad_norm_layerwise
102 |     ):
103 |     
104 |     if clipping_style=='layer-wise':
105 |         if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'):
106 |             norm_sample = layer.weight.norm_sample
107 |             if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'):
108 |                 norm_sample = torch.stack([layer.weight.norm_sample,layer.bias.norm_sample], dim=0).norm(2, dim=0);
109 |         else:
110 |             norm_sample = layer.bias.norm_sample
111 |         #norm_sample = torch.stack([param.norm_sample for param in layer.parameters() if hasattr(param,'norm_sample')], dim=0).norm(2, dim=0);
112 | 
113 |         # compute per-sample grad norm and clipping factor
114 |         if clipping_fn=='automatic':
115 |             C = max_grad_norm_layerwise / (norm_sample + numerical_stability_constant)#torch.ones_like(norm_sample,dtype=layer.weight.dtype)#change to non-DP C=1 works under mixed precision
116 |         elif clipping_fn=='Abadi':
117 |             C = torch.clamp_max(max_grad_norm_layerwise / (norm_sample + numerical_stability_constant), 1.)
118 |         elif clipping_fn=='global':
119 |             C = (norm_sample<=max_grad_norm_layerwise).float()
120 |         else:
121 |             raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
122 |             
123 |         if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'):
124 |             #--- weight, compute clipped gradient
125 |             _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer))
126 |             common_type=torch.promote_types(layer.activations.dtype,layer.backprops.dtype)
127 |             grad_weight = compute_layer_grad(layer, layer.activations.to(common_type), torch.einsum('b...,b->b...',layer.backprops.to(common_type),C), C)
128 |             del layer.activations, layer.backprops
129 |             _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False)
130 |             
131 |         if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'):
132 |             #--- bias, compute clipped gradient
133 |             grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C)#(layer.bias.grad_sample*C.unsqueeze(1)).sum(dim=0)#
134 |             del layer.bias.grad_sample
135 |             _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False)
136 |                 
137 |     elif clipping_style=='param-wise':
138 |         if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'):
139 |             if clipping_fn=='automatic':
140 |                 C_weight = max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant)
141 |             elif clipping_fn=='Abadi':
142 |                 C_weight = torch.clamp_max(max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant), 1.)
143 |             elif clipping_fn=='global':
144 |                 C_weight = (layer.weight.norm_sample<=max_grad_norm_layerwise).float()
145 |             else:
146 |                 raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
147 |         
148 |         if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'):
149 |             if clipping_fn=='automatic':
150 |                 C_bias = max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant)
151 |             elif clipping_fn=='Abadi':
152 |                 C_bias = torch.clamp_max(max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant), 1.)
153 |             elif clipping_fn=='global':
154 |                 C_bias = (layer.bias.norm_sample<=max_grad_norm_layerwise).float()
155 |             else:
156 |                 raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
157 |         
158 |             
159 |         if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'):
160 |             _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer))
161 |             grad_weight = compute_layer_grad(layer, layer.activations, torch.einsum('b...,b->b...',layer.backprops,C_weight), C_weight)
162 |             del layer.activations, layer.backprops
163 |             
164 |             _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False)
165 |             
166 |             
167 |         #--- bias, compute clipped gradient
168 |         if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'):
169 |             grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C_bias)
170 |             del layer.bias.grad_sample
171 |             _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False)
172 |     else:
173 |         raise ValueError(f"Unknown clipping style {clipping_style}. Expected one of 'layer-wise','param-wise'.")
174 | 
175 |             
176 |     for param in layer.parameters():
177 |         if hasattr(param,'norm_sample'):
178 |             del param.norm_sample
179 | 


--------------------------------------------------------------------------------
/fastDP/lora_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | LoRA layers.
  3 | 
  4 | This version does not have merged weights for zero latency inference. It makes the code easier to read and maintain.
  5 | Adapted from
  6 |     https://github.com/microsoft/LoRA
  7 |     https://www.microsoft.com/en-us/research/project/dp-transformers/
  8 | """
  9 | 
 10 | import torch
 11 | import transformers
 12 | from torch import nn
 13 | 
 14 | 
 15 | class DPMergedLinear(nn.Module):
 16 |     def __init__(
 17 |         self,
 18 |         in_features: int,
 19 |         out_features: int,
 20 |         lora_r=0,
 21 |         lora_alpha=1.,
 22 |         lora_dropout=0.,
 23 |     ):
 24 |         super(DPMergedLinear, self).__init__()
 25 |         self.linear = nn.Linear(in_features=in_features, out_features=out_features)
 26 |         self.lora_r = lora_r
 27 |         self.lora_alpha = lora_alpha
 28 |         self.lora_dropout = nn.Dropout(p=lora_dropout)
 29 |         if self.lora_r > 0:
 30 |             self.lora_A = nn.Linear(in_features=in_features, out_features=lora_r, bias=False)
 31 |             self.lora_B = nn.Linear(in_features=lora_r, out_features=out_features, bias=False)
 32 |             self.scaling = self.lora_alpha / lora_r
 33 |         self.reset_parameters()
 34 | 
 35 |     def forward(self, x: torch.Tensor):
 36 |         result = self.linear(x)
 37 |         if self.lora_r > 0:
 38 |             after_dropout = self.lora_dropout(x)
 39 |             after_A = self.lora_A(after_dropout)
 40 |             after_B = self.lora_B(after_A)
 41 |             result += after_B * self.scaling
 42 |         return result
 43 | 
 44 |     def reset_parameters(self):
 45 |         self.linear.reset_parameters()
 46 |         if self.lora_r > 0:
 47 |             self.lora_A.reset_parameters()
 48 |             self.lora_B.weight.data.zero_()
 49 | 
 50 |     @staticmethod
 51 |     def from_transformers_conv1d(
 52 |         original_layer,
 53 |         lora_r=0,
 54 |         lora_alpha=1.,
 55 |         lora_dropout=0.,
 56 |     ) -> "DPMergedLinear":
 57 |         lora_layer = DPMergedLinear(
 58 |             in_features=original_layer.weight.shape[0],
 59 |             out_features=original_layer.weight.shape[1],
 60 |             lora_r=lora_r,
 61 |             lora_alpha=lora_alpha,
 62 |             lora_dropout=lora_dropout,
 63 |         ).to(original_layer.weight.device)
 64 |         lora_layer.linear.weight.data.copy_(original_layer.weight.T.data)
 65 |         lora_layer.linear.bias.data.copy_(original_layer.bias.data)
 66 |         return lora_layer
 67 | 
 68 | 
 69 | def convert_gpt2_attention_to_lora(
 70 |     model: transformers.GPT2PreTrainedModel,
 71 |     lora_r=0,
 72 |     lora_alpha=1.,
 73 |     lora_dropout=0.,
 74 | ) -> transformers.GPT2PreTrainedModel:
 75 |     if not isinstance(model, transformers.GPT2PreTrainedModel):
 76 |         raise TypeError("Requires a GPT2 model")
 77 | 
 78 |     if not hasattr(model, "h") and hasattr(model, "transformer"):
 79 |         transformer = model.transformer
 80 |     else:
 81 |         transformer = model
 82 | 
 83 |     for h_i in transformer.h:
 84 |         new_layer = DPMergedLinear.from_transformers_conv1d(
 85 |             original_layer=h_i.attn.c_attn,
 86 |             lora_r=lora_r,
 87 |             lora_alpha=lora_alpha,
 88 |             lora_dropout=lora_dropout,
 89 |         )
 90 |         h_i.attn.c_attn = new_layer
 91 | 
 92 |     return model
 93 | 
 94 | 
 95 | def mark_only_lora_as_trainable(model: torch.nn.Module) -> None:
 96 |     model.requires_grad_(True)
 97 |     for n, p in model.named_parameters():
 98 |         if 'lora_' not in n:
 99 |             p.requires_grad = False
100 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | torch~=1.11.0+cu113
 2 | prv-accountant
 3 | transformers>=4.20.1
 4 | numpy
 5 | scipy
 6 | jupyterlab
 7 | jupyter
 8 | opacus>=1.0
 9 | ml-swissknife
10 | opt_einsum
11 | pytest
12 | pydantic==1.10
13 | tqdm>=4.62.1
14 | deepspeed~=0.8.3
15 | fairscale==0.4
16 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import re
 3 | 
 4 | import setuptools
 5 | 
 6 | # for simplicity we actually store the version in the __version__ attribute in the source
 7 | here = os.path.realpath(os.path.dirname(__file__))
 8 | print(here)
 9 | with open(os.path.join(here, 'fastDP', '__init__.py')) as f:
10 |     meta_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", f.read(), re.M)
11 |     if meta_match:
12 |         version = meta_match.group(1)
13 |     else:
14 |         raise RuntimeError("Unable to find __version__ string.")
15 | 
16 | with open(os.path.join(here, 'README.md')) as f:
17 |     readme = f.read()
18 | 
19 | setuptools.setup(
20 |     name="fastDP",
21 |     version=version,
22 |     author="Zhiqi Bu",
23 |     author_email="woodyx218@gmail.com",
24 |     description="Optimally efficient implementation of differentially private optimization (with per-sample gradient clipping.",
25 |     long_description=readme,
26 |     url="",
27 |     packages=setuptools.find_packages(exclude=['examples', 'tests']),
28 |     python_requires='~=3.8',
29 |     classifiers=[
30 |         "Programming Language :: Python :: 3",
31 |         "License :: OSI Approved :: Apache Software License",
32 |     ],
33 | )
34 | 


--------------------------------------------------------------------------------