├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE.md
├── README.md
├── THIRD-PARTY-NOTICES.txt
├── assets
├── efficiency.png
├── nlp.png
├── scalability.png
└── vision.png
├── examples
├── __init__.py
├── image_classification
│ ├── CIFAR_TIMM.py
│ ├── CV_TIMM.py
│ ├── CelebA_TIMM.py
│ ├── README.md
│ ├── ZERO_examples
│ │ ├── CIFAR_TIMM_FSDP_extending.py
│ │ ├── CIFAR_TIMM_ZERO1.py
│ │ ├── CIFAR_TIMM_ZERO23.py
│ │ ├── CIFAR_TIMM_ZERO_extending.py
│ │ └── cifar_config.json
│ └── __init__.py
├── requirements.txt
├── table2text
│ ├── README.md
│ ├── __init__.py
│ ├── compiled_args.py
│ ├── data_utils
│ │ ├── __init__.py
│ │ ├── data_collator.py
│ │ └── language_modeling.py
│ ├── decoding_utils.py
│ ├── gpt_config_stage123.json
│ ├── misc.py
│ ├── models.py
│ ├── run.sh
│ ├── run_ZERO1.sh
│ ├── run_ZERO23.sh
│ ├── run_ZERO_extending.py
│ ├── run_language_modeling.py
│ ├── run_language_modeling_ZERO23.py
│ ├── run_language_modeling_extending.py
│ └── trainer.py
└── text_classification
│ ├── README.md
│ ├── __init__.py
│ ├── data
│ ├── download_dataset.sh
│ ├── make_k_shot_without_dev.py
│ └── make_valid_data.py
│ ├── run_classification.py
│ ├── run_wrapper.py
│ └── src
│ ├── __init__.py
│ ├── common.py
│ ├── compiled_args.py
│ ├── dataset.py
│ ├── label_search.py
│ ├── models.py
│ ├── processors.py
│ └── trainer.py
├── fastDP
├── README.md
├── __init__.py
├── accounting
│ ├── __init__.py
│ ├── accounting_manager.py
│ └── rdp_accounting.py
├── autograd_grad_sample.py
├── autograd_grad_sample_dist.py
├── lora_utils.py
├── privacy_engine.py
├── privacy_engine_dist_extending.py
├── privacy_engine_dist_stage23.py
├── supported_differentially_private_layers.py
├── supported_layers_grad_samplers.py
└── transformers_support.py
├── requirements.txt
└── setup.py
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/NOTICE.md:
--------------------------------------------------------------------------------
1 | Fast Differential Privacy
2 |
3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
4 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Fast Differential Privacy
2 |
3 | *Fast Differential Privacy* (**fastDP**) is a library that allows differentially private optimization of PyTorch models, with a few additional lines of code. The goal of this library is to make DP deep learning as similar to the standard non-private learning as possible, in terms of **speed, memory cost, scalability, accuracy and hyperparameter-tuning**. It supports all PyTorch optimizers, popular models in [TIMM](https://github.com/rwightman/pytorch-image-models), [torchvision](https://github.com/pytorch/vision), [HuggingFace](https://huggingface.co/transformers/) (up to supported modules), multiple privacy accountants, multiple clipping functions/styles, most parameter-efficient training methods, and distribute solutions such as DeepSpeed and FSDP. The library has provably little overhead in terms of training time and memory cost, compared with the standard non-private optimization.
4 |
5 |
6 | ---
7 | ## Installation.
8 | To install the library after Git clone, run
9 | ```bash
10 | python -m setup develop
11 | ```
12 |
13 | > :warning: **NOTE**: We strongly recommend Python>=3.8 and torch<=1.11 (it is a known issue that torch 1.12 can slow down as much as 3 times).
14 |
15 | ## Getting started
16 | To train a model with differential privacy, simply create a `PrivacyEngine` and continue the standard training pipeline:
17 |
18 | ```python
19 | from fastDP import PrivacyEngine
20 | optimizer = SGD(model.parameters(), lr=0.05)
21 | privacy_engine = PrivacyEngine(
22 | model,
23 | batch_size=256,
24 | sample_size=50000,
25 | epochs=3,
26 | target_epsilon=2,
27 | clipping_fn='automatic',
28 | clipping_mode='MixOpt',
29 | origin_params=None,
30 | clipping_style='all-layer',
31 | )
32 | # attaching to optimizers is not needed for multi-GPU distributed learning
33 | privacy_engine.attach(optimizer)
34 |
35 | #----- standard training pipeline
36 | loss = F.cross_entropy(model(batch), labels)
37 | loss.backward()
38 | optimizer.step()
39 | optimizer.zero_grad()
40 | ```
41 |
42 | We provide details about our privacy engine in `fastDP/README.md`, including the supported modules and the arguments. By default, we use the `'MixOpt'` (hybrid book-keeping [4]) clipping mode (which enjoys almost the same time complexity as non-private optimization), and the automatic clipping function [8] (which does not need to tune the clipping threshold `max_grad_norm`). We support RDP and GLW privacy accountant, and additional accountants can be used through the argument `noise_multiplier`, after its calculation from [[Automating differential privacy computation](https://github.com/yuxiangw/autodp)] library.
43 |
44 | Specifically, we allow the gradient accumulation to use very large batch size, which is beneficial to DP optimization:
45 | ```python
46 | for i, batch in enumerate(dataloader):
47 | loss = F.cross_entropy(model(batch), labels)
48 | loss.backward()
49 | if i % gradient_accumulation_steps == 0:
50 | optimizer.step()
51 | optimizer.zero_grad()
52 | ```
53 |
54 | ## Foundation model release
55 | We release DP vision foundation models in [v2.1](https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1): VisionTransformer models (ViT; ~86M param) following [Pre-training Differentially Private Models with Limited Public Data](https://arxiv.org/abs/2402.18752) in NeurIPS 2024. These models have [epsilon=2](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps2.pt) and [epsilon=8](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps8.pt), pre-trained on ImageNet-1k with AdamW (1k classes, 1 million images) and ImageNet-11k with DP-AdamW (11k classes, 11 million images). More DP foundation models to come!
56 |
57 | ## Highlights
58 | 1. This library enables large model training in the **multi-GPU distributed setting** and **supports mixed precision training** under DeepSpeed and FSDP.
59 |
60 |
61 |
62 | The scalability has been tested on 100B models with 512 GPUs.
63 |
64 |
65 |
66 |
67 | 2. This library enables DP training to have almost **the same time and space complexity** as the standard non-private training. This is achieved by three key techniques as described in [4]: mixed ghost norm, book-keeping, and ghost differentiation. In practice, we observe <20% memory overhead and <25% slowdown across different tasks.
68 |
69 |
70 |
71 |
72 |
73 | 3. Specifically, this library overcomes the severe memory issues of large model (commonly encountered by Opacus, which computes the per-sample gradients) and high dimensional data (commonly encountered by ghost clipping, e.g. in Private transformers), by leveraging the mixed ghost norm trick [3,8].
74 |
75 |
76 |
77 |
78 |
79 | 4. We **support all optimizers** in [`torch.optim`](https://pytorch.org/docs/stable/optim.html) (SGD, Adam, AdaGrad,...) and a wide range of **models** (BERT, RoBERTa, GPT, ViT, BEiT, CrossViT, DEiT, ResNet, VGG, DenseNet,...), including their parameter-efficient variants. For example, one can run DP bias-term fine-tuning (DP-BiTFiT) by simply freezing non-bias terms, as in `examples/image_classification`.
80 |
81 | ------
82 | Full fine-tuning results on a single A100 GPU
83 |
84 | | Datasets | ε | Setting | Model | Accuracy | Time(min)/epoch |
85 | |----------|---|----------------------------------------------------------|---------------|-----------|-----------------|
86 | | CIFAR10 | 2 | [6] | ViT-large | 98.9 | 7.0 |
87 | | CIFAR100 | 2 | [6] | BEiT-large | 88.7 | 6.5 |
88 | | CelebA | 3 | [6] | ResNet18 | 88.2 | 2.7 |
89 | | SST2 | 3 | [8] | RoBERTa-large | 93.9 | 13.5 |
90 | | QNLI | 3 | [8] | RoBERTa-large | 91.0 | 20.2 |
91 | | QQP | 3 | [8] | RoBERTa-large | 86.8 | 70.0 |
92 | | MNLI | 3 | [8] | RoBERTa-large | 86.3/86.7 | 77.1 |
93 |
94 | More datasets, epsilon budgets, models, fine-tuning styles, and different hyperparamters can be found in the related papers.
95 |
96 |
97 |
98 | ## Examples
99 | The `examples` folder covers tasks on the table-to-text (E2E and DART datasets with GPT2 models), the text classification (SST2/QNLI/QQP/MNLI datasets with BERT/RoBERTa models), and the image classification (CIFAR10/CIFAR100/CelebA datasets with [TIMM](https://github.com/rwightman/pytorch-image-models)/[torchvision](https://github.com/pytorch/vision) models). Detailed `README.md` can be found in each sub-folder. These examples can be used to reproduce the results in [2,3,4,6,8].
100 |
101 |
102 | ## Citation
103 | Please consider citing the following if you want to use this library in your works:
104 | ```
105 | @inproceedings{bu2023differentially,
106 | title={Differentially private optimization on large model at small cost},
107 | author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George},
108 | booktitle={International Conference on Machine Learning},
109 | pages={3192--3218},
110 | year={2023},
111 | organization={PMLR}
112 | }
113 |
114 | @article{bu2023zero,
115 | title={Zero redundancy distributed learning with differential privacy},
116 | author={Bu, Zhiqi and Chiu, Justin and Liu, Ruixuan and Zha, Sheng and Karypis, George},
117 | booktitle={ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML},
118 | journal={arXiv preprint arXiv:2311.11822},
119 | year={2023}
120 | }
121 |
122 | @inproceedings{bu2022differentially,
123 | title={Differentially Private Bias-Term Fine-tuning of Foundation Models},
124 | author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George},
125 | booktitle={Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022},
126 | year={2022}
127 | }
128 | ```
129 |
130 | ## Acknowledgements
131 | This codebase is largely inspired by [[Opacus (v0.15)]](https://github.com/pytorch/opacus), [[Private transformers (v0.2.3)]](https://github.com/lxuechen/private-transformers), [[Private Vision]](https://github.com/woodyx218/private_vision), and [[FastGradClip]](https://github.com/ppmlguy/fastgradclip).
132 |
133 | ## References
134 | [1] Ian Goodfellow. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015).
135 |
136 | [2] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." ICLR (2022).
137 |
138 | [3] Zhiqi Bu, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." NeurIPS (2022).
139 |
140 | [4] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." ICML (2023).
141 |
142 | [5] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021).
143 |
144 | [6] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term Fine-tuning of Foundation Models." ICML (2024).
145 |
146 | [7] Martin Abadi, et al. "Deep learning with differential privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.
147 |
148 | [8] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Automatic clipping: Differentially private deep learning made easier and stronger." NeurIPS (2023).
149 |
150 | [9] Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, and George Karypis. "Pre-training Differentially Private Models with Limited Public Data." NeurIPS (2024).
151 |
--------------------------------------------------------------------------------
/THIRD-PARTY-NOTICES.txt:
--------------------------------------------------------------------------------
1 | ** private-transformers; version 0.2.3 -- https://github.com/lxuechen/private-transformers
2 | ** opacus; version 0.15 -- https://github.com/pytorch/opacus
3 | ** private_vision; version initial version -- https://github.com/woodyx218/private_vision
4 |
5 | Apache License
6 | Version 2.0, January 2004
7 | http://www.apache.org/licenses/
8 |
9 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
10 |
11 | 1. Definitions.
12 |
13 | "License" shall mean the terms and conditions for use, reproduction, and
14 | distribution as defined by Sections 1 through 9 of this document.
15 |
16 | "Licensor" shall mean the copyright owner or entity authorized by the copyright
17 | owner that is granting the License.
18 |
19 | "Legal Entity" shall mean the union of the acting entity and all other entities
20 | that control, are controlled by, or are under common control with that entity.
21 | For the purposes of this definition, "control" means (i) the power, direct or
22 | indirect, to cause the direction or management of such entity, whether by
23 | contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
24 | outstanding shares, or (iii) beneficial ownership of such entity.
25 |
26 | "You" (or "Your") shall mean an individual or Legal Entity exercising
27 | permissions granted by this License.
28 |
29 | "Source" form shall mean the preferred form for making modifications, including
30 | but not limited to software source code, documentation source, and configuration
31 | files.
32 |
33 | "Object" form shall mean any form resulting from mechanical transformation or
34 | translation of a Source form, including but not limited to compiled object code,
35 | generated documentation, and conversions to other media types.
36 |
37 | "Work" shall mean the work of authorship, whether in Source or Object form, made
38 | available under the License, as indicated by a copyright notice that is included
39 | in or attached to the work (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object form, that
42 | is based on (or derived from) the Work and for which the editorial revisions,
43 | annotations, elaborations, or other modifications represent, as a whole, an
44 | original work of authorship. For the purposes of this License, Derivative Works
45 | shall not include works that remain separable from, or merely link (or bind by
46 | name) to the interfaces of, the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including the original version
49 | of the Work and any modifications or additions to that Work or Derivative Works
50 | thereof, that is intentionally submitted to Licensor for inclusion in the Work
51 | by the copyright owner or by an individual or Legal Entity authorized to submit
52 | on behalf of the copyright owner. For the purposes of this definition,
53 | "submitted" means any form of electronic, verbal, or written communication sent
54 | to the Licensor or its representatives, including but not limited to
55 | communication on electronic mailing lists, source code control systems, and
56 | issue tracking systems that are managed by, or on behalf of, the Licensor for
57 | the purpose of discussing and improving the Work, but excluding communication
58 | that is conspicuously marked or otherwise designated in writing by the copyright
59 | owner as "Not a Contribution."
60 |
61 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf
62 | of whom a Contribution has been received by Licensor and subsequently
63 | incorporated within the Work.
64 |
65 | 2. Grant of Copyright License. Subject to the terms and conditions of this
66 | License, each Contributor hereby grants to You a perpetual, worldwide, non-
67 | exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce,
68 | prepare Derivative Works of, publicly display, publicly perform, sublicense, and
69 | distribute the Work and such Derivative Works in Source or Object form.
70 |
71 | 3. Grant of Patent License. Subject to the terms and conditions of this License,
72 | each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-
73 | charge, royalty-free, irrevocable (except as stated in this section) patent
74 | license to make, have made, use, offer to sell, sell, import, and otherwise
75 | transfer the Work, where such license applies only to those patent claims
76 | licensable by such Contributor that are necessarily infringed by their
77 | Contribution(s) alone or by combination of their Contribution(s) with the Work
78 | to which such Contribution(s) was submitted. If You institute patent litigation
79 | against any entity (including a cross-claim or counterclaim in a lawsuit)
80 | alleging that the Work or a Contribution incorporated within the Work
81 | constitutes direct or contributory patent infringement, then any patent licenses
82 | granted to You under this License for that Work shall terminate as of the date
83 | such litigation is filed.
84 |
85 | 4. Redistribution. You may reproduce and distribute copies of the Work or
86 | Derivative Works thereof in any medium, with or without modifications, and in
87 | Source or Object form, provided that You meet the following conditions:
88 |
89 | (a) You must give any other recipients of the Work or Derivative Works a
90 | copy of this License; and
91 |
92 | (b) You must cause any modified files to carry prominent notices stating
93 | that You changed the files; and
94 |
95 | (c) You must retain, in the Source form of any Derivative Works that You
96 | distribute, all copyright, patent, trademark, and attribution notices from the
97 | Source form of the Work, excluding those notices that do not pertain to any part
98 | of the Derivative Works; and
99 |
100 | (d) If the Work includes a "NOTICE" text file as part of its distribution,
101 | then any Derivative Works that You distribute must include a readable copy of
102 | the attribution notices contained within such NOTICE file, excluding those
103 | notices that do not pertain to any part of the Derivative Works, in at least one
104 | of the following places: within a NOTICE text file distributed as part of the
105 | Derivative Works; within the Source form or documentation, if provided along
106 | with the Derivative Works; or, within a display generated by the Derivative
107 | Works, if and wherever such third-party notices normally appear. The contents of
108 | the NOTICE file are for informational purposes only and do not modify the
109 | License. You may add Your own attribution notices within Derivative Works that
110 | You distribute, alongside or as an addendum to the NOTICE text from the Work,
111 | provided that such additional attribution notices cannot be construed as
112 | modifying the License.
113 |
114 | You may add Your own copyright statement to Your modifications and may
115 | provide additional or different license terms and conditions for use,
116 | reproduction, or distribution of Your modifications, or for any such Derivative
117 | Works as a whole, provided Your use, reproduction, and distribution of the Work
118 | otherwise complies with the conditions stated in this License.
119 |
120 | 5. Submission of Contributions. Unless You explicitly state otherwise, any
121 | Contribution intentionally submitted for inclusion in the Work by You to the
122 | Licensor shall be under the terms and conditions of this License, without any
123 | additional terms or conditions. Notwithstanding the above, nothing herein shall
124 | supersede or modify the terms of any separate license agreement you may have
125 | executed with Licensor regarding such Contributions.
126 |
127 | 6. Trademarks. This License does not grant permission to use the trade names,
128 | trademarks, service marks, or product names of the Licensor, except as required
129 | for reasonable and customary use in describing the origin of the Work and
130 | reproducing the content of the NOTICE file.
131 |
132 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in
133 | writing, Licensor provides the Work (and each Contributor provides its
134 | Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
135 | KIND, either express or implied, including, without limitation, any warranties
136 | or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
137 | PARTICULAR PURPOSE. You are solely responsible for determining the
138 | appropriateness of using or redistributing the Work and assume any risks
139 | associated with Your exercise of permissions under this License.
140 |
141 | 8. Limitation of Liability. In no event and under no legal theory, whether in
142 | tort (including negligence), contract, or otherwise, unless required by
143 | applicable law (such as deliberate and grossly negligent acts) or agreed to in
144 | writing, shall any Contributor be liable to You for damages, including any
145 | direct, indirect, special, incidental, or consequential damages of any character
146 | arising as a result of this License or out of the use or inability to use the
147 | Work (including but not limited to damages for loss of goodwill, work stoppage,
148 | computer failure or malfunction, or any and all other commercial damages or
149 | losses), even if such Contributor has been advised of the possibility of such
150 | damages.
151 |
152 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or
153 | Derivative Works thereof, You may choose to offer, and charge a fee for,
154 | acceptance of support, warranty, indemnity, or other liability obligations
155 | and/or rights consistent with this License. However, in accepting such
156 | obligations, You may act only on Your own behalf and on Your sole
157 | responsibility, not on behalf of any other Contributor, and only if You agree to
158 | indemnify, defend, and hold each Contributor harmless for any liability incurred
159 | by, or claims asserted against, such Contributor by reason of your accepting any
160 | such warranty or additional liability.
161 |
162 | END OF TERMS AND CONDITIONS
163 |
164 | APPENDIX: How to apply the Apache License to your work.
165 |
166 | To apply the Apache License to your work, attach the following boilerplate
167 | notice, with the fields enclosed by brackets "[]" replaced with your own
168 | identifying information. (Don't include the brackets!) The text should be
169 | enclosed in the appropriate comment syntax for the file format. We also
170 | recommend that a file or class name and description of purpose be included on
171 | the same "printed page" as the copyright notice for easier identification within
172 | third-party archives.
173 |
174 | Copyright [yyyy] [name of copyright owner]
175 |
176 | Licensed under the Apache License, Version 2.0 (the "License");
177 | you may not use this file except in compliance with the License.
178 | You may obtain a copy of the License at
179 |
180 | http://www.apache.org/licenses/LICENSE-2.0
181 |
182 | Unless required by applicable law or agreed to in writing, software
183 | distributed under the License is distributed on an "AS IS" BASIS,
184 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
185 | See the License for the specific language governing permissions and
186 | limitations under the License.
187 |
188 | * For Private-Transformers see also this required NOTICE:
189 | None
190 | * For opacus see also this required NOTICE:
191 | Copyright (c) Meta Platforms, Inc. and affiliates.
192 | * For private_vision see also this required NOTICE:
193 | None
194 |
195 | ------
196 |
197 | ** ml-swissknife; version 0.1.7 -- https://github.com/lxuechen/ml-swissknife
198 | None
199 |
200 | MIT License
201 |
202 | Copyright (c)
203 |
204 | Permission is hereby granted, free of charge, to any person obtaining a copy of
205 | this software and associated documentation files (the "Software"), to deal in
206 | the Software without restriction, including without limitation the rights to
207 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
208 | the Software, and to permit persons to whom the Software is furnished to do so,
209 | subject to the following conditions:
210 |
211 | The above copyright notice and this permission notice shall be included in all
212 | copies or substantial portions of the Software.
213 |
214 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
215 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
216 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
217 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
218 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
219 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
220 |
--------------------------------------------------------------------------------
/assets/efficiency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/efficiency.png
--------------------------------------------------------------------------------
/assets/nlp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/nlp.png
--------------------------------------------------------------------------------
/assets/scalability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/scalability.png
--------------------------------------------------------------------------------
/assets/vision.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/vision.png
--------------------------------------------------------------------------------
/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/__init__.py
--------------------------------------------------------------------------------
/examples/image_classification/CIFAR_TIMM.py:
--------------------------------------------------------------------------------
1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
2 | def main(args):
3 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
4 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
5 | return None
6 |
7 | device= torch.device("cuda:0")
8 |
9 | # Data
10 | print('==> Preparing data..')
11 |
12 | transformation = torchvision.transforms.Compose([
13 | torchvision.transforms.Resize(args.dimension),
14 | torchvision.transforms.ToTensor(),
15 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
16 | ])
17 |
18 |
19 | if args.cifar_data=='CIFAR10':
20 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
21 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
22 | elif args.cifar_data=='CIFAR100':
23 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
24 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
25 | else:
26 | return "Must specify datasets as CIFAR10 or CIFAR100"
27 |
28 |
29 | trainloader = torch.utils.data.DataLoader(
30 | trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4)
31 |
32 | testloader = torch.utils.data.DataLoader(
33 | testset, batch_size=100, shuffle=False, num_workers=4)
34 |
35 | n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps
36 |
37 | # Model
38 | print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm. Mode: ', args.clipping_mode)
39 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))
40 | net = ModuleValidator.fix(net); net=net.to(device)
41 |
42 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
43 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
44 |
45 | criterion = nn.CrossEntropyLoss()
46 |
47 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
48 |
49 | if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety
50 | for name,param in net.named_parameters():
51 | if '.bias' not in name:
52 | param.requires_grad_(False)
53 |
54 | # Privacy engine
55 | if 'nonDP' not in args.clipping_mode:
56 | sigma=get_noise_multiplier(
57 | target_epsilon = args.epsilon,
58 | target_delta = 1e-5,
59 | sample_rate = args.bs/len(trainset),
60 | epochs = args.epochs,
61 | )
62 |
63 | if 'BK' in args.clipping_mode:
64 | clipping_mode=args.clipping_mode[3:]
65 | else:
66 | clipping_mode='ghost'
67 |
68 | if args.clipping_style in [['all-layer'],['layer-wise'],['param-wise']]:
69 | args.clipping_style=args.clipping_style[0]
70 | privacy_engine = PrivacyEngine(
71 | net,
72 | batch_size=args.bs,
73 | sample_size=len(trainset),
74 | noise_multiplier=sigma,
75 | epochs=args.epochs,
76 | clipping_mode=clipping_mode,
77 | clipping_style=args.clipping_style,
78 | origin_params=args.origin_params,#['patch_embed.proj.bias'],
79 | )
80 | privacy_engine.attach(optimizer)
81 |
82 |
83 | def train(epoch):
84 |
85 | net.train()
86 | train_loss = 0
87 | correct = 0
88 | total = 0
89 |
90 |
91 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
92 | inputs, targets = inputs.to(device), targets.to(device)
93 | outputs = net(inputs)
94 | loss = criterion(outputs, targets)
95 |
96 | loss.backward()
97 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
98 | optimizer.step()
99 | optimizer.zero_grad()
100 |
101 | train_loss += loss.item()
102 | _, predicted = outputs.max(1)
103 | total += targets.size(0)
104 | correct += predicted.eq(targets).sum().item()
105 |
106 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
107 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
108 |
109 | def test(epoch):
110 | net.eval()
111 | test_loss = 0
112 | correct = 0
113 | total = 0
114 | with torch.no_grad():
115 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
116 | inputs, targets = inputs.to(device), targets.to(device)
117 | outputs = net(inputs)
118 | loss = criterion(outputs, targets)
119 |
120 | test_loss += loss.item()
121 | _, predicted = outputs.max(1)
122 | total += targets.size(0)
123 | correct += predicted.eq(targets).sum().item()
124 |
125 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
126 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
127 |
128 | for epoch in range(args.epochs):
129 | train(epoch)
130 | test(epoch)
131 |
132 |
133 | if __name__ == '__main__':
134 | import argparse
135 |
136 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
137 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
138 | parser.add_argument('--epochs', default=3, type=int,
139 | help='numter of epochs')
140 | parser.add_argument('--bs', default=1000, type=int, help='batch size')
141 | parser.add_argument('--mini_bs', type=int, default=50)
142 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
143 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
144 | parser.add_argument('--clipping_style', default='all-layer', nargs='+',type=str)
145 | parser.add_argument('--model', default='vit_small_patch16_224', type=str)
146 | parser.add_argument('--cifar_data', type=str, default='CIFAR10')
147 | parser.add_argument('--dimension', type=int,default=224)
148 | parser.add_argument('--origin_params', nargs='+', default=None)
149 |
150 | args = parser.parse_args()
151 |
152 | from fastDP import PrivacyEngine
153 |
154 | import torch
155 | import torchvision
156 | torch.manual_seed(2)
157 | import torch.nn as nn
158 | import torch.optim as optim
159 | import timm
160 | from opacus.validators import ModuleValidator
161 | from opacus.accountants.utils import get_noise_multiplier
162 | from tqdm import tqdm
163 | import warnings; warnings.filterwarnings("ignore")
164 |
165 | main(args)
166 |
--------------------------------------------------------------------------------
/examples/image_classification/CV_TIMM.py:
--------------------------------------------------------------------------------
1 | '''Train CV with PyTorch.'''
2 | def main(args):
3 |
4 | device= torch.device("cuda:0")
5 |
6 | # Data
7 | transformation = torchvision.transforms.Compose([
8 | torchvision.transforms.Resize((224,224)),#https://discuss.pytorch.org/t/runtimeerror-stack-expects-each-tensor-to-be-equal-size-but-got-3-224-224-at-entry-0-and-3-224-336-at-entry-3/87211/10
9 | torchvision.transforms.ToTensor(),
10 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
11 | ])
12 |
13 | if args.dataset_name in ['SVHN','CIFAR10']:
14 | num_classes=10
15 | elif args.dataset_name in ['CIFAR100','FGVCAircraft']:
16 | num_classes=100
17 | elif args.dataset_name in ['Food101']:
18 | num_classes=101
19 | elif args.dataset_name in ['GTSRB']:
20 | num_classes=43
21 | elif args.dataset_name in ['CelebA']:
22 | num_classes=40
23 | elif args.dataset_name in ['Places365']:
24 | num_classes=365
25 | elif args.dataset_name in ['ImageNet']:
26 | num_classes=1000
27 | elif args.dataset_name in ['INaturalist']:
28 | num_classes=10000
29 |
30 |
31 | if args.dataset_name in ['SVHN','Food101','GTSRB','FGVCAircraft']:
32 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=True, transform=transformation)
33 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=True, transform=transformation)
34 | elif args.dataset_name in ['CIFAR10','CIFAR100']:
35 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=True, download=True, transform=transformation)
36 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=False, download=True, transform=transformation)
37 | elif args.dataset_name=='CelebA':
38 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=False, target_type='attr', transform=transformation)
39 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=False, target_type='attr',transform=transformation)
40 | elif args.dataset_name=='Places365':
41 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train-standard', small=True, download=False, transform=transformation)
42 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', small=True, download=False, transform=transformation)
43 | elif args.dataset_name=='INaturalist':
44 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_train_mini', download=False, transform=transformation)
45 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_valid', download=False, transform=transformation)
46 | elif args.dataset_name=='ImageNet':
47 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', transform=transformation)
48 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', transform=transformation)
49 |
50 | trainloader = torch.utils.data.DataLoader(
51 | trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4)
52 |
53 | testloader = torch.utils.data.DataLoader(
54 | testset, batch_size=100, shuffle=False, num_workers=4)
55 |
56 | n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps
57 |
58 |
59 | # Model
60 | net = timm.create_model(args.model, pretrained=True, num_classes = num_classes)
61 | net = ModuleValidator.fix(net).to(device)
62 |
63 | if args.dataset_name=='CelebA':
64 | criterion = nn.BCEWithLogitsLoss(reduction='none')
65 | else:
66 | criterion = nn.CrossEntropyLoss()
67 |
68 |
69 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
70 |
71 | if 'BiTFiT' in args.clipping_mode:
72 | for name,layer in net.named_modules():
73 | if hasattr(layer,'weight'):
74 | temp_layer=layer
75 | for name,param in net.named_parameters():
76 | if '.bias' not in name:
77 | param.requires_grad_(False)
78 | for param in temp_layer.parameters():
79 | param.requires_grad_(True)
80 |
81 | # Privacy engine
82 | if 'nonDP' not in args.clipping_mode:
83 | sigma=get_noise_multiplier(
84 | target_epsilon = args.epsilon,
85 | target_delta = 1/len(trainset),
86 | sample_rate = args.bs/len(trainset),
87 | epochs = args.epochs,
88 | )
89 | print(f'adding noise level {sigma}')
90 | privacy_engine = PrivacyEngine(
91 | net,
92 | batch_size=args.bs,
93 | sample_size=len(trainset),
94 | noise_multiplier=sigma,
95 | epochs=args.epochs,
96 | clipping_mode='MixOpt',
97 | clipping_style='all-layer',
98 | )
99 | privacy_engine.attach(optimizer)
100 |
101 |
102 | tr_loss=[]
103 | te_loss=[]
104 | tr_acc=[]
105 | te_acc=[]
106 |
107 | def train(epoch):
108 |
109 | net.train()
110 | train_loss = 0
111 | correct = 0
112 | total = 0
113 |
114 |
115 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
116 | inputs, targets = inputs.to(device), targets.to(device)
117 | outputs = net(inputs)
118 | if args.dataset_name=='CelebA':
119 | loss = criterion(outputs, targets.float()).sum(dim=1).mean()
120 | else:
121 | loss = criterion(outputs, targets);#print(loss.item())
122 |
123 | loss.backward()
124 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
125 | optimizer.step()
126 | optimizer.zero_grad()
127 |
128 | train_loss += loss.item()
129 | total += targets.size(0)
130 | if args.dataset_name=='CelebA':
131 | correct += ((outputs > 0) == targets).sum(dim=0).float().mean()
132 | else:
133 | _, predicted = outputs.max(1)
134 | correct += predicted.eq(targets).sum().item()
135 |
136 | if args.dataset_name in ['Places365','INaturalist','ImageNet'] and (batch_idx + 1) % 100 == 0:
137 | print(loss.item(),100.*correct/total)
138 |
139 |
140 | tr_loss.append(train_loss/(batch_idx+1))
141 | tr_acc.append(100.*correct/total)
142 | print('Epoch: ', epoch, 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
143 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
144 |
145 | def test(epoch):
146 | net.eval()
147 | test_loss = 0
148 | correct = 0
149 | total = 0
150 | with torch.no_grad():
151 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
152 | inputs, targets = inputs.to(device), targets.to(device)
153 | outputs = net(inputs)
154 | if args.dataset_name=='CelebA':
155 | loss = criterion(outputs, targets.float()).sum(dim=1).mean()
156 | else:
157 | loss = criterion(outputs, targets);#print(loss.item())
158 |
159 | test_loss += loss.item()
160 | total += targets.size(0)
161 | if args.dataset_name=='CelebA':
162 | correct += ((outputs > 0) == targets).sum(dim=0).float().mean()
163 | else:
164 | _, predicted = outputs.max(1)
165 | correct += predicted.eq(targets).sum().item()
166 |
167 | te_loss.append(test_loss/(batch_idx+1))
168 | te_acc.append(100.*correct/total)
169 | print('Epoch: ', epoch, 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
170 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
171 |
172 | for epoch in range(args.epochs):
173 | train(epoch)
174 | test(epoch)
175 | print(tr_loss,tr_acc,te_loss,te_acc)
176 |
177 |
178 | if __name__ == '__main__':
179 | import argparse
180 |
181 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
182 | parser.add_argument('--lr', default=5e-4, type=float, help='learning rate')
183 | parser.add_argument('--epochs', default=5, type=int,
184 | help='numter of epochs')
185 | parser.add_argument('--bs', default=1000, type=int, help='batch size')
186 | parser.add_argument('--mini_bs', type=int, default=100)
187 | parser.add_argument('--epsilon', default=8, type=float, help='target epsilon')
188 | parser.add_argument('--dataset_name', type=str, default='CIFAR10',help='https://pytorch.org/vision/stable/datasets.html')
189 | parser.add_argument('--clipping_mode', type=str, default='MixOpt',choices=['BiTFiT','MixOpt', 'nonDP','nonDP-BiTFiT'])
190 | parser.add_argument('--model', default='vit_base_patch16_224', type=str, help='model name')
191 |
192 | args = parser.parse_args()
193 |
194 | from fastDP import PrivacyEngine
195 |
196 | import torch
197 | import torchvision
198 | torch.manual_seed(2)
199 | import torch.nn as nn
200 | import torch.optim as optim
201 | import timm
202 | from opacus.validators import ModuleValidator
203 | from opacus.accountants.utils import get_noise_multiplier
204 | from tqdm import tqdm
205 | import numpy as np
206 | import warnings; warnings.filterwarnings("ignore")
207 | main(args)
208 |
--------------------------------------------------------------------------------
/examples/image_classification/CelebA_TIMM.py:
--------------------------------------------------------------------------------
1 | #This runs multi-label classification
2 | def main(args):
3 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
4 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
5 | return None
6 |
7 | device = torch.device('cuda')
8 |
9 | # Data
10 | print('==> Preparing data..')
11 |
12 | train_set = datasets.CelebA(root='.', split='train',target_type='attr',
13 | transform=transforms.Compose([
14 | transforms.ToTensor(),
15 | #transforms.Normalize(mean=[0.5,0.5,0.5],std=[0.5,0.5,0.5]),
16 | ]))
17 | test_set = datasets.CelebA(root=".", split='test', target_type='attr',
18 | transform=transforms.Compose([
19 | transforms.ToTensor()]))
20 |
21 | if args.labels==None:
22 | args.labels=list(range(40))
23 | print('Training on all 40 labels.')
24 | else:
25 | print('Training on ', [attr_names[ind] for ind in args.labels])
26 |
27 |
28 | train_set.attr = train_set.attr[:, args.labels].type(torch.float32)
29 | test_set.attr = test_set.attr[:, args.labels].type(torch.float32)
30 |
31 | print('Training/Testing set size: ', len(train_set),len(test_set),' ; Image dimension: ',train_set[0][0].shape)
32 |
33 | trainloader = torch.utils.data.DataLoader(
34 | train_set, batch_size=args.mini_bs, pin_memory=True,num_workers=4,shuffle=True)
35 | testloader = torch.utils.data.DataLoader(
36 | test_set, batch_size=500, pin_memory=True,num_workers=4, shuffle=False)
37 |
38 | n_acc_steps=args.bs//args.mini_bs
39 |
40 | # Model
41 | print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm.')
42 | net = timm.create_model(args.model, pretrained=True, num_classes=len(args.labels))
43 | net = ModuleValidator.fix(net)
44 | net=net.to(device)
45 |
46 | for name,param in net.named_parameters():
47 | print("First trainable parameter is: ",name);break
48 |
49 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
50 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
51 |
52 | criterion = nn.BCEWithLogitsLoss(reduction='none')
53 |
54 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
55 |
56 | if 'BiTFiT' in args.clipping_mode:
57 | for name,param in net.named_parameters():
58 | if '.bias' not in name:
59 | param.requires_grad_(False)
60 |
61 | # Privacy engine
62 | if 'nonDP' not in args.clipping_mode:
63 | sigma=get_noise_multiplier(
64 | target_epsilon = args.epsilon,
65 | target_delta = 5e-6,
66 | sample_rate = args.bs/len(train_set),
67 | epochs = args.epochs,
68 | )
69 |
70 | if 'BK' in args.clipping_mode:
71 | clipping_mode=args.clipping_mode[3:]
72 | else:
73 | clipping_mode='ghost'
74 | privacy_engine = PrivacyEngine(
75 | net,
76 | batch_size=args.bs,
77 | sample_size=len(train_set),
78 | noise_multiplier=sigma,
79 | epochs=args.epochs,
80 | clipping_mode=clipping_mode,
81 | origin_params=args.origin_params,
82 | )
83 | privacy_engine.attach(optimizer)
84 |
85 |
86 | def train(epoch):
87 |
88 | net.train()
89 | train_loss = 0
90 | correct = np.zeros_like([0]*len(args.labels))
91 | total = 0
92 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)):
93 | inputs, targets = inputs.to(device), targets.to(device)
94 | outputs = net(inputs)
95 | loss = criterion(outputs, targets.float()).sum(dim=1).mean()
96 |
97 |
98 | loss.backward()
99 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
100 | optimizer.step()
101 | optimizer.zero_grad()
102 |
103 | train_loss += loss.item()
104 | total += targets.size(0)
105 | correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy()
106 |
107 | print('Epoch: ', epoch, 'Train Loss: ', (train_loss/(batch_idx+1),
108 | ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total)))
109 |
110 | def test(epoch):
111 | net.eval()
112 | test_loss = 0
113 | correct = np.zeros_like([0]*len(args.labels))
114 | total = 0
115 | with torch.no_grad():
116 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)):
117 | inputs, targets = inputs.to(device), targets.to(device)
118 | outputs = net(inputs)
119 | loss = criterion(outputs, targets.float()).sum(dim=1)
120 | loss = loss.mean()
121 |
122 | test_loss += loss.item()
123 | total += targets.size(0)
124 | correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy()
125 |
126 | print('Epoch: ', epoch, 'Test Loss: ', (test_loss/(batch_idx+1),
127 | ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total)))
128 |
129 |
130 | for epoch in range(args.epochs):
131 | train(epoch)
132 | test(epoch)
133 |
134 |
135 | if __name__ == '__main__':
136 | import argparse
137 | parser = argparse.ArgumentParser()
138 | parser.add_argument('--lr', type=float, default=0.001)
139 | parser.add_argument('--epochs', type=int, default=10)
140 | parser.add_argument('--bs', type=int, default=500)
141 | parser.add_argument('--mini_bs', type=int, default=100)
142 | parser.add_argument('--epsilon', default=3, type=float)
143 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
144 | parser.add_argument('--model', type=str, default='resnet18')
145 | parser.add_argument('--labels', nargs="*", type=int, default=None,help='List of label indices, 0-39 for CelebA')
146 | parser.add_argument('--origin_params', nargs='+', default=None)
147 |
148 |
149 | args = parser.parse_args()
150 |
151 | attr_names=['5_o_Clock_Shadow','Arched_Eyebrows','Attractive','Bags_Under_Eyes',
152 | 'Bald','Bangs','Big_Lips','Big_Nose',
153 | 'Black_Hair','Blond_Hair','Blurry','Brown_Hair',
154 | 'Bushy_Eyebrows','Chubby','Double_Chin','Eyeglasses',
155 | 'Goatee','Gray_Hair','Heavy_Makeup','High_Cheekbones',
156 | 'Male','Mouth_Slightly_Open','Mustache','Narrow_Eyes',
157 | 'No_Beard','Oval_Face','Pale_Skin','Pointy_Nose',
158 | 'Receding_Hairline','Rosy_Cheeks','Sideburns','Smiling',
159 | 'Straight_Hair','Wavy_Hair','Wearing_Earrings','Wearing_Hat',
160 | 'Wearing_Lipstick','Wearing_Necklace','Wearing_Necktie','Young']
161 |
162 | import numpy as np
163 | from fastDP import PrivacyEngine
164 |
165 | import torch
166 | from torchvision import datasets, transforms
167 | torch.manual_seed(0)
168 | import torch.nn as nn
169 | import torch.optim as optim
170 | import timm
171 | from opacus.validators import ModuleValidator
172 | from opacus.accountants.utils import get_noise_multiplier
173 | from tqdm import tqdm
174 | import warnings; warnings.filterwarnings("ignore")
175 |
176 | main(args)
177 |
--------------------------------------------------------------------------------
/examples/image_classification/README.md:
--------------------------------------------------------------------------------
1 | ## DP image classification with convolutional neural networks and vision transformers
2 |
3 | We provide the scripts to implement DP optimization on CIFAR10, CIFAR100, SVHN, ImageNet, CelebA, Places365, INaturalist, and other datasets, using the models (CNN and ViT) from [TIMM](https://github.com/rwightman/pytorch-image-models/tree/master/timm/models). Supported models include VGG, ResNet, Wide ResNet, ViT, CrossViT, BEiT, DEiT, ...
4 |
5 | ### Multi-GPU distributed learning
6 | See `ZERO_examples` folder. Our privacy engine supports DeepSpeed (ZeRO 1+2+3) and FSDP with mixed precision training. For example,
7 | ```plaintext
8 | deepspeed CIFAR_TIMM_ZERO1.py --model vit_large_patch16_224 --cifar_data CIFAR10 --deepspeed_config cifar_config.json
9 | ```
10 |
11 | ### CIFAR10/CIFAR100
12 | ```plaintext
13 | python -m CIFAR_TIMM --model vit_large_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_mode BK-MixOpt --cifar_data CIFAR10
14 | ```
15 |
16 | The script by default uses (hybrid) book-keeping by [Differentially Private Optimization on Large Model at Small Cost](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. This script achieves state-of-the-art accuracy with BEiT-large and ViT-large under 7 min per epoch on one A100 GPU (40GB). Notice that `--origin_params 'patch_embed.proj.bias'` specifically accelerates ViT through the ghost differentiation trick.
17 |
18 | Arguments:
19 |
20 | * `--cifar_data`: Whether to train on CIFAR10 (default) or CIFAR100 datasets.
21 |
22 | * `--epsilon`: Target privacy spending, default is 2.
23 |
24 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `nonDP` (non-private full fine-tuning), `BK-ghost` (base book-keeping), `BK-MixGhostClip`, `BK-MixOpt` (default), `BiTFiT` (DP bias-term fine-tuning) and `nonDPBiTFiT` (non-private BiTFiT). All BK algorithms are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf), and DP-BiTFiT is from [Bu et al., 2022](https://arxiv.org/pdf/2210.00036.pdf).
25 |
26 | * `--model`: The pretrained model from TIMM, check the full list by `timm.list_models(pretrained=True)`.
27 |
28 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters.
29 |
30 | * `--dimension`: Dimension of images, default is 224, i.e. the image is resized to 224*224.
31 |
32 | * `--lr`: Learning rate, default is 0.0005. Note BiTFiT learning rate should be larger than full fine-tuning's.
33 |
34 | * `--mini_bs` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy, default is 50.
35 |
36 | * `--bs` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is 1000.
37 |
38 | * `--epochs`: Number of epochs, default is 3.
39 |
40 | * `--clipping_style`: Which group-wise per-sample gradient clipping style to use. This argument takes one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, a uniform 3-group clipping can be implemented using
41 | ```plaintext
42 | python -m CIFAR_TIMM --model vit_base_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_style patch_embed.proj blocks.4.norm1 blocks.8.norm1 --cifar_data CIFAR10
43 | ```
44 |
45 | ### CelebA
46 | Download dataset by `torchvision`, or from the [official host](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) with all .txt and `/img_align_celeba` in the same directory.
47 | ```plaintext
48 | python -m CelebA_TIMM --model resnet18
49 | ```
50 | Same arguments `[lr, epochs, bs, mini_bs, epsilon, clipping mode, model]` as the CIFAR example with one addition `--labels`. Default is `None` to train all 40 labels as multi-label/multi-task problem, otherwise train on label indices specified as a list. For example, label index 31 is 'Smiling' and label index 20 is 'Male'.
51 |
52 | ### General computer vision experiments
53 | We provide a general script to experiment on many torchvision datasets, and we fix most of the arguments in the privacy engine.
54 | ```plaintext
55 | python -m CV_TIMM --model vit_base_patch16_224 --dataset_name ImageNet
56 | ```
57 |
58 | ### Note
59 | 1. Vision models oftentimes have batch normalization layers, which violate the DP guarantee (see [Opacus](https://opacus.ai/tutorials/guide_to_module_validator) for the reason). A common solution is to replace with group/layer/instance normalization, and this can be easily fixed by Opacus>=v1.0: `model=ModuleValidator.fix(model)`.
60 |
61 | 2. To reproduce DP image classification and compare with other packages, we refer to [private-vision](https://github.com/woodyx218/private_vision) (covering GhostClip, MixGhostClip, Opacus-like optimization) and [Opacus](https://github.com/pytorch/opacus). Different packages and clipping modes should produce the same accuracy. Note that training more epochs with larger noise usually gives better accuracy.
62 |
63 | 3. Generally speaking, GhostClip is inefficient for large image (try 512X512 image with resnet18), Opacus is inefficient for large model (try 224X224 image with BEiT-large). Hence we improve on mixed ghost norm from [Bu et al.](https://arxiv.org/abs/2205.10683) to use GhostClip or Opacus at different layers.
64 |
--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_FSDP_extending.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import torch
4 | import torch.nn as nn
5 | import torch.optim as optim
6 |
7 | import torchvision
8 | from fastDP import PrivacyEngine_Distributed_extending
9 |
10 | import timm
11 | #from opacus.validators import ModuleValidator
12 | from tqdm import tqdm
13 | import warnings; warnings.filterwarnings("ignore")
14 |
15 |
16 | import torch.distributed as dist
17 | import torch.multiprocessing as mp
18 | from torch.utils.data.distributed import DistributedSampler
19 | from fairscale.nn import FullyShardedDataParallel as FSDP
20 |
21 | #--- if import from torch <= 1.11
22 | #from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
23 | #from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload,BackwardPrefetch
24 | #from torch.distributed.fsdp.wrap import default_auto_wrap_policy,enable_wrap,wrap
25 | from fairscale.nn import default_auto_wrap_policy
26 | from fairscale.internal.parallel import ProcessGroupName
27 |
28 |
29 | def setup(rank, world_size):
30 | os.environ['MASTER_ADDR'] = 'localhost'
31 | os.environ['MASTER_PORT'] = '12355'
32 |
33 | # initialize the process group
34 | dist.init_process_group("nccl", rank=rank, world_size=world_size)
35 |
36 | def cleanup():
37 | dist.destroy_process_group()
38 |
39 |
40 | def train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps):
41 | net.train()
42 | ddp_loss = torch.zeros(3).to(rank)
43 |
44 | for batch_idx, data in enumerate(tqdm(trainloader)):
45 | # get the inputs; data is a list of [inputs, labels]
46 | inputs, targets = data[0].to(rank), data[1].to(rank)
47 | outputs = net(inputs)
48 |
49 | loss = criterion(outputs, targets)
50 |
51 | loss.backward()
52 | if ((batch_idx + 1) % grad_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)):
53 | optimizer.step()
54 | optimizer.zero_grad()
55 |
56 | _, predicted = outputs.max(1)
57 |
58 | ddp_loss[0] += loss.item()
59 | ddp_loss[1] += len(data[0])
60 | ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item()
61 |
62 | if rank == 0:
63 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%%'
64 | % (ddp_loss[0]/(batch_idx+1), 100.*ddp_loss[2]/ddp_loss[1]))
65 | dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
66 |
67 | def test(epoch,net,rank,testloader,criterion):
68 | net.eval()
69 | ddp_loss = torch.zeros(3).to(rank)
70 |
71 | with torch.no_grad():
72 | for batch_idx, data in enumerate(tqdm(testloader)):
73 | inputs, targets = data[0].to(rank), data[1].to(rank)
74 | outputs = net(inputs)
75 | loss = criterion(outputs, targets)
76 |
77 | _, predicted = outputs.max(1)
78 | ddp_loss[0] += loss.item()
79 | ddp_loss[1] += len(data[0])
80 | ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item()
81 | if rank == 0:
82 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%%'
83 | % (ddp_loss[0]/ddp_loss[1]*len(inputs), 100.*ddp_loss[2]/ddp_loss[1]))
84 |
85 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
86 | def main(rank, world_size, args):
87 |
88 | grad_acc_steps = args.batch_size//args.mini_batch_size//world_size
89 |
90 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
91 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
92 | return None
93 |
94 | setup(rank, world_size)
95 |
96 |
97 | transformation = torchvision.transforms.Compose([
98 | torchvision.transforms.Resize(args.dimension),
99 | torchvision.transforms.ToTensor(),
100 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
101 | ])
102 |
103 | # Data
104 | print('==> Preparing data..')
105 |
106 | if args.cifar_data=='CIFAR10':
107 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=False, transform=transformation)
108 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=False, transform=transformation)
109 | elif args.cifar_data=='CIFAR100':
110 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=False, transform=transformation)
111 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=False, transform=transformation)
112 | else:
113 | return "Must specify datasets as CIFAR10 or CIFAR100"
114 |
115 | sampler_train = DistributedSampler(trainset, rank=rank, num_replicas=world_size, shuffle=True)
116 | sampler_test = DistributedSampler(testset, rank=rank, num_replicas=world_size)
117 |
118 | train_kwargs = {'batch_size': args.mini_batch_size, 'sampler': sampler_train}
119 | test_kwargs = {'batch_size': 10, 'sampler': sampler_test}
120 | cuda_kwargs = {'num_workers': 2,
121 | 'pin_memory': False,
122 | 'shuffle': False}
123 | train_kwargs.update(cuda_kwargs)
124 | test_kwargs.update(cuda_kwargs)
125 |
126 | trainloader = torch.utils.data.DataLoader(trainset,**train_kwargs)
127 | testloader = torch.utils.data.DataLoader(testset, **test_kwargs)
128 | torch.cuda.set_device(rank)
129 |
130 |
131 | init_start_event = torch.cuda.Event(enable_timing=True)
132 | init_end_event = torch.cuda.Event(enable_timing=True)
133 |
134 | # Model
135 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode,grad_acc_steps)
136 | net = timm.create_model(args.model, pretrained=True, num_classes=int(args.cifar_data[5:]))
137 |
138 | if 'BiTFiT' in args.clipping_mode:
139 | for name,param in net.named_parameters():
140 | if '.bias' not in name:
141 | param.requires_grad_(False)
142 |
143 | net = net.to(rank)
144 |
145 | # Privacy engine
146 | if 'nonDP' not in args.clipping_mode:
147 | PrivacyEngine_Distributed_extending(
148 | net,
149 | batch_size=args.batch_size,
150 | sample_size=len(trainset),
151 | epochs=args.epochs,
152 | target_epsilon=args.epsilon,
153 | num_GPUs=world_size,
154 | torch_seed_is_fixed=True, #FSDP always gives different seeds to devices if use FSDP() to wrap
155 | grad_accum_steps=grad_acc_steps,
156 | )
157 |
158 |
159 | #net = FSDP(net,flatten_parameters=False, mixed_precision=args.fp16)# must use flatten_parameters=False https://github.com/facebookresearch/fairscale/issues/1047
160 |
161 | from fairscale.nn.wrap import wrap, enable_wrap, auto_wrap
162 | fsdp_params = dict(wrapper_cls=FSDP, mixed_precision=args.fp16, flatten_parameters=False)#,disable_reshard_on_root=False,reshard_after_forward=False,clear_autocast_cache=True) # True or False
163 | with enable_wrap(**fsdp_params):
164 | # cannot wrap the network as a whole, will lose weight.noise
165 | for pp in net.modules(): # must wrap module/layer not parameter
166 | if hasattr(pp,'weight'): # AssertionError assert not isinstance(child, cast(type, ConfigAutoWrap.wrapper_cls))
167 | pp=auto_wrap(pp)
168 |
169 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
170 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
171 |
172 |
173 | criterion = nn.CrossEntropyLoss(reduction='sum')
174 |
175 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
176 | #https://pytorch.org/docs/stable/fsdp.html
177 | #The optimizer must be initialized after the module has been wrapped, since FSDP will shard parameters in-place and this will break any previously initialized optimizers.
178 |
179 | init_start_event.record()
180 |
181 | for epoch in range(args.epochs):
182 | train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps)
183 | test(epoch,net,rank,testloader,criterion)
184 | init_end_event.record()
185 |
186 | if rank == 0:
187 | print(f"CUDA event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")
188 |
189 | cleanup()
190 |
191 |
192 |
193 | if __name__ == '__main__':
194 |
195 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
196 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
197 | parser.add_argument('--epochs', default=5, type=int,
198 | help='numter of epochs')
199 | parser.add_argument('--batch_size', default=1024, type=int, help='logical batch size')
200 | parser.add_argument('--mini_batch_size', default=16, type=int, help='physical batch size')
201 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
202 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
203 | parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str)
204 | parser.add_argument('--cifar_data', type=str, default='CIFAR100')
205 | parser.add_argument('--dimension', type=int,default=224)
206 | parser.add_argument('--fp16', type=bool, default=False)
207 |
208 | args = parser.parse_args()
209 |
210 | torch.manual_seed(2) # useful for reproduction
211 |
212 | WORLD_SIZE = torch.cuda.device_count()
213 |
214 | mp.spawn(main,args=(WORLD_SIZE, args),
215 | nprocs=WORLD_SIZE,join=True)
216 | #https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn
217 |
--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO1.py:
--------------------------------------------------------------------------------
1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
2 | def main(args):
3 | config=json.load(open(args.deepspeed_config))
4 |
5 | transformation = torchvision.transforms.Compose([
6 | torchvision.transforms.Resize(args.dimension),
7 | torchvision.transforms.ToTensor(),
8 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
9 | ])
10 |
11 | if torch.distributed.get_rank() != 0:
12 | # might be downloading cifar data, let rank 0 download first
13 | torch.distributed.barrier()
14 |
15 |
16 | # Data
17 | print('==> Preparing data..')
18 |
19 | if args.cifar_data=='CIFAR10':
20 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
21 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
22 | elif args.cifar_data=='CIFAR100':
23 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
24 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
25 | else:
26 | return "Must specify datasets as CIFAR10 or CIFAR100"
27 |
28 |
29 | if torch.distributed.get_rank() == 0:
30 | # cifar data is downloaded, indicate other ranks can proceed
31 | torch.distributed.barrier()
32 |
33 | testloader = torch.utils.data.DataLoader(testset, batch_size=20, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
34 |
35 | # Model
36 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
37 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))
38 | net = ModuleValidator.fix(net);
39 |
40 | criterion = nn.CrossEntropyLoss()
41 |
42 | if 'BiTFiT' in args.clipping_mode:
43 | for name,param in net.named_parameters():
44 | if '.bias' not in name:
45 | param.requires_grad_(False)
46 |
47 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
48 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
49 |
50 | # Privacy engine
51 | if 'nonDP' not in args.clipping_mode:
52 | privacy_engine = PrivacyEngine(
53 | net,
54 | batch_size=config['train_batch_size'],
55 | sample_size=len(trainset),
56 | epochs=args.epochs,
57 | target_epsilon=args.epsilon,
58 | clipping_mode='MixOpt',
59 | clipping_style=args.clipping_style,
60 | num_GPUs=torch.distributed.get_world_size(),
61 | torch_seed_is_fixed=True,
62 | )
63 |
64 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
65 |
66 | # Initialize DeepSpeed to use the following features
67 | # 1) Distributed model
68 | # 2) Distributed data loader
69 | # 3) DeepSpeed optimizer
70 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
71 |
72 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled()
73 | print(f'fp16={fp16},bf16={bf16}')
74 |
75 |
76 | def train(epoch):
77 |
78 | net.train()
79 | train_loss = 0
80 | correct = 0
81 | total = 0
82 |
83 |
84 | for batch_idx, data in enumerate(tqdm(trainloader)):
85 | # get the inputs; data is a list of [inputs, labels]
86 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
87 | if fp16:
88 | inputs = inputs.half()
89 | if bf16:
90 | inputs = inputs.bfloat16()
91 | outputs = model_engine(inputs)
92 |
93 | loss = criterion(outputs, targets)
94 |
95 | model_engine.backward(loss)
96 | #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)):
97 | model_engine.step()
98 |
99 | train_loss += loss.item()
100 | _, predicted = outputs.max(1)
101 | total += targets.size(0)
102 | correct += predicted.eq(targets).sum().item()
103 |
104 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
105 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
106 |
107 | def test(epoch):
108 | net.eval()
109 | test_loss = 0
110 | correct = 0
111 | total = 0
112 | with torch.no_grad():
113 | for batch_idx, data in enumerate(tqdm(testloader)):
114 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
115 | if fp16:
116 | inputs = inputs.half()
117 | if bf16:
118 | inputs = inputs.bfloat16()
119 | outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
120 | loss = criterion(outputs, targets)
121 |
122 | test_loss += loss.item()
123 | _, predicted = outputs.max(1)
124 | total += targets.size(0)
125 | correct += predicted.eq(targets).sum().item()
126 |
127 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
128 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
129 |
130 | for epoch in range(args.epochs):
131 | train(epoch)
132 | test(epoch)
133 |
134 |
135 | if __name__ == '__main__':
136 | import deepspeed
137 | import argparse
138 |
139 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
140 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
141 | parser.add_argument('--epochs', default=1, type=int,
142 | help='numter of epochs')
143 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
144 | parser.add_argument('--clipping_mode', default='MixOpt', type=str)
145 | parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str)
146 | parser.add_argument('--cifar_data', type=str, default='CIFAR100')
147 | parser.add_argument('--dimension', type=int,default=224)
148 | parser.add_argument('--clipping_style', type=str, default='layer-wise')
149 |
150 | parser.add_argument('--local_rank',
151 | type=int,
152 | default=-1,
153 | help='local rank passed from distributed launcher')
154 | # Include DeepSpeed configuration arguments
155 | parser = deepspeed.add_config_arguments(parser)
156 |
157 | args = parser.parse_args()
158 |
159 | from fastDP import PrivacyEngine
160 |
161 | import torch
162 | import torchvision
163 | torch.manual_seed(3) # if use, need change privacy engine's argument
164 | import torch.nn as nn
165 | import torch.optim as optim
166 | import timm
167 | from opacus.validators import ModuleValidator
168 | from tqdm import tqdm
169 | import warnings; warnings.filterwarnings("ignore")
170 |
171 | import json
172 |
173 | deepspeed.init_distributed()
174 |
175 | main(args)
176 |
--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO23.py:
--------------------------------------------------------------------------------
1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
2 | def main(args):
3 | config=json.load(open(args.deepspeed_config))
4 |
5 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
6 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
7 | return None
8 |
9 |
10 | transformation = torchvision.transforms.Compose([
11 | torchvision.transforms.Resize(args.dimension),
12 | torchvision.transforms.ToTensor(),
13 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
14 | ])
15 |
16 | if torch.distributed.get_rank() != 0:
17 | # might be downloading cifar data, let rank 0 download first
18 | torch.distributed.barrier()
19 |
20 |
21 | # Data
22 | print('==> Preparing data..')
23 |
24 | if args.cifar_data=='CIFAR10':
25 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
26 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
27 | elif args.cifar_data=='CIFAR100':
28 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
29 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
30 | else:
31 | return "Must specify datasets as CIFAR10 or CIFAR100"
32 |
33 |
34 | if torch.distributed.get_rank() == 0:
35 | # cifar data is downloaded, indicate other ranks can proceed
36 | torch.distributed.barrier()
37 |
38 | testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
39 |
40 | # Model
41 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
42 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))
43 | net = ModuleValidator.fix(net);
44 |
45 | criterion = nn.CrossEntropyLoss()
46 |
47 | if 'BiTFiT' in args.clipping_mode:
48 | for name,param in net.named_parameters():
49 | if '.bias' not in name:
50 | param.requires_grad_(False)
51 |
52 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
53 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad]))
54 |
55 | # Privacy engine
56 | if 'nonDP' not in args.clipping_mode:
57 | privacy_engine = PrivacyEngine_Distributed_Stage_2_and_3(
58 | net,
59 | batch_size=config['train_batch_size'],
60 | sample_size=len(trainset),
61 | epochs=args.epochs,
62 | #noise_multiplier=0,
63 | target_epsilon=args.epsilon,
64 | clipping_mode='MixOpt',
65 | clipping_style='layer-wise',
66 | num_GPUs=torch.distributed.get_world_size(),
67 | torch_seed_is_fixed=True,
68 | )
69 |
70 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
71 |
72 | # Initialize DeepSpeed to use the following features
73 | # 1) Distributed model
74 | # 2) Distributed data loader
75 | # 3) DeepSpeed optimizer
76 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
77 |
78 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled()
79 | print(f'fp16={fp16},bf16={bf16}')
80 |
81 |
82 | def train(epoch):
83 |
84 | net.train()
85 | train_loss = 0
86 | correct = 0
87 | total = 0
88 |
89 |
90 | for batch_idx, data in enumerate(tqdm(trainloader)):
91 | # get the inputs; data is a list of [inputs, labels]
92 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
93 | if fp16:
94 | inputs = inputs.half()
95 | if bf16:
96 | inputs = inputs.bfloat16()
97 | outputs = model_engine(inputs)
98 |
99 | loss = criterion(outputs, targets)
100 |
101 | model_engine.backward(loss)
102 | #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)):
103 | model_engine.step()
104 |
105 | train_loss += loss.item()
106 | _, predicted = outputs.max(1)
107 | total += targets.size(0)
108 | correct += predicted.eq(targets).sum().item()
109 |
110 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
111 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
112 |
113 | def test(epoch):
114 | net.eval()
115 | test_loss = 0
116 | correct = 0
117 | total = 0
118 | with torch.no_grad():
119 | for batch_idx, data in enumerate(tqdm(testloader)):
120 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
121 | if fp16:
122 | inputs = inputs.half()
123 | if bf16:
124 | inputs = inputs.bfloat16()
125 | outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
126 | loss = criterion(outputs, targets)
127 |
128 | test_loss += loss.item()
129 | _, predicted = outputs.max(1)
130 | total += targets.size(0)
131 | correct += predicted.eq(targets).sum().item()
132 |
133 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
134 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
135 |
136 | for epoch in range(args.epochs):
137 | train(epoch)
138 | test(epoch)
139 |
140 |
141 | if __name__ == '__main__':
142 | import deepspeed
143 | import argparse
144 |
145 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
146 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
147 | parser.add_argument('--epochs', default=5, type=int,
148 | help='numter of epochs')
149 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
150 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
151 | parser.add_argument('--model', default='vit_small_patch16_224', type=str)
152 | parser.add_argument('--cifar_data', type=str, default='CIFAR100')
153 | parser.add_argument('--dimension', type=int,default=224)
154 | parser.add_argument('--origin_params', nargs='+', default=None)
155 |
156 | parser.add_argument('--local_rank',
157 | type=int,
158 | default=-1,
159 | help='local rank passed from distributed launcher')
160 | # Include DeepSpeed configuration arguments
161 | parser = deepspeed.add_config_arguments(parser)
162 |
163 | args = parser.parse_args()
164 |
165 | from fastDP import PrivacyEngine_Distributed_Stage_2_and_3
166 |
167 | import torch
168 | import torchvision
169 | torch.manual_seed(3) # if use, need change privacy engine's argument
170 | import torch.nn as nn
171 | import torch.optim as optim
172 | import timm
173 | from opacus.validators import ModuleValidator
174 | from tqdm import tqdm
175 | import warnings; warnings.filterwarnings("ignore")
176 |
177 | import json
178 |
179 | import deepspeed
180 | deepspeed.init_distributed()
181 |
182 | main(args)
183 |
--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO_extending.py:
--------------------------------------------------------------------------------
1 | '''Train CIFAR10/CIFAR100 with PyTorch.'''
2 | def main(args):
3 | config=json.load(open(args.deepspeed_config))
4 |
5 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']:
6 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'")
7 | return None
8 |
9 |
10 | transformation = torchvision.transforms.Compose([
11 | torchvision.transforms.Resize(args.dimension),
12 | torchvision.transforms.ToTensor(),
13 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)),
14 | ])
15 |
16 | if torch.distributed.get_rank() != 0:
17 | # might be downloading cifar data, let rank 0 download first
18 | torch.distributed.barrier()
19 |
20 |
21 | # Data
22 | if args.cifar_data=='CIFAR10':
23 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation)
24 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation)
25 | elif args.cifar_data=='CIFAR100':
26 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation)
27 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation)
28 | else:
29 | return "Must specify datasets as CIFAR10 or CIFAR100"
30 |
31 |
32 | if torch.distributed.get_rank() == 0:
33 | # cifar data is downloaded, indicate other ranks can proceed
34 | torch.distributed.barrier()
35 |
36 | testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746
37 |
38 | # Model
39 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode)
40 | # https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L376
41 | # embed_dim a.k.a. width, mlp_ratio=MLP/embed_dim, depth is number of blocks
42 | if args.model!='vitANY':
43 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:]))
44 | else:
45 | net = timm.models.vision_transformer.VisionTransformer(embed_dim=768,num_heads=12,depth=12,mlp_ratio=4,num_classes=int(args.cifar_data[5:]))
46 |
47 | if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety
48 | for name,param in net.named_parameters():
49 | if '.bias' not in name:
50 | param.requires_grad_(False)
51 |
52 | criterion = nn.CrossEntropyLoss()
53 |
54 | if 'nonDP' not in args.clipping_mode:
55 | PrivacyEngine_Distributed_extending(
56 | net,
57 | batch_size=config['train_batch_size'],
58 | sample_size=len(trainset),
59 | epochs=args.epochs,
60 | target_epsilon=args.epsilon,
61 | num_GPUs=torch.distributed.get_world_size(),
62 | torch_seed_is_fixed=(args.seed_fixed>=0), # better use False?
63 | grad_accum_steps=config['train_batch_size']/config['train_micro_batch_size_per_gpu']/torch.distributed.get_world_size(),
64 | )
65 |
66 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()]))
67 | print(f"Number of trainable parameters: {sum([p.numel() for p in net.parameters() if p.requires_grad])}({sum([p.numel() for p in net.parameters() if p.requires_grad])/sum([p.numel() for p in net.parameters()])})")
68 |
69 | optimizer = optim.Adam(net.parameters(), lr=args.lr)
70 |
71 | # Initialize DeepSpeed to use the following features
72 | # 1) Distributed model
73 | # 2) Distributed data loader
74 | # 3) DeepSpeed optimizer
75 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset)
76 |
77 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled();
78 | print(f'fp16={fp16},bf16={bf16}')
79 |
80 |
81 | def train(epoch):
82 |
83 | net.train()
84 | train_loss = 0
85 | correct = 0
86 | total = 0
87 |
88 |
89 | for batch_idx, data in enumerate(tqdm(trainloader)):
90 | # get the inputs; data is a list of [inputs, labels]
91 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
92 | if fp16:
93 | inputs = inputs.half()
94 | if bf16:
95 | inputs = inputs.bfloat16()
96 | outputs = model_engine(inputs)
97 |
98 | loss = criterion(outputs, targets)
99 |
100 | model_engine.backward(loss)
101 | model_engine.step()
102 |
103 | train_loss += loss.item()
104 | _, predicted = outputs.max(1)
105 | total += targets.size(0)
106 | correct += predicted.eq(targets).sum().item()
107 |
108 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)'
109 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
110 |
111 | def test(epoch):
112 | net.eval()
113 | test_loss = 0
114 | correct = 0
115 | total = 0
116 | with torch.no_grad():
117 | for batch_idx, data in enumerate(tqdm(testloader)):
118 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
119 | if fp16:
120 | inputs = inputs.half()
121 | if bf16:
122 | inputs = inputs.bfloat16()
123 | outputs = model_engine(inputs)
124 | #outputs = net(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py
125 | loss = criterion(outputs, targets)
126 |
127 | test_loss += loss.item()
128 | _, predicted = outputs.max(1)
129 | total += targets.size(0)
130 | correct += predicted.eq(targets).sum().item()
131 |
132 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)'
133 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
134 |
135 | for epoch in range(args.epochs):
136 | train(epoch)
137 | test(epoch)
138 |
139 |
140 | if __name__ == '__main__':
141 | import deepspeed
142 | import argparse
143 |
144 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training')
145 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate')
146 | parser.add_argument('--epochs', default=5, type=int,
147 | help='numter of epochs')
148 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon')
149 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str)
150 | parser.add_argument('--model', default='vit_large_patch16_224', type=str)
151 | parser.add_argument('--cifar_data', type=str, default='CIFAR100')
152 | parser.add_argument('--dimension', type=int,default=224)
153 | parser.add_argument('--seed_fixed', type=int,default=3)
154 |
155 | parser.add_argument('--local_rank',
156 | type=int,
157 | default=-1,
158 | help='local rank passed from distributed launcher')
159 | # Include DeepSpeed configuration arguments
160 | parser = deepspeed.add_config_arguments(parser)
161 |
162 | args = parser.parse_args()
163 |
164 | from fastDP import PrivacyEngine_Distributed_extending
165 |
166 | import torch
167 | import torchvision
168 | if args.seed_fixed>=0:
169 | torch.manual_seed(args.seed_fixed) # if use, need change privacy engine's argument
170 | import torch.nn as nn
171 | import torch.optim as optim
172 | import timm
173 | from tqdm import tqdm
174 | import warnings; warnings.filterwarnings("ignore")
175 |
176 | import json
177 |
178 | import deepspeed
179 | deepspeed.init_distributed()
180 |
181 | main(args)
182 |
--------------------------------------------------------------------------------
/examples/image_classification/ZERO_examples/cifar_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "train_batch_size": 1024,
3 | "train_micro_batch_size_per_gpu": 32,
4 | "steps_per_print": 2000,
5 | "prescale_gradients": false,
6 | "bf16": {
7 | "enabled": false
8 | },
9 | "fp16": {
10 | "enabled": true,
11 | "fp16_master_weights_and_grads": false,
12 | "loss_scale": 1.0,
13 | "loss_scale_window": 1000,
14 | "hysteresis": 2,
15 | "min_loss_scale": 1,
16 | "initial_scale_power": 0
17 | },
18 | "wall_clock_breakdown": false,
19 | "zero_optimization": {
20 | "stage": 1,
21 | "allgather_partitions": true,
22 | "reduce_scatter": true,
23 | "allgather_bucket_size": 50000000,
24 | "reduce_bucket_size": 50000000,
25 | "overlap_comm": true,
26 | "contiguous_gradients": true,
27 | "cpu_offload": false,
28 | "stage3_max_live_parameters" : 1e8,
29 | "stage3_max_reuse_distance" : 1e8,
30 | "stage3_prefetch_bucket_size" : 1e7
31 | }
32 | }
33 |
--------------------------------------------------------------------------------
/examples/image_classification/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/image_classification/__init__.py
--------------------------------------------------------------------------------
/examples/requirements.txt:
--------------------------------------------------------------------------------
1 | argcomplete==1.12.1
2 | avro-python3==1.9.2.1
3 | azure-storage-blob==12.4.0
4 | bottle==0.12.20
5 | certifi==2023.7.22
6 | chardet==3.0.4
7 | charset-normalizer==2.0.4
8 | click==8.0.1
9 | crcmod==1.7
10 | cycler==0.10.0
11 | datasets
12 | diffimg==0.2.3
13 | docopt==0.6.2
14 | fastavro==1.4.1
15 | filelock==3.0.12
16 | fire
17 | fusepy==2.0.4
18 | future==0.18.3
19 | gdown>=5.0
20 | gpytorch
21 | httplib2==0.19.0
22 | idna==3.2
23 | imageio==2.9.0
24 | indexed-gzip-fileobj-fork-epicfaace==1.5.4
25 | isodate==0.6.0
26 | joblib==1.2.0
27 | kiwisolver==1.3.1
28 | lazy_loader==0.3
29 | markdown2==2.4.0
30 | marshmallow==2.15.1
31 | marshmallow-jsonapi==0.15.1
32 | matplotlib==3.4.3
33 | mock==2.0.0
34 | networkx==2.6.2
35 | nltk==3.9
36 | numpy>=1.21.2
37 | oauth2client==4.1.3
38 | packaging==21.0
39 | pandas==1.3.2
40 | pathtools==0.1.2
41 | pbr==5.6.0
42 | Pillow==10.2.0
43 | psutil==5.7.2
44 | pyasn1==0.4.8
45 | pyasn1-modules==0.2.8
46 | pycparser==2.20
47 | pydot==1.4.2
48 | pymongo==3.11.4
49 | pyparsing==2.4.7
50 | PySocks==1.7.1
51 | python-dateutil==2.8.*
52 | pytz==2021.1
53 | PyWavelets==1.1.1
54 | PyYAML==5.4.*
55 | regex==2021.8.3
56 | requests
57 | retry==0.9.2
58 | sacremoses==0.0.45
59 | scikit-image==0.18.2
60 | scikit-learn==1.5.0
61 | scipy>=1.7.1
62 | seaborn==0.11.2
63 | selenium==3.141.0
64 | sentence-transformers>=2.0.0
65 | sentencepiece==0.1.96
66 | sentry-sdk==1.14.0
67 | six==1.15.0
68 | SQLAlchemy==1.3.19
69 | termcolor==1.1.0
70 | threadpoolctl==2.2.0
71 | tifffile==2021.8.8
72 | tokenizers==0.10.3
73 | tqdm>=4.62.1
74 | transformers<=4.26
75 | typing-extensions==3.7.4.3
76 | urllib3==1.26.*
77 | watchdog==0.10.3
78 | websocket-client==1.0.1
79 |
--------------------------------------------------------------------------------
/examples/table2text/README.md:
--------------------------------------------------------------------------------
1 | ## DP natural language generation with Huggingface transformers
2 |
3 | ### Getting the data
4 |
5 | E2E and DART datasets are adapted from \[[Li & Liang, 2021](https://arxiv.org/abs/2101.00190)\] and hosted by \[[Li et al., 2021](https://arxiv.org/abs/2110.05679)\] at [Google drive](https://drive.google.com/file/d/1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7/view?usp=sharing). To obtain the data, run
6 | ```plaintext
7 | gdown https://drive.google.com/uc?id=1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7
8 | unzip prefix-tuning.zip
9 | ```
10 | This should produce a `table2text/prefix-tuning/data` subfolder that contains the datasets.
11 |
12 | ### Running on single GPU
13 |
14 | Use the `run.sh` script in the folder, which runs the `run_language_modeling.py` for the command.
15 |
16 | For instance, run the following under the `examples` folder:
17 | ```plaintext
18 | bash table2text/run.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2"
19 | ```
20 |
21 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For E2E/DART, training `gpt2` on one A100 GPU (40GB) takes around 2.5/4 min per epoch.
22 |
23 | Arguments (sequentially):
24 | * `--output_dir`: path to a folder where results will be written
25 |
26 | * `--task_mode`: name of task; one of "e2e" and "dart"
27 |
28 | * `--model_name_or_path`: The pretrained model; one of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large".
29 |
30 | * `--target_epsilon`: Target privacy spending, default is 8.
31 |
32 | * `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
33 |
34 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `MixOpt` (default, meaning hybrid book-keeping), `MixGhostClip`, `ghost`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf).
35 |
36 | ### Running on multi-GPU distributed learning
37 |
38 | Use the `run_ZERO1.sh`, `run_ZERO23.sh` and `run_ZERO_extending.py` in the folder for ZeRO1, ZeRO2+3 and ZeRO1+2+3, respectively. The scripts read the config from `gpt_config_stage123.json`.
39 |
40 | For instance, run the following under the `examples` folder:
41 | ```plaintext
42 | bash table2text/run_ZERO1.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2"
43 | ```
44 |
45 | ### Evaluation
46 |
47 | The script automatically evaluates some measures like loss during the training. To evaluate the generations with BLEU, ROGUE, METEOR, CIDEr, NIST, etc., we use the official [e2e-metrics](https://github.com/tuetschek/e2e-metrics) for E2E, and [GEM-metrics](https://github.com/GEM-benchmark/GEM-metrics) for DART.
48 |
49 | Specifically for E2E, after installing e2e-metric in the `table2text` folder, run
50 | ```bash
51 | cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
52 | python e2e-metrics/measure_scores.py prefix-tuning/data/e2e_data/clean_references_test.txt ..//generations_model/eval/global_step_00000420.txt
53 | ```
54 |
--------------------------------------------------------------------------------
/examples/table2text/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/__init__.py
--------------------------------------------------------------------------------
/examples/table2text/compiled_args.py:
--------------------------------------------------------------------------------
1 | """Compilation of all the arguments."""
2 | import logging
3 | import os
4 | import sys
5 | from dataclasses import dataclass, field
6 | from typing import Optional
7 |
8 | import transformers
9 |
10 | MODEL_CONFIG_CLASSES = list(transformers.MODEL_WITH_LM_HEAD_MAPPING.keys())
11 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
12 |
13 | TRUE_TAGS = ('y', 'yes', 't', 'true')
14 |
15 |
16 | # See all possible arguments in src/transformers/training_args.py
17 | # or by passing the --help flag to this script.
18 | # We now keep distinct sets of args, for a cleaner separation of concerns.
19 | @dataclass
20 | class ModelArguments:
21 | """
22 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
23 | """
24 | model_name_or_path: Optional[str] = field(
25 | default=None,
26 | metadata={
27 | "help": "The model checkpoint for weights initialization. Leave None if you want to train a model from "
28 | "scratch."
29 | },
30 | )
31 | model_type: Optional[str] = field(
32 | default=None,
33 | metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
34 | )
35 | config_name: Optional[str] = field(
36 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
37 | )
38 | tokenizer_name: Optional[str] = field(
39 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
40 | )
41 | cache_dir: Optional[str] = field(
42 | default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
43 | )
44 |
45 | static_lm_head: str = field(default='no')
46 | static_embedding: str = field(default='no')
47 | attention_only: str = field(default="no")
48 | bias_only: str = field(default="no")
49 |
50 | def __post_init__(self):
51 | self.static_lm_head = self.static_lm_head.lower() in TRUE_TAGS
52 | self.static_embedding = self.static_embedding.lower() in TRUE_TAGS
53 | self.attention_only = self.attention_only.lower() in TRUE_TAGS
54 | self.bias_only = self.bias_only.lower() in TRUE_TAGS
55 |
56 |
57 | @dataclass
58 | class DataTrainingArguments:
59 | """
60 | Arguments pertaining to what data we are going to input our model for training and eval.
61 | """
62 | data_folder: Optional[str] = field(default=None, metadata={"help": "Path to folder with all the data."})
63 |
64 | # Useful for truncating the dataset.
65 | max_train_examples: Optional[int] = field(default=sys.maxsize)
66 | max_valid_examples: Optional[int] = field(default=sys.maxsize)
67 | max_eval_examples: Optional[int] = field(default=sys.maxsize)
68 |
69 | line_by_line: bool = field(
70 | default=True,
71 | metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."},
72 | )
73 | task_mode: Optional[str] = field(
74 | default=None, metadata={"help": "The name of the task."}
75 | )
76 | format_mode: Optional[str] = field(
77 | default='cat', metadata={"help": "The mode of data2text format (cat, peek, nopeek)"}
78 | )
79 | max_source_length: Optional[int] = field(
80 | default=512, metadata={"help": "the max source length of summarization data. "}
81 | )
82 | train_max_target_length: Optional[int] = field(
83 | default=100, metadata={"help": "the max target length for training data. "}
84 | )
85 | val_max_target_length: Optional[int] = field(
86 | default=100, metadata={"help": "the max target length for dev data. "}
87 | )
88 | block_size: int = field(
89 | default=-1,
90 | metadata={
91 | "help": "Optional input sequence length after tokenization."
92 | "The training dataset will be truncated in block of this size for training."
93 | "Default to the model max input length for single sentence inputs (take into account special "
94 | "tokens)."
95 | },
96 | )
97 | overwrite_cache: bool = field(
98 | default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
99 | )
100 | max_seq_len: int = field(default=sys.maxsize)
101 |
102 |
103 | def __post_init__(self):
104 | if self.data_folder is not None:
105 | logging.warning(f'Overriding dataset paths using those given in `data_folder`')
106 |
107 | if self.task_mode == "e2e":
108 | self.train_data_file = os.path.join(self.data_folder, 'src1_train.txt')
109 | self.valid_data_file = os.path.join(self.data_folder, 'src1_valid.txt')
110 | self.eval_data_file = os.path.join(self.data_folder, 'src1_test.txt')
111 |
112 | self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt')
113 | self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt')
114 | self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt')
115 |
116 | elif self.task_mode == "dart":
117 | self.train_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-train.json')
118 | self.valid_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-dev.json')
119 | self.eval_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-test.json')
120 |
121 | self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt')
122 | self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt')
123 | self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt')
124 |
125 |
126 | @dataclass
127 | class TrainingArguments(transformers.TrainingArguments):
128 | max_eval_batches: int = field(default=-1, metadata={"help": "Maximum number of evaluation steps to run."})
129 | max_generations: int = field(default=sys.maxsize)
130 | max_generations_train: int = field(default=10)
131 | max_generations_valid: int = field(default=10)
132 | skip_generation: str = field(default="no")
133 |
134 | ema_model_averaging: str = field(default="no")
135 | ema_model_gamma: float = field(default=0.99)
136 | ema_model_start_from: int = field(default=1000)
137 | lr_decay: str = field(default="yes")
138 | eval_epochs: int = field(default=10)
139 |
140 | deepspeed_config: str = field(default=None)
141 | num_GPUs: int = field(default=1)
142 | logical_batch_size: int = field(default=None)
143 |
144 | evaluate_during_training: str = field(
145 | default="yes",
146 | metadata={"help": "Run evaluation during training at each logging step."},
147 | )
148 | evaluate_before_training: str = field(
149 | default="yes",
150 | metadata={"help": "Run evaluation before training."},
151 | )
152 | save_at_last: str = field(default="no", metadata={"help": "Save at the end of training."})
153 |
154 | def __post_init__(self):
155 | super(TrainingArguments, self).__post_init__()
156 | self.skip_generation = self.skip_generation.lower() in ('y', 'yes')
157 | self.ema_model_averaging = (self.ema_model_averaging.lower() in ('y', 'yes'))
158 | self.lr_decay = (self.lr_decay.lower() in ('y', 'yes'))
159 | self.evaluate_during_training = (self.evaluate_during_training in ('y', 'yes'))
160 | self.evaluate_before_training = (self.evaluate_before_training in ('y', 'yes'))
161 | self.save_at_last = (self.save_at_last in ('y', 'yes'))
162 |
163 |
164 | @dataclass
165 | class PrivacyArguments:
166 | """Arguments for differentially private training."""
167 | per_example_max_grad_norm: float = field(
168 | default=.1, metadata={
169 | "help": "Clipping 2-norm of per-sample gradients."
170 | }
171 | )
172 | noise_multiplier: float = field(
173 | default=None, metadata={
174 | "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, "
175 | "use the one searched based budget"
176 | }
177 | )
178 | target_epsilon: float = field(
179 | default=None, metadata={
180 | "help": "Privacy budget; if `None` use the noise multiplier specified."
181 | }
182 | )
183 | target_delta: float = field(
184 | default=None, metadata={
185 | "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)."
186 | }
187 | )
188 | accounting_mode: str = field(
189 | default="rdp", metadata={"help": "One of `rdp`, `glw`, `all`."}
190 | )
191 | non_private: str = field(default="no")
192 | clipping_mode: str = field(default="ghost")
193 | clipping_fn: str = field(default="automatic")
194 | clipping_style: str = field(default="all-layer")
195 | torch_seed_is_fixed: bool = field(default=True)
196 |
197 | def __post_init__(self):
198 | self.non_private = self.non_private.lower() in ('y', 'yes')
199 |
--------------------------------------------------------------------------------
/examples/table2text/data_utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/data_utils/__init__.py
--------------------------------------------------------------------------------
/examples/table2text/data_utils/data_collator.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass
2 | from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
3 |
4 | import torch
5 | from torch.nn.utils.rnn import pad_sequence
6 |
7 | from transformers.tokenization_utils import PreTrainedTokenizer
8 | from transformers.tokenization_utils_base import BatchEncoding, PaddingStrategy
9 | from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
10 |
11 |
12 | InputDataClass = NewType("InputDataClass", Any)
13 |
14 | """
15 | A DataCollator is a function that takes a list of samples from a Dataset
16 | and collate them into a batch, as a dictionary of Tensors.
17 | """
18 | DataCollator = NewType("DataCollator", Callable[[List[InputDataClass]], Dict[str, torch.Tensor]])
19 |
20 |
21 | @dataclass
22 | class DataCollatorForData2TextLanguageModeling:
23 | """
24 | Data collator used for language modeling.
25 | - collates batches of tensors, honoring their tokenizer's pad_token
26 | - preprocesses batches for masked language modeling
27 | """
28 | tokenizer: PreTrainedTokenizer
29 | mlm: bool = True
30 | format_mode: str = 'cat'
31 | mlm_probability: float = 0.15
32 |
33 | def __call__(
34 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
35 | ) -> Dict[str, torch.Tensor]:
36 | if isinstance(examples[0], (dict, BatchEncoding)):
37 | examples = [e["input_ids"] for e in examples]
38 | input_ids, labels, src, tgt, cate = zip(*examples)
39 | if self.mlm:
40 | inputs, labels = self.mask_tokens(batch)
41 | return {"input_ids": inputs, "labels": labels}
42 | else:
43 | if self.format_mode == 'cat':
44 | mode_input = 3
45 | elif self.format_mode == 'peek':
46 | mode_input = 1
47 | elif self.format_mode == 'nopeek':
48 | mode_input = 2
49 | elif self.format_mode == 'infix':
50 | mode_input = 4
51 |
52 | # mode_input = 1 # means that we take the input again.
53 | # mode_input = 2 # means that we do not peek at src again.
54 | # mode_input = 3 # means that we look at the categories, and see the input again.
55 |
56 | if mode_input == 1:
57 | # input, batch
58 | batch = self._tensorize_batch(input_ids)
59 | labels = self._tensorize_batch(labels)
60 | src = self._tensorize_batch(src)
61 | cate_batch, cate_attn = None, None
62 | # tgt = self._tensorize_batch(tgt)
63 | elif mode_input == 2:
64 | # nopeek.
65 | batch = self._tensorize_batch(tgt)
66 | labels = batch.clone()
67 | src = self._tensorize_batch(src)
68 | cate_batch, cate_attn = None, None
69 | elif mode_input == 3:
70 | batch = self._tensorize_batch(input_ids)
71 | labels = self._tensorize_batch(labels)
72 | src = self._tensorize_batch(cate)
73 | cate_batch, cate_attn = None, None
74 | elif mode_input == 4:
75 | batch = self._tensorize_batch(tgt)
76 | labels = batch.clone()
77 | src = self._tensorize_batch(src)
78 |
79 | cate_batch = self._tensorize_batch(cate)
80 | cate_attn = (cate_batch != self.tokenizer.pad_token_id)
81 |
82 | labels[labels == self.tokenizer.pad_token_id] = -100 # tgt
83 | src_attn = (src != self.tokenizer.pad_token_id) # src
84 | tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt
85 |
86 | if cate_batch is None:
87 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn,
88 | 'src':src}
89 | else:
90 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn': tgt_attn,
91 | 'src': src, "cate_batch":cate_batch, "cate_attn":cate_attn}
92 |
93 | def _tensorize_batch(
94 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
95 | ) -> torch.Tensor:
96 | # In order to accept both lists of lists and lists of Tensors
97 | if isinstance(examples[0], (list, tuple)):
98 | examples = [torch.tensor(e, dtype=torch.long) for e in examples]
99 | length_of_first = examples[0].size(0)
100 | are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
101 | if are_tensors_same_length:
102 | return torch.stack(examples, dim=0)
103 | else:
104 | if self.tokenizer._pad_token is None:
105 | raise ValueError(
106 | "You are attempting to pad samples but the tokenizer you are using"
107 | f" ({self.tokenizer.__class__.__name__}) does not have one."
108 | )
109 | return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)
110 |
111 | def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
112 | """
113 | Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
114 | """
115 |
116 | if self.tokenizer.mask_token is None:
117 | raise ValueError(
118 | "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
119 | )
120 |
121 | labels = inputs.clone()
122 | # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
123 | probability_matrix = torch.full(labels.shape, self.mlm_probability)
124 | special_tokens_mask = [
125 | self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
126 | ]
127 | probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
128 | if self.tokenizer._pad_token is not None:
129 | padding_mask = labels.eq(self.tokenizer.pad_token_id)
130 | probability_matrix.masked_fill_(padding_mask, value=0.0)
131 | masked_indices = torch.bernoulli(probability_matrix).bool()
132 | labels[~masked_indices] = -100 # We only compute loss on masked tokens
133 |
134 | # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
135 | indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
136 | inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
137 |
138 | # 10% of the time, we replace masked input tokens with random word
139 | indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
140 | random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
141 | inputs[indices_random] = random_words[indices_random]
142 |
143 | # The rest of the time (10% of the time) we keep the masked input tokens unchanged
144 | return inputs, labels
145 |
146 |
147 | @dataclass
148 | class DataCollatorForSumLanguageModeling:
149 | """
150 | Data collator used for language modeling.
151 | - collates batches of tensors, honoring their tokenizer's pad_token
152 | - preprocesses batches for masked language modeling
153 | """
154 | tokenizer: PreTrainedTokenizer
155 | mlm: bool = True
156 | format_mode: str = 'cat'
157 | mlm_probability: float = 0.15
158 |
159 | def __call__(
160 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
161 | ) -> Dict[str, torch.Tensor]:
162 | if isinstance(examples[0], (dict, BatchEncoding)):
163 | examples = [e["input_ids"] for e in examples]
164 | # print(examples[0])
165 | # print(len(examples))
166 | input_ids, labels, src, tgt = zip(*examples)
167 | # print(len(input_ids), len(labels), len(weights))
168 | if self.mlm:
169 | inputs, labels = self.mask_tokens(batch)
170 | return {"input_ids": inputs, "labels": labels}
171 | else:
172 |
173 | # print(self.format_mode)
174 |
175 | if self.format_mode == 'peek' or self.format_mode == 'cat':
176 | mode_input = 1
177 | elif self.format_mode == 'nopeek':
178 | assert False, 'should use format_mode = peek or cat.'
179 | mode_input = 2
180 | elif self.format_mode == 'infix':
181 | assert False, 'should use format_mode = peek or cat.'
182 | mode_input = 4
183 |
184 | # mode_input = 1 # means that we take the input again.
185 | # mode_input = 2 # means that we do not peek at src again.
186 | # mode_input = 3 # means that we look at the categories, and see the input again.
187 |
188 | # print(self.format_mode, mode_input)
189 |
190 | if mode_input == 1:
191 | # input, batch
192 | batch = self._tensorize_batch(input_ids)
193 | labels = self._tensorize_batch(labels)
194 | src = self._tensorize_batch(src)
195 |
196 | labels[labels == self.tokenizer.pad_token_id] = -100 # tgt
197 | src_attn = (src != self.tokenizer.pad_token_id) # src
198 | tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt
199 |
200 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn,
201 | 'src':src}
202 |
203 |
204 | def _tensorize_batch(
205 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
206 | ) -> torch.Tensor:
207 | # In order to accept both lists of lists and lists of Tensors
208 | if isinstance(examples[0], (list, tuple)):
209 | examples = [torch.tensor(e, dtype=torch.long) for e in examples]
210 | length_of_first = examples[0].size(0)
211 | are_tensors_same_length = all(x.size(0) == length_of_first for x in examples)
212 | if are_tensors_same_length:
213 | return torch.stack(examples, dim=0)
214 | else:
215 | if self.tokenizer._pad_token is None:
216 | raise ValueError(
217 | "You are attempting to pad samples but the tokenizer you are using"
218 | f" ({self.tokenizer.__class__.__name__}) does not have one."
219 | )
220 | return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id)
221 |
--------------------------------------------------------------------------------
/examples/table2text/decoding_utils.py:
--------------------------------------------------------------------------------
1 | """Utilities for generation."""
2 | import logging
3 | import sys
4 | from typing import Optional
5 |
6 | import tqdm
7 | import transformers
8 |
9 |
10 | def generate(
11 | model: transformers.PreTrainedModel,
12 | tokenizer: transformers.PreTrainedTokenizer,
13 | loader=None,
14 | prompt_dataset=None,
15 | max_length=100,
16 | min_length=5,
17 | top_k=0,
18 | top_p=0.9, # Only filter with top_p.
19 | repetition_penalty=1,
20 | do_sample=False,
21 | num_beams=5,
22 | bad_words_ids=None,
23 | dummy_token_id=-100, # Used as mask.
24 | num_return_sequences=1,
25 | max_generations=sys.maxsize,
26 | device=None,
27 | padding_token="[PAD]",
28 | **kwargs,
29 | ):
30 | assert not model.training, "Generation must be when `model` is in eval mode."
31 | if kwargs:
32 | logging.warning(f"Unknown kwargs: {kwargs}")
33 |
34 | # These are linebreaks; generating these will mess up the evaluation, since those files assume one example per-line.
35 | if bad_words_ids is None:
36 | bad_words_ids = [[628], [198]]
37 | if padding_token in tokenizer.get_vocab():
38 | bad_words_ids.append(tokenizer.encode(padding_token))
39 |
40 | kwargs = dict(
41 | model=model,
42 | tokenizer=tokenizer,
43 | max_length=max_length,
44 | min_length=min_length,
45 | top_k=top_k,
46 | top_p=top_p,
47 | repetition_penalty=repetition_penalty,
48 | do_sample=do_sample,
49 | num_beams=num_beams,
50 | bad_words_ids=bad_words_ids,
51 | dummy_token_id=dummy_token_id,
52 | num_return_sequences=num_return_sequences,
53 | max_generations=max_generations,
54 | device=device,
55 | padding_token=padding_token,
56 | )
57 | if loader is not None:
58 | result = _generate_with_loader(loader=loader, **kwargs)
59 | elif prompt_dataset is not None:
60 | result = _generate_with_prompt_dataset(prompt_dataset=prompt_dataset, **kwargs)
61 | else:
62 | raise ValueError(f"`loader` and `prompt_dataset` cannot both be `None`.")
63 |
64 | return result
65 |
66 |
67 | def _generate_with_loader(
68 | loader,
69 |
70 | model,
71 | tokenizer: transformers.PreTrainedTokenizer,
72 | max_length,
73 | min_length,
74 | top_k,
75 | top_p,
76 | repetition_penalty,
77 | do_sample,
78 | num_beams,
79 | bad_words_ids,
80 | dummy_token_id,
81 | num_return_sequences,
82 | max_generations,
83 | device,
84 | padding_token,
85 | ):
86 | references = []
87 | full_generations = [] # Sentences including the prompt part.
88 | unstripped_generations = []
89 | generations = []
90 |
91 | stop_generation = False
92 | for batch_idx, batch in tqdm.tqdm(enumerate(loader), desc="generation"):
93 | if stop_generation:
94 | break
95 |
96 | batch_input_ids, batch_labels = batch["input_ids"], batch["labels"]
97 | # e.g., inputs_ids may be [[95, 123, 32], [198, 19, 120]], and
98 | # labels may be [[-100, 123, 32], [-100, -100, 120]
99 |
100 | for input_ids, labels in zip(batch_input_ids, batch_labels):
101 | if stop_generation:
102 | break
103 |
104 | # Find the first pad token and end the sentence from there!
105 | if padding_token in tokenizer.get_vocab():
106 | pad_positions, = (
107 | input_ids == tokenizer.encode(padding_token, return_tensors="pt").squeeze()
108 | ).nonzero(as_tuple=True)
109 | # Some sentences might have padding; others might not.
110 | if pad_positions.numel() == 0:
111 | first_pad_position = None
112 | else:
113 | first_pad_position = pad_positions[0]
114 | reference_str: str = tokenizer.decode(input_ids[:first_pad_position], clean_up_tokenization_spaces=True)
115 | else:
116 | reference_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
117 | references.append(reference_str)
118 |
119 | # Find the first non- -100 position. Note there are trailing -100s.
120 | non_prompt_positions, = (labels != dummy_token_id).nonzero(as_tuple=True)
121 | first_non_prompt_position = non_prompt_positions[0].item()
122 | prompt_len = first_non_prompt_position
123 | prompt_ids = input_ids[:prompt_len]
124 |
125 | output_ids = model.generate(
126 | input_ids=prompt_ids[None, ...].to(device),
127 | max_length=max_length + prompt_len, # This cannot be a 0-D tensor!
128 | min_length=min_length,
129 | top_k=top_k,
130 | top_p=top_p,
131 | repetition_penalty=repetition_penalty,
132 | do_sample=do_sample,
133 | bad_words_ids=bad_words_ids,
134 | num_return_sequences=num_return_sequences,
135 | num_beams=num_beams,
136 | pad_token_id=tokenizer.eos_token_id, # Stop the stupid logging...
137 | )
138 | output_ids = output_ids.squeeze(dim=0) # Throw away batch dimension.
139 |
140 | whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True)
141 | prompt_str: str = tokenizer.decode(prompt_ids, clean_up_tokenization_spaces=True)
142 | output_str: str = whole_str[len(prompt_str):]
143 |
144 | full_generations.append(whole_str)
145 | del whole_str, prompt_str
146 |
147 | # Remove potential eos_token at the end.
148 | eos_position: Optional[int] = output_str.find(tokenizer.eos_token)
149 | if eos_position == -1: # Didn't generate eos_token; that's okay -- just skip!
150 | eos_position = None
151 | output_str = output_str[:eos_position]
152 | unstripped_generations.append(output_str)
153 |
154 | # Removing leading and trailing spaces.
155 | output_str = output_str.strip()
156 |
157 | generations.append(output_str)
158 |
159 | if len(generations) >= max_generations:
160 | stop_generation = True
161 |
162 | return full_generations, unstripped_generations, generations, references
163 |
164 |
165 | def _generate_with_prompt_dataset(
166 | prompt_dataset,
167 |
168 | model,
169 | tokenizer,
170 | max_length,
171 | min_length,
172 | top_k,
173 | top_p,
174 | repetition_penalty,
175 | do_sample,
176 | num_beams,
177 | bad_words_ids,
178 | dummy_token_id,
179 | num_return_sequences,
180 | max_generations,
181 | device,
182 | padding_token,
183 | ):
184 | references = []
185 | full_generations = [] # Sentences including the prompt part.
186 | unstripped_generations = []
187 | generations = []
188 |
189 | stop_generation = False
190 | for input_ids in tqdm.tqdm(prompt_dataset, desc="generation"):
191 | if stop_generation:
192 | break
193 |
194 | prompt_len = len(input_ids[0])
195 | output_ids = model.generate(
196 | input_ids=input_ids.to(device),
197 | max_length=max_length + prompt_len, # This cannot be a 0-D tensor!
198 | min_length=min_length,
199 | top_k=top_k,
200 | top_p=top_p,
201 | repetition_penalty=repetition_penalty,
202 | do_sample=do_sample,
203 | bad_words_ids=bad_words_ids,
204 | num_return_sequences=num_return_sequences,
205 | num_beams=num_beams,
206 | pad_token_id=tokenizer.eos_token_id, # Stop the stupid logging...
207 | )
208 | output_ids = output_ids.squeeze(dim=0) # Throw away batch dimension.
209 | input_ids = input_ids.squeeze(dim=0)
210 |
211 | whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True)
212 | prompt_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True)
213 | output_str: str = whole_str[len(prompt_str):]
214 |
215 | full_generations.append(whole_str)
216 | del whole_str, prompt_str
217 |
218 | # Remove potential eos_token at the end.
219 | eos_position: Optional[int] = output_str.find(tokenizer.eos_token)
220 | if eos_position == -1: # Didn't generate eos_token; that's okay -- just skip!
221 | eos_position = None
222 | output_str = output_str[:eos_position]
223 | unstripped_generations.append(output_str)
224 |
225 | # Removing leading and trailing spaces.
226 | output_str = output_str.strip()
227 |
228 | generations.append(output_str)
229 |
230 | if len(generations) >= max_generations:
231 | stop_generation = True
232 | return full_generations, unstripped_generations, generations, references
233 |
--------------------------------------------------------------------------------
/examples/table2text/gpt_config_stage123.json:
--------------------------------------------------------------------------------
1 | {
2 | "bf16": {
3 | "enabled": true
4 | },
5 | "fp16": {
6 | "enabled": false,
7 | "fp16_master_weights_and_grads": false,
8 | "loss_scale": 1,
9 | "loss_scale_window": 1000,
10 | "hysteresis": 2,
11 | "min_loss_scale": 1,
12 | "initial_scale_power": 3
13 | },
14 | "train_micro_batch_size_per_gpu": 999999999,
15 | "wall_clock_breakdown": false,
16 | "zero_optimization": {
17 | "stage": 1,
18 | "allgather_partitions": true,
19 | "reduce_scatter": true,
20 | "allgather_bucket_size": 50000000,
21 | "reduce_bucket_size": 50000000,
22 | "overlap_comm": true,
23 | "contiguous_gradients": true,
24 | "cpu_offload": false
25 | }
26 | }
27 |
--------------------------------------------------------------------------------
/examples/table2text/misc.py:
--------------------------------------------------------------------------------
1 | """Miscellaneous utilities.
2 |
3 | Mostly bespoke data loaders at the moment.
4 | """
5 |
6 | from transformers import (
7 | DataCollatorForLanguageModeling,
8 | DataCollatorForPermutationLanguageModeling,
9 | PreTrainedTokenizer
10 | )
11 |
12 | try:
13 | from .compiled_args import DataTrainingArguments
14 | from .data_utils.data_collator import DataCollatorForData2TextLanguageModeling
15 | from .data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset
16 | except:
17 | from compiled_args import DataTrainingArguments
18 | from data_utils.data_collator import DataCollatorForData2TextLanguageModeling
19 | from data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset
20 |
21 |
22 | def get_dataset_with_path(
23 | data_args: DataTrainingArguments,
24 | tokenizer: PreTrainedTokenizer,
25 | file_path: str,
26 | max_examples: int,
27 | **_,
28 | ):
29 | if data_args.line_by_line:
30 | if data_args.task_mode == 'e2e':
31 | dataset = LineByLineE2ETextDataset(
32 | tokenizer=tokenizer,
33 | file_path=file_path,
34 | block_size=data_args.block_size,
35 | bos_tok=tokenizer.bos_token,
36 | eos_tok=tokenizer.eos_token,
37 | max_seq_len=data_args.max_seq_len,
38 | max_examples=max_examples,
39 | )
40 | elif data_args.task_mode == 'dart':
41 | dataset = LineByLineTriplesTextDataset(
42 | tokenizer=tokenizer,
43 | file_path=file_path,
44 | block_size=data_args.block_size,
45 | bos_tok=tokenizer.bos_token,
46 | eos_tok=tokenizer.eos_token,
47 | max_seq_len=data_args.max_seq_len,
48 | max_examples=max_examples,
49 | )
50 | else:
51 | raise ValueError(f"Unknown `args.task_mode`: {data_args.task_mode}")
52 |
53 | else:
54 | raise ValueError("table2text task don't support anything other than line_by_line!")
55 | return dataset
56 |
57 |
58 | def get_prompt_dataset(file_path, tokenizer):
59 | with open(file_path, 'r') as f:
60 | lines = f.readlines()
61 | encoded_lines = [
62 | tokenizer.encode(line.strip(), add_special_tokens=False, return_tensors="pt")
63 | for line in lines
64 | ]
65 | return encoded_lines
66 |
67 |
68 | def get_all_datasets(config, tokenizer, data_args, model_args, **_):
69 | kwargs = dict(data_args=data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir)
70 | train_dataset = get_dataset_with_path(
71 | **kwargs, file_path=data_args.train_data_file, max_examples=data_args.max_train_examples
72 | )
73 | valid_dataset = get_dataset_with_path(
74 | **kwargs, file_path=data_args.valid_data_file, max_examples=data_args.max_valid_examples
75 | )
76 | eval_dataset = get_dataset_with_path(
77 | **kwargs, file_path=data_args.eval_data_file, max_examples=data_args.max_eval_examples
78 | )
79 |
80 | if config.model_type == "xlnet":
81 | data_collator = DataCollatorForPermutationLanguageModeling(
82 | tokenizer=tokenizer,
83 | plm_probability=data_args.plm_probability,
84 | max_span_length=data_args.max_span_length,
85 | )
86 | else:
87 | if data_args.task_mode == 'e2e' or data_args.task_mode == 'dart':
88 | data_collator = DataCollatorForData2TextLanguageModeling(
89 | tokenizer=tokenizer, mlm=False, format_mode=data_args.format_mode
90 | )
91 | else:
92 | data_collator = DataCollatorForLanguageModeling(
93 | tokenizer=tokenizer, mlm=False,
94 | )
95 |
96 | return train_dataset, valid_dataset, eval_dataset, data_collator
97 |
--------------------------------------------------------------------------------
/examples/table2text/models.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 | from transformers import GPT2PreTrainedModel, GPT2LMHeadModel
4 |
5 |
6 | class _View(nn.Module):
7 | def __init__(self, shape):
8 | super(_View, self).__init__()
9 | self.shape = shape
10 |
11 | def forward(self, x):
12 | return x.reshape(*self.shape)
13 |
14 |
15 | class PrefixTuner(GPT2PreTrainedModel):
16 | """A minimalistic implementation of the core components."""
17 |
18 | def __init__(self, config, model_args, gpt2=None):
19 | super(PrefixTuner, self).__init__(config=config)
20 |
21 | # Instantiate a GPT-2, and DON'T optimizer it!
22 | if gpt2 is None:
23 | self.gpt2 = GPT2LMHeadModel.from_pretrained(
24 | model_args.model_name_or_path, config=config, cache_dir=model_args.cache_dir,
25 | )
26 | else:
27 | self.gpt2 = gpt2
28 |
29 | self.register_buffer('extra_prefix_ids', torch.arange(model_args.prefix_len))
30 | # TODO: Also introduce the easier net.
31 | self.extra_prefix_net = nn.Sequential(
32 | nn.Embedding(model_args.prefix_len, config.n_embd),
33 | nn.Linear(config.n_embd, model_args.mid_dim),
34 | nn.Tanh(),
35 | nn.Linear(model_args.mid_dim, config.n_layer * 2 * config.n_embd),
36 | _View((-1, model_args.prefix_len, config.n_layer * 2, config.n_head, config.n_embd // config.n_head)),
37 | nn.Dropout(model_args.prefix_dropout),
38 | )
39 |
40 | def make_past_key_values(self, bsz=None):
41 | extra_prefix_ids = self.extra_prefix_ids[None, :].expand(bsz, -1)
42 | past_key_values = self.extra_prefix_net(extra_prefix_ids)
43 | # (n_layer, batch_size, n_head, prefix_len, n_embed // n_head).
44 | # e.g., (2, 1, 12, 5, 64,).
45 | past_key_values = past_key_values.permute([2, 0, 3, 1, 4]).split(2, dim=0)
46 | return past_key_values
47 |
48 | def state_dict(self):
49 | """Avoid storing GPT-2, since it's not even trained."""
50 | return self.extra_prefix_net.state_dict()
51 |
52 | def load_state_dict(self, state_dict):
53 | """Avoid loading GPT-2, since it's not even trained."""
54 | self.extra_prefix_net.load_state_dict(state_dict)
55 |
56 | @property
57 | def major_device(self):
58 | """Returns the device where the parameters are on."""
59 | return next(self.parameters()).device
60 |
61 | def forward(
62 | self,
63 | input_ids,
64 | attention_mask=None,
65 | token_type_ids=None,
66 | position_ids=None,
67 | head_mask=None,
68 | inputs_embeds=None,
69 | encoder_hidden_states=None,
70 | encoder_attention_mask=None,
71 | labels=None,
72 | use_cache=None,
73 | output_attentions=None,
74 | output_hidden_states=None,
75 | return_dict=None,
76 | **kwargs,
77 | ):
78 | past_key_values = self.make_past_key_values(bsz=input_ids.size(0))
79 | return self.gpt2(
80 | input_ids=input_ids,
81 | past_key_values=past_key_values,
82 | attention_mask=attention_mask,
83 | token_type_ids=token_type_ids,
84 | position_ids=position_ids,
85 | head_mask=head_mask,
86 | inputs_embeds=inputs_embeds,
87 | encoder_hidden_states=encoder_hidden_states,
88 | encoder_attention_mask=encoder_attention_mask,
89 | labels=labels,
90 | use_cache=use_cache,
91 | output_attentions=output_attentions,
92 | output_hidden_states=output_hidden_states,
93 | return_dict=return_dict,
94 | **kwargs
95 | )
96 |
97 | def generate(self, input_ids, num_beams, **kwargs):
98 | # Additional files also changed:
99 | # src/transformers/generation_utils.py
100 | # src/transformers/models/gpt2/modeling_gpt2.py
101 |
102 | # A sanity check is to optimize the model for a few updates and check if the beam-search generations changed.
103 | # The confusing logic in generation_utils:
104 | # 1) `past` is used in `GPT2LMHeadModel:prepare_inputs_for_generation`,
105 | # 2) it's converted to `past_key_values` in that function,
106 | # 3) `past_key_values` is then updated in forward due to return_dict,
107 | # 4) `past` is set to `past_key_values` in `generation_utils:_update_model_kwargs_for_generation`
108 |
109 | # This is expansion step is important for generation, since otherwise the shapes are wrong.
110 | past_key_values = self.make_past_key_values(bsz=input_ids.size(0) * num_beams)
111 | # ---
112 |
113 | return self.gpt2.generate(
114 | input_ids=input_ids,
115 | num_beams=num_beams,
116 | past_key_values=past_key_values,
117 |
118 | use_cache=True,
119 | position_ids=None,
120 |
121 | # The logic: At beginning, past=None, and then it gets replaced with past_key_values.
122 | # Can't directly give in past, since otherwise, input_ids gets truncated to the last index.
123 | use_past_key_values_as_past_at_init=True,
124 | nullify_attention_mask=True,
125 | # ---
126 |
127 | **kwargs
128 | )
129 |
--------------------------------------------------------------------------------
/examples/table2text/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | data_dir=${1}
4 | output_dir=${2}
5 | task_mode=${3}
6 | model_name_or_path=${4:-"gpt2"} # One of distilgpt2, gpt2, gpt2-medium, gpt2-large
7 | target_epsilon=${5:-8}
8 | clipping_fn=${6:-"automatic"}
9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"all-layer"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-50}
14 | learning_rate=${12:-0.002}
15 | batch_size=${13:-1000}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 |
20 | if [[ ${task_mode} == "e2e" ]]; then
21 | data_dir="${data_dir}/data/e2e_data"
22 | target_delta=8e-6
23 | num_train_epochs=10
24 | max_seq_len=100
25 | else
26 | if [[ ${task_mode} == "dart" ]]; then
27 | target_delta=1e-5
28 | data_dir="${data_dir}/data/dart"
29 | num_train_epochs=15 # Approximately same number of updates.
30 | learning_rate=5e-4 # Lower learning rate for stability in large models.
31 | max_seq_len=120
32 | else
33 | echo "Unknown task: ${task_mode}"
34 | exit 1
35 | fi
36 | fi
37 |
38 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size}))
39 |
40 | # Arguments in the last two lines are the most important.
41 | python table2text/run_language_modeling.py \
42 | --output_dir ${output_dir} --overwrite_output_dir \
43 | --task_mode ${task_mode} \
44 | --model_name_or_path ${model_name_or_path} \
45 | --tokenizer_name ${model_name_or_path} \
46 | --do_train --do_eval \
47 | --line_by_line \
48 | --save_steps 100 --save_total_limit 1 --save_at_last no \
49 | --logging_dir ${output_dir} --logging_steps -1 \
50 | --seed 0 \
51 | --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" --evaluate_during_training "no" --per_device_eval_batch_size 10 \
52 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
53 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
54 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
55 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
56 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
57 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
58 | --non_private ${non_private} \
59 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" \
60 |
61 |
62 |
63 |
64 |
65 |
66 |
--------------------------------------------------------------------------------
/examples/table2text/run_ZERO1.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | data_dir=${1:-"data/prefix-tuning"}
4 | output_dir=${2:-"data/output"}
5 | task_mode=${3:-"e2e"}
6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl","gptj"
7 | target_epsilon=${5:-8}
8 | clipping_fn=${6:-"automatic"}
9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"layer-wise"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-4}
14 | learning_rate=${12:-0.002}
15 | batch_size=${13:-1024}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 | num_GPUs=${17:-8}
20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"}
21 |
22 | if [[ ${task_mode} == "e2e" ]]; then
23 | data_dir="${data_dir}/data/e2e_data"
24 | target_delta=8e-6
25 | num_train_epochs=10
26 | max_seq_len=100
27 | if [[ ${bias_only} == "yes" ]]; then
28 | learning_rate=1e-2
29 | else
30 | learning_rate=2e-3
31 | fi
32 | else
33 | if [[ ${task_mode} == "dart" ]]; then
34 | target_delta=1e-5
35 | data_dir="${data_dir}/data/dart"
36 | num_train_epochs=15 # Approximately same number of updates.
37 | learning_rate=5e-4 # Lower learning rate for stability in large models.
38 | max_seq_len=120
39 | if [[ ${bias_only} == "yes" ]]; then
40 | learning_rate=2e-3
41 | else
42 | learning_rate=5e-4
43 | fi
44 |
45 | else
46 | echo "Unknown task: ${task_mode}"
47 | exit 1
48 | fi
49 | fi
50 |
51 | deepspeed table2text/run_language_modeling.py --deepspeed_config ${deepspeed_config} \
52 | --output_dir ${output_dir} --overwrite_output_dir \
53 | --task_mode ${task_mode} \
54 | --model_name_or_path ${model_name_or_path} \
55 | --tokenizer_name ${model_name_or_path} \
56 | --do_train --do_eval \
57 | --line_by_line \
58 | --save_steps 100 --save_total_limit 1 --save_at_last no \
59 | --logging_dir ${output_dir} --logging_steps -1 \
60 | --seed 0 \
61 | --dataloader_num_workers 2 \
62 | --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
63 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \
64 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
65 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
66 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
67 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
68 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --logical_batch_size ${batch_size}\
69 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
70 | --non_private ${non_private} \
71 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}"
72 |
--------------------------------------------------------------------------------
/examples/table2text/run_ZERO23.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | data_dir=${1:-"table2text/data/prefix-tuning"}
4 | output_dir=${2:-"table2text/data/output"}
5 | task_mode=${3:-"e2e"}
6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"
7 | target_epsilon=${5:-8}
8 | clipping_fn=${6:-"automatic"}
9 | clipping_mode=${7:-"MixOpt"}
10 | clipping_style=${8:-"layer-wise"}
11 | bias_only=${9:-"no"}
12 | non_private=${10:-"no"}
13 | physical_batch_size=${11:-4}
14 | learning_rate=${12:-0.001}
15 | batch_size=${13:-1024}
16 | attention_only=${14:-"no"}
17 | static_lm_head=${15:-"no"}
18 | static_embedding=${16:-"no"}
19 | num_GPUs=${17:-8}
20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"}
21 |
22 | if [[ ${task_mode} == "e2e" ]]; then
23 | data_dir="${data_dir}/data/e2e_data"
24 | target_delta=8e-6
25 | num_train_epochs=10
26 | max_seq_len=100
27 | else
28 | if [[ ${task_mode} == "dart" ]]; then
29 | target_delta=1e-5
30 | data_dir="${data_dir}/data/dart"
31 | num_train_epochs=15 # Approximately same number of updates.
32 | learning_rate=5e-4 # Lower learning rate for stability in large models.
33 | max_seq_len=120
34 | else
35 | echo "Unknown task: ${task_mode}"
36 | exit 1
37 | fi
38 | fi
39 |
40 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs}))
41 |
42 | # Arguments in the last two lines are the most important.
43 | deepspeed table2text/run_language_modeling_ZERO23.py --deepspeed_config ${deepspeed_config} \
44 | --output_dir ${output_dir} --overwrite_output_dir \
45 | --task_mode ${task_mode} \
46 | --model_name_or_path ${model_name_or_path} \
47 | --tokenizer_name ${model_name_or_path} \
48 | --do_train --do_eval \
49 | --line_by_line \
50 | --save_steps 100 --save_total_limit 1 --save_at_last no \
51 | --logging_dir ${output_dir} --logging_steps -1 \
52 | --seed 0 \
53 | --dataloader_num_workers 2 \
54 | --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
55 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \
56 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
57 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
58 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
59 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
60 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
61 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
62 | --non_private ${non_private} \
63 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}"
64 |
--------------------------------------------------------------------------------
/examples/table2text/run_ZERO_extending.py:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | data_dir=${1}
4 | output_dir=${2}
5 | task_mode=${3:-"e2e"}
6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"
7 | target_epsilon=${5:-8}
8 | bias_only=${6:-"no"}
9 | non_private=${7:-"no"}
10 | physical_batch_size=${8:-4}
11 | learning_rate=${9:-0.001}
12 | batch_size=${10:-1024}
13 | attention_only=${11:-"no"}
14 | static_lm_head=${12:-"no"}
15 | static_embedding=${13:-"no"}
16 | num_GPUs=${14:-8}
17 | deepspeed_config=${15:-"table2text/gpt_config_stage123.json"}
18 |
19 | if [[ ${task_mode} == "e2e" ]]; then
20 | data_dir="${data_dir}/data/e2e_data"
21 | target_delta=8e-6
22 | num_train_epochs=10
23 | max_seq_len=100
24 | else
25 | if [[ ${task_mode} == "dart" ]]; then
26 | target_delta=1e-5
27 | data_dir="${data_dir}/data/dart"
28 | num_train_epochs=15 # Approximately same number of updates.
29 | learning_rate=5e-4 # Lower learning rate for stability in large models.
30 | max_seq_len=120
31 | else
32 | echo "Unknown task: ${task_mode}"
33 | exit 1
34 | fi
35 | fi
36 |
37 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs}))
38 |
39 | deepspeed table2text/run_language_modeling_extending.py --deepspeed_config ${deepspeed_config} \
40 | --output_dir ${output_dir} --overwrite_output_dir \
41 | --task_mode ${task_mode} \
42 | --model_name_or_path ${model_name_or_path} \
43 | --tokenizer_name ${model_name_or_path} \
44 | --do_train --do_eval \
45 | --line_by_line \
46 | --save_steps 100 --save_total_limit 1 --save_at_last no \
47 | --logging_dir ${output_dir} --logging_steps -1 \
48 | --seed 0 \
49 | --dataloader_num_workers 2 \
50 | --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \
51 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \
52 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \
53 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \
54 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \
55 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \
56 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \
57 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \
58 | --non_private ${non_private} \
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
--------------------------------------------------------------------------------
/examples/text_classification/README.md:
--------------------------------------------------------------------------------
1 | ## DP text classification with Huggingface transformers
2 |
3 | ### Getting the data
4 |
5 | We adopt the data pipeline by \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf)\], which is adapted from the excellent work by \[[Gao et al., 2021](https://arxiv.org/pdf/2012.15723.pdf)\]. To obtain the data, run the following:
6 |
7 | ```plaintext
8 | cd data; bash download_dataset.sh
9 | ```
10 |
11 | This should produce a `data/original` subfolder that contains the GLUE ([General Language Understanding Evaluation](https://huggingface.co/datasets/glue)) datasets.
12 |
13 | ### Running
14 |
15 | Use the `run_wrapper.py` script in the folder, which runs the `run_classification.py` for the command.
16 |
17 | Necessary arguments:
18 |
19 | * `--output_dir`: path to a folder where results will be written
20 | * `--task_name`: name of task; one of `sst-2`, `qnli`, `qqp`, `mnli`
21 |
22 | For instance, run the following under the `examples` folder:
23 |
24 | ```plaintext
25 | python -m text_classification.run_wrapper --output_dir ToDeleteNLU --task_name sst-2
26 | ```
27 |
28 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For SST-2/QNLI/QQP/MNLI, running `roberta-base` on one A100 GPU (40GB) takes around 5/8/37/32 min per epoch.
29 |
30 | Additional arguments:
31 |
32 | * `--model_name_or_path`: The pretrained model; one of `distilbert-base-uncased`, `bert-base-uncased`, `bert-large-uncased`, `distilroberta-base`, `roberta-base`, `roberta-large`.
33 |
34 | * `--target_epsilon`: Target privacy spending, default is 8.
35 |
36 | * `--few_shot_type`: Whether to use the generic prompt formatter described in Section 3.2 of our paper. `prompt` is to use, `finetune` is to not use.
37 |
38 | * `--non_private`: Whether to train differentially privately; one of `yes`, `no` (default).
39 |
40 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `ghost` (default, meaning book-keeping), `MixGhostClip`, `MixOpt`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf).
41 |
42 | * `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
43 |
44 | * `--clipping_style`: Which per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, 2-group clipping style can use `--clipping_style 2`; 12-group clipping style can use `--clipping_style 12`.
45 |
46 | * `--attention_only`: Whether to only train attention layers; one of `yes`, `no` (default).
47 |
48 | * `--bias_only`: Whether to only train bias terms; one of `yes`, `no` (default). If yes, this is implementing [[Differentially Private Bias-Term only
49 | Fine-tuning]](https://arxiv.org/pdf/2210.00036.pdf).
50 |
51 | * `--physical_batch_size` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy.
52 |
53 | * `--batch_size` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is None.
54 |
55 | Note that keeping other training hyperparameter (e.g., number of training epochs, clipping threshold, learning rate) as default, the script should reproduce the results in \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf); [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)\].
56 |
--------------------------------------------------------------------------------
/examples/text_classification/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/__init__.py
--------------------------------------------------------------------------------
/examples/text_classification/data/download_dataset.sh:
--------------------------------------------------------------------------------
1 | wget https://nlp.cs.princeton.edu/projects/lm-bff/datasets.tar
2 | tar xvf datasets.tar
3 |
--------------------------------------------------------------------------------
/examples/text_classification/data/make_k_shot_without_dev.py:
--------------------------------------------------------------------------------
1 | """The datasets in the k-shot folder contain dev.tsv; we make the test set the dev set in the new k-shot.
2 |
3 | python -m classification.data.make_k_shot_without_dev
4 | """
5 | import os
6 |
7 | from ml_swissknife import utils
8 |
9 | join = os.path.join
10 |
11 | base_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot'
12 | new_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot-no-dev'
13 |
14 | task_names = ("SST-2", "QNLI", "MNLI", "QQP")
15 | for task_name in task_names:
16 | folder = join(base_dir, task_name)
17 | new_folder = join(new_dir, task_name)
18 |
19 | for name in utils.listdir(folder):
20 | subfolder = join(folder, name)
21 | new_subfolder = join(new_folder, name)
22 | os.makedirs(new_subfolder, exist_ok=True)
23 |
24 | train = join(subfolder, 'train.tsv')
25 | new_train = join(new_subfolder, 'train.tsv')
26 | os.system(f'cp {train} {new_train}')
27 |
28 | if task_name == "MNLI":
29 | test = join(subfolder, 'test_matched.tsv')
30 | new_dev = join(new_subfolder, 'dev_matched.tsv')
31 | os.system(f'cp {test} {new_dev}')
32 |
33 | test = join(subfolder, 'test_mismatched.tsv')
34 | new_dev = join(new_subfolder, 'dev_mismatched.tsv')
35 | os.system(f'cp {test} {new_dev}')
36 | else:
37 | test = join(subfolder, 'test.tsv')
38 | new_dev = join(new_subfolder, 'dev.tsv')
39 | os.system(f'cp {test} {new_dev}')
40 |
--------------------------------------------------------------------------------
/examples/text_classification/data/make_valid_data.py:
--------------------------------------------------------------------------------
1 | """Make the separate validation data, so that we don't tune on dev set.
2 |
3 | python -m classification.data.make_valid_data
4 | """
5 | import os
6 |
7 | import fire
8 | import numpy as np
9 | import tqdm
10 |
11 |
12 | def write_lines(path, lines, mode="w"):
13 | os.makedirs(os.path.dirname(path), exist_ok=True)
14 | with open(path, mode) as f:
15 | f.writelines(lines)
16 | print(len(lines))
17 |
18 |
19 | def main():
20 | valid_percentage = 0.1
21 | original_dir = "/nlp/scr/lxuechen/data/lm-bff/data/original"
22 | new_dir = "/nlp/scr/lxuechen/data/lm-bff/data/glue-with-validation"
23 |
24 | task_folders = ("GLUE-SST-2", "QNLI", "QQP")
25 | for task_folder in task_folders:
26 | # Create train and valid splits.
27 | full_train_path = os.path.join(original_dir, task_folder, 'train.tsv')
28 | with open(full_train_path, 'r') as f:
29 | full_train = f.readlines()
30 |
31 | header = full_train[0]
32 | full_train = full_train[1:] # Remove header.
33 |
34 | indices = np.random.permutation(len(full_train))
35 | new_valid_size = int(len(indices) * valid_percentage)
36 | new_train_size = len(indices) - new_valid_size
37 | new_train_indices = indices[:new_train_size]
38 | new_valid_indices = indices[new_train_size:]
39 | assert len(new_train_indices) == new_train_size
40 | assert len(new_valid_indices) == new_valid_size
41 |
42 | new_train = [header] + [full_train[i] for i in new_train_indices]
43 | new_valid = [header] + [full_train[i] for i in new_valid_indices]
44 |
45 | new_train_path = os.path.join(new_dir, task_folder, 'train.tsv')
46 | new_valid_path = os.path.join(new_dir, task_folder, 'dev.tsv')
47 |
48 | write_lines(new_train_path, new_train)
49 | write_lines(new_valid_path, new_valid)
50 | del new_train, new_valid, new_train_path, new_valid_path
51 | del new_train_size, new_train_indices
52 | del new_valid_size, new_valid_indices
53 |
54 | # Make test!
55 | test_path = os.path.join(original_dir, task_folder, 'dev.tsv')
56 | new_test_path = os.path.join(new_dir, task_folder, 'test.tsv')
57 | os.system(f'cp {test_path} {new_test_path}')
58 | del test_path, new_test_path
59 |
60 | # Make valid set for MNLI; different, since matched/mismatched!
61 | task_folder = "MNLI"
62 | matched_genres = ['slate', 'government', 'telephone', 'travel', 'fiction']
63 | mismatched_genres = ['letters', 'verbatim', 'facetoface', 'oup', 'nineeleven']
64 | full_train_path = os.path.join(original_dir, task_folder, 'train.tsv')
65 | with open(full_train_path, 'r') as f:
66 | full_train = f.readlines()
67 | full_train_csv = [line.split('\t') for line in full_train]
68 |
69 | # Check the lengths are correct.
70 | l = len(full_train_csv[0])
71 | for line in full_train_csv:
72 | assert l == len(line)
73 |
74 | # Remove header.
75 | header = full_train[0]
76 | header_csv = full_train_csv[0]
77 |
78 | full_train = full_train[1:]
79 | full_train_csv = full_train_csv[1:]
80 |
81 | # Get index of genre.
82 | genre_index = header_csv.index('genre')
83 |
84 | # Shuffle both!
85 | indices = np.random.permutation(len(full_train))
86 | full_train = [full_train[i] for i in indices]
87 | full_train_csv = [full_train_csv[i] for i in indices]
88 |
89 | # Split validation.
90 | new_valid_size = int(len(indices) * valid_percentage)
91 | new_matched_valid_size = new_mismatched_valid_size = new_valid_size // 2
92 |
93 | # Fetch the indices.
94 | new_train_indices = []
95 | new_matched_valid_indices = []
96 | new_mismatched_valid_indices = []
97 | matched_count = mismatched_count = 0
98 | for i, row in enumerate(full_train_csv):
99 | genre = row[genre_index]
100 | if genre in matched_genres and matched_count < new_matched_valid_size:
101 | new_matched_valid_indices.append(i)
102 | matched_count += 1
103 | elif genre in mismatched_genres and mismatched_count < new_mismatched_valid_size:
104 | new_mismatched_valid_indices.append(i)
105 | mismatched_count += 1
106 | else:
107 | new_train_indices.append(i)
108 |
109 | new_matched_valid_indices = set(new_matched_valid_indices)
110 | new_mismatched_valid_indices = set(new_mismatched_valid_indices)
111 |
112 | new_train = [header]
113 | new_matched_valid = [header]
114 | new_mismatched_valid = [header]
115 | for i, line in tqdm.tqdm(enumerate(full_train)):
116 | if i in new_matched_valid_indices:
117 | new_matched_valid.append(line)
118 | elif i in new_mismatched_valid_indices:
119 | new_mismatched_valid.append(line)
120 | else:
121 | new_train.append(line)
122 |
123 | new_train_path = os.path.join(new_dir, task_folder, 'train.tsv')
124 | new_matched_valid_path = os.path.join(new_dir, task_folder, 'dev_matched.tsv')
125 | new_mismatched_valid_path = os.path.join(new_dir, task_folder, 'dev_mismatched.tsv')
126 |
127 | write_lines(new_train_path, new_train)
128 | write_lines(new_matched_valid_path, new_matched_valid)
129 | write_lines(new_mismatched_valid_path, new_mismatched_valid)
130 |
131 | matched_test_path = os.path.join(original_dir, task_folder, 'dev_matched.tsv')
132 | new_matched_test_path = os.path.join(new_dir, task_folder, 'test_matched.tsv')
133 | os.system(f'cp {matched_test_path} {new_matched_test_path}')
134 |
135 | mismatched_test_path = os.path.join(original_dir, task_folder, 'dev_mismatched.tsv')
136 | new_mismatched_test_path = os.path.join(new_dir, task_folder, 'test_mismatched.tsv')
137 | os.system(f'cp {mismatched_test_path} {new_mismatched_test_path}')
138 |
139 |
140 | if __name__ == "__main__":
141 | fire.Fire(main)
142 |
--------------------------------------------------------------------------------
/examples/text_classification/run_wrapper.py:
--------------------------------------------------------------------------------
1 | """Wrapper launcher script."""
2 |
3 | import os
4 |
5 | import fire
6 |
7 | from .src import common
8 |
9 |
10 | def _get_command(
11 | task_name,
12 | output_dir,
13 | model_name_or_path,
14 | data_dir,
15 | learning_rate,
16 | clipping_mode: str,
17 | clipping_fn: str,
18 | clipping_style: str,
19 | non_private,
20 | target_epsilon,
21 | few_shot_type,
22 | seed,
23 | attention_only,
24 | bias_only,
25 | static_lm_head,
26 | static_embedding,
27 | randomly_initialize,
28 | physical_batch_size,
29 | batch_size,
30 | num_train_epochs,
31 | eval_steps,
32 | ):
33 | task_name_to_factor = {
34 | "sst-2": 1, "qnli": 2, "qqp": 6, "mnli": 6,
35 | }
36 | factor = task_name_to_factor[task_name]
37 |
38 | if batch_size is None:
39 | base_batch_size = 1000
40 | # This batch size selection roughly ensures the sampling rates on different
41 | # datasets are in the same ballpark.
42 | batch_size = int(base_batch_size * factor)
43 | gradient_accumulation_steps = batch_size // physical_batch_size
44 |
45 | if num_train_epochs is None:
46 | base_num_train_epochs = 3
47 | num_train_epochs = int(base_num_train_epochs * factor)
48 |
49 | if learning_rate is None:
50 | if non_private.lower() in ('yes', 'y', 'true', 't'):
51 | learning_rate = 5e-5
52 | if bias_only.lower() in ('yes', 'y', 'true', 't'):
53 | learning_rate=1e-3
54 | else:
55 | learning_rate = 5e-4
56 | if bias_only.lower() in ('yes', 'y', 'true', 't'):
57 | learning_rate=5e-3
58 |
59 | data_dir = f"{data_dir}/{common.task_name2suffix_name[task_name]}"
60 | template = {
61 | "sst-2": "*cls**sent_0*_It_was*mask*.*sep+*",
62 | "mnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*",
63 | "qnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*",
64 | "qqp": "*cls**sent-_0**mask*,*+sentl_1**sep+*",
65 | }[task_name]
66 |
67 | # Epochs chosen roughly to match e2e number of updates. We didn't hyperparameter tune on classification tasks :)
68 | cmd = f'''
69 | python -m text_classification.run_classification \
70 | --task_name {task_name} \
71 | --data_dir {data_dir} \
72 | --output_dir {output_dir} \
73 | --overwrite_output_dir \
74 | --model_name_or_path {model_name_or_path} \
75 | --few_shot_type {few_shot_type} \
76 | --num_k 1 \
77 | --num_sample 1 --seed {seed} \
78 | --template {template} \
79 | --non_private {non_private} \
80 | --num_train_epochs {num_train_epochs} \
81 | --target_epsilon {target_epsilon} \
82 | --per_device_train_batch_size {physical_batch_size} \
83 | --gradient_accumulation_steps {gradient_accumulation_steps} \
84 | --per_device_eval_batch_size 8 \
85 | --per_example_max_grad_norm 0.1 --clipping_mode {clipping_mode} \
86 | --clipping_fn {clipping_fn} --clipping_style {clipping_style}\
87 | --learning_rate {learning_rate} \
88 | --lr_decay yes \
89 | --adam_epsilon 1e-08 \
90 | --weight_decay 0 \
91 | --max_seq_len 256 \
92 | --evaluation_strategy steps --eval_steps {eval_steps} --evaluate_before_training True \
93 | --do_train --do_eval \
94 | --first_sent_limit 200 --other_sent_limit 200 --truncate_head yes \
95 | --attention_only {attention_only} --bias_only {bias_only} --static_lm_head {static_lm_head} --static_embedding {static_embedding} \
96 | --randomly_initialize {randomly_initialize}
97 | '''
98 | return cmd
99 |
100 |
101 | def main(
102 | output_dir,
103 | task_name,
104 | few_shot_type="prompt", # finetune or prompt
105 | seed=0,
106 | model_name_or_path="roberta-base",
107 | data_dir="text_classification/data/original",
108 | learning_rate=None,
109 | clipping_mode="MixOpt",
110 | clipping_fn="automatic",
111 | clipping_style="all-layer",
112 | non_private="no",
113 | target_epsilon=8,
114 | attention_only="no",
115 | bias_only="no",
116 | static_lm_head="no",
117 | static_embedding="no",
118 | physical_batch_size =40,
119 | eval_steps=10,
120 | randomly_initialize="no",
121 | batch_size=None,
122 | num_train_epochs=None,
123 | ):
124 | command = _get_command(
125 | output_dir=output_dir,
126 | task_name=task_name,
127 | model_name_or_path=model_name_or_path,
128 | data_dir=data_dir,
129 | learning_rate=learning_rate,
130 | clipping_mode=clipping_mode,
131 | clipping_fn=clipping_fn,
132 | clipping_style=clipping_style,
133 | non_private=non_private,
134 | target_epsilon=target_epsilon,
135 | few_shot_type=few_shot_type,
136 | seed=seed,
137 | attention_only=attention_only,
138 | bias_only=bias_only,
139 | static_lm_head=static_lm_head,
140 | static_embedding=static_embedding,
141 | physical_batch_size = physical_batch_size,
142 | eval_steps=eval_steps,
143 | randomly_initialize=randomly_initialize,
144 | batch_size=batch_size,
145 | num_train_epochs=num_train_epochs,
146 | )
147 | print('Running command:')
148 | print(command)
149 | os.system(command)
150 |
151 |
152 | if __name__ == "__main__":
153 | fire.Fire(main)
154 |
--------------------------------------------------------------------------------
/examples/text_classification/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/src/__init__.py
--------------------------------------------------------------------------------
/examples/text_classification/src/common.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 | task_name2suffix_name = {"sst-2": "GLUE-SST-2", "mnli": "MNLI", "qqp": "QQP", "qnli": "QNLI"}
4 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
5 | true_tags = ('y', 'yes', 't', 'true')
6 |
--------------------------------------------------------------------------------
/examples/text_classification/src/compiled_args.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass, field
2 |
3 | import transformers
4 |
5 | from .common import true_tags
6 | from typing import Optional
7 |
8 |
9 | @dataclass
10 | class PrivacyArguments:
11 | """Arguments for differentially private training."""
12 |
13 | per_example_max_grad_norm: float = field(
14 | default=.1, metadata={
15 | "help": "Clipping 2-norm of per-sample gradients."
16 | }
17 | )
18 | noise_multiplier: float = field(
19 | default=None, metadata={
20 | "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, "
21 | "use the one searched based budget"
22 | }
23 | )
24 | target_epsilon: float = field(
25 | default=None, metadata={
26 | "help": "Privacy budget; if `None` use the noise multiplier specified."
27 | }
28 | )
29 | target_delta: float = field(
30 | default=None, metadata={
31 | "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)."
32 | }
33 | )
34 | non_private: str = field(
35 | default="yes", metadata={"help": "Train non-privately if True."}
36 | )
37 | accounting_mode: str = field(
38 | default="rdp", metadata={"help": "One of (`rdp`, `glw`, `all`)."}
39 | )
40 | clipping_mode: str = field(
41 | default="default"
42 | )
43 | clipping_fn: str = field(
44 | default="automatic"
45 | )
46 | clipping_style: str = field(
47 | default="all-layer"
48 | )
49 |
50 | def __post_init__(self):
51 | self.non_private = self.non_private.lower() in true_tags # noqa
52 |
53 |
54 | @dataclass
55 | class TrainingArguments(transformers.TrainingArguments):
56 | eval_epochs: int = field(default=10, metadata={"help": "Evaluate once such epochs"})
57 | evaluate_before_training: bool = field(default=False, metadata={"help": "Run evaluation before training."})
58 | lr_decay: str = field(
59 | default="no", metadata={"help": "Apply the usual linear decay if `yes`, otherwise no deacy."}
60 | )
61 | evaluate_test_split: bool = field(default=False, metadata={"help": "Run evaluation on the test split"})
62 |
63 | def __post_init__(self):
64 | super(TrainingArguments, self).__post_init__()
65 | self.lr_decay = self.lr_decay.lower() in true_tags # noqa
66 |
--------------------------------------------------------------------------------
/examples/text_classification/src/label_search.py:
--------------------------------------------------------------------------------
1 | """Automatic label search helpers."""
2 |
3 | import itertools
4 | import logging
5 | import multiprocessing
6 |
7 | import numpy as np
8 | import scipy.spatial as spatial
9 | import scipy.special as special
10 | import scipy.stats as stats
11 | import tqdm
12 |
13 | logger = logging.getLogger(__name__)
14 |
15 |
16 | def select_likely_words(train_logits, train_labels, k_likely=1000, vocab=None, is_regression=False):
17 | """Pre-select likely words based on conditional likelihood."""
18 | indices = []
19 | if is_regression:
20 | median = np.median(train_labels)
21 | train_labels = (train_labels > median).astype(np.int)
22 | num_labels = np.max(train_labels) + 1
23 | for idx in range(num_labels):
24 | label_logits = train_logits[train_labels == idx]
25 | scores = label_logits.mean(axis=0)
26 | kept = []
27 | for i in np.argsort(-scores):
28 | text = vocab[i]
29 | if not text.startswith("Ġ"):
30 | continue
31 | kept.append(i)
32 | indices.append(kept[:k_likely])
33 | return indices
34 |
35 |
36 | def select_neighbors(distances, k_neighbors, valid):
37 | """Select k nearest neighbors based on distance (filtered to be within the 'valid' set)."""
38 | indices = np.argsort(distances)
39 | neighbors = []
40 | for i in indices:
41 | if i not in valid:
42 | continue
43 | neighbors.append(i)
44 | if k_neighbors > 0:
45 | return neighbors[:k_neighbors]
46 | return neighbors
47 |
48 |
49 | def init(train_logits, train_labels):
50 | global logits, labels
51 | logits = train_logits
52 | labels = train_labels
53 |
54 |
55 | def eval_pairing_acc(pairing):
56 | global logits, labels
57 | label_logits = np.take(logits, pairing, axis=-1)
58 | preds = np.argmax(label_logits, axis=-1)
59 | correct = np.sum(preds == labels)
60 | return correct / len(labels)
61 |
62 |
63 | def eval_pairing_corr(pairing):
64 | global logits, labels
65 | if pairing[0] == pairing[1]:
66 | return -1
67 | label_logits = np.take(logits, pairing, axis=-1)
68 | label_probs = special.softmax(label_logits, axis=-1)[:, 1]
69 | pearson_corr = stats.pearsonr(label_probs, labels)[0]
70 | return pearson_corr
71 |
72 |
73 | def find_labels(
74 | model,
75 | train_logits,
76 | train_labels,
77 | seed_labels=None,
78 | k_likely=1000,
79 | k_neighbors=None,
80 | top_n=-1,
81 | vocab=None,
82 | is_regression=False,
83 | ):
84 | # Get top indices based on conditional likelihood using the LM.
85 | likely_indices = select_likely_words(
86 | train_logits=train_logits,
87 | train_labels=train_labels,
88 | k_likely=k_likely,
89 | vocab=vocab,
90 | is_regression=is_regression)
91 |
92 | logger.info("Top labels (conditional) per class:")
93 | for i, inds in enumerate(likely_indices):
94 | logger.info("\t| Label %d: %s", i, ", ".join([vocab[i] for i in inds[:10]]))
95 |
96 | # Convert to sets.
97 | valid_indices = [set(inds) for inds in likely_indices]
98 |
99 | # If specified, further re-rank according to nearest neighbors of seed labels.
100 | # Otherwise, keep ranking as is (based on conditional likelihood only).
101 | if seed_labels:
102 | assert (vocab is not None)
103 | seed_ids = [vocab.index(l) for l in seed_labels]
104 | vocab_vecs = model.lm_head.decoder.weight.detach().cpu().numpy()
105 | seed_vecs = np.take(vocab_vecs, seed_ids, axis=0)
106 |
107 | # [num_labels, vocab_size]
108 | label_distances = spatial.distance.cdist(seed_vecs, vocab_vecs, metric="cosine")
109 |
110 | # Establish label candidates (as k nearest neighbors).
111 | label_candidates = []
112 | logger.info("Re-ranked by nearest neighbors:")
113 | for i, distances in enumerate(label_distances):
114 | label_candidates.append(select_neighbors(distances, k_neighbors, valid_indices[i]))
115 | logger.info("\t| Label: %s", seed_labels[i])
116 | logger.info("\t| Neighbors: %s", " ".join([vocab[idx] for idx in label_candidates[i]]))
117 | else:
118 | label_candidates = likely_indices
119 |
120 | # Brute-force search all valid pairings.
121 | pairings = list(itertools.product(*label_candidates))
122 |
123 | if is_regression:
124 | eval_pairing = eval_pairing_corr
125 | metric = "corr"
126 | else:
127 | eval_pairing = eval_pairing_acc
128 | metric = "acc"
129 |
130 | # Score each pairing.
131 | pairing_scores = []
132 | with multiprocessing.Pool(initializer=init, initargs=(train_logits, train_labels)) as workers:
133 | with tqdm.tqdm(total=len(pairings)) as pbar:
134 | chunksize = max(10, int(len(pairings) / 1000))
135 | for score in workers.imap(eval_pairing, pairings, chunksize=chunksize):
136 | pairing_scores.append(score)
137 | pbar.update()
138 |
139 | # Take top-n.
140 | best_idx = np.argsort(-np.array(pairing_scores))[:top_n]
141 | best_scores = [pairing_scores[i] for i in best_idx]
142 | best_pairings = [pairings[i] for i in best_idx]
143 |
144 | logger.info("Automatically searched pairings:")
145 | for i, indices in enumerate(best_pairings):
146 | logger.info("\t| %s (%s = %2.2f)", " ".join([vocab[j] for j in indices]), metric, best_scores[i])
147 |
148 | return best_pairings
149 |
--------------------------------------------------------------------------------
/fastDP/README.md:
--------------------------------------------------------------------------------
1 | ### Two Privacy Engines
2 |
3 | FastDP provides two privacy engines to compute the private gradient: **hook-based** and **torch-extending**. These privacy engines are equivalent mathematically, though their applicability and computation efficiency can be different. We summarize the differences and note that some limitations can be overcome with more engineering efforts.
4 |
5 | | | Hook-based (DP) | Torch-extending (DP) | Standard (non-DP) |
6 | |:----------------------------:|:-------------------------------:|:----------------:|:------------:|
7 | | Speed (1/time complexity) | 80-100% | ~70% | 100% |
8 | | Memory cost (space complexity) | 100-130% | ~100% | 100% |
9 | | ZeRO distribution solution | ✅ Supported | ✅ Supported | ✅ Supported |
10 | | Most types of layers | ✅ Supported (see below) | ✅ Supported (see below) | ✅ Supported |
11 | | Per-sample clipping styles | ✅ Supported for all styles | Layer-wise style |✅ Not needed |
12 | | Per-sample clipping functions | ✅ Supported for all functions | Automatic clipping |✅ Not needed |
13 | | Modifying optimizers | Needed for `PrivacyEngine`; not needed for ZeRO | ✅ Not needed | ✅ Not needed |
14 | | Private gradient stored in | `param.private_grad` | `param.grad` | `param.grad` |
15 | | Fused kernel | ✅ Supported | Not supported |✅ Supported |
16 | | Ghost differentiation (origin param) | Supported on single GPU | Not supported | Not needed |
17 | | Recommended usage | Single GPU or ZeRO | General | General |
18 |
19 | #### 1. Hook-based
20 | Hook-based approach computes the private gradient with forward hooks (to store the activations) and backward hooks (to compute the per-sample gradient norms, to clip and to add noise). See [this tutorial for hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html). This approach firstly computes the private gradient then overrides the non-DP gradient.
21 |
22 | On single GPU or data parallelism (see `PrivacyEngine`), the hooks are backward module hooks, which are triggered before `param.grad` is computed; in ZeRO (see `PrivacyEngine_Distributed_Stage_2_and_3`), some backward tensor hooks are in place, which are triggered after `param.grad` has been computed.
23 |
24 | #### 2. Torch-extending
25 | Torch-extending approach computes the private gradient directly by re-writeing the model's back-propagation mechanism (see `PrivacyEngine_Distributed_extending`). See [this tutorial for extending torch modules](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-nn). This approach overrides the non-DP modules as shown in `supported_differentially_private_layers.py`. Given that this approach does not modify the optimizers and the communication orchestra of distributed solutions, it is expected to be applicable generally. However, some slowdown may be observed as the extension is not implemented at C++ level.
26 |
27 | ### Supported Modules
28 |
29 | Our privacy engine supports the commonly used modules that constitute most models, with possibly two methods to compute the per-sample gradient norm:
30 | * nn.Linear (GhostClip & Grad Instantiation)
31 | * nn.LayerNorm (Grad Instantiation)
32 | * nn.GroupNorm (Grad Instantiation)
33 | * nn.InstanceNorm (Grad Instantiation)
34 | * nn.Embedding (GhostClip)
35 | * nn.Conv1d (GhostClip & Grad Instantiation)
36 | * nn.Conv2d (GhostClip & Grad Instantiation)
37 | * nn.Conv3d (GhostClip & Grad Instantiation)
38 |
39 | Frozen (e.g. `nn.Linear` with `requires_grad=False`) and non-trainable (e.g. `nn.ReLU`, `nn.Tanh`, `nn.MaxPool2d`) modules are also supported.
40 |
41 | Note GhostClip stands for ghost clipping [1][2][3], that computes the gradient norms without creating and storing the gradients. Grad Instantiation stands for per-sample gradient instantiation [5], that generates the per-sample gradients and then computes their norms. Note that Grad Instantiation can be inefficient for large models and GhostClip can be inefficient for high-dimensional data. Therefore we allow to choose the method at different layers (known as the hybrid algorithms by [3][4]) for modules that support both methods.
42 |
43 | ### Arguments
44 | * `module`: The model that to be optimized with differential privacy.
45 | * `batch_size`: Logical batch size that determines the convergence and accuracy.
46 | * `sample_size`: Number of training samples.
47 | * `target_epsilon`: Target privacy budget ε.
48 | * `target_delta`: Target privacy budget δ, should be smaller than 1/sample_size.
49 | * `max_grad_norm`: Per-sample gradient clipping threshold, default to 1. No need to tune if `clipping_fn="automatic"`.
50 | * `epochs`: Number of epochs. Not needed if `noise_multiplier` is provided.
51 | * `noise_multiplier`: Level of independent Gaussian noise into the gradient. This can be automatically computed by different `accounting_mode` if `target_epsilon, batch_size, sample_size, epochs` are provided.
52 | * `accounting_mode`: Privacy accounting theory to use, one of "rdp" (default), "glw", "all".
53 | * `named_params`: Specifies which parameters to optimize with differential privacy.
54 | * `clipping_mode`: Per-sample gradient clipping mode, one of 'ghost', 'MixGhostClip', 'MixOpt' (default) from [4]. Note different clipping modes, including Opacus [5], GhostClip [2] and Mixed GhostClip [3], give the same convergence and accuracy though at significantly different time/space complexity.
55 | * `clipping_fn`: Per-sample gradient clipping function to use; one of "automatic" (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), "Abadi" [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , "global" [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf).
56 | * `clipping_style`: Per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a block, including both weight and bias parameters), `param-wise` (each parameter is a block), or a list of layer names (general block-wise clipping).
57 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters. For example, in text classification with RoBERTa, set `origin_params=["_embeddings"]`; in text generation with GPT2, set `origin_params=["wte","wpe"]`; in image classification with BEiT, set `origin_params=["patch_embed.proj.bias"]`. This trick gives about 8/6=1.666 speedup at no memory overhead.
58 |
59 | ### Usage
60 | Our privacy engine uses Pytorch [forward and backward hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html) to clip per-sample gradients and to add noises. To privately train models, attach the privacy engine to any optimizers from [torch.optim](https://pytorch.org/docs/stable/optim.html), which accumulates the sum of clipped per-sample gradients into `.grad` during backward propagation and additionally inject noises by `step`.
61 |
62 | To conduct DP bias-term fine-tuning (DP-BiTFiT [6]), simply freeze all non-bias terms:
63 | ```python
64 | [param.requires_grad_(False) for name, param in model.named_parameters() if '.bias' not in name]
65 | ```
66 | Note that for two-phase DP training (e.g. appendix of [6] or DP continual training), one need to detach the first engine and attach a new engine to a new optimizer.
67 |
68 | ### References
69 | [1] Goodfellow, Ian. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015).
70 |
71 | [2] Li, Xuechen, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." arXiv preprint arXiv:2110.05679 (2021).
72 |
73 | [3] Bu, Zhiqi, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." arXiv preprint arXiv:2205.10683 (2022).
74 |
75 | [4] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." arXiv preprint arXiv:2210.00038 (2022).
76 |
77 | [5] Yousefpour, Ashkan, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021).
78 |
79 | [6] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term only Fine-tuning of Foundation Models." arXiv preprint arXiv:2210.00036 (2022).
80 |
--------------------------------------------------------------------------------
/fastDP/__init__.py:
--------------------------------------------------------------------------------
1 | from . import lora_utils
2 | from .privacy_engine import PrivacyEngine
3 | from .privacy_engine_dist_stage23 import PrivacyEngine_Distributed_Stage_2_and_3
4 | from .privacy_engine_dist_extending import PrivacyEngine_Distributed_extending
5 | from .supported_differentially_private_layers import *
6 | __version__ = '2.0.0'
7 |
--------------------------------------------------------------------------------
/fastDP/accounting/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/fastDP/accounting/__init__.py
--------------------------------------------------------------------------------
/fastDP/accounting/accounting_manager.py:
--------------------------------------------------------------------------------
1 | import abc
2 | import math
3 | from typing import Dict, Optional, Union
4 |
5 | from . import rdp_accounting
6 |
7 | DEFAULT_ALPHAS = tuple(1 + x / 10.0 for x in range(1, 100)) + tuple(range(12, 64)) # RDP.
8 |
9 |
10 | class AccountingManager(abc.ABC):
11 | def _get_sigma_with_target_epsilon(
12 | self,
13 | target_epsilon,
14 | target_delta,
15 | sample_rate,
16 | steps,
17 | threshold,
18 | sigma_hi_init,
19 | sigma_lo_init,
20 | ):
21 | """Binary search σ given ε and δ."""
22 | if sigma_lo_init > sigma_hi_init:
23 | raise ValueError("`sigma_lo` should be smaller than `sigma_hi`.")
24 |
25 | # Find an appropriate region for binary search.
26 | sigma_hi = sigma_hi_init
27 | sigma_lo = sigma_lo_init
28 |
29 | # Ensure sigma_hi isn't too small.
30 | while True:
31 | eps = self._compute_epsilon_from_sigma(sigma_hi, sample_rate, target_delta, steps)
32 | if eps < target_epsilon:
33 | break
34 | sigma_hi *= 2
35 |
36 | # Ensure sigma_lo isn't too large.
37 | while True:
38 | eps = self._compute_epsilon_from_sigma(sigma_lo, sample_rate, target_delta, steps)
39 | if eps > target_epsilon:
40 | break
41 | sigma_lo /= 2
42 |
43 | # Binary search.
44 | while sigma_hi - sigma_lo > threshold:
45 | sigma = (sigma_hi + sigma_lo) / 2
46 | eps = self._compute_epsilon_from_sigma(sigma, sample_rate, target_delta, steps)
47 | if eps < target_epsilon:
48 | sigma_hi = sigma
49 | else:
50 | sigma_lo = sigma
51 |
52 | # Conservative estimate.
53 | return sigma_hi
54 |
55 | @abc.abstractmethod
56 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
57 | """Override for reporting results."""
58 | raise NotImplementedError
59 |
60 | @abc.abstractmethod
61 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps) -> float:
62 | """Override for binary sigma search."""
63 | raise NotImplementedError
64 |
65 | def compute_sigma(
66 | self,
67 | target_epsilon: float,
68 | target_delta: float,
69 | sample_rate: float,
70 | epochs: Optional[Union[float, int]] = None,
71 | steps=None,
72 | threshold=1e-3,
73 | sigma_hi_init=4,
74 | sigma_lo_init=0.1,
75 | ) -> float:
76 | if steps is None:
77 | if epochs is None:
78 | raise ValueError("Epochs and steps cannot both be None.")
79 | steps = math.ceil(epochs / sample_rate)
80 | return self._get_sigma_with_target_epsilon(
81 | target_epsilon=target_epsilon,
82 | target_delta=target_delta,
83 | sample_rate=sample_rate,
84 | steps=steps,
85 | threshold=threshold,
86 | sigma_hi_init=sigma_hi_init,
87 | sigma_lo_init=sigma_lo_init,
88 | )
89 |
90 |
91 | class RDPManager(AccountingManager):
92 | def __init__(self, alphas):
93 | super(RDPManager, self).__init__()
94 | self._alphas = alphas
95 |
96 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps):
97 | return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_rdp"]
98 |
99 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
100 | """Compute RDP as usual, but convert to (ε, δ)-DP based on the result by Canonne, Kamath, Steinke."""
101 | rdp = rdp_accounting.compute_rdp(q=sample_rate, noise_multiplier=sigma, steps=steps, orders=self._alphas)
102 | eps, alpha = rdp_accounting.get_privacy_spent(orders=self._alphas, rdp=rdp, delta=target_delta)
103 | return dict(eps_rdp=eps, alpha_rdp=alpha)
104 |
105 |
106 | class GLWManager(AccountingManager):
107 | def __init__(self, eps_error=0.05):
108 | super(GLWManager, self).__init__()
109 | self._eps_error = eps_error
110 |
111 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps):
112 | return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_upper"] # Be conservative.
113 |
114 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict:
115 | if steps == 0:
116 | return dict(eps_low=None, eps_estimate=None, eps_upper=None)
117 |
118 | from prv_accountant import Accountant
119 | accountant = Accountant(
120 | noise_multiplier=sigma,
121 | sampling_probability=sample_rate,
122 | delta=target_delta,
123 | eps_error=self._eps_error,
124 | max_compositions=steps
125 | )
126 | eps_low, eps_estimate, eps_upper = accountant.compute_epsilon(num_compositions=steps)
127 | return dict(eps_low=eps_low, eps_estimate=eps_estimate, eps_upper=eps_upper)
128 |
--------------------------------------------------------------------------------
/fastDP/accounting/rdp_accounting.py:
--------------------------------------------------------------------------------
1 | r"""
2 | This file is adapted from the privacy accounting procedure in Opacus', which in turn is adapted from tf-privacy.
3 | Below is the original documentation in Opacus.
4 |
5 | *Based on Google's TF Privacy:* https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis
6 | /rdp_accountant.py.
7 | *Here, we update this code to Python 3, and optimize dependencies.*
8 |
9 | Functionality for computing Renyi Differential Privacy (RDP) of an additive
10 | Sampled Gaussian Mechanism (SGM).
11 |
12 | Example:
13 | Suppose that we have run an SGM applied to a function with L2-sensitivity of 1.
14 |
15 | Its parameters are given as a list of tuples
16 | ``[(q_1, sigma_1, steps_1), ..., (q_k, sigma_k, steps_k)],``
17 | and we wish to compute epsilon for a given target delta.
18 |
19 | The example code would be:
20 |
21 | >>> max_order = 32
22 | >>> orders = range(2, max_order + 1)
23 | >>> rdp = np.zeros_like(orders, dtype=float)
24 | >>> for q, sigma, steps in parameters:
25 | >>> rdp += privacy_analysis.compute_rdp(q, sigma, steps, orders)
26 | >>> epsilon, opt_order = privacy_analysis.get_privacy_spent(orders, rdp, delta)
27 | """
28 |
29 | import math
30 | from typing import List, Sequence, Union
31 |
32 | import numpy as np
33 | from scipy import special
34 |
35 |
36 | ########################
37 | # LOG-SPACE ARITHMETIC #
38 | ########################
39 |
40 |
41 | def _log_add(logx: float, logy: float) -> float:
42 | r"""Adds two numbers in the log space.
43 |
44 | Args:
45 | logx: First term in log space.
46 | logy: Second term in log space.
47 |
48 | Returns:
49 | Sum of numbers in log space.
50 | """
51 | a, b = min(logx, logy), max(logx, logy)
52 | if a == -np.inf: # adding 0
53 | return b
54 | # Use exp(a) + exp(b) = (exp(a - b) + 1) * exp(b)
55 | return math.log1p(math.exp(a - b)) + b # log1p(x) = log(x + 1)
56 |
57 |
58 | def _log_sub(logx: float, logy: float) -> float:
59 | r"""Subtracts two numbers in the log space.
60 |
61 | Args:
62 | logx: First term in log space. Expected to be greater than the second term.
63 | logy: First term in log space. Expected to be less than the first term.
64 |
65 | Returns:
66 | Difference of numbers in log space.
67 |
68 | Raises:
69 | ValueError
70 | If the result is negative.
71 | """
72 | if logx < logy:
73 | raise ValueError("The result of subtraction must be non-negative.")
74 | if logy == -np.inf: # subtracting 0
75 | return logx
76 | if logx == logy:
77 | return -np.inf # 0 is represented as -np.inf in the log space.
78 |
79 | try:
80 | # Use exp(x) - exp(y) = (exp(x - y) - 1) * exp(y).
81 | return math.log(math.expm1(logx - logy)) + logy # expm1(x) = exp(x) - 1
82 | except OverflowError:
83 | return logx
84 |
85 |
86 | def _compute_log_a_for_int_alpha(q: float, sigma: float, alpha: int) -> float:
87 | r"""Computes :math:`log(A_\alpha)` for integer ``alpha``.
88 |
89 | Notes:
90 | Note that
91 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
92 | and that 0 < ``q`` < 1.
93 |
94 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details.
95 |
96 | Args:
97 | q: Sampling rate of SGM.
98 | sigma: The standard deviation of the additive Gaussian noise.
99 | alpha: The order at which RDP is computed.
100 |
101 | Returns:
102 | :math:`log(A_\alpha)` as defined in Section 3.3 of
103 | https://arxiv.org/pdf/1908.10530.pdf.
104 | """
105 |
106 | # Initialize with 0 in the log space.
107 | log_a = -np.inf
108 |
109 | for i in range(alpha + 1):
110 | log_coef_i = (
111 | math.log(special.binom(alpha, i))
112 | + i * math.log(q)
113 | + (alpha - i) * math.log(1 - q)
114 | )
115 |
116 | s = log_coef_i + (i * i - i) / (2 * (sigma ** 2))
117 | log_a = _log_add(log_a, s)
118 |
119 | return float(log_a)
120 |
121 |
122 | def _compute_log_a_for_frac_alpha(q: float, sigma: float, alpha: float) -> float:
123 | r"""Computes :math:`log(A_\alpha)` for fractional ``alpha``.
124 |
125 | Notes:
126 | Note that
127 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
128 | and that 0 < ``q`` < 1.
129 |
130 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details.
131 |
132 | Args:
133 | q: Sampling rate of SGM.
134 | sigma: The standard deviation of the additive Gaussian noise.
135 | alpha: The order at which RDP is computed.
136 |
137 | Returns:
138 | :math:`log(A_\alpha)` as defined in Section 3.3 of
139 | https://arxiv.org/pdf/1908.10530.pdf.
140 | """
141 | # The two parts of A_alpha, integrals over (-inf,z0] and [z0, +inf), are
142 | # initialized to 0 in the log space:
143 | log_a0, log_a1 = -np.inf, -np.inf
144 | i = 0
145 |
146 | z0 = sigma ** 2 * math.log(1 / q - 1) + 0.5
147 |
148 | while True: # do ... until loop
149 | coef = special.binom(alpha, i)
150 | log_coef = math.log(abs(coef))
151 | j = alpha - i
152 |
153 | log_t0 = log_coef + i * math.log(q) + j * math.log(1 - q)
154 | log_t1 = log_coef + j * math.log(q) + i * math.log(1 - q)
155 |
156 | log_e0 = math.log(0.5) + _log_erfc((i - z0) / (math.sqrt(2) * sigma))
157 | log_e1 = math.log(0.5) + _log_erfc((z0 - j) / (math.sqrt(2) * sigma))
158 |
159 | log_s0 = log_t0 + (i * i - i) / (2 * (sigma ** 2)) + log_e0
160 | log_s1 = log_t1 + (j * j - j) / (2 * (sigma ** 2)) + log_e1
161 |
162 | if coef > 0:
163 | log_a0 = _log_add(log_a0, log_s0)
164 | log_a1 = _log_add(log_a1, log_s1)
165 | else:
166 | log_a0 = _log_sub(log_a0, log_s0)
167 | log_a1 = _log_sub(log_a1, log_s1)
168 |
169 | i += 1
170 | if max(log_s0, log_s1) < -30:
171 | break
172 |
173 | return _log_add(log_a0, log_a1)
174 |
175 |
176 | def _compute_log_a(q: float, sigma: float, alpha: float) -> float:
177 | r"""Computes :math:`log(A_\alpha)` for any positive finite ``alpha``.
178 |
179 | Notes:
180 | Note that
181 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``,
182 | and that 0 < ``q`` < 1.
183 |
184 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf
185 | for details.
186 |
187 | Args:
188 | q: Sampling rate of SGM.
189 | sigma: The standard deviation of the additive Gaussian noise.
190 | alpha: The order at which RDP is computed.
191 |
192 | Returns:
193 | :math:`log(A_\alpha)` as defined in the paper mentioned above.
194 | """
195 | if float(alpha).is_integer():
196 | return _compute_log_a_for_int_alpha(q, sigma, int(alpha))
197 | else:
198 | return _compute_log_a_for_frac_alpha(q, sigma, alpha)
199 |
200 |
201 | def _log_erfc(x: float) -> float:
202 | r"""Computes :math:`log(erfc(x))` with high accuracy for large ``x``.
203 |
204 | Helper function used in computation of :math:`log(A_\alpha)`
205 | for a fractional alpha.
206 |
207 | Args:
208 | x: The input to the function
209 |
210 | Returns:
211 | :math:`log(erfc(x))`
212 | """
213 | return math.log(2) + special.log_ndtr(-x * 2 ** 0.5)
214 |
215 |
216 | def _compute_rdp(q: float, sigma: float, alpha: float) -> float:
217 | r"""Computes RDP of the Sampled Gaussian Mechanism at order ``alpha``.
218 |
219 | Args:
220 | q: Sampling rate of SGM.
221 | sigma: The standard deviation of the additive Gaussian noise.
222 | alpha: The order at which RDP is computed.
223 |
224 | Returns:
225 | RDP at order ``alpha``; can be np.inf.
226 | """
227 | if q == 0:
228 | return 0
229 |
230 | # no privacy
231 | if sigma == 0:
232 | return np.inf
233 |
234 | if q == 1.0:
235 | return alpha / (2 * sigma ** 2)
236 |
237 | if np.isinf(alpha):
238 | return np.inf
239 |
240 | return _compute_log_a(q, sigma, alpha) / (alpha - 1)
241 |
242 |
243 | def compute_rdp(
244 | q: float, noise_multiplier: float, steps: int, orders: Union[Sequence[float], float]
245 | ) -> Union[List[float], float]:
246 | r"""Computes Renyi Differential Privacy (RDP) guarantees of the
247 | Sampled Gaussian Mechanism (SGM) iterated ``steps`` times.
248 |
249 | Args:
250 | q: Sampling rate of SGM.
251 | noise_multiplier: The ratio of the standard deviation of the
252 | additive Gaussian noise to the L2-sensitivity of the function
253 | to which it is added. Note that this is same as the standard
254 | deviation of the additive Gaussian noise when the L2-sensitivity
255 | of the function is 1.
256 | steps: The number of iterations of the mechanism.
257 | orders: An array (or a scalar) of RDP orders.
258 |
259 | Returns:
260 | The RDP guarantees at all orders; can be ``np.inf``.
261 | """
262 | if isinstance(orders, float):
263 | rdp = _compute_rdp(q, noise_multiplier, orders)
264 | else:
265 | rdp = np.array([_compute_rdp(q, noise_multiplier, order) for order in orders])
266 |
267 | return rdp * steps
268 |
269 |
270 | # Based on
271 | # https://github.com/tensorflow/privacy/blob/5f07198b66b3617b22609db983926e3ba97cd905/tensorflow_privacy/privacy/analysis/rdp_accountant.py#L237
272 | def get_privacy_spent(orders, rdp, delta):
273 | """Compute epsilon given a list of RDP values and target delta.
274 | Args:
275 | orders: An array (or a scalar) of orders.
276 | rdp: A list (or a scalar) of RDP guarantees.
277 | delta: The target delta.
278 | Returns:
279 | Pair of (eps, optimal_order).
280 | Raises:
281 | ValueError: If input is malformed.
282 | """
283 | orders_vec = np.atleast_1d(orders)
284 | rdp_vec = np.atleast_1d(rdp)
285 |
286 | if delta <= 0:
287 | raise ValueError("Privacy failure probability bound delta must be >0.")
288 | if len(orders_vec) != len(rdp_vec):
289 | raise ValueError("Input lists must have the same length.")
290 |
291 | # Basic bound (see https://arxiv.org/abs/1702.07476 Proposition 3 in v3):
292 | # eps = min( rdp_vec - math.log(delta) / (orders_vec - 1) )
293 |
294 | # Improved bound from https://arxiv.org/abs/2004.00010 Proposition 12 (in v4).
295 | # Also appears in https://arxiv.org/abs/2001.05990 Equation 20 (in v1).
296 | eps_vec = []
297 | for (a, r) in zip(orders_vec, rdp_vec):
298 | if a < 1:
299 | raise ValueError("Renyi divergence order must be >=1.")
300 | if r < 0:
301 | raise ValueError("Renyi divergence must be >=0.")
302 |
303 | if delta ** 2 + math.expm1(-r) >= 0:
304 | # In this case, we can simply bound via KL divergence:
305 | # delta <= sqrt(1-exp(-KL)).
306 | eps = 0 # No need to try further computation if we have eps = 0.
307 | elif a > 1.01:
308 | # This bound is not numerically stable as alpha->1.
309 | # Thus we have a min value of alpha.
310 | # The bound is also not useful for small alpha, so doesn't matter.
311 | eps = r + math.log1p(-1 / a) - math.log(delta * a) / (a - 1)
312 | else:
313 | # In this case we can't do anything. E.g., asking for delta = 0.
314 | eps = np.inf
315 | eps_vec.append(eps)
316 |
317 | idx_opt = np.argmin(eps_vec)
318 | return max(0, eps_vec[idx_opt]), orders_vec[idx_opt]
319 |
--------------------------------------------------------------------------------
/fastDP/autograd_grad_sample_dist.py:
--------------------------------------------------------------------------------
1 | """
2 | A large portion of this code is adapted from Opacus v0.15 (https://github.com/pytorch/opacus)
3 | and from Private-transformers v0.2.3 (https://github.com/lxuechen/private-transformers)
4 | which are licensed under Apache License 2.0.
5 |
6 | We have modified it considerably to support book-keeping and BiTFiT.
7 | """
8 |
9 | from typing import Tuple
10 |
11 | import torch
12 | import torch.nn as nn
13 |
14 | from .supported_layers_grad_samplers import _supported_layers_norm_sample_AND_clipping,_create_or_extend_private_grad
15 |
16 | def requires_grad(module: nn.Module) -> bool:
17 | """
18 | Checks if any parameters in a specified module require gradients.
19 |
20 | Args:
21 | module: PyTorch module whose parameters are examined
22 |
23 | Returns:
24 | Flag indicate if any parameters require gradients
25 | """
26 | return any(p.requires_grad for p in module.parameters() if hasattr(p,'requires_grad'))
27 |
28 |
29 | def add_hooks(model: nn.Module, loss_reduction='mean', clipping_mode='MixOpt',bias_only=False,
30 | clipping_style='all-layer', block_heads=None, named_params=None, named_layers=None,
31 | clipping_fn=None, numerical_stability_constant=None, max_grad_norm_layerwise=None):
32 | r"""
33 | Adds hooks to model to save activations (to layers) and backprop (to params) values.
34 |
35 | The hooks will
36 |
37 | 1. save activations into ``layer.activations`` (NOT param.activations) during forward pass.
38 | Note: BiTFiT is special in that if a layer only requires bias gradient, no need for forward hook
39 |
40 | 2. compute per-sample grad norm or grad and save in ``param.norm_sample`` or ``param.grad_sample`` during backward pass.
41 |
42 | Args:
43 | model: Model to which hooks are added.
44 | """
45 | if hasattr(model, "autograd_grad_sample_hooks"):
46 | raise ValueError("Trying to add hooks twice to the same model")
47 |
48 | handles = []
49 |
50 | for name, layer in model.named_modules():
51 | if type(layer) in _supported_layers_norm_sample_AND_clipping and requires_grad(layer):
52 | if hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad:
53 | #print('Attaching forward hook on', name)
54 | handles.append(layer.register_forward_hook(_capture_activations))
55 |
56 | def this_backward(this_layer, grad_input, grad_output):
57 | _prepare_sample_grad_or_norm(this_layer, grad_output, loss_reduction, clipping_mode,bias_only)
58 | _per_block_clip_grad(this_layer, named_params, named_layers, clipping_style, clipping_fn, numerical_stability_constant, max_grad_norm_layerwise)
59 |
60 | # Starting with 1.8.0, can use `register_full_backward_hook`, but slower
61 | handles.append(layer.register_backward_hook(this_backward))
62 |
63 | model.__dict__.setdefault("autograd_grad_sample_hooks", []).extend(handles)
64 |
65 |
66 | def remove_hooks(model: nn.Module):
67 | """Removes hooks added by `add_hooks()`."""
68 | for handle in model.autograd_grad_sample_hooks:
69 | handle.remove()
70 | del model.autograd_grad_sample_hooks
71 |
72 |
73 | def _capture_activations(layer: nn.Module, inputs: Tuple, outputs: Tuple):
74 | """Forward hook handler captures AND saves activations."""
75 | layer.activations=inputs[0].detach()
76 |
77 | def _prepare_sample_grad_or_norm(
78 | layer: nn.Module,
79 | grad_output: Tuple[torch.Tensor],
80 | loss_reduction='mean',
81 | clipping_mode='MixOpt',
82 | bias_only=False,
83 | ):
84 | """Backward hook handler captures AND saves grad_outputs (book-keeping)."""
85 | backprops = grad_output[0].detach()
86 |
87 | """Computes per-sample grad norm or grad for individual layers."""
88 | if not hasattr(layer,'activations'):
89 | layer.activations=None
90 | if loss_reduction=='mean':
91 | backprops = backprops * backprops.shape[0] # .backprops should save dL_i/ds, not 1/B*dL_i/ds, the mean reduction is taken care of in privacy engine .step()
92 | compute_layer_grad_sample, _ = _supported_layers_norm_sample_AND_clipping.get(type(layer))
93 |
94 | compute_layer_grad_sample(layer, layer.activations, backprops, clipping_mode)
95 |
96 | layer.backprops=backprops
97 |
98 |
99 | def _per_block_clip_grad(
100 | layer: nn.Module, named_params, named_layers, clipping_style, clipping_fn,
101 | numerical_stability_constant,max_grad_norm_layerwise
102 | ):
103 |
104 | if clipping_style=='layer-wise':
105 | if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'):
106 | norm_sample = layer.weight.norm_sample
107 | if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'):
108 | norm_sample = torch.stack([layer.weight.norm_sample,layer.bias.norm_sample], dim=0).norm(2, dim=0);
109 | else:
110 | norm_sample = layer.bias.norm_sample
111 | #norm_sample = torch.stack([param.norm_sample for param in layer.parameters() if hasattr(param,'norm_sample')], dim=0).norm(2, dim=0);
112 |
113 | # compute per-sample grad norm and clipping factor
114 | if clipping_fn=='automatic':
115 | C = max_grad_norm_layerwise / (norm_sample + numerical_stability_constant)#torch.ones_like(norm_sample,dtype=layer.weight.dtype)#change to non-DP C=1 works under mixed precision
116 | elif clipping_fn=='Abadi':
117 | C = torch.clamp_max(max_grad_norm_layerwise / (norm_sample + numerical_stability_constant), 1.)
118 | elif clipping_fn=='global':
119 | C = (norm_sample<=max_grad_norm_layerwise).float()
120 | else:
121 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
122 |
123 | if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'):
124 | #--- weight, compute clipped gradient
125 | _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer))
126 | common_type=torch.promote_types(layer.activations.dtype,layer.backprops.dtype)
127 | grad_weight = compute_layer_grad(layer, layer.activations.to(common_type), torch.einsum('b...,b->b...',layer.backprops.to(common_type),C), C)
128 | del layer.activations, layer.backprops
129 | _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False)
130 |
131 | if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'):
132 | #--- bias, compute clipped gradient
133 | grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C)#(layer.bias.grad_sample*C.unsqueeze(1)).sum(dim=0)#
134 | del layer.bias.grad_sample
135 | _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False)
136 |
137 | elif clipping_style=='param-wise':
138 | if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'):
139 | if clipping_fn=='automatic':
140 | C_weight = max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant)
141 | elif clipping_fn=='Abadi':
142 | C_weight = torch.clamp_max(max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant), 1.)
143 | elif clipping_fn=='global':
144 | C_weight = (layer.weight.norm_sample<=max_grad_norm_layerwise).float()
145 | else:
146 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
147 |
148 | if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'):
149 | if clipping_fn=='automatic':
150 | C_bias = max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant)
151 | elif clipping_fn=='Abadi':
152 | C_bias = torch.clamp_max(max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant), 1.)
153 | elif clipping_fn=='global':
154 | C_bias = (layer.bias.norm_sample<=max_grad_norm_layerwise).float()
155 | else:
156 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.")
157 |
158 |
159 | if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'):
160 | _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer))
161 | grad_weight = compute_layer_grad(layer, layer.activations, torch.einsum('b...,b->b...',layer.backprops,C_weight), C_weight)
162 | del layer.activations, layer.backprops
163 |
164 | _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False)
165 |
166 |
167 | #--- bias, compute clipped gradient
168 | if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'):
169 | grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C_bias)
170 | del layer.bias.grad_sample
171 | _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False)
172 | else:
173 | raise ValueError(f"Unknown clipping style {clipping_style}. Expected one of 'layer-wise','param-wise'.")
174 |
175 |
176 | for param in layer.parameters():
177 | if hasattr(param,'norm_sample'):
178 | del param.norm_sample
179 |
--------------------------------------------------------------------------------
/fastDP/lora_utils.py:
--------------------------------------------------------------------------------
1 | """
2 | LoRA layers.
3 |
4 | This version does not have merged weights for zero latency inference. It makes the code easier to read and maintain.
5 | Adapted from
6 | https://github.com/microsoft/LoRA
7 | https://www.microsoft.com/en-us/research/project/dp-transformers/
8 | """
9 |
10 | import torch
11 | import transformers
12 | from torch import nn
13 |
14 |
15 | class DPMergedLinear(nn.Module):
16 | def __init__(
17 | self,
18 | in_features: int,
19 | out_features: int,
20 | lora_r=0,
21 | lora_alpha=1.,
22 | lora_dropout=0.,
23 | ):
24 | super(DPMergedLinear, self).__init__()
25 | self.linear = nn.Linear(in_features=in_features, out_features=out_features)
26 | self.lora_r = lora_r
27 | self.lora_alpha = lora_alpha
28 | self.lora_dropout = nn.Dropout(p=lora_dropout)
29 | if self.lora_r > 0:
30 | self.lora_A = nn.Linear(in_features=in_features, out_features=lora_r, bias=False)
31 | self.lora_B = nn.Linear(in_features=lora_r, out_features=out_features, bias=False)
32 | self.scaling = self.lora_alpha / lora_r
33 | self.reset_parameters()
34 |
35 | def forward(self, x: torch.Tensor):
36 | result = self.linear(x)
37 | if self.lora_r > 0:
38 | after_dropout = self.lora_dropout(x)
39 | after_A = self.lora_A(after_dropout)
40 | after_B = self.lora_B(after_A)
41 | result += after_B * self.scaling
42 | return result
43 |
44 | def reset_parameters(self):
45 | self.linear.reset_parameters()
46 | if self.lora_r > 0:
47 | self.lora_A.reset_parameters()
48 | self.lora_B.weight.data.zero_()
49 |
50 | @staticmethod
51 | def from_transformers_conv1d(
52 | original_layer,
53 | lora_r=0,
54 | lora_alpha=1.,
55 | lora_dropout=0.,
56 | ) -> "DPMergedLinear":
57 | lora_layer = DPMergedLinear(
58 | in_features=original_layer.weight.shape[0],
59 | out_features=original_layer.weight.shape[1],
60 | lora_r=lora_r,
61 | lora_alpha=lora_alpha,
62 | lora_dropout=lora_dropout,
63 | ).to(original_layer.weight.device)
64 | lora_layer.linear.weight.data.copy_(original_layer.weight.T.data)
65 | lora_layer.linear.bias.data.copy_(original_layer.bias.data)
66 | return lora_layer
67 |
68 |
69 | def convert_gpt2_attention_to_lora(
70 | model: transformers.GPT2PreTrainedModel,
71 | lora_r=0,
72 | lora_alpha=1.,
73 | lora_dropout=0.,
74 | ) -> transformers.GPT2PreTrainedModel:
75 | if not isinstance(model, transformers.GPT2PreTrainedModel):
76 | raise TypeError("Requires a GPT2 model")
77 |
78 | if not hasattr(model, "h") and hasattr(model, "transformer"):
79 | transformer = model.transformer
80 | else:
81 | transformer = model
82 |
83 | for h_i in transformer.h:
84 | new_layer = DPMergedLinear.from_transformers_conv1d(
85 | original_layer=h_i.attn.c_attn,
86 | lora_r=lora_r,
87 | lora_alpha=lora_alpha,
88 | lora_dropout=lora_dropout,
89 | )
90 | h_i.attn.c_attn = new_layer
91 |
92 | return model
93 |
94 |
95 | def mark_only_lora_as_trainable(model: torch.nn.Module) -> None:
96 | model.requires_grad_(True)
97 | for n, p in model.named_parameters():
98 | if 'lora_' not in n:
99 | p.requires_grad = False
100 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch~=1.11.0+cu113
2 | prv-accountant
3 | transformers>=4.20.1
4 | numpy
5 | scipy
6 | jupyterlab
7 | jupyter
8 | opacus>=1.0
9 | ml-swissknife
10 | opt_einsum
11 | pytest
12 | pydantic==1.10
13 | tqdm>=4.62.1
14 | deepspeed~=0.8.3
15 | fairscale==0.4
16 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 |
4 | import setuptools
5 |
6 | # for simplicity we actually store the version in the __version__ attribute in the source
7 | here = os.path.realpath(os.path.dirname(__file__))
8 | print(here)
9 | with open(os.path.join(here, 'fastDP', '__init__.py')) as f:
10 | meta_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", f.read(), re.M)
11 | if meta_match:
12 | version = meta_match.group(1)
13 | else:
14 | raise RuntimeError("Unable to find __version__ string.")
15 |
16 | with open(os.path.join(here, 'README.md')) as f:
17 | readme = f.read()
18 |
19 | setuptools.setup(
20 | name="fastDP",
21 | version=version,
22 | author="Zhiqi Bu",
23 | author_email="woodyx218@gmail.com",
24 | description="Optimally efficient implementation of differentially private optimization (with per-sample gradient clipping.",
25 | long_description=readme,
26 | url="",
27 | packages=setuptools.find_packages(exclude=['examples', 'tests']),
28 | python_requires='~=3.8',
29 | classifiers=[
30 | "Programming Language :: Python :: 3",
31 | "License :: OSI Approved :: Apache Software License",
32 | ],
33 | )
34 |
--------------------------------------------------------------------------------