├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE.md ├── README.md ├── THIRD-PARTY-NOTICES.txt ├── assets ├── efficiency.png ├── nlp.png ├── scalability.png └── vision.png ├── examples ├── __init__.py ├── image_classification │ ├── CIFAR_TIMM.py │ ├── CV_TIMM.py │ ├── CelebA_TIMM.py │ ├── README.md │ ├── ZERO_examples │ │ ├── CIFAR_TIMM_FSDP_extending.py │ │ ├── CIFAR_TIMM_ZERO1.py │ │ ├── CIFAR_TIMM_ZERO23.py │ │ ├── CIFAR_TIMM_ZERO_extending.py │ │ └── cifar_config.json │ └── __init__.py ├── requirements.txt ├── table2text │ ├── README.md │ ├── __init__.py │ ├── compiled_args.py │ ├── data_utils │ │ ├── __init__.py │ │ ├── data_collator.py │ │ └── language_modeling.py │ ├── decoding_utils.py │ ├── gpt_config_stage123.json │ ├── misc.py │ ├── models.py │ ├── run.sh │ ├── run_ZERO1.sh │ ├── run_ZERO23.sh │ ├── run_ZERO_extending.py │ ├── run_language_modeling.py │ ├── run_language_modeling_ZERO23.py │ ├── run_language_modeling_extending.py │ └── trainer.py └── text_classification │ ├── README.md │ ├── __init__.py │ ├── data │ ├── download_dataset.sh │ ├── make_k_shot_without_dev.py │ └── make_valid_data.py │ ├── run_classification.py │ ├── run_wrapper.py │ └── src │ ├── __init__.py │ ├── common.py │ ├── compiled_args.py │ ├── dataset.py │ ├── label_search.py │ ├── models.py │ ├── processors.py │ └── trainer.py ├── fastDP ├── README.md ├── __init__.py ├── accounting │ ├── __init__.py │ ├── accounting_manager.py │ └── rdp_accounting.py ├── autograd_grad_sample.py ├── autograd_grad_sample_dist.py ├── lora_utils.py ├── privacy_engine.py ├── privacy_engine_dist_extending.py ├── privacy_engine_dist_stage23.py ├── supported_differentially_private_layers.py ├── supported_layers_grad_samplers.py └── transformers_support.py ├── requirements.txt └── setup.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /NOTICE.md: -------------------------------------------------------------------------------- 1 | Fast Differential Privacy 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fast Differential Privacy 2 | 3 | *Fast Differential Privacy* (**fastDP**) is a library that allows differentially private optimization of PyTorch models, with a few additional lines of code. The goal of this library is to make DP deep learning as similar to the standard non-private learning as possible, in terms of **speed, memory cost, scalability, accuracy and hyperparameter-tuning**. It supports all PyTorch optimizers, popular models in [TIMM](https://github.com/rwightman/pytorch-image-models), [torchvision](https://github.com/pytorch/vision), [HuggingFace](https://huggingface.co/transformers/) (up to supported modules), multiple privacy accountants, multiple clipping functions/styles, most parameter-efficient training methods, and distribute solutions such as DeepSpeed and FSDP. The library has provably little overhead in terms of training time and memory cost, compared with the standard non-private optimization. 4 | 5 | 6 | --- 7 | ## Installation. 8 | To install the library after Git clone, run 9 | ```bash 10 | python -m setup develop 11 | ``` 12 | 13 | > :warning: **NOTE**: We strongly recommend Python>=3.8 and torch<=1.11 (it is a known issue that torch 1.12 can slow down as much as 3 times). 14 | 15 | ## Getting started 16 | To train a model with differential privacy, simply create a `PrivacyEngine` and continue the standard training pipeline: 17 | 18 | ```python 19 | from fastDP import PrivacyEngine 20 | optimizer = SGD(model.parameters(), lr=0.05) 21 | privacy_engine = PrivacyEngine( 22 | model, 23 | batch_size=256, 24 | sample_size=50000, 25 | epochs=3, 26 | target_epsilon=2, 27 | clipping_fn='automatic', 28 | clipping_mode='MixOpt', 29 | origin_params=None, 30 | clipping_style='all-layer', 31 | ) 32 | # attaching to optimizers is not needed for multi-GPU distributed learning 33 | privacy_engine.attach(optimizer) 34 | 35 | #----- standard training pipeline 36 | loss = F.cross_entropy(model(batch), labels) 37 | loss.backward() 38 | optimizer.step() 39 | optimizer.zero_grad() 40 | ``` 41 | 42 | We provide details about our privacy engine in `fastDP/README.md`, including the supported modules and the arguments. By default, we use the `'MixOpt'` (hybrid book-keeping [4]) clipping mode (which enjoys almost the same time complexity as non-private optimization), and the automatic clipping function [8] (which does not need to tune the clipping threshold `max_grad_norm`). We support RDP and GLW privacy accountant, and additional accountants can be used through the argument `noise_multiplier`, after its calculation from [[Automating differential privacy computation](https://github.com/yuxiangw/autodp)] library. 43 | 44 | Specifically, we allow the gradient accumulation to use very large batch size, which is beneficial to DP optimization: 45 | ```python 46 | for i, batch in enumerate(dataloader): 47 | loss = F.cross_entropy(model(batch), labels) 48 | loss.backward() 49 | if i % gradient_accumulation_steps == 0: 50 | optimizer.step() 51 | optimizer.zero_grad() 52 | ``` 53 | 54 | ## Foundation model release 55 | We release DP vision foundation models in [v2.1](https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1): VisionTransformer models (ViT; ~86M param) following [Pre-training Differentially Private Models with Limited Public Data](https://arxiv.org/abs/2402.18752) in NeurIPS 2024. These models have [epsilon=2](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps2.pt) and [epsilon=8](https://github.com/awslabs/fast-differential-privacy/releases/download/v2.1/ViT_base_imgnet11k_DP_eps8.pt), pre-trained on ImageNet-1k with AdamW (1k classes, 1 million images) and ImageNet-11k with DP-AdamW (11k classes, 11 million images). More DP foundation models to come! 56 | 57 | ## Highlights 58 | 1. This library enables large model training in the **multi-GPU distributed setting** and **supports mixed precision training** under DeepSpeed and FSDP. 59 |

60 | 61 |

62 | The scalability has been tested on 100B models with 512 GPUs. 63 |

64 | 65 |

66 | 67 | 2. This library enables DP training to have almost **the same time and space complexity** as the standard non-private training. This is achieved by three key techniques as described in [4]: mixed ghost norm, book-keeping, and ghost differentiation. In practice, we observe <20% memory overhead and <25% slowdown across different tasks. 68 | 69 |

70 | 71 |

72 | 73 | 3. Specifically, this library overcomes the severe memory issues of large model (commonly encountered by Opacus, which computes the per-sample gradients) and high dimensional data (commonly encountered by ghost clipping, e.g. in Private transformers), by leveraging the mixed ghost norm trick [3,8]. 74 | 75 |

76 | 77 |

78 | 79 | 4. We **support all optimizers** in [`torch.optim`](https://pytorch.org/docs/stable/optim.html) (SGD, Adam, AdaGrad,...) and a wide range of **models** (BERT, RoBERTa, GPT, ViT, BEiT, CrossViT, DEiT, ResNet, VGG, DenseNet,...), including their parameter-efficient variants. For example, one can run DP bias-term fine-tuning (DP-BiTFiT) by simply freezing non-bias terms, as in `examples/image_classification`. 80 | 81 | ------ 82 | Full fine-tuning results on a single A100 GPU 83 | 84 | | Datasets | ε | Setting | Model | Accuracy | Time(min)/epoch | 85 | |----------|---|----------------------------------------------------------|---------------|-----------|-----------------| 86 | | CIFAR10 | 2 | [6] | ViT-large | 98.9 | 7.0 | 87 | | CIFAR100 | 2 | [6] | BEiT-large | 88.7 | 6.5 | 88 | | CelebA | 3 | [6] | ResNet18 | 88.2 | 2.7 | 89 | | SST2 | 3 | [8] | RoBERTa-large | 93.9 | 13.5 | 90 | | QNLI | 3 | [8] | RoBERTa-large | 91.0 | 20.2 | 91 | | QQP | 3 | [8] | RoBERTa-large | 86.8 | 70.0 | 92 | | MNLI | 3 | [8] | RoBERTa-large | 86.3/86.7 | 77.1 | 93 | 94 | More datasets, epsilon budgets, models, fine-tuning styles, and different hyperparamters can be found in the related papers. 95 | 96 | 97 | 98 | ## Examples 99 | The `examples` folder covers tasks on the table-to-text (E2E and DART datasets with GPT2 models), the text classification (SST2/QNLI/QQP/MNLI datasets with BERT/RoBERTa models), and the image classification (CIFAR10/CIFAR100/CelebA datasets with [TIMM](https://github.com/rwightman/pytorch-image-models)/[torchvision](https://github.com/pytorch/vision) models). Detailed `README.md` can be found in each sub-folder. These examples can be used to reproduce the results in [2,3,4,6,8]. 100 | 101 | 102 | ## Citation 103 | Please consider citing the following if you want to use this library in your works: 104 | ``` 105 | @inproceedings{bu2023differentially, 106 | title={Differentially private optimization on large model at small cost}, 107 | author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George}, 108 | booktitle={International Conference on Machine Learning}, 109 | pages={3192--3218}, 110 | year={2023}, 111 | organization={PMLR} 112 | } 113 | 114 | @article{bu2023zero, 115 | title={Zero redundancy distributed learning with differential privacy}, 116 | author={Bu, Zhiqi and Chiu, Justin and Liu, Ruixuan and Zha, Sheng and Karypis, George}, 117 | booktitle={ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML}, 118 | journal={arXiv preprint arXiv:2311.11822}, 119 | year={2023} 120 | } 121 | 122 | @inproceedings{bu2022differentially, 123 | title={Differentially Private Bias-Term Fine-tuning of Foundation Models}, 124 | author={Bu, Zhiqi and Wang, Yu-Xiang and Zha, Sheng and Karypis, George}, 125 | booktitle={Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022}, 126 | year={2022} 127 | } 128 | ``` 129 | 130 | ## Acknowledgements 131 | This codebase is largely inspired by [[Opacus (v0.15)]](https://github.com/pytorch/opacus), [[Private transformers (v0.2.3)]](https://github.com/lxuechen/private-transformers), [[Private Vision]](https://github.com/woodyx218/private_vision), and [[FastGradClip]](https://github.com/ppmlguy/fastgradclip). 132 | 133 | ## References 134 | [1] Ian Goodfellow. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015). 135 | 136 | [2] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." ICLR (2022). 137 | 138 | [3] Zhiqi Bu, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." NeurIPS (2022). 139 | 140 | [4] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." ICML (2023). 141 | 142 | [5] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021). 143 | 144 | [6] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term Fine-tuning of Foundation Models." ICML (2024). 145 | 146 | [7] Martin Abadi, et al. "Deep learning with differential privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 147 | 148 | [8] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Automatic clipping: Differentially private deep learning made easier and stronger." NeurIPS (2023). 149 | 150 | [9] Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, and George Karypis. "Pre-training Differentially Private Models with Limited Public Data." NeurIPS (2024). 151 | -------------------------------------------------------------------------------- /THIRD-PARTY-NOTICES.txt: -------------------------------------------------------------------------------- 1 | ** private-transformers; version 0.2.3 -- https://github.com/lxuechen/private-transformers 2 | ** opacus; version 0.15 -- https://github.com/pytorch/opacus 3 | ** private_vision; version initial version -- https://github.com/woodyx218/private_vision 4 | 5 | Apache License 6 | Version 2.0, January 2004 7 | http://www.apache.org/licenses/ 8 | 9 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 10 | 11 | 1. Definitions. 12 | 13 | "License" shall mean the terms and conditions for use, reproduction, and 14 | distribution as defined by Sections 1 through 9 of this document. 15 | 16 | "Licensor" shall mean the copyright owner or entity authorized by the copyright 17 | owner that is granting the License. 18 | 19 | "Legal Entity" shall mean the union of the acting entity and all other entities 20 | that control, are controlled by, or are under common control with that entity. 21 | For the purposes of this definition, "control" means (i) the power, direct or 22 | indirect, to cause the direction or management of such entity, whether by 23 | contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the 24 | outstanding shares, or (iii) beneficial ownership of such entity. 25 | 26 | "You" (or "Your") shall mean an individual or Legal Entity exercising 27 | permissions granted by this License. 28 | 29 | "Source" form shall mean the preferred form for making modifications, including 30 | but not limited to software source code, documentation source, and configuration 31 | files. 32 | 33 | "Object" form shall mean any form resulting from mechanical transformation or 34 | translation of a Source form, including but not limited to compiled object code, 35 | generated documentation, and conversions to other media types. 36 | 37 | "Work" shall mean the work of authorship, whether in Source or Object form, made 38 | available under the License, as indicated by a copyright notice that is included 39 | in or attached to the work (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object form, that 42 | is based on (or derived from) the Work and for which the editorial revisions, 43 | annotations, elaborations, or other modifications represent, as a whole, an 44 | original work of authorship. For the purposes of this License, Derivative Works 45 | shall not include works that remain separable from, or merely link (or bind by 46 | name) to the interfaces of, the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including the original version 49 | of the Work and any modifications or additions to that Work or Derivative Works 50 | thereof, that is intentionally submitted to Licensor for inclusion in the Work 51 | by the copyright owner or by an individual or Legal Entity authorized to submit 52 | on behalf of the copyright owner. For the purposes of this definition, 53 | "submitted" means any form of electronic, verbal, or written communication sent 54 | to the Licensor or its representatives, including but not limited to 55 | communication on electronic mailing lists, source code control systems, and 56 | issue tracking systems that are managed by, or on behalf of, the Licensor for 57 | the purpose of discussing and improving the Work, but excluding communication 58 | that is conspicuously marked or otherwise designated in writing by the copyright 59 | owner as "Not a Contribution." 60 | 61 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf 62 | of whom a Contribution has been received by Licensor and subsequently 63 | incorporated within the Work. 64 | 65 | 2. Grant of Copyright License. Subject to the terms and conditions of this 66 | License, each Contributor hereby grants to You a perpetual, worldwide, non- 67 | exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, 68 | prepare Derivative Works of, publicly display, publicly perform, sublicense, and 69 | distribute the Work and such Derivative Works in Source or Object form. 70 | 71 | 3. Grant of Patent License. Subject to the terms and conditions of this License, 72 | each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no- 73 | charge, royalty-free, irrevocable (except as stated in this section) patent 74 | license to make, have made, use, offer to sell, sell, import, and otherwise 75 | transfer the Work, where such license applies only to those patent claims 76 | licensable by such Contributor that are necessarily infringed by their 77 | Contribution(s) alone or by combination of their Contribution(s) with the Work 78 | to which such Contribution(s) was submitted. If You institute patent litigation 79 | against any entity (including a cross-claim or counterclaim in a lawsuit) 80 | alleging that the Work or a Contribution incorporated within the Work 81 | constitutes direct or contributory patent infringement, then any patent licenses 82 | granted to You under this License for that Work shall terminate as of the date 83 | such litigation is filed. 84 | 85 | 4. Redistribution. You may reproduce and distribute copies of the Work or 86 | Derivative Works thereof in any medium, with or without modifications, and in 87 | Source or Object form, provided that You meet the following conditions: 88 | 89 | (a) You must give any other recipients of the Work or Derivative Works a 90 | copy of this License; and 91 | 92 | (b) You must cause any modified files to carry prominent notices stating 93 | that You changed the files; and 94 | 95 | (c) You must retain, in the Source form of any Derivative Works that You 96 | distribute, all copyright, patent, trademark, and attribution notices from the 97 | Source form of the Work, excluding those notices that do not pertain to any part 98 | of the Derivative Works; and 99 | 100 | (d) If the Work includes a "NOTICE" text file as part of its distribution, 101 | then any Derivative Works that You distribute must include a readable copy of 102 | the attribution notices contained within such NOTICE file, excluding those 103 | notices that do not pertain to any part of the Derivative Works, in at least one 104 | of the following places: within a NOTICE text file distributed as part of the 105 | Derivative Works; within the Source form or documentation, if provided along 106 | with the Derivative Works; or, within a display generated by the Derivative 107 | Works, if and wherever such third-party notices normally appear. The contents of 108 | the NOTICE file are for informational purposes only and do not modify the 109 | License. You may add Your own attribution notices within Derivative Works that 110 | You distribute, alongside or as an addendum to the NOTICE text from the Work, 111 | provided that such additional attribution notices cannot be construed as 112 | modifying the License. 113 | 114 | You may add Your own copyright statement to Your modifications and may 115 | provide additional or different license terms and conditions for use, 116 | reproduction, or distribution of Your modifications, or for any such Derivative 117 | Works as a whole, provided Your use, reproduction, and distribution of the Work 118 | otherwise complies with the conditions stated in this License. 119 | 120 | 5. Submission of Contributions. Unless You explicitly state otherwise, any 121 | Contribution intentionally submitted for inclusion in the Work by You to the 122 | Licensor shall be under the terms and conditions of this License, without any 123 | additional terms or conditions. Notwithstanding the above, nothing herein shall 124 | supersede or modify the terms of any separate license agreement you may have 125 | executed with Licensor regarding such Contributions. 126 | 127 | 6. Trademarks. This License does not grant permission to use the trade names, 128 | trademarks, service marks, or product names of the Licensor, except as required 129 | for reasonable and customary use in describing the origin of the Work and 130 | reproducing the content of the NOTICE file. 131 | 132 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in 133 | writing, Licensor provides the Work (and each Contributor provides its 134 | Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 135 | KIND, either express or implied, including, without limitation, any warranties 136 | or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 137 | PARTICULAR PURPOSE. You are solely responsible for determining the 138 | appropriateness of using or redistributing the Work and assume any risks 139 | associated with Your exercise of permissions under this License. 140 | 141 | 8. Limitation of Liability. In no event and under no legal theory, whether in 142 | tort (including negligence), contract, or otherwise, unless required by 143 | applicable law (such as deliberate and grossly negligent acts) or agreed to in 144 | writing, shall any Contributor be liable to You for damages, including any 145 | direct, indirect, special, incidental, or consequential damages of any character 146 | arising as a result of this License or out of the use or inability to use the 147 | Work (including but not limited to damages for loss of goodwill, work stoppage, 148 | computer failure or malfunction, or any and all other commercial damages or 149 | losses), even if such Contributor has been advised of the possibility of such 150 | damages. 151 | 152 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or 153 | Derivative Works thereof, You may choose to offer, and charge a fee for, 154 | acceptance of support, warranty, indemnity, or other liability obligations 155 | and/or rights consistent with this License. However, in accepting such 156 | obligations, You may act only on Your own behalf and on Your sole 157 | responsibility, not on behalf of any other Contributor, and only if You agree to 158 | indemnify, defend, and hold each Contributor harmless for any liability incurred 159 | by, or claims asserted against, such Contributor by reason of your accepting any 160 | such warranty or additional liability. 161 | 162 | END OF TERMS AND CONDITIONS 163 | 164 | APPENDIX: How to apply the Apache License to your work. 165 | 166 | To apply the Apache License to your work, attach the following boilerplate 167 | notice, with the fields enclosed by brackets "[]" replaced with your own 168 | identifying information. (Don't include the brackets!) The text should be 169 | enclosed in the appropriate comment syntax for the file format. We also 170 | recommend that a file or class name and description of purpose be included on 171 | the same "printed page" as the copyright notice for easier identification within 172 | third-party archives. 173 | 174 | Copyright [yyyy] [name of copyright owner] 175 | 176 | Licensed under the Apache License, Version 2.0 (the "License"); 177 | you may not use this file except in compliance with the License. 178 | You may obtain a copy of the License at 179 | 180 | http://www.apache.org/licenses/LICENSE-2.0 181 | 182 | Unless required by applicable law or agreed to in writing, software 183 | distributed under the License is distributed on an "AS IS" BASIS, 184 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 185 | See the License for the specific language governing permissions and 186 | limitations under the License. 187 | 188 | * For Private-Transformers see also this required NOTICE: 189 | None 190 | * For opacus see also this required NOTICE: 191 | Copyright (c) Meta Platforms, Inc. and affiliates. 192 | * For private_vision see also this required NOTICE: 193 | None 194 | 195 | ------ 196 | 197 | ** ml-swissknife; version 0.1.7 -- https://github.com/lxuechen/ml-swissknife 198 | None 199 | 200 | MIT License 201 | 202 | Copyright (c) 203 | 204 | Permission is hereby granted, free of charge, to any person obtaining a copy of 205 | this software and associated documentation files (the "Software"), to deal in 206 | the Software without restriction, including without limitation the rights to 207 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 208 | the Software, and to permit persons to whom the Software is furnished to do so, 209 | subject to the following conditions: 210 | 211 | The above copyright notice and this permission notice shall be included in all 212 | copies or substantial portions of the Software. 213 | 214 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 215 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 216 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 217 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 218 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 219 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 220 | -------------------------------------------------------------------------------- /assets/efficiency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/efficiency.png -------------------------------------------------------------------------------- /assets/nlp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/nlp.png -------------------------------------------------------------------------------- /assets/scalability.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/scalability.png -------------------------------------------------------------------------------- /assets/vision.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/assets/vision.png -------------------------------------------------------------------------------- /examples/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/__init__.py -------------------------------------------------------------------------------- /examples/image_classification/CIFAR_TIMM.py: -------------------------------------------------------------------------------- 1 | '''Train CIFAR10/CIFAR100 with PyTorch.''' 2 | def main(args): 3 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']: 4 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'") 5 | return None 6 | 7 | device= torch.device("cuda:0") 8 | 9 | # Data 10 | print('==> Preparing data..') 11 | 12 | transformation = torchvision.transforms.Compose([ 13 | torchvision.transforms.Resize(args.dimension), 14 | torchvision.transforms.ToTensor(), 15 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 16 | ]) 17 | 18 | 19 | if args.cifar_data=='CIFAR10': 20 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation) 21 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation) 22 | elif args.cifar_data=='CIFAR100': 23 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation) 24 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation) 25 | else: 26 | return "Must specify datasets as CIFAR10 or CIFAR100" 27 | 28 | 29 | trainloader = torch.utils.data.DataLoader( 30 | trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4) 31 | 32 | testloader = torch.utils.data.DataLoader( 33 | testset, batch_size=100, shuffle=False, num_workers=4) 34 | 35 | n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps 36 | 37 | # Model 38 | print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm. Mode: ', args.clipping_mode) 39 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:])) 40 | net = ModuleValidator.fix(net); net=net.to(device) 41 | 42 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 43 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad])) 44 | 45 | criterion = nn.CrossEntropyLoss() 46 | 47 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 48 | 49 | if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety 50 | for name,param in net.named_parameters(): 51 | if '.bias' not in name: 52 | param.requires_grad_(False) 53 | 54 | # Privacy engine 55 | if 'nonDP' not in args.clipping_mode: 56 | sigma=get_noise_multiplier( 57 | target_epsilon = args.epsilon, 58 | target_delta = 1e-5, 59 | sample_rate = args.bs/len(trainset), 60 | epochs = args.epochs, 61 | ) 62 | 63 | if 'BK' in args.clipping_mode: 64 | clipping_mode=args.clipping_mode[3:] 65 | else: 66 | clipping_mode='ghost' 67 | 68 | if args.clipping_style in [['all-layer'],['layer-wise'],['param-wise']]: 69 | args.clipping_style=args.clipping_style[0] 70 | privacy_engine = PrivacyEngine( 71 | net, 72 | batch_size=args.bs, 73 | sample_size=len(trainset), 74 | noise_multiplier=sigma, 75 | epochs=args.epochs, 76 | clipping_mode=clipping_mode, 77 | clipping_style=args.clipping_style, 78 | origin_params=args.origin_params,#['patch_embed.proj.bias'], 79 | ) 80 | privacy_engine.attach(optimizer) 81 | 82 | 83 | def train(epoch): 84 | 85 | net.train() 86 | train_loss = 0 87 | correct = 0 88 | total = 0 89 | 90 | 91 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)): 92 | inputs, targets = inputs.to(device), targets.to(device) 93 | outputs = net(inputs) 94 | loss = criterion(outputs, targets) 95 | 96 | loss.backward() 97 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)): 98 | optimizer.step() 99 | optimizer.zero_grad() 100 | 101 | train_loss += loss.item() 102 | _, predicted = outputs.max(1) 103 | total += targets.size(0) 104 | correct += predicted.eq(targets).sum().item() 105 | 106 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)' 107 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)) 108 | 109 | def test(epoch): 110 | net.eval() 111 | test_loss = 0 112 | correct = 0 113 | total = 0 114 | with torch.no_grad(): 115 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)): 116 | inputs, targets = inputs.to(device), targets.to(device) 117 | outputs = net(inputs) 118 | loss = criterion(outputs, targets) 119 | 120 | test_loss += loss.item() 121 | _, predicted = outputs.max(1) 122 | total += targets.size(0) 123 | correct += predicted.eq(targets).sum().item() 124 | 125 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)' 126 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)) 127 | 128 | for epoch in range(args.epochs): 129 | train(epoch) 130 | test(epoch) 131 | 132 | 133 | if __name__ == '__main__': 134 | import argparse 135 | 136 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 137 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate') 138 | parser.add_argument('--epochs', default=3, type=int, 139 | help='numter of epochs') 140 | parser.add_argument('--bs', default=1000, type=int, help='batch size') 141 | parser.add_argument('--mini_bs', type=int, default=50) 142 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon') 143 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str) 144 | parser.add_argument('--clipping_style', default='all-layer', nargs='+',type=str) 145 | parser.add_argument('--model', default='vit_small_patch16_224', type=str) 146 | parser.add_argument('--cifar_data', type=str, default='CIFAR10') 147 | parser.add_argument('--dimension', type=int,default=224) 148 | parser.add_argument('--origin_params', nargs='+', default=None) 149 | 150 | args = parser.parse_args() 151 | 152 | from fastDP import PrivacyEngine 153 | 154 | import torch 155 | import torchvision 156 | torch.manual_seed(2) 157 | import torch.nn as nn 158 | import torch.optim as optim 159 | import timm 160 | from opacus.validators import ModuleValidator 161 | from opacus.accountants.utils import get_noise_multiplier 162 | from tqdm import tqdm 163 | import warnings; warnings.filterwarnings("ignore") 164 | 165 | main(args) 166 | -------------------------------------------------------------------------------- /examples/image_classification/CV_TIMM.py: -------------------------------------------------------------------------------- 1 | '''Train CV with PyTorch.''' 2 | def main(args): 3 | 4 | device= torch.device("cuda:0") 5 | 6 | # Data 7 | transformation = torchvision.transforms.Compose([ 8 | torchvision.transforms.Resize((224,224)),#https://discuss.pytorch.org/t/runtimeerror-stack-expects-each-tensor-to-be-equal-size-but-got-3-224-224-at-entry-0-and-3-224-336-at-entry-3/87211/10 9 | torchvision.transforms.ToTensor(), 10 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 11 | ]) 12 | 13 | if args.dataset_name in ['SVHN','CIFAR10']: 14 | num_classes=10 15 | elif args.dataset_name in ['CIFAR100','FGVCAircraft']: 16 | num_classes=100 17 | elif args.dataset_name in ['Food101']: 18 | num_classes=101 19 | elif args.dataset_name in ['GTSRB']: 20 | num_classes=43 21 | elif args.dataset_name in ['CelebA']: 22 | num_classes=40 23 | elif args.dataset_name in ['Places365']: 24 | num_classes=365 25 | elif args.dataset_name in ['ImageNet']: 26 | num_classes=1000 27 | elif args.dataset_name in ['INaturalist']: 28 | num_classes=10000 29 | 30 | 31 | if args.dataset_name in ['SVHN','Food101','GTSRB','FGVCAircraft']: 32 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=True, transform=transformation) 33 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=True, transform=transformation) 34 | elif args.dataset_name in ['CIFAR10','CIFAR100']: 35 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=True, download=True, transform=transformation) 36 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', train=False, download=True, transform=transformation) 37 | elif args.dataset_name=='CelebA': 38 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', download=False, target_type='attr', transform=transformation) 39 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='test', download=False, target_type='attr',transform=transformation) 40 | elif args.dataset_name=='Places365': 41 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train-standard', small=True, download=False, transform=transformation) 42 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', small=True, download=False, transform=transformation) 43 | elif args.dataset_name=='INaturalist': 44 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_train_mini', download=False, transform=transformation) 45 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', version='2021_valid', download=False, transform=transformation) 46 | elif args.dataset_name=='ImageNet': 47 | trainset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='train', transform=transformation) 48 | testset = getattr(torchvision.datasets,args.dataset_name)(root='data/', split='val', transform=transformation) 49 | 50 | trainloader = torch.utils.data.DataLoader( 51 | trainset, batch_size=args.mini_bs, shuffle=True, num_workers=4) 52 | 53 | testloader = torch.utils.data.DataLoader( 54 | testset, batch_size=100, shuffle=False, num_workers=4) 55 | 56 | n_acc_steps = args.bs // args.mini_bs # gradient accumulation steps 57 | 58 | 59 | # Model 60 | net = timm.create_model(args.model, pretrained=True, num_classes = num_classes) 61 | net = ModuleValidator.fix(net).to(device) 62 | 63 | if args.dataset_name=='CelebA': 64 | criterion = nn.BCEWithLogitsLoss(reduction='none') 65 | else: 66 | criterion = nn.CrossEntropyLoss() 67 | 68 | 69 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 70 | 71 | if 'BiTFiT' in args.clipping_mode: 72 | for name,layer in net.named_modules(): 73 | if hasattr(layer,'weight'): 74 | temp_layer=layer 75 | for name,param in net.named_parameters(): 76 | if '.bias' not in name: 77 | param.requires_grad_(False) 78 | for param in temp_layer.parameters(): 79 | param.requires_grad_(True) 80 | 81 | # Privacy engine 82 | if 'nonDP' not in args.clipping_mode: 83 | sigma=get_noise_multiplier( 84 | target_epsilon = args.epsilon, 85 | target_delta = 1/len(trainset), 86 | sample_rate = args.bs/len(trainset), 87 | epochs = args.epochs, 88 | ) 89 | print(f'adding noise level {sigma}') 90 | privacy_engine = PrivacyEngine( 91 | net, 92 | batch_size=args.bs, 93 | sample_size=len(trainset), 94 | noise_multiplier=sigma, 95 | epochs=args.epochs, 96 | clipping_mode='MixOpt', 97 | clipping_style='all-layer', 98 | ) 99 | privacy_engine.attach(optimizer) 100 | 101 | 102 | tr_loss=[] 103 | te_loss=[] 104 | tr_acc=[] 105 | te_acc=[] 106 | 107 | def train(epoch): 108 | 109 | net.train() 110 | train_loss = 0 111 | correct = 0 112 | total = 0 113 | 114 | 115 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)): 116 | inputs, targets = inputs.to(device), targets.to(device) 117 | outputs = net(inputs) 118 | if args.dataset_name=='CelebA': 119 | loss = criterion(outputs, targets.float()).sum(dim=1).mean() 120 | else: 121 | loss = criterion(outputs, targets);#print(loss.item()) 122 | 123 | loss.backward() 124 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)): 125 | optimizer.step() 126 | optimizer.zero_grad() 127 | 128 | train_loss += loss.item() 129 | total += targets.size(0) 130 | if args.dataset_name=='CelebA': 131 | correct += ((outputs > 0) == targets).sum(dim=0).float().mean() 132 | else: 133 | _, predicted = outputs.max(1) 134 | correct += predicted.eq(targets).sum().item() 135 | 136 | if args.dataset_name in ['Places365','INaturalist','ImageNet'] and (batch_idx + 1) % 100 == 0: 137 | print(loss.item(),100.*correct/total) 138 | 139 | 140 | tr_loss.append(train_loss/(batch_idx+1)) 141 | tr_acc.append(100.*correct/total) 142 | print('Epoch: ', epoch, 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)' 143 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)) 144 | 145 | def test(epoch): 146 | net.eval() 147 | test_loss = 0 148 | correct = 0 149 | total = 0 150 | with torch.no_grad(): 151 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)): 152 | inputs, targets = inputs.to(device), targets.to(device) 153 | outputs = net(inputs) 154 | if args.dataset_name=='CelebA': 155 | loss = criterion(outputs, targets.float()).sum(dim=1).mean() 156 | else: 157 | loss = criterion(outputs, targets);#print(loss.item()) 158 | 159 | test_loss += loss.item() 160 | total += targets.size(0) 161 | if args.dataset_name=='CelebA': 162 | correct += ((outputs > 0) == targets).sum(dim=0).float().mean() 163 | else: 164 | _, predicted = outputs.max(1) 165 | correct += predicted.eq(targets).sum().item() 166 | 167 | te_loss.append(test_loss/(batch_idx+1)) 168 | te_acc.append(100.*correct/total) 169 | print('Epoch: ', epoch, 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)' 170 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)) 171 | 172 | for epoch in range(args.epochs): 173 | train(epoch) 174 | test(epoch) 175 | print(tr_loss,tr_acc,te_loss,te_acc) 176 | 177 | 178 | if __name__ == '__main__': 179 | import argparse 180 | 181 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 182 | parser.add_argument('--lr', default=5e-4, type=float, help='learning rate') 183 | parser.add_argument('--epochs', default=5, type=int, 184 | help='numter of epochs') 185 | parser.add_argument('--bs', default=1000, type=int, help='batch size') 186 | parser.add_argument('--mini_bs', type=int, default=100) 187 | parser.add_argument('--epsilon', default=8, type=float, help='target epsilon') 188 | parser.add_argument('--dataset_name', type=str, default='CIFAR10',help='https://pytorch.org/vision/stable/datasets.html') 189 | parser.add_argument('--clipping_mode', type=str, default='MixOpt',choices=['BiTFiT','MixOpt', 'nonDP','nonDP-BiTFiT']) 190 | parser.add_argument('--model', default='vit_base_patch16_224', type=str, help='model name') 191 | 192 | args = parser.parse_args() 193 | 194 | from fastDP import PrivacyEngine 195 | 196 | import torch 197 | import torchvision 198 | torch.manual_seed(2) 199 | import torch.nn as nn 200 | import torch.optim as optim 201 | import timm 202 | from opacus.validators import ModuleValidator 203 | from opacus.accountants.utils import get_noise_multiplier 204 | from tqdm import tqdm 205 | import numpy as np 206 | import warnings; warnings.filterwarnings("ignore") 207 | main(args) 208 | -------------------------------------------------------------------------------- /examples/image_classification/CelebA_TIMM.py: -------------------------------------------------------------------------------- 1 | #This runs multi-label classification 2 | def main(args): 3 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']: 4 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'") 5 | return None 6 | 7 | device = torch.device('cuda') 8 | 9 | # Data 10 | print('==> Preparing data..') 11 | 12 | train_set = datasets.CelebA(root='.', split='train',target_type='attr', 13 | transform=transforms.Compose([ 14 | transforms.ToTensor(), 15 | #transforms.Normalize(mean=[0.5,0.5,0.5],std=[0.5,0.5,0.5]), 16 | ])) 17 | test_set = datasets.CelebA(root=".", split='test', target_type='attr', 18 | transform=transforms.Compose([ 19 | transforms.ToTensor()])) 20 | 21 | if args.labels==None: 22 | args.labels=list(range(40)) 23 | print('Training on all 40 labels.') 24 | else: 25 | print('Training on ', [attr_names[ind] for ind in args.labels]) 26 | 27 | 28 | train_set.attr = train_set.attr[:, args.labels].type(torch.float32) 29 | test_set.attr = test_set.attr[:, args.labels].type(torch.float32) 30 | 31 | print('Training/Testing set size: ', len(train_set),len(test_set),' ; Image dimension: ',train_set[0][0].shape) 32 | 33 | trainloader = torch.utils.data.DataLoader( 34 | train_set, batch_size=args.mini_bs, pin_memory=True,num_workers=4,shuffle=True) 35 | testloader = torch.utils.data.DataLoader( 36 | test_set, batch_size=500, pin_memory=True,num_workers=4, shuffle=False) 37 | 38 | n_acc_steps=args.bs//args.mini_bs 39 | 40 | # Model 41 | print('==> Building model..', args.model,'; BatchNorm is replaced by GroupNorm.') 42 | net = timm.create_model(args.model, pretrained=True, num_classes=len(args.labels)) 43 | net = ModuleValidator.fix(net) 44 | net=net.to(device) 45 | 46 | for name,param in net.named_parameters(): 47 | print("First trainable parameter is: ",name);break 48 | 49 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 50 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad])) 51 | 52 | criterion = nn.BCEWithLogitsLoss(reduction='none') 53 | 54 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 55 | 56 | if 'BiTFiT' in args.clipping_mode: 57 | for name,param in net.named_parameters(): 58 | if '.bias' not in name: 59 | param.requires_grad_(False) 60 | 61 | # Privacy engine 62 | if 'nonDP' not in args.clipping_mode: 63 | sigma=get_noise_multiplier( 64 | target_epsilon = args.epsilon, 65 | target_delta = 5e-6, 66 | sample_rate = args.bs/len(train_set), 67 | epochs = args.epochs, 68 | ) 69 | 70 | if 'BK' in args.clipping_mode: 71 | clipping_mode=args.clipping_mode[3:] 72 | else: 73 | clipping_mode='ghost' 74 | privacy_engine = PrivacyEngine( 75 | net, 76 | batch_size=args.bs, 77 | sample_size=len(train_set), 78 | noise_multiplier=sigma, 79 | epochs=args.epochs, 80 | clipping_mode=clipping_mode, 81 | origin_params=args.origin_params, 82 | ) 83 | privacy_engine.attach(optimizer) 84 | 85 | 86 | def train(epoch): 87 | 88 | net.train() 89 | train_loss = 0 90 | correct = np.zeros_like([0]*len(args.labels)) 91 | total = 0 92 | for batch_idx, (inputs, targets) in enumerate(tqdm(trainloader)): 93 | inputs, targets = inputs.to(device), targets.to(device) 94 | outputs = net(inputs) 95 | loss = criterion(outputs, targets.float()).sum(dim=1).mean() 96 | 97 | 98 | loss.backward() 99 | if ((batch_idx + 1) % n_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)): 100 | optimizer.step() 101 | optimizer.zero_grad() 102 | 103 | train_loss += loss.item() 104 | total += targets.size(0) 105 | correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy() 106 | 107 | print('Epoch: ', epoch, 'Train Loss: ', (train_loss/(batch_idx+1), 108 | ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total))) 109 | 110 | def test(epoch): 111 | net.eval() 112 | test_loss = 0 113 | correct = np.zeros_like([0]*len(args.labels)) 114 | total = 0 115 | with torch.no_grad(): 116 | for batch_idx, (inputs, targets) in enumerate(tqdm(testloader)): 117 | inputs, targets = inputs.to(device), targets.to(device) 118 | outputs = net(inputs) 119 | loss = criterion(outputs, targets.float()).sum(dim=1) 120 | loss = loss.mean() 121 | 122 | test_loss += loss.item() 123 | total += targets.size(0) 124 | correct += ((outputs > 0) == targets).sum(dim=0).cpu().detach().numpy() 125 | 126 | print('Epoch: ', epoch, 'Test Loss: ', (test_loss/(batch_idx+1), 127 | ' | Acc: ', 100.*correct/total, np.mean(100.0 * correct / total))) 128 | 129 | 130 | for epoch in range(args.epochs): 131 | train(epoch) 132 | test(epoch) 133 | 134 | 135 | if __name__ == '__main__': 136 | import argparse 137 | parser = argparse.ArgumentParser() 138 | parser.add_argument('--lr', type=float, default=0.001) 139 | parser.add_argument('--epochs', type=int, default=10) 140 | parser.add_argument('--bs', type=int, default=500) 141 | parser.add_argument('--mini_bs', type=int, default=100) 142 | parser.add_argument('--epsilon', default=3, type=float) 143 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str) 144 | parser.add_argument('--model', type=str, default='resnet18') 145 | parser.add_argument('--labels', nargs="*", type=int, default=None,help='List of label indices, 0-39 for CelebA') 146 | parser.add_argument('--origin_params', nargs='+', default=None) 147 | 148 | 149 | args = parser.parse_args() 150 | 151 | attr_names=['5_o_Clock_Shadow','Arched_Eyebrows','Attractive','Bags_Under_Eyes', 152 | 'Bald','Bangs','Big_Lips','Big_Nose', 153 | 'Black_Hair','Blond_Hair','Blurry','Brown_Hair', 154 | 'Bushy_Eyebrows','Chubby','Double_Chin','Eyeglasses', 155 | 'Goatee','Gray_Hair','Heavy_Makeup','High_Cheekbones', 156 | 'Male','Mouth_Slightly_Open','Mustache','Narrow_Eyes', 157 | 'No_Beard','Oval_Face','Pale_Skin','Pointy_Nose', 158 | 'Receding_Hairline','Rosy_Cheeks','Sideburns','Smiling', 159 | 'Straight_Hair','Wavy_Hair','Wearing_Earrings','Wearing_Hat', 160 | 'Wearing_Lipstick','Wearing_Necklace','Wearing_Necktie','Young'] 161 | 162 | import numpy as np 163 | from fastDP import PrivacyEngine 164 | 165 | import torch 166 | from torchvision import datasets, transforms 167 | torch.manual_seed(0) 168 | import torch.nn as nn 169 | import torch.optim as optim 170 | import timm 171 | from opacus.validators import ModuleValidator 172 | from opacus.accountants.utils import get_noise_multiplier 173 | from tqdm import tqdm 174 | import warnings; warnings.filterwarnings("ignore") 175 | 176 | main(args) 177 | -------------------------------------------------------------------------------- /examples/image_classification/README.md: -------------------------------------------------------------------------------- 1 | ## DP image classification with convolutional neural networks and vision transformers 2 | 3 | We provide the scripts to implement DP optimization on CIFAR10, CIFAR100, SVHN, ImageNet, CelebA, Places365, INaturalist, and other datasets, using the models (CNN and ViT) from [TIMM](https://github.com/rwightman/pytorch-image-models/tree/master/timm/models). Supported models include VGG, ResNet, Wide ResNet, ViT, CrossViT, BEiT, DEiT, ... 4 | 5 | ### Multi-GPU distributed learning 6 | See `ZERO_examples` folder. Our privacy engine supports DeepSpeed (ZeRO 1+2+3) and FSDP with mixed precision training. For example, 7 | ```plaintext 8 | deepspeed CIFAR_TIMM_ZERO1.py --model vit_large_patch16_224 --cifar_data CIFAR10 --deepspeed_config cifar_config.json 9 | ``` 10 | 11 | ### CIFAR10/CIFAR100 12 | ```plaintext 13 | python -m CIFAR_TIMM --model vit_large_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_mode BK-MixOpt --cifar_data CIFAR10 14 | ``` 15 | 16 | The script by default uses (hybrid) book-keeping by [Differentially Private Optimization on Large Model at Small Cost](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. This script achieves state-of-the-art accuracy with BEiT-large and ViT-large under 7 min per epoch on one A100 GPU (40GB). Notice that `--origin_params 'patch_embed.proj.bias'` specifically accelerates ViT through the ghost differentiation trick. 17 | 18 | Arguments: 19 | 20 | * `--cifar_data`: Whether to train on CIFAR10 (default) or CIFAR100 datasets. 21 | 22 | * `--epsilon`: Target privacy spending, default is 2. 23 | 24 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `nonDP` (non-private full fine-tuning), `BK-ghost` (base book-keeping), `BK-MixGhostClip`, `BK-MixOpt` (default), `BiTFiT` (DP bias-term fine-tuning) and `nonDPBiTFiT` (non-private BiTFiT). All BK algorithms are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf), and DP-BiTFiT is from [Bu et al., 2022](https://arxiv.org/pdf/2210.00036.pdf). 25 | 26 | * `--model`: The pretrained model from TIMM, check the full list by `timm.list_models(pretrained=True)`. 27 | 28 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters. 29 | 30 | * `--dimension`: Dimension of images, default is 224, i.e. the image is resized to 224*224. 31 | 32 | * `--lr`: Learning rate, default is 0.0005. Note BiTFiT learning rate should be larger than full fine-tuning's. 33 | 34 | * `--mini_bs` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy, default is 50. 35 | 36 | * `--bs` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is 1000. 37 | 38 | * `--epochs`: Number of epochs, default is 3. 39 | 40 | * `--clipping_style`: Which group-wise per-sample gradient clipping style to use. This argument takes one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, a uniform 3-group clipping can be implemented using 41 | ```plaintext 42 | python -m CIFAR_TIMM --model vit_base_patch16_224 --origin_params 'patch_embed.proj.bias' --clipping_style patch_embed.proj blocks.4.norm1 blocks.8.norm1 --cifar_data CIFAR10 43 | ``` 44 | 45 | ### CelebA 46 | Download dataset by `torchvision`, or from the [official host](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) with all .txt and `/img_align_celeba` in the same directory. 47 | ```plaintext 48 | python -m CelebA_TIMM --model resnet18 49 | ``` 50 | Same arguments `[lr, epochs, bs, mini_bs, epsilon, clipping mode, model]` as the CIFAR example with one addition `--labels`. Default is `None` to train all 40 labels as multi-label/multi-task problem, otherwise train on label indices specified as a list. For example, label index 31 is 'Smiling' and label index 20 is 'Male'. 51 | 52 | ### General computer vision experiments 53 | We provide a general script to experiment on many torchvision datasets, and we fix most of the arguments in the privacy engine. 54 | ```plaintext 55 | python -m CV_TIMM --model vit_base_patch16_224 --dataset_name ImageNet 56 | ``` 57 | 58 | ### Note 59 | 1. Vision models oftentimes have batch normalization layers, which violate the DP guarantee (see [Opacus](https://opacus.ai/tutorials/guide_to_module_validator) for the reason). A common solution is to replace with group/layer/instance normalization, and this can be easily fixed by Opacus>=v1.0: `model=ModuleValidator.fix(model)`. 60 | 61 | 2. To reproduce DP image classification and compare with other packages, we refer to [private-vision](https://github.com/woodyx218/private_vision) (covering GhostClip, MixGhostClip, Opacus-like optimization) and [Opacus](https://github.com/pytorch/opacus). Different packages and clipping modes should produce the same accuracy. Note that training more epochs with larger noise usually gives better accuracy. 62 | 63 | 3. Generally speaking, GhostClip is inefficient for large image (try 512X512 image with resnet18), Opacus is inefficient for large model (try 224X224 image with BEiT-large). Hence we improve on mixed ghost norm from [Bu et al.](https://arxiv.org/abs/2205.10683) to use GhostClip or Opacus at different layers. 64 | -------------------------------------------------------------------------------- /examples/image_classification/ZERO_examples/CIFAR_TIMM_FSDP_extending.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import torch 4 | import torch.nn as nn 5 | import torch.optim as optim 6 | 7 | import torchvision 8 | from fastDP import PrivacyEngine_Distributed_extending 9 | 10 | import timm 11 | #from opacus.validators import ModuleValidator 12 | from tqdm import tqdm 13 | import warnings; warnings.filterwarnings("ignore") 14 | 15 | 16 | import torch.distributed as dist 17 | import torch.multiprocessing as mp 18 | from torch.utils.data.distributed import DistributedSampler 19 | from fairscale.nn import FullyShardedDataParallel as FSDP 20 | 21 | #--- if import from torch <= 1.11 22 | #from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 23 | #from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload,BackwardPrefetch 24 | #from torch.distributed.fsdp.wrap import default_auto_wrap_policy,enable_wrap,wrap 25 | from fairscale.nn import default_auto_wrap_policy 26 | from fairscale.internal.parallel import ProcessGroupName 27 | 28 | 29 | def setup(rank, world_size): 30 | os.environ['MASTER_ADDR'] = 'localhost' 31 | os.environ['MASTER_PORT'] = '12355' 32 | 33 | # initialize the process group 34 | dist.init_process_group("nccl", rank=rank, world_size=world_size) 35 | 36 | def cleanup(): 37 | dist.destroy_process_group() 38 | 39 | 40 | def train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps): 41 | net.train() 42 | ddp_loss = torch.zeros(3).to(rank) 43 | 44 | for batch_idx, data in enumerate(tqdm(trainloader)): 45 | # get the inputs; data is a list of [inputs, labels] 46 | inputs, targets = data[0].to(rank), data[1].to(rank) 47 | outputs = net(inputs) 48 | 49 | loss = criterion(outputs, targets) 50 | 51 | loss.backward() 52 | if ((batch_idx + 1) % grad_acc_steps == 0) or ((batch_idx + 1) == len(trainloader)): 53 | optimizer.step() 54 | optimizer.zero_grad() 55 | 56 | _, predicted = outputs.max(1) 57 | 58 | ddp_loss[0] += loss.item() 59 | ddp_loss[1] += len(data[0]) 60 | ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item() 61 | 62 | if rank == 0: 63 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%%' 64 | % (ddp_loss[0]/(batch_idx+1), 100.*ddp_loss[2]/ddp_loss[1])) 65 | dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM) 66 | 67 | def test(epoch,net,rank,testloader,criterion): 68 | net.eval() 69 | ddp_loss = torch.zeros(3).to(rank) 70 | 71 | with torch.no_grad(): 72 | for batch_idx, data in enumerate(tqdm(testloader)): 73 | inputs, targets = data[0].to(rank), data[1].to(rank) 74 | outputs = net(inputs) 75 | loss = criterion(outputs, targets) 76 | 77 | _, predicted = outputs.max(1) 78 | ddp_loss[0] += loss.item() 79 | ddp_loss[1] += len(data[0]) 80 | ddp_loss[2] += predicted.eq(targets.view_as(predicted)).sum().item() 81 | if rank == 0: 82 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%%' 83 | % (ddp_loss[0]/ddp_loss[1]*len(inputs), 100.*ddp_loss[2]/ddp_loss[1])) 84 | 85 | '''Train CIFAR10/CIFAR100 with PyTorch.''' 86 | def main(rank, world_size, args): 87 | 88 | grad_acc_steps = args.batch_size//args.mini_batch_size//world_size 89 | 90 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']: 91 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'") 92 | return None 93 | 94 | setup(rank, world_size) 95 | 96 | 97 | transformation = torchvision.transforms.Compose([ 98 | torchvision.transforms.Resize(args.dimension), 99 | torchvision.transforms.ToTensor(), 100 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 101 | ]) 102 | 103 | # Data 104 | print('==> Preparing data..') 105 | 106 | if args.cifar_data=='CIFAR10': 107 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=False, transform=transformation) 108 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=False, transform=transformation) 109 | elif args.cifar_data=='CIFAR100': 110 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=False, transform=transformation) 111 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=False, transform=transformation) 112 | else: 113 | return "Must specify datasets as CIFAR10 or CIFAR100" 114 | 115 | sampler_train = DistributedSampler(trainset, rank=rank, num_replicas=world_size, shuffle=True) 116 | sampler_test = DistributedSampler(testset, rank=rank, num_replicas=world_size) 117 | 118 | train_kwargs = {'batch_size': args.mini_batch_size, 'sampler': sampler_train} 119 | test_kwargs = {'batch_size': 10, 'sampler': sampler_test} 120 | cuda_kwargs = {'num_workers': 2, 121 | 'pin_memory': False, 122 | 'shuffle': False} 123 | train_kwargs.update(cuda_kwargs) 124 | test_kwargs.update(cuda_kwargs) 125 | 126 | trainloader = torch.utils.data.DataLoader(trainset,**train_kwargs) 127 | testloader = torch.utils.data.DataLoader(testset, **test_kwargs) 128 | torch.cuda.set_device(rank) 129 | 130 | 131 | init_start_event = torch.cuda.Event(enable_timing=True) 132 | init_end_event = torch.cuda.Event(enable_timing=True) 133 | 134 | # Model 135 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode,grad_acc_steps) 136 | net = timm.create_model(args.model, pretrained=True, num_classes=int(args.cifar_data[5:])) 137 | 138 | if 'BiTFiT' in args.clipping_mode: 139 | for name,param in net.named_parameters(): 140 | if '.bias' not in name: 141 | param.requires_grad_(False) 142 | 143 | net = net.to(rank) 144 | 145 | # Privacy engine 146 | if 'nonDP' not in args.clipping_mode: 147 | PrivacyEngine_Distributed_extending( 148 | net, 149 | batch_size=args.batch_size, 150 | sample_size=len(trainset), 151 | epochs=args.epochs, 152 | target_epsilon=args.epsilon, 153 | num_GPUs=world_size, 154 | torch_seed_is_fixed=True, #FSDP always gives different seeds to devices if use FSDP() to wrap 155 | grad_accum_steps=grad_acc_steps, 156 | ) 157 | 158 | 159 | #net = FSDP(net,flatten_parameters=False, mixed_precision=args.fp16)# must use flatten_parameters=False https://github.com/facebookresearch/fairscale/issues/1047 160 | 161 | from fairscale.nn.wrap import wrap, enable_wrap, auto_wrap 162 | fsdp_params = dict(wrapper_cls=FSDP, mixed_precision=args.fp16, flatten_parameters=False)#,disable_reshard_on_root=False,reshard_after_forward=False,clear_autocast_cache=True) # True or False 163 | with enable_wrap(**fsdp_params): 164 | # cannot wrap the network as a whole, will lose weight.noise 165 | for pp in net.modules(): # must wrap module/layer not parameter 166 | if hasattr(pp,'weight'): # AssertionError assert not isinstance(child, cast(type, ConfigAutoWrap.wrapper_cls)) 167 | pp=auto_wrap(pp) 168 | 169 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 170 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad])) 171 | 172 | 173 | criterion = nn.CrossEntropyLoss(reduction='sum') 174 | 175 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 176 | #https://pytorch.org/docs/stable/fsdp.html 177 | #The optimizer must be initialized after the module has been wrapped, since FSDP will shard parameters in-place and this will break any previously initialized optimizers. 178 | 179 | init_start_event.record() 180 | 181 | for epoch in range(args.epochs): 182 | train(epoch,net,rank,trainloader,criterion,optimizer,grad_acc_steps) 183 | test(epoch,net,rank,testloader,criterion) 184 | init_end_event.record() 185 | 186 | if rank == 0: 187 | print(f"CUDA event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec") 188 | 189 | cleanup() 190 | 191 | 192 | 193 | if __name__ == '__main__': 194 | 195 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 196 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate') 197 | parser.add_argument('--epochs', default=5, type=int, 198 | help='numter of epochs') 199 | parser.add_argument('--batch_size', default=1024, type=int, help='logical batch size') 200 | parser.add_argument('--mini_batch_size', default=16, type=int, help='physical batch size') 201 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon') 202 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str) 203 | parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str) 204 | parser.add_argument('--cifar_data', type=str, default='CIFAR100') 205 | parser.add_argument('--dimension', type=int,default=224) 206 | parser.add_argument('--fp16', type=bool, default=False) 207 | 208 | args = parser.parse_args() 209 | 210 | torch.manual_seed(2) # useful for reproduction 211 | 212 | WORLD_SIZE = torch.cuda.device_count() 213 | 214 | mp.spawn(main,args=(WORLD_SIZE, args), 215 | nprocs=WORLD_SIZE,join=True) 216 | #https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn 217 | -------------------------------------------------------------------------------- /examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO1.py: -------------------------------------------------------------------------------- 1 | '''Train CIFAR10/CIFAR100 with PyTorch.''' 2 | def main(args): 3 | config=json.load(open(args.deepspeed_config)) 4 | 5 | transformation = torchvision.transforms.Compose([ 6 | torchvision.transforms.Resize(args.dimension), 7 | torchvision.transforms.ToTensor(), 8 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 9 | ]) 10 | 11 | if torch.distributed.get_rank() != 0: 12 | # might be downloading cifar data, let rank 0 download first 13 | torch.distributed.barrier() 14 | 15 | 16 | # Data 17 | print('==> Preparing data..') 18 | 19 | if args.cifar_data=='CIFAR10': 20 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation) 21 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation) 22 | elif args.cifar_data=='CIFAR100': 23 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation) 24 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation) 25 | else: 26 | return "Must specify datasets as CIFAR10 or CIFAR100" 27 | 28 | 29 | if torch.distributed.get_rank() == 0: 30 | # cifar data is downloaded, indicate other ranks can proceed 31 | torch.distributed.barrier() 32 | 33 | testloader = torch.utils.data.DataLoader(testset, batch_size=20, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746 34 | 35 | # Model 36 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode) 37 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:])) 38 | net = ModuleValidator.fix(net); 39 | 40 | criterion = nn.CrossEntropyLoss() 41 | 42 | if 'BiTFiT' in args.clipping_mode: 43 | for name,param in net.named_parameters(): 44 | if '.bias' not in name: 45 | param.requires_grad_(False) 46 | 47 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 48 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad])) 49 | 50 | # Privacy engine 51 | if 'nonDP' not in args.clipping_mode: 52 | privacy_engine = PrivacyEngine( 53 | net, 54 | batch_size=config['train_batch_size'], 55 | sample_size=len(trainset), 56 | epochs=args.epochs, 57 | target_epsilon=args.epsilon, 58 | clipping_mode='MixOpt', 59 | clipping_style=args.clipping_style, 60 | num_GPUs=torch.distributed.get_world_size(), 61 | torch_seed_is_fixed=True, 62 | ) 63 | 64 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 65 | 66 | # Initialize DeepSpeed to use the following features 67 | # 1) Distributed model 68 | # 2) Distributed data loader 69 | # 3) DeepSpeed optimizer 70 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset) 71 | 72 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled() 73 | print(f'fp16={fp16},bf16={bf16}') 74 | 75 | 76 | def train(epoch): 77 | 78 | net.train() 79 | train_loss = 0 80 | correct = 0 81 | total = 0 82 | 83 | 84 | for batch_idx, data in enumerate(tqdm(trainloader)): 85 | # get the inputs; data is a list of [inputs, labels] 86 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 87 | if fp16: 88 | inputs = inputs.half() 89 | if bf16: 90 | inputs = inputs.bfloat16() 91 | outputs = model_engine(inputs) 92 | 93 | loss = criterion(outputs, targets) 94 | 95 | model_engine.backward(loss) 96 | #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)): 97 | model_engine.step() 98 | 99 | train_loss += loss.item() 100 | _, predicted = outputs.max(1) 101 | total += targets.size(0) 102 | correct += predicted.eq(targets).sum().item() 103 | 104 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)' 105 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)) 106 | 107 | def test(epoch): 108 | net.eval() 109 | test_loss = 0 110 | correct = 0 111 | total = 0 112 | with torch.no_grad(): 113 | for batch_idx, data in enumerate(tqdm(testloader)): 114 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 115 | if fp16: 116 | inputs = inputs.half() 117 | if bf16: 118 | inputs = inputs.bfloat16() 119 | outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py 120 | loss = criterion(outputs, targets) 121 | 122 | test_loss += loss.item() 123 | _, predicted = outputs.max(1) 124 | total += targets.size(0) 125 | correct += predicted.eq(targets).sum().item() 126 | 127 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)' 128 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)) 129 | 130 | for epoch in range(args.epochs): 131 | train(epoch) 132 | test(epoch) 133 | 134 | 135 | if __name__ == '__main__': 136 | import deepspeed 137 | import argparse 138 | 139 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 140 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate') 141 | parser.add_argument('--epochs', default=1, type=int, 142 | help='numter of epochs') 143 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon') 144 | parser.add_argument('--clipping_mode', default='MixOpt', type=str) 145 | parser.add_argument('--model', default='vit_gigantic_patch14_224', type=str) 146 | parser.add_argument('--cifar_data', type=str, default='CIFAR100') 147 | parser.add_argument('--dimension', type=int,default=224) 148 | parser.add_argument('--clipping_style', type=str, default='layer-wise') 149 | 150 | parser.add_argument('--local_rank', 151 | type=int, 152 | default=-1, 153 | help='local rank passed from distributed launcher') 154 | # Include DeepSpeed configuration arguments 155 | parser = deepspeed.add_config_arguments(parser) 156 | 157 | args = parser.parse_args() 158 | 159 | from fastDP import PrivacyEngine 160 | 161 | import torch 162 | import torchvision 163 | torch.manual_seed(3) # if use, need change privacy engine's argument 164 | import torch.nn as nn 165 | import torch.optim as optim 166 | import timm 167 | from opacus.validators import ModuleValidator 168 | from tqdm import tqdm 169 | import warnings; warnings.filterwarnings("ignore") 170 | 171 | import json 172 | 173 | deepspeed.init_distributed() 174 | 175 | main(args) 176 | -------------------------------------------------------------------------------- /examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO23.py: -------------------------------------------------------------------------------- 1 | '''Train CIFAR10/CIFAR100 with PyTorch.''' 2 | def main(args): 3 | config=json.load(open(args.deepspeed_config)) 4 | 5 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']: 6 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'") 7 | return None 8 | 9 | 10 | transformation = torchvision.transforms.Compose([ 11 | torchvision.transforms.Resize(args.dimension), 12 | torchvision.transforms.ToTensor(), 13 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 14 | ]) 15 | 16 | if torch.distributed.get_rank() != 0: 17 | # might be downloading cifar data, let rank 0 download first 18 | torch.distributed.barrier() 19 | 20 | 21 | # Data 22 | print('==> Preparing data..') 23 | 24 | if args.cifar_data=='CIFAR10': 25 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation) 26 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation) 27 | elif args.cifar_data=='CIFAR100': 28 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation) 29 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation) 30 | else: 31 | return "Must specify datasets as CIFAR10 or CIFAR100" 32 | 33 | 34 | if torch.distributed.get_rank() == 0: 35 | # cifar data is downloaded, indicate other ranks can proceed 36 | torch.distributed.barrier() 37 | 38 | testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746 39 | 40 | # Model 41 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode) 42 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:])) 43 | net = ModuleValidator.fix(net); 44 | 45 | criterion = nn.CrossEntropyLoss() 46 | 47 | if 'BiTFiT' in args.clipping_mode: 48 | for name,param in net.named_parameters(): 49 | if '.bias' not in name: 50 | param.requires_grad_(False) 51 | 52 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 53 | print('Number of trainable parameters: ', sum([p.numel() for p in net.parameters() if p.requires_grad])) 54 | 55 | # Privacy engine 56 | if 'nonDP' not in args.clipping_mode: 57 | privacy_engine = PrivacyEngine_Distributed_Stage_2_and_3( 58 | net, 59 | batch_size=config['train_batch_size'], 60 | sample_size=len(trainset), 61 | epochs=args.epochs, 62 | #noise_multiplier=0, 63 | target_epsilon=args.epsilon, 64 | clipping_mode='MixOpt', 65 | clipping_style='layer-wise', 66 | num_GPUs=torch.distributed.get_world_size(), 67 | torch_seed_is_fixed=True, 68 | ) 69 | 70 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 71 | 72 | # Initialize DeepSpeed to use the following features 73 | # 1) Distributed model 74 | # 2) Distributed data loader 75 | # 3) DeepSpeed optimizer 76 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset) 77 | 78 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled() 79 | print(f'fp16={fp16},bf16={bf16}') 80 | 81 | 82 | def train(epoch): 83 | 84 | net.train() 85 | train_loss = 0 86 | correct = 0 87 | total = 0 88 | 89 | 90 | for batch_idx, data in enumerate(tqdm(trainloader)): 91 | # get the inputs; data is a list of [inputs, labels] 92 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 93 | if fp16: 94 | inputs = inputs.half() 95 | if bf16: 96 | inputs = inputs.bfloat16() 97 | outputs = model_engine(inputs) 98 | 99 | loss = criterion(outputs, targets) 100 | 101 | model_engine.backward(loss) 102 | #if ((batch_idx + 1) % 2 == 0) or ((batch_idx + 1) == len(trainloader)): 103 | model_engine.step() 104 | 105 | train_loss += loss.item() 106 | _, predicted = outputs.max(1) 107 | total += targets.size(0) 108 | correct += predicted.eq(targets).sum().item() 109 | 110 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)' 111 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)) 112 | 113 | def test(epoch): 114 | net.eval() 115 | test_loss = 0 116 | correct = 0 117 | total = 0 118 | with torch.no_grad(): 119 | for batch_idx, data in enumerate(tqdm(testloader)): 120 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 121 | if fp16: 122 | inputs = inputs.half() 123 | if bf16: 124 | inputs = inputs.bfloat16() 125 | outputs = model_engine(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py 126 | loss = criterion(outputs, targets) 127 | 128 | test_loss += loss.item() 129 | _, predicted = outputs.max(1) 130 | total += targets.size(0) 131 | correct += predicted.eq(targets).sum().item() 132 | 133 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)' 134 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)) 135 | 136 | for epoch in range(args.epochs): 137 | train(epoch) 138 | test(epoch) 139 | 140 | 141 | if __name__ == '__main__': 142 | import deepspeed 143 | import argparse 144 | 145 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 146 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate') 147 | parser.add_argument('--epochs', default=5, type=int, 148 | help='numter of epochs') 149 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon') 150 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str) 151 | parser.add_argument('--model', default='vit_small_patch16_224', type=str) 152 | parser.add_argument('--cifar_data', type=str, default='CIFAR100') 153 | parser.add_argument('--dimension', type=int,default=224) 154 | parser.add_argument('--origin_params', nargs='+', default=None) 155 | 156 | parser.add_argument('--local_rank', 157 | type=int, 158 | default=-1, 159 | help='local rank passed from distributed launcher') 160 | # Include DeepSpeed configuration arguments 161 | parser = deepspeed.add_config_arguments(parser) 162 | 163 | args = parser.parse_args() 164 | 165 | from fastDP import PrivacyEngine_Distributed_Stage_2_and_3 166 | 167 | import torch 168 | import torchvision 169 | torch.manual_seed(3) # if use, need change privacy engine's argument 170 | import torch.nn as nn 171 | import torch.optim as optim 172 | import timm 173 | from opacus.validators import ModuleValidator 174 | from tqdm import tqdm 175 | import warnings; warnings.filterwarnings("ignore") 176 | 177 | import json 178 | 179 | import deepspeed 180 | deepspeed.init_distributed() 181 | 182 | main(args) 183 | -------------------------------------------------------------------------------- /examples/image_classification/ZERO_examples/CIFAR_TIMM_ZERO_extending.py: -------------------------------------------------------------------------------- 1 | '''Train CIFAR10/CIFAR100 with PyTorch.''' 2 | def main(args): 3 | config=json.load(open(args.deepspeed_config)) 4 | 5 | if args.clipping_mode not in ['nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT']: 6 | print("Mode must be one of 'nonDP','BK-ghost', 'BK-MixGhostClip', 'BK-MixOpt','nonDP-BiTFiT','BiTFiT'") 7 | return None 8 | 9 | 10 | transformation = torchvision.transforms.Compose([ 11 | torchvision.transforms.Resize(args.dimension), 12 | torchvision.transforms.ToTensor(), 13 | torchvision.transforms.Normalize((0.5, 0.5, 0.5),(0.5, 0.5, 0.5)), 14 | ]) 15 | 16 | if torch.distributed.get_rank() != 0: 17 | # might be downloading cifar data, let rank 0 download first 18 | torch.distributed.barrier() 19 | 20 | 21 | # Data 22 | if args.cifar_data=='CIFAR10': 23 | trainset = torchvision.datasets.CIFAR10(root='data/', train=True, download=True, transform=transformation) 24 | testset = torchvision.datasets.CIFAR10(root='data/', train=False, download=True, transform=transformation) 25 | elif args.cifar_data=='CIFAR100': 26 | trainset = torchvision.datasets.CIFAR100(root='data/', train=True, download=True, transform=transformation) 27 | testset = torchvision.datasets.CIFAR100(root='data/', train=False, download=True, transform=transformation) 28 | else: 29 | return "Must specify datasets as CIFAR10 or CIFAR100" 30 | 31 | 32 | if torch.distributed.get_rank() == 0: 33 | # cifar data is downloaded, indicate other ranks can proceed 34 | torch.distributed.barrier() 35 | 36 | testloader = torch.utils.data.DataLoader(testset, batch_size=10, shuffle=False, num_workers=2) #must have num_workers!=0!!!!!! https://github.com/microsoft/DeepSpeed/issues/1735#issuecomment-1025073746 37 | 38 | # Model 39 | print('==> Building and fixing model..', args.model,'. Mode: ', args.clipping_mode) 40 | # https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L376 41 | # embed_dim a.k.a. width, mlp_ratio=MLP/embed_dim, depth is number of blocks 42 | if args.model!='vitANY': 43 | net = timm.create_model(args.model,pretrained=True,num_classes=int(args.cifar_data[5:])) 44 | else: 45 | net = timm.models.vision_transformer.VisionTransformer(embed_dim=768,num_heads=12,depth=12,mlp_ratio=4,num_classes=int(args.cifar_data[5:])) 46 | 47 | if 'BiTFiT' in args.clipping_mode: # not needed for DP-BiTFiT but use here for safety 48 | for name,param in net.named_parameters(): 49 | if '.bias' not in name: 50 | param.requires_grad_(False) 51 | 52 | criterion = nn.CrossEntropyLoss() 53 | 54 | if 'nonDP' not in args.clipping_mode: 55 | PrivacyEngine_Distributed_extending( 56 | net, 57 | batch_size=config['train_batch_size'], 58 | sample_size=len(trainset), 59 | epochs=args.epochs, 60 | target_epsilon=args.epsilon, 61 | num_GPUs=torch.distributed.get_world_size(), 62 | torch_seed_is_fixed=(args.seed_fixed>=0), # better use False? 63 | grad_accum_steps=config['train_batch_size']/config['train_micro_batch_size_per_gpu']/torch.distributed.get_world_size(), 64 | ) 65 | 66 | print('Number of total parameters: ', sum([p.numel() for p in net.parameters()])) 67 | print(f"Number of trainable parameters: {sum([p.numel() for p in net.parameters() if p.requires_grad])}({sum([p.numel() for p in net.parameters() if p.requires_grad])/sum([p.numel() for p in net.parameters()])})") 68 | 69 | optimizer = optim.Adam(net.parameters(), lr=args.lr) 70 | 71 | # Initialize DeepSpeed to use the following features 72 | # 1) Distributed model 73 | # 2) Distributed data loader 74 | # 3) DeepSpeed optimizer 75 | model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, optimizer=optimizer, model_parameters=net.parameters(), training_data=trainset) 76 | 77 | fp16 = model_engine.fp16_enabled();bf16 = model_engine.bfloat16_enabled(); 78 | print(f'fp16={fp16},bf16={bf16}') 79 | 80 | 81 | def train(epoch): 82 | 83 | net.train() 84 | train_loss = 0 85 | correct = 0 86 | total = 0 87 | 88 | 89 | for batch_idx, data in enumerate(tqdm(trainloader)): 90 | # get the inputs; data is a list of [inputs, labels] 91 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 92 | if fp16: 93 | inputs = inputs.half() 94 | if bf16: 95 | inputs = inputs.bfloat16() 96 | outputs = model_engine(inputs) 97 | 98 | loss = criterion(outputs, targets) 99 | 100 | model_engine.backward(loss) 101 | model_engine.step() 102 | 103 | train_loss += loss.item() 104 | _, predicted = outputs.max(1) 105 | total += targets.size(0) 106 | correct += predicted.eq(targets).sum().item() 107 | 108 | print('Epoch: ', epoch, len(trainloader), 'Train Loss: %.3f | Acc: %.3f%% (%d/%d)' 109 | % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)) 110 | 111 | def test(epoch): 112 | net.eval() 113 | test_loss = 0 114 | correct = 0 115 | total = 0 116 | with torch.no_grad(): 117 | for batch_idx, data in enumerate(tqdm(testloader)): 118 | inputs, targets = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank) 119 | if fp16: 120 | inputs = inputs.half() 121 | if bf16: 122 | inputs = inputs.bfloat16() 123 | outputs = model_engine(inputs) 124 | #outputs = net(inputs) # https://github.com/microsoft/DeepSpeedExamples/blob/master/cifar/cifar10_deepspeed.py 125 | loss = criterion(outputs, targets) 126 | 127 | test_loss += loss.item() 128 | _, predicted = outputs.max(1) 129 | total += targets.size(0) 130 | correct += predicted.eq(targets).sum().item() 131 | 132 | print('Epoch: ', epoch, len(testloader), 'Test Loss: %.3f | Acc: %.3f%% (%d/%d)' 133 | % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)) 134 | 135 | for epoch in range(args.epochs): 136 | train(epoch) 137 | test(epoch) 138 | 139 | 140 | if __name__ == '__main__': 141 | import deepspeed 142 | import argparse 143 | 144 | parser = argparse.ArgumentParser(description='PyTorch CIFAR Training') 145 | parser.add_argument('--lr', default=0.0005, type=float, help='learning rate') 146 | parser.add_argument('--epochs', default=5, type=int, 147 | help='numter of epochs') 148 | parser.add_argument('--epsilon', default=2, type=float, help='target epsilon') 149 | parser.add_argument('--clipping_mode', default='BK-MixOpt', type=str) 150 | parser.add_argument('--model', default='vit_large_patch16_224', type=str) 151 | parser.add_argument('--cifar_data', type=str, default='CIFAR100') 152 | parser.add_argument('--dimension', type=int,default=224) 153 | parser.add_argument('--seed_fixed', type=int,default=3) 154 | 155 | parser.add_argument('--local_rank', 156 | type=int, 157 | default=-1, 158 | help='local rank passed from distributed launcher') 159 | # Include DeepSpeed configuration arguments 160 | parser = deepspeed.add_config_arguments(parser) 161 | 162 | args = parser.parse_args() 163 | 164 | from fastDP import PrivacyEngine_Distributed_extending 165 | 166 | import torch 167 | import torchvision 168 | if args.seed_fixed>=0: 169 | torch.manual_seed(args.seed_fixed) # if use, need change privacy engine's argument 170 | import torch.nn as nn 171 | import torch.optim as optim 172 | import timm 173 | from tqdm import tqdm 174 | import warnings; warnings.filterwarnings("ignore") 175 | 176 | import json 177 | 178 | import deepspeed 179 | deepspeed.init_distributed() 180 | 181 | main(args) 182 | -------------------------------------------------------------------------------- /examples/image_classification/ZERO_examples/cifar_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "train_batch_size": 1024, 3 | "train_micro_batch_size_per_gpu": 32, 4 | "steps_per_print": 2000, 5 | "prescale_gradients": false, 6 | "bf16": { 7 | "enabled": false 8 | }, 9 | "fp16": { 10 | "enabled": true, 11 | "fp16_master_weights_and_grads": false, 12 | "loss_scale": 1.0, 13 | "loss_scale_window": 1000, 14 | "hysteresis": 2, 15 | "min_loss_scale": 1, 16 | "initial_scale_power": 0 17 | }, 18 | "wall_clock_breakdown": false, 19 | "zero_optimization": { 20 | "stage": 1, 21 | "allgather_partitions": true, 22 | "reduce_scatter": true, 23 | "allgather_bucket_size": 50000000, 24 | "reduce_bucket_size": 50000000, 25 | "overlap_comm": true, 26 | "contiguous_gradients": true, 27 | "cpu_offload": false, 28 | "stage3_max_live_parameters" : 1e8, 29 | "stage3_max_reuse_distance" : 1e8, 30 | "stage3_prefetch_bucket_size" : 1e7 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /examples/image_classification/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/image_classification/__init__.py -------------------------------------------------------------------------------- /examples/requirements.txt: -------------------------------------------------------------------------------- 1 | argcomplete==1.12.1 2 | avro-python3==1.9.2.1 3 | azure-storage-blob==12.4.0 4 | bottle==0.12.20 5 | certifi==2023.7.22 6 | chardet==3.0.4 7 | charset-normalizer==2.0.4 8 | click==8.0.1 9 | crcmod==1.7 10 | cycler==0.10.0 11 | datasets 12 | diffimg==0.2.3 13 | docopt==0.6.2 14 | fastavro==1.4.1 15 | filelock==3.0.12 16 | fire 17 | fusepy==2.0.4 18 | future==0.18.3 19 | gdown>=5.0 20 | gpytorch 21 | httplib2==0.19.0 22 | idna==3.2 23 | imageio==2.9.0 24 | indexed-gzip-fileobj-fork-epicfaace==1.5.4 25 | isodate==0.6.0 26 | joblib==1.2.0 27 | kiwisolver==1.3.1 28 | lazy_loader==0.3 29 | markdown2==2.4.0 30 | marshmallow==2.15.1 31 | marshmallow-jsonapi==0.15.1 32 | matplotlib==3.4.3 33 | mock==2.0.0 34 | networkx==2.6.2 35 | nltk==3.9 36 | numpy>=1.21.2 37 | oauth2client==4.1.3 38 | packaging==21.0 39 | pandas==1.3.2 40 | pathtools==0.1.2 41 | pbr==5.6.0 42 | Pillow==10.2.0 43 | psutil==5.7.2 44 | pyasn1==0.4.8 45 | pyasn1-modules==0.2.8 46 | pycparser==2.20 47 | pydot==1.4.2 48 | pymongo==3.11.4 49 | pyparsing==2.4.7 50 | PySocks==1.7.1 51 | python-dateutil==2.8.* 52 | pytz==2021.1 53 | PyWavelets==1.1.1 54 | PyYAML==5.4.* 55 | regex==2021.8.3 56 | requests 57 | retry==0.9.2 58 | sacremoses==0.0.45 59 | scikit-image==0.18.2 60 | scikit-learn==1.5.0 61 | scipy>=1.7.1 62 | seaborn==0.11.2 63 | selenium==3.141.0 64 | sentence-transformers>=2.0.0 65 | sentencepiece==0.1.96 66 | sentry-sdk==1.14.0 67 | six==1.15.0 68 | SQLAlchemy==1.3.19 69 | termcolor==1.1.0 70 | threadpoolctl==2.2.0 71 | tifffile==2021.8.8 72 | tokenizers==0.10.3 73 | tqdm>=4.62.1 74 | transformers<=4.26 75 | typing-extensions==3.7.4.3 76 | urllib3==1.26.* 77 | watchdog==0.10.3 78 | websocket-client==1.0.1 79 | -------------------------------------------------------------------------------- /examples/table2text/README.md: -------------------------------------------------------------------------------- 1 | ## DP natural language generation with Huggingface transformers 2 | 3 | ### Getting the data 4 | 5 | E2E and DART datasets are adapted from \[[Li & Liang, 2021](https://arxiv.org/abs/2101.00190)\] and hosted by \[[Li et al., 2021](https://arxiv.org/abs/2110.05679)\] at [Google drive](https://drive.google.com/file/d/1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7/view?usp=sharing). To obtain the data, run 6 | ```plaintext 7 | gdown https://drive.google.com/uc?id=1Re1wyUPtS3IalSsVVJhSg2sn8UNa7DM7 8 | unzip prefix-tuning.zip 9 | ``` 10 | This should produce a `table2text/prefix-tuning/data` subfolder that contains the datasets. 11 | 12 | ### Running on single GPU 13 | 14 | Use the `run.sh` script in the folder, which runs the `run_language_modeling.py` for the command. 15 | 16 | For instance, run the following under the `examples` folder: 17 | ```plaintext 18 | bash table2text/run.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2" 19 | ``` 20 | 21 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For E2E/DART, training `gpt2` on one A100 GPU (40GB) takes around 2.5/4 min per epoch. 22 | 23 | Arguments (sequentially): 24 | * `--output_dir`: path to a folder where results will be written 25 | 26 | * `--task_mode`: name of task; one of "e2e" and "dart" 27 | 28 | * `--model_name_or_path`: The pretrained model; one of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large". 29 | 30 | * `--target_epsilon`: Target privacy spending, default is 8. 31 | 32 | * `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf). 33 | 34 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `MixOpt` (default, meaning hybrid book-keeping), `MixGhostClip`, `ghost`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf). 35 | 36 | ### Running on multi-GPU distributed learning 37 | 38 | Use the `run_ZERO1.sh`, `run_ZERO23.sh` and `run_ZERO_extending.py` in the folder for ZeRO1, ZeRO2+3 and ZeRO1+2+3, respectively. The scripts read the config from `gpt_config_stage123.json`. 39 | 40 | For instance, run the following under the `examples` folder: 41 | ```plaintext 42 | bash table2text/run_ZERO1.sh table2text/prefix-tuning ToDeleteNLG "e2e" "gpt2" 43 | ``` 44 | 45 | ### Evaluation 46 | 47 | The script automatically evaluates some measures like loss during the training. To evaluate the generations with BLEU, ROGUE, METEOR, CIDEr, NIST, etc., we use the official [e2e-metrics](https://github.com/tuetschek/e2e-metrics) for E2E, and [GEM-metrics](https://github.com/GEM-benchmark/GEM-metrics) for DART. 48 | 49 | Specifically for E2E, after installing e2e-metric in the `table2text` folder, run 50 | ```bash 51 | cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib) 52 | python e2e-metrics/measure_scores.py prefix-tuning/data/e2e_data/clean_references_test.txt ..//generations_model/eval/global_step_00000420.txt 53 | ``` 54 | -------------------------------------------------------------------------------- /examples/table2text/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/__init__.py -------------------------------------------------------------------------------- /examples/table2text/compiled_args.py: -------------------------------------------------------------------------------- 1 | """Compilation of all the arguments.""" 2 | import logging 3 | import os 4 | import sys 5 | from dataclasses import dataclass, field 6 | from typing import Optional 7 | 8 | import transformers 9 | 10 | MODEL_CONFIG_CLASSES = list(transformers.MODEL_WITH_LM_HEAD_MAPPING.keys()) 11 | MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) 12 | 13 | TRUE_TAGS = ('y', 'yes', 't', 'true') 14 | 15 | 16 | # See all possible arguments in src/transformers/training_args.py 17 | # or by passing the --help flag to this script. 18 | # We now keep distinct sets of args, for a cleaner separation of concerns. 19 | @dataclass 20 | class ModelArguments: 21 | """ 22 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. 23 | """ 24 | model_name_or_path: Optional[str] = field( 25 | default=None, 26 | metadata={ 27 | "help": "The model checkpoint for weights initialization. Leave None if you want to train a model from " 28 | "scratch." 29 | }, 30 | ) 31 | model_type: Optional[str] = field( 32 | default=None, 33 | metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)}, 34 | ) 35 | config_name: Optional[str] = field( 36 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 37 | ) 38 | tokenizer_name: Optional[str] = field( 39 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 40 | ) 41 | cache_dir: Optional[str] = field( 42 | default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"} 43 | ) 44 | 45 | static_lm_head: str = field(default='no') 46 | static_embedding: str = field(default='no') 47 | attention_only: str = field(default="no") 48 | bias_only: str = field(default="no") 49 | 50 | def __post_init__(self): 51 | self.static_lm_head = self.static_lm_head.lower() in TRUE_TAGS 52 | self.static_embedding = self.static_embedding.lower() in TRUE_TAGS 53 | self.attention_only = self.attention_only.lower() in TRUE_TAGS 54 | self.bias_only = self.bias_only.lower() in TRUE_TAGS 55 | 56 | 57 | @dataclass 58 | class DataTrainingArguments: 59 | """ 60 | Arguments pertaining to what data we are going to input our model for training and eval. 61 | """ 62 | data_folder: Optional[str] = field(default=None, metadata={"help": "Path to folder with all the data."}) 63 | 64 | # Useful for truncating the dataset. 65 | max_train_examples: Optional[int] = field(default=sys.maxsize) 66 | max_valid_examples: Optional[int] = field(default=sys.maxsize) 67 | max_eval_examples: Optional[int] = field(default=sys.maxsize) 68 | 69 | line_by_line: bool = field( 70 | default=True, 71 | metadata={"help": "Whether distinct lines of text in the dataset are to be handled as distinct sequences."}, 72 | ) 73 | task_mode: Optional[str] = field( 74 | default=None, metadata={"help": "The name of the task."} 75 | ) 76 | format_mode: Optional[str] = field( 77 | default='cat', metadata={"help": "The mode of data2text format (cat, peek, nopeek)"} 78 | ) 79 | max_source_length: Optional[int] = field( 80 | default=512, metadata={"help": "the max source length of summarization data. "} 81 | ) 82 | train_max_target_length: Optional[int] = field( 83 | default=100, metadata={"help": "the max target length for training data. "} 84 | ) 85 | val_max_target_length: Optional[int] = field( 86 | default=100, metadata={"help": "the max target length for dev data. "} 87 | ) 88 | block_size: int = field( 89 | default=-1, 90 | metadata={ 91 | "help": "Optional input sequence length after tokenization." 92 | "The training dataset will be truncated in block of this size for training." 93 | "Default to the model max input length for single sentence inputs (take into account special " 94 | "tokens)." 95 | }, 96 | ) 97 | overwrite_cache: bool = field( 98 | default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} 99 | ) 100 | max_seq_len: int = field(default=sys.maxsize) 101 | 102 | 103 | def __post_init__(self): 104 | if self.data_folder is not None: 105 | logging.warning(f'Overriding dataset paths using those given in `data_folder`') 106 | 107 | if self.task_mode == "e2e": 108 | self.train_data_file = os.path.join(self.data_folder, 'src1_train.txt') 109 | self.valid_data_file = os.path.join(self.data_folder, 'src1_valid.txt') 110 | self.eval_data_file = os.path.join(self.data_folder, 'src1_test.txt') 111 | 112 | self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt') 113 | self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt') 114 | self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt') 115 | 116 | elif self.task_mode == "dart": 117 | self.train_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-train.json') 118 | self.valid_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-dev.json') 119 | self.eval_data_file = os.path.join(self.data_folder, 'dart-v1.1.1-full-test.json') 120 | 121 | self.train_prompt_file = os.path.join(self.data_folder, 'prompts_train.txt') 122 | self.val_prompt_file = os.path.join(self.data_folder, 'prompts_valid.txt') 123 | self.eval_prompt_file = os.path.join(self.data_folder, 'prompts_test.txt') 124 | 125 | 126 | @dataclass 127 | class TrainingArguments(transformers.TrainingArguments): 128 | max_eval_batches: int = field(default=-1, metadata={"help": "Maximum number of evaluation steps to run."}) 129 | max_generations: int = field(default=sys.maxsize) 130 | max_generations_train: int = field(default=10) 131 | max_generations_valid: int = field(default=10) 132 | skip_generation: str = field(default="no") 133 | 134 | ema_model_averaging: str = field(default="no") 135 | ema_model_gamma: float = field(default=0.99) 136 | ema_model_start_from: int = field(default=1000) 137 | lr_decay: str = field(default="yes") 138 | eval_epochs: int = field(default=10) 139 | 140 | deepspeed_config: str = field(default=None) 141 | num_GPUs: int = field(default=1) 142 | logical_batch_size: int = field(default=None) 143 | 144 | evaluate_during_training: str = field( 145 | default="yes", 146 | metadata={"help": "Run evaluation during training at each logging step."}, 147 | ) 148 | evaluate_before_training: str = field( 149 | default="yes", 150 | metadata={"help": "Run evaluation before training."}, 151 | ) 152 | save_at_last: str = field(default="no", metadata={"help": "Save at the end of training."}) 153 | 154 | def __post_init__(self): 155 | super(TrainingArguments, self).__post_init__() 156 | self.skip_generation = self.skip_generation.lower() in ('y', 'yes') 157 | self.ema_model_averaging = (self.ema_model_averaging.lower() in ('y', 'yes')) 158 | self.lr_decay = (self.lr_decay.lower() in ('y', 'yes')) 159 | self.evaluate_during_training = (self.evaluate_during_training in ('y', 'yes')) 160 | self.evaluate_before_training = (self.evaluate_before_training in ('y', 'yes')) 161 | self.save_at_last = (self.save_at_last in ('y', 'yes')) 162 | 163 | 164 | @dataclass 165 | class PrivacyArguments: 166 | """Arguments for differentially private training.""" 167 | per_example_max_grad_norm: float = field( 168 | default=.1, metadata={ 169 | "help": "Clipping 2-norm of per-sample gradients." 170 | } 171 | ) 172 | noise_multiplier: float = field( 173 | default=None, metadata={ 174 | "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, " 175 | "use the one searched based budget" 176 | } 177 | ) 178 | target_epsilon: float = field( 179 | default=None, metadata={ 180 | "help": "Privacy budget; if `None` use the noise multiplier specified." 181 | } 182 | ) 183 | target_delta: float = field( 184 | default=None, metadata={ 185 | "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)." 186 | } 187 | ) 188 | accounting_mode: str = field( 189 | default="rdp", metadata={"help": "One of `rdp`, `glw`, `all`."} 190 | ) 191 | non_private: str = field(default="no") 192 | clipping_mode: str = field(default="ghost") 193 | clipping_fn: str = field(default="automatic") 194 | clipping_style: str = field(default="all-layer") 195 | torch_seed_is_fixed: bool = field(default=True) 196 | 197 | def __post_init__(self): 198 | self.non_private = self.non_private.lower() in ('y', 'yes') 199 | -------------------------------------------------------------------------------- /examples/table2text/data_utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/table2text/data_utils/__init__.py -------------------------------------------------------------------------------- /examples/table2text/data_utils/data_collator.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union 3 | 4 | import torch 5 | from torch.nn.utils.rnn import pad_sequence 6 | 7 | from transformers.tokenization_utils import PreTrainedTokenizer 8 | from transformers.tokenization_utils_base import BatchEncoding, PaddingStrategy 9 | from transformers.tokenization_utils_fast import PreTrainedTokenizerFast 10 | 11 | 12 | InputDataClass = NewType("InputDataClass", Any) 13 | 14 | """ 15 | A DataCollator is a function that takes a list of samples from a Dataset 16 | and collate them into a batch, as a dictionary of Tensors. 17 | """ 18 | DataCollator = NewType("DataCollator", Callable[[List[InputDataClass]], Dict[str, torch.Tensor]]) 19 | 20 | 21 | @dataclass 22 | class DataCollatorForData2TextLanguageModeling: 23 | """ 24 | Data collator used for language modeling. 25 | - collates batches of tensors, honoring their tokenizer's pad_token 26 | - preprocesses batches for masked language modeling 27 | """ 28 | tokenizer: PreTrainedTokenizer 29 | mlm: bool = True 30 | format_mode: str = 'cat' 31 | mlm_probability: float = 0.15 32 | 33 | def __call__( 34 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]] 35 | ) -> Dict[str, torch.Tensor]: 36 | if isinstance(examples[0], (dict, BatchEncoding)): 37 | examples = [e["input_ids"] for e in examples] 38 | input_ids, labels, src, tgt, cate = zip(*examples) 39 | if self.mlm: 40 | inputs, labels = self.mask_tokens(batch) 41 | return {"input_ids": inputs, "labels": labels} 42 | else: 43 | if self.format_mode == 'cat': 44 | mode_input = 3 45 | elif self.format_mode == 'peek': 46 | mode_input = 1 47 | elif self.format_mode == 'nopeek': 48 | mode_input = 2 49 | elif self.format_mode == 'infix': 50 | mode_input = 4 51 | 52 | # mode_input = 1 # means that we take the input again. 53 | # mode_input = 2 # means that we do not peek at src again. 54 | # mode_input = 3 # means that we look at the categories, and see the input again. 55 | 56 | if mode_input == 1: 57 | # input, batch 58 | batch = self._tensorize_batch(input_ids) 59 | labels = self._tensorize_batch(labels) 60 | src = self._tensorize_batch(src) 61 | cate_batch, cate_attn = None, None 62 | # tgt = self._tensorize_batch(tgt) 63 | elif mode_input == 2: 64 | # nopeek. 65 | batch = self._tensorize_batch(tgt) 66 | labels = batch.clone() 67 | src = self._tensorize_batch(src) 68 | cate_batch, cate_attn = None, None 69 | elif mode_input == 3: 70 | batch = self._tensorize_batch(input_ids) 71 | labels = self._tensorize_batch(labels) 72 | src = self._tensorize_batch(cate) 73 | cate_batch, cate_attn = None, None 74 | elif mode_input == 4: 75 | batch = self._tensorize_batch(tgt) 76 | labels = batch.clone() 77 | src = self._tensorize_batch(src) 78 | 79 | cate_batch = self._tensorize_batch(cate) 80 | cate_attn = (cate_batch != self.tokenizer.pad_token_id) 81 | 82 | labels[labels == self.tokenizer.pad_token_id] = -100 # tgt 83 | src_attn = (src != self.tokenizer.pad_token_id) # src 84 | tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt 85 | 86 | if cate_batch is None: 87 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 88 | 'src':src} 89 | else: 90 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn': tgt_attn, 91 | 'src': src, "cate_batch":cate_batch, "cate_attn":cate_attn} 92 | 93 | def _tensorize_batch( 94 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]] 95 | ) -> torch.Tensor: 96 | # In order to accept both lists of lists and lists of Tensors 97 | if isinstance(examples[0], (list, tuple)): 98 | examples = [torch.tensor(e, dtype=torch.long) for e in examples] 99 | length_of_first = examples[0].size(0) 100 | are_tensors_same_length = all(x.size(0) == length_of_first for x in examples) 101 | if are_tensors_same_length: 102 | return torch.stack(examples, dim=0) 103 | else: 104 | if self.tokenizer._pad_token is None: 105 | raise ValueError( 106 | "You are attempting to pad samples but the tokenizer you are using" 107 | f" ({self.tokenizer.__class__.__name__}) does not have one." 108 | ) 109 | return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id) 110 | 111 | def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: 112 | """ 113 | Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. 114 | """ 115 | 116 | if self.tokenizer.mask_token is None: 117 | raise ValueError( 118 | "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer." 119 | ) 120 | 121 | labels = inputs.clone() 122 | # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa) 123 | probability_matrix = torch.full(labels.shape, self.mlm_probability) 124 | special_tokens_mask = [ 125 | self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist() 126 | ] 127 | probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0) 128 | if self.tokenizer._pad_token is not None: 129 | padding_mask = labels.eq(self.tokenizer.pad_token_id) 130 | probability_matrix.masked_fill_(padding_mask, value=0.0) 131 | masked_indices = torch.bernoulli(probability_matrix).bool() 132 | labels[~masked_indices] = -100 # We only compute loss on masked tokens 133 | 134 | # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK]) 135 | indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices 136 | inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token) 137 | 138 | # 10% of the time, we replace masked input tokens with random word 139 | indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced 140 | random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long) 141 | inputs[indices_random] = random_words[indices_random] 142 | 143 | # The rest of the time (10% of the time) we keep the masked input tokens unchanged 144 | return inputs, labels 145 | 146 | 147 | @dataclass 148 | class DataCollatorForSumLanguageModeling: 149 | """ 150 | Data collator used for language modeling. 151 | - collates batches of tensors, honoring their tokenizer's pad_token 152 | - preprocesses batches for masked language modeling 153 | """ 154 | tokenizer: PreTrainedTokenizer 155 | mlm: bool = True 156 | format_mode: str = 'cat' 157 | mlm_probability: float = 0.15 158 | 159 | def __call__( 160 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]] 161 | ) -> Dict[str, torch.Tensor]: 162 | if isinstance(examples[0], (dict, BatchEncoding)): 163 | examples = [e["input_ids"] for e in examples] 164 | # print(examples[0]) 165 | # print(len(examples)) 166 | input_ids, labels, src, tgt = zip(*examples) 167 | # print(len(input_ids), len(labels), len(weights)) 168 | if self.mlm: 169 | inputs, labels = self.mask_tokens(batch) 170 | return {"input_ids": inputs, "labels": labels} 171 | else: 172 | 173 | # print(self.format_mode) 174 | 175 | if self.format_mode == 'peek' or self.format_mode == 'cat': 176 | mode_input = 1 177 | elif self.format_mode == 'nopeek': 178 | assert False, 'should use format_mode = peek or cat.' 179 | mode_input = 2 180 | elif self.format_mode == 'infix': 181 | assert False, 'should use format_mode = peek or cat.' 182 | mode_input = 4 183 | 184 | # mode_input = 1 # means that we take the input again. 185 | # mode_input = 2 # means that we do not peek at src again. 186 | # mode_input = 3 # means that we look at the categories, and see the input again. 187 | 188 | # print(self.format_mode, mode_input) 189 | 190 | if mode_input == 1: 191 | # input, batch 192 | batch = self._tensorize_batch(input_ids) 193 | labels = self._tensorize_batch(labels) 194 | src = self._tensorize_batch(src) 195 | 196 | labels[labels == self.tokenizer.pad_token_id] = -100 # tgt 197 | src_attn = (src != self.tokenizer.pad_token_id) # src 198 | tgt_attn = (batch != self.tokenizer.pad_token_id) # tgt 199 | 200 | return {"input_ids": batch, "labels": labels, 'src_attn': src_attn, 'tgt_attn':tgt_attn, 201 | 'src':src} 202 | 203 | 204 | def _tensorize_batch( 205 | self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]] 206 | ) -> torch.Tensor: 207 | # In order to accept both lists of lists and lists of Tensors 208 | if isinstance(examples[0], (list, tuple)): 209 | examples = [torch.tensor(e, dtype=torch.long) for e in examples] 210 | length_of_first = examples[0].size(0) 211 | are_tensors_same_length = all(x.size(0) == length_of_first for x in examples) 212 | if are_tensors_same_length: 213 | return torch.stack(examples, dim=0) 214 | else: 215 | if self.tokenizer._pad_token is None: 216 | raise ValueError( 217 | "You are attempting to pad samples but the tokenizer you are using" 218 | f" ({self.tokenizer.__class__.__name__}) does not have one." 219 | ) 220 | return pad_sequence(examples, batch_first=True, padding_value=self.tokenizer.pad_token_id) 221 | -------------------------------------------------------------------------------- /examples/table2text/decoding_utils.py: -------------------------------------------------------------------------------- 1 | """Utilities for generation.""" 2 | import logging 3 | import sys 4 | from typing import Optional 5 | 6 | import tqdm 7 | import transformers 8 | 9 | 10 | def generate( 11 | model: transformers.PreTrainedModel, 12 | tokenizer: transformers.PreTrainedTokenizer, 13 | loader=None, 14 | prompt_dataset=None, 15 | max_length=100, 16 | min_length=5, 17 | top_k=0, 18 | top_p=0.9, # Only filter with top_p. 19 | repetition_penalty=1, 20 | do_sample=False, 21 | num_beams=5, 22 | bad_words_ids=None, 23 | dummy_token_id=-100, # Used as mask. 24 | num_return_sequences=1, 25 | max_generations=sys.maxsize, 26 | device=None, 27 | padding_token="[PAD]", 28 | **kwargs, 29 | ): 30 | assert not model.training, "Generation must be when `model` is in eval mode." 31 | if kwargs: 32 | logging.warning(f"Unknown kwargs: {kwargs}") 33 | 34 | # These are linebreaks; generating these will mess up the evaluation, since those files assume one example per-line. 35 | if bad_words_ids is None: 36 | bad_words_ids = [[628], [198]] 37 | if padding_token in tokenizer.get_vocab(): 38 | bad_words_ids.append(tokenizer.encode(padding_token)) 39 | 40 | kwargs = dict( 41 | model=model, 42 | tokenizer=tokenizer, 43 | max_length=max_length, 44 | min_length=min_length, 45 | top_k=top_k, 46 | top_p=top_p, 47 | repetition_penalty=repetition_penalty, 48 | do_sample=do_sample, 49 | num_beams=num_beams, 50 | bad_words_ids=bad_words_ids, 51 | dummy_token_id=dummy_token_id, 52 | num_return_sequences=num_return_sequences, 53 | max_generations=max_generations, 54 | device=device, 55 | padding_token=padding_token, 56 | ) 57 | if loader is not None: 58 | result = _generate_with_loader(loader=loader, **kwargs) 59 | elif prompt_dataset is not None: 60 | result = _generate_with_prompt_dataset(prompt_dataset=prompt_dataset, **kwargs) 61 | else: 62 | raise ValueError(f"`loader` and `prompt_dataset` cannot both be `None`.") 63 | 64 | return result 65 | 66 | 67 | def _generate_with_loader( 68 | loader, 69 | 70 | model, 71 | tokenizer: transformers.PreTrainedTokenizer, 72 | max_length, 73 | min_length, 74 | top_k, 75 | top_p, 76 | repetition_penalty, 77 | do_sample, 78 | num_beams, 79 | bad_words_ids, 80 | dummy_token_id, 81 | num_return_sequences, 82 | max_generations, 83 | device, 84 | padding_token, 85 | ): 86 | references = [] 87 | full_generations = [] # Sentences including the prompt part. 88 | unstripped_generations = [] 89 | generations = [] 90 | 91 | stop_generation = False 92 | for batch_idx, batch in tqdm.tqdm(enumerate(loader), desc="generation"): 93 | if stop_generation: 94 | break 95 | 96 | batch_input_ids, batch_labels = batch["input_ids"], batch["labels"] 97 | # e.g., inputs_ids may be [[95, 123, 32], [198, 19, 120]], and 98 | # labels may be [[-100, 123, 32], [-100, -100, 120] 99 | 100 | for input_ids, labels in zip(batch_input_ids, batch_labels): 101 | if stop_generation: 102 | break 103 | 104 | # Find the first pad token and end the sentence from there! 105 | if padding_token in tokenizer.get_vocab(): 106 | pad_positions, = ( 107 | input_ids == tokenizer.encode(padding_token, return_tensors="pt").squeeze() 108 | ).nonzero(as_tuple=True) 109 | # Some sentences might have padding; others might not. 110 | if pad_positions.numel() == 0: 111 | first_pad_position = None 112 | else: 113 | first_pad_position = pad_positions[0] 114 | reference_str: str = tokenizer.decode(input_ids[:first_pad_position], clean_up_tokenization_spaces=True) 115 | else: 116 | reference_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True) 117 | references.append(reference_str) 118 | 119 | # Find the first non- -100 position. Note there are trailing -100s. 120 | non_prompt_positions, = (labels != dummy_token_id).nonzero(as_tuple=True) 121 | first_non_prompt_position = non_prompt_positions[0].item() 122 | prompt_len = first_non_prompt_position 123 | prompt_ids = input_ids[:prompt_len] 124 | 125 | output_ids = model.generate( 126 | input_ids=prompt_ids[None, ...].to(device), 127 | max_length=max_length + prompt_len, # This cannot be a 0-D tensor! 128 | min_length=min_length, 129 | top_k=top_k, 130 | top_p=top_p, 131 | repetition_penalty=repetition_penalty, 132 | do_sample=do_sample, 133 | bad_words_ids=bad_words_ids, 134 | num_return_sequences=num_return_sequences, 135 | num_beams=num_beams, 136 | pad_token_id=tokenizer.eos_token_id, # Stop the stupid logging... 137 | ) 138 | output_ids = output_ids.squeeze(dim=0) # Throw away batch dimension. 139 | 140 | whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True) 141 | prompt_str: str = tokenizer.decode(prompt_ids, clean_up_tokenization_spaces=True) 142 | output_str: str = whole_str[len(prompt_str):] 143 | 144 | full_generations.append(whole_str) 145 | del whole_str, prompt_str 146 | 147 | # Remove potential eos_token at the end. 148 | eos_position: Optional[int] = output_str.find(tokenizer.eos_token) 149 | if eos_position == -1: # Didn't generate eos_token; that's okay -- just skip! 150 | eos_position = None 151 | output_str = output_str[:eos_position] 152 | unstripped_generations.append(output_str) 153 | 154 | # Removing leading and trailing spaces. 155 | output_str = output_str.strip() 156 | 157 | generations.append(output_str) 158 | 159 | if len(generations) >= max_generations: 160 | stop_generation = True 161 | 162 | return full_generations, unstripped_generations, generations, references 163 | 164 | 165 | def _generate_with_prompt_dataset( 166 | prompt_dataset, 167 | 168 | model, 169 | tokenizer, 170 | max_length, 171 | min_length, 172 | top_k, 173 | top_p, 174 | repetition_penalty, 175 | do_sample, 176 | num_beams, 177 | bad_words_ids, 178 | dummy_token_id, 179 | num_return_sequences, 180 | max_generations, 181 | device, 182 | padding_token, 183 | ): 184 | references = [] 185 | full_generations = [] # Sentences including the prompt part. 186 | unstripped_generations = [] 187 | generations = [] 188 | 189 | stop_generation = False 190 | for input_ids in tqdm.tqdm(prompt_dataset, desc="generation"): 191 | if stop_generation: 192 | break 193 | 194 | prompt_len = len(input_ids[0]) 195 | output_ids = model.generate( 196 | input_ids=input_ids.to(device), 197 | max_length=max_length + prompt_len, # This cannot be a 0-D tensor! 198 | min_length=min_length, 199 | top_k=top_k, 200 | top_p=top_p, 201 | repetition_penalty=repetition_penalty, 202 | do_sample=do_sample, 203 | bad_words_ids=bad_words_ids, 204 | num_return_sequences=num_return_sequences, 205 | num_beams=num_beams, 206 | pad_token_id=tokenizer.eos_token_id, # Stop the stupid logging... 207 | ) 208 | output_ids = output_ids.squeeze(dim=0) # Throw away batch dimension. 209 | input_ids = input_ids.squeeze(dim=0) 210 | 211 | whole_str: str = tokenizer.decode(output_ids, clean_up_tokenization_spaces=True) 212 | prompt_str: str = tokenizer.decode(input_ids, clean_up_tokenization_spaces=True) 213 | output_str: str = whole_str[len(prompt_str):] 214 | 215 | full_generations.append(whole_str) 216 | del whole_str, prompt_str 217 | 218 | # Remove potential eos_token at the end. 219 | eos_position: Optional[int] = output_str.find(tokenizer.eos_token) 220 | if eos_position == -1: # Didn't generate eos_token; that's okay -- just skip! 221 | eos_position = None 222 | output_str = output_str[:eos_position] 223 | unstripped_generations.append(output_str) 224 | 225 | # Removing leading and trailing spaces. 226 | output_str = output_str.strip() 227 | 228 | generations.append(output_str) 229 | 230 | if len(generations) >= max_generations: 231 | stop_generation = True 232 | return full_generations, unstripped_generations, generations, references 233 | -------------------------------------------------------------------------------- /examples/table2text/gpt_config_stage123.json: -------------------------------------------------------------------------------- 1 | { 2 | "bf16": { 3 | "enabled": true 4 | }, 5 | "fp16": { 6 | "enabled": false, 7 | "fp16_master_weights_and_grads": false, 8 | "loss_scale": 1, 9 | "loss_scale_window": 1000, 10 | "hysteresis": 2, 11 | "min_loss_scale": 1, 12 | "initial_scale_power": 3 13 | }, 14 | "train_micro_batch_size_per_gpu": 999999999, 15 | "wall_clock_breakdown": false, 16 | "zero_optimization": { 17 | "stage": 1, 18 | "allgather_partitions": true, 19 | "reduce_scatter": true, 20 | "allgather_bucket_size": 50000000, 21 | "reduce_bucket_size": 50000000, 22 | "overlap_comm": true, 23 | "contiguous_gradients": true, 24 | "cpu_offload": false 25 | } 26 | } 27 | -------------------------------------------------------------------------------- /examples/table2text/misc.py: -------------------------------------------------------------------------------- 1 | """Miscellaneous utilities. 2 | 3 | Mostly bespoke data loaders at the moment. 4 | """ 5 | 6 | from transformers import ( 7 | DataCollatorForLanguageModeling, 8 | DataCollatorForPermutationLanguageModeling, 9 | PreTrainedTokenizer 10 | ) 11 | 12 | try: 13 | from .compiled_args import DataTrainingArguments 14 | from .data_utils.data_collator import DataCollatorForData2TextLanguageModeling 15 | from .data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset 16 | except: 17 | from compiled_args import DataTrainingArguments 18 | from data_utils.data_collator import DataCollatorForData2TextLanguageModeling 19 | from data_utils.language_modeling import LineByLineE2ETextDataset, LineByLineTriplesTextDataset 20 | 21 | 22 | def get_dataset_with_path( 23 | data_args: DataTrainingArguments, 24 | tokenizer: PreTrainedTokenizer, 25 | file_path: str, 26 | max_examples: int, 27 | **_, 28 | ): 29 | if data_args.line_by_line: 30 | if data_args.task_mode == 'e2e': 31 | dataset = LineByLineE2ETextDataset( 32 | tokenizer=tokenizer, 33 | file_path=file_path, 34 | block_size=data_args.block_size, 35 | bos_tok=tokenizer.bos_token, 36 | eos_tok=tokenizer.eos_token, 37 | max_seq_len=data_args.max_seq_len, 38 | max_examples=max_examples, 39 | ) 40 | elif data_args.task_mode == 'dart': 41 | dataset = LineByLineTriplesTextDataset( 42 | tokenizer=tokenizer, 43 | file_path=file_path, 44 | block_size=data_args.block_size, 45 | bos_tok=tokenizer.bos_token, 46 | eos_tok=tokenizer.eos_token, 47 | max_seq_len=data_args.max_seq_len, 48 | max_examples=max_examples, 49 | ) 50 | else: 51 | raise ValueError(f"Unknown `args.task_mode`: {data_args.task_mode}") 52 | 53 | else: 54 | raise ValueError("table2text task don't support anything other than line_by_line!") 55 | return dataset 56 | 57 | 58 | def get_prompt_dataset(file_path, tokenizer): 59 | with open(file_path, 'r') as f: 60 | lines = f.readlines() 61 | encoded_lines = [ 62 | tokenizer.encode(line.strip(), add_special_tokens=False, return_tensors="pt") 63 | for line in lines 64 | ] 65 | return encoded_lines 66 | 67 | 68 | def get_all_datasets(config, tokenizer, data_args, model_args, **_): 69 | kwargs = dict(data_args=data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) 70 | train_dataset = get_dataset_with_path( 71 | **kwargs, file_path=data_args.train_data_file, max_examples=data_args.max_train_examples 72 | ) 73 | valid_dataset = get_dataset_with_path( 74 | **kwargs, file_path=data_args.valid_data_file, max_examples=data_args.max_valid_examples 75 | ) 76 | eval_dataset = get_dataset_with_path( 77 | **kwargs, file_path=data_args.eval_data_file, max_examples=data_args.max_eval_examples 78 | ) 79 | 80 | if config.model_type == "xlnet": 81 | data_collator = DataCollatorForPermutationLanguageModeling( 82 | tokenizer=tokenizer, 83 | plm_probability=data_args.plm_probability, 84 | max_span_length=data_args.max_span_length, 85 | ) 86 | else: 87 | if data_args.task_mode == 'e2e' or data_args.task_mode == 'dart': 88 | data_collator = DataCollatorForData2TextLanguageModeling( 89 | tokenizer=tokenizer, mlm=False, format_mode=data_args.format_mode 90 | ) 91 | else: 92 | data_collator = DataCollatorForLanguageModeling( 93 | tokenizer=tokenizer, mlm=False, 94 | ) 95 | 96 | return train_dataset, valid_dataset, eval_dataset, data_collator 97 | -------------------------------------------------------------------------------- /examples/table2text/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | from transformers import GPT2PreTrainedModel, GPT2LMHeadModel 4 | 5 | 6 | class _View(nn.Module): 7 | def __init__(self, shape): 8 | super(_View, self).__init__() 9 | self.shape = shape 10 | 11 | def forward(self, x): 12 | return x.reshape(*self.shape) 13 | 14 | 15 | class PrefixTuner(GPT2PreTrainedModel): 16 | """A minimalistic implementation of the core components.""" 17 | 18 | def __init__(self, config, model_args, gpt2=None): 19 | super(PrefixTuner, self).__init__(config=config) 20 | 21 | # Instantiate a GPT-2, and DON'T optimizer it! 22 | if gpt2 is None: 23 | self.gpt2 = GPT2LMHeadModel.from_pretrained( 24 | model_args.model_name_or_path, config=config, cache_dir=model_args.cache_dir, 25 | ) 26 | else: 27 | self.gpt2 = gpt2 28 | 29 | self.register_buffer('extra_prefix_ids', torch.arange(model_args.prefix_len)) 30 | # TODO: Also introduce the easier net. 31 | self.extra_prefix_net = nn.Sequential( 32 | nn.Embedding(model_args.prefix_len, config.n_embd), 33 | nn.Linear(config.n_embd, model_args.mid_dim), 34 | nn.Tanh(), 35 | nn.Linear(model_args.mid_dim, config.n_layer * 2 * config.n_embd), 36 | _View((-1, model_args.prefix_len, config.n_layer * 2, config.n_head, config.n_embd // config.n_head)), 37 | nn.Dropout(model_args.prefix_dropout), 38 | ) 39 | 40 | def make_past_key_values(self, bsz=None): 41 | extra_prefix_ids = self.extra_prefix_ids[None, :].expand(bsz, -1) 42 | past_key_values = self.extra_prefix_net(extra_prefix_ids) 43 | # (n_layer, batch_size, n_head, prefix_len, n_embed // n_head). 44 | # e.g., (2, 1, 12, 5, 64,). 45 | past_key_values = past_key_values.permute([2, 0, 3, 1, 4]).split(2, dim=0) 46 | return past_key_values 47 | 48 | def state_dict(self): 49 | """Avoid storing GPT-2, since it's not even trained.""" 50 | return self.extra_prefix_net.state_dict() 51 | 52 | def load_state_dict(self, state_dict): 53 | """Avoid loading GPT-2, since it's not even trained.""" 54 | self.extra_prefix_net.load_state_dict(state_dict) 55 | 56 | @property 57 | def major_device(self): 58 | """Returns the device where the parameters are on.""" 59 | return next(self.parameters()).device 60 | 61 | def forward( 62 | self, 63 | input_ids, 64 | attention_mask=None, 65 | token_type_ids=None, 66 | position_ids=None, 67 | head_mask=None, 68 | inputs_embeds=None, 69 | encoder_hidden_states=None, 70 | encoder_attention_mask=None, 71 | labels=None, 72 | use_cache=None, 73 | output_attentions=None, 74 | output_hidden_states=None, 75 | return_dict=None, 76 | **kwargs, 77 | ): 78 | past_key_values = self.make_past_key_values(bsz=input_ids.size(0)) 79 | return self.gpt2( 80 | input_ids=input_ids, 81 | past_key_values=past_key_values, 82 | attention_mask=attention_mask, 83 | token_type_ids=token_type_ids, 84 | position_ids=position_ids, 85 | head_mask=head_mask, 86 | inputs_embeds=inputs_embeds, 87 | encoder_hidden_states=encoder_hidden_states, 88 | encoder_attention_mask=encoder_attention_mask, 89 | labels=labels, 90 | use_cache=use_cache, 91 | output_attentions=output_attentions, 92 | output_hidden_states=output_hidden_states, 93 | return_dict=return_dict, 94 | **kwargs 95 | ) 96 | 97 | def generate(self, input_ids, num_beams, **kwargs): 98 | # Additional files also changed: 99 | # src/transformers/generation_utils.py 100 | # src/transformers/models/gpt2/modeling_gpt2.py 101 | 102 | # A sanity check is to optimize the model for a few updates and check if the beam-search generations changed. 103 | # The confusing logic in generation_utils: 104 | # 1) `past` is used in `GPT2LMHeadModel:prepare_inputs_for_generation`, 105 | # 2) it's converted to `past_key_values` in that function, 106 | # 3) `past_key_values` is then updated in forward due to return_dict, 107 | # 4) `past` is set to `past_key_values` in `generation_utils:_update_model_kwargs_for_generation` 108 | 109 | # This is expansion step is important for generation, since otherwise the shapes are wrong. 110 | past_key_values = self.make_past_key_values(bsz=input_ids.size(0) * num_beams) 111 | # --- 112 | 113 | return self.gpt2.generate( 114 | input_ids=input_ids, 115 | num_beams=num_beams, 116 | past_key_values=past_key_values, 117 | 118 | use_cache=True, 119 | position_ids=None, 120 | 121 | # The logic: At beginning, past=None, and then it gets replaced with past_key_values. 122 | # Can't directly give in past, since otherwise, input_ids gets truncated to the last index. 123 | use_past_key_values_as_past_at_init=True, 124 | nullify_attention_mask=True, 125 | # --- 126 | 127 | **kwargs 128 | ) 129 | -------------------------------------------------------------------------------- /examples/table2text/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_dir=${1} 4 | output_dir=${2} 5 | task_mode=${3} 6 | model_name_or_path=${4:-"gpt2"} # One of distilgpt2, gpt2, gpt2-medium, gpt2-large 7 | target_epsilon=${5:-8} 8 | clipping_fn=${6:-"automatic"} 9 | clipping_mode=${7:-"MixOpt"} 10 | clipping_style=${8:-"all-layer"} 11 | bias_only=${9:-"no"} 12 | non_private=${10:-"no"} 13 | physical_batch_size=${11:-50} 14 | learning_rate=${12:-0.002} 15 | batch_size=${13:-1000} 16 | attention_only=${14:-"no"} 17 | static_lm_head=${15:-"no"} 18 | static_embedding=${16:-"no"} 19 | 20 | if [[ ${task_mode} == "e2e" ]]; then 21 | data_dir="${data_dir}/data/e2e_data" 22 | target_delta=8e-6 23 | num_train_epochs=10 24 | max_seq_len=100 25 | else 26 | if [[ ${task_mode} == "dart" ]]; then 27 | target_delta=1e-5 28 | data_dir="${data_dir}/data/dart" 29 | num_train_epochs=15 # Approximately same number of updates. 30 | learning_rate=5e-4 # Lower learning rate for stability in large models. 31 | max_seq_len=120 32 | else 33 | echo "Unknown task: ${task_mode}" 34 | exit 1 35 | fi 36 | fi 37 | 38 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size})) 39 | 40 | # Arguments in the last two lines are the most important. 41 | python table2text/run_language_modeling.py \ 42 | --output_dir ${output_dir} --overwrite_output_dir \ 43 | --task_mode ${task_mode} \ 44 | --model_name_or_path ${model_name_or_path} \ 45 | --tokenizer_name ${model_name_or_path} \ 46 | --do_train --do_eval \ 47 | --line_by_line \ 48 | --save_steps 100 --save_total_limit 1 --save_at_last no \ 49 | --logging_dir ${output_dir} --logging_steps -1 \ 50 | --seed 0 \ 51 | --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" --evaluate_during_training "no" --per_device_eval_batch_size 10 \ 52 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \ 53 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \ 54 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \ 55 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \ 56 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \ 57 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \ 58 | --non_private ${non_private} \ 59 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" \ 60 | 61 | 62 | 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /examples/table2text/run_ZERO1.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_dir=${1:-"data/prefix-tuning"} 4 | output_dir=${2:-"data/output"} 5 | task_mode=${3:-"e2e"} 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl","gptj" 7 | target_epsilon=${5:-8} 8 | clipping_fn=${6:-"automatic"} 9 | clipping_mode=${7:-"MixOpt"} 10 | clipping_style=${8:-"layer-wise"} 11 | bias_only=${9:-"no"} 12 | non_private=${10:-"no"} 13 | physical_batch_size=${11:-4} 14 | learning_rate=${12:-0.002} 15 | batch_size=${13:-1024} 16 | attention_only=${14:-"no"} 17 | static_lm_head=${15:-"no"} 18 | static_embedding=${16:-"no"} 19 | num_GPUs=${17:-8} 20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"} 21 | 22 | if [[ ${task_mode} == "e2e" ]]; then 23 | data_dir="${data_dir}/data/e2e_data" 24 | target_delta=8e-6 25 | num_train_epochs=10 26 | max_seq_len=100 27 | if [[ ${bias_only} == "yes" ]]; then 28 | learning_rate=1e-2 29 | else 30 | learning_rate=2e-3 31 | fi 32 | else 33 | if [[ ${task_mode} == "dart" ]]; then 34 | target_delta=1e-5 35 | data_dir="${data_dir}/data/dart" 36 | num_train_epochs=15 # Approximately same number of updates. 37 | learning_rate=5e-4 # Lower learning rate for stability in large models. 38 | max_seq_len=120 39 | if [[ ${bias_only} == "yes" ]]; then 40 | learning_rate=2e-3 41 | else 42 | learning_rate=5e-4 43 | fi 44 | 45 | else 46 | echo "Unknown task: ${task_mode}" 47 | exit 1 48 | fi 49 | fi 50 | 51 | deepspeed table2text/run_language_modeling.py --deepspeed_config ${deepspeed_config} \ 52 | --output_dir ${output_dir} --overwrite_output_dir \ 53 | --task_mode ${task_mode} \ 54 | --model_name_or_path ${model_name_or_path} \ 55 | --tokenizer_name ${model_name_or_path} \ 56 | --do_train --do_eval \ 57 | --line_by_line \ 58 | --save_steps 100 --save_total_limit 1 --save_at_last no \ 59 | --logging_dir ${output_dir} --logging_steps -1 \ 60 | --seed 0 \ 61 | --dataloader_num_workers 2 \ 62 | --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \ 63 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \ 64 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \ 65 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \ 66 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \ 67 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \ 68 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --logical_batch_size ${batch_size}\ 69 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \ 70 | --non_private ${non_private} \ 71 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" 72 | -------------------------------------------------------------------------------- /examples/table2text/run_ZERO23.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_dir=${1:-"table2text/data/prefix-tuning"} 4 | output_dir=${2:-"table2text/data/output"} 5 | task_mode=${3:-"e2e"} 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl" 7 | target_epsilon=${5:-8} 8 | clipping_fn=${6:-"automatic"} 9 | clipping_mode=${7:-"MixOpt"} 10 | clipping_style=${8:-"layer-wise"} 11 | bias_only=${9:-"no"} 12 | non_private=${10:-"no"} 13 | physical_batch_size=${11:-4} 14 | learning_rate=${12:-0.001} 15 | batch_size=${13:-1024} 16 | attention_only=${14:-"no"} 17 | static_lm_head=${15:-"no"} 18 | static_embedding=${16:-"no"} 19 | num_GPUs=${17:-8} 20 | deepspeed_config=${18:-"table2text/gpt_config_stage123.json"} 21 | 22 | if [[ ${task_mode} == "e2e" ]]; then 23 | data_dir="${data_dir}/data/e2e_data" 24 | target_delta=8e-6 25 | num_train_epochs=10 26 | max_seq_len=100 27 | else 28 | if [[ ${task_mode} == "dart" ]]; then 29 | target_delta=1e-5 30 | data_dir="${data_dir}/data/dart" 31 | num_train_epochs=15 # Approximately same number of updates. 32 | learning_rate=5e-4 # Lower learning rate for stability in large models. 33 | max_seq_len=120 34 | else 35 | echo "Unknown task: ${task_mode}" 36 | exit 1 37 | fi 38 | fi 39 | 40 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs})) 41 | 42 | # Arguments in the last two lines are the most important. 43 | deepspeed table2text/run_language_modeling_ZERO23.py --deepspeed_config ${deepspeed_config} \ 44 | --output_dir ${output_dir} --overwrite_output_dir \ 45 | --task_mode ${task_mode} \ 46 | --model_name_or_path ${model_name_or_path} \ 47 | --tokenizer_name ${model_name_or_path} \ 48 | --do_train --do_eval \ 49 | --line_by_line \ 50 | --save_steps 100 --save_total_limit 1 --save_at_last no \ 51 | --logging_dir ${output_dir} --logging_steps -1 \ 52 | --seed 0 \ 53 | --dataloader_num_workers 2 \ 54 | --eval_steps -1 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \ 55 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \ 56 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \ 57 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \ 58 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \ 59 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \ 60 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \ 61 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \ 62 | --non_private ${non_private} \ 63 | --clipping_mode "${clipping_mode}" --clipping_fn "${clipping_fn}" --clipping_style "${clipping_style}" 64 | -------------------------------------------------------------------------------- /examples/table2text/run_ZERO_extending.py: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_dir=${1} 4 | output_dir=${2} 5 | task_mode=${3:-"e2e"} 6 | model_name_or_path=${4:-"gpt2"} # One of "distilgpt2", "gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl" 7 | target_epsilon=${5:-8} 8 | bias_only=${6:-"no"} 9 | non_private=${7:-"no"} 10 | physical_batch_size=${8:-4} 11 | learning_rate=${9:-0.001} 12 | batch_size=${10:-1024} 13 | attention_only=${11:-"no"} 14 | static_lm_head=${12:-"no"} 15 | static_embedding=${13:-"no"} 16 | num_GPUs=${14:-8} 17 | deepspeed_config=${15:-"table2text/gpt_config_stage123.json"} 18 | 19 | if [[ ${task_mode} == "e2e" ]]; then 20 | data_dir="${data_dir}/data/e2e_data" 21 | target_delta=8e-6 22 | num_train_epochs=10 23 | max_seq_len=100 24 | else 25 | if [[ ${task_mode} == "dart" ]]; then 26 | target_delta=1e-5 27 | data_dir="${data_dir}/data/dart" 28 | num_train_epochs=15 # Approximately same number of updates. 29 | learning_rate=5e-4 # Lower learning rate for stability in large models. 30 | max_seq_len=120 31 | else 32 | echo "Unknown task: ${task_mode}" 33 | exit 1 34 | fi 35 | fi 36 | 37 | gradient_accumulation_steps=$((${batch_size} / ${physical_batch_size} / ${num_GPUs})) 38 | 39 | deepspeed table2text/run_language_modeling_extending.py --deepspeed_config ${deepspeed_config} \ 40 | --output_dir ${output_dir} --overwrite_output_dir \ 41 | --task_mode ${task_mode} \ 42 | --model_name_or_path ${model_name_or_path} \ 43 | --tokenizer_name ${model_name_or_path} \ 44 | --do_train --do_eval \ 45 | --line_by_line \ 46 | --save_steps 100 --save_total_limit 1 --save_at_last no \ 47 | --logging_dir ${output_dir} --logging_steps -1 \ 48 | --seed 0 \ 49 | --dataloader_num_workers 2 \ 50 | --eval_steps 100 --eval_epochs 999 --max_eval_batches 100 --evaluation_strategy epoch --evaluate_before_training "no" \ 51 | --evaluate_during_training "no" --per_device_eval_batch_size 10 \ 52 | --max_generations 9223372036854775807 --max_generations_train 10 --max_generations_valid 9223372036854775807 \ 53 | --max_train_examples 9223372036854775807 --max_valid_examples 9223372036854775807 --max_eval_examples 9223372036854775807 \ 54 | --data_folder ${data_dir} --max_seq_len ${max_seq_len} --format_mode cat \ 55 | --per_example_max_grad_norm 0.1 --target_delta ${target_delta} --target_epsilon ${target_epsilon} \ 56 | --learning_rate ${learning_rate} --lr_decay "no" --num_train_epochs ${num_train_epochs} --per_device_train_batch_size ${physical_batch_size} --gradient_accumulation_steps ${gradient_accumulation_steps} \ 57 | --attention_only ${attention_only} --bias_only ${bias_only} --static_lm_head ${static_lm_head} --static_embedding ${static_embedding} \ 58 | --non_private ${non_private} \ 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /examples/text_classification/README.md: -------------------------------------------------------------------------------- 1 | ## DP text classification with Huggingface transformers 2 | 3 | ### Getting the data 4 | 5 | We adopt the data pipeline by \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf)\], which is adapted from the excellent work by \[[Gao et al., 2021](https://arxiv.org/pdf/2012.15723.pdf)\]. To obtain the data, run the following: 6 | 7 | ```plaintext 8 | cd data; bash download_dataset.sh 9 | ``` 10 | 11 | This should produce a `data/original` subfolder that contains the GLUE ([General Language Understanding Evaluation](https://huggingface.co/datasets/glue)) datasets. 12 | 13 | ### Running 14 | 15 | Use the `run_wrapper.py` script in the folder, which runs the `run_classification.py` for the command. 16 | 17 | Necessary arguments: 18 | 19 | * `--output_dir`: path to a folder where results will be written 20 | * `--task_name`: name of task; one of `sst-2`, `qnli`, `qqp`, `mnli` 21 | 22 | For instance, run the following under the `examples` folder: 23 | 24 | ```plaintext 25 | python -m text_classification.run_wrapper --output_dir ToDeleteNLU --task_name sst-2 26 | ``` 27 | 28 | The script by default uses book-keeping (BK) by [[Differentially Private Optimization on Large Model at Small Cost]](https://arxiv.org/pdf/2210.00038.pdf) for the DP full fine-tuning. Gradient accumulation is used so that larger physical batch size allows faster training at heavier memory burden, but the accuracy is not affected. For SST-2/QNLI/QQP/MNLI, running `roberta-base` on one A100 GPU (40GB) takes around 5/8/37/32 min per epoch. 29 | 30 | Additional arguments: 31 | 32 | * `--model_name_or_path`: The pretrained model; one of `distilbert-base-uncased`, `bert-base-uncased`, `bert-large-uncased`, `distilroberta-base`, `roberta-base`, `roberta-large`. 33 | 34 | * `--target_epsilon`: Target privacy spending, default is 8. 35 | 36 | * `--few_shot_type`: Whether to use the generic prompt formatter described in Section 3.2 of our paper. `prompt` is to use, `finetune` is to not use. 37 | 38 | * `--non_private`: Whether to train differentially privately; one of `yes`, `no` (default). 39 | 40 | * `--clipping_mode`: Which DP algorithm to implement per-sample gradient clipping; one of `ghost` (default, meaning book-keeping), `MixGhostClip`, `MixOpt`. All three modes are from [Bu et al., 2022](https://arxiv.org/pdf/2210.00038.pdf). 41 | 42 | * `--clipping_fn`: Which per-sample gradient clipping function use; one of `automatic` (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), `Abadi` [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , `global` [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf). 43 | 44 | * `--clipping_style`: Which per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a group, including both weight and bias parameters), `param-wise` (each parameter is a group), or a list of layer names (general group-wise clipping). For example, 2-group clipping style can use `--clipping_style 2`; 12-group clipping style can use `--clipping_style 12`. 45 | 46 | * `--attention_only`: Whether to only train attention layers; one of `yes`, `no` (default). 47 | 48 | * `--bias_only`: Whether to only train bias terms; one of `yes`, `no` (default). If yes, this is implementing [[Differentially Private Bias-Term only 49 | Fine-tuning]](https://arxiv.org/pdf/2210.00036.pdf). 50 | 51 | * `--physical_batch_size` : Physical batch size for gradient accumulation that determines memory and speed, but not accuracy. 52 | 53 | * `--batch_size` : Logical batch size that determines the convergence and accuracy, should be multiple of `physical_batch_size`; default is None. 54 | 55 | Note that keeping other training hyperparameter (e.g., number of training epochs, clipping threshold, learning rate) as default, the script should reproduce the results in \[[Li et al., 2021](https://arxiv.org/pdf/2110.05679.pdf); [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)\]. 56 | -------------------------------------------------------------------------------- /examples/text_classification/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/__init__.py -------------------------------------------------------------------------------- /examples/text_classification/data/download_dataset.sh: -------------------------------------------------------------------------------- 1 | wget https://nlp.cs.princeton.edu/projects/lm-bff/datasets.tar 2 | tar xvf datasets.tar 3 | -------------------------------------------------------------------------------- /examples/text_classification/data/make_k_shot_without_dev.py: -------------------------------------------------------------------------------- 1 | """The datasets in the k-shot folder contain dev.tsv; we make the test set the dev set in the new k-shot. 2 | 3 | python -m classification.data.make_k_shot_without_dev 4 | """ 5 | import os 6 | 7 | from ml_swissknife import utils 8 | 9 | join = os.path.join 10 | 11 | base_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot' 12 | new_dir = '/nlp/scr/lxuechen/data/lm-bff/data/k-shot-no-dev' 13 | 14 | task_names = ("SST-2", "QNLI", "MNLI", "QQP") 15 | for task_name in task_names: 16 | folder = join(base_dir, task_name) 17 | new_folder = join(new_dir, task_name) 18 | 19 | for name in utils.listdir(folder): 20 | subfolder = join(folder, name) 21 | new_subfolder = join(new_folder, name) 22 | os.makedirs(new_subfolder, exist_ok=True) 23 | 24 | train = join(subfolder, 'train.tsv') 25 | new_train = join(new_subfolder, 'train.tsv') 26 | os.system(f'cp {train} {new_train}') 27 | 28 | if task_name == "MNLI": 29 | test = join(subfolder, 'test_matched.tsv') 30 | new_dev = join(new_subfolder, 'dev_matched.tsv') 31 | os.system(f'cp {test} {new_dev}') 32 | 33 | test = join(subfolder, 'test_mismatched.tsv') 34 | new_dev = join(new_subfolder, 'dev_mismatched.tsv') 35 | os.system(f'cp {test} {new_dev}') 36 | else: 37 | test = join(subfolder, 'test.tsv') 38 | new_dev = join(new_subfolder, 'dev.tsv') 39 | os.system(f'cp {test} {new_dev}') 40 | -------------------------------------------------------------------------------- /examples/text_classification/data/make_valid_data.py: -------------------------------------------------------------------------------- 1 | """Make the separate validation data, so that we don't tune on dev set. 2 | 3 | python -m classification.data.make_valid_data 4 | """ 5 | import os 6 | 7 | import fire 8 | import numpy as np 9 | import tqdm 10 | 11 | 12 | def write_lines(path, lines, mode="w"): 13 | os.makedirs(os.path.dirname(path), exist_ok=True) 14 | with open(path, mode) as f: 15 | f.writelines(lines) 16 | print(len(lines)) 17 | 18 | 19 | def main(): 20 | valid_percentage = 0.1 21 | original_dir = "/nlp/scr/lxuechen/data/lm-bff/data/original" 22 | new_dir = "/nlp/scr/lxuechen/data/lm-bff/data/glue-with-validation" 23 | 24 | task_folders = ("GLUE-SST-2", "QNLI", "QQP") 25 | for task_folder in task_folders: 26 | # Create train and valid splits. 27 | full_train_path = os.path.join(original_dir, task_folder, 'train.tsv') 28 | with open(full_train_path, 'r') as f: 29 | full_train = f.readlines() 30 | 31 | header = full_train[0] 32 | full_train = full_train[1:] # Remove header. 33 | 34 | indices = np.random.permutation(len(full_train)) 35 | new_valid_size = int(len(indices) * valid_percentage) 36 | new_train_size = len(indices) - new_valid_size 37 | new_train_indices = indices[:new_train_size] 38 | new_valid_indices = indices[new_train_size:] 39 | assert len(new_train_indices) == new_train_size 40 | assert len(new_valid_indices) == new_valid_size 41 | 42 | new_train = [header] + [full_train[i] for i in new_train_indices] 43 | new_valid = [header] + [full_train[i] for i in new_valid_indices] 44 | 45 | new_train_path = os.path.join(new_dir, task_folder, 'train.tsv') 46 | new_valid_path = os.path.join(new_dir, task_folder, 'dev.tsv') 47 | 48 | write_lines(new_train_path, new_train) 49 | write_lines(new_valid_path, new_valid) 50 | del new_train, new_valid, new_train_path, new_valid_path 51 | del new_train_size, new_train_indices 52 | del new_valid_size, new_valid_indices 53 | 54 | # Make test! 55 | test_path = os.path.join(original_dir, task_folder, 'dev.tsv') 56 | new_test_path = os.path.join(new_dir, task_folder, 'test.tsv') 57 | os.system(f'cp {test_path} {new_test_path}') 58 | del test_path, new_test_path 59 | 60 | # Make valid set for MNLI; different, since matched/mismatched! 61 | task_folder = "MNLI" 62 | matched_genres = ['slate', 'government', 'telephone', 'travel', 'fiction'] 63 | mismatched_genres = ['letters', 'verbatim', 'facetoface', 'oup', 'nineeleven'] 64 | full_train_path = os.path.join(original_dir, task_folder, 'train.tsv') 65 | with open(full_train_path, 'r') as f: 66 | full_train = f.readlines() 67 | full_train_csv = [line.split('\t') for line in full_train] 68 | 69 | # Check the lengths are correct. 70 | l = len(full_train_csv[0]) 71 | for line in full_train_csv: 72 | assert l == len(line) 73 | 74 | # Remove header. 75 | header = full_train[0] 76 | header_csv = full_train_csv[0] 77 | 78 | full_train = full_train[1:] 79 | full_train_csv = full_train_csv[1:] 80 | 81 | # Get index of genre. 82 | genre_index = header_csv.index('genre') 83 | 84 | # Shuffle both! 85 | indices = np.random.permutation(len(full_train)) 86 | full_train = [full_train[i] for i in indices] 87 | full_train_csv = [full_train_csv[i] for i in indices] 88 | 89 | # Split validation. 90 | new_valid_size = int(len(indices) * valid_percentage) 91 | new_matched_valid_size = new_mismatched_valid_size = new_valid_size // 2 92 | 93 | # Fetch the indices. 94 | new_train_indices = [] 95 | new_matched_valid_indices = [] 96 | new_mismatched_valid_indices = [] 97 | matched_count = mismatched_count = 0 98 | for i, row in enumerate(full_train_csv): 99 | genre = row[genre_index] 100 | if genre in matched_genres and matched_count < new_matched_valid_size: 101 | new_matched_valid_indices.append(i) 102 | matched_count += 1 103 | elif genre in mismatched_genres and mismatched_count < new_mismatched_valid_size: 104 | new_mismatched_valid_indices.append(i) 105 | mismatched_count += 1 106 | else: 107 | new_train_indices.append(i) 108 | 109 | new_matched_valid_indices = set(new_matched_valid_indices) 110 | new_mismatched_valid_indices = set(new_mismatched_valid_indices) 111 | 112 | new_train = [header] 113 | new_matched_valid = [header] 114 | new_mismatched_valid = [header] 115 | for i, line in tqdm.tqdm(enumerate(full_train)): 116 | if i in new_matched_valid_indices: 117 | new_matched_valid.append(line) 118 | elif i in new_mismatched_valid_indices: 119 | new_mismatched_valid.append(line) 120 | else: 121 | new_train.append(line) 122 | 123 | new_train_path = os.path.join(new_dir, task_folder, 'train.tsv') 124 | new_matched_valid_path = os.path.join(new_dir, task_folder, 'dev_matched.tsv') 125 | new_mismatched_valid_path = os.path.join(new_dir, task_folder, 'dev_mismatched.tsv') 126 | 127 | write_lines(new_train_path, new_train) 128 | write_lines(new_matched_valid_path, new_matched_valid) 129 | write_lines(new_mismatched_valid_path, new_mismatched_valid) 130 | 131 | matched_test_path = os.path.join(original_dir, task_folder, 'dev_matched.tsv') 132 | new_matched_test_path = os.path.join(new_dir, task_folder, 'test_matched.tsv') 133 | os.system(f'cp {matched_test_path} {new_matched_test_path}') 134 | 135 | mismatched_test_path = os.path.join(original_dir, task_folder, 'dev_mismatched.tsv') 136 | new_mismatched_test_path = os.path.join(new_dir, task_folder, 'test_mismatched.tsv') 137 | os.system(f'cp {mismatched_test_path} {new_mismatched_test_path}') 138 | 139 | 140 | if __name__ == "__main__": 141 | fire.Fire(main) 142 | -------------------------------------------------------------------------------- /examples/text_classification/run_wrapper.py: -------------------------------------------------------------------------------- 1 | """Wrapper launcher script.""" 2 | 3 | import os 4 | 5 | import fire 6 | 7 | from .src import common 8 | 9 | 10 | def _get_command( 11 | task_name, 12 | output_dir, 13 | model_name_or_path, 14 | data_dir, 15 | learning_rate, 16 | clipping_mode: str, 17 | clipping_fn: str, 18 | clipping_style: str, 19 | non_private, 20 | target_epsilon, 21 | few_shot_type, 22 | seed, 23 | attention_only, 24 | bias_only, 25 | static_lm_head, 26 | static_embedding, 27 | randomly_initialize, 28 | physical_batch_size, 29 | batch_size, 30 | num_train_epochs, 31 | eval_steps, 32 | ): 33 | task_name_to_factor = { 34 | "sst-2": 1, "qnli": 2, "qqp": 6, "mnli": 6, 35 | } 36 | factor = task_name_to_factor[task_name] 37 | 38 | if batch_size is None: 39 | base_batch_size = 1000 40 | # This batch size selection roughly ensures the sampling rates on different 41 | # datasets are in the same ballpark. 42 | batch_size = int(base_batch_size * factor) 43 | gradient_accumulation_steps = batch_size // physical_batch_size 44 | 45 | if num_train_epochs is None: 46 | base_num_train_epochs = 3 47 | num_train_epochs = int(base_num_train_epochs * factor) 48 | 49 | if learning_rate is None: 50 | if non_private.lower() in ('yes', 'y', 'true', 't'): 51 | learning_rate = 5e-5 52 | if bias_only.lower() in ('yes', 'y', 'true', 't'): 53 | learning_rate=1e-3 54 | else: 55 | learning_rate = 5e-4 56 | if bias_only.lower() in ('yes', 'y', 'true', 't'): 57 | learning_rate=5e-3 58 | 59 | data_dir = f"{data_dir}/{common.task_name2suffix_name[task_name]}" 60 | template = { 61 | "sst-2": "*cls**sent_0*_It_was*mask*.*sep+*", 62 | "mnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*", 63 | "qnli": "*cls**sent-_0*?*mask*,*+sentl_1**sep+*", 64 | "qqp": "*cls**sent-_0**mask*,*+sentl_1**sep+*", 65 | }[task_name] 66 | 67 | # Epochs chosen roughly to match e2e number of updates. We didn't hyperparameter tune on classification tasks :) 68 | cmd = f''' 69 | python -m text_classification.run_classification \ 70 | --task_name {task_name} \ 71 | --data_dir {data_dir} \ 72 | --output_dir {output_dir} \ 73 | --overwrite_output_dir \ 74 | --model_name_or_path {model_name_or_path} \ 75 | --few_shot_type {few_shot_type} \ 76 | --num_k 1 \ 77 | --num_sample 1 --seed {seed} \ 78 | --template {template} \ 79 | --non_private {non_private} \ 80 | --num_train_epochs {num_train_epochs} \ 81 | --target_epsilon {target_epsilon} \ 82 | --per_device_train_batch_size {physical_batch_size} \ 83 | --gradient_accumulation_steps {gradient_accumulation_steps} \ 84 | --per_device_eval_batch_size 8 \ 85 | --per_example_max_grad_norm 0.1 --clipping_mode {clipping_mode} \ 86 | --clipping_fn {clipping_fn} --clipping_style {clipping_style}\ 87 | --learning_rate {learning_rate} \ 88 | --lr_decay yes \ 89 | --adam_epsilon 1e-08 \ 90 | --weight_decay 0 \ 91 | --max_seq_len 256 \ 92 | --evaluation_strategy steps --eval_steps {eval_steps} --evaluate_before_training True \ 93 | --do_train --do_eval \ 94 | --first_sent_limit 200 --other_sent_limit 200 --truncate_head yes \ 95 | --attention_only {attention_only} --bias_only {bias_only} --static_lm_head {static_lm_head} --static_embedding {static_embedding} \ 96 | --randomly_initialize {randomly_initialize} 97 | ''' 98 | return cmd 99 | 100 | 101 | def main( 102 | output_dir, 103 | task_name, 104 | few_shot_type="prompt", # finetune or prompt 105 | seed=0, 106 | model_name_or_path="roberta-base", 107 | data_dir="text_classification/data/original", 108 | learning_rate=None, 109 | clipping_mode="MixOpt", 110 | clipping_fn="automatic", 111 | clipping_style="all-layer", 112 | non_private="no", 113 | target_epsilon=8, 114 | attention_only="no", 115 | bias_only="no", 116 | static_lm_head="no", 117 | static_embedding="no", 118 | physical_batch_size =40, 119 | eval_steps=10, 120 | randomly_initialize="no", 121 | batch_size=None, 122 | num_train_epochs=None, 123 | ): 124 | command = _get_command( 125 | output_dir=output_dir, 126 | task_name=task_name, 127 | model_name_or_path=model_name_or_path, 128 | data_dir=data_dir, 129 | learning_rate=learning_rate, 130 | clipping_mode=clipping_mode, 131 | clipping_fn=clipping_fn, 132 | clipping_style=clipping_style, 133 | non_private=non_private, 134 | target_epsilon=target_epsilon, 135 | few_shot_type=few_shot_type, 136 | seed=seed, 137 | attention_only=attention_only, 138 | bias_only=bias_only, 139 | static_lm_head=static_lm_head, 140 | static_embedding=static_embedding, 141 | physical_batch_size = physical_batch_size, 142 | eval_steps=eval_steps, 143 | randomly_initialize=randomly_initialize, 144 | batch_size=batch_size, 145 | num_train_epochs=num_train_epochs, 146 | ) 147 | print('Running command:') 148 | print(command) 149 | os.system(command) 150 | 151 | 152 | if __name__ == "__main__": 153 | fire.Fire(main) 154 | -------------------------------------------------------------------------------- /examples/text_classification/src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/examples/text_classification/src/__init__.py -------------------------------------------------------------------------------- /examples/text_classification/src/common.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | task_name2suffix_name = {"sst-2": "GLUE-SST-2", "mnli": "MNLI", "qqp": "QQP", "qnli": "QNLI"} 4 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 5 | true_tags = ('y', 'yes', 't', 'true') 6 | -------------------------------------------------------------------------------- /examples/text_classification/src/compiled_args.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass, field 2 | 3 | import transformers 4 | 5 | from .common import true_tags 6 | from typing import Optional 7 | 8 | 9 | @dataclass 10 | class PrivacyArguments: 11 | """Arguments for differentially private training.""" 12 | 13 | per_example_max_grad_norm: float = field( 14 | default=.1, metadata={ 15 | "help": "Clipping 2-norm of per-sample gradients." 16 | } 17 | ) 18 | noise_multiplier: float = field( 19 | default=None, metadata={ 20 | "help": "Standard deviation of noise added for privacy; if `target_epsilon` is specified, " 21 | "use the one searched based budget" 22 | } 23 | ) 24 | target_epsilon: float = field( 25 | default=None, metadata={ 26 | "help": "Privacy budget; if `None` use the noise multiplier specified." 27 | } 28 | ) 29 | target_delta: float = field( 30 | default=None, metadata={ 31 | "help": "Lax probability in approximate differential privacy; if `None` use 1 / len(train_data)." 32 | } 33 | ) 34 | non_private: str = field( 35 | default="yes", metadata={"help": "Train non-privately if True."} 36 | ) 37 | accounting_mode: str = field( 38 | default="rdp", metadata={"help": "One of (`rdp`, `glw`, `all`)."} 39 | ) 40 | clipping_mode: str = field( 41 | default="default" 42 | ) 43 | clipping_fn: str = field( 44 | default="automatic" 45 | ) 46 | clipping_style: str = field( 47 | default="all-layer" 48 | ) 49 | 50 | def __post_init__(self): 51 | self.non_private = self.non_private.lower() in true_tags # noqa 52 | 53 | 54 | @dataclass 55 | class TrainingArguments(transformers.TrainingArguments): 56 | eval_epochs: int = field(default=10, metadata={"help": "Evaluate once such epochs"}) 57 | evaluate_before_training: bool = field(default=False, metadata={"help": "Run evaluation before training."}) 58 | lr_decay: str = field( 59 | default="no", metadata={"help": "Apply the usual linear decay if `yes`, otherwise no deacy."} 60 | ) 61 | evaluate_test_split: bool = field(default=False, metadata={"help": "Run evaluation on the test split"}) 62 | 63 | def __post_init__(self): 64 | super(TrainingArguments, self).__post_init__() 65 | self.lr_decay = self.lr_decay.lower() in true_tags # noqa 66 | -------------------------------------------------------------------------------- /examples/text_classification/src/label_search.py: -------------------------------------------------------------------------------- 1 | """Automatic label search helpers.""" 2 | 3 | import itertools 4 | import logging 5 | import multiprocessing 6 | 7 | import numpy as np 8 | import scipy.spatial as spatial 9 | import scipy.special as special 10 | import scipy.stats as stats 11 | import tqdm 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | def select_likely_words(train_logits, train_labels, k_likely=1000, vocab=None, is_regression=False): 17 | """Pre-select likely words based on conditional likelihood.""" 18 | indices = [] 19 | if is_regression: 20 | median = np.median(train_labels) 21 | train_labels = (train_labels > median).astype(np.int) 22 | num_labels = np.max(train_labels) + 1 23 | for idx in range(num_labels): 24 | label_logits = train_logits[train_labels == idx] 25 | scores = label_logits.mean(axis=0) 26 | kept = [] 27 | for i in np.argsort(-scores): 28 | text = vocab[i] 29 | if not text.startswith("Ġ"): 30 | continue 31 | kept.append(i) 32 | indices.append(kept[:k_likely]) 33 | return indices 34 | 35 | 36 | def select_neighbors(distances, k_neighbors, valid): 37 | """Select k nearest neighbors based on distance (filtered to be within the 'valid' set).""" 38 | indices = np.argsort(distances) 39 | neighbors = [] 40 | for i in indices: 41 | if i not in valid: 42 | continue 43 | neighbors.append(i) 44 | if k_neighbors > 0: 45 | return neighbors[:k_neighbors] 46 | return neighbors 47 | 48 | 49 | def init(train_logits, train_labels): 50 | global logits, labels 51 | logits = train_logits 52 | labels = train_labels 53 | 54 | 55 | def eval_pairing_acc(pairing): 56 | global logits, labels 57 | label_logits = np.take(logits, pairing, axis=-1) 58 | preds = np.argmax(label_logits, axis=-1) 59 | correct = np.sum(preds == labels) 60 | return correct / len(labels) 61 | 62 | 63 | def eval_pairing_corr(pairing): 64 | global logits, labels 65 | if pairing[0] == pairing[1]: 66 | return -1 67 | label_logits = np.take(logits, pairing, axis=-1) 68 | label_probs = special.softmax(label_logits, axis=-1)[:, 1] 69 | pearson_corr = stats.pearsonr(label_probs, labels)[0] 70 | return pearson_corr 71 | 72 | 73 | def find_labels( 74 | model, 75 | train_logits, 76 | train_labels, 77 | seed_labels=None, 78 | k_likely=1000, 79 | k_neighbors=None, 80 | top_n=-1, 81 | vocab=None, 82 | is_regression=False, 83 | ): 84 | # Get top indices based on conditional likelihood using the LM. 85 | likely_indices = select_likely_words( 86 | train_logits=train_logits, 87 | train_labels=train_labels, 88 | k_likely=k_likely, 89 | vocab=vocab, 90 | is_regression=is_regression) 91 | 92 | logger.info("Top labels (conditional) per class:") 93 | for i, inds in enumerate(likely_indices): 94 | logger.info("\t| Label %d: %s", i, ", ".join([vocab[i] for i in inds[:10]])) 95 | 96 | # Convert to sets. 97 | valid_indices = [set(inds) for inds in likely_indices] 98 | 99 | # If specified, further re-rank according to nearest neighbors of seed labels. 100 | # Otherwise, keep ranking as is (based on conditional likelihood only). 101 | if seed_labels: 102 | assert (vocab is not None) 103 | seed_ids = [vocab.index(l) for l in seed_labels] 104 | vocab_vecs = model.lm_head.decoder.weight.detach().cpu().numpy() 105 | seed_vecs = np.take(vocab_vecs, seed_ids, axis=0) 106 | 107 | # [num_labels, vocab_size] 108 | label_distances = spatial.distance.cdist(seed_vecs, vocab_vecs, metric="cosine") 109 | 110 | # Establish label candidates (as k nearest neighbors). 111 | label_candidates = [] 112 | logger.info("Re-ranked by nearest neighbors:") 113 | for i, distances in enumerate(label_distances): 114 | label_candidates.append(select_neighbors(distances, k_neighbors, valid_indices[i])) 115 | logger.info("\t| Label: %s", seed_labels[i]) 116 | logger.info("\t| Neighbors: %s", " ".join([vocab[idx] for idx in label_candidates[i]])) 117 | else: 118 | label_candidates = likely_indices 119 | 120 | # Brute-force search all valid pairings. 121 | pairings = list(itertools.product(*label_candidates)) 122 | 123 | if is_regression: 124 | eval_pairing = eval_pairing_corr 125 | metric = "corr" 126 | else: 127 | eval_pairing = eval_pairing_acc 128 | metric = "acc" 129 | 130 | # Score each pairing. 131 | pairing_scores = [] 132 | with multiprocessing.Pool(initializer=init, initargs=(train_logits, train_labels)) as workers: 133 | with tqdm.tqdm(total=len(pairings)) as pbar: 134 | chunksize = max(10, int(len(pairings) / 1000)) 135 | for score in workers.imap(eval_pairing, pairings, chunksize=chunksize): 136 | pairing_scores.append(score) 137 | pbar.update() 138 | 139 | # Take top-n. 140 | best_idx = np.argsort(-np.array(pairing_scores))[:top_n] 141 | best_scores = [pairing_scores[i] for i in best_idx] 142 | best_pairings = [pairings[i] for i in best_idx] 143 | 144 | logger.info("Automatically searched pairings:") 145 | for i, indices in enumerate(best_pairings): 146 | logger.info("\t| %s (%s = %2.2f)", " ".join([vocab[j] for j in indices]), metric, best_scores[i]) 147 | 148 | return best_pairings 149 | -------------------------------------------------------------------------------- /fastDP/README.md: -------------------------------------------------------------------------------- 1 | ### Two Privacy Engines 2 | 3 | FastDP provides two privacy engines to compute the private gradient: **hook-based** and **torch-extending**. These privacy engines are equivalent mathematically, though their applicability and computation efficiency can be different. We summarize the differences and note that some limitations can be overcome with more engineering efforts. 4 | 5 | | | Hook-based (DP) | Torch-extending (DP) | Standard (non-DP) | 6 | |:----------------------------:|:-------------------------------:|:----------------:|:------------:| 7 | | Speed (1/time complexity) | 80-100% | ~70% | 100% | 8 | | Memory cost (space complexity) | 100-130% | ~100% | 100% | 9 | | ZeRO distribution solution | ✅ Supported | ✅ Supported | ✅ Supported | 10 | | Most types of layers | ✅ Supported (see below) | ✅ Supported (see below) | ✅ Supported | 11 | | Per-sample clipping styles | ✅ Supported for all styles | Layer-wise style |✅ Not needed | 12 | | Per-sample clipping functions | ✅ Supported for all functions | Automatic clipping |✅ Not needed | 13 | | Modifying optimizers | Needed for `PrivacyEngine`; not needed for ZeRO | ✅ Not needed | ✅ Not needed | 14 | | Private gradient stored in | `param.private_grad` | `param.grad` | `param.grad` | 15 | | Fused kernel | ✅ Supported | Not supported |✅ Supported | 16 | | Ghost differentiation (origin param) | Supported on single GPU | Not supported | Not needed | 17 | | Recommended usage | Single GPU or ZeRO | General | General | 18 | 19 | #### 1. Hook-based 20 | Hook-based approach computes the private gradient with forward hooks (to store the activations) and backward hooks (to compute the per-sample gradient norms, to clip and to add noise). See [this tutorial for hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html). This approach firstly computes the private gradient then overrides the non-DP gradient. 21 | 22 | On single GPU or data parallelism (see `PrivacyEngine`), the hooks are backward module hooks, which are triggered before `param.grad` is computed; in ZeRO (see `PrivacyEngine_Distributed_Stage_2_and_3`), some backward tensor hooks are in place, which are triggered after `param.grad` has been computed. 23 | 24 | #### 2. Torch-extending 25 | Torch-extending approach computes the private gradient directly by re-writeing the model's back-propagation mechanism (see `PrivacyEngine_Distributed_extending`). See [this tutorial for extending torch modules](https://pytorch.org/docs/stable/notes/extending.html#extending-torch-nn). This approach overrides the non-DP modules as shown in `supported_differentially_private_layers.py`. Given that this approach does not modify the optimizers and the communication orchestra of distributed solutions, it is expected to be applicable generally. However, some slowdown may be observed as the extension is not implemented at C++ level. 26 | 27 | ### Supported Modules 28 | 29 | Our privacy engine supports the commonly used modules that constitute most models, with possibly two methods to compute the per-sample gradient norm: 30 | * nn.Linear (GhostClip & Grad Instantiation) 31 | * nn.LayerNorm (Grad Instantiation) 32 | * nn.GroupNorm (Grad Instantiation) 33 | * nn.InstanceNorm (Grad Instantiation) 34 | * nn.Embedding (GhostClip) 35 | * nn.Conv1d (GhostClip & Grad Instantiation) 36 | * nn.Conv2d (GhostClip & Grad Instantiation) 37 | * nn.Conv3d (GhostClip & Grad Instantiation) 38 | 39 | Frozen (e.g. `nn.Linear` with `requires_grad=False`) and non-trainable (e.g. `nn.ReLU`, `nn.Tanh`, `nn.MaxPool2d`) modules are also supported. 40 | 41 | Note GhostClip stands for ghost clipping [1][2][3], that computes the gradient norms without creating and storing the gradients. Grad Instantiation stands for per-sample gradient instantiation [5], that generates the per-sample gradients and then computes their norms. Note that Grad Instantiation can be inefficient for large models and GhostClip can be inefficient for high-dimensional data. Therefore we allow to choose the method at different layers (known as the hybrid algorithms by [3][4]) for modules that support both methods. 42 | 43 | ### Arguments 44 | * `module`: The model that to be optimized with differential privacy. 45 | * `batch_size`: Logical batch size that determines the convergence and accuracy. 46 | * `sample_size`: Number of training samples. 47 | * `target_epsilon`: Target privacy budget ε. 48 | * `target_delta`: Target privacy budget δ, should be smaller than 1/sample_size. 49 | * `max_grad_norm`: Per-sample gradient clipping threshold, default to 1. No need to tune if `clipping_fn="automatic"`. 50 | * `epochs`: Number of epochs. Not needed if `noise_multiplier` is provided. 51 | * `noise_multiplier`: Level of independent Gaussian noise into the gradient. This can be automatically computed by different `accounting_mode` if `target_epsilon, batch_size, sample_size, epochs` are provided. 52 | * `accounting_mode`: Privacy accounting theory to use, one of "rdp" (default), "glw", "all". 53 | * `named_params`: Specifies which parameters to optimize with differential privacy. 54 | * `clipping_mode`: Per-sample gradient clipping mode, one of 'ghost', 'MixGhostClip', 'MixOpt' (default) from [4]. Note different clipping modes, including Opacus [5], GhostClip [2] and Mixed GhostClip [3], give the same convergence and accuracy though at significantly different time/space complexity. 55 | * `clipping_fn`: Per-sample gradient clipping function to use; one of "automatic" (default, [Bu et al., 2022](https://arxiv.org/pdf/2206.07136.pdf)), "Abadi" [(Abadi et al., 2016)](https://arxiv.org/pdf/1607.00133.pdf) , "global" [(Bu et al., 2021)](https://arxiv.org/pdf/2106.07830.pdf). 56 | * `clipping_style`: Per-sample gradient clipping style to use; one of `all-layer` (flat clipping), `layer-wise` (each layer is a block, including both weight and bias parameters), `param-wise` (each parameter is a block), or a list of layer names (general block-wise clipping). 57 | * `--origin_params`: Origin parameters for the ghost differentiation trick from [Bu et al. Appendix D.3](https://arxiv.org/pdf/2210.00038.pdf). Default is `None` (not using the trick). To enjoy the acceleration from the trick, set to each model's first trainable layer's parameters. For example, in text classification with RoBERTa, set `origin_params=["_embeddings"]`; in text generation with GPT2, set `origin_params=["wte","wpe"]`; in image classification with BEiT, set `origin_params=["patch_embed.proj.bias"]`. This trick gives about 8/6=1.666 speedup at no memory overhead. 58 | 59 | ### Usage 60 | Our privacy engine uses Pytorch [forward and backward hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html) to clip per-sample gradients and to add noises. To privately train models, attach the privacy engine to any optimizers from [torch.optim](https://pytorch.org/docs/stable/optim.html), which accumulates the sum of clipped per-sample gradients into `.grad` during backward propagation and additionally inject noises by `step`. 61 | 62 | To conduct DP bias-term fine-tuning (DP-BiTFiT [6]), simply freeze all non-bias terms: 63 | ```python 64 | [param.requires_grad_(False) for name, param in model.named_parameters() if '.bias' not in name] 65 | ``` 66 | Note that for two-phase DP training (e.g. appendix of [6] or DP continual training), one need to detach the first engine and attach a new engine to a new optimizer. 67 | 68 | ### References 69 | [1] Goodfellow, Ian. "Efficient per-example gradient computations." arXiv preprint arXiv:1510.01799 (2015). 70 | 71 | [2] Li, Xuechen, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. "Large language models can be strong differentially private learners." arXiv preprint arXiv:2110.05679 (2021). 72 | 73 | [3] Bu, Zhiqi, Jialin Mao, and Shiyun Xu. "Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy." arXiv preprint arXiv:2205.10683 (2022). 74 | 75 | [4] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Optimization on Large Model at Small Cost." arXiv preprint arXiv:2210.00038 (2022). 76 | 77 | [5] Yousefpour, Ashkan, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen et al. "Opacus: User-friendly differential privacy library in PyTorch." arXiv preprint arXiv:2109.12298 (2021). 78 | 79 | [6] Bu, Zhiqi, Yu-Xiang Wang, Sheng Zha, and George Karypis. "Differentially Private Bias-Term only Fine-tuning of Foundation Models." arXiv preprint arXiv:2210.00036 (2022). 80 | -------------------------------------------------------------------------------- /fastDP/__init__.py: -------------------------------------------------------------------------------- 1 | from . import lora_utils 2 | from .privacy_engine import PrivacyEngine 3 | from .privacy_engine_dist_stage23 import PrivacyEngine_Distributed_Stage_2_and_3 4 | from .privacy_engine_dist_extending import PrivacyEngine_Distributed_extending 5 | from .supported_differentially_private_layers import * 6 | __version__ = '2.0.0' 7 | -------------------------------------------------------------------------------- /fastDP/accounting/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/fast-differential-privacy/3d5cc561aa337c72f79873ccc4fe8b900b5493b5/fastDP/accounting/__init__.py -------------------------------------------------------------------------------- /fastDP/accounting/accounting_manager.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import math 3 | from typing import Dict, Optional, Union 4 | 5 | from . import rdp_accounting 6 | 7 | DEFAULT_ALPHAS = tuple(1 + x / 10.0 for x in range(1, 100)) + tuple(range(12, 64)) # RDP. 8 | 9 | 10 | class AccountingManager(abc.ABC): 11 | def _get_sigma_with_target_epsilon( 12 | self, 13 | target_epsilon, 14 | target_delta, 15 | sample_rate, 16 | steps, 17 | threshold, 18 | sigma_hi_init, 19 | sigma_lo_init, 20 | ): 21 | """Binary search σ given ε and δ.""" 22 | if sigma_lo_init > sigma_hi_init: 23 | raise ValueError("`sigma_lo` should be smaller than `sigma_hi`.") 24 | 25 | # Find an appropriate region for binary search. 26 | sigma_hi = sigma_hi_init 27 | sigma_lo = sigma_lo_init 28 | 29 | # Ensure sigma_hi isn't too small. 30 | while True: 31 | eps = self._compute_epsilon_from_sigma(sigma_hi, sample_rate, target_delta, steps) 32 | if eps < target_epsilon: 33 | break 34 | sigma_hi *= 2 35 | 36 | # Ensure sigma_lo isn't too large. 37 | while True: 38 | eps = self._compute_epsilon_from_sigma(sigma_lo, sample_rate, target_delta, steps) 39 | if eps > target_epsilon: 40 | break 41 | sigma_lo /= 2 42 | 43 | # Binary search. 44 | while sigma_hi - sigma_lo > threshold: 45 | sigma = (sigma_hi + sigma_lo) / 2 46 | eps = self._compute_epsilon_from_sigma(sigma, sample_rate, target_delta, steps) 47 | if eps < target_epsilon: 48 | sigma_hi = sigma 49 | else: 50 | sigma_lo = sigma 51 | 52 | # Conservative estimate. 53 | return sigma_hi 54 | 55 | @abc.abstractmethod 56 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict: 57 | """Override for reporting results.""" 58 | raise NotImplementedError 59 | 60 | @abc.abstractmethod 61 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps) -> float: 62 | """Override for binary sigma search.""" 63 | raise NotImplementedError 64 | 65 | def compute_sigma( 66 | self, 67 | target_epsilon: float, 68 | target_delta: float, 69 | sample_rate: float, 70 | epochs: Optional[Union[float, int]] = None, 71 | steps=None, 72 | threshold=1e-3, 73 | sigma_hi_init=4, 74 | sigma_lo_init=0.1, 75 | ) -> float: 76 | if steps is None: 77 | if epochs is None: 78 | raise ValueError("Epochs and steps cannot both be None.") 79 | steps = math.ceil(epochs / sample_rate) 80 | return self._get_sigma_with_target_epsilon( 81 | target_epsilon=target_epsilon, 82 | target_delta=target_delta, 83 | sample_rate=sample_rate, 84 | steps=steps, 85 | threshold=threshold, 86 | sigma_hi_init=sigma_hi_init, 87 | sigma_lo_init=sigma_lo_init, 88 | ) 89 | 90 | 91 | class RDPManager(AccountingManager): 92 | def __init__(self, alphas): 93 | super(RDPManager, self).__init__() 94 | self._alphas = alphas 95 | 96 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps): 97 | return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_rdp"] 98 | 99 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict: 100 | """Compute RDP as usual, but convert to (ε, δ)-DP based on the result by Canonne, Kamath, Steinke.""" 101 | rdp = rdp_accounting.compute_rdp(q=sample_rate, noise_multiplier=sigma, steps=steps, orders=self._alphas) 102 | eps, alpha = rdp_accounting.get_privacy_spent(orders=self._alphas, rdp=rdp, delta=target_delta) 103 | return dict(eps_rdp=eps, alpha_rdp=alpha) 104 | 105 | 106 | class GLWManager(AccountingManager): 107 | def __init__(self, eps_error=0.05): 108 | super(GLWManager, self).__init__() 109 | self._eps_error = eps_error 110 | 111 | def _compute_epsilon_from_sigma(self, sigma, sample_rate, target_delta, steps): 112 | return self.compute_epsilon(sigma, sample_rate, target_delta, steps)["eps_upper"] # Be conservative. 113 | 114 | def compute_epsilon(self, sigma, sample_rate, target_delta, steps) -> Dict: 115 | if steps == 0: 116 | return dict(eps_low=None, eps_estimate=None, eps_upper=None) 117 | 118 | from prv_accountant import Accountant 119 | accountant = Accountant( 120 | noise_multiplier=sigma, 121 | sampling_probability=sample_rate, 122 | delta=target_delta, 123 | eps_error=self._eps_error, 124 | max_compositions=steps 125 | ) 126 | eps_low, eps_estimate, eps_upper = accountant.compute_epsilon(num_compositions=steps) 127 | return dict(eps_low=eps_low, eps_estimate=eps_estimate, eps_upper=eps_upper) 128 | -------------------------------------------------------------------------------- /fastDP/accounting/rdp_accounting.py: -------------------------------------------------------------------------------- 1 | r""" 2 | This file is adapted from the privacy accounting procedure in Opacus', which in turn is adapted from tf-privacy. 3 | Below is the original documentation in Opacus. 4 | 5 | *Based on Google's TF Privacy:* https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis 6 | /rdp_accountant.py. 7 | *Here, we update this code to Python 3, and optimize dependencies.* 8 | 9 | Functionality for computing Renyi Differential Privacy (RDP) of an additive 10 | Sampled Gaussian Mechanism (SGM). 11 | 12 | Example: 13 | Suppose that we have run an SGM applied to a function with L2-sensitivity of 1. 14 | 15 | Its parameters are given as a list of tuples 16 | ``[(q_1, sigma_1, steps_1), ..., (q_k, sigma_k, steps_k)],`` 17 | and we wish to compute epsilon for a given target delta. 18 | 19 | The example code would be: 20 | 21 | >>> max_order = 32 22 | >>> orders = range(2, max_order + 1) 23 | >>> rdp = np.zeros_like(orders, dtype=float) 24 | >>> for q, sigma, steps in parameters: 25 | >>> rdp += privacy_analysis.compute_rdp(q, sigma, steps, orders) 26 | >>> epsilon, opt_order = privacy_analysis.get_privacy_spent(orders, rdp, delta) 27 | """ 28 | 29 | import math 30 | from typing import List, Sequence, Union 31 | 32 | import numpy as np 33 | from scipy import special 34 | 35 | 36 | ######################## 37 | # LOG-SPACE ARITHMETIC # 38 | ######################## 39 | 40 | 41 | def _log_add(logx: float, logy: float) -> float: 42 | r"""Adds two numbers in the log space. 43 | 44 | Args: 45 | logx: First term in log space. 46 | logy: Second term in log space. 47 | 48 | Returns: 49 | Sum of numbers in log space. 50 | """ 51 | a, b = min(logx, logy), max(logx, logy) 52 | if a == -np.inf: # adding 0 53 | return b 54 | # Use exp(a) + exp(b) = (exp(a - b) + 1) * exp(b) 55 | return math.log1p(math.exp(a - b)) + b # log1p(x) = log(x + 1) 56 | 57 | 58 | def _log_sub(logx: float, logy: float) -> float: 59 | r"""Subtracts two numbers in the log space. 60 | 61 | Args: 62 | logx: First term in log space. Expected to be greater than the second term. 63 | logy: First term in log space. Expected to be less than the first term. 64 | 65 | Returns: 66 | Difference of numbers in log space. 67 | 68 | Raises: 69 | ValueError 70 | If the result is negative. 71 | """ 72 | if logx < logy: 73 | raise ValueError("The result of subtraction must be non-negative.") 74 | if logy == -np.inf: # subtracting 0 75 | return logx 76 | if logx == logy: 77 | return -np.inf # 0 is represented as -np.inf in the log space. 78 | 79 | try: 80 | # Use exp(x) - exp(y) = (exp(x - y) - 1) * exp(y). 81 | return math.log(math.expm1(logx - logy)) + logy # expm1(x) = exp(x) - 1 82 | except OverflowError: 83 | return logx 84 | 85 | 86 | def _compute_log_a_for_int_alpha(q: float, sigma: float, alpha: int) -> float: 87 | r"""Computes :math:`log(A_\alpha)` for integer ``alpha``. 88 | 89 | Notes: 90 | Note that 91 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``, 92 | and that 0 < ``q`` < 1. 93 | 94 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details. 95 | 96 | Args: 97 | q: Sampling rate of SGM. 98 | sigma: The standard deviation of the additive Gaussian noise. 99 | alpha: The order at which RDP is computed. 100 | 101 | Returns: 102 | :math:`log(A_\alpha)` as defined in Section 3.3 of 103 | https://arxiv.org/pdf/1908.10530.pdf. 104 | """ 105 | 106 | # Initialize with 0 in the log space. 107 | log_a = -np.inf 108 | 109 | for i in range(alpha + 1): 110 | log_coef_i = ( 111 | math.log(special.binom(alpha, i)) 112 | + i * math.log(q) 113 | + (alpha - i) * math.log(1 - q) 114 | ) 115 | 116 | s = log_coef_i + (i * i - i) / (2 * (sigma ** 2)) 117 | log_a = _log_add(log_a, s) 118 | 119 | return float(log_a) 120 | 121 | 122 | def _compute_log_a_for_frac_alpha(q: float, sigma: float, alpha: float) -> float: 123 | r"""Computes :math:`log(A_\alpha)` for fractional ``alpha``. 124 | 125 | Notes: 126 | Note that 127 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``, 128 | and that 0 < ``q`` < 1. 129 | 130 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf for details. 131 | 132 | Args: 133 | q: Sampling rate of SGM. 134 | sigma: The standard deviation of the additive Gaussian noise. 135 | alpha: The order at which RDP is computed. 136 | 137 | Returns: 138 | :math:`log(A_\alpha)` as defined in Section 3.3 of 139 | https://arxiv.org/pdf/1908.10530.pdf. 140 | """ 141 | # The two parts of A_alpha, integrals over (-inf,z0] and [z0, +inf), are 142 | # initialized to 0 in the log space: 143 | log_a0, log_a1 = -np.inf, -np.inf 144 | i = 0 145 | 146 | z0 = sigma ** 2 * math.log(1 / q - 1) + 0.5 147 | 148 | while True: # do ... until loop 149 | coef = special.binom(alpha, i) 150 | log_coef = math.log(abs(coef)) 151 | j = alpha - i 152 | 153 | log_t0 = log_coef + i * math.log(q) + j * math.log(1 - q) 154 | log_t1 = log_coef + j * math.log(q) + i * math.log(1 - q) 155 | 156 | log_e0 = math.log(0.5) + _log_erfc((i - z0) / (math.sqrt(2) * sigma)) 157 | log_e1 = math.log(0.5) + _log_erfc((z0 - j) / (math.sqrt(2) * sigma)) 158 | 159 | log_s0 = log_t0 + (i * i - i) / (2 * (sigma ** 2)) + log_e0 160 | log_s1 = log_t1 + (j * j - j) / (2 * (sigma ** 2)) + log_e1 161 | 162 | if coef > 0: 163 | log_a0 = _log_add(log_a0, log_s0) 164 | log_a1 = _log_add(log_a1, log_s1) 165 | else: 166 | log_a0 = _log_sub(log_a0, log_s0) 167 | log_a1 = _log_sub(log_a1, log_s1) 168 | 169 | i += 1 170 | if max(log_s0, log_s1) < -30: 171 | break 172 | 173 | return _log_add(log_a0, log_a1) 174 | 175 | 176 | def _compute_log_a(q: float, sigma: float, alpha: float) -> float: 177 | r"""Computes :math:`log(A_\alpha)` for any positive finite ``alpha``. 178 | 179 | Notes: 180 | Note that 181 | :math:`A_\alpha` is real valued function of ``alpha`` and ``q``, 182 | and that 0 < ``q`` < 1. 183 | 184 | Refer to Section 3.3 of https://arxiv.org/pdf/1908.10530.pdf 185 | for details. 186 | 187 | Args: 188 | q: Sampling rate of SGM. 189 | sigma: The standard deviation of the additive Gaussian noise. 190 | alpha: The order at which RDP is computed. 191 | 192 | Returns: 193 | :math:`log(A_\alpha)` as defined in the paper mentioned above. 194 | """ 195 | if float(alpha).is_integer(): 196 | return _compute_log_a_for_int_alpha(q, sigma, int(alpha)) 197 | else: 198 | return _compute_log_a_for_frac_alpha(q, sigma, alpha) 199 | 200 | 201 | def _log_erfc(x: float) -> float: 202 | r"""Computes :math:`log(erfc(x))` with high accuracy for large ``x``. 203 | 204 | Helper function used in computation of :math:`log(A_\alpha)` 205 | for a fractional alpha. 206 | 207 | Args: 208 | x: The input to the function 209 | 210 | Returns: 211 | :math:`log(erfc(x))` 212 | """ 213 | return math.log(2) + special.log_ndtr(-x * 2 ** 0.5) 214 | 215 | 216 | def _compute_rdp(q: float, sigma: float, alpha: float) -> float: 217 | r"""Computes RDP of the Sampled Gaussian Mechanism at order ``alpha``. 218 | 219 | Args: 220 | q: Sampling rate of SGM. 221 | sigma: The standard deviation of the additive Gaussian noise. 222 | alpha: The order at which RDP is computed. 223 | 224 | Returns: 225 | RDP at order ``alpha``; can be np.inf. 226 | """ 227 | if q == 0: 228 | return 0 229 | 230 | # no privacy 231 | if sigma == 0: 232 | return np.inf 233 | 234 | if q == 1.0: 235 | return alpha / (2 * sigma ** 2) 236 | 237 | if np.isinf(alpha): 238 | return np.inf 239 | 240 | return _compute_log_a(q, sigma, alpha) / (alpha - 1) 241 | 242 | 243 | def compute_rdp( 244 | q: float, noise_multiplier: float, steps: int, orders: Union[Sequence[float], float] 245 | ) -> Union[List[float], float]: 246 | r"""Computes Renyi Differential Privacy (RDP) guarantees of the 247 | Sampled Gaussian Mechanism (SGM) iterated ``steps`` times. 248 | 249 | Args: 250 | q: Sampling rate of SGM. 251 | noise_multiplier: The ratio of the standard deviation of the 252 | additive Gaussian noise to the L2-sensitivity of the function 253 | to which it is added. Note that this is same as the standard 254 | deviation of the additive Gaussian noise when the L2-sensitivity 255 | of the function is 1. 256 | steps: The number of iterations of the mechanism. 257 | orders: An array (or a scalar) of RDP orders. 258 | 259 | Returns: 260 | The RDP guarantees at all orders; can be ``np.inf``. 261 | """ 262 | if isinstance(orders, float): 263 | rdp = _compute_rdp(q, noise_multiplier, orders) 264 | else: 265 | rdp = np.array([_compute_rdp(q, noise_multiplier, order) for order in orders]) 266 | 267 | return rdp * steps 268 | 269 | 270 | # Based on 271 | # https://github.com/tensorflow/privacy/blob/5f07198b66b3617b22609db983926e3ba97cd905/tensorflow_privacy/privacy/analysis/rdp_accountant.py#L237 272 | def get_privacy_spent(orders, rdp, delta): 273 | """Compute epsilon given a list of RDP values and target delta. 274 | Args: 275 | orders: An array (or a scalar) of orders. 276 | rdp: A list (or a scalar) of RDP guarantees. 277 | delta: The target delta. 278 | Returns: 279 | Pair of (eps, optimal_order). 280 | Raises: 281 | ValueError: If input is malformed. 282 | """ 283 | orders_vec = np.atleast_1d(orders) 284 | rdp_vec = np.atleast_1d(rdp) 285 | 286 | if delta <= 0: 287 | raise ValueError("Privacy failure probability bound delta must be >0.") 288 | if len(orders_vec) != len(rdp_vec): 289 | raise ValueError("Input lists must have the same length.") 290 | 291 | # Basic bound (see https://arxiv.org/abs/1702.07476 Proposition 3 in v3): 292 | # eps = min( rdp_vec - math.log(delta) / (orders_vec - 1) ) 293 | 294 | # Improved bound from https://arxiv.org/abs/2004.00010 Proposition 12 (in v4). 295 | # Also appears in https://arxiv.org/abs/2001.05990 Equation 20 (in v1). 296 | eps_vec = [] 297 | for (a, r) in zip(orders_vec, rdp_vec): 298 | if a < 1: 299 | raise ValueError("Renyi divergence order must be >=1.") 300 | if r < 0: 301 | raise ValueError("Renyi divergence must be >=0.") 302 | 303 | if delta ** 2 + math.expm1(-r) >= 0: 304 | # In this case, we can simply bound via KL divergence: 305 | # delta <= sqrt(1-exp(-KL)). 306 | eps = 0 # No need to try further computation if we have eps = 0. 307 | elif a > 1.01: 308 | # This bound is not numerically stable as alpha->1. 309 | # Thus we have a min value of alpha. 310 | # The bound is also not useful for small alpha, so doesn't matter. 311 | eps = r + math.log1p(-1 / a) - math.log(delta * a) / (a - 1) 312 | else: 313 | # In this case we can't do anything. E.g., asking for delta = 0. 314 | eps = np.inf 315 | eps_vec.append(eps) 316 | 317 | idx_opt = np.argmin(eps_vec) 318 | return max(0, eps_vec[idx_opt]), orders_vec[idx_opt] 319 | -------------------------------------------------------------------------------- /fastDP/autograd_grad_sample_dist.py: -------------------------------------------------------------------------------- 1 | """ 2 | A large portion of this code is adapted from Opacus v0.15 (https://github.com/pytorch/opacus) 3 | and from Private-transformers v0.2.3 (https://github.com/lxuechen/private-transformers) 4 | which are licensed under Apache License 2.0. 5 | 6 | We have modified it considerably to support book-keeping and BiTFiT. 7 | """ 8 | 9 | from typing import Tuple 10 | 11 | import torch 12 | import torch.nn as nn 13 | 14 | from .supported_layers_grad_samplers import _supported_layers_norm_sample_AND_clipping,_create_or_extend_private_grad 15 | 16 | def requires_grad(module: nn.Module) -> bool: 17 | """ 18 | Checks if any parameters in a specified module require gradients. 19 | 20 | Args: 21 | module: PyTorch module whose parameters are examined 22 | 23 | Returns: 24 | Flag indicate if any parameters require gradients 25 | """ 26 | return any(p.requires_grad for p in module.parameters() if hasattr(p,'requires_grad')) 27 | 28 | 29 | def add_hooks(model: nn.Module, loss_reduction='mean', clipping_mode='MixOpt',bias_only=False, 30 | clipping_style='all-layer', block_heads=None, named_params=None, named_layers=None, 31 | clipping_fn=None, numerical_stability_constant=None, max_grad_norm_layerwise=None): 32 | r""" 33 | Adds hooks to model to save activations (to layers) and backprop (to params) values. 34 | 35 | The hooks will 36 | 37 | 1. save activations into ``layer.activations`` (NOT param.activations) during forward pass. 38 | Note: BiTFiT is special in that if a layer only requires bias gradient, no need for forward hook 39 | 40 | 2. compute per-sample grad norm or grad and save in ``param.norm_sample`` or ``param.grad_sample`` during backward pass. 41 | 42 | Args: 43 | model: Model to which hooks are added. 44 | """ 45 | if hasattr(model, "autograd_grad_sample_hooks"): 46 | raise ValueError("Trying to add hooks twice to the same model") 47 | 48 | handles = [] 49 | 50 | for name, layer in model.named_modules(): 51 | if type(layer) in _supported_layers_norm_sample_AND_clipping and requires_grad(layer): 52 | if hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad: 53 | #print('Attaching forward hook on', name) 54 | handles.append(layer.register_forward_hook(_capture_activations)) 55 | 56 | def this_backward(this_layer, grad_input, grad_output): 57 | _prepare_sample_grad_or_norm(this_layer, grad_output, loss_reduction, clipping_mode,bias_only) 58 | _per_block_clip_grad(this_layer, named_params, named_layers, clipping_style, clipping_fn, numerical_stability_constant, max_grad_norm_layerwise) 59 | 60 | # Starting with 1.8.0, can use `register_full_backward_hook`, but slower 61 | handles.append(layer.register_backward_hook(this_backward)) 62 | 63 | model.__dict__.setdefault("autograd_grad_sample_hooks", []).extend(handles) 64 | 65 | 66 | def remove_hooks(model: nn.Module): 67 | """Removes hooks added by `add_hooks()`.""" 68 | for handle in model.autograd_grad_sample_hooks: 69 | handle.remove() 70 | del model.autograd_grad_sample_hooks 71 | 72 | 73 | def _capture_activations(layer: nn.Module, inputs: Tuple, outputs: Tuple): 74 | """Forward hook handler captures AND saves activations.""" 75 | layer.activations=inputs[0].detach() 76 | 77 | def _prepare_sample_grad_or_norm( 78 | layer: nn.Module, 79 | grad_output: Tuple[torch.Tensor], 80 | loss_reduction='mean', 81 | clipping_mode='MixOpt', 82 | bias_only=False, 83 | ): 84 | """Backward hook handler captures AND saves grad_outputs (book-keeping).""" 85 | backprops = grad_output[0].detach() 86 | 87 | """Computes per-sample grad norm or grad for individual layers.""" 88 | if not hasattr(layer,'activations'): 89 | layer.activations=None 90 | if loss_reduction=='mean': 91 | backprops = backprops * backprops.shape[0] # .backprops should save dL_i/ds, not 1/B*dL_i/ds, the mean reduction is taken care of in privacy engine .step() 92 | compute_layer_grad_sample, _ = _supported_layers_norm_sample_AND_clipping.get(type(layer)) 93 | 94 | compute_layer_grad_sample(layer, layer.activations, backprops, clipping_mode) 95 | 96 | layer.backprops=backprops 97 | 98 | 99 | def _per_block_clip_grad( 100 | layer: nn.Module, named_params, named_layers, clipping_style, clipping_fn, 101 | numerical_stability_constant,max_grad_norm_layerwise 102 | ): 103 | 104 | if clipping_style=='layer-wise': 105 | if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'): 106 | norm_sample = layer.weight.norm_sample 107 | if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'): 108 | norm_sample = torch.stack([layer.weight.norm_sample,layer.bias.norm_sample], dim=0).norm(2, dim=0); 109 | else: 110 | norm_sample = layer.bias.norm_sample 111 | #norm_sample = torch.stack([param.norm_sample for param in layer.parameters() if hasattr(param,'norm_sample')], dim=0).norm(2, dim=0); 112 | 113 | # compute per-sample grad norm and clipping factor 114 | if clipping_fn=='automatic': 115 | C = max_grad_norm_layerwise / (norm_sample + numerical_stability_constant)#torch.ones_like(norm_sample,dtype=layer.weight.dtype)#change to non-DP C=1 works under mixed precision 116 | elif clipping_fn=='Abadi': 117 | C = torch.clamp_max(max_grad_norm_layerwise / (norm_sample + numerical_stability_constant), 1.) 118 | elif clipping_fn=='global': 119 | C = (norm_sample<=max_grad_norm_layerwise).float() 120 | else: 121 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.") 122 | 123 | if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'): 124 | #--- weight, compute clipped gradient 125 | _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer)) 126 | common_type=torch.promote_types(layer.activations.dtype,layer.backprops.dtype) 127 | grad_weight = compute_layer_grad(layer, layer.activations.to(common_type), torch.einsum('b...,b->b...',layer.backprops.to(common_type),C), C) 128 | del layer.activations, layer.backprops 129 | _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False) 130 | 131 | if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'): 132 | #--- bias, compute clipped gradient 133 | grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C)#(layer.bias.grad_sample*C.unsqueeze(1)).sum(dim=0)# 134 | del layer.bias.grad_sample 135 | _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False) 136 | 137 | elif clipping_style=='param-wise': 138 | if hasattr(layer,'weight') and hasattr(layer.weight,'norm_sample'): 139 | if clipping_fn=='automatic': 140 | C_weight = max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant) 141 | elif clipping_fn=='Abadi': 142 | C_weight = torch.clamp_max(max_grad_norm_layerwise / (layer.weight.norm_sample + numerical_stability_constant), 1.) 143 | elif clipping_fn=='global': 144 | C_weight = (layer.weight.norm_sample<=max_grad_norm_layerwise).float() 145 | else: 146 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.") 147 | 148 | if hasattr(layer,'bias') and hasattr(layer.bias,'norm_sample'): 149 | if clipping_fn=='automatic': 150 | C_bias = max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant) 151 | elif clipping_fn=='Abadi': 152 | C_bias = torch.clamp_max(max_grad_norm_layerwise / (layer.bias.norm_sample + numerical_stability_constant), 1.) 153 | elif clipping_fn=='global': 154 | C_bias = (layer.bias.norm_sample<=max_grad_norm_layerwise).float() 155 | else: 156 | raise ValueError(f"Unknown clipping function {clipping_fn}. Expected one of Abadi, automatic, global.") 157 | 158 | 159 | if hasattr(layer,'weight') and hasattr(layer.weight,'requires_grad') and layer.weight.requires_grad and hasattr(layer,'activations') and hasattr(layer.weight,'norm_sample'): 160 | _, compute_layer_grad = _supported_layers_norm_sample_AND_clipping.get(type(layer)) 161 | grad_weight = compute_layer_grad(layer, layer.activations, torch.einsum('b...,b->b...',layer.backprops,C_weight), C_weight) 162 | del layer.activations, layer.backprops 163 | 164 | _create_or_extend_private_grad(layer.weight, grad_weight, accumulate_private_grad = False) 165 | 166 | 167 | #--- bias, compute clipped gradient 168 | if hasattr(layer,'bias') and hasattr(layer.bias,'requires_grad') and layer.bias.requires_grad and hasattr(layer.bias,'grad_sample') and hasattr(layer.bias,'norm_sample'): 169 | grad_bias = torch.einsum("b...,b->...", layer.bias.grad_sample, C_bias) 170 | del layer.bias.grad_sample 171 | _create_or_extend_private_grad(layer.bias, grad_bias, accumulate_private_grad = False) 172 | else: 173 | raise ValueError(f"Unknown clipping style {clipping_style}. Expected one of 'layer-wise','param-wise'.") 174 | 175 | 176 | for param in layer.parameters(): 177 | if hasattr(param,'norm_sample'): 178 | del param.norm_sample 179 | -------------------------------------------------------------------------------- /fastDP/lora_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | LoRA layers. 3 | 4 | This version does not have merged weights for zero latency inference. It makes the code easier to read and maintain. 5 | Adapted from 6 | https://github.com/microsoft/LoRA 7 | https://www.microsoft.com/en-us/research/project/dp-transformers/ 8 | """ 9 | 10 | import torch 11 | import transformers 12 | from torch import nn 13 | 14 | 15 | class DPMergedLinear(nn.Module): 16 | def __init__( 17 | self, 18 | in_features: int, 19 | out_features: int, 20 | lora_r=0, 21 | lora_alpha=1., 22 | lora_dropout=0., 23 | ): 24 | super(DPMergedLinear, self).__init__() 25 | self.linear = nn.Linear(in_features=in_features, out_features=out_features) 26 | self.lora_r = lora_r 27 | self.lora_alpha = lora_alpha 28 | self.lora_dropout = nn.Dropout(p=lora_dropout) 29 | if self.lora_r > 0: 30 | self.lora_A = nn.Linear(in_features=in_features, out_features=lora_r, bias=False) 31 | self.lora_B = nn.Linear(in_features=lora_r, out_features=out_features, bias=False) 32 | self.scaling = self.lora_alpha / lora_r 33 | self.reset_parameters() 34 | 35 | def forward(self, x: torch.Tensor): 36 | result = self.linear(x) 37 | if self.lora_r > 0: 38 | after_dropout = self.lora_dropout(x) 39 | after_A = self.lora_A(after_dropout) 40 | after_B = self.lora_B(after_A) 41 | result += after_B * self.scaling 42 | return result 43 | 44 | def reset_parameters(self): 45 | self.linear.reset_parameters() 46 | if self.lora_r > 0: 47 | self.lora_A.reset_parameters() 48 | self.lora_B.weight.data.zero_() 49 | 50 | @staticmethod 51 | def from_transformers_conv1d( 52 | original_layer, 53 | lora_r=0, 54 | lora_alpha=1., 55 | lora_dropout=0., 56 | ) -> "DPMergedLinear": 57 | lora_layer = DPMergedLinear( 58 | in_features=original_layer.weight.shape[0], 59 | out_features=original_layer.weight.shape[1], 60 | lora_r=lora_r, 61 | lora_alpha=lora_alpha, 62 | lora_dropout=lora_dropout, 63 | ).to(original_layer.weight.device) 64 | lora_layer.linear.weight.data.copy_(original_layer.weight.T.data) 65 | lora_layer.linear.bias.data.copy_(original_layer.bias.data) 66 | return lora_layer 67 | 68 | 69 | def convert_gpt2_attention_to_lora( 70 | model: transformers.GPT2PreTrainedModel, 71 | lora_r=0, 72 | lora_alpha=1., 73 | lora_dropout=0., 74 | ) -> transformers.GPT2PreTrainedModel: 75 | if not isinstance(model, transformers.GPT2PreTrainedModel): 76 | raise TypeError("Requires a GPT2 model") 77 | 78 | if not hasattr(model, "h") and hasattr(model, "transformer"): 79 | transformer = model.transformer 80 | else: 81 | transformer = model 82 | 83 | for h_i in transformer.h: 84 | new_layer = DPMergedLinear.from_transformers_conv1d( 85 | original_layer=h_i.attn.c_attn, 86 | lora_r=lora_r, 87 | lora_alpha=lora_alpha, 88 | lora_dropout=lora_dropout, 89 | ) 90 | h_i.attn.c_attn = new_layer 91 | 92 | return model 93 | 94 | 95 | def mark_only_lora_as_trainable(model: torch.nn.Module) -> None: 96 | model.requires_grad_(True) 97 | for n, p in model.named_parameters(): 98 | if 'lora_' not in n: 99 | p.requires_grad = False 100 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch~=1.11.0+cu113 2 | prv-accountant 3 | transformers>=4.20.1 4 | numpy 5 | scipy 6 | jupyterlab 7 | jupyter 8 | opacus>=1.0 9 | ml-swissknife 10 | opt_einsum 11 | pytest 12 | pydantic==1.10 13 | tqdm>=4.62.1 14 | deepspeed~=0.8.3 15 | fairscale==0.4 16 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | 4 | import setuptools 5 | 6 | # for simplicity we actually store the version in the __version__ attribute in the source 7 | here = os.path.realpath(os.path.dirname(__file__)) 8 | print(here) 9 | with open(os.path.join(here, 'fastDP', '__init__.py')) as f: 10 | meta_match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", f.read(), re.M) 11 | if meta_match: 12 | version = meta_match.group(1) 13 | else: 14 | raise RuntimeError("Unable to find __version__ string.") 15 | 16 | with open(os.path.join(here, 'README.md')) as f: 17 | readme = f.read() 18 | 19 | setuptools.setup( 20 | name="fastDP", 21 | version=version, 22 | author="Zhiqi Bu", 23 | author_email="woodyx218@gmail.com", 24 | description="Optimally efficient implementation of differentially private optimization (with per-sample gradient clipping.", 25 | long_description=readme, 26 | url="", 27 | packages=setuptools.find_packages(exclude=['examples', 'tests']), 28 | python_requires='~=3.8', 29 | classifiers=[ 30 | "Programming Language :: Python :: 3", 31 | "License :: OSI Approved :: Apache Software License", 32 | ], 33 | ) 34 | --------------------------------------------------------------------------------