├── requirements.txt ├── README.md └── Dissecting-FT-Unlearning ├── Restoration_experiments_olmo-30tokens.ipynb └── Restoration_experiments_llama-30tokens.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | ai2-olmo==0.2.5 2 | aiohttp==3.9.3 3 | aiosignal==1.3.1 4 | anyio==4.3.0 5 | better-abc==0.0.3 6 | bitsandbytes==0.43.1 7 | datasets==2.18.0 8 | evaluate==0.4.2 9 | huggingface-hub==0.21.4 10 | hydra-core==1.3.2 11 | jupyter 12 | matplotlib==3.7.5 13 | nltk==3.8.1 14 | numpy==1.24.1 15 | openai==1.25.0 16 | pandas==2.0.3 17 | peft==0.10.0 18 | pillow==10.2.0 19 | PyYAML==6.0.1 20 | rouge==1.0.1 21 | rouge-score==0.1.2 22 | scipy==1.10.1 23 | sentencepiece==0.2.0 24 | statistics==1.0.3.5 25 | scikit-learn=1.5.2 26 | tokenizers==0.15.2 27 | tornado==6.4 28 | tqdm==4.66.2 29 | transformer-lens==1.15.0 30 | transformers==4.38.2 31 | urllib3==1.26.13 32 | 33 | 34 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Dissecting-FT-Unlearning (EMNLP 2024 Main - Short) 2 | 3 | The code for the experiments in our paper titled **[Dissecting Fine-Tuning Unlearning in Large Language Models]** accepted in EMNLP 2024 Main (Short paper), which is a follow-up work of [ConceptVectors Benchmark](https://github.com/yihuaihong/ConceptVectors). 4 | 5 | * **Arxiv:** https://arxiv.org/abs/2410.06606 6 | 7 | 8 | 9 | 10 | ## Quick Links 11 | - [Dissecting FT Unlearning](#Dissecting-FT-Unlearning) 12 | - [Quick Links](#quick-links) 13 | - [Overview](#overview) 14 | - [1. Requirements](#1-requirements) 15 | - [2. Download Unlearned models](#2-Download-Unlearned-models) 16 | - [3. Patching or Restoring Experiments](#3-Patching-or-Restoring-Experiments) 17 | - [How to Cite](#how-to-cite) 18 | 19 | ## Overview 20 | You can reproduce the experiments in our paper. 21 | 22 | > **Abstract** 23 | > Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear. In this paper, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters. Interestingly, the MLP components in the model's final layer are the primary contributors to these apparently positive unlearning effects, responsible for controlling the models' behaviors. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge. The code is released at https://github.com/yihuaihong/Dissecting-FT-Unlearning 24 | 25 | 26 | ## 1. Requirements 27 | To install the required packages, please run the following command. 28 | ```sh 29 | conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia 30 | pip install -r requirements.txt 31 | ``` 32 | 33 | ## 2. Download Unlearned models 34 | 35 | For convenience, we directly released the models after Unlearning training (via DPO or Gradient Difference) on both LLaMA2-7B-chat and OLMo-7B in the Huggingface links below: 36 | 37 | ```sh 38 | for method in 'dpo' 'grad_diff': 39 | for id in 16 18 21 26 27 38 42 47 49 54 40 | download_link = f{https://huggingface.co/YihuaiHong/llama2-7b_{method}_unlearn_on_id_{id}_concept} 41 | ``` 42 | 43 | or 44 | 45 | ```sh 46 | for method in 'dpo' 'grad_diff': 47 | for id in 4 37 40 44 59 77 90 105 141 147 48 | download_link = f{https://huggingface.co/YihuaiHong/olmo-7b_{method}_unlearn_on_id_{id}_concept} 49 | ``` 50 | 51 | 52 | ## 3. Patching or Restoring Experiments 53 | 54 | Please run Restoration_experiments_llama-30tokens.ipynb or Restoration_experiments_olmo-30tokens.ipynb 55 | 56 | 57 | 58 | ## How to Cite 59 | ``` 60 | @misc{hong2024dissectingfinetuningunlearninglarge, 61 | title={Dissecting Fine-Tuning Unlearning in Large Language Models}, 62 | author={Yihuai Hong and Yuelin Zou and Lijie Hu and Ziqian Zeng and Di Wang and Haiqin Yang}, 63 | year={2024}, 64 | eprint={2410.06606}, 65 | archivePrefix={arXiv}, 66 | primaryClass={cs.CL}, 67 | url={https://arxiv.org/abs/2410.06606}, 68 | } 69 | ``` 70 | -------------------------------------------------------------------------------- /Dissecting-FT-Unlearning/Restoration_experiments_olmo-30tokens.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "metadata": { 6 | "collapsed": false, 7 | "jupyter": { 8 | "outputs_hidden": false 9 | }, 10 | "tags": [] 11 | }, 12 | "source": [ 13 | "import torch\n", 14 | "\n", 15 | "if torch.cuda.is_available():\n", 16 | " gpu_count = torch.cuda.device_count()\n", 17 | " print(f\"Find {gpu_count} GPU can be used.\")\n", 18 | "\n", 19 | " for i in range(gpu_count):\n", 20 | " gpu_name = torch.cuda.get_device_name(i)\n", 21 | " print(f\"GPU {i + 1}: {gpu_name}\")\n", 22 | "else:\n", 23 | " print(\"No GPU can be used.\")\n" 24 | ], 25 | "outputs": [], 26 | "execution_count": null 27 | }, 28 | { 29 | "cell_type": "code", 30 | "metadata": { 31 | "collapsed": false, 32 | "jupyter": { 33 | "outputs_hidden": false 34 | }, 35 | "tags": [] 36 | }, 37 | "source": [ 38 | "import copy\n", 39 | "from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n", 40 | "from rouge import Rouge\n", 41 | "# from bert_score import score\n", 42 | "import statistics\n", 43 | "from ast import literal_eval\n", 44 | "import functools\n", 45 | "import json\n", 46 | "import os\n", 47 | "import random\n", 48 | "import numpy as np\n", 49 | "import pandas as pd\n", 50 | "import torch\n", 51 | "import torch.nn.functional as F\n", 52 | "\n", 53 | "random.seed(8888)\n", 54 | "torch.manual_seed(8888)\n", 55 | "random.seed(8888)\n", 56 | "np.random.seed(8888)\n", 57 | "\n", 58 | "if torch.cuda.is_available():\n", 59 | " torch.cuda.manual_seed(8888)\n", 60 | " torch.cuda.manual_seed_all(8888)\n", 61 | "\n", 62 | "\n", 63 | "from tqdm import tqdm\n", 64 | "\n", 65 | "torch.set_grad_enabled(False)\n", 66 | "tqdm.pandas()" 67 | ], 68 | "outputs": [], 69 | "execution_count": null 70 | }, 71 | { 72 | "cell_type": "code", 73 | "metadata": { 74 | "collapsed": false, 75 | "jupyter": { 76 | "outputs_hidden": false 77 | }, 78 | "tags": [] 79 | }, 80 | "source": [ 81 | "torch.cuda.set_device(0)\n", 82 | "\n", 83 | "\n", 84 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 85 | "import json\n", 86 | "from os.path import join\n", 87 | "\n", 88 | "model_dir = \"/mnt/workspace/workgroup/yhhong/transformers\"\n", 89 | "original_model_name = \"OLMo-7B\" \n", 90 | "\n", 91 | "\n", 92 | "original_model = AutoModelForCausalLM.from_pretrained(\n", 93 | " join(model_dir, original_model_name),\n", 94 | " torch_dtype=torch.bfloat16,\n", 95 | " trust_remote_code=True\n", 96 | ")\n", 97 | "\n", 98 | "tokenizer = AutoTokenizer.from_pretrained(join(model_dir, original_model_name))\n", 99 | "tokenizer.pad_token = tokenizer.eos_token\n", 100 | "tokenizer.padding_side = \"left\"\n", 101 | "\n", 102 | "original_model.to('cuda');\n" 103 | ], 104 | "outputs": [], 105 | "execution_count": null 106 | }, 107 | { 108 | "cell_type": "code", 109 | "metadata": { 110 | "tags": [] 111 | }, 112 | "source": [ 113 | "original_model" 114 | ], 115 | "outputs": [], 116 | "execution_count": null 117 | }, 118 | { 119 | "cell_type": "code", 120 | "metadata": { 121 | "collapsed": false, 122 | "jupyter": { 123 | "outputs_hidden": false 124 | }, 125 | "tags": [] 126 | }, 127 | "source": [ 128 | "import torch.nn.functional as F\n", 129 | "\n", 130 | "def set_act_get_hooks(model, mlp_coef=False, attn=False):\n", 131 | " \"\"\"\n", 132 | " Works on OLMo, getting coef or attn values \n", 133 | " \"\"\"\n", 134 | "\n", 135 | " for attr in [\"activations_\"]:\n", 136 | " if not hasattr(model, attr):\n", 137 | " setattr(model, attr, {})\n", 138 | "\n", 139 | " def get_activation(name):\n", 140 | " def hook(module, input, output):\n", 141 | " if \"m_coef\" in name:\n", 142 | " model.activations_[name] = input[0][:, :].detach() \n", 143 | " elif \"attn\" in name:\n", 144 | " model.activations_[name] = output[:, :].detach() \n", 145 | "\n", 146 | " return hook\n", 147 | "\n", 148 | " hooks = []\n", 149 | " for i in range(32):\n", 150 | " if mlp_coef is True: #co-effciency\n", 151 | " hooks.append(model.model.transformer.blocks[i].ff_out.register_forward_hook(get_activation(\"m_coef_\" + str(i))))\n", 152 | " if attn is True:\n", 153 | " hooks.append(model.model.transformer.blocks[i].attn_out.register_forward_hook(get_activation(\"attn_\" + str(i)))) \n", 154 | " \n", 155 | "\n", 156 | " return hooks\n", 157 | "\n", 158 | "def set_act_modify_hooks(model, original_model, mlp_coef, attn, ori_mlp_coef, ori_attn, un_mlp_coef, un_attn, layers=None):\n", 159 | " \"\"\"\n", 160 | " Works on OLMo, modify coef or attn values\n", 161 | " \"\"\"\n", 162 | "\n", 163 | " def modify_activation(name, update_value, patch_input):\n", 164 | " def pre_hook(module, input):\n", 165 | " if \"m_coef\" in name:\n", 166 | " input[0][:, :, :] = update_value \n", 167 | " \n", 168 | " def post_hook(module, input, output):\n", 169 | " if \"attn\" in name:\n", 170 | " output[:, :, :] = update_value\n", 171 | " \n", 172 | " if patch_input:\n", 173 | " return pre_hook\n", 174 | " else:\n", 175 | " return post_hook\n", 176 | "\n", 177 | " hooks = []\n", 178 | " if layers is not None:\n", 179 | " for i in range(32):\n", 180 | " if i in layers and mlp_coef is True: #co-effciency\n", 181 | " hooks.append(model.model.transformer.blocks[i].ff_out.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = ori_mlp_coef[i], patch_input=True)))\n", 182 | " # else:\n", 183 | " # hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = un_mlp_coef[i], patch_input=True)))\n", 184 | " if i in layers and attn is True:\n", 185 | " hooks.append(model.model.transformer.blocks[i].attn_out.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = ori_attn[i], patch_input=False))) \n", 186 | " # else:\n", 187 | " # hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = un_attn[i], patch_input=False))) \n", 188 | "\n", 189 | " return hooks\n", 190 | "\n", 191 | "def remove_hooks(hooks):\n", 192 | " for hook in hooks:\n", 193 | " hook.remove()\n", 194 | "\n", 195 | " \n", 196 | "def compute_loss(target_logits, logits):\n", 197 | " \n", 198 | " #MSE loss\n", 199 | "\n", 200 | " return torch.mean((target_logits - logits) ** 2)\n", 201 | "\n", 202 | "\n", 203 | "def compute_KRS(ori_loss, new_loss): \n", 204 | " return 1 - (new_loss / ori_loss)\n", 205 | " \n", 206 | "\n", 207 | "\n" 208 | ], 209 | "outputs": [], 210 | "execution_count": null 211 | }, 212 | { 213 | "cell_type": "code", 214 | "metadata": { 215 | "tags": [] 216 | }, 217 | "source": [ 218 | "# loading data:\n", 219 | "\n", 220 | "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n", 221 | " data = json.load(file)\n", 222 | " \n", 223 | "for ix, item in enumerate(data):\n", 224 | " print(ix,\" \",item['Concept'])\n", 225 | "\n" 226 | ], 227 | "outputs": [], 228 | "execution_count": null 229 | }, 230 | { 231 | "cell_type": "code", 232 | "metadata": { 233 | "tags": [] 234 | }, 235 | "source": [ 236 | "# evaluate original model and obtain the next 30 tokens answers\n", 237 | "\n", 238 | "for index, x in enumerate(tqdm(data)):\n", 239 | " \n", 240 | " questions = []\n", 241 | " n_new_tokens = 30\n", 242 | " for idx, question in enumerate(x['QA']):\n", 243 | " questions.append(f\"Question: {question}\\n Answer:\")\n", 244 | "\n", 245 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 246 | " input_length = inputs.input_ids.size(1)\n", 247 | " with torch.no_grad():\n", 248 | " generation_output = original_model.generate( # mt.model\n", 249 | " **inputs,\n", 250 | " do_sample=False,\n", 251 | " max_new_tokens=n_new_tokens,\n", 252 | " )\n", 253 | " \n", 254 | " eos_token_id = tokenizer.eos_token_id\n", 255 | " normal_tokens_num_list = []\n", 256 | " for i, output in enumerate(generation_output):\n", 257 | " eos_count = (output == eos_token_id).sum().item()\n", 258 | " total_tokens = output.size(0) \n", 259 | " normal_tokens_count = total_tokens - eos_count - inputs.input_ids.size(1)\n", 260 | " normal_tokens_num_list.append(normal_tokens_count)\n", 261 | " \n", 262 | " outputs = tokenizer.batch_decode(generation_output[:, :], skip_special_tokens=True)\n", 263 | " \n", 264 | " data[index]['QA_with_answers'] = outputs\n", 265 | " data[index]['answers_tokens_num'] = normal_tokens_num_list\n", 266 | "\n", 267 | "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"w\", encoding=\"utf-8\") as file:\n", 268 | " json.dump(data, file, ensure_ascii=False, indent=4)" 269 | ], 270 | "outputs": [], 271 | "execution_count": null 272 | }, 273 | { 274 | "cell_type": "code", 275 | "metadata": { 276 | "tags": [] 277 | }, 278 | "source": [ 279 | "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n", 280 | " data = json.load(file)" 281 | ], 282 | "outputs": [], 283 | "execution_count": null 284 | }, 285 | { 286 | "cell_type": "code", 287 | "metadata": { 288 | "tags": [] 289 | }, 290 | "source": [ 291 | "#on vanilla model and unlearned model\n", 292 | "\n", 293 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 294 | "import json\n", 295 | "from os.path import join\n", 296 | "\n", 297 | "\n", 298 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 299 | "\n", 300 | "methods_ori_loss_list = [] \n", 301 | "methods_ori_attn_list = []\n", 302 | "methods_ori_mlp_coef_list = []\n", 303 | "methods_un_mlp_coef_list = []\n", 304 | "methods_un_attn_list = []\n", 305 | "\n", 306 | "for method in ['grad_diff', 'dpo']:\n", 307 | " ori_loss_list = [] \n", 308 | " ori_attn_list = []\n", 309 | " ori_mlp_coef_list = []\n", 310 | " un_mlp_coef_list = []\n", 311 | " un_attn_list = []\n", 312 | " \n", 313 | " for x in tqdm(data):\n", 314 | " # Loading new unlearned model for certain concept \n", 315 | "\n", 316 | " model_name = f\"olmo-7b/{method}/Full/{x['id']}\"\n", 317 | "\n", 318 | " model = AutoModelForCausalLM.from_pretrained(\n", 319 | " join(model_dir, model_name),\n", 320 | " torch_dtype=torch.bfloat16,\n", 321 | " trust_remote_code=True\n", 322 | " );\n", 323 | "\n", 324 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 325 | " tokenizer.pad_token = tokenizer.eos_token\n", 326 | " tokenizer.padding_side = \"left\"\n", 327 | " model.to('cuda')\n", 328 | " \n", 329 | " questions = x['QA_with_answers']\n", 330 | "\n", 331 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 332 | " input_length = inputs.attention_mask.sum(dim=1)\n", 333 | "\n", 334 | " un_hooks = set_act_get_hooks(model, mlp_coef=True, attn=True)\n", 335 | " with torch.no_grad():\n", 336 | " output = model(**inputs)\n", 337 | "\n", 338 | " # remove hooks\n", 339 | " remove_hooks(un_hooks) \n", 340 | " print('output.logits.shape: ',output.logits.shape)\n", 341 | "\n", 342 | "\n", 343 | " hooks = set_act_get_hooks(original_model, mlp_coef=True, attn=True)\n", 344 | " with torch.no_grad():\n", 345 | " original_output = original_model(**inputs)\n", 346 | " # remove hooks\n", 347 | " remove_hooks(hooks) \n", 348 | " \n", 349 | " \n", 350 | " loss_list = []\n", 351 | " for i, length in enumerate(x['answers_tokens_num']):\n", 352 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 353 | " loss_list.append(loss)\n", 354 | " \n", 355 | " un_mlp_coef = []\n", 356 | " un_attn = []\n", 357 | " ori_mlp_coef = []\n", 358 | " ori_attn = []\n", 359 | " for layer in range(32):\n", 360 | " un_mlp_coef.append(model.activations_[f'm_coef_{layer}'])\n", 361 | " un_attn.append(model.activations_[f'attn_{layer}'])\n", 362 | " ori_mlp_coef.append(original_model.activations_[f'm_coef_{layer}'])\n", 363 | " ori_attn.append(original_model.activations_[f'attn_{layer}'])\n", 364 | "\n", 365 | "\n", 366 | " ori_loss_list.append(loss_list)\n", 367 | "\n", 368 | " un_mlp_coef_list.append(un_mlp_coef) # on all data\n", 369 | " un_attn_list.append(un_attn) # on all data\n", 370 | "\n", 371 | " ori_mlp_coef_list.append(ori_mlp_coef) # on all data\n", 372 | " ori_attn_list.append(ori_attn) # on all data\n", 373 | "\n", 374 | "\n", 375 | " del model\n", 376 | " del tokenizer\n", 377 | " torch.cuda.empty_cache()\n", 378 | "\n", 379 | " methods_ori_loss_list.append(ori_loss_list)\n", 380 | " methods_ori_attn_list.append(ori_attn_list)\n", 381 | " methods_ori_mlp_coef_list.append(ori_mlp_coef_list)\n", 382 | " methods_un_mlp_coef_list.append(un_mlp_coef_list)\n", 383 | " methods_un_attn_list.append(un_attn_list)\n", 384 | " \n", 385 | "\n" 386 | ], 387 | "outputs": [], 388 | "execution_count": null 389 | }, 390 | { 391 | "cell_type": "code", 392 | "metadata": { 393 | "tags": [] 394 | }, 395 | "source": [ 396 | "#patching on MLP coefficients on unlearned models\n", 397 | "torch.cuda.set_device(0)\n", 398 | "\n", 399 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 400 | "import json\n", 401 | "from os.path import join\n", 402 | "\n", 403 | "\n", 404 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 405 | "\n", 406 | "\n", 407 | "methods_mlp_coef_patch_loss_lists = []\n", 408 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 409 | " mlp_coef_patch_loss_lists = []\n", 410 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 411 | " ori_attn_list = methods_ori_attn_list[j]\n", 412 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 413 | " un_attn_list = methods_un_attn_list[j]\n", 414 | " \n", 415 | " for index, x in enumerate(tqdm(data)):\n", 416 | "\n", 417 | " # Loading new unlearned model for certain concept \n", 418 | "\n", 419 | " model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n", 420 | "\n", 421 | " model = AutoModelForCausalLM.from_pretrained(\n", 422 | " join(model_dir, model_name),\n", 423 | " torch_dtype=torch.bfloat16,\n", 424 | " trust_remote_code=True\n", 425 | " );\n", 426 | "\n", 427 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 428 | " tokenizer.pad_token = tokenizer.eos_token\n", 429 | " tokenizer.padding_side = \"left\"\n", 430 | " model.to('cuda');\n", 431 | "\n", 432 | " questions = x['QA_with_answers']\n", 433 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 434 | " input_length = inputs.attention_mask.sum(dim=1)\n", 435 | "\n", 436 | " with torch.no_grad():\n", 437 | " original_output = original_model(**inputs)\n", 438 | "\n", 439 | " patch_loss_list = []\n", 440 | " for layer in range(32):\n", 441 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=False, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n", 442 | " with torch.no_grad():\n", 443 | " output = model(**inputs)\n", 444 | " # remove hooks\n", 445 | " remove_hooks(hooks) \n", 446 | " \n", 447 | " loss_list = []\n", 448 | " for i, length in enumerate(x['answers_tokens_num']):\n", 449 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 450 | " loss_list.append(loss)\n", 451 | " \n", 452 | " patch_loss_list.append(loss_list)\n", 453 | "\n", 454 | " mlp_coef_patch_loss_lists.append(patch_loss_list) \n", 455 | "\n", 456 | "\n", 457 | " del model\n", 458 | " del tokenizer\n", 459 | " torch.cuda.empty_cache()\n", 460 | "\n", 461 | " methods_mlp_coef_patch_loss_lists.append(mlp_coef_patch_loss_lists)\n", 462 | "\n" 463 | ], 464 | "outputs": [], 465 | "execution_count": null 466 | }, 467 | { 468 | "cell_type": "code", 469 | "metadata": { 470 | "tags": [] 471 | }, 472 | "source": [ 473 | "\n", 474 | "print('recover on OLMo by patching on coef (5 layer per patch)')\n", 475 | " \n", 476 | "coef_KRS_list_all = []\n", 477 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 478 | " print(f'on {method}')\n", 479 | " coef_KRS_list_per_method = []\n", 480 | " for patch_losses, ori_losses in zip(methods_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 481 | " coef_KRS_list_per_data = []\n", 482 | " for new_losses in patch_losses: #different_layer\n", 483 | " coef_KRS_list_per_layer = []\n", 484 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 485 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 486 | " coef_KRS_list_per_layer.append(KRS.cpu().float())\n", 487 | " \n", 488 | " coef_KRS_list_per_data.append(np.mean(coef_KRS_list_per_layer, axis=-1))\n", 489 | " coef_KRS_list_per_method.append(coef_KRS_list_per_data)\n", 490 | " \n", 491 | " coef_KRS_list_per_method = np.array(coef_KRS_list_per_method)\n", 492 | " avg_coef_KRS_list_per_method = np.mean(coef_KRS_list_per_method, axis=0)\n", 493 | " \n", 494 | " coef_KRS_list_all.append(avg_coef_KRS_list_per_method)\n", 495 | "\n", 496 | "coef_KRS_list_all = np.array(coef_KRS_list_all)\n", 497 | "avg_coef_KRS_list_all = np.mean(coef_KRS_list_all, axis=0)\n", 498 | "print(f'avg_coef_KRS_list_all: {list(avg_coef_KRS_list_all)}')" 499 | ], 500 | "outputs": [], 501 | "execution_count": null 502 | }, 503 | { 504 | "cell_type": "code", 505 | "metadata": {}, 506 | "source": [ 507 | "#patching on Attention states on unlearned models\n", 508 | "torch.cuda.set_device(0)\n", 509 | "\n", 510 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 511 | "import json\n", 512 | "from os.path import join\n", 513 | "\n", 514 | "\n", 515 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 516 | "\n", 517 | "methods_attn_patch_loss_lists = []\n", 518 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 519 | " attn_patch_loss_lists = []\n", 520 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 521 | " ori_attn_list = methods_ori_attn_list[j]\n", 522 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 523 | " un_attn_list = methods_un_attn_list[j]\n", 524 | " \n", 525 | " for index, x in enumerate(tqdm(data)):\n", 526 | "\n", 527 | " # Loading new unlearned model for certain concept \n", 528 | "\n", 529 | " model_name = f\"olmo-7b/{method}/Full/{x['id']}\"\n", 530 | "\n", 531 | " model = AutoModelForCausalLM.from_pretrained(\n", 532 | " join(model_dir, model_name),\n", 533 | " torch_dtype=torch.bfloat16,\n", 534 | " trust_remote_code=True\n", 535 | " );\n", 536 | "\n", 537 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 538 | " tokenizer.pad_token = tokenizer.eos_token\n", 539 | " tokenizer.padding_side = \"left\"\n", 540 | " model.to('cuda');\n", 541 | "\n", 542 | " questions = x['QA_with_answers']\n", 543 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 544 | " input_length = inputs.attention_mask.sum(dim=1)\n", 545 | "\n", 546 | " with torch.no_grad():\n", 547 | " original_output = original_model(**inputs)\n", 548 | "\n", 549 | " patch_loss_list = []\n", 550 | " for layer in range(32):\n", 551 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=False, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2]) #[i for i in range(32)]\n", 552 | " with torch.no_grad():\n", 553 | " output = model(**inputs)\n", 554 | " # remove hooks\n", 555 | " remove_hooks(hooks) \n", 556 | " \n", 557 | " loss_list = []\n", 558 | " for i, length in enumerate(x['answers_tokens_num']):\n", 559 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 560 | " loss_list.append(loss)\n", 561 | " \n", 562 | " patch_loss_list.append(loss_list)\n", 563 | "\n", 564 | " attn_patch_loss_lists.append(patch_loss_list) \n", 565 | "\n", 566 | "\n", 567 | " del model\n", 568 | " del tokenizer\n", 569 | " torch.cuda.empty_cache()\n", 570 | "\n", 571 | " methods_attn_patch_loss_lists.append(attn_patch_loss_lists)" 572 | ], 573 | "outputs": [], 574 | "execution_count": null 575 | }, 576 | { 577 | "cell_type": "code", 578 | "metadata": { 579 | "tags": [] 580 | }, 581 | "source": [ 582 | "\n", 583 | "\n", 584 | "print('recover on OLMo by patching on attn (5 layer per patch)')\n", 585 | " \n", 586 | "attn_KRS_list_all = []\n", 587 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 588 | " print(f'on {method}')\n", 589 | " attn_KRS_list_per_method = []\n", 590 | " for patch_losses, ori_losses in zip(methods_attn_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 591 | " attn_KRS_list_per_data = []\n", 592 | " for new_losses in patch_losses: #different_layer\n", 593 | " attn_KRS_list_per_layer = []\n", 594 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 595 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 596 | " attn_KRS_list_per_layer.append(KRS.cpu().float())\n", 597 | " \n", 598 | " attn_KRS_list_per_data.append(np.mean(attn_KRS_list_per_layer, axis=-1))\n", 599 | " attn_KRS_list_per_method.append(attn_KRS_list_per_data)\n", 600 | " \n", 601 | " attn_KRS_list_per_method = np.array(attn_KRS_list_per_method)\n", 602 | " avg_attn_KRS_list_per_method = np.mean(attn_KRS_list_per_method, axis=0)\n", 603 | " \n", 604 | " attn_KRS_list_all.append(avg_attn_KRS_list_per_method)\n", 605 | "\n", 606 | "attn_KRS_list_all = np.array(attn_KRS_list_all)\n", 607 | "avg_attn_KRS_list_all = np.mean(attn_KRS_list_all, axis=0)\n", 608 | "print(f'avg_attn_KRS_list_all: {list(avg_attn_KRS_list_all)}')" 609 | ], 610 | "outputs": [], 611 | "execution_count": null 612 | }, 613 | { 614 | "cell_type": "code", 615 | "metadata": {}, 616 | "source": [ 617 | "#patching on both MLP coef and Attention states on unlearned models\n", 618 | "\n", 619 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 620 | "import json\n", 621 | "from os.path import join\n", 622 | "\n", 623 | "\n", 624 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 625 | "\n", 626 | "\n", 627 | "methods_attn_mlp_coef_patch_loss_lists = []\n", 628 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 629 | " attn_mlp_coef_patch_loss_lists = []\n", 630 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 631 | " ori_attn_list = methods_ori_attn_list[j]\n", 632 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 633 | " un_attn_list = methods_un_attn_list[j]\n", 634 | "\n", 635 | " for index, x in enumerate(tqdm(data)):\n", 636 | "\n", 637 | " model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n", 638 | "\n", 639 | " model = AutoModelForCausalLM.from_pretrained(\n", 640 | " join(model_dir, model_name),\n", 641 | " torch_dtype=torch.bfloat16,\n", 642 | " trust_remote_code=True\n", 643 | " )\n", 644 | "\n", 645 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 646 | " tokenizer.pad_token = tokenizer.eos_token\n", 647 | " tokenizer.padding_side = \"left\"\n", 648 | " model.to('cuda');\n", 649 | "\n", 650 | " questions = x['QA_with_answers']\n", 651 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 652 | " input_length = inputs.attention_mask.sum(dim=1)\n", 653 | "\n", 654 | " with torch.no_grad():\n", 655 | " original_output = original_model(**inputs)\n", 656 | "\n", 657 | " patch_loss_list = []\n", 658 | " for layer in range(32):\n", 659 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n", 660 | " with torch.no_grad():\n", 661 | " output = model(**inputs)\n", 662 | " # remove hooks\n", 663 | " remove_hooks(hooks) \n", 664 | " \n", 665 | " loss_list = []\n", 666 | " for i, length in enumerate(x['answers_tokens_num']):\n", 667 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 668 | " loss_list.append(loss)\n", 669 | " \n", 670 | " patch_loss_list.append(loss_list)\n", 671 | "\n", 672 | " attn_mlp_coef_patch_loss_lists.append(patch_loss_list) \n", 673 | "\n", 674 | "\n", 675 | " del model\n", 676 | " del tokenizer\n", 677 | " torch.cuda.empty_cache()\n", 678 | " \n", 679 | " methods_attn_mlp_coef_patch_loss_lists.append(attn_mlp_coef_patch_loss_lists)\n" 680 | ], 681 | "outputs": [], 682 | "execution_count": null 683 | }, 684 | { 685 | "cell_type": "code", 686 | "metadata": { 687 | "tags": [] 688 | }, 689 | "source": [ 690 | "\n", 691 | "print('recover on OLMo by patching on both coef and attn (5 layer per patch)')\n", 692 | " \n", 693 | "attn_mlp_coef_KRS_list_all = []\n", 694 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 695 | " print(f'on {method}')\n", 696 | " attn_mlp_coef_KRS_list_per_method = []\n", 697 | " for patch_losses, ori_losses in zip(methods_attn_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 698 | " attn_mlp_coef_KRS_list_per_data = []\n", 699 | " for new_losses in patch_losses: #different_layer\n", 700 | " attn_mlp_coef_KRS_list_per_layer = []\n", 701 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 702 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 703 | " attn_mlp_coef_KRS_list_per_layer.append(KRS.cpu().float())\n", 704 | " \n", 705 | " attn_mlp_coef_KRS_list_per_data.append(np.mean(attn_mlp_coef_KRS_list_per_layer, axis=-1))\n", 706 | " attn_mlp_coef_KRS_list_per_method.append(attn_mlp_coef_KRS_list_per_data)\n", 707 | " \n", 708 | " attn_mlp_coef_KRS_list_per_method = np.array(attn_mlp_coef_KRS_list_per_method)\n", 709 | " avg_attn_mlp_coef_KRS_list_per_method = np.mean(attn_mlp_coef_KRS_list_per_method, axis=0)\n", 710 | " \n", 711 | " attn_mlp_coef_KRS_list_all.append(avg_attn_mlp_coef_KRS_list_per_method)\n", 712 | "\n", 713 | "attn_mlp_coef_KRS_list_all = np.array(attn_mlp_coef_KRS_list_all)\n", 714 | "avg_attn_mlp_coef_KRS_list_all = np.mean(attn_mlp_coef_KRS_list_all, axis=0)\n", 715 | "print(f'avg_attn_mlp_coef_KRS_list_all: {list(avg_attn_mlp_coef_KRS_list_all)}')" 716 | ], 717 | "outputs": [], 718 | "execution_count": null 719 | }, 720 | { 721 | "cell_type": "code", 722 | "metadata": {}, 723 | "source": [ 724 | "#Restore MLP value vectors on unlearned models\n", 725 | "\n", 726 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 727 | "import json\n", 728 | "from os.path import join\n", 729 | "import copy\n", 730 | "\n", 731 | "global old_params\n", 732 | "\n", 733 | "\n", 734 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 735 | "\n", 736 | "methods_vectors_patch_loss_lists = []\n", 737 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 738 | " vectors_patch_loss_lists = []\n", 739 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 740 | " ori_attn_list = methods_ori_attn_list[j]\n", 741 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 742 | " un_attn_list = methods_un_attn_list[j]\n", 743 | "\n", 744 | " for index, x in enumerate(tqdm(data)):\n", 745 | "\n", 746 | " model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n", 747 | "\n", 748 | " model = AutoModelForCausalLM.from_pretrained(\n", 749 | " join(model_dir, model_name),\n", 750 | " torch_dtype=torch.bfloat16,\n", 751 | " trust_remote_code=True\n", 752 | " )\n", 753 | "\n", 754 | " old_params = copy.deepcopy(model.state_dict())\n", 755 | "\n", 756 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 757 | " tokenizer.pad_token = tokenizer.eos_token\n", 758 | " tokenizer.padding_side = \"left\"\n", 759 | " model.to('cuda')\n", 760 | "\n", 761 | " questions = x['QA_with_answers']\n", 762 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 763 | " input_length = inputs.attention_mask.sum(dim=1)\n", 764 | "\n", 765 | " with torch.no_grad():\n", 766 | " original_output = original_model(**inputs)\n", 767 | "\n", 768 | " vectors_patch_loss_list = []\n", 769 | " for layer in range(32):\n", 770 | " for recover_layer in range(32): # select which layer's MLP value vectors to be recover:\n", 771 | " if recover_layer >= layer -2 and recover_layer <= layer + 2:\n", 772 | " model.state_dict()[f'model.transformer.blocks[{recover_layer}].ff_out.weight'][:, :] = original_model.state_dict()[f'model.transformer.blocks[{recover_layer}].ff_out.weight'][:, :] \n", 773 | " \n", 774 | " with torch.no_grad():\n", 775 | " output = model(**inputs)\n", 776 | "\n", 777 | " loss_list = []\n", 778 | " for i, length in enumerate(x['answers_tokens_num']):\n", 779 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 780 | " loss_list.append(loss)\n", 781 | "\n", 782 | " vectors_patch_loss_list.append(loss_list)\n", 783 | "\n", 784 | " model.load_state_dict(old_params)\n", 785 | " \n", 786 | " vectors_patch_loss_lists.append(vectors_patch_loss_list) \n", 787 | "\n", 788 | " del model\n", 789 | " del tokenizer\n", 790 | " torch.cuda.empty_cache()\n", 791 | " \n", 792 | " methods_vectors_patch_loss_lists.append(vectors_patch_loss_lists)\n", 793 | " \n", 794 | "\n" 795 | ], 796 | "outputs": [], 797 | "execution_count": null 798 | }, 799 | { 800 | "cell_type": "code", 801 | "metadata": { 802 | "tags": [] 803 | }, 804 | "source": [ 805 | "\n", 806 | "print('recover on OLMo by patching on value vectors (5 layer per patch)')\n", 807 | " \n", 808 | "vectors_KRS_list_all = []\n", 809 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 810 | " print(f'on {method}')\n", 811 | " vectors_KRS_list_per_method = []\n", 812 | " for patch_losses, ori_losses in zip(methods_vectors_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 813 | " vectors_KRS_list_per_data = []\n", 814 | " for new_losses in patch_losses: #different_layer\n", 815 | " vectors_KRS_list_per_layer = []\n", 816 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 817 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 818 | " vectors_KRS_list_per_layer.append(KRS.cpu().float())\n", 819 | " \n", 820 | " vectors_KRS_list_per_data.append(np.mean(vectors_KRS_list_per_layer, axis=-1))\n", 821 | " vectors_KRS_list_per_method.append(vectors_KRS_list_per_data)\n", 822 | " \n", 823 | " vectors_KRS_list_per_method = np.array(vectors_KRS_list_per_method)\n", 824 | " avg_vectors_KRS_list_per_method = np.mean(vectors_KRS_list_per_method, axis=0)\n", 825 | " \n", 826 | " vectors_KRS_list_all.append(avg_vectors_KRS_list_per_method)\n", 827 | "\n", 828 | "vectors_KRS_list_all = np.array(vectors_KRS_list_all)\n", 829 | "avg_vectors_KRS_list_all = np.mean(vectors_KRS_list_all, axis=0)\n", 830 | "print(f'avg_vectors_KRS_list_all: {list(avg_vectors_KRS_list_all)}')" 831 | ], 832 | "outputs": [], 833 | "execution_count": null 834 | } 835 | ], 836 | "metadata": { 837 | "kernelspec": { 838 | "display_name": "Python 3 (ipykernel)", 839 | "language": "python", 840 | "name": "python3" 841 | }, 842 | "language_info": { 843 | "codemirror_mode": { 844 | "name": "ipython", 845 | "version": 3 846 | }, 847 | "file_extension": ".py", 848 | "mimetype": "text/x-python", 849 | "name": "python", 850 | "nbconvert_exporter": "python", 851 | "pygments_lexer": "ipython3", 852 | "version": "3.11.5" 853 | } 854 | }, 855 | "nbformat": 4, 856 | "nbformat_minor": 4 857 | } 858 | -------------------------------------------------------------------------------- /Dissecting-FT-Unlearning/Restoration_experiments_llama-30tokens.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "metadata": { 6 | "collapsed": false, 7 | "jupyter": { 8 | "outputs_hidden": false 9 | }, 10 | "tags": [] 11 | }, 12 | "source": [ 13 | "import torch\n", 14 | "\n", 15 | "if torch.cuda.is_available():\n", 16 | " gpu_count = torch.cuda.device_count()\n", 17 | " print(f\"Find {gpu_count} GPU can be used.\")\n", 18 | "\n", 19 | " # 遍历并打印每个GPU设备的名称\n", 20 | " for i in range(gpu_count):\n", 21 | " gpu_name = torch.cuda.get_device_name(i)\n", 22 | " print(f\"GPU {i + 1}: {gpu_name}\")\n", 23 | "else:\n", 24 | " print(\"No GPU can be used.\")\n" 25 | ], 26 | "outputs": [], 27 | "execution_count": null 28 | }, 29 | { 30 | "cell_type": "code", 31 | "metadata": { 32 | "collapsed": false, 33 | "jupyter": { 34 | "outputs_hidden": false 35 | }, 36 | "tags": [] 37 | }, 38 | "source": [ 39 | "import copy\n", 40 | "from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n", 41 | "from rouge import Rouge\n", 42 | "# from bert_score import score\n", 43 | "import statistics\n", 44 | "from ast import literal_eval\n", 45 | "import functools\n", 46 | "import json\n", 47 | "import os\n", 48 | "import random\n", 49 | "import numpy as np\n", 50 | "import pandas as pd\n", 51 | "import torch\n", 52 | "import torch.nn.functional as F\n", 53 | "\n", 54 | "random.seed(8888)\n", 55 | "torch.manual_seed(8888)\n", 56 | "random.seed(8888)\n", 57 | "np.random.seed(8888)\n", 58 | "\n", 59 | "if torch.cuda.is_available():\n", 60 | " torch.cuda.manual_seed(8888)\n", 61 | " torch.cuda.manual_seed_all(8888)\n", 62 | "\n", 63 | "\n", 64 | "from tqdm import tqdm\n", 65 | "\n", 66 | "torch.set_grad_enabled(False)\n", 67 | "tqdm.pandas()" 68 | ], 69 | "outputs": [], 70 | "execution_count": null 71 | }, 72 | { 73 | "cell_type": "code", 74 | "metadata": { 75 | "collapsed": false, 76 | "jupyter": { 77 | "outputs_hidden": false 78 | }, 79 | "tags": [] 80 | }, 81 | "source": [ 82 | "torch.cuda.set_device(0)\n", 83 | "\n", 84 | "\n", 85 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 86 | "import json\n", 87 | "from os.path import join\n", 88 | "\n", 89 | "# Get CounterFact data for GPT2-xl, from the ROME repository.\n", 90 | "#wget.download(\"https://rome.baulab.info/data/dsets/known_1000.json\")\n", 91 | "\n", 92 | "\n", 93 | "model_dir = \"/mnt/workspace/workgroup/yhhong/transformers\"\n", 94 | "original_model_name = \"Llama-2-7b-chat-hf\" #\"chatglm3-6b\" #\"Llama-2-7b-chat-hf\" #\"Qwen1.5-7B-Chat\"\n", 95 | "\n", 96 | "\n", 97 | "original_model = AutoModelForCausalLM.from_pretrained(\n", 98 | " join(model_dir, original_model_name),\n", 99 | " torch_dtype=torch.bfloat16,\n", 100 | " trust_remote_code=True\n", 101 | ");\n", 102 | "\n", 103 | "tokenizer = AutoTokenizer.from_pretrained(join(model_dir, original_model_name))\n", 104 | "tokenizer.pad_token = tokenizer.eos_token\n", 105 | "tokenizer.padding_side = \"left\"\n", 106 | "\n", 107 | "original_model.to('cuda');\n" 108 | ], 109 | "outputs": [], 110 | "execution_count": null 111 | }, 112 | { 113 | "cell_type": "code", 114 | "metadata": { 115 | "tags": [] 116 | }, 117 | "source": [ 118 | "original_model" 119 | ], 120 | "outputs": [], 121 | "execution_count": null 122 | }, 123 | { 124 | "cell_type": "code", 125 | "metadata": { 126 | "collapsed": false, 127 | "jupyter": { 128 | "outputs_hidden": false 129 | }, 130 | "tags": [] 131 | }, 132 | "source": [ 133 | "import torch.nn.functional as F\n", 134 | "\n", 135 | "def set_act_get_hooks(model, mlp_coef=False, attn=False):\n", 136 | " \"\"\"\n", 137 | " Works on LLaMA-2, getting coef values or attn values\n", 138 | " \"\"\"\n", 139 | "\n", 140 | " for attr in [\"activations_\"]:\n", 141 | " if not hasattr(model, attr):\n", 142 | " setattr(model, attr, {})\n", 143 | "\n", 144 | " def get_activation(name):\n", 145 | " def hook(module, input, output):\n", 146 | " if \"m_coef\" in name:\n", 147 | " model.activations_[name] = input[0][:, :].detach() \n", 148 | " elif \"attn\" in name:\n", 149 | " model.activations_[name] = output[:, :].detach() \n", 150 | "\n", 151 | " return hook\n", 152 | "\n", 153 | " hooks = []\n", 154 | " for i in range(32):\n", 155 | " if mlp_coef is True: #co-effciency\n", 156 | " hooks.append(model.model.layers[i].mlp.down_proj.register_forward_hook(get_activation(\"m_coef_\" + str(i))))\n", 157 | " if attn is True:\n", 158 | " hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(get_activation(\"attn_\" + str(i)))) \n", 159 | " \n", 160 | "\n", 161 | " return hooks\n", 162 | "\n", 163 | "def set_act_modify_hooks(model, original_model, mlp_coef, attn, ori_mlp_coef, ori_attn, un_mlp_coef, un_attn, layers=None):\n", 164 | " \"\"\"\n", 165 | " Works on LLaMA-2, modifying coef values or attn values\n", 166 | " \"\"\"\n", 167 | "\n", 168 | " def modify_activation(name, update_value, patch_input):\n", 169 | " def pre_hook(module, input):\n", 170 | " if \"m_coef\" in name:\n", 171 | " input[0][:, :, :] = update_value \n", 172 | " \n", 173 | " def post_hook(module, input, output):\n", 174 | " if \"attn\" in name:\n", 175 | " output[:, :, :] = update_value\n", 176 | " \n", 177 | " if patch_input:\n", 178 | " return pre_hook\n", 179 | " else:\n", 180 | " return post_hook\n", 181 | "\n", 182 | " hooks = []\n", 183 | " if layers is not None:\n", 184 | " for i in range(32):\n", 185 | " if i in layers and mlp_coef is True: #co-effciency\n", 186 | " hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = ori_mlp_coef[i], patch_input=True)))\n", 187 | " # else:\n", 188 | " # hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = un_mlp_coef[i], patch_input=True)))\n", 189 | " if i in layers and attn is True:\n", 190 | " hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = ori_attn[i], patch_input=False))) \n", 191 | " # else:\n", 192 | " # hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = un_attn[i], patch_input=False))) \n", 193 | "\n", 194 | " return hooks\n", 195 | "\n", 196 | "def remove_hooks(hooks):\n", 197 | " for hook in hooks:\n", 198 | " hook.remove()\n", 199 | "\n", 200 | " \n", 201 | "def compute_loss(target_logits, logits):\n", 202 | " \n", 203 | " #MSE loss\n", 204 | "\n", 205 | " return torch.mean((target_logits - logits) ** 2)\n", 206 | "\n", 207 | "\n", 208 | "def compute_KRS(ori_loss, new_loss): \n", 209 | " return 1 - (new_loss / ori_loss)\n", 210 | " \n", 211 | "\n", 212 | "\n" 213 | ], 214 | "outputs": [], 215 | "execution_count": null 216 | }, 217 | { 218 | "metadata": { 219 | "tags": [] 220 | }, 221 | "cell_type": "code", 222 | "source": [ 223 | "# loading data:\n", 224 | "\n", 225 | "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n", 226 | " data = json.load(file)\n", 227 | " \n", 228 | "for ix, item in enumerate(data):\n", 229 | " print(ix,\" \",item['Concept'])\n", 230 | "\n" 231 | ], 232 | "outputs": [], 233 | "execution_count": null 234 | }, 235 | { 236 | "cell_type": "code", 237 | "metadata": { 238 | "tags": [] 239 | }, 240 | "source": [ 241 | "# evaluate original model and obtain the next 30 tokens answers\n", 242 | "\n", 243 | "for index, x in enumerate(tqdm(data)):\n", 244 | " \n", 245 | " questions = []\n", 246 | " n_new_tokens = 30\n", 247 | " for idx, question in enumerate(x['QA']):\n", 248 | " questions.append(f\"Question: {question}\\n Answer:\")\n", 249 | "\n", 250 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 251 | " input_length = inputs.input_ids.size(1)\n", 252 | " with torch.no_grad():\n", 253 | " generation_output = original_model.generate( # mt.model\n", 254 | " **inputs,\n", 255 | " do_sample=False,\n", 256 | " max_new_tokens=n_new_tokens,\n", 257 | " )\n", 258 | "\n", 259 | " \n", 260 | " eos_token_id = tokenizer.eos_token_id\n", 261 | " normal_tokens_num_list = []\n", 262 | " for i, output in enumerate(generation_output):\n", 263 | " eos_count = (output == eos_token_id).sum().item()\n", 264 | " total_tokens = output.size(0) \n", 265 | " normal_tokens_count = total_tokens - eos_count - inputs.input_ids.size(1)\n", 266 | " normal_tokens_num_list.append(normal_tokens_count)\n", 267 | " \n", 268 | " outputs = tokenizer.batch_decode(generation_output[:, :], skip_special_tokens=True)\n", 269 | " \n", 270 | " data[index]['QA_with_answers'] = outputs\n", 271 | " data[index]['answers_tokens_num'] = normal_tokens_num_list\n", 272 | "\n", 273 | "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"w\", encoding=\"utf-8\") as file:\n", 274 | " json.dump(data, file, ensure_ascii=False, indent=4)" 275 | ], 276 | "outputs": [], 277 | "execution_count": null 278 | }, 279 | { 280 | "cell_type": "code", 281 | "metadata": { 282 | "tags": [] 283 | }, 284 | "source": [ 285 | "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n", 286 | " data = json.load(file)" 287 | ], 288 | "outputs": [], 289 | "execution_count": null 290 | }, 291 | { 292 | "cell_type": "code", 293 | "metadata": { 294 | "tags": [] 295 | }, 296 | "source": [ 297 | "#on vanilla model and unlearned model\n", 298 | "\n", 299 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 300 | "import json\n", 301 | "from os.path import join\n", 302 | "\n", 303 | "\n", 304 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 305 | "\n", 306 | "methods_ori_loss_list = [] \n", 307 | "methods_ori_attn_list = []\n", 308 | "methods_ori_mlp_coef_list = []\n", 309 | "methods_un_mlp_coef_list = []\n", 310 | "methods_un_attn_list = []\n", 311 | "\n", 312 | "for method in ['grad_diff', 'dpo']:\n", 313 | " ori_loss_list = [] \n", 314 | " ori_attn_list = []\n", 315 | " ori_mlp_coef_list = []\n", 316 | " un_mlp_coef_list = []\n", 317 | " un_attn_list = []\n", 318 | " \n", 319 | " for x in tqdm(data):\n", 320 | " # Loading new unlearned model for certain concept \n", 321 | "\n", 322 | " model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n", 323 | "\n", 324 | " model = AutoModelForCausalLM.from_pretrained(\n", 325 | " join(model_dir, model_name),\n", 326 | " torch_dtype=torch.bfloat16,\n", 327 | " trust_remote_code=True\n", 328 | " );\n", 329 | "\n", 330 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 331 | " tokenizer.pad_token = tokenizer.eos_token\n", 332 | " tokenizer.padding_side = \"left\"\n", 333 | " model.to('cuda')\n", 334 | " \n", 335 | " questions = x['QA_with_answers']\n", 336 | "\n", 337 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 338 | " input_length = inputs.attention_mask.sum(dim=1)\n", 339 | "\n", 340 | " un_hooks = set_act_get_hooks(model, mlp_coef=True, attn=True)\n", 341 | " with torch.no_grad():\n", 342 | " output = model(**inputs)\n", 343 | "\n", 344 | " # remove hooks\n", 345 | " remove_hooks(un_hooks) \n", 346 | "\n", 347 | " hooks = set_act_get_hooks(original_model, mlp_coef=True, attn=True)\n", 348 | " with torch.no_grad():\n", 349 | " original_output = original_model(**inputs)\n", 350 | " # remove hooks\n", 351 | " remove_hooks(hooks) \n", 352 | " \n", 353 | " \n", 354 | " loss_list = []\n", 355 | " for i, length in enumerate(x['answers_tokens_num']):\n", 356 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 357 | " loss_list.append(loss)\n", 358 | " \n", 359 | " un_mlp_coef = []\n", 360 | " un_attn = []\n", 361 | " ori_mlp_coef = []\n", 362 | " ori_attn = []\n", 363 | " for layer in range(32):\n", 364 | " un_mlp_coef.append(model.activations_[f'm_coef_{layer}'])\n", 365 | " un_attn.append(model.activations_[f'attn_{layer}'])\n", 366 | " ori_mlp_coef.append(original_model.activations_[f'm_coef_{layer}'])\n", 367 | " ori_attn.append(original_model.activations_[f'attn_{layer}'])\n", 368 | "\n", 369 | "\n", 370 | " ori_loss_list.append(loss_list)\n", 371 | "\n", 372 | " un_mlp_coef_list.append(un_mlp_coef) # on all data\n", 373 | " un_attn_list.append(un_attn) # on all data\n", 374 | "\n", 375 | " ori_mlp_coef_list.append(ori_mlp_coef) # on all data\n", 376 | " ori_attn_list.append(ori_attn) # on all data\n", 377 | "\n", 378 | "\n", 379 | " del model\n", 380 | " del tokenizer\n", 381 | " torch.cuda.empty_cache()\n", 382 | "\n", 383 | " methods_ori_loss_list.append(ori_loss_list)\n", 384 | " methods_ori_attn_list.append(ori_attn_list)\n", 385 | " methods_ori_mlp_coef_list.append(ori_mlp_coef_list)\n", 386 | " methods_un_mlp_coef_list.append(un_mlp_coef_list)\n", 387 | " methods_un_attn_list.append(un_attn_list)\n", 388 | " \n", 389 | "\n" 390 | ], 391 | "outputs": [], 392 | "execution_count": null 393 | }, 394 | { 395 | "cell_type": "code", 396 | "metadata": { 397 | "tags": [] 398 | }, 399 | "source": [ 400 | "#patching on MLP coefficients on unlearned models\n", 401 | "torch.cuda.set_device(0)\n", 402 | "\n", 403 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 404 | "import json\n", 405 | "from os.path import join\n", 406 | "\n", 407 | "\n", 408 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 409 | "\n", 410 | "\n", 411 | "methods_mlp_coef_patch_loss_lists = []\n", 412 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 413 | " mlp_coef_patch_loss_lists = []\n", 414 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 415 | " ori_attn_list = methods_ori_attn_list[j]\n", 416 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 417 | " un_attn_list = methods_un_attn_list[j]\n", 418 | " \n", 419 | " for index, x in enumerate(tqdm(data)):\n", 420 | "\n", 421 | " # Loading new unlearned model for certain concept \n", 422 | "\n", 423 | " model_name = f\"llama2-7b/{method}/Full/{x['id']}\" #\"chatglm3-6b\" #\"Llama-2-7b-chat-hf\" #\"Qwen1.5-7B-Chat\"\n", 424 | "\n", 425 | " model = AutoModelForCausalLM.from_pretrained(\n", 426 | " join(model_dir, model_name),\n", 427 | " torch_dtype=torch.bfloat16,\n", 428 | " trust_remote_code=True\n", 429 | " );\n", 430 | "\n", 431 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 432 | " tokenizer.pad_token = tokenizer.eos_token\n", 433 | " tokenizer.padding_side = \"left\"\n", 434 | " model.to('cuda');\n", 435 | "\n", 436 | " questions = x['QA_with_answers']\n", 437 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 438 | " input_length = inputs.attention_mask.sum(dim=1)\n", 439 | "\n", 440 | " with torch.no_grad():\n", 441 | " original_output = original_model(**inputs)\n", 442 | "\n", 443 | " patch_loss_list = []\n", 444 | " for layer in range(32):\n", 445 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=False, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n", 446 | " with torch.no_grad():\n", 447 | " output = model(**inputs)\n", 448 | " # remove hooks\n", 449 | " remove_hooks(hooks) \n", 450 | " \n", 451 | " loss_list = []\n", 452 | " for i, length in enumerate(x['answers_tokens_num']):\n", 453 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 454 | " loss_list.append(loss)\n", 455 | " \n", 456 | " patch_loss_list.append(loss_list)\n", 457 | "\n", 458 | " mlp_coef_patch_loss_lists.append(patch_loss_list) \n", 459 | "\n", 460 | "\n", 461 | " del model\n", 462 | " del tokenizer\n", 463 | " torch.cuda.empty_cache()\n", 464 | "\n", 465 | " methods_mlp_coef_patch_loss_lists.append(mlp_coef_patch_loss_lists)\n", 466 | "\n" 467 | ], 468 | "outputs": [], 469 | "execution_count": null 470 | }, 471 | { 472 | "cell_type": "code", 473 | "metadata": { 474 | "tags": [] 475 | }, 476 | "source": [ 477 | "print('recover on LLaMA by patching on coef (5 layer per patch)')\n", 478 | " \n", 479 | "coef_KRS_list_all = []\n", 480 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 481 | " print(f'on {method}')\n", 482 | " coef_KRS_list_per_method = []\n", 483 | " for patch_losses, ori_losses in zip(methods_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 484 | " coef_KRS_list_per_data = []\n", 485 | " for new_losses in patch_losses: #different_layer\n", 486 | " coef_KRS_list_per_layer = []\n", 487 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 488 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 489 | " coef_KRS_list_per_layer.append(KRS.cpu())\n", 490 | " \n", 491 | " coef_KRS_list_per_data.append(np.mean(coef_KRS_list_per_layer, axis=-1))\n", 492 | " coef_KRS_list_per_method.append(coef_KRS_list_per_data)\n", 493 | " \n", 494 | " coef_KRS_list_per_method = np.array(coef_KRS_list_per_method)\n", 495 | " avg_coef_KRS_list_per_method = np.mean(coef_KRS_list_per_method, axis=0)\n", 496 | " \n", 497 | " coef_KRS_list_all.append(avg_coef_KRS_list_per_method)\n", 498 | "\n", 499 | "coef_KRS_list_all = np.array(coef_KRS_list_all)\n", 500 | "avg_coef_KRS_list_all = np.mean(coef_KRS_list_all, axis=0)\n", 501 | "print(f'avg_coef_KRS_list_all: {list(avg_coef_KRS_list_all)}')" 502 | ], 503 | "outputs": [], 504 | "execution_count": null 505 | }, 506 | { 507 | "cell_type": "code", 508 | "metadata": {}, 509 | "source": [ 510 | "#patching on Attention states on unlearned models\n", 511 | "\n", 512 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 513 | "import json\n", 514 | "from os.path import join\n", 515 | "\n", 516 | "\n", 517 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 518 | "\n", 519 | "methods_attn_patch_loss_lists = []\n", 520 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 521 | " attn_patch_loss_lists = []\n", 522 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 523 | " ori_attn_list = methods_ori_attn_list[j]\n", 524 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 525 | " un_attn_list = methods_un_attn_list[j]\n", 526 | " \n", 527 | " for index, x in enumerate(tqdm(data)):\n", 528 | "\n", 529 | " # Loading new unlearned model for certain concept \n", 530 | "\n", 531 | " model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n", 532 | "\n", 533 | " model = AutoModelForCausalLM.from_pretrained(\n", 534 | " join(model_dir, model_name),\n", 535 | " torch_dtype=torch.bfloat16,\n", 536 | " trust_remote_code=True\n", 537 | " );\n", 538 | "\n", 539 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 540 | " tokenizer.pad_token = tokenizer.eos_token\n", 541 | " tokenizer.padding_side = \"left\"\n", 542 | " model.to('cuda');\n", 543 | "\n", 544 | " questions = x['QA_with_answers']\n", 545 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 546 | " input_length = inputs.attention_mask.sum(dim=1)\n", 547 | "\n", 548 | " with torch.no_grad():\n", 549 | " original_output = original_model(**inputs)\n", 550 | "\n", 551 | " patch_loss_list = []\n", 552 | " for layer in range(32):\n", 553 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=False, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2]) #[i for i in range(32)]\n", 554 | " with torch.no_grad():\n", 555 | " output = model(**inputs)\n", 556 | " # remove hooks\n", 557 | " remove_hooks(hooks) \n", 558 | " \n", 559 | " loss_list = []\n", 560 | " for i, length in enumerate(x['answers_tokens_num']):\n", 561 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 562 | " loss_list.append(loss)\n", 563 | " \n", 564 | " patch_loss_list.append(loss_list)\n", 565 | "\n", 566 | " attn_patch_loss_lists.append(patch_loss_list) \n", 567 | "\n", 568 | "\n", 569 | " del model\n", 570 | " del tokenizer\n", 571 | " torch.cuda.empty_cache()\n", 572 | "\n", 573 | " methods_attn_patch_loss_lists.append(attn_patch_loss_lists)" 574 | ], 575 | "outputs": [], 576 | "execution_count": null 577 | }, 578 | { 579 | "cell_type": "code", 580 | "metadata": { 581 | "tags": [] 582 | }, 583 | "source": [ 584 | "print('recover on LLaMA by patching on attn (5 layer per patch)')\n", 585 | " \n", 586 | "attn_KRS_list_all = []\n", 587 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 588 | " print(f'on {method}')\n", 589 | " attn_KRS_list_per_method = []\n", 590 | " for patch_losses, ori_losses in zip(methods_attn_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 591 | " attn_KRS_list_per_data = []\n", 592 | " for new_losses in patch_losses: #different_layer\n", 593 | " attn_KRS_list_per_layer = []\n", 594 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 595 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 596 | " attn_KRS_list_per_layer.append(KRS.cpu())\n", 597 | " \n", 598 | " attn_KRS_list_per_data.append(np.mean(attn_KRS_list_per_layer, axis=-1))\n", 599 | " attn_KRS_list_per_method.append(attn_KRS_list_per_data)\n", 600 | " \n", 601 | " attn_KRS_list_per_method = np.array(attn_KRS_list_per_method)\n", 602 | " avg_attn_KRS_list_per_method = np.mean(attn_KRS_list_per_method, axis=0)\n", 603 | " \n", 604 | " attn_KRS_list_all.append(avg_attn_KRS_list_per_method)\n", 605 | "\n", 606 | "attn_KRS_list_all = np.array(attn_KRS_list_all)\n", 607 | "avg_attn_KRS_list_all = np.mean(attn_KRS_list_all, axis=0)\n", 608 | "print(f'avg_attn_KRS_list_all: {list(avg_attn_KRS_list_all)}')" 609 | ], 610 | "outputs": [], 611 | "execution_count": null 612 | }, 613 | { 614 | "cell_type": "code", 615 | "metadata": {}, 616 | "source": [ 617 | "#patching on both MLP coef and Attention states on unlearned models\n", 618 | "\n", 619 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 620 | "import json\n", 621 | "from os.path import join\n", 622 | "\n", 623 | "\n", 624 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 625 | "\n", 626 | "methods_attn_mlp_coef_patch_loss_lists = []\n", 627 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 628 | " attn_mlp_coef_patch_loss_lists = []\n", 629 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 630 | " ori_attn_list = methods_ori_attn_list[j]\n", 631 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 632 | " un_attn_list = methods_un_attn_list[j]\n", 633 | "\n", 634 | " for index, x in enumerate(tqdm(data)):\n", 635 | "\n", 636 | " model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n", 637 | "\n", 638 | " model = AutoModelForCausalLM.from_pretrained(\n", 639 | " join(model_dir, model_name),\n", 640 | " torch_dtype=torch.bfloat16,\n", 641 | " trust_remote_code=True\n", 642 | " )\n", 643 | "\n", 644 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 645 | " tokenizer.pad_token = tokenizer.eos_token\n", 646 | " tokenizer.padding_side = \"left\"\n", 647 | " model.to('cuda');\n", 648 | "\n", 649 | " questions = x['QA_with_answers']\n", 650 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 651 | " input_length = inputs.attention_mask.sum(dim=1)\n", 652 | "\n", 653 | " with torch.no_grad():\n", 654 | " original_output = original_model(**inputs)\n", 655 | "\n", 656 | " patch_loss_list = []\n", 657 | " for layer in range(32):\n", 658 | " hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n", 659 | " with torch.no_grad():\n", 660 | " output = model(**inputs)\n", 661 | " # remove hooks\n", 662 | " remove_hooks(hooks) \n", 663 | " \n", 664 | " loss_list = []\n", 665 | " for i, length in enumerate(x['answers_tokens_num']):\n", 666 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 667 | " loss_list.append(loss)\n", 668 | " \n", 669 | " patch_loss_list.append(loss_list)\n", 670 | "\n", 671 | " attn_mlp_coef_patch_loss_lists.append(patch_loss_list) \n", 672 | "\n", 673 | "\n", 674 | " del model\n", 675 | " del tokenizer\n", 676 | " torch.cuda.empty_cache()\n", 677 | " \n", 678 | " methods_attn_mlp_coef_patch_loss_lists.append(attn_mlp_coef_patch_loss_lists)\n", 679 | "\n" 680 | ], 681 | "outputs": [], 682 | "execution_count": null 683 | }, 684 | { 685 | "cell_type": "code", 686 | "metadata": { 687 | "tags": [] 688 | }, 689 | "source": [ 690 | "\n", 691 | "print('recover on LLaMA by patching on both coef and attn (5 layer per patch)')\n", 692 | " \n", 693 | "attn_mlp_coef_KRS_list_all = []\n", 694 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 695 | " print(f'on {method}')\n", 696 | " attn_mlp_coef_KRS_list_per_method = []\n", 697 | " for patch_losses, ori_losses in zip(methods_attn_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 698 | " attn_mlp_coef_KRS_list_per_data = []\n", 699 | " for new_losses in patch_losses: #different_layer\n", 700 | " attn_mlp_coef_KRS_list_per_layer = []\n", 701 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 702 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 703 | " attn_mlp_coef_KRS_list_per_layer.append(KRS.cpu())\n", 704 | " \n", 705 | " attn_mlp_coef_KRS_list_per_data.append(np.mean(attn_mlp_coef_KRS_list_per_layer, axis=-1))\n", 706 | " attn_mlp_coef_KRS_list_per_method.append(attn_mlp_coef_KRS_list_per_data)\n", 707 | " \n", 708 | " attn_mlp_coef_KRS_list_per_method = np.array(attn_mlp_coef_KRS_list_per_method)\n", 709 | " avg_attn_mlp_coef_KRS_list_per_method = np.mean(attn_mlp_coef_KRS_list_per_method, axis=0)\n", 710 | " \n", 711 | " attn_mlp_coef_KRS_list_all.append(avg_attn_mlp_coef_KRS_list_per_method)\n", 712 | "\n", 713 | "attn_mlp_coef_KRS_list_all = np.array(attn_mlp_coef_KRS_list_all)\n", 714 | "avg_attn_mlp_coef_KRS_list_all = np.mean(attn_mlp_coef_KRS_list_all, axis=0)\n", 715 | "print(f'avg_attn_mlp_coef_KRS_list_all: {list(avg_attn_mlp_coef_KRS_list_all)}')" 716 | ], 717 | "outputs": [], 718 | "execution_count": null 719 | }, 720 | { 721 | "cell_type": "code", 722 | "metadata": {}, 723 | "source": [ 724 | "#Restore MLP value vectors on unlearned models\n", 725 | "\n", 726 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 727 | "import json\n", 728 | "from os.path import join\n", 729 | "import copy\n", 730 | "\n", 731 | "global old_params\n", 732 | "\n", 733 | "\n", 734 | "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n", 735 | "\n", 736 | "methods_vectors_patch_loss_lists = []\n", 737 | "for j, method in enumerate(['grad_diff', 'dpo']):\n", 738 | " vectors_patch_loss_lists = []\n", 739 | " ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n", 740 | " ori_attn_list = methods_ori_attn_list[j]\n", 741 | " un_mlp_coef_list = methods_un_mlp_coef_list[j]\n", 742 | " un_attn_list = methods_un_attn_list[j]\n", 743 | "\n", 744 | " for index, x in enumerate(tqdm(data)):\n", 745 | "\n", 746 | " model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n", 747 | "\n", 748 | " model = AutoModelForCausalLM.from_pretrained(\n", 749 | " join(model_dir, model_name),\n", 750 | " torch_dtype=torch.bfloat16,\n", 751 | " trust_remote_code=True\n", 752 | " )\n", 753 | "\n", 754 | " old_params = copy.deepcopy(model.state_dict())\n", 755 | "\n", 756 | " tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n", 757 | " tokenizer.pad_token = tokenizer.eos_token\n", 758 | " tokenizer.padding_side = \"left\"\n", 759 | " model.to('cuda')\n", 760 | "\n", 761 | " questions = x['QA_with_answers']\n", 762 | " inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n", 763 | " input_length = inputs.attention_mask.sum(dim=1)\n", 764 | "\n", 765 | " with torch.no_grad():\n", 766 | " original_output = original_model(**inputs)\n", 767 | "\n", 768 | " vectors_patch_loss_list = []\n", 769 | " for layer in range(32):\n", 770 | " for recover_layer in range(32): # select which layer's MLP value vectors to be recover:\n", 771 | " if recover_layer >= layer -2 and recover_layer <= layer + 2:\n", 772 | " model.state_dict()[f'model.layers.{recover_layer}.mlp.down_proj.weight'][:, :] = original_model.state_dict()[f'model.layers.{recover_layer}.mlp.down_proj.weight'][:, :]\n", 773 | " \n", 774 | " \n", 775 | " with torch.no_grad():\n", 776 | " output = model(**inputs)\n", 777 | "\n", 778 | " loss_list = []\n", 779 | " for i, length in enumerate(x['answers_tokens_num']):\n", 780 | " loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n", 781 | " loss_list.append(loss)\n", 782 | "\n", 783 | " vectors_patch_loss_list.append(loss_list)\n", 784 | "\n", 785 | " model.load_state_dict(old_params)\n", 786 | " \n", 787 | " vectors_patch_loss_lists.append(vectors_patch_loss_list) \n", 788 | "\n", 789 | " del model\n", 790 | " del tokenizer\n", 791 | " torch.cuda.empty_cache()\n", 792 | " \n", 793 | " methods_vectors_patch_loss_lists.append(vectors_patch_loss_lists)\n", 794 | " \n", 795 | "\n" 796 | ], 797 | "outputs": [], 798 | "execution_count": null 799 | }, 800 | { 801 | "cell_type": "code", 802 | "metadata": { 803 | "tags": [] 804 | }, 805 | "source": [ 806 | "\n", 807 | "\n", 808 | "print('recover on LLaMA by patching on value vectors (5 layer per patch)')\n", 809 | " \n", 810 | "vectors_KRS_list_all = []\n", 811 | "for t, method in enumerate(['grad_diff', 'dpo']):\n", 812 | " print(f'on {method}')\n", 813 | " vectors_KRS_list_per_method = []\n", 814 | " for patch_losses, ori_losses in zip(methods_vectors_patch_loss_lists[t], methods_ori_loss_list[t]):\n", 815 | " vectors_KRS_list_per_data = []\n", 816 | " for new_losses in patch_losses: #different_layer\n", 817 | " vectors_KRS_list_per_layer = []\n", 818 | " for new_loss, ori_loss in zip(new_losses, ori_losses):\n", 819 | " KRS = compute_KRS(ori_loss, new_loss=new_loss)\n", 820 | " vectors_KRS_list_per_layer.append(KRS.cpu())\n", 821 | " \n", 822 | " vectors_KRS_list_per_data.append(np.mean(vectors_KRS_list_per_layer, axis=-1))\n", 823 | " vectors_KRS_list_per_method.append(vectors_KRS_list_per_data)\n", 824 | " \n", 825 | " vectors_KRS_list_per_method = np.array(vectors_KRS_list_per_method)\n", 826 | " avg_vectors_KRS_list_per_method = np.mean(vectors_KRS_list_per_method, axis=0)\n", 827 | " \n", 828 | " vectors_KRS_list_all.append(avg_vectors_KRS_list_per_method)\n", 829 | "\n", 830 | "vectors_KRS_list_all = np.array(vectors_KRS_list_all)\n", 831 | "avg_vectors_KRS_list_all = np.mean(vectors_KRS_list_all, axis=0)\n", 832 | "print(f'avg_vectors_KRS_list_all: {list(avg_vectors_KRS_list_all)}')" 833 | ], 834 | "outputs": [], 835 | "execution_count": null 836 | } 837 | ], 838 | "metadata": { 839 | "kernelspec": { 840 | "display_name": "Python 3 (ipykernel)", 841 | "language": "python", 842 | "name": "python3" 843 | }, 844 | "language_info": { 845 | "codemirror_mode": { 846 | "name": "ipython", 847 | "version": 3 848 | }, 849 | "file_extension": ".py", 850 | "mimetype": "text/x-python", 851 | "name": "python", 852 | "nbconvert_exporter": "python", 853 | "pygments_lexer": "ipython3", 854 | "version": "3.11.5" 855 | } 856 | }, 857 | "nbformat": 4, 858 | "nbformat_minor": 4 859 | } 860 | --------------------------------------------------------------------------------