├── requirements.txt
├── README.md
└── Dissecting-FT-Unlearning
    ├── Restoration_experiments_olmo-30tokens.ipynb
    └── Restoration_experiments_llama-30tokens.ipynb


/requirements.txt:
--------------------------------------------------------------------------------
 1 | ai2-olmo==0.2.5
 2 | aiohttp==3.9.3
 3 | aiosignal==1.3.1
 4 | anyio==4.3.0
 5 | better-abc==0.0.3
 6 | bitsandbytes==0.43.1
 7 | datasets==2.18.0
 8 | evaluate==0.4.2
 9 | huggingface-hub==0.21.4
10 | hydra-core==1.3.2
11 | jupyter
12 | matplotlib==3.7.5
13 | nltk==3.8.1
14 | numpy==1.24.1
15 | openai==1.25.0
16 | pandas==2.0.3
17 | peft==0.10.0
18 | pillow==10.2.0
19 | PyYAML==6.0.1
20 | rouge==1.0.1
21 | rouge-score==0.1.2
22 | scipy==1.10.1
23 | sentencepiece==0.2.0
24 | statistics==1.0.3.5
25 | scikit-learn=1.5.2
26 | tokenizers==0.15.2
27 | tornado==6.4
28 | tqdm==4.66.2
29 | transformer-lens==1.15.0
30 | transformers==4.38.2
31 | urllib3==1.26.13
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Dissecting-FT-Unlearning (EMNLP 2024 Main - Short)
 2 | 
 3 | The code for the experiments in our paper titled **[Dissecting Fine-Tuning Unlearning in Large Language Models]** accepted in EMNLP 2024 Main (Short paper), which is a follow-up work of [ConceptVectors Benchmark](https://github.com/yihuaihong/ConceptVectors).
 4 | 
 5 | * **Arxiv:** https://arxiv.org/abs/2410.06606
 6 | 
 7 | 
 8 | 
 9 | 
10 | ## Quick Links
11 | - [Dissecting FT Unlearning](#Dissecting-FT-Unlearning)
12 |   - [Quick Links](#quick-links)
13 |   - [Overview](#overview)
14 |   - [1. Requirements](#1-requirements)
15 |   - [2. Download Unlearned models](#2-Download-Unlearned-models)
16 |   - [3. Patching or Restoring Experiments](#3-Patching-or-Restoring-Experiments)
17 |   - [How to Cite](#how-to-cite)
18 | 
19 | ## Overview
20 | You can reproduce the experiments in our paper.
21 | 
22 | > **Abstract**
23 | > Fine-tuning-based unlearning methods prevail for preventing targeted harmful, sensitive, or copyrighted information within large language models while preserving overall capabilities. However, the true effectiveness of these methods is unclear.  In this paper, we delve into the limitations of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods alter the model's knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parameters. Interestingly, the MLP components in the model's final layer are the primary contributors to these apparently positive unlearning effects, responsible for controlling the models' behaviors. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge. The code is released at https://github.com/yihuaihong/Dissecting-FT-Unlearning
24 | 
25 | 
26 | ## 1. Requirements
27 | To install the required packages, please run the following command.
28 | ```sh
29 | conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
30 | pip install -r requirements.txt
31 | ```
32 | 
33 | ## 2. Download Unlearned models 
34 | 
35 | For convenience, we directly released the models after Unlearning training (via DPO or Gradient Difference) on both LLaMA2-7B-chat and OLMo-7B in the Huggingface links below:
36 | 
37 | ```sh
38 | for method in 'dpo' 'grad_diff':
39 |     for id in 16 18 21 26 27 38 42 47 49 54
40 |         download_link = f{https://huggingface.co/YihuaiHong/llama2-7b_{method}_unlearn_on_id_{id}_concept}
41 | ```
42 | 
43 | or
44 | 
45 | ```sh
46 | for method in 'dpo' 'grad_diff':
47 |     for id in 4 37 40 44 59 77 90 105 141 147
48 |         download_link = f{https://huggingface.co/YihuaiHong/olmo-7b_{method}_unlearn_on_id_{id}_concept}
49 | ```
50 | 
51 | 
52 | ## 3. Patching or Restoring Experiments
53 | 
54 | Please run Restoration_experiments_llama-30tokens.ipynb or Restoration_experiments_olmo-30tokens.ipynb
55 | 
56 | 
57 | 
58 | ## How to Cite
59 | ```
60 | @misc{hong2024dissectingfinetuningunlearninglarge,
61 |       title={Dissecting Fine-Tuning Unlearning in Large Language Models}, 
62 |       author={Yihuai Hong and Yuelin Zou and Lijie Hu and Ziqian Zeng and Di Wang and Haiqin Yang},
63 |       year={2024},
64 |       eprint={2410.06606},
65 |       archivePrefix={arXiv},
66 |       primaryClass={cs.CL},
67 |       url={https://arxiv.org/abs/2410.06606}, 
68 | }
69 | ```
70 | 


--------------------------------------------------------------------------------
/Dissecting-FT-Unlearning/Restoration_experiments_olmo-30tokens.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "metadata": {
  6 |     "collapsed": false,
  7 |     "jupyter": {
  8 |      "outputs_hidden": false
  9 |     },
 10 |     "tags": []
 11 |    },
 12 |    "source": [
 13 |     "import torch\n",
 14 |     "\n",
 15 |     "if torch.cuda.is_available():\n",
 16 |     "    gpu_count = torch.cuda.device_count()\n",
 17 |     "    print(f\"Find {gpu_count} GPU can be used.\")\n",
 18 |     "\n",
 19 |     "    for i in range(gpu_count):\n",
 20 |     "        gpu_name = torch.cuda.get_device_name(i)\n",
 21 |     "        print(f\"GPU {i + 1}: {gpu_name}\")\n",
 22 |     "else:\n",
 23 |     "    print(\"No GPU can be used.\")\n"
 24 |    ],
 25 |    "outputs": [],
 26 |    "execution_count": null
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "metadata": {
 31 |     "collapsed": false,
 32 |     "jupyter": {
 33 |      "outputs_hidden": false
 34 |     },
 35 |     "tags": []
 36 |    },
 37 |    "source": [
 38 |     "import copy\n",
 39 |     "from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n",
 40 |     "from rouge import Rouge\n",
 41 |     "# from bert_score import score\n",
 42 |     "import statistics\n",
 43 |     "from ast import literal_eval\n",
 44 |     "import functools\n",
 45 |     "import json\n",
 46 |     "import os\n",
 47 |     "import random\n",
 48 |     "import numpy as np\n",
 49 |     "import pandas as pd\n",
 50 |     "import torch\n",
 51 |     "import torch.nn.functional as F\n",
 52 |     "\n",
 53 |     "random.seed(8888)\n",
 54 |     "torch.manual_seed(8888)\n",
 55 |     "random.seed(8888)\n",
 56 |     "np.random.seed(8888)\n",
 57 |     "\n",
 58 |     "if torch.cuda.is_available():\n",
 59 |     "    torch.cuda.manual_seed(8888)\n",
 60 |     "    torch.cuda.manual_seed_all(8888)\n",
 61 |     "\n",
 62 |     "\n",
 63 |     "from tqdm import tqdm\n",
 64 |     "\n",
 65 |     "torch.set_grad_enabled(False)\n",
 66 |     "tqdm.pandas()"
 67 |    ],
 68 |    "outputs": [],
 69 |    "execution_count": null
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "metadata": {
 74 |     "collapsed": false,
 75 |     "jupyter": {
 76 |      "outputs_hidden": false
 77 |     },
 78 |     "tags": []
 79 |    },
 80 |    "source": [
 81 |     "torch.cuda.set_device(0)\n",
 82 |     "\n",
 83 |     "\n",
 84 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
 85 |     "import json\n",
 86 |     "from os.path import join\n",
 87 |     "\n",
 88 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/transformers\"\n",
 89 |     "original_model_name = \"OLMo-7B\" \n",
 90 |     "\n",
 91 |     "\n",
 92 |     "original_model = AutoModelForCausalLM.from_pretrained(\n",
 93 |     "    join(model_dir, original_model_name),\n",
 94 |     "    torch_dtype=torch.bfloat16,\n",
 95 |     "    trust_remote_code=True\n",
 96 |     ")\n",
 97 |     "\n",
 98 |     "tokenizer = AutoTokenizer.from_pretrained(join(model_dir, original_model_name))\n",
 99 |     "tokenizer.pad_token = tokenizer.eos_token\n",
100 |     "tokenizer.padding_side = \"left\"\n",
101 |     "\n",
102 |     "original_model.to('cuda');\n"
103 |    ],
104 |    "outputs": [],
105 |    "execution_count": null
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "metadata": {
110 |     "tags": []
111 |    },
112 |    "source": [
113 |     "original_model"
114 |    ],
115 |    "outputs": [],
116 |    "execution_count": null
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "metadata": {
121 |     "collapsed": false,
122 |     "jupyter": {
123 |      "outputs_hidden": false
124 |     },
125 |     "tags": []
126 |    },
127 |    "source": [
128 |     "import torch.nn.functional as F\n",
129 |     "\n",
130 |     "def set_act_get_hooks(model, mlp_coef=False, attn=False):\n",
131 |     "    \"\"\"\n",
132 |     "    Works on OLMo, getting coef or attn values \n",
133 |     "    \"\"\"\n",
134 |     "\n",
135 |     "    for attr in [\"activations_\"]:\n",
136 |     "        if not hasattr(model, attr):\n",
137 |     "            setattr(model, attr, {})\n",
138 |     "\n",
139 |     "    def get_activation(name):\n",
140 |     "        def hook(module, input, output):\n",
141 |     "            if \"m_coef\" in name:\n",
142 |     "                model.activations_[name] = input[0][:, :].detach() \n",
143 |     "            elif \"attn\" in name:\n",
144 |     "                model.activations_[name] = output[:, :].detach() \n",
145 |     "\n",
146 |     "        return hook\n",
147 |     "\n",
148 |     "    hooks = []\n",
149 |     "    for i in range(32):\n",
150 |     "        if mlp_coef is True: #co-effciency\n",
151 |     "            hooks.append(model.model.transformer.blocks[i].ff_out.register_forward_hook(get_activation(\"m_coef_\" + str(i))))\n",
152 |     "        if attn is True:\n",
153 |     "            hooks.append(model.model.transformer.blocks[i].attn_out.register_forward_hook(get_activation(\"attn_\" + str(i))))    \n",
154 |     "        \n",
155 |     "\n",
156 |     "    return hooks\n",
157 |     "\n",
158 |     "def set_act_modify_hooks(model, original_model, mlp_coef, attn, ori_mlp_coef, ori_attn, un_mlp_coef, un_attn, layers=None):\n",
159 |     "    \"\"\"\n",
160 |     "    Works on OLMo, modify coef or attn values\n",
161 |     "    \"\"\"\n",
162 |     "\n",
163 |     "    def modify_activation(name, update_value, patch_input):\n",
164 |     "        def pre_hook(module, input):\n",
165 |     "            if \"m_coef\" in name:\n",
166 |     "                input[0][:, :, :] = update_value \n",
167 |     "        \n",
168 |     "        def post_hook(module, input, output):\n",
169 |     "            if \"attn\" in name:\n",
170 |     "                output[:, :, :] = update_value\n",
171 |     "                \n",
172 |     "        if patch_input:\n",
173 |     "            return pre_hook\n",
174 |     "        else:\n",
175 |     "            return post_hook\n",
176 |     "\n",
177 |     "    hooks = []\n",
178 |     "    if layers is not None:\n",
179 |     "        for i in range(32):\n",
180 |     "            if i in layers and mlp_coef is True: #co-effciency\n",
181 |     "                hooks.append(model.model.transformer.blocks[i].ff_out.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = ori_mlp_coef[i], patch_input=True)))\n",
182 |     "            # else:\n",
183 |     "            #     hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = un_mlp_coef[i], patch_input=True)))\n",
184 |     "            if i in layers and attn is True:\n",
185 |     "                hooks.append(model.model.transformer.blocks[i].attn_out.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = ori_attn[i], patch_input=False)))  \n",
186 |     "            # else:\n",
187 |     "            #     hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = un_attn[i], patch_input=False)))  \n",
188 |     "\n",
189 |     "    return hooks\n",
190 |     "\n",
191 |     "def remove_hooks(hooks):\n",
192 |     "    for hook in hooks:\n",
193 |     "        hook.remove()\n",
194 |     "\n",
195 |     "        \n",
196 |     "def compute_loss(target_logits, logits):\n",
197 |     "    \n",
198 |     "    #MSE loss\n",
199 |     "\n",
200 |     "    return torch.mean((target_logits - logits) ** 2)\n",
201 |     "\n",
202 |     "\n",
203 |     "def compute_KRS(ori_loss, new_loss):  \n",
204 |     "    return 1 - (new_loss / ori_loss)\n",
205 |     "    \n",
206 |     "\n",
207 |     "\n"
208 |    ],
209 |    "outputs": [],
210 |    "execution_count": null
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "metadata": {
215 |     "tags": []
216 |    },
217 |    "source": [
218 |     "# loading data:\n",
219 |     "\n",
220 |     "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n",
221 |     "    data = json.load(file)\n",
222 |     "    \n",
223 |     "for ix, item in enumerate(data):\n",
224 |     "    print(ix,\" \",item['Concept'])\n",
225 |     "\n"
226 |    ],
227 |    "outputs": [],
228 |    "execution_count": null
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "metadata": {
233 |     "tags": []
234 |    },
235 |    "source": [
236 |     "# evaluate original model and obtain the next 30 tokens answers\n",
237 |     "\n",
238 |     "for index, x in enumerate(tqdm(data)):\n",
239 |     "    \n",
240 |     "    questions = []\n",
241 |     "    n_new_tokens = 30\n",
242 |     "    for idx, question in enumerate(x['QA']):\n",
243 |     "        questions.append(f\"Question: {question}\\n Answer:\")\n",
244 |     "\n",
245 |     "    inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
246 |     "    input_length = inputs.input_ids.size(1)\n",
247 |     "    with torch.no_grad():\n",
248 |     "        generation_output = original_model.generate(  # mt.model\n",
249 |     "            **inputs,\n",
250 |     "            do_sample=False,\n",
251 |     "            max_new_tokens=n_new_tokens,\n",
252 |     "        )\n",
253 |     "    \n",
254 |     "    eos_token_id = tokenizer.eos_token_id\n",
255 |     "    normal_tokens_num_list = []\n",
256 |     "    for i, output in enumerate(generation_output):\n",
257 |     "        eos_count = (output == eos_token_id).sum().item()\n",
258 |     "        total_tokens = output.size(0) \n",
259 |     "        normal_tokens_count = total_tokens - eos_count - inputs.input_ids.size(1)\n",
260 |     "        normal_tokens_num_list.append(normal_tokens_count)\n",
261 |     "    \n",
262 |     "    outputs = tokenizer.batch_decode(generation_output[:, :], skip_special_tokens=True)\n",
263 |     "    \n",
264 |     "    data[index]['QA_with_answers'] = outputs\n",
265 |     "    data[index]['answers_tokens_num'] = normal_tokens_num_list\n",
266 |     "\n",
267 |     "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"w\", encoding=\"utf-8\") as file:\n",
268 |     "    json.dump(data, file, ensure_ascii=False, indent=4)"
269 |    ],
270 |    "outputs": [],
271 |    "execution_count": null
272 |   },
273 |   {
274 |    "cell_type": "code",
275 |    "metadata": {
276 |     "tags": []
277 |    },
278 |    "source": [
279 |     "with open(\"data/olmo-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n",
280 |     "    data = json.load(file)"
281 |    ],
282 |    "outputs": [],
283 |    "execution_count": null
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "metadata": {
288 |     "tags": []
289 |    },
290 |    "source": [
291 |     "#on vanilla model and unlearned model\n",
292 |     "\n",
293 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
294 |     "import json\n",
295 |     "from os.path import join\n",
296 |     "\n",
297 |     "\n",
298 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
299 |     "\n",
300 |     "methods_ori_loss_list = [] \n",
301 |     "methods_ori_attn_list = []\n",
302 |     "methods_ori_mlp_coef_list = []\n",
303 |     "methods_un_mlp_coef_list = []\n",
304 |     "methods_un_attn_list = []\n",
305 |     "\n",
306 |     "for method in ['grad_diff', 'dpo']:\n",
307 |     "    ori_loss_list = [] \n",
308 |     "    ori_attn_list = []\n",
309 |     "    ori_mlp_coef_list = []\n",
310 |     "    un_mlp_coef_list = []\n",
311 |     "    un_attn_list = []\n",
312 |     "    \n",
313 |     "    for x in tqdm(data):\n",
314 |     "        # Loading new unlearned model for certain concept \n",
315 |     "\n",
316 |     "        model_name = f\"olmo-7b/{method}/Full/{x['id']}\"\n",
317 |     "\n",
318 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
319 |     "            join(model_dir, model_name),\n",
320 |     "            torch_dtype=torch.bfloat16,\n",
321 |     "            trust_remote_code=True\n",
322 |     "        );\n",
323 |     "\n",
324 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
325 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
326 |     "        tokenizer.padding_side = \"left\"\n",
327 |     "        model.to('cuda')\n",
328 |     "        \n",
329 |     "        questions = x['QA_with_answers']\n",
330 |     "\n",
331 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
332 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
333 |     "\n",
334 |     "        un_hooks = set_act_get_hooks(model, mlp_coef=True, attn=True)\n",
335 |     "        with torch.no_grad():\n",
336 |     "            output = model(**inputs)\n",
337 |     "\n",
338 |     "        # remove hooks\n",
339 |     "        remove_hooks(un_hooks) \n",
340 |     "        print('output.logits.shape: ',output.logits.shape)\n",
341 |     "\n",
342 |     "\n",
343 |     "        hooks = set_act_get_hooks(original_model, mlp_coef=True, attn=True)\n",
344 |     "        with torch.no_grad():\n",
345 |     "            original_output = original_model(**inputs)\n",
346 |     "        # remove hooks\n",
347 |     "        remove_hooks(hooks) \n",
348 |     "        \n",
349 |     "        \n",
350 |     "        loss_list = []\n",
351 |     "        for i, length in enumerate(x['answers_tokens_num']):\n",
352 |     "            loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
353 |     "            loss_list.append(loss)\n",
354 |     "        \n",
355 |     "        un_mlp_coef = []\n",
356 |     "        un_attn = []\n",
357 |     "        ori_mlp_coef = []\n",
358 |     "        ori_attn = []\n",
359 |     "        for layer in range(32):\n",
360 |     "            un_mlp_coef.append(model.activations_[f'm_coef_{layer}'])\n",
361 |     "            un_attn.append(model.activations_[f'attn_{layer}'])\n",
362 |     "            ori_mlp_coef.append(original_model.activations_[f'm_coef_{layer}'])\n",
363 |     "            ori_attn.append(original_model.activations_[f'attn_{layer}'])\n",
364 |     "\n",
365 |     "\n",
366 |     "        ori_loss_list.append(loss_list)\n",
367 |     "\n",
368 |     "        un_mlp_coef_list.append(un_mlp_coef) # on all data\n",
369 |     "        un_attn_list.append(un_attn) # on all data\n",
370 |     "\n",
371 |     "        ori_mlp_coef_list.append(ori_mlp_coef) # on all data\n",
372 |     "        ori_attn_list.append(ori_attn) # on all data\n",
373 |     "\n",
374 |     "\n",
375 |     "        del model\n",
376 |     "        del tokenizer\n",
377 |     "        torch.cuda.empty_cache()\n",
378 |     "\n",
379 |     "    methods_ori_loss_list.append(ori_loss_list)\n",
380 |     "    methods_ori_attn_list.append(ori_attn_list)\n",
381 |     "    methods_ori_mlp_coef_list.append(ori_mlp_coef_list)\n",
382 |     "    methods_un_mlp_coef_list.append(un_mlp_coef_list)\n",
383 |     "    methods_un_attn_list.append(un_attn_list)\n",
384 |     "    \n",
385 |     "\n"
386 |    ],
387 |    "outputs": [],
388 |    "execution_count": null
389 |   },
390 |   {
391 |    "cell_type": "code",
392 |    "metadata": {
393 |     "tags": []
394 |    },
395 |    "source": [
396 |     "#patching on MLP coefficients on unlearned models\n",
397 |     "torch.cuda.set_device(0)\n",
398 |     "\n",
399 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
400 |     "import json\n",
401 |     "from os.path import join\n",
402 |     "\n",
403 |     "\n",
404 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
405 |     "\n",
406 |     "\n",
407 |     "methods_mlp_coef_patch_loss_lists = []\n",
408 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
409 |     "    mlp_coef_patch_loss_lists = []\n",
410 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
411 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
412 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
413 |     "    un_attn_list = methods_un_attn_list[j]\n",
414 |     "    \n",
415 |     "    for index, x in enumerate(tqdm(data)):\n",
416 |     "\n",
417 |     "        # Loading new unlearned model for certain concept \n",
418 |     "\n",
419 |     "        model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n",
420 |     "\n",
421 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
422 |     "            join(model_dir, model_name),\n",
423 |     "            torch_dtype=torch.bfloat16,\n",
424 |     "            trust_remote_code=True\n",
425 |     "        );\n",
426 |     "\n",
427 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
428 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
429 |     "        tokenizer.padding_side = \"left\"\n",
430 |     "        model.to('cuda');\n",
431 |     "\n",
432 |     "        questions = x['QA_with_answers']\n",
433 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
434 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
435 |     "\n",
436 |     "        with torch.no_grad():\n",
437 |     "            original_output = original_model(**inputs)\n",
438 |     "\n",
439 |     "        patch_loss_list = []\n",
440 |     "        for layer in range(32):\n",
441 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=False, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n",
442 |     "            with torch.no_grad():\n",
443 |     "                output = model(**inputs)\n",
444 |     "            # remove hooks\n",
445 |     "            remove_hooks(hooks) \n",
446 |     "            \n",
447 |     "            loss_list = []\n",
448 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
449 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
450 |     "                loss_list.append(loss)\n",
451 |     "            \n",
452 |     "            patch_loss_list.append(loss_list)\n",
453 |     "\n",
454 |     "        mlp_coef_patch_loss_lists.append(patch_loss_list)   \n",
455 |     "\n",
456 |     "\n",
457 |     "        del model\n",
458 |     "        del tokenizer\n",
459 |     "        torch.cuda.empty_cache()\n",
460 |     "\n",
461 |     "    methods_mlp_coef_patch_loss_lists.append(mlp_coef_patch_loss_lists)\n",
462 |     "\n"
463 |    ],
464 |    "outputs": [],
465 |    "execution_count": null
466 |   },
467 |   {
468 |    "cell_type": "code",
469 |    "metadata": {
470 |     "tags": []
471 |    },
472 |    "source": [
473 |     "\n",
474 |     "print('recover on OLMo by patching on coef (5 layer per patch)')\n",
475 |     "        \n",
476 |     "coef_KRS_list_all = []\n",
477 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
478 |     "    print(f'on {method}')\n",
479 |     "    coef_KRS_list_per_method = []\n",
480 |     "    for patch_losses, ori_losses in zip(methods_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
481 |     "        coef_KRS_list_per_data = []\n",
482 |     "        for new_losses in patch_losses: #different_layer\n",
483 |     "            coef_KRS_list_per_layer = []\n",
484 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
485 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
486 |     "                coef_KRS_list_per_layer.append(KRS.cpu().float())\n",
487 |     "    \n",
488 |     "            coef_KRS_list_per_data.append(np.mean(coef_KRS_list_per_layer, axis=-1))\n",
489 |     "        coef_KRS_list_per_method.append(coef_KRS_list_per_data)\n",
490 |     "     \n",
491 |     "    coef_KRS_list_per_method = np.array(coef_KRS_list_per_method)\n",
492 |     "    avg_coef_KRS_list_per_method = np.mean(coef_KRS_list_per_method, axis=0)\n",
493 |     "    \n",
494 |     "    coef_KRS_list_all.append(avg_coef_KRS_list_per_method)\n",
495 |     "\n",
496 |     "coef_KRS_list_all = np.array(coef_KRS_list_all)\n",
497 |     "avg_coef_KRS_list_all = np.mean(coef_KRS_list_all, axis=0)\n",
498 |     "print(f'avg_coef_KRS_list_all: {list(avg_coef_KRS_list_all)}')"
499 |    ],
500 |    "outputs": [],
501 |    "execution_count": null
502 |   },
503 |   {
504 |    "cell_type": "code",
505 |    "metadata": {},
506 |    "source": [
507 |     "#patching on Attention states on unlearned models\n",
508 |     "torch.cuda.set_device(0)\n",
509 |     "\n",
510 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
511 |     "import json\n",
512 |     "from os.path import join\n",
513 |     "\n",
514 |     "\n",
515 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
516 |     "\n",
517 |     "methods_attn_patch_loss_lists = []\n",
518 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
519 |     "    attn_patch_loss_lists = []\n",
520 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
521 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
522 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
523 |     "    un_attn_list = methods_un_attn_list[j]\n",
524 |     "    \n",
525 |     "    for index, x in enumerate(tqdm(data)):\n",
526 |     "\n",
527 |     "        # Loading new unlearned model for certain concept \n",
528 |     "\n",
529 |     "        model_name = f\"olmo-7b/{method}/Full/{x['id']}\"\n",
530 |     "\n",
531 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
532 |     "            join(model_dir, model_name),\n",
533 |     "            torch_dtype=torch.bfloat16,\n",
534 |     "            trust_remote_code=True\n",
535 |     "        );\n",
536 |     "\n",
537 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
538 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
539 |     "        tokenizer.padding_side = \"left\"\n",
540 |     "        model.to('cuda');\n",
541 |     "\n",
542 |     "        questions = x['QA_with_answers']\n",
543 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
544 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
545 |     "\n",
546 |     "        with torch.no_grad():\n",
547 |     "            original_output = original_model(**inputs)\n",
548 |     "\n",
549 |     "        patch_loss_list = []\n",
550 |     "        for layer in range(32):\n",
551 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=False, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2]) #[i for i in range(32)]\n",
552 |     "            with torch.no_grad():\n",
553 |     "                output = model(**inputs)\n",
554 |     "            # remove hooks\n",
555 |     "            remove_hooks(hooks) \n",
556 |     "            \n",
557 |     "            loss_list = []\n",
558 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
559 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
560 |     "                loss_list.append(loss)\n",
561 |     "            \n",
562 |     "            patch_loss_list.append(loss_list)\n",
563 |     "\n",
564 |     "        attn_patch_loss_lists.append(patch_loss_list)   \n",
565 |     "\n",
566 |     "\n",
567 |     "        del model\n",
568 |     "        del tokenizer\n",
569 |     "        torch.cuda.empty_cache()\n",
570 |     "\n",
571 |     "    methods_attn_patch_loss_lists.append(attn_patch_loss_lists)"
572 |    ],
573 |    "outputs": [],
574 |    "execution_count": null
575 |   },
576 |   {
577 |    "cell_type": "code",
578 |    "metadata": {
579 |     "tags": []
580 |    },
581 |    "source": [
582 |     "\n",
583 |     "\n",
584 |     "print('recover on OLMo by patching on attn (5 layer per patch)')\n",
585 |     "        \n",
586 |     "attn_KRS_list_all = []\n",
587 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
588 |     "    print(f'on {method}')\n",
589 |     "    attn_KRS_list_per_method = []\n",
590 |     "    for patch_losses, ori_losses in zip(methods_attn_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
591 |     "        attn_KRS_list_per_data = []\n",
592 |     "        for new_losses in patch_losses: #different_layer\n",
593 |     "            attn_KRS_list_per_layer = []\n",
594 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
595 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
596 |     "                attn_KRS_list_per_layer.append(KRS.cpu().float())\n",
597 |     "    \n",
598 |     "            attn_KRS_list_per_data.append(np.mean(attn_KRS_list_per_layer, axis=-1))\n",
599 |     "        attn_KRS_list_per_method.append(attn_KRS_list_per_data)\n",
600 |     "     \n",
601 |     "    attn_KRS_list_per_method = np.array(attn_KRS_list_per_method)\n",
602 |     "    avg_attn_KRS_list_per_method = np.mean(attn_KRS_list_per_method, axis=0)\n",
603 |     "    \n",
604 |     "    attn_KRS_list_all.append(avg_attn_KRS_list_per_method)\n",
605 |     "\n",
606 |     "attn_KRS_list_all = np.array(attn_KRS_list_all)\n",
607 |     "avg_attn_KRS_list_all = np.mean(attn_KRS_list_all, axis=0)\n",
608 |     "print(f'avg_attn_KRS_list_all: {list(avg_attn_KRS_list_all)}')"
609 |    ],
610 |    "outputs": [],
611 |    "execution_count": null
612 |   },
613 |   {
614 |    "cell_type": "code",
615 |    "metadata": {},
616 |    "source": [
617 |     "#patching on both MLP coef and Attention states on unlearned models\n",
618 |     "\n",
619 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
620 |     "import json\n",
621 |     "from os.path import join\n",
622 |     "\n",
623 |     "\n",
624 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
625 |     "\n",
626 |     "\n",
627 |     "methods_attn_mlp_coef_patch_loss_lists = []\n",
628 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
629 |     "    attn_mlp_coef_patch_loss_lists = []\n",
630 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
631 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
632 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
633 |     "    un_attn_list = methods_un_attn_list[j]\n",
634 |     "\n",
635 |     "    for index, x in enumerate(tqdm(data)):\n",
636 |     "\n",
637 |     "        model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n",
638 |     "\n",
639 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
640 |     "            join(model_dir, model_name),\n",
641 |     "            torch_dtype=torch.bfloat16,\n",
642 |     "            trust_remote_code=True\n",
643 |     "        )\n",
644 |     "\n",
645 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
646 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
647 |     "        tokenizer.padding_side = \"left\"\n",
648 |     "        model.to('cuda');\n",
649 |     "\n",
650 |     "        questions = x['QA_with_answers']\n",
651 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
652 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
653 |     "\n",
654 |     "        with torch.no_grad():\n",
655 |     "            original_output = original_model(**inputs)\n",
656 |     "\n",
657 |     "        patch_loss_list = []\n",
658 |     "        for layer in range(32):\n",
659 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n",
660 |     "            with torch.no_grad():\n",
661 |     "                output = model(**inputs)\n",
662 |     "            # remove hooks\n",
663 |     "            remove_hooks(hooks) \n",
664 |     "            \n",
665 |     "            loss_list = []\n",
666 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
667 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
668 |     "                loss_list.append(loss)\n",
669 |     "            \n",
670 |     "            patch_loss_list.append(loss_list)\n",
671 |     "\n",
672 |     "        attn_mlp_coef_patch_loss_lists.append(patch_loss_list)   \n",
673 |     "\n",
674 |     "\n",
675 |     "        del model\n",
676 |     "        del tokenizer\n",
677 |     "        torch.cuda.empty_cache()\n",
678 |     "    \n",
679 |     "    methods_attn_mlp_coef_patch_loss_lists.append(attn_mlp_coef_patch_loss_lists)\n"
680 |    ],
681 |    "outputs": [],
682 |    "execution_count": null
683 |   },
684 |   {
685 |    "cell_type": "code",
686 |    "metadata": {
687 |     "tags": []
688 |    },
689 |    "source": [
690 |     "\n",
691 |     "print('recover on OLMo by patching on both coef and attn (5 layer per patch)')\n",
692 |     "        \n",
693 |     "attn_mlp_coef_KRS_list_all = []\n",
694 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
695 |     "    print(f'on {method}')\n",
696 |     "    attn_mlp_coef_KRS_list_per_method = []\n",
697 |     "    for patch_losses, ori_losses in zip(methods_attn_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
698 |     "        attn_mlp_coef_KRS_list_per_data = []\n",
699 |     "        for new_losses in patch_losses: #different_layer\n",
700 |     "            attn_mlp_coef_KRS_list_per_layer = []\n",
701 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
702 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
703 |     "                attn_mlp_coef_KRS_list_per_layer.append(KRS.cpu().float())\n",
704 |     "    \n",
705 |     "            attn_mlp_coef_KRS_list_per_data.append(np.mean(attn_mlp_coef_KRS_list_per_layer, axis=-1))\n",
706 |     "        attn_mlp_coef_KRS_list_per_method.append(attn_mlp_coef_KRS_list_per_data)\n",
707 |     "     \n",
708 |     "    attn_mlp_coef_KRS_list_per_method = np.array(attn_mlp_coef_KRS_list_per_method)\n",
709 |     "    avg_attn_mlp_coef_KRS_list_per_method = np.mean(attn_mlp_coef_KRS_list_per_method, axis=0)\n",
710 |     "    \n",
711 |     "    attn_mlp_coef_KRS_list_all.append(avg_attn_mlp_coef_KRS_list_per_method)\n",
712 |     "\n",
713 |     "attn_mlp_coef_KRS_list_all = np.array(attn_mlp_coef_KRS_list_all)\n",
714 |     "avg_attn_mlp_coef_KRS_list_all = np.mean(attn_mlp_coef_KRS_list_all, axis=0)\n",
715 |     "print(f'avg_attn_mlp_coef_KRS_list_all: {list(avg_attn_mlp_coef_KRS_list_all)}')"
716 |    ],
717 |    "outputs": [],
718 |    "execution_count": null
719 |   },
720 |   {
721 |    "cell_type": "code",
722 |    "metadata": {},
723 |    "source": [
724 |     "#Restore MLP value vectors on unlearned models\n",
725 |     "\n",
726 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
727 |     "import json\n",
728 |     "from os.path import join\n",
729 |     "import copy\n",
730 |     "\n",
731 |     "global old_params\n",
732 |     "\n",
733 |     "\n",
734 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
735 |     "\n",
736 |     "methods_vectors_patch_loss_lists = []\n",
737 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
738 |     "    vectors_patch_loss_lists = []\n",
739 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
740 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
741 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
742 |     "    un_attn_list = methods_un_attn_list[j]\n",
743 |     "\n",
744 |     "    for index, x in enumerate(tqdm(data)):\n",
745 |     "\n",
746 |     "        model_name = f\"olmo-7b/{method}/Full/{x['id']}\" \n",
747 |     "\n",
748 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
749 |     "            join(model_dir, model_name),\n",
750 |     "            torch_dtype=torch.bfloat16,\n",
751 |     "            trust_remote_code=True\n",
752 |     "        )\n",
753 |     "\n",
754 |     "        old_params = copy.deepcopy(model.state_dict())\n",
755 |     "\n",
756 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
757 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
758 |     "        tokenizer.padding_side = \"left\"\n",
759 |     "        model.to('cuda')\n",
760 |     "\n",
761 |     "        questions = x['QA_with_answers']\n",
762 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
763 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
764 |     "\n",
765 |     "        with torch.no_grad():\n",
766 |     "            original_output = original_model(**inputs)\n",
767 |     "\n",
768 |     "        vectors_patch_loss_list = []\n",
769 |     "        for layer in range(32):\n",
770 |     "            for recover_layer in range(32): # select which layer's MLP value vectors to be recover:\n",
771 |     "                if recover_layer >= layer -2 and recover_layer <= layer + 2:\n",
772 |     "                    model.state_dict()[f'model.transformer.blocks[{recover_layer}].ff_out.weight'][:, :] = original_model.state_dict()[f'model.transformer.blocks[{recover_layer}].ff_out.weight'][:, :]    \n",
773 |     "            \n",
774 |     "            with torch.no_grad():\n",
775 |     "                output = model(**inputs)\n",
776 |     "\n",
777 |     "            loss_list = []\n",
778 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
779 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
780 |     "                loss_list.append(loss)\n",
781 |     "\n",
782 |     "            vectors_patch_loss_list.append(loss_list)\n",
783 |     "\n",
784 |     "            model.load_state_dict(old_params)\n",
785 |     "            \n",
786 |     "        vectors_patch_loss_lists.append(vectors_patch_loss_list)   \n",
787 |     "\n",
788 |     "        del model\n",
789 |     "        del tokenizer\n",
790 |     "        torch.cuda.empty_cache()\n",
791 |     "        \n",
792 |     "    methods_vectors_patch_loss_lists.append(vectors_patch_loss_lists)\n",
793 |     "    \n",
794 |     "\n"
795 |    ],
796 |    "outputs": [],
797 |    "execution_count": null
798 |   },
799 |   {
800 |    "cell_type": "code",
801 |    "metadata": {
802 |     "tags": []
803 |    },
804 |    "source": [
805 |     "\n",
806 |     "print('recover on OLMo by patching on value vectors (5 layer per patch)')\n",
807 |     "        \n",
808 |     "vectors_KRS_list_all = []\n",
809 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
810 |     "    print(f'on {method}')\n",
811 |     "    vectors_KRS_list_per_method = []\n",
812 |     "    for patch_losses, ori_losses in zip(methods_vectors_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
813 |     "        vectors_KRS_list_per_data = []\n",
814 |     "        for new_losses in patch_losses: #different_layer\n",
815 |     "            vectors_KRS_list_per_layer = []\n",
816 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
817 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
818 |     "                vectors_KRS_list_per_layer.append(KRS.cpu().float())\n",
819 |     "    \n",
820 |     "            vectors_KRS_list_per_data.append(np.mean(vectors_KRS_list_per_layer, axis=-1))\n",
821 |     "        vectors_KRS_list_per_method.append(vectors_KRS_list_per_data)\n",
822 |     "     \n",
823 |     "    vectors_KRS_list_per_method = np.array(vectors_KRS_list_per_method)\n",
824 |     "    avg_vectors_KRS_list_per_method = np.mean(vectors_KRS_list_per_method, axis=0)\n",
825 |     "    \n",
826 |     "    vectors_KRS_list_all.append(avg_vectors_KRS_list_per_method)\n",
827 |     "\n",
828 |     "vectors_KRS_list_all = np.array(vectors_KRS_list_all)\n",
829 |     "avg_vectors_KRS_list_all = np.mean(vectors_KRS_list_all, axis=0)\n",
830 |     "print(f'avg_vectors_KRS_list_all: {list(avg_vectors_KRS_list_all)}')"
831 |    ],
832 |    "outputs": [],
833 |    "execution_count": null
834 |   }
835 |  ],
836 |  "metadata": {
837 |   "kernelspec": {
838 |    "display_name": "Python 3 (ipykernel)",
839 |    "language": "python",
840 |    "name": "python3"
841 |   },
842 |   "language_info": {
843 |    "codemirror_mode": {
844 |     "name": "ipython",
845 |     "version": 3
846 |    },
847 |    "file_extension": ".py",
848 |    "mimetype": "text/x-python",
849 |    "name": "python",
850 |    "nbconvert_exporter": "python",
851 |    "pygments_lexer": "ipython3",
852 |    "version": "3.11.5"
853 |   }
854 |  },
855 |  "nbformat": 4,
856 |  "nbformat_minor": 4
857 | }
858 | 


--------------------------------------------------------------------------------
/Dissecting-FT-Unlearning/Restoration_experiments_llama-30tokens.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "metadata": {
  6 |     "collapsed": false,
  7 |     "jupyter": {
  8 |      "outputs_hidden": false
  9 |     },
 10 |     "tags": []
 11 |    },
 12 |    "source": [
 13 |     "import torch\n",
 14 |     "\n",
 15 |     "if torch.cuda.is_available():\n",
 16 |     "    gpu_count = torch.cuda.device_count()\n",
 17 |     "    print(f\"Find {gpu_count} GPU can be used.\")\n",
 18 |     "\n",
 19 |     "    # 遍历并打印每个GPU设备的名称\n",
 20 |     "    for i in range(gpu_count):\n",
 21 |     "        gpu_name = torch.cuda.get_device_name(i)\n",
 22 |     "        print(f\"GPU {i + 1}: {gpu_name}\")\n",
 23 |     "else:\n",
 24 |     "    print(\"No GPU can be used.\")\n"
 25 |    ],
 26 |    "outputs": [],
 27 |    "execution_count": null
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "metadata": {
 32 |     "collapsed": false,
 33 |     "jupyter": {
 34 |      "outputs_hidden": false
 35 |     },
 36 |     "tags": []
 37 |    },
 38 |    "source": [
 39 |     "import copy\n",
 40 |     "from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n",
 41 |     "from rouge import Rouge\n",
 42 |     "# from bert_score import score\n",
 43 |     "import statistics\n",
 44 |     "from ast import literal_eval\n",
 45 |     "import functools\n",
 46 |     "import json\n",
 47 |     "import os\n",
 48 |     "import random\n",
 49 |     "import numpy as np\n",
 50 |     "import pandas as pd\n",
 51 |     "import torch\n",
 52 |     "import torch.nn.functional as F\n",
 53 |     "\n",
 54 |     "random.seed(8888)\n",
 55 |     "torch.manual_seed(8888)\n",
 56 |     "random.seed(8888)\n",
 57 |     "np.random.seed(8888)\n",
 58 |     "\n",
 59 |     "if torch.cuda.is_available():\n",
 60 |     "    torch.cuda.manual_seed(8888)\n",
 61 |     "    torch.cuda.manual_seed_all(8888)\n",
 62 |     "\n",
 63 |     "\n",
 64 |     "from tqdm import tqdm\n",
 65 |     "\n",
 66 |     "torch.set_grad_enabled(False)\n",
 67 |     "tqdm.pandas()"
 68 |    ],
 69 |    "outputs": [],
 70 |    "execution_count": null
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "metadata": {
 75 |     "collapsed": false,
 76 |     "jupyter": {
 77 |      "outputs_hidden": false
 78 |     },
 79 |     "tags": []
 80 |    },
 81 |    "source": [
 82 |     "torch.cuda.set_device(0)\n",
 83 |     "\n",
 84 |     "\n",
 85 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
 86 |     "import json\n",
 87 |     "from os.path import join\n",
 88 |     "\n",
 89 |     "# Get CounterFact data for GPT2-xl, from the ROME repository.\n",
 90 |     "#wget.download(\"https://rome.baulab.info/data/dsets/known_1000.json\")\n",
 91 |     "\n",
 92 |     "\n",
 93 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/transformers\"\n",
 94 |     "original_model_name = \"Llama-2-7b-chat-hf\" #\"chatglm3-6b\" #\"Llama-2-7b-chat-hf\" #\"Qwen1.5-7B-Chat\"\n",
 95 |     "\n",
 96 |     "\n",
 97 |     "original_model = AutoModelForCausalLM.from_pretrained(\n",
 98 |     "    join(model_dir, original_model_name),\n",
 99 |     "    torch_dtype=torch.bfloat16,\n",
100 |     "    trust_remote_code=True\n",
101 |     ");\n",
102 |     "\n",
103 |     "tokenizer = AutoTokenizer.from_pretrained(join(model_dir, original_model_name))\n",
104 |     "tokenizer.pad_token = tokenizer.eos_token\n",
105 |     "tokenizer.padding_side = \"left\"\n",
106 |     "\n",
107 |     "original_model.to('cuda');\n"
108 |    ],
109 |    "outputs": [],
110 |    "execution_count": null
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "metadata": {
115 |     "tags": []
116 |    },
117 |    "source": [
118 |     "original_model"
119 |    ],
120 |    "outputs": [],
121 |    "execution_count": null
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "metadata": {
126 |     "collapsed": false,
127 |     "jupyter": {
128 |      "outputs_hidden": false
129 |     },
130 |     "tags": []
131 |    },
132 |    "source": [
133 |     "import torch.nn.functional as F\n",
134 |     "\n",
135 |     "def set_act_get_hooks(model, mlp_coef=False, attn=False):\n",
136 |     "    \"\"\"\n",
137 |     "    Works on LLaMA-2, getting coef values or attn values\n",
138 |     "    \"\"\"\n",
139 |     "\n",
140 |     "    for attr in [\"activations_\"]:\n",
141 |     "        if not hasattr(model, attr):\n",
142 |     "            setattr(model, attr, {})\n",
143 |     "\n",
144 |     "    def get_activation(name):\n",
145 |     "        def hook(module, input, output):\n",
146 |     "            if \"m_coef\" in name:\n",
147 |     "                model.activations_[name] = input[0][:, :].detach() \n",
148 |     "            elif \"attn\" in name:\n",
149 |     "                model.activations_[name] = output[:, :].detach() \n",
150 |     "\n",
151 |     "        return hook\n",
152 |     "\n",
153 |     "    hooks = []\n",
154 |     "    for i in range(32):\n",
155 |     "        if mlp_coef is True: #co-effciency\n",
156 |     "            hooks.append(model.model.layers[i].mlp.down_proj.register_forward_hook(get_activation(\"m_coef_\" + str(i))))\n",
157 |     "        if attn is True:\n",
158 |     "            hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(get_activation(\"attn_\" + str(i))))    \n",
159 |     "        \n",
160 |     "\n",
161 |     "    return hooks\n",
162 |     "\n",
163 |     "def set_act_modify_hooks(model, original_model, mlp_coef, attn, ori_mlp_coef, ori_attn, un_mlp_coef, un_attn, layers=None):\n",
164 |     "    \"\"\"\n",
165 |     "    Works on LLaMA-2, modifying coef values or attn values\n",
166 |     "    \"\"\"\n",
167 |     "\n",
168 |     "    def modify_activation(name, update_value, patch_input):\n",
169 |     "        def pre_hook(module, input):\n",
170 |     "            if \"m_coef\" in name:\n",
171 |     "                input[0][:, :, :] = update_value \n",
172 |     "        \n",
173 |     "        def post_hook(module, input, output):\n",
174 |     "            if \"attn\" in name:\n",
175 |     "                output[:, :, :] = update_value\n",
176 |     "                \n",
177 |     "        if patch_input:\n",
178 |     "            return pre_hook\n",
179 |     "        else:\n",
180 |     "            return post_hook\n",
181 |     "\n",
182 |     "    hooks = []\n",
183 |     "    if layers is not None:\n",
184 |     "        for i in range(32):\n",
185 |     "            if i in layers and mlp_coef is True: #co-effciency\n",
186 |     "                hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = ori_mlp_coef[i], patch_input=True)))\n",
187 |     "            # else:\n",
188 |     "            #     hooks.append(model.model.layers[i].mlp.down_proj.register_forward_pre_hook(modify_activation(\"m_coef_\" + str(i), update_value = un_mlp_coef[i], patch_input=True)))\n",
189 |     "            if i in layers and attn is True:\n",
190 |     "                hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = ori_attn[i], patch_input=False)))  \n",
191 |     "            # else:\n",
192 |     "            #     hooks.append(model.model.layers[i].self_attn.o_proj.register_forward_hook(modify_activation(\"attn_\" + str(i), update_value = un_attn[i], patch_input=False)))  \n",
193 |     "\n",
194 |     "    return hooks\n",
195 |     "\n",
196 |     "def remove_hooks(hooks):\n",
197 |     "    for hook in hooks:\n",
198 |     "        hook.remove()\n",
199 |     "\n",
200 |     "        \n",
201 |     "def compute_loss(target_logits, logits):\n",
202 |     "    \n",
203 |     "    #MSE loss\n",
204 |     "\n",
205 |     "    return torch.mean((target_logits - logits) ** 2)\n",
206 |     "\n",
207 |     "\n",
208 |     "def compute_KRS(ori_loss, new_loss):  \n",
209 |     "    return 1 - (new_loss / ori_loss)\n",
210 |     "    \n",
211 |     "\n",
212 |     "\n"
213 |    ],
214 |    "outputs": [],
215 |    "execution_count": null
216 |   },
217 |   {
218 |    "metadata": {
219 |     "tags": []
220 |    },
221 |    "cell_type": "code",
222 |    "source": [
223 |     "# loading data:\n",
224 |     "\n",
225 |     "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n",
226 |     "    data = json.load(file)\n",
227 |     "    \n",
228 |     "for ix, item in enumerate(data):\n",
229 |     "    print(ix,\" \",item['Concept'])\n",
230 |     "\n"
231 |    ],
232 |    "outputs": [],
233 |    "execution_count": null
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "metadata": {
238 |     "tags": []
239 |    },
240 |    "source": [
241 |     "# evaluate original model and obtain the next 30 tokens answers\n",
242 |     "\n",
243 |     "for index, x in enumerate(tqdm(data)):\n",
244 |     "    \n",
245 |     "    questions = []\n",
246 |     "    n_new_tokens = 30\n",
247 |     "    for idx, question in enumerate(x['QA']):\n",
248 |     "        questions.append(f\"Question: {question}\\n Answer:\")\n",
249 |     "\n",
250 |     "    inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
251 |     "    input_length = inputs.input_ids.size(1)\n",
252 |     "    with torch.no_grad():\n",
253 |     "        generation_output = original_model.generate(  # mt.model\n",
254 |     "            **inputs,\n",
255 |     "            do_sample=False,\n",
256 |     "            max_new_tokens=n_new_tokens,\n",
257 |     "        )\n",
258 |     "\n",
259 |     "    \n",
260 |     "    eos_token_id = tokenizer.eos_token_id\n",
261 |     "    normal_tokens_num_list = []\n",
262 |     "    for i, output in enumerate(generation_output):\n",
263 |     "        eos_count = (output == eos_token_id).sum().item()\n",
264 |     "        total_tokens = output.size(0) \n",
265 |     "        normal_tokens_count = total_tokens - eos_count - inputs.input_ids.size(1)\n",
266 |     "        normal_tokens_num_list.append(normal_tokens_count)\n",
267 |     "    \n",
268 |     "    outputs = tokenizer.batch_decode(generation_output[:, :], skip_special_tokens=True)\n",
269 |     "    \n",
270 |     "    data[index]['QA_with_answers'] = outputs\n",
271 |     "    data[index]['answers_tokens_num'] = normal_tokens_num_list\n",
272 |     "\n",
273 |     "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"w\", encoding=\"utf-8\") as file:\n",
274 |     "    json.dump(data, file, ensure_ascii=False, indent=4)"
275 |    ],
276 |    "outputs": [],
277 |    "execution_count": null
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "metadata": {
282 |     "tags": []
283 |    },
284 |    "source": [
285 |     "with open(\"data/llama2-7b_concepts_test_30tokens_answers.json\", \"r\", encoding=\"utf-8\") as file:\n",
286 |     "    data = json.load(file)"
287 |    ],
288 |    "outputs": [],
289 |    "execution_count": null
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "metadata": {
294 |     "tags": []
295 |    },
296 |    "source": [
297 |     "#on vanilla model and unlearned model\n",
298 |     "\n",
299 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
300 |     "import json\n",
301 |     "from os.path import join\n",
302 |     "\n",
303 |     "\n",
304 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
305 |     "\n",
306 |     "methods_ori_loss_list = [] \n",
307 |     "methods_ori_attn_list = []\n",
308 |     "methods_ori_mlp_coef_list = []\n",
309 |     "methods_un_mlp_coef_list = []\n",
310 |     "methods_un_attn_list = []\n",
311 |     "\n",
312 |     "for method in ['grad_diff', 'dpo']:\n",
313 |     "    ori_loss_list = [] \n",
314 |     "    ori_attn_list = []\n",
315 |     "    ori_mlp_coef_list = []\n",
316 |     "    un_mlp_coef_list = []\n",
317 |     "    un_attn_list = []\n",
318 |     "    \n",
319 |     "    for x in tqdm(data):\n",
320 |     "        # Loading new unlearned model for certain concept \n",
321 |     "\n",
322 |     "        model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n",
323 |     "\n",
324 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
325 |     "            join(model_dir, model_name),\n",
326 |     "            torch_dtype=torch.bfloat16,\n",
327 |     "            trust_remote_code=True\n",
328 |     "        );\n",
329 |     "\n",
330 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
331 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
332 |     "        tokenizer.padding_side = \"left\"\n",
333 |     "        model.to('cuda')\n",
334 |     "        \n",
335 |     "        questions = x['QA_with_answers']\n",
336 |     "\n",
337 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
338 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
339 |     "\n",
340 |     "        un_hooks = set_act_get_hooks(model, mlp_coef=True, attn=True)\n",
341 |     "        with torch.no_grad():\n",
342 |     "            output = model(**inputs)\n",
343 |     "\n",
344 |     "        # remove hooks\n",
345 |     "        remove_hooks(un_hooks) \n",
346 |     "\n",
347 |     "        hooks = set_act_get_hooks(original_model, mlp_coef=True, attn=True)\n",
348 |     "        with torch.no_grad():\n",
349 |     "            original_output = original_model(**inputs)\n",
350 |     "        # remove hooks\n",
351 |     "        remove_hooks(hooks) \n",
352 |     "        \n",
353 |     "        \n",
354 |     "        loss_list = []\n",
355 |     "        for i, length in enumerate(x['answers_tokens_num']):\n",
356 |     "            loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
357 |     "            loss_list.append(loss)\n",
358 |     "        \n",
359 |     "        un_mlp_coef = []\n",
360 |     "        un_attn = []\n",
361 |     "        ori_mlp_coef = []\n",
362 |     "        ori_attn = []\n",
363 |     "        for layer in range(32):\n",
364 |     "            un_mlp_coef.append(model.activations_[f'm_coef_{layer}'])\n",
365 |     "            un_attn.append(model.activations_[f'attn_{layer}'])\n",
366 |     "            ori_mlp_coef.append(original_model.activations_[f'm_coef_{layer}'])\n",
367 |     "            ori_attn.append(original_model.activations_[f'attn_{layer}'])\n",
368 |     "\n",
369 |     "\n",
370 |     "        ori_loss_list.append(loss_list)\n",
371 |     "\n",
372 |     "        un_mlp_coef_list.append(un_mlp_coef) # on all data\n",
373 |     "        un_attn_list.append(un_attn) # on all data\n",
374 |     "\n",
375 |     "        ori_mlp_coef_list.append(ori_mlp_coef) # on all data\n",
376 |     "        ori_attn_list.append(ori_attn) # on all data\n",
377 |     "\n",
378 |     "\n",
379 |     "        del model\n",
380 |     "        del tokenizer\n",
381 |     "        torch.cuda.empty_cache()\n",
382 |     "\n",
383 |     "    methods_ori_loss_list.append(ori_loss_list)\n",
384 |     "    methods_ori_attn_list.append(ori_attn_list)\n",
385 |     "    methods_ori_mlp_coef_list.append(ori_mlp_coef_list)\n",
386 |     "    methods_un_mlp_coef_list.append(un_mlp_coef_list)\n",
387 |     "    methods_un_attn_list.append(un_attn_list)\n",
388 |     "    \n",
389 |     "\n"
390 |    ],
391 |    "outputs": [],
392 |    "execution_count": null
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "metadata": {
397 |     "tags": []
398 |    },
399 |    "source": [
400 |     "#patching on MLP coefficients on unlearned models\n",
401 |     "torch.cuda.set_device(0)\n",
402 |     "\n",
403 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
404 |     "import json\n",
405 |     "from os.path import join\n",
406 |     "\n",
407 |     "\n",
408 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
409 |     "\n",
410 |     "\n",
411 |     "methods_mlp_coef_patch_loss_lists = []\n",
412 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
413 |     "    mlp_coef_patch_loss_lists = []\n",
414 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
415 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
416 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
417 |     "    un_attn_list = methods_un_attn_list[j]\n",
418 |     "    \n",
419 |     "    for index, x in enumerate(tqdm(data)):\n",
420 |     "\n",
421 |     "        # Loading new unlearned model for certain concept \n",
422 |     "\n",
423 |     "        model_name = f\"llama2-7b/{method}/Full/{x['id']}\" #\"chatglm3-6b\" #\"Llama-2-7b-chat-hf\" #\"Qwen1.5-7B-Chat\"\n",
424 |     "\n",
425 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
426 |     "            join(model_dir, model_name),\n",
427 |     "            torch_dtype=torch.bfloat16,\n",
428 |     "            trust_remote_code=True\n",
429 |     "        );\n",
430 |     "\n",
431 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
432 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
433 |     "        tokenizer.padding_side = \"left\"\n",
434 |     "        model.to('cuda');\n",
435 |     "\n",
436 |     "        questions = x['QA_with_answers']\n",
437 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
438 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
439 |     "\n",
440 |     "        with torch.no_grad():\n",
441 |     "            original_output = original_model(**inputs)\n",
442 |     "\n",
443 |     "        patch_loss_list = []\n",
444 |     "        for layer in range(32):\n",
445 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=False, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n",
446 |     "            with torch.no_grad():\n",
447 |     "                output = model(**inputs)\n",
448 |     "            # remove hooks\n",
449 |     "            remove_hooks(hooks) \n",
450 |     "            \n",
451 |     "            loss_list = []\n",
452 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
453 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
454 |     "                loss_list.append(loss)\n",
455 |     "            \n",
456 |     "            patch_loss_list.append(loss_list)\n",
457 |     "\n",
458 |     "        mlp_coef_patch_loss_lists.append(patch_loss_list)   \n",
459 |     "\n",
460 |     "\n",
461 |     "        del model\n",
462 |     "        del tokenizer\n",
463 |     "        torch.cuda.empty_cache()\n",
464 |     "\n",
465 |     "    methods_mlp_coef_patch_loss_lists.append(mlp_coef_patch_loss_lists)\n",
466 |     "\n"
467 |    ],
468 |    "outputs": [],
469 |    "execution_count": null
470 |   },
471 |   {
472 |    "cell_type": "code",
473 |    "metadata": {
474 |     "tags": []
475 |    },
476 |    "source": [
477 |     "print('recover on LLaMA by patching on coef (5 layer per patch)')\n",
478 |     "        \n",
479 |     "coef_KRS_list_all = []\n",
480 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
481 |     "    print(f'on {method}')\n",
482 |     "    coef_KRS_list_per_method = []\n",
483 |     "    for patch_losses, ori_losses in zip(methods_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
484 |     "        coef_KRS_list_per_data = []\n",
485 |     "        for new_losses in patch_losses: #different_layer\n",
486 |     "            coef_KRS_list_per_layer = []\n",
487 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
488 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
489 |     "                coef_KRS_list_per_layer.append(KRS.cpu())\n",
490 |     "    \n",
491 |     "            coef_KRS_list_per_data.append(np.mean(coef_KRS_list_per_layer, axis=-1))\n",
492 |     "        coef_KRS_list_per_method.append(coef_KRS_list_per_data)\n",
493 |     "     \n",
494 |     "    coef_KRS_list_per_method = np.array(coef_KRS_list_per_method)\n",
495 |     "    avg_coef_KRS_list_per_method = np.mean(coef_KRS_list_per_method, axis=0)\n",
496 |     "    \n",
497 |     "    coef_KRS_list_all.append(avg_coef_KRS_list_per_method)\n",
498 |     "\n",
499 |     "coef_KRS_list_all = np.array(coef_KRS_list_all)\n",
500 |     "avg_coef_KRS_list_all = np.mean(coef_KRS_list_all, axis=0)\n",
501 |     "print(f'avg_coef_KRS_list_all: {list(avg_coef_KRS_list_all)}')"
502 |    ],
503 |    "outputs": [],
504 |    "execution_count": null
505 |   },
506 |   {
507 |    "cell_type": "code",
508 |    "metadata": {},
509 |    "source": [
510 |     "#patching on Attention states on unlearned models\n",
511 |     "\n",
512 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
513 |     "import json\n",
514 |     "from os.path import join\n",
515 |     "\n",
516 |     "\n",
517 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
518 |     "\n",
519 |     "methods_attn_patch_loss_lists = []\n",
520 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
521 |     "    attn_patch_loss_lists = []\n",
522 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
523 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
524 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
525 |     "    un_attn_list = methods_un_attn_list[j]\n",
526 |     "    \n",
527 |     "    for index, x in enumerate(tqdm(data)):\n",
528 |     "\n",
529 |     "        # Loading new unlearned model for certain concept \n",
530 |     "\n",
531 |     "        model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n",
532 |     "\n",
533 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
534 |     "            join(model_dir, model_name),\n",
535 |     "            torch_dtype=torch.bfloat16,\n",
536 |     "            trust_remote_code=True\n",
537 |     "        );\n",
538 |     "\n",
539 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
540 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
541 |     "        tokenizer.padding_side = \"left\"\n",
542 |     "        model.to('cuda');\n",
543 |     "\n",
544 |     "        questions = x['QA_with_answers']\n",
545 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
546 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
547 |     "\n",
548 |     "        with torch.no_grad():\n",
549 |     "            original_output = original_model(**inputs)\n",
550 |     "\n",
551 |     "        patch_loss_list = []\n",
552 |     "        for layer in range(32):\n",
553 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=False, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2]) #[i for i in range(32)]\n",
554 |     "            with torch.no_grad():\n",
555 |     "                output = model(**inputs)\n",
556 |     "            # remove hooks\n",
557 |     "            remove_hooks(hooks) \n",
558 |     "            \n",
559 |     "            loss_list = []\n",
560 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
561 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
562 |     "                loss_list.append(loss)\n",
563 |     "            \n",
564 |     "            patch_loss_list.append(loss_list)\n",
565 |     "\n",
566 |     "        attn_patch_loss_lists.append(patch_loss_list)   \n",
567 |     "\n",
568 |     "\n",
569 |     "        del model\n",
570 |     "        del tokenizer\n",
571 |     "        torch.cuda.empty_cache()\n",
572 |     "\n",
573 |     "    methods_attn_patch_loss_lists.append(attn_patch_loss_lists)"
574 |    ],
575 |    "outputs": [],
576 |    "execution_count": null
577 |   },
578 |   {
579 |    "cell_type": "code",
580 |    "metadata": {
581 |     "tags": []
582 |    },
583 |    "source": [
584 |     "print('recover on LLaMA by patching on attn (5 layer per patch)')\n",
585 |     "        \n",
586 |     "attn_KRS_list_all = []\n",
587 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
588 |     "    print(f'on {method}')\n",
589 |     "    attn_KRS_list_per_method = []\n",
590 |     "    for patch_losses, ori_losses in zip(methods_attn_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
591 |     "        attn_KRS_list_per_data = []\n",
592 |     "        for new_losses in patch_losses: #different_layer\n",
593 |     "            attn_KRS_list_per_layer = []\n",
594 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
595 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
596 |     "                attn_KRS_list_per_layer.append(KRS.cpu())\n",
597 |     "    \n",
598 |     "            attn_KRS_list_per_data.append(np.mean(attn_KRS_list_per_layer, axis=-1))\n",
599 |     "        attn_KRS_list_per_method.append(attn_KRS_list_per_data)\n",
600 |     "     \n",
601 |     "    attn_KRS_list_per_method = np.array(attn_KRS_list_per_method)\n",
602 |     "    avg_attn_KRS_list_per_method = np.mean(attn_KRS_list_per_method, axis=0)\n",
603 |     "    \n",
604 |     "    attn_KRS_list_all.append(avg_attn_KRS_list_per_method)\n",
605 |     "\n",
606 |     "attn_KRS_list_all = np.array(attn_KRS_list_all)\n",
607 |     "avg_attn_KRS_list_all = np.mean(attn_KRS_list_all, axis=0)\n",
608 |     "print(f'avg_attn_KRS_list_all: {list(avg_attn_KRS_list_all)}')"
609 |    ],
610 |    "outputs": [],
611 |    "execution_count": null
612 |   },
613 |   {
614 |    "cell_type": "code",
615 |    "metadata": {},
616 |    "source": [
617 |     "#patching on both MLP coef and Attention states on unlearned models\n",
618 |     "\n",
619 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
620 |     "import json\n",
621 |     "from os.path import join\n",
622 |     "\n",
623 |     "\n",
624 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
625 |     "\n",
626 |     "methods_attn_mlp_coef_patch_loss_lists = []\n",
627 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
628 |     "    attn_mlp_coef_patch_loss_lists = []\n",
629 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
630 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
631 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
632 |     "    un_attn_list = methods_un_attn_list[j]\n",
633 |     "\n",
634 |     "    for index, x in enumerate(tqdm(data)):\n",
635 |     "\n",
636 |     "        model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n",
637 |     "\n",
638 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
639 |     "            join(model_dir, model_name),\n",
640 |     "            torch_dtype=torch.bfloat16,\n",
641 |     "            trust_remote_code=True\n",
642 |     "        )\n",
643 |     "\n",
644 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
645 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
646 |     "        tokenizer.padding_side = \"left\"\n",
647 |     "        model.to('cuda');\n",
648 |     "\n",
649 |     "        questions = x['QA_with_answers']\n",
650 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
651 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
652 |     "\n",
653 |     "        with torch.no_grad():\n",
654 |     "            original_output = original_model(**inputs)\n",
655 |     "\n",
656 |     "        patch_loss_list = []\n",
657 |     "        for layer in range(32):\n",
658 |     "            hooks = set_act_modify_hooks(model, original_model, mlp_coef=True, attn=True, ori_mlp_coef=ori_mlp_coef_list[index], ori_attn=ori_attn_list[index], un_mlp_coef=un_mlp_coef_list[index], un_attn=un_attn_list[index], layers = [layer-2, layer-1, layer, layer+1, layer+2])\n",
659 |     "            with torch.no_grad():\n",
660 |     "                output = model(**inputs)\n",
661 |     "            # remove hooks\n",
662 |     "            remove_hooks(hooks) \n",
663 |     "            \n",
664 |     "            loss_list = []\n",
665 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
666 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
667 |     "                loss_list.append(loss)\n",
668 |     "            \n",
669 |     "            patch_loss_list.append(loss_list)\n",
670 |     "\n",
671 |     "        attn_mlp_coef_patch_loss_lists.append(patch_loss_list)   \n",
672 |     "\n",
673 |     "\n",
674 |     "        del model\n",
675 |     "        del tokenizer\n",
676 |     "        torch.cuda.empty_cache()\n",
677 |     "    \n",
678 |     "    methods_attn_mlp_coef_patch_loss_lists.append(attn_mlp_coef_patch_loss_lists)\n",
679 |     "\n"
680 |    ],
681 |    "outputs": [],
682 |    "execution_count": null
683 |   },
684 |   {
685 |    "cell_type": "code",
686 |    "metadata": {
687 |     "tags": []
688 |    },
689 |    "source": [
690 |     "\n",
691 |     "print('recover on LLaMA by patching on both coef and attn (5 layer per patch)')\n",
692 |     "        \n",
693 |     "attn_mlp_coef_KRS_list_all = []\n",
694 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
695 |     "    print(f'on {method}')\n",
696 |     "    attn_mlp_coef_KRS_list_per_method = []\n",
697 |     "    for patch_losses, ori_losses in zip(methods_attn_mlp_coef_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
698 |     "        attn_mlp_coef_KRS_list_per_data = []\n",
699 |     "        for new_losses in patch_losses: #different_layer\n",
700 |     "            attn_mlp_coef_KRS_list_per_layer = []\n",
701 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
702 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
703 |     "                attn_mlp_coef_KRS_list_per_layer.append(KRS.cpu())\n",
704 |     "    \n",
705 |     "            attn_mlp_coef_KRS_list_per_data.append(np.mean(attn_mlp_coef_KRS_list_per_layer, axis=-1))\n",
706 |     "        attn_mlp_coef_KRS_list_per_method.append(attn_mlp_coef_KRS_list_per_data)\n",
707 |     "     \n",
708 |     "    attn_mlp_coef_KRS_list_per_method = np.array(attn_mlp_coef_KRS_list_per_method)\n",
709 |     "    avg_attn_mlp_coef_KRS_list_per_method = np.mean(attn_mlp_coef_KRS_list_per_method, axis=0)\n",
710 |     "    \n",
711 |     "    attn_mlp_coef_KRS_list_all.append(avg_attn_mlp_coef_KRS_list_per_method)\n",
712 |     "\n",
713 |     "attn_mlp_coef_KRS_list_all = np.array(attn_mlp_coef_KRS_list_all)\n",
714 |     "avg_attn_mlp_coef_KRS_list_all = np.mean(attn_mlp_coef_KRS_list_all, axis=0)\n",
715 |     "print(f'avg_attn_mlp_coef_KRS_list_all: {list(avg_attn_mlp_coef_KRS_list_all)}')"
716 |    ],
717 |    "outputs": [],
718 |    "execution_count": null
719 |   },
720 |   {
721 |    "cell_type": "code",
722 |    "metadata": {},
723 |    "source": [
724 |     "#Restore MLP value vectors on unlearned models\n",
725 |     "\n",
726 |     "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
727 |     "import json\n",
728 |     "from os.path import join\n",
729 |     "import copy\n",
730 |     "\n",
731 |     "global old_params\n",
732 |     "\n",
733 |     "\n",
734 |     "model_dir = \"/mnt/workspace/workgroup/yhhong/unlearn_results\"\n",
735 |     "\n",
736 |     "methods_vectors_patch_loss_lists = []\n",
737 |     "for j, method in enumerate(['grad_diff', 'dpo']):\n",
738 |     "    vectors_patch_loss_lists = []\n",
739 |     "    ori_mlp_coef_list = methods_ori_mlp_coef_list[j]\n",
740 |     "    ori_attn_list = methods_ori_attn_list[j]\n",
741 |     "    un_mlp_coef_list = methods_un_mlp_coef_list[j]\n",
742 |     "    un_attn_list = methods_un_attn_list[j]\n",
743 |     "\n",
744 |     "    for index, x in enumerate(tqdm(data)):\n",
745 |     "\n",
746 |     "        model_name = f\"llama2-7b/{method}/Full/{x['id']}\" \n",
747 |     "\n",
748 |     "        model = AutoModelForCausalLM.from_pretrained(\n",
749 |     "            join(model_dir, model_name),\n",
750 |     "            torch_dtype=torch.bfloat16,\n",
751 |     "            trust_remote_code=True\n",
752 |     "        )\n",
753 |     "\n",
754 |     "        old_params = copy.deepcopy(model.state_dict())\n",
755 |     "\n",
756 |     "        tokenizer = AutoTokenizer.from_pretrained(join(model_dir, model_name))\n",
757 |     "        tokenizer.pad_token = tokenizer.eos_token\n",
758 |     "        tokenizer.padding_side = \"left\"\n",
759 |     "        model.to('cuda')\n",
760 |     "\n",
761 |     "        questions = x['QA_with_answers']\n",
762 |     "        inputs = tokenizer(questions, return_tensors=\"pt\", padding=True, return_token_type_ids=False).to('cuda')\n",
763 |     "        input_length = inputs.attention_mask.sum(dim=1)\n",
764 |     "\n",
765 |     "        with torch.no_grad():\n",
766 |     "            original_output = original_model(**inputs)\n",
767 |     "\n",
768 |     "        vectors_patch_loss_list = []\n",
769 |     "        for layer in range(32):\n",
770 |     "            for recover_layer in range(32): # select which layer's MLP value vectors to be recover:\n",
771 |     "                if recover_layer >= layer -2 and recover_layer <= layer + 2:\n",
772 |     "                    model.state_dict()[f'model.layers.{recover_layer}.mlp.down_proj.weight'][:, :] = original_model.state_dict()[f'model.layers.{recover_layer}.mlp.down_proj.weight'][:, :]\n",
773 |     "            \n",
774 |     "            \n",
775 |     "            with torch.no_grad():\n",
776 |     "                output = model(**inputs)\n",
777 |     "\n",
778 |     "            loss_list = []\n",
779 |     "            for i, length in enumerate(x['answers_tokens_num']):\n",
780 |     "                loss = compute_loss(original_output.logits[i, -length:, :], output.logits[i, -length:, :])\n",
781 |     "                loss_list.append(loss)\n",
782 |     "\n",
783 |     "            vectors_patch_loss_list.append(loss_list)\n",
784 |     "\n",
785 |     "            model.load_state_dict(old_params)\n",
786 |     "            \n",
787 |     "        vectors_patch_loss_lists.append(vectors_patch_loss_list)   \n",
788 |     "\n",
789 |     "        del model\n",
790 |     "        del tokenizer\n",
791 |     "        torch.cuda.empty_cache()\n",
792 |     "        \n",
793 |     "    methods_vectors_patch_loss_lists.append(vectors_patch_loss_lists)\n",
794 |     "    \n",
795 |     "\n"
796 |    ],
797 |    "outputs": [],
798 |    "execution_count": null
799 |   },
800 |   {
801 |    "cell_type": "code",
802 |    "metadata": {
803 |     "tags": []
804 |    },
805 |    "source": [
806 |     "\n",
807 |     "\n",
808 |     "print('recover on LLaMA by patching on value vectors (5 layer per patch)')\n",
809 |     "        \n",
810 |     "vectors_KRS_list_all = []\n",
811 |     "for t, method in enumerate(['grad_diff', 'dpo']):\n",
812 |     "    print(f'on {method}')\n",
813 |     "    vectors_KRS_list_per_method = []\n",
814 |     "    for patch_losses, ori_losses in zip(methods_vectors_patch_loss_lists[t], methods_ori_loss_list[t]):\n",
815 |     "        vectors_KRS_list_per_data = []\n",
816 |     "        for new_losses in patch_losses: #different_layer\n",
817 |     "            vectors_KRS_list_per_layer = []\n",
818 |     "            for new_loss, ori_loss in zip(new_losses, ori_losses):\n",
819 |     "                KRS = compute_KRS(ori_loss, new_loss=new_loss)\n",
820 |     "                vectors_KRS_list_per_layer.append(KRS.cpu())\n",
821 |     "    \n",
822 |     "            vectors_KRS_list_per_data.append(np.mean(vectors_KRS_list_per_layer, axis=-1))\n",
823 |     "        vectors_KRS_list_per_method.append(vectors_KRS_list_per_data)\n",
824 |     "     \n",
825 |     "    vectors_KRS_list_per_method = np.array(vectors_KRS_list_per_method)\n",
826 |     "    avg_vectors_KRS_list_per_method = np.mean(vectors_KRS_list_per_method, axis=0)\n",
827 |     "    \n",
828 |     "    vectors_KRS_list_all.append(avg_vectors_KRS_list_per_method)\n",
829 |     "\n",
830 |     "vectors_KRS_list_all = np.array(vectors_KRS_list_all)\n",
831 |     "avg_vectors_KRS_list_all = np.mean(vectors_KRS_list_all, axis=0)\n",
832 |     "print(f'avg_vectors_KRS_list_all: {list(avg_vectors_KRS_list_all)}')"
833 |    ],
834 |    "outputs": [],
835 |    "execution_count": null
836 |   }
837 |  ],
838 |  "metadata": {
839 |   "kernelspec": {
840 |    "display_name": "Python 3 (ipykernel)",
841 |    "language": "python",
842 |    "name": "python3"
843 |   },
844 |   "language_info": {
845 |    "codemirror_mode": {
846 |     "name": "ipython",
847 |     "version": 3
848 |    },
849 |    "file_extension": ".py",
850 |    "mimetype": "text/x-python",
851 |    "name": "python",
852 |    "nbconvert_exporter": "python",
853 |    "pygments_lexer": "ipython3",
854 |    "version": "3.11.5"
855 |   }
856 |  },
857 |  "nbformat": 4,
858 |  "nbformat_minor": 4
859 | }
860 | 


--------------------------------------------------------------------------------