├── Poisoned_datasets ├── Beauty_poison_0.01.txt ├── Beauty_poison_0.02.txt ├── Beauty_poison_0.03.txt ├── Sports_poison_0.01.txt ├── Sports_poison_0.02.txt ├── Sports_poison_0.03.txt ├── Toys_poison_0.01.txt ├── Toys_poison_0.02.txt ├── Toys_poison_0.03.txt ├── Yelp_poison_0.01.txt ├── Yelp_poison_0.02.txt ├── Yelp_poison_0.03.txt └── Yelp_poison_only1perline_0.01.txt ├── README.md └── Seq-poison ├── Bi_classifier model ├── Beauty_bi_classify.pt ├── Sports_and_Outdoors_bi_classify.pt ├── Toys_and_Games_bi_classify.pt └── Yelp_bi_classify.pt ├── classify.py ├── data_processing.py ├── dataloader.py ├── dataset ├── Beauty.txt ├── Sports_and_Outdoors.txt ├── Toys_and_Games.txt └── Yelp.txt ├── discriminator.py ├── generate_data.py ├── generator.py ├── helpers.py ├── main.py ├── process.py └── train_classify.py /README.md: -------------------------------------------------------------------------------- 1 | ### Poisoned_datasets 2 | This is the implementation code for paper 《Poisoning Self-supervised Learning Based Sequential Recommendations》 on ACM SIGIR 2023 3 | 4 | For each dataset, we generated fake users whose numbers are 1%, 2%, and 3% of real users. 5 | 6 | It is worth noting that, in the three original __Amazon__ datasets (__Beauty__, __Sports and Outdoors__, and __Toys and Games__), none of the users has multiple interactions with the same item, while in the __Yelp__ dataset, users often interact with the same item multiple times. 7 | Therefore, to ensure the stealthiness of our attack, when constructing the poisoning data of the __Amazon__ datasets, we let each fake user only interact with the target item once, while in __Yelp__, we allow each fake user to interact with the target item multiple times. 8 | For comparison, we also provide the poisoning data of the __Yelp__ dataset where each user interacts with the same item at most once. 9 | 10 | ### Seq-poison 11 | Our model for fake user generating. 12 | 13 | #### datasets 14 | Original pre-processed user-item interaction records obtained by the data downloaded from [Google Drive](https://drive.google.com/drive/folders/1ahiLmzU7cGRPXf5qGMqtAChte2eYp9gI) (which is publicly available). 15 | 16 | We use the "5-core" datasets as described in our paper. 17 | 18 | #### Run 19 | Create "5-core" datasets: 20 | ``` 21 | python data_processing.py 22 | ``` 23 | You can also use these already processed datasets directly in __Seq-poison/dataset__ 24 | 25 | Create bi-classifier: 26 | 27 | ``` 28 | python train_classify.py 29 | ``` 30 | 31 | Now we get the bi-classifier model __{data_name}_bi_classify.pt__. 32 | 33 | Train the generator that generates fake users: 34 | 35 | ``` 36 | python main.py 37 | ``` 38 | 39 | Generate poisoning data (the percentage of fake users can be set): 40 | 41 | ``` 42 | python generate_data.py 43 | ``` 44 | 45 | -------------------------------------------------------------------------------- /Seq-poison/Bi_classifier model/Beauty_bi_classify.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongGroup/Poisoning-SSL-based-RS/7c30ca3d9df621080afc38f8aa20c09e0b5e6891/Seq-poison/Bi_classifier model/Beauty_bi_classify.pt -------------------------------------------------------------------------------- /Seq-poison/Bi_classifier model/Sports_and_Outdoors_bi_classify.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongGroup/Poisoning-SSL-based-RS/7c30ca3d9df621080afc38f8aa20c09e0b5e6891/Seq-poison/Bi_classifier model/Sports_and_Outdoors_bi_classify.pt -------------------------------------------------------------------------------- /Seq-poison/Bi_classifier model/Toys_and_Games_bi_classify.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongGroup/Poisoning-SSL-based-RS/7c30ca3d9df621080afc38f8aa20c09e0b5e6891/Seq-poison/Bi_classifier model/Toys_and_Games_bi_classify.pt -------------------------------------------------------------------------------- /Seq-poison/Bi_classifier model/Yelp_bi_classify.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CongGroup/Poisoning-SSL-based-RS/7c30ca3d9df621080afc38f8aa20c09e0b5e6891/Seq-poison/Bi_classifier model/Yelp_bi_classify.pt -------------------------------------------------------------------------------- /Seq-poison/classify.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | 6 | # a simple network with convs structure 7 | class Classify(nn.Module): 8 | 9 | def __init__(self, num_classes, vocab_size, emb_dim, filter_sizes, num_filters, dropout): 10 | super(Classify, self).__init__() 11 | self.emb = nn.Embedding(vocab_size, emb_dim) 12 | self.convs = nn.ModuleList([ 13 | nn.Conv2d(1, n, (f, emb_dim)) for (n, f) in zip(num_filters, filter_sizes) 14 | ]) 15 | self.highway = nn.Linear(sum(num_filters), sum(num_filters)) 16 | self.dropout = nn.Dropout(p=dropout) 17 | self.lin = nn.Linear(sum(num_filters), num_classes) 18 | self.softmax = nn.Softmax(dim=1) 19 | self.init_parameters() 20 | 21 | def forward(self, x): 22 | """ 23 | Args: 24 | x: (batch_size * seq_len) 25 | """ 26 | emb = self.emb(x).unsqueeze(1) # batch_size * 1 * seq_len * emb_dim 27 | convs = [F.relu(conv(emb)).squeeze(3) for conv in self.convs] # [batch_size * num_filter * length] 28 | pools = [F.max_pool1d(conv, conv.size(2)).squeeze(2) for conv in convs] # [batch_size * num_filter] 29 | pred = torch.cat(pools, 1) # batch_size * num_filters_sum 30 | highway = self.highway(pred) 31 | pred = torch.sigmoid(highway) * F.relu(highway) + (1. - torch.sigmoid(highway)) * pred 32 | logit = self.lin(self.dropout(pred)) 33 | pred = self.softmax(logit) 34 | return pred 35 | 36 | def init_parameters(self): 37 | for param in self.parameters(): 38 | param.data.uniform_(-0.05, 0.05) 39 | 40 | 41 | -------------------------------------------------------------------------------- /Seq-poison/data_processing.py: -------------------------------------------------------------------------------- 1 | import gzip 2 | import numpy as np 3 | from collections import defaultdict 4 | import pandas as pd 5 | from pandas.core.frame import DataFrame 6 | import tqdm 7 | import json 8 | 9 | """ 10 | Tool function for generating 5-core dataset 11 | """ 12 | 13 | def parse(path): # for Amazon 14 | g = gzip.open(path, 'r') 15 | for l in g: 16 | yield eval(l) 17 | 18 | def Yelp(date_min, date_max, rating_score): 19 | users = [] 20 | items = [] 21 | scores = [] 22 | times = [] 23 | data_flie = './data_processing/Data/Yelp/yelp_academic_dataset_review_2020.json' 24 | lines = open(data_flie).readlines() 25 | for line in tqdm.tqdm(lines): 26 | review = json.loads(line.strip()) 27 | rating = review['stars'] 28 | # 2004-10-12 10:13:32 2019-12-13 15:51:19 29 | date = review['date'] 30 | # 剔除一些例子 31 | if date < date_min or date > date_max or float(rating) <= rating_score: 32 | continue 33 | user = review['user_id'] 34 | item = review['business_id'] 35 | time = date.replace('-','').replace(':','').replace(' ','') 36 | users.append(user) 37 | items.append(item) 38 | scores.append(rating) 39 | times.append(time) 40 | return users,items,scores,times 41 | 42 | # return (user item timestamp) sort in get_interaction 43 | def Amazon(dataset_name, rating_score): 44 | ''' 45 | reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B 46 | asin - ID of the product, e.g. 0000013714 47 | reviewerName - name of the reviewer 48 | helpful - helpfulness rating of the review, e.g. 2/3 49 | --"helpful": [2, 3], 50 | reviewText - text of the review 51 | --"reviewText": "I bought this for my husband who plays the piano. ..." 52 | overall - rating of the product 53 | --"overall": 5.0, 54 | summary - summary of the review 55 | --"summary": "Heavenly Highway Hymns", 56 | unixReviewTime - time of the review (unix time) 57 | --"unixReviewTime": 1252800000, 58 | reviewTime - time of the review (raw) 59 | --"reviewTime": "09 13, 2009" 60 | ''' 61 | users = [] 62 | items = [] 63 | scores = [] 64 | times = [] 65 | # older Amazon 66 | data_flie = './data_processing/Data/'+ dataset_name +'/reviews_' + dataset_name + '_5' + '.json.gz' 67 | # latest Amazon 68 | # data_flie = '/home/hui_wang/data/new_Amazon/' + dataset_name + '.json.gz' 69 | for inter in parse(data_flie): 70 | if float(inter['overall']) <= rating_score: # 小于一定分数去掉 71 | continue 72 | user = inter['reviewerID'] 73 | item = inter['asin'] 74 | score = inter["overall"] 75 | time = inter['unixReviewTime'] 76 | users.append(user) 77 | items.append(item) 78 | scores.append(score) 79 | times.append(time) 80 | return users,items,scores,times 81 | 82 | # 循环过滤 K-core 83 | def filter_Kcore(user_items, user_core, item_core): # user 接所有items 84 | user_count, item_count, isKcore = check_Kcore(user_items, user_core, item_core) 85 | pop_users = set() 86 | pop_items = set() 87 | while not isKcore: 88 | for user, num in user_count.items(): 89 | if user_count[user] < user_core: # 直接把user 删除 90 | user_items.pop(user) 91 | pop_users.add(user) 92 | else: 93 | for item in user_items[user]: 94 | if item_count[item] < item_core: 95 | user_items[user].remove(item) 96 | pop_items.add(item) 97 | user_count, item_count, isKcore = check_Kcore(user_items, user_core, item_core) 98 | return user_items, pop_users, pop_items 99 | 100 | # K-core user_core item_core 101 | def check_Kcore(user_items, user_core, item_core): 102 | user_count = defaultdict(int) 103 | item_count = defaultdict(int) 104 | for user, items in user_items.items(): 105 | for item in items: # 统计出现的次数 106 | user_count[user] += 1 107 | item_count[item] += 1 108 | 109 | for user, num in user_count.items(): 110 | if num < user_core: 111 | return user_count, item_count, False 112 | for item, num in item_count.items(): 113 | if num < item_core: 114 | return user_count, item_count, False 115 | return user_count, item_count, True # 已经保证Kcore 116 | 117 | def id_map(user_items): # user_items dict 118 | 119 | user2id = {} # raw 2 uid 120 | item2id = {} # raw 2 iid 121 | id2user = {} # uid 2 raw 122 | id2item = {} # iid 2 raw 123 | user_id = 1 124 | item_id = 1 125 | final_data = {} 126 | for user, items in user_items.items(): 127 | if user not in user2id: 128 | user2id[user] = str(user_id) 129 | id2user[str(user_id)] = user 130 | user_id += 1 131 | iids = [] # item id lists 132 | for item in items: 133 | if item not in item2id: 134 | item2id[item] = str(item_id) 135 | id2item[str(item_id)] = item 136 | item_id += 1 137 | iids.append(item2id[item]) 138 | uid = user2id[user] 139 | final_data[uid] = iids 140 | data_maps = { 141 | 'user2id': user2id, 142 | 'item2id': item2id, 143 | 'id2user': id2user, 144 | 'id2item': id2item 145 | } 146 | return final_data, user_id-1, item_id-1, data_maps 147 | 148 | def get_interaction(datas): 149 | user_seq = {} 150 | for index, inter in datas.iterrows(): 151 | user, item, time = inter['userId'],inter['itemId'],inter["timestamp"] 152 | if user in user_seq: 153 | user_seq[user].append((item, time)) 154 | else: 155 | user_seq[user] = [] 156 | user_seq[user].append((item, time)) 157 | 158 | for user, item_time in user_seq.items(): 159 | item_time.sort(key=lambda x: x[1]) # 对各个数据集得单独排序 160 | items = [] 161 | for t in item_time: 162 | items.append(t[0]) 163 | user_seq[user] = items 164 | return user_seq 165 | 166 | def main(data_name, data_type='Amazon'): 167 | assert data_type in {'Amazon', 'Yelp'} 168 | np.random.seed(12345) 169 | rating_score = 0.0 # rating score smaller than this score would be deleted 170 | # user 5-core item 5-core 171 | user_core = 5 172 | item_core = 5 173 | attribute_core = 0 174 | 175 | if data_type == 'Yelp': 176 | date_max = '2019-12-31 00:00:00' 177 | date_min = '2019-01-01 00:00:00' 178 | users, items, scores, times = Yelp(date_min, date_max, rating_score) 179 | else: 180 | users, items, scores, times = Amazon(data_name, rating_score=rating_score) 181 | 182 | data = DataFrame({ 183 | "userId":users, 184 | "itemId":items, 185 | "rating":scores, 186 | "timestamp":times 187 | }) 188 | 189 | user_items = get_interaction(data) 190 | user_items, pop_users, pop_items = filter_Kcore(user_items, user_core=user_core, item_core=item_core) 191 | 192 | 193 | data = data[(-data['userId'].isin(pop_users))] 194 | data = data[(-data['itemId'].isin(pop_items))] 195 | 196 | user_items, user_num, item_num, data_maps = id_map(user_items) # new_num_id 197 | 198 | user2id = data_maps['user2id'] 199 | item2id = data_maps['item2id'] 200 | data['userId'] = data.userId.apply(lambda x: user2id[x]) 201 | data['itemId'] = data.itemId.apply(lambda x: item2id[x]) 202 | 203 | 204 | data.to_csv("./"+ data_name +".csv",index=False) 205 | 206 | if __name__ =="__main__": 207 | main("Beauty", data_type="Amazon") -------------------------------------------------------------------------------- /Seq-poison/dataloader.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | 3 | import os 4 | import random 5 | import math 6 | from cv2 import randn 7 | from torch.utils.data import Dataset, DataLoader 8 | import numpy as np 9 | import torch 10 | 11 | class GenDataset(Dataset): 12 | """ 13 | Toy data iter to load digits 14 | """ 15 | def __init__(self, max_seq_length, user_seq, mode="train"): 16 | super(GenDataset, self).__init__() 17 | self.max_seq_length = max_seq_length 18 | self.user_seq = self.padding(user_seq) 19 | 20 | if mode == 'train': 21 | self.user_seq = self.user_seq[:int(0.9*len(self.user_seq))] 22 | else: 23 | self.user_seq = self.user_seq[int(0.9*len(self.user_seq)):] 24 | 25 | def padding(self, user_seq): 26 | user_seqs = [] 27 | for s in user_seq: 28 | pad_len = self.max_seq_length - len(s) 29 | s = s + [0] * pad_len 30 | user_seqs.append(s) 31 | return user_seqs 32 | 33 | def __len__(self): 34 | return len(self.user_seq) 35 | 36 | def __getitem__(self, index): 37 | item = self.user_seq[index] 38 | label = torch.LongTensor(np.array(item,dtype="int64")) 39 | data = [0] + item[:-1] 40 | data = torch.LongTensor(np.array(data,dtype="int64")) 41 | return data, label 42 | 43 | 44 | 45 | class DisDataset(Dataset): 46 | """ 47 | Toy data iter to load digits 48 | """ 49 | def __init__(self, real_data, fake_data, max_seq_length): 50 | super(DisDataset, self).__init__() 51 | self.max_seq_length = max_seq_length 52 | self.data = self.padding(real_data) + fake_data 53 | self.labels = [1 for _ in range(len(real_data))] +\ 54 | [0 for _ in range(len(fake_data))] 55 | self.pairs = list(zip(self.data, self.labels)) 56 | 57 | def padding(self, user_seq): 58 | user_seqs = [] 59 | for s in user_seq: 60 | pad_len = self.max_seq_length - len(s) 61 | s = s + [0] * pad_len 62 | user_seqs.append(s) 63 | return user_seqs 64 | 65 | def __len__(self): 66 | return len(self.pairs) 67 | 68 | def __getitem__(self, index): 69 | pair = self.pairs[index] 70 | data = torch.LongTensor(np.array(pair[0],dtype="int64")) 71 | label = torch.LongTensor([pair[1]]) 72 | return data, label 73 | 74 | # Dataset class for bi_classifier 75 | class ClaDataset(Dataset): 76 | def __init__(self, max_seq_length, real_seq, fake_seq, mask_id): 77 | self.real_seq = self.unpadding(real_seq) 78 | self.fake_seq = self.unpadding(fake_seq) 79 | self.masked_segment_sequence = [] 80 | self.anti_masked_segment_sequence = [] 81 | self.data_pairs = [] # data pair 82 | self.max_len = max_seq_length 83 | self.mask_id = mask_id 84 | 85 | self.mask_sequence() 86 | self.get_pos_neg_pairs() 87 | 88 | def unpadding(self, fake_seq): 89 | #tensor to list 90 | fake_seq = fake_seq.cpu().data.numpy().tolist() 91 | fake_seqs = [] 92 | for fake_data in fake_seq: 93 | seq = [] 94 | for i, f in enumerate(fake_data): 95 | if f != 0: 96 | seq.append(f) 97 | else: 98 | # clip when there are two consecutive "0" 99 | if i == (len(fake_data) - 1) or fake_data[i+1] == 0: 100 | break 101 | else: 102 | continue 103 | fake_seqs.append(seq) 104 | 105 | return fake_seqs 106 | 107 | def mask_sequence(self): 108 | """ 109 | mask user_seq and do padding 110 | """ 111 | for fake_data in self.fake_seq: 112 | masked_segment_sequence = [] 113 | anti_masked_segment_sequence = [] 114 | # Masked Item Prediction 115 | if len(fake_data) < 2: 116 | masked_segment_sequence = fake_data 117 | anti_masked_segment_sequence = [self.mask_id] * len(fake_data) 118 | else: 119 | real_sample = self.real_seq[random.randint(0, len(self.real_seq)-1)] 120 | min_len = len(fake_data) if len(fake_data)