├── .DS_Store ├── .gitignore ├── README.md ├── applications ├── heavy_hitters │ ├── __init__.py │ ├── compare_methods.py │ ├── hadamard_response.py │ ├── k_randomized_response.py │ ├── rappor.py │ └── succint_histogram.py └── mean_estimation │ ├── __init__.py │ └── compare_different_methods.py ├── cryptlib ├── Paillier.py ├── RSA.py └── __init__.py ├── dplib ├── .DS_Store ├── __init__.py ├── __pycache__ │ ├── __init__.cpython-37.pyc │ └── __init__.cpython-38.pyc ├── bv_library.py ├── dp_mechanisms │ ├── __init__.py │ ├── __pycache__ │ │ └── dp_base.cpython-38.pyc │ ├── dp_base.py │ ├── exponential.py │ ├── laplace_mechanism.py │ └── randomize_response_mechism.py ├── ldp_mechanisms │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-37.pyc │ │ ├── __init__.cpython-38.pyc │ │ ├── ldp_base.cpython-38.pyc │ │ ├── ldplib.cpython-37.pyc │ │ ├── ldplib.cpython-38.pyc │ │ └── meanlib.cpython-38.pyc │ ├── duchi_mechanism.py │ ├── generalized_randomized_response_mechanism.py │ ├── kvlib.py │ ├── ldp_base.py │ ├── ldplib.py │ ├── meanlib.py │ ├── piecewise_mechanism.py │ └── varlib.py ├── mdlib.py └── sunNumTools │ ├── Normalizer.py │ └── __init__.py ├── useless.py └── utils ├── __init__.py └── evaluation_matrix.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | applications/ 3 | test_examples/ 4 | dplib/ldp_mechanisms/meanlib_res.csv 5 | dplib/ldp_mechanisms/meanlib_tst.py 6 | dplib/ldp_mechanisms/varlib.py 7 | dplib/ldp_mechanisms/varlib_res.csv 8 | /useless.py 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Introduction 4 | 5 | 6 | - 我的个人网站：[https://forestneo.top/](https://forestneo.top/) 7 | - 欢迎关注我的知乎：[DPer](https://www.zhihu.com/people/sun-lin-83) 8 | - 欢迎关注微信公众号：[《差分隐私》](https://forest-pic.oss-cn-beijing.aliyuncs.com/20200308122411.png) 9 | - QQ学习交流群：779053117（微信交流群联系群主添加） 10 | 11 |

12 | 13 | 本开源代码可用于科学研究，本项目主要包含以下部分： 14 | 15 | - `basis.sunDP`: 包含和DP相关的内容 16 | - `basis.sunLDP`: 以前叫做`sunDP`，里面包含和LDP相关的内容； 17 | - `basis.sunCrypt`: 包含一些密码学的算法基本流程，可用于对一些密码算法流程的了解，实现效率低； 18 | 19 | # sunDP 20 | 21 | # sunLDP 22 | 23 | 24 | The ldplib provides basic randomized functions. 25 | 26 | - eps2p: turn the privacy budget to the probability by coin flipping 27 | - discretization: used to discretize a continuous value 28 | - `RR`: [Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias](https://www.tandfonline.com/doi/abs/10.1080/01621459.1965.10480775) 29 | - `UE`, `SUE`, `OUE`: from paper [Locally Differentially Private Protocols for Frequency Estimation](https://dl.acm.org/doi/10.5555/3241189.3241247) 30 | 31 | ### kvlib 32 | 33 | Some basic encoding terms: 34 | 35 | - `kv`: a kv pair denoted as $\langle k, v\rangle$, where $k\in \{0,1\}, v\in[-1,1]$. 36 | - `kvl`: a list of key-value pairs, denoted by $[\langle k_1, v_1\rangle,\langle k_2, v_2\rangle...]$. The kvl is used to represent the $i-$th key-value or to represent a list of key-value pairs of one user. 37 | - `kvt`: a $n\times d$ key-value table. A kvt is used to represent the kvl from $n$ users. 38 | 39 | The kvlib main contains the following perturbation and analysis algorithms: 40 | 41 | - `PrivBV`: [PrivKV: Key-Value Data Collection with Local Differential Privacy](https://ieeexplore.ieee.org/abstract/document/8835348/) 42 | - `BiSample`: [BiSample: Bidirectional Sampling for Handling Missing Data with Local Differential Privacy.](https://www.researchgate.net/publication/339251866_BiSample_Bidirectional_Sampling_for_Handling_Missing_Data_with_Local_Differential_Privacy/stats) 43 | - `SE`: from paper [Conditional Analysis for Key-Value Data with Local Differential Privacy](https://arxiv.org/abs/1907.05014) 44 | 45 | ### heavy_hitters 46 | 47 | - `Hadamard Repsonse`: [Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication](http://arxiv.org/abs/1802.04705) 48 | - `k-RR`: the k-randomized response 49 | - `k-subset`: 50 | - `RAPPOR`: [RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response](http://dl.acm.org/citation.cfm?doid=2660267.2660348) 51 | 52 | ### mean_solutions 53 | 54 | - `duchi`: also known as the 1Bit Mechanism (noted that the input domain of 1Bit is [1,m], while the input domain of duchi is [-1,1]). 55 | - `PM`: [Collecting and Analyzing Multidimensional Data with Local Differential Privacy](https://arxiv.org/abs/1907.00782) 56 | 57 | # sunCrypt 58 | 59 | sunCrypt包含一些常见的密码算法，实现效率低下但可读性高，主要包含： 60 | 61 | - Paillier: 公钥密码算法 62 | - RSA: 公钥密码算法 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /applications/heavy_hitters/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/5 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : __init__.py.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | 11 | if __name__ == '__main__': 12 | pass -------------------------------------------------------------------------------- /applications/heavy_hitters/compare_methods.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/5 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : compare_methods.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import numpy as np 11 | import applications.heavy_hitters.hadamard_response as HR 12 | import applications.heavy_hitters.rappor as RAPP 13 | import applications.heavy_hitters.k_randomized_response as KRR 14 | import matplotlib.pyplot as plt 15 | 16 | 17 | def generate_distribution(distribution_name, domain): 18 | if distribution_name == "uniform": 19 | return np.full(shape=domain, fill_value=1.0 / domain) 20 | elif distribution_name == "gauss": 21 | u = domain / 2 22 | sigma = domain / 6 23 | x = np.arange(1, domain+1) 24 | fx = 1 / (np.sqrt(2*np.pi) * sigma) * np.e**(- (x-u)**2 / (2 * sigma**2)) 25 | return fx / sum(fx) 26 | elif distribution_name == "exp": 27 | lmda = 2 28 | prob_list = np.array([lmda * np.e**(-lmda * x) for x in np.arange(1, domain+1)/10]) 29 | return prob_list / sum(prob_list) 30 | else: 31 | raise Exception("the distribution is not contained") 32 | 33 | 34 | def generate_bucket(n, bucket_size, distribution_name): 35 | distribution = generate_distribution(distribution_name, domain=bucket_size) 36 | bucket_list = np.random.choice(range(bucket_size), n, p=distribution) 37 | hist = np.histogram(bucket_list, bins=range(bucket_size+1)) 38 | return bucket_list, hist[0] 39 | 40 | 41 | def draw_distribution(distribution): 42 | index = np.arange(len(distribution)) 43 | plt.plot(index, distribution) 44 | plt.show() 45 | 46 | 47 | def get_err(true_hist, estimate_hist, method='max'): 48 | if method == 'max': 49 | return np.max(np.fabs(true_hist - estimate_hist)) 50 | if method == 'average': 51 | return np.average(np.fabs(true_hist - estimate_hist)) 52 | if method == 'l1': 53 | return np.sum(np.fabs(true_hist - estimate_hist)) 54 | if method == 'l2': 55 | return np.sqrt(np.sum((true_hist - estimate_hist)**2)) 56 | else: 57 | raise Exception("The input method is not allowed, method = ", method) 58 | 59 | 60 | def run_example(): 61 | config = { 62 | 'bucket_size': 100, 63 | 'epsilon': 1, 64 | 'n': 1000000, 65 | 'error_method': 'l1' 66 | } 67 | 68 | bucket_list, true_hist = generate_bucket(n=config['n'], bucket_size=config['bucket_size'], distribution_name='uniform') 69 | bucket_list = np.asarray(bucket_list) 70 | print("true hist = ", true_hist) 71 | 72 | print("\n==========>>>>> in HR") 73 | hr = HR.HR(bucket_size=config['bucket_size'], epsilon=config['epsilon']) 74 | hr_private_bucket_list = [hr.user_encode(bucket) for bucket in bucket_list] 75 | hr_histogram = hr.aggregate_histogram(private_bucket_list=hr_private_bucket_list) 76 | hr_error = get_err(true_hist, hr_histogram, config['error_method']) 77 | # print("HR resul", hr_histogram) 78 | print("HR error", hr_error) 79 | 80 | print("\n==========>>>>> in RAPPOR") 81 | rappor = RAPP.RAPPOR(bucket_size=config['bucket_size'], epsilon=config['epsilon']) 82 | rappor_private_bucket_list = [rappor.user_encode(bucket) for bucket in bucket_list] 83 | rappor_histogram = rappor.aggregate_histogram(private_bucket_list=rappor_private_bucket_list) 84 | rappor_error = get_err(true_hist, rappor_histogram, config['error_method']) 85 | # print("RAPPOR resul", rappor_histogram) 86 | print("RAPPOR error", rappor_error) 87 | 88 | print("\n==========>>>>> in KRR") 89 | krr = KRR.GeneralizedRandomizedResponse(bucket_size=config['bucket_size'], epsilon=config['epsilon']) 90 | krr_private_bucket_list = [krr.user_encode(item) for item in bucket_list] 91 | krr_histogram = krr.aggregate_histogram(krr_private_bucket_list) 92 | krr_error = get_err(true_hist, krr_histogram, config['error_method']) 93 | # print("krr result ", krr_histogram) 94 | print("krr error ", krr_error) 95 | print(config) 96 | 97 | 98 | if __name__ == '__main__': 99 | run_example() 100 | # dist = generate_distribution(distribution_name='exp', domain=20) 101 | # print(dist) 102 | # draw_distribution(dist) 103 | -------------------------------------------------------------------------------- /applications/heavy_hitters/hadamard_response.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/9 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : hadamard_response.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import numpy as np 11 | from applications import heavy_hitters as example 12 | 13 | 14 | class HR: 15 | def __init__(self, bucket_size, epsilon): 16 | self.epsilon = epsilon 17 | # the size of buckets 18 | self.bucket_size = bucket_size 19 | # this is the K in paper 20 | self.private_bucket_size = int(2 ** np.ceil(np.log2(bucket_size+1))) 21 | self.K = self.private_bucket_size 22 | self.s = int(self.K / 2) 23 | 24 | # the probability 25 | self.__ph = np.e ** epsilon / (self.s * np.e ** epsilon + self.K - self.s) 26 | self.__pl = 1.0 / (self.s * np.e ** epsilon + self.K - self.s) 27 | 28 | # Generating the hadamard matrix 29 | self.__hadamard_matrix = np.array([1]) 30 | for i in range(int(np.log2(self.K))): 31 | a = np.hstack([self.__hadamard_matrix, self.__hadamard_matrix]) 32 | b = np.hstack([self.__hadamard_matrix, -self.__hadamard_matrix]) 33 | self.__hadamard_matrix = np.vstack([a, b]) 34 | 35 | # to store the output items together with corresponding probability, the shape is k*K 36 | self.probability_matrix = np.copy(self.__hadamard_matrix)[1:, :] 37 | self.probability_matrix = np.where(self.probability_matrix == 1, self.__ph, self.__pl) 38 | 39 | def user_encode(self, bucket): 40 | if bucket >= self.bucket_size: 41 | raise Exception("the input domain is wrong, bucket = %d, k = %d" % (bucket, self.bucket_size)) 42 | a = range(self.K) 43 | p = self.probability_matrix[bucket] 44 | encode_item = np.random.choice(a=a, p=p) 45 | return encode_item 46 | 47 | def get_Cx(self, bucket): 48 | hadamard_line = self.__hadamard_matrix[bucket + 1] 49 | Cx = np.where(hadamard_line == 1) 50 | return Cx[0] 51 | 52 | def aggregate_histogram(self, private_bucket_list): 53 | private_hist = np.histogram(private_bucket_list, bins=range(self.private_bucket_size + 1))[0] 54 | hist = np.zeros(shape=self.bucket_size) 55 | for i in range(self.bucket_size): 56 | count = 0 57 | cx = np.where(self.__hadamard_matrix[i + 1] == 1)[0] 58 | for index in cx: 59 | count += private_hist[index] 60 | hist[i] = count 61 | 62 | n = len(private_bucket_list) 63 | estimate_hist = 2.0 * (np.e**self.epsilon + 1) / (np.e**self.epsilon - 1) * (hist - n / 2) 64 | return estimate_hist 65 | 66 | 67 | def run_example(): 68 | bucket_size = 4 69 | epsilon = 1 70 | n = 1000000 71 | 72 | # np.random.seed(10) 73 | hr = HR(bucket_size=bucket_size, epsilon=epsilon) 74 | 75 | bucket_list, true_hist = example.generate_bucket(n=n, bucket_size=bucket_size, distribution_name='uniform') 76 | print("this is buckets: ", bucket_list) 77 | print("this is true hist: ", true_hist) 78 | 79 | print("==========>>>>> in KRR") 80 | private_bucket_list = [hr.user_encode(item) for item in bucket_list] 81 | print("this is private buckets: ", private_bucket_list) 82 | estimate_hist = hr.aggregate_histogram(private_bucket_list) 83 | print("this is estimate_hist", estimate_hist) 84 | 85 | 86 | if __name__ == '__main__': 87 | run_example() 88 | -------------------------------------------------------------------------------- /applications/heavy_hitters/k_randomized_response.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/9 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : k_randomized_response.py 7 | # @Software: PyCharm 8 | 9 | 10 | import numpy as np 11 | from applications import heavy_hitters as example 12 | import matplotlib.pyplot as plt 13 | 14 | 15 | class GeneralizedRandomizedResponse: 16 | def __init__(self, bucket_size, epsilon): 17 | self.bucket_size = bucket_size 18 | self.epsilon = epsilon 19 | self.k = bucket_size 20 | 21 | self.p_h = np.e ** epsilon / (np.e ** epsilon + self.k - 1) 22 | self.p_l = 1 / (np.e ** epsilon + self.k - 1) 23 | self.__tf_matrix = np.full(shape=(self.k, self.k), fill_value=self.p_l) 24 | for i in range(self.k): 25 | self.__tf_matrix[i][i] = self.p_h 26 | 27 | def user_encode(self, bucket): 28 | probability_list = self.__tf_matrix[bucket] 29 | return np.random.choice(a=range(self.k), p=probability_list) 30 | 31 | def aggregate_histogram(self, private_bucket_list): 32 | private_hist = np.zeros(shape=self.k) 33 | for private_bucket in private_bucket_list: 34 | private_hist[private_bucket] += 1 35 | estimate_hist = (private_hist - len(private_bucket_list) * self.p_l) / (self.p_h - self.p_l) 36 | return estimate_hist 37 | 38 | def aggregate_histogram_by_matrix(self, private_bucket_list): 39 | """ 40 | this method is to estimate the histogram by the inverse of tf_matrix 41 | """ 42 | private_hist = np.zeros(shape=self.k) 43 | for private_bucket in private_bucket_list: 44 | private_hist[private_bucket] += 1 45 | tf_reverse = np.linalg.inv(self.__tf_matrix) 46 | estimated_hist = np.dot(tf_reverse, np.reshape(private_hist, newshape=(self.bucket_size, 1))) 47 | return np.reshape(estimated_hist, newshape=self.bucket_size) 48 | 49 | 50 | def run_example(): 51 | np.set_printoptions(threshold=40, linewidth=200, edgeitems=5) 52 | 53 | n = 10 ** 5 54 | bucket_size = 100 55 | epsilon = 1 56 | 57 | print("==========>>>>> in KRR") 58 | krr = GeneralizedRandomizedResponse(bucket_size=bucket_size, epsilon=epsilon) 59 | bucket_list, true_hist = example.generate_bucket(n=n, bucket_size=bucket_size, distribution_name='exp') 60 | print("this is buckets: ", bucket_list) 61 | print("this is true hist: ", true_hist) 62 | 63 | private_bucket_list = [krr.user_encode(item) for item in bucket_list] 64 | estimated_hist = krr.aggregate_histogram(private_bucket_list) 65 | print("this is estimate_hist", estimated_hist) 66 | 67 | index = range(bucket_size) 68 | plt.plot(index, true_hist) 69 | plt.plot(index, estimated_hist) 70 | plt.legend(['true', 'krr']) 71 | plt.show() 72 | 73 | 74 | if __name__ == '__main__': 75 | run_example() -------------------------------------------------------------------------------- /applications/heavy_hitters/rappor.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/5 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : rappor.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import numpy as np 11 | from applications import heavy_hitters as example 12 | 13 | 14 | class RAPPOR: 15 | def __init__(self, bucket_size, epsilon): 16 | # the probability of 1->1 17 | self.p = np.e ** (epsilon/2) / (np.e ** (epsilon/2) + 1) 18 | # the size of buckets 19 | self.bucket_size = bucket_size 20 | 21 | def user_encode(self, bucket): 22 | if bucket >= self.bucket_size: 23 | raise Exception("Error, the input domain is wrong, bucket = %d, k = %d" % (bucket, self.bucket_size)) 24 | # onehot encoding 25 | private_bucket = np.zeros(self.bucket_size) 26 | private_bucket[bucket] = 1 27 | # randomized response 28 | return np.where(private_bucket == 1, np.random.binomial(1, self.p, self.bucket_size), 29 | np.random.binomial(1, 1 - self.p, self.bucket_size)) 30 | 31 | def aggregate_histogram(self, private_bucket_list): 32 | private_bucket_list = np.asarray(np.asarray(private_bucket_list)) 33 | item_count = private_bucket_list.shape[0] 34 | private_counts = np.sum(private_bucket_list, axis=0) 35 | estimate_counts = (private_counts + item_count * self.p - item_count) / (2*self.p - 1) 36 | return estimate_counts 37 | 38 | 39 | def run_example(): 40 | bucket_size = 5 41 | epsilon = 1 42 | 43 | print("==========>>>>> in RAPPOR") 44 | rappor = RAPPOR(bucket_size=bucket_size, epsilon=epsilon) 45 | bucket_list, true_hist = example.generate_bucket(n=10000, bucket_size=bucket_size, distribution_name='uniform') 46 | print("this is buckets: ", bucket_list) 47 | print("this is true hist: ", true_hist) 48 | 49 | private_bucket_list = [rappor.user_encode(item) for item in bucket_list] 50 | estimate_hist = rappor.aggregate_histogram(private_bucket_list) 51 | print("this is estimate_hist", estimate_hist) 52 | 53 | 54 | if __name__ == '__main__': 55 | run_example() -------------------------------------------------------------------------------- /applications/heavy_hitters/succint_histogram.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/9/1 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @Software: PyCharm 7 | 8 | 9 | import numpy as np 10 | from applications import heavy_hitters as example 11 | 12 | 13 | def cosine_similarity(arr_1: np.ndarray, arr_2: np.ndarray): 14 | return np.dot(arr_1, arr_2) / (np.linalg.norm(arr_1) * np.linalg.norm(arr_2)) 15 | 16 | 17 | def euclidean_similarity(arr_1: np.ndarray, arr_2: np.ndarray): 18 | return np.sqrt(np.sum((arr_1 - arr_2) ** 2)) 19 | 20 | 21 | class SuccinctHistogram: 22 | def __init__(self, epsilon, d, m): 23 | """ 24 | 需要给定m，或者 n 和 beta 25 | :param epsilon: 26 | :param d: 27 | :param m: 28 | :param n: 29 | :param beta: 30 | """ 31 | self.epsilon = epsilon 32 | self.d = d 33 | self.__p = np.e ** epsilon / (np.e ** epsilon + 1) 34 | 35 | self.onehot_matrix = np.eye(d) 36 | print("onehot matrix generated") 37 | 38 | # """ this is m in succinct histogram""" 39 | # gamma = np.sqrt((np.log(2*d / beta)) / (epsilon**2 * n)) 40 | # m = np.log(d+1) * np.log(2/beta) / (gamma ** 2) 41 | # self.m = int(np.ceil(m)) 42 | 43 | self.m = m 44 | 45 | # generate a d*m matrix phi 46 | self.phi = 1/np.sqrt(self.m) * np.random.choice(a=[-1, 1], size=[d, self.m]) 47 | print("matrix phi generated!") 48 | 49 | self.C = (np.e**epsilon + 1) / (np.e**epsilon - 1) 50 | print("*"*10, " Succint Histogram initialized!") 51 | 52 | def user_encode(self, value): 53 | if not 0 <= value < self.d: 54 | raise Exception("Error, the input is not in the input domain, ", value) 55 | # onehot_arr = self.onehot_matrix[value] 56 | # d_x = onehot_arr.dot(self.phi) 57 | # print("dx = ", d_x) 58 | d_x = self.phi[value] 59 | return self.__basic_randomizer(d_x) 60 | 61 | def __basic_randomizer(self, x): 62 | j = np.random.randint(low=0, high=self.m) 63 | if not np.all(x == 0): 64 | z_j = np.random.choice([self.C * self.m * x[j], -self.C * self.m * x[j]], p=[self.__p, 1 - self.__p]) 65 | else: 66 | z_j = np.random.choice([-self.C * np.sqrt(self.m), self.C * self.m]) 67 | return j, z_j 68 | 69 | def FO(self, z_hat): 70 | f = np.zeros(shape=self.d) 71 | for bucket in range(self.d): 72 | onehot_arr = self.onehot_matrix[bucket] 73 | f[bucket] = np.inner(onehot_arr.dot(self.phi), z_hat) 74 | return f 75 | 76 | def PROT_FO(self, users_data): 77 | n = len(users_data) 78 | z_sum = np.zeros(shape=self.m) 79 | print("start encoding") 80 | for i in range(n): 81 | j, z_j = self.user_encode(users_data[i]) 82 | z_sum[j] = z_sum[j] + z_j 83 | print("start decoding") 84 | z_hat = z_sum / n 85 | return self.FO(z_hat=z_hat) 86 | 87 | 88 | def run_example(): 89 | epsilon = 1 90 | n = 10 ** 6 91 | bucket_size = 1000 92 | m = 500000 93 | 94 | # np.random.seed(10) 95 | 96 | bucket_list, true_hist = example.generate_bucket(n=n, bucket_size=bucket_size, distribution_name='exp') 97 | 98 | true_distribution = true_hist / sum(true_hist) 99 | print(true_distribution[:10]) 100 | example.draw_distribution(true_distribution) 101 | 102 | SH = SuccinctHistogram(epsilon=epsilon, d=bucket_size, m=m) 103 | estimated_hist = SH.PROT_FO(users_data=bucket_list) 104 | estimated_distribution = estimated_hist / sum(estimated_hist) 105 | example.draw_distribution(estimated_distribution) 106 | 107 | print(estimated_distribution[:10]) 108 | 109 | 110 | if __name__ == '__main__': 111 | run_example() 112 | 113 | 114 | 115 | 116 | -------------------------------------------------------------------------------- /applications/mean_estimation/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/applications/mean_estimation/__init__.py -------------------------------------------------------------------------------- /applications/mean_estimation/compare_different_methods.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-07-11 18:27 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | # @Update : 2020.10.20 8 | 9 | 10 | """ 11 | This is a case for the mean estimation tasks 12 | """ 13 | 14 | import numpy as np 15 | import matplotlib.pyplot as plt 16 | import dplib.ldp_mechanisms.meanlib as meanlib 17 | 18 | 19 | if __name__ == '__main__': 20 | # generate the data in [-1,1] 21 | data = np.clip(np.random.normal(loc=0.2, scale=0.3, size=100000), a_min=-1, a_max=1) 22 | # get baseline 23 | m_base = np.average(data) 24 | 25 | epsilon_list, error_duchi, error_piecewise = [], [], [] 26 | 27 | for i in range(1, 10): 28 | epsilon = 0.1 * i 29 | epsilon_list.append(epsilon) 30 | 31 | # initial the encoding method 32 | duchi = meanlib.Duchi(epsilon) 33 | piecewise = meanlib.PiecewiseMechanism(epsilon) 34 | 35 | # duchi's solution and its error 36 | duchi_data = [duchi.encode(value) for value in data] 37 | m_duchi = np.average(duchi_data) 38 | err_duchi = np.fabs(m_duchi - m_base) 39 | error_duchi.append(err_duchi) 40 | 41 | # piecewise solution and its error 42 | pm_data = [piecewise.encode(value) for value in data] 43 | m_piecewise = np.average(pm_data) 44 | err_pm = np.fabs(m_piecewise - m_base) 45 | error_piecewise.append(err_pm) 46 | 47 | print("epsilon = %.2f, err_duchi = %.4f, err_pm = %.4f" % (epsilon, err_duchi, err_pm)) 48 | 49 | # draw the result 50 | plt.figure(figsize=[12, 5]) 51 | plt.plot(epsilon_list, error_duchi, label="duchi") 52 | plt.plot(epsilon_list, error_piecewise, label="piecewise") 53 | plt.xlabel("epsilon") 54 | plt.ylabel("error") 55 | plt.legend() 56 | plt.show() 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /cryptlib/Paillier.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021-10-01 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | 此文件用于测试 Paillier 算法的性质，为未完善版本，仅供学习其方法 9 | @ 2021.10.09 10 | """ 11 | 12 | import sympy 13 | 14 | 15 | class Paillier: 16 | def __init__(self, p=0, q=0, g=0): 17 | self.__p = p 18 | self.__q = q 19 | self.__n = self.__p * self.__q 20 | self.__lambda = sympy.lcm(self.__p - 1, self.__q - 1) 21 | 22 | self.__g = g 23 | self.__u = 0 24 | if self.__g != 0: 25 | self.__u = sympy.mod_inverse(self.__L(g**self.__lambda % self.__n**2), self.__n) 26 | print("p,q,n,lambda,g,u = ", self.__p, self.__q, self.__n, self.__lambda, self.__g, self.__u) 27 | print("Paillier:: initialized") 28 | 29 | def __L(self, x): 30 | return (x-1) / self.__n 31 | 32 | def generate_key(self, key_bits=10): 33 | # choose p and q, and calculate n and lambda 34 | while True: 35 | self.__p = sympy.ntheory.generate.randprime(2**key_bits, 2**(key_bits+1)) 36 | self.__q = sympy.ntheory.generate.randprime(2**key_bits, 2**(key_bits+1)) 37 | self.__n = self.__p * self.__q 38 | if sympy.gcd(self.__n, (self.__p - 1) * (self.__q - 1)) == 1: 39 | break 40 | self.__lambda = sympy.lcm(self.__p - 1, self.__q - 1) 41 | 42 | # choose g 43 | while True: 44 | # self.__g = np.random.randint(1, self.__n**2) 45 | # TODO: 20210929: 生成随机数 g，此处为了方便直接生成素数 46 | self.__g = sympy.ntheory.generate.randprime(2, self.__n**2) 47 | self.__u = sympy.mod_inverse(self.__L(self.__g**self.__lambda % self.__n**2), self.__n) 48 | # TODO: 20210929: 跳出没有写 49 | break 50 | print("Paillier:: key generated") 51 | 52 | def encrypt(self, m, r=0): 53 | r = sympy.ntheory.randprime(3, self.__n) if r == 0 else r 54 | return self.__g ** m * r ** self.__n % (self.__n ** 2) 55 | 56 | def decrypt(self, c): 57 | return self.__L(c**self.__lambda % (self.__n**2)) * self.__u % self.__n 58 | 59 | def get_public_key(self): 60 | return self.__n, self.__g 61 | 62 | def get_private_key(self): 63 | return self.__lambda, self.__u 64 | 65 | def __str__(self): 66 | return "ttt" 67 | 68 | def is_validate(self): 69 | # TODO: 20210929:检测当前参数是否合规 70 | return True 71 | 72 | 73 | if __name__ == '__main__': 74 | pai = Paillier(p=7, q=11, g=5652) 75 | # pai = Paillier() 76 | # pai.generate_key() 77 | x_1, x_2 = 13, 25 78 | 79 | # 加解密测试 80 | print("\n encrypt-decrypt test") 81 | y_1, y_2 = pai.encrypt(x_1, r=23), pai.encrypt(x_2, r=23) # r 可以不同 82 | x_1, x_2 = pai.decrypt(y_1), pai.decrypt(y_2) 83 | print(x_1, x_2) 84 | 85 | # 加法同态测试，即 [x_1] * [x_2] = [x_1 + x_2] 86 | print("\n homomorphic test 1") 87 | y_3 = y_1 * y_2 88 | x_3 = pai.decrypt(y_3) 89 | print(x_3) 90 | 91 | # 验证 [m_1]^m2 mod n^2 = [m_1 * m_2] 92 | print("\n homomorphic test 2") 93 | n = pai.get_public_key()[0] 94 | x_tmp_1 = pai.decrypt(pai.encrypt(x_1)**x_2 % n**2) 95 | x_tmp_2 = (x_1 * x_2) % n 96 | print(x_tmp_1, x_tmp_2) 97 | 98 | 99 | -------------------------------------------------------------------------------- /cryptlib/RSA.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021-10-01 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | 此文件用于测试 RSA 算法的性质，为未完善版本，仅供学习其方法 9 | @ 2021.10.09 10 | """ 11 | 12 | import sympy 13 | import numpy as np 14 | 15 | 16 | class RSA: 17 | def __init__(self, p=0, q=0, e=0, d=0): 18 | self.__p = p 19 | self.__q = q 20 | self.__n = self.__p * self.__q 21 | self.__phi = (self.__p-1) * (self.__q-1) 22 | self.__e = e 23 | self.__d = d 24 | 25 | def generate_key(self, key_bits=10): 26 | # choose p and q, and calculate n and phi 27 | # key_bits = key_bits / 2 # RSA 密钥的位数指的是 n 的位数 28 | self.__p = sympy.ntheory.generate.randprime(2 ** key_bits, 2 ** (key_bits + 1)) 29 | self.__q = sympy.ntheory.generate.randprime(2 ** key_bits, 2 ** (key_bits + 1)) 30 | self.__n = self.__p * self.__q 31 | self.__phi = (self.__p - 1) * (self.__q - 1) 32 | while True: 33 | # 暴力一点，直接选了一个素数当 e 34 | self.__e = sympy.ntheory.generate.randprime(2, self.__phi) 35 | self.__d = sympy.mod_inverse(self.__e, self.__phi) 36 | if sympy.gcd(self.__e, self.__phi) == 1: 37 | break 38 | 39 | def encrypt(self, m): 40 | # todo: 未加速，速度很慢 41 | return m**self.__e % self.__n 42 | 43 | def decrypt(self, c): 44 | # todo: 未加速，速度很慢 45 | return c**self.__d % self.__n 46 | 47 | def get_public_key(self): 48 | return self.__e, self.__n 49 | 50 | def get_private_key(self): 51 | return self.__d, self.__n 52 | 53 | def __str__(self): 54 | return "ttt" 55 | 56 | def is_validate(self): 57 | # todo: 检测当前参数是否合规 58 | return True 59 | 60 | 61 | if __name__ == '__main__': 62 | rsa = RSA() 63 | rsa.generate_key() 64 | 65 | x = 12 66 | y = rsa.encrypt(x) 67 | print("the encrypted number is: ", y) 68 | x = rsa.decrypt(y) 69 | print("the decrypted result is: ", x) -------------------------------------------------------------------------------- /cryptlib/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/cryptlib/__init__.py -------------------------------------------------------------------------------- /dplib/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/.DS_Store -------------------------------------------------------------------------------- /dplib/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/__init__.py -------------------------------------------------------------------------------- /dplib/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /dplib/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/bv_library.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021/3/3 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : bv_library.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import numpy as np 11 | import dplib.ldp_mechanisms.ldplib as ldplib 12 | ''' 13 | The bit vector mechanism 14 | ''' 15 | 16 | class BitVector: 17 | def __init__(self, random_values: np.ndarray, t: float, data_range: list): 18 | self.r = random_values 19 | self.s = len(random_values) 20 | self.t = t 21 | 22 | self.L = data_range[0] 23 | self.U = data_range[1] 24 | self.u = self.U - self.L 25 | 26 | def encode(self, v): 27 | bv = np.zeros(shape=self.s, dtype=int) 28 | for i in range(self.s): 29 | if self.r[i] - t <= v <= self.r[i] + self.t: 30 | bv[i] = 1 31 | return bv 32 | 33 | def estimate_distance(self, private_data1: np.ndarray, private_data2: np.ndarray): 34 | d_h = np.sum(np.fabs(private_data1 - private_data2)) 35 | d_e = d_h * self.u / (2 * self.s) 36 | return d_e 37 | 38 | 39 | class RandomBitVector: 40 | def __init__(self, random_values: np.ndarray, data_range: list, p=1.0): 41 | self.s = len(random_values) 42 | self.r = random_values 43 | self.L = data_range[0] 44 | self.U = data_range[1] 45 | self.u = self.U - self.L 46 | self.p = p 47 | 48 | def encode(self, v): 49 | bv = np.where(v >= self.r, 1, 0) 50 | return ldplib.random_response(bit_array=bv, p=(self.p+1)/2) 51 | 52 | def estimate_distance(self, private_data1, private_data2): 53 | d_h = np.sum(np.fabs(private_data1 - private_data2)) 54 | d_e = (d_h / self.s - ((1-self.p**2)/2)) * self.u / self.p**2 55 | return d_e 56 | 57 | 58 | class PMRandomizedBitVector: 59 | def __init__(self, random_values: np.ndarray, data_range: list, triangle=1.0, epsilon=0.0): 60 | self.triangle = triangle 61 | self.epsilon = epsilon 62 | self.s = len(random_values) 63 | self.r = random_values 64 | self.L = data_range[0] 65 | self.U = data_range[1] 66 | self.u = self.U - self.L 67 | self.RBV = RandomBitVector(random_values=random_values, data_range=data_range, p=1) 68 | 69 | def encode(self, v): 70 | v = v + np.random.laplace(loc=0, scale=self.triangle/self.epsilon) 71 | return self.RBV.encode(v) 72 | 73 | def estimate_distance(self, private_data1, private_data2): 74 | return self.RBV.estimate_distance(private_data1, private_data2) 75 | 76 | 77 | if __name__ == '__main__': 78 | length = 10000 79 | np.random.seed(0) 80 | data_range = [-10, 20] 81 | random_values = np.random.uniform(low=data_range[0], high=data_range[1], size=length) 82 | print(random_values) 83 | print(min(random_values), max(random_values)) 84 | t = 4 85 | 86 | BV = BitVector(random_values=random_values, t=t, data_range=data_range) 87 | RBV = RandomBitVector(random_values=random_values, data_range=data_range, p=0.9) 88 | PMRBV = PMRandomizedBitVector(random_values=random_values, data_range=data_range, triangle=1, epsilon=10) 89 | method = BV 90 | 91 | data_pair = [ 92 | [1, 3], 93 | [2, 3], 94 | [4, 6], 95 | [4, 8], 96 | ] 97 | 98 | for a, b in data_pair: 99 | p_a = method.encode(a) 100 | p_b = method.encode(b) 101 | de_true = np.fabs(a - b) 102 | de_esti = method.estimate_distance(p_a, p_b) 103 | print(a, b, de_true, de_esti) 104 | 105 | pass 106 | -------------------------------------------------------------------------------- /dplib/dp_mechanisms/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/dp_mechanisms/__init__.py -------------------------------------------------------------------------------- /dplib/dp_mechanisms/__pycache__/dp_base.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/dp_mechanisms/__pycache__/dp_base.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/dp_mechanisms/dp_base.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-01-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | base class for differential privacy 9 | """ 10 | import abc 11 | 12 | 13 | class DPBase(metaclass=abc.ABCMeta): 14 | 15 | @abc.abstractmethod 16 | def randomize(self, value): 17 | """ the randomize function """ 18 | raise NotImplementedError 19 | 20 | @classmethod 21 | def _check_epsilon_delta(cls, epsilon, delta): 22 | if not (epsilon >= 0 and 0 <= delta <= 1 and epsilon+delta > 0): 23 | raise ValueError("the range of epsilon and delta is wrong, epsilon={}, delta={}".format(epsilon, delta)) 24 | return float(epsilon), float(delta) 25 | 26 | @abc.abstractmethod 27 | def get_privacy_budget(self): 28 | """ 29 | return the privacy budget 30 | """ 31 | raise NotImplementedError 32 | -------------------------------------------------------------------------------- /dplib/dp_mechanisms/exponential.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-01-07 15:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | import numpy as np 8 | 9 | 10 | class Exponential: 11 | def __init__(self, epsilon, sensitivity, range, func_score): 12 | self.epsilon = epsilon 13 | self.sensitivity = sensitivity 14 | self.range = range 15 | self.func_score = func_score 16 | 17 | def exponential(self, data): 18 | # calculate scores 19 | scores = np.asarray([self.func_score(item, data) for item in self.range]) 20 | probabilities = np.exp(self.epsilon * scores / (2 * self.sensitivity)) 21 | probabilities = probabilities / np.linalg.norm(probabilities, ord=1) 22 | return np.random.choice(self.range, size=1, p=probabilities)[0] 23 | 24 | 25 | def score(x, data: list): 26 | return data.count(x) / 200 27 | 28 | 29 | def run_example(): 30 | np.random.seed(0) 31 | data = list(np.random.choice(a=['a', 'b', 'c'], size=1000, replace=True, p=[0.5, 0.3, 0.2])) 32 | EXP = Exponential(epsilon=1, sensitivity=1, range=['a', 'b', 'c'], func_score=score) 33 | res = [EXP.exponential(data) for i in range(10000)] 34 | print(res.count('a'), res.count('b'), res.count('c')) 35 | 36 | 37 | if __name__ == '__main__': 38 | run_example() 39 | -------------------------------------------------------------------------------- /dplib/dp_mechanisms/laplace_mechanism.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-01-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | The laplace mechanism for differential privacy 9 | """ 10 | from dp_base import DPBase 11 | import numpy as np 12 | 13 | 14 | class LaplaceMechanism(DPBase): 15 | def __init__(self, epsilon, delta=0.0, sensitivity=1): 16 | self.__epsilon, self.__delta = self._check_epsilon_delta(epsilon, delta) 17 | self.__sensitivity = sensitivity 18 | self.__lap_scale = sensitivity / epsilon 19 | 20 | def randomize(self, value): 21 | value = self.__check_value(value) 22 | return value + np.random.laplace(loc=0, scale=self.__lap_scale) 23 | 24 | @staticmethod 25 | def __check_value(value): 26 | if value >= 0 or value < 0: 27 | return value 28 | raise ValueError("ERR: the input value={} is invalid.".format(value)) 29 | 30 | def get_privacy_budget(self): 31 | return self.__epsilon, self.__delta 32 | 33 | def get_sensitivity(self): 34 | return self.__sensitivity 35 | 36 | 37 | def run_example(): 38 | a = 1 39 | lap = LaplaceMechanism(epsilon=10, sensitivity=1) 40 | res = [lap.randomize(a) for _ in range(100)] 41 | print(np.average(res)) 42 | 43 | 44 | if __name__ == '__main__': 45 | run_example() 46 | -------------------------------------------------------------------------------- /dplib/dp_mechanisms/randomize_response_mechism.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-01-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | The randomized mechanism for differential privacy 9 | """ 10 | from dp_base import DPBase 11 | import numpy as np 12 | 13 | 14 | class RandomizedResponseMechanism(DPBase): 15 | def __init__(self, epsilon, delta=0.0, sensitivity=1, domain=(0, 0)): 16 | self.__epsilon, self.__delta = self._check_epsilon_delta(epsilon, delta) 17 | self.__domain = domain 18 | self.__p = np.e**epsilon / (np.e**epsilon + 1) 19 | 20 | def randomize(self, value): 21 | value = self.__check_value(value) 22 | # todo 23 | 24 | def __check_value(self, value): 25 | if self.__domain[0] <= value <= self.__domain[1]: 26 | return value 27 | raise ValueError("ERR: the input value={} is not in domain={}.".format(value, self.__domain)) 28 | 29 | def get_privacy_budget(self): 30 | return self.__epsilon, self.__delta 31 | 32 | 33 | def run_example(): 34 | a = 1 35 | rr = RandomizedResponseMechanism(epsilon=10) 36 | res = [rr.randomize(a) for _ in range(100)] 37 | 38 | 39 | if __name__ == '__main__': 40 | run_example() 41 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__init__.py -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/__init__.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/__init__.cpython-37.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/ldp_base.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/ldp_base.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/ldplib.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/ldplib.cpython-37.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/ldplib.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/ldplib.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/__pycache__/meanlib.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/ldp_mechanisms/__pycache__/meanlib.cpython-38.pyc -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/duchi_mechanism.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-02-08 22:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | the duchi's solution towards mean estimation 9 | @ 2022.02.08 updated, add the LDPBase class 10 | """ 11 | 12 | from ldp_base import LDPBase 13 | import numpy as np 14 | 15 | 16 | class DuchiMechanism(LDPBase): 17 | def __init__(self, epsilon, domain=(-1, 1)): 18 | self._espilon = self._check_epsilon(epsilon) 19 | self._domain = domain 20 | self._p = np.e ** epsilon / (np.e**epsilon + 1) 21 | 22 | def _check_value(self, value): 23 | if not self._domain[0] <= value <= self._domain[1]: 24 | raise ValueError("ERR: The input value={} is not in tht input domain={}.".format(value, self._domain)) 25 | return value 26 | 27 | def randomize(self, value): 28 | value = self._check_value(value) 29 | # assume the domain is [a, b], the discretization and rr process is 30 | # P[y=a] = ((1-2p)*v + (ap+bp-a))/(b-a) 31 | a, b = self._domain 32 | rnd_p = ((1-2*self._p)*value + (a*self._p+b*self._p-a))/(b-a) 33 | rnd = np.random.random() 34 | value = a if rnd <= rnd_p else b 35 | 36 | # after the perturbation process, the expectation of y is 37 | # E[y] = (2p-1)x + (b+a)(1-p) 38 | # thus, adjust is needed 39 | value = (value - (b+a)*(1-self._p)) / (2*self._p-1) 40 | return value 41 | 42 | 43 | if __name__ == '__main__': 44 | domain = (100, 200) 45 | a = DuchiMechanism(epsilon=0.001, domain=domain) 46 | data = np.clip(np.random.laplace(loc=130, scale=20, size=10**5), domain[0], domain[1]) 47 | print(np.average(data)) 48 | 49 | p_data = [a.randomize(v) for v in data] 50 | print(max(p_data), min(p_data)) 51 | print(np.average(p_data)) 52 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/generalized_randomized_response_mechanism.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-02-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | the generalized randomized response mechanism, also known as krr 9 | @ 2022.02.08 updated, add the LDPBase class 10 | """ 11 | 12 | from ldp_base import LDPBase 13 | import numpy as np 14 | 15 | 16 | class GRR(LDPBase): 17 | def __init__(self, epsilon, domain): 18 | self._check_epsilon(epsilon) 19 | self._epsilon = epsilon 20 | 21 | # domain 表示元素所在的空间，比如 domain = ['apple', 'banana', 'pear'] 22 | self._domain = domain 23 | self._k = len(domain) 24 | 25 | # 高概率ph，低概率pl 26 | self._ph = np.e ** epsilon / (np.e ** epsilon + self._k - 1) 27 | self._pl = 1 / (np.e ** epsilon + self._k - 1) 28 | 29 | # 用于快速建立索引，比如{'apple':0, 'banana':1, 'pear':2} 30 | self._item_index = {item: index for index, item in enumerate(domain)} 31 | 32 | def randomize(self, value): 33 | value = self.__check_value(value) 34 | probability_arr = np.full(shape=self._k, fill_value=self._pl) 35 | probability_arr[self._item_index[value]] = self._ph 36 | return np.random.choice(a=self._domain, p=probability_arr) 37 | 38 | def __check_value(self, value): 39 | if value not in self._domain: 40 | raise Exception("ERR: the input value={} is not in the input domain={}.".format(value, self._domain)) 41 | return value 42 | 43 | def estimate_hist(self, randomized_values): 44 | counts = np.zeros(shape=self._k) 45 | for value in randomized_values: 46 | counts[self._item_index[value]] += 1 47 | return (counts - len(randomized_values) * self._pl) / (self._ph - self._pl) 48 | 49 | 50 | if __name__ == '__main__': 51 | domain = ['a', 'b', 'c'] 52 | epsilon = 1 53 | krr = GRR(epsilon, domain) 54 | print(krr._item_index) 55 | encoded_list = [] 56 | for i in range(1000): 57 | encoded_list.append(krr.randomize('c')) 58 | print(krr.estimate_hist(encoded_list)) -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/kvlib.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-11-01 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | import numpy as np 8 | import dplib.ldp_mechanisms.ldplib as ldplib 9 | 10 | 11 | def kvlist_get_baseline(kv_list: np.ndarray, discretization=False): 12 | if not isinstance(discretization, bool): 13 | raise Exception("Input type error: ", type(discretization)) 14 | f = np.average(kv_list[:, 0]) 15 | 16 | value_list = [] 17 | for kv in kv_list: 18 | if int(kv[0]) == 1 and discretization is True: 19 | value_list.append(ldplib.discretization(kv[1], lower=-1, upper=1)) 20 | elif int(kv[0]) == 1 and discretization is False: 21 | value_list.append(kv[1]) 22 | else: 23 | pass 24 | m = np.average(np.asarray(value_list)) 25 | return f, m 26 | 27 | 28 | def kvt_get_baseline(kvt: np.ndarray, discretization=False): 29 | if not isinstance(kvt, np.ndarray): 30 | raise Exception("type error of kvt: ", type(kvt)) 31 | 32 | n, d = kvt.shape[0], kvt.shape[1] 33 | f_list, m_list = np.zeros([d]), np.zeros([d]) 34 | 35 | for i in range(d): 36 | kv_list = kvt[:, i] 37 | f, m = kvlist_get_baseline(kv_list, discretization=discretization) 38 | f_list[i], m_list[i] = f, m 39 | return f_list, m_list 40 | 41 | 42 | def kv_en_privkv(kv, epsilon1, epsilon2, set_value=None): 43 | k, v = int(kv[0]), kv[1] 44 | if k == 1: 45 | k = ldplib.perturbation(value=k, perturbed_value=1-k, epsilon=epsilon1) 46 | if k == 1: 47 | discretize_v = ldplib.discretization(v, -1, 1) 48 | p_k, p_v = 1, ldplib.perturbation(value=discretize_v, perturbed_value=-discretize_v, epsilon=epsilon2) 49 | else: 50 | p_k, p_v = 0, 0 51 | else: 52 | k = ldplib.perturbation(value=k, perturbed_value=1 - k, epsilon=epsilon1) 53 | if k == 1: 54 | v = np.random.uniform(low=-1, high=1) if set_value is None else set_value 55 | discretize_v = ldplib.discretization(v, -1, 1) 56 | p_k, p_v = 1, ldplib.perturbation(value=discretize_v, perturbed_value=-discretize_v, epsilon=epsilon2) 57 | else: 58 | p_k, p_v = 0, 0 59 | return [p_k, p_v] 60 | 61 | 62 | def kv_de_privkv(p_kv_list: np.ndarray, epsilon_k, epsilon_v): 63 | if not isinstance(p_kv_list, np.ndarray): 64 | raise Exception("type error of p_kv_list: ", type(p_kv_list)) 65 | 66 | p1 = np.e ** epsilon_k / (1 + np.e ** epsilon_k) 67 | p2 = np.e ** epsilon_v / (1 + np.e ** epsilon_v) 68 | 69 | k_list = p_kv_list[:, 0] 70 | v_list = p_kv_list[:, 1] 71 | 72 | f = (np.average(k_list) + p1-1) / (2*p1 - 1) 73 | # the [0] is because np.where() returns a tuple (x,y), x is the list and y it the type of elements of the array 74 | n1 = len(np.where(v_list == 1)[0]) 75 | n2 = len(np.where(v_list == -1)[0]) 76 | 77 | n_all = n1 + n2 78 | n1_star = (p2-1) / (2*p2-1) * n_all + n1 / (2*p2-1) 79 | n2_star = (p2-1) / (2*p2-1) * n_all + n2 / (2*p2-1) 80 | n1_star = np.clip(n1_star, 0, n_all) 81 | n2_star = np.clip(n2_star, 0, n_all) 82 | m = (n1_star - n2_star) / n_all 83 | 84 | return f, m 85 | 86 | 87 | def kv_en_onehot(kv, epsilon): 88 | """ 89 | encode a kv into [a,b,c], where: 90 | a=1 represents if the k == 0 91 | b represents if v == -1 92 | c represents if v == 1 93 | """ 94 | k, v = int(kv[0]), kv[1] 95 | onehot = np.zeros([3]) 96 | if k == 0: 97 | onehot[0] = 1 98 | else: 99 | d_v = ldplib.discretization(v, -1, 1) 100 | if d_v == -1: 101 | onehot[1] = 1 102 | else: 103 | onehot[2] = 1 104 | return ldplib.random_response(bit_array=onehot, p=ldplib.eps2p(epsilon/2)) 105 | 106 | 107 | def kv_de_onehot(p_kv_list, epsilon): 108 | pass 109 | 110 | 111 | def kv_en_state_encoding(kv, epsilon): 112 | """ 113 | The unary encoding, also known as k-random response, is used in user side. It works as follows 114 | First, key value data is mapped into {0, 1, 2}. Basically, [0,0]->1; [1,-1]->0; [1,1]->2; 115 | Then the k-rr is used to report. 116 | :param kv: key value data, in which k in {0,1} and value in [-1,1] 117 | :param epsilon: privacy budget 118 | :return: the encoded key value data, the res is in {0,1,2} 119 | """ 120 | k, v = kv[0], ldplib.discretization(value=kv[1], lower=-1, upper=1) 121 | unary = k * v + 1 122 | return ldplib.k_random_response(unary, values=[0, 1, 2], epsilon=epsilon) 123 | 124 | 125 | def kv_de_state_encoding(p_kv_list: np.ndarray, epsilon): 126 | """ 127 | This is used in the server side. The server collects all the data and then use this function to calculate f and m. 128 | :param p_kv_list: the encoded kv list 129 | :param epsilon: the privacy budget 130 | :return: the estimated frequency and mean_estimation. 131 | """ 132 | if not isinstance(p_kv_list, np.ndarray): 133 | raise Exception("type error of p_kv_list: ", type(p_kv_list)) 134 | 135 | zero = len(np.where(p_kv_list == 1)[0]) # [0,0] 136 | pos = len(np.where(p_kv_list == 2)[0]) # [1,1] 137 | neg = len(np.where(p_kv_list == 0)[0]) # [1,-1] 138 | cnt_all = zero + pos + neg 139 | 140 | # adjust the true count 141 | cnt = np.asarray([zero, pos, neg]) 142 | p = np.e ** epsilon / (2 + np.e ** epsilon) 143 | 144 | est_cnt = (2 * cnt - cnt_all * (1 - p)) / (3 * p - 1) 145 | 146 | f = (est_cnt[1] + est_cnt[2]) / cnt_all 147 | m = (est_cnt[1] - est_cnt[2]) / (est_cnt[1] + est_cnt[2]) 148 | return f, m 149 | 150 | 151 | def kv_en_bisample(kv, epsilon): 152 | k, v = kv[0], kv[1] 153 | if k == 0: 154 | return np.random.binomial(1, 0.5), np.random.binomial(1, 1/(np.e**epsilon+1)) 155 | direction = np.random.binomial(1, 0.5) 156 | if direction == 0: # negative sampling 157 | probability = (1 - np.e ** epsilon) / (1 + np.e ** epsilon) * v / 2 + 0.5 158 | else: # positive sampling 159 | probability = (np.e ** epsilon - 1) / (np.e ** epsilon + 1) * v / 2 + 0.5 160 | return direction, np.random.binomial(1, probability) 161 | 162 | 163 | def kv_de_bisample(p_kv_list: np.ndarray, epsilon): 164 | pos_values = p_kv_list[p_kv_list[:, 0] == 1] 165 | neg_values = p_kv_list[p_kv_list[:, 0] == 0] 166 | f_pos = np.average(pos_values[:, 1]) 167 | f_neg = np.average(neg_values[:, 1]) 168 | 169 | p = ldplib.eps2p(epsilon) 170 | 171 | f = (2*p - 2 + f_pos + f_neg) / (2*p - 1) 172 | m = (f_pos - f_neg) / (f_pos + f_neg + 2*p - 2) 173 | return f, m 174 | 175 | # def kv_en_f2m(kv, epsilon_k, epsilon_v, method, set_value=0): 176 | # v = kv[1] if kv[0] == 1 else set_value 177 | # p_k = ldplib.random_response_old(bits=int(kv[0]), p=ldplib.eps2p(epsilon_k)) 178 | # p_v = method(v, epsilon_v) 179 | # return p_k, p_v 180 | # 181 | # 182 | # def kv_de_f2m(p_kv_list: np.ndarray, epsilon_k, set_value=0): 183 | # if not isinstance(p_kv_list, np.ndarray): 184 | # raise Exception("type error of p_kv_list: ", type(p_kv_list)) 185 | # f = np.average(p_kv_list[:, 0]) 186 | # p = ldplib.eps2p(epsilon=epsilon_k) 187 | # f = (p-1+f) / (2*p-1) 188 | # m_all = np.average(p_kv_list[:, 1]) 189 | # m = (m_all - (1 - f) * set_value) / f 190 | # return f, m 191 | 192 | 193 | def my_run_tst(): 194 | # initial random seed, optional 195 | # np.random.seed(10) 196 | 197 | # generate 100000 kv pairs with f=0.7 and m=0.3 198 | kv_list = [[np.random.binomial(1, 0.7), np.clip(a=np.random.normal(loc=0.3, scale=0.3), a_min=-1, a_max=1)] for _ in 199 | range(500000)] 200 | kv_list = np.asarray(kv_list) 201 | kv_list[:, 1] = kv_list[:, 1] * kv_list[:, 0] 202 | f_base, m_base = kvlist_get_baseline(kv_list=np.asarray(kv_list)) 203 | print("this is the baseline f=%.4f, m=%.4f" % (f_base, m_base)) 204 | 205 | epsilon = 0.1 206 | 207 | # the PrivKV method 208 | pirvkv_kv_list = [kv_en_privkv(kv, epsilon1=epsilon/2, epsilon2=epsilon/2) for kv in kv_list] 209 | f_privkv, m_privkv = kv_de_privkv(p_kv_list=np.asarray(pirvkv_kv_list), epsilon_k=epsilon / 2, epsilon_v=epsilon / 2) 210 | print("this is the privkv f=%.4f, m=%.4f" % (f_privkv, m_privkv)) 211 | 212 | # the StateEncoding method 213 | se_kv_list = [kv_en_state_encoding(kv, epsilon) for kv in kv_list] 214 | f_se, m_se = kv_de_state_encoding(p_kv_list=np.asarray(se_kv_list), epsilon=epsilon) 215 | print("this is the se f=%.4f, m=%.4f" % (f_se, m_se)) 216 | 217 | bisample_kv_list = [kv_en_bisample(kv,epsilon) for kv in kv_list] 218 | f_bisample, m_bisample = kv_de_bisample(p_kv_list=np.asarray(bisample_kv_list), epsilon=epsilon) 219 | print("this is the bi f=%.4f, m=%.4f" % (f_bisample, m_bisample)) 220 | 221 | 222 | if __name__ == '__main__': 223 | my_run_tst() 224 | 225 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/ldp_base.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-01-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | base class for local differential privacy 9 | """ 10 | import abc 11 | 12 | 13 | class LDPBase(metaclass=abc.ABCMeta): 14 | 15 | @abc.abstractmethod 16 | def randomize(self, value): 17 | """ 18 | the randomize function 19 | """ 20 | raise NotImplementedError 21 | 22 | @classmethod 23 | def _check_epsilon(cls, epsilon): 24 | if not (epsilon >= 0): 25 | raise ValueError("ERR: the range of epsilon={} is wrong.".format(epsilon)) 26 | return epsilon 27 | 28 | @abc.abstractmethod 29 | def _check_value(self, value): 30 | """ 31 | to check if the input value is valid 32 | """ 33 | raise NotImplementedError 34 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/ldplib.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-05-31 12:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | # 8 | import numpy as np 9 | 10 | 11 | def eps2p(epsilon, n=2): 12 | return np.e ** epsilon / (np.e ** epsilon + n - 1) 13 | 14 | 15 | def discretization(value, lower=0, upper=1): 16 | if value > upper or value < lower: 17 | raise Exception("the range of value is not valid in Function @Func: discretization") 18 | 19 | p = (value - lower) / (upper - lower) 20 | rnd = np.random.random() 21 | return upper if rnd < p else lower 22 | 23 | 24 | def perturbation(value, perturbed_value, epsilon): 25 | rnd = np.random.random() 26 | if rnd < eps2p(epsilon): 27 | return value 28 | return perturbed_value 29 | 30 | 31 | def k_random_response(value, values, epsilon): 32 | """ 33 | the k-random response 34 | :param value: current value 35 | :param values: the possible value 36 | :param epsilon: privacy budget 37 | :return: 38 | """ 39 | if not isinstance(values, list): 40 | raise Exception("The values should be list") 41 | if value not in values: 42 | raise Exception("Errors in k-random response") 43 | p = np.e ** epsilon / (np.e ** epsilon + len(values) - 1) 44 | if np.random.random() < p: 45 | return value 46 | values.remove(value) 47 | return values[np.random.randint(low=0, high=len(values))] 48 | 49 | 50 | def k_random_response_new(item, k, epsilon): 51 | if not item < k: 52 | raise Exception("the input domain is wrong, item = %d, k = %d." % (item, k)) 53 | p_l = 1 / (np.e ** epsilon + k - 1) 54 | p_h = np.e ** epsilon / (np.e ** epsilon + k - 1) 55 | respond_probability = np.full(shape=k, fill_value=p_l) 56 | respond_probability[item] = p_h 57 | perturbed_item = np.random.choice(a=range(k), p=respond_probability) 58 | return perturbed_item 59 | 60 | 61 | def random_response(bit_array: np.ndarray, p, q=None): 62 | """ 63 | :param bit_array: 64 | :param p: probability of 1->1 65 | :param q: probability of 0->1 66 | update: 2020.03.06 67 | :return: 68 | """ 69 | q = 1-p if q is None else q 70 | if isinstance(bit_array, int): 71 | probability = p if bit_array == 1 else q 72 | return np.random.binomial(n=1, p=probability) 73 | return np.where(bit_array == 1, np.random.binomial(1, p, len(bit_array)), np.random.binomial(1, q, len(bit_array))) 74 | 75 | 76 | def random_response_decode(bit_array_list: np.ndarray, p: float, q=None): 77 | q = 1-p if q is None else q 78 | n = bit_array_list.shape[0] 79 | y = np.sum(bit_array_list, axis=0) 80 | return (y - n * q) / (p-q) 81 | 82 | 83 | def unary_encoding(bit_array: np.ndarray, epsilon): 84 | """ 85 | the unary encoding, the default UE is SUE 86 | update: 2020.02.25 87 | """ 88 | if not isinstance(bit_array, np.ndarray): 89 | raise Exception("Type Err: ", type(bit_array)) 90 | return symmetric_unary_encoding(bit_array, epsilon) 91 | 92 | 93 | def symmetric_unary_encoding(bit_array: np.ndarray, epsilon): 94 | """ 95 | the SUE, the p and q is revised. 96 | update: 2021.04.20 97 | """ 98 | t = np.e ** (epsilon / 2) 99 | p = t / (t + 1) 100 | q = 1 / (t + 1) 101 | return random_response(bit_array, p, q) 102 | 103 | 104 | def optimized_unary_encoding(bit_array: np.ndarray, epsilon): 105 | """ 106 | the OUE, the p and q is revised. 107 | update: 2021.04.20 108 | """ 109 | p = 1 / 2 110 | q = 1 / (np.e ** epsilon + 1) 111 | return random_response(bit_array, p, q) 112 | 113 | 114 | if __name__ == '__main__': 115 | pass 116 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/meanlib.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021-10-09 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | @ 2021.10.09 整合了之前的 Duchi 方法和 PM 方法 9 | @ 2021.11.22 添加Laplace机制 10 | @ 2021.12.31 整合统一接口ValueEncoder 11 | """ 12 | 13 | import numpy as np 14 | import dplib.ldp_mechanisms.ldplib as ldplib 15 | 16 | 17 | class Duchi: 18 | def __init__(self, epsilon): 19 | self.__epsilon = epsilon 20 | self.__C = (np.e ** epsilon + 1) / (np.e ** epsilon - 1) 21 | 22 | def encode(self, v): 23 | value = ldplib.discretization(value=v, lower=-1, upper=1) 24 | value = ldplib.perturbation(value=value, perturbed_value=-value, epsilon=self.__epsilon) 25 | return self.__C * value 26 | 27 | 28 | class PiecewiseMechanism: 29 | def __init__(self, epsilon): 30 | self.__epsilon = epsilon 31 | 32 | def encode_author(self, v): 33 | """ 34 | Piecewise Mechanism, from paper: Collecting and Analyzing Multidimensional Data with Local Differential Privacy 35 | """ 36 | z = np.e ** (self.__epsilon / 2) 37 | P1 = (v + 1) / (2 + 2 * z) 38 | P2 = z / (z + 1) 39 | P3 = (1 - v) / (2 + 2 * z) 40 | 41 | C = (z + 1) / (z - 1) 42 | g1 = (C + 1) * v / 2 - (C - 1) / 2 43 | g2 = (C + 1) * v / 2 + (C - 1) / 2 44 | 45 | rnd = np.random.random() 46 | if rnd < P1: 47 | result = -C + np.random.random() * (g1 - (-C)) 48 | elif rnd < P1 + P2: 49 | result = (g2 - g1) * np.random.random() + g1 50 | else: 51 | result = (C - g2) * np.random.random() + g2 52 | return result 53 | 54 | # 我的实现方法 55 | def encode(self, value): 56 | """ 57 | Piecewise Mechanism, from paper: Collecting and Analyzing Multidimensional Data with Local Differential Privacy 58 | """ 59 | C = (np.e ** (self.__epsilon / 2) + 1) / (np.e ** (self.__epsilon / 2) - 1) 60 | p = (np.e ** self.__epsilon - np.e ** (self.__epsilon / 2)) / (2 * np.e ** (self.__epsilon / 2) + 2) 61 | L = (C+1)/2 * value - (C-1)/2 62 | R = L + C - 1 63 | 64 | p_h = (p - p / (np.e ** self.__epsilon)) * (C - 1) 65 | 66 | rnd = np.random.random() 67 | if rnd <= p_h: 68 | rnd_v = np.random.uniform(L, R) 69 | else: 70 | rnd_v = np.random.uniform(-C, C) 71 | return rnd_v 72 | 73 | 74 | class Laplace: 75 | def __init__(self, epsilon): 76 | self.__epsilon = epsilon 77 | self.__laplace_scale = 2 / self.__epsilon 78 | 79 | def encode(self, v): 80 | return v + np.random.laplace(loc=0, scale=self.__laplace_scale) 81 | 82 | 83 | class ValueEncoder: 84 | """ 85 | 整合的统一接口，后面有其他新方法，都可以调用这个接口： 86 | @method: 编码方法 87 | @parameters_dict: 对应编码的参数，用字典表示，比如 88 | 如：encoder = ValueEncoder(method='duchi', parameters_dict={'epsilon':1})，表示用duchi方法，隐私预算为1 89 | """ 90 | def __init__(self, method, parameters_dict): 91 | self.method = None 92 | self.parameters_dict = parameters_dict 93 | if str.lower(method) == 'laplace': 94 | self.method = Laplace(self.parameters_dict['epsilon']) 95 | elif str.lower(method) == 'duchi': 96 | self.method = Duchi(self.parameters_dict['epsilon']) 97 | elif str.lower(method) == 'piecewise': 98 | self.method = PiecewiseMechanism(self.parameters_dict['epsilon']) 99 | else: 100 | raise Exception("ERR, method = %s not supported!" % str.lower(method)) 101 | 102 | def encode(self, v): 103 | if v > 1 or v < -1: 104 | raise Exception("ERR, input range error, v = %.2f" % v) 105 | return self.method.encode(v) 106 | 107 | 108 | if __name__ == '__main__': 109 | data = np.clip(np.random.normal(loc=0.2, scale=0.3, size=10**3), a_min=-1, a_max=1) 110 | encoder = ValueEncoder(method='duchi', parameters_dict={'epsilon': 5}) 111 | encoded_data = [encoder.encode(v) for v in data] 112 | print(np.average(data), np.average(encoded_data)) 113 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/piecewise_mechanism.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022-02-07 17:48 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | the piecewise mechanism, 9 | from paper "Collecting and Analyzing Multidimensional Data with Local Differential Privacy" 10 | link: https://arxiv.org/abs/1907.00782 11 | 12 | @ 2022.02.08 updated, add the LDPBase class 13 | """ 14 | 15 | from ldp_base import LDPBase 16 | import numpy as np 17 | 18 | 19 | class PMBase(LDPBase): 20 | def __init__(self, epsilon): 21 | self._epsilon = self._check_epsilon(epsilon) 22 | 23 | z = np.e ** (self._epsilon / 2) 24 | self._C = (z + 1) / (z - 1) 25 | 26 | def randomize(self, value): 27 | v = self._check_value(value) 28 | z = np.e ** (self._epsilon / 2) 29 | 30 | P1 = (v + 1) / (2 + 2 * z) 31 | P2 = z / (z + 1) 32 | P3 = (1 - v) / (2 + 2 * z) 33 | 34 | g1 = (self._C + 1) * v / 2 - (self._C - 1) / 2 35 | g2 = (self._C + 1) * v / 2 + (self._C - 1) / 2 36 | 37 | rnd = np.random.random() 38 | if rnd < P1: 39 | result = -self._C + np.random.random() * (g1 - (-self._C)) 40 | elif rnd < P1 + P2: 41 | result = (g2 - g1) * np.random.random() + g1 42 | else: 43 | result = (self._C - g2) * np.random.random() + g2 44 | return result 45 | 46 | def randomize2(self, value, minor=1e-10): 47 | """ 48 | 此方法原理没问题，并且更加易于理解 49 | 但是当epsilon非常大的时候（比如epsilon=100），这个方法可能出问题，问题的原因在于计算的C=1，进而导致P_h=0 50 | minor的作用就是防止C=1时候C-1=0，进而导致p_h=0 51 | """ 52 | value = self._check_value(value) 53 | 54 | C = self._C 55 | p = (np.e ** self._epsilon - np.e ** (self._epsilon / 2)) / (2 * np.e ** (self._epsilon / 2) + 2) 56 | L = (C+1)/2 * value - (C-1)/2 57 | R = L + C - 1 58 | p_h = (p - p / (np.e ** self._epsilon)) * (C + minor - 1) 59 | 60 | rnd = np.random.random() 61 | if rnd <= p_h: 62 | rnd_v = np.random.uniform(L, R) 63 | else: 64 | rnd_v = np.random.uniform(-C, C) 65 | return rnd_v 66 | 67 | def _check_value(self, value): 68 | if not -1 <= value <= 1: 69 | raise ValueError("the input value={} is not in domain=[-1,1].".format(value)) 70 | return value 71 | 72 | 73 | class PiecewiseMechanism(LDPBase): 74 | def __init__(self, epsilon, domain): 75 | self._domain = domain 76 | self._epsilon = epsilon 77 | self._pm_encoder = PMBase(epsilon=epsilon) 78 | 79 | def _transform(self, value): 80 | """transform v in self.domain to v' in [-1,1]""" 81 | value = self._check_value(value) 82 | a, b = self._domain 83 | return (2*value - b - a) / (b - a) 84 | 85 | def _transform_T(self, value): 86 | """inverse of self._transform""" 87 | a, b = self._domain 88 | return (value * (b-a)+a+b)/2 89 | 90 | def randomize(self, value): 91 | value = self._transform(value) 92 | value = self._pm_encoder.randomize(value) 93 | value = self._transform_T(value) 94 | return value 95 | 96 | def _check_value(self, value): 97 | if not self._domain[0] <= value <= self._domain[1]: 98 | raise ValueError("the input value={} is not in domain={}".format(value, self._domain)) 99 | return value 100 | 101 | 102 | def myrun(): 103 | domain = (90, 200) 104 | # data = np.clip(np.random.laplace(loc=100, scale=20, size=10), domain[0], domain[1]) 105 | pm_encoder = PiecewiseMechanism(epsilon=100, domain=domain) 106 | a = 200 107 | print(pm_encoder.randomize(a)) 108 | 109 | 110 | if __name__ == '__main__': 111 | myrun() 112 | -------------------------------------------------------------------------------- /dplib/ldp_mechanisms/varlib.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021-11-22 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | import numpy as np 8 | import pandas as pd 9 | import meanlib 10 | 11 | """ 12 | @ 2021.11.22 测试不同方法 13 | """ 14 | 15 | 16 | def tst_cmp(size_list, eps_list, mechanism_list, repeated_times=20): 17 | df = pd.DataFrame(columns=('size', 'epsilon', 'mechanism', 'error')) 18 | for size in size_list: 19 | data = np.clip(np.random.normal(loc=0.2, scale=0.3, size=size), a_min=-1, a_max=1) 20 | # data = [-1, 0, 1] 21 | # print(data) 22 | var_base = np.var(data) 23 | for epsilon in eps_list: 24 | for mechanism in mechanism_list: 25 | # print(size, epsilon) 26 | err_list = [] 27 | for i in range(repeated_times): 28 | mech = mechanism(epsilon=epsilon) 29 | x_encode = [mech.encode(v) for v in data] 30 | # print("x_encode = ", x_encode) 31 | x2_encode = [mech.encode(v**2) for v in data] 32 | # print("x2_encode = ", x2_encode) 33 | esti_x2 = np.average(x2_encode) 34 | esti_x = np.average(x_encode) 35 | var_esti = esti_x2 - esti_x**2 36 | # print(esti_x2, esti_x) 37 | err_list.append(np.fabs(var_base - var_esti)) 38 | # print(err_list) 39 | record = {'size': size, 'epsilon': epsilon, 'mechanism': str(mechanism), 'error': np.average(err_list)} 40 | print(record) 41 | df = df.append(record, ignore_index=True) 42 | print(df) 43 | df.to_csv("varlib_res.csv", index=None) 44 | 45 | 46 | if __name__ == '__main__': 47 | size_list = [10**4, 5*10**4, 10**5, 5*10**5] 48 | eps_list = [0.1, 0.5, 1, 5, 10] 49 | mechanism_list = [meanlib.Duchi, meanlib.PiecewiseMechanism, meanlib.Laplace] 50 | tst_cmp(size_list, eps_list, mechanism_list) 51 | 52 | 53 | 54 | -------------------------------------------------------------------------------- /dplib/mdlib.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/5/1 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : mdlib.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import dplib.ldp_mechanisms.ldplib as ldplib 11 | import numpy as np 12 | 13 | 14 | def remove_null_value(vl: np.ndarray): 15 | return vl[vl == vl] 16 | 17 | 18 | def generate_data(mr=0.3, size=10**5): 19 | # generate data that belongs to [-1,1] 20 | data = np.random.random_sample(size=size) * 2 - 1 21 | # generate missing values 22 | mr_flag = np.random.binomial(1, p=mr, size=size) 23 | data[mr_flag == 1] = np.nan 24 | return data 25 | 26 | 27 | def get_baseline(vl: np.ndarray): 28 | vl_val = remove_null_value(vl) 29 | return 1 - len(vl_val) / len(vl), np.average(vl_val) 30 | 31 | 32 | class BiSampleMD: 33 | def __init__(self, epsilon): 34 | self.epsilon = epsilon 35 | self.__p = np.e**epsilon / (np.e**epsilon + 1) 36 | 37 | def user_encode(self, val): 38 | if np.isnan(val): 39 | return np.random.binomial(1, p=0.5), np.random.binomial(1, p=1-self.__p) 40 | 41 | s = np.random.binomial(1, p=0.5) # sampling direction 42 | if s == 1: 43 | # positive sampling 44 | b = np.random.binomial(1, p=(np.e**self.epsilon - 1) / (np.e**self.epsilon + 1) * val / 2 + 0.5) 45 | else: 46 | # negative sampling 47 | b = np.random.binomial(1, p=(1 - np.e**self.epsilon) / (np.e**self.epsilon + 1) * val / 2 + 0.5) 48 | return s, b 49 | 50 | def aggregate_mean(self, p_val_lst): 51 | val_lst = np.asarray(p_val_lst) 52 | 53 | pos_lst = val_lst[val_lst[:, 0] == 1] 54 | neg_lst = val_lst[val_lst[:, 0] == 0] 55 | pos_val = pos_lst[:, 1] 56 | neg_val = neg_lst[:, 1] 57 | 58 | f_pos = 1.0 * sum(pos_val) / len(pos_val) 59 | f_neg = 1.0 * sum(neg_val) / len(neg_val) 60 | 61 | m = (f_pos - f_neg) / (f_pos + f_neg + 2*self.__p - 2) 62 | mr = (1 - f_pos - f_neg) / (2*self.__p - 1) 63 | return mr, m 64 | 65 | 66 | def my_example(): 67 | vl = generate_data(mr=0.3, size=10**5) 68 | mr, m = get_baseline(vl) 69 | print(vl) 70 | print("data size = ", len(vl)) 71 | 72 | epsilon = 1 73 | bisample = BiSampleMD(epsilon=epsilon) 74 | p_vl_lst = [bisample.user_encode(val) for val in vl] 75 | est_mr, est_m = bisample.aggregate_mean(p_vl_lst) 76 | 77 | print("true result: mr = %.6f, m = %.6f" % (mr, m)) 78 | print("este result: mr = %.6f, m = %.6f" % (est_mr, est_m)) 79 | 80 | 81 | if __name__ == '__main__': 82 | my_example() -------------------------------------------------------------------------------- /dplib/sunNumTools/Normalizer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021-10-09 10:37 3 | # @Author : ForestNeo 4 | # @Email : dr.forestneo@gmail.com 5 | # @Software: PyCharm 6 | 7 | """ 8 | This script is used for data scaling 9 | 10 | @ 2021.10.09: initialized 11 | """ 12 | 13 | 14 | class Normalizer: 15 | """ 16 | 数据归一化，默认归一化到区间[0,1] 17 | """ 18 | def __init__(self, in_domain, out_domain=(0, 1)): 19 | self.in_min, self.in_max = in_domain[0], in_domain[1] 20 | self.out_min, self.out_max = out_domain[0], out_domain[1] 21 | self.slope = (self.out_max - self.out_min) / (self.in_max - self.in_min) 22 | 23 | def normalize(self, v): 24 | if v > self.in_max or v < self.in_min: 25 | raise Exception("ERR: input out of range! input = %.2f, range = [%.2f, %.2f]" % (v, self.in_min, self.in_max)) 26 | return self.slope * v + self.out_min - self.slope * self.in_min 27 | 28 | def de_normalize(self, v): 29 | return (v - self.out_min + self.slope * self.in_min) / self.slope 30 | 31 | 32 | if __name__ == '__main__': 33 | input_domain = [0, 100] 34 | output_domain = [0, 1] 35 | 36 | normalizer = Normalizer(input_domain, output_domain) 37 | print(normalizer.normalize(50)) 38 | print(normalizer.de_normalize(0.5)) 39 | 40 | -------------------------------------------------------------------------------- /dplib/sunNumTools/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/dplib/sunNumTools/__init__.py -------------------------------------------------------------------------------- /useless.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/17 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @File : useless.py.py 7 | # @Software: PyCharm 8 | # @Function: 9 | 10 | import numpy as np 11 | 12 | class A: 13 | def __init__(self, a): 14 | self.a = a 15 | 16 | 17 | class B(A): 18 | def __init__(self, b, a=1): 19 | super().__init__(a) 20 | self.b = b 21 | print(self.a) 22 | 23 | 24 | b = B(a=3, b=2) 25 | print(b.a, b.b) 26 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/forestneo/sunPytools/487e3c74c19f7c4cd1502e6fa2ecb9ccbbe4506a/utils/__init__.py -------------------------------------------------------------------------------- /utils/evaluation_matrix.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2022/02/22 3 | # @Author : ForestNeo 4 | # @Site : forestneo.com 5 | # @Email : dr.forestneo@gmail.com 6 | # @Software: PyCharm 7 | # @Function: 8 | 9 | import numpy as np 10 | 11 | 12 | def fscore(true_labels, pred_labels): 13 | pass 14 | 15 | 16 | def precision(true_labels, pred_labels): 17 | pass 18 | 19 | 20 | 21 | if __name__ == '__main__': 22 | print("hello, world") 23 | --------------------------------------------------------------------------------