├── README.md ├── conf ├── age_hierarchy.txt ├── edu_hierarchy.txt ├── marital_hierarchy.txt └── race_hierarchy.txt ├── data ├── adult.data ├── adult.names ├── adult.test └── old.adult.names ├── src ├── differential_privacy.py ├── exponential_mechanism.py ├── k_anonymity.py ├── kanonymity_eval.py └── laplace_mechanism.py └── utilis └── readdata.py /README.md: -------------------------------------------------------------------------------- 1 | ## K-anonymity and Differential Privacy 2 | 3 | 4 | [TOC] 5 | 6 | #### 1. K-anonymity 7 | 8 | ##### 1.1 Generalization Hierarchy 9 | 10 | Generalization hierarchy is defined in files in `conf` folder. 11 | 12 | ##### 1.2 Heuristic program 13 | 14 | I implement the datafly heuristic algorithm, which pseudo code is below: 15 | 16 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0krnjtttwj30qw0dmjut.jpg) 17 | 18 | The detailed comments for each function can be found in file `k_anonymity.py`. 19 | 20 | ##### 1.3 Evaluation 21 | 22 | I evaluate the result for `k = [5, 10, 50, 100]`, and calculate the distoration and precision. The calculation of distoration and precision is the same as given in lecture, where distortion is: 23 | 24 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0ksdvufmpj30ij07a74y.jpg) 25 | 26 | and precision is: 27 | 28 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0ksdvufmpj30ij07a74y.jpg) 29 | 30 | 31 | #### 2. Differential Privacy 32 | 33 | ##### 2.1 Laplace Mechanism 34 | 35 | ###### 2.1.1 Query for e=0.5 and e = 1 36 | 37 | Query `1000` times for `e` in `[0.5, 1]`. 38 | 39 | 40 | ###### 2.1.2 0.5-Indistinguishable Proof 41 | 42 | To prove the indistinguishable of two queries, I use bucket to gather outputs into 20 buckets, then calculate the probability for each bucket. Then calculate the quotient over each bucket probability. For two query result, `query1` and `query2`, I calculate both probability of `query1` over probability over `query2`, and also probability of `query2` over probability over `query1. We can see in both cases, the quotient is smaller than $exp^{\epsilon}$, which proves the $\epsilon$-indistinguishable. 43 | 44 | 45 | ######2.1.3 1-Indistinguishable Proof 46 | 47 | The proof is same as 0.5-indistinguishable proof. We can see for each case, the $1$-indistinguishable holds. 48 | 49 | 50 | 51 | ###### 2.1.4 Distortion 52 | 53 | To calculate the distortion, I used the RMSE as metric. Firstly I calculate the groundtruth, which is the true average age greater than 25 without adding noise. Then I calculate the RMSE of query result with groundtruth. We can see when $\epsilon = 1$, the RMSE is smaller than $\epsilon=0.5$, which proves the distortino of $\epsilon = 1$ is smaller than $\epsilon=0.5$. 54 | 55 | 56 | 57 | ##### 2.2 Exponential Mechanism 58 | 59 | ###### 2.2.1 Query for e=0.5 and e = 1 60 | 61 | Query `1000` times for `e` in `[0.5, 1]`. 62 | 63 | 64 | 65 | ###### 2.2.2 0.5-Indistinguishable Proof 66 | 67 | To prove the $\epsilon$-indistinguishable, I firstly count the frequency of each education value in the query results. Then calculate the probability for each education value. Indistinguishability can be proved by showing the probability quotient of two adjacent tables is smaller than $\exp^{\epsilon}$. We can see for each case, the indistinguishability holds. 68 | 69 | 70 | ###### 2.2.3 1-Indistinguishable Proof 71 | 72 | The proof is same with $\epsilon=0.5$. We can see for each case, the indistinguishability holds. 73 | 74 | 75 | ###### 2.2.4 Distortion 76 | 77 | The metric I used here is `1-precision`. `Precision` is calculated by #results which is the groudtruth / total query numbers. `1-precision` is a measure for distortion, as higher precison implies lower distortion. We can see when $\epsilon = 1$, the distortion is smaller than $\epsilon=0.5$, which proves the distortino of $\epsilon = 1$ is smaller than $\epsilon=0.5$. 78 | 79 | -------------------------------------------------------------------------------- /conf/age_hierarchy.txt: -------------------------------------------------------------------------------- 1 | 0,0-10,0-20,0-50,0-100 2 | 1,0-10,0-20,0-50,0-100 3 | 2,0-10,0-20,0-50,0-100 4 | 3,0-10,0-20,0-50,0-100 5 | 4,0-10,0-20,0-50,0-100 6 | 5,0-10,0-20,0-50,0-100 7 | 6,0-10,0-20,0-50,0-100 8 | 7,0-10,0-20,0-50,0-100 9 | 8,0-10,0-20,0-50,0-100 10 | 9,0-10,0-20,0-50,0-100 11 | 10,10-20,0-20,0-50,0-100 12 | 11,10-20,0-20,0-50,0-100 13 | 12,10-20,0-20,0-50,0-100 14 | 13,10-20,0-20,0-50,0-100 15 | 14,10-20,0-20,0-50,0-100 16 | 15,10-20,0-20,0-50,0-100 17 | 16,10-20,0-20,0-50,0-100 18 | 17,10-20,0-20,0-50,0-100 19 | 18,10-20,0-20,0-50,0-100 20 | 19,10-20,0-20,0-50,0-100 21 | 20,20-30,20-40,0-50,0-100 22 | 21,20-30,20-40,0-50,0-100 23 | 22,20-30,20-40,0-50,0-100 24 | 23,20-30,20-40,0-50,0-100 25 | 24,20-30,20-40,0-50,0-100 26 | 25,20-30,20-40,0-50,0-100 27 | 26,20-30,20-40,0-50,0-100 28 | 27,20-30,20-40,0-50,0-100 29 | 28,20-30,20-40,0-50,0-100 30 | 29,20-30,20-40,0-50,0-100 31 | 30,30-40,20-40,0-50,0-100 32 | 31,30-40,20-40,0-50,0-100 33 | 32,30-40,20-40,0-50,0-100 34 | 33,30-40,20-40,0-50,0-100 35 | 34,30-40,20-40,0-50,0-100 36 | 35,30-40,20-40,0-50,0-100 37 | 36,30-40,20-40,0-50,0-100 38 | 37,30-40,20-40,0-50,0-100 39 | 38,30-40,20-40,0-50,0-100 40 | 39,30-40,20-40,0-50,0-100 41 | 40,40-50,40-60,0-50,0-100 42 | 41,40-50,40-60,0-50,0-100 43 | 42,40-50,40-60,0-50,0-100 44 | 43,40-50,40-60,0-50,0-100 45 | 44,40-50,40-60,0-50,0-100 46 | 45,40-50,40-60,0-50,0-100 47 | 46,40-50,40-60,0-50,0-100 48 | 47,40-50,40-60,0-50,0-100 49 | 48,40-50,40-60,0-50,0-100 50 | 49,40-50,40-60,0-50,0-100 51 | 50,50-60,40-60,50-100,0-100 52 | 51,50-60,40-60,50-100,0-100 53 | 52,50-60,40-60,50-100,0-100 54 | 53,50-60,40-60,50-100,0-100 55 | 54,50-60,40-60,50-100,0-100 56 | 55,50-60,40-60,50-100,0-100 57 | 56,50-60,40-60,50-100,0-100 58 | 57,50-60,40-60,50-100,0-100 59 | 58,50-60,40-60,50-100,0-100 60 | 59,50-60,40-60,50-100,0-100 61 | 60,60-70,60-80,50-100,0-100 62 | 61,60-70,60-80,50-100,0-100 63 | 62,60-70,60-80,50-100,0-100 64 | 63,60-70,60-80,50-100,0-100 65 | 64,60-70,60-80,50-100,0-100 66 | 65,60-70,60-80,50-100,0-100 67 | 66,60-70,60-80,50-100,0-100 68 | 67,60-70,60-80,50-100,0-100 69 | 68,60-70,60-80,50-100,0-100 70 | 69,60-70,60-80,50-100,0-100 71 | 70,70-80,60-80,50-100,0-100 72 | 71,70-80,60-80,50-100,0-100 73 | 72,70-80,60-80,50-100,0-100 74 | 73,70-80,60-80,50-100,0-100 75 | 74,70-80,60-80,50-100,0-100 76 | 75,70-80,60-80,50-100,0-100 77 | 76,70-80,60-80,50-100,0-100 78 | 77,70-80,60-80,50-100,0-100 79 | 78,70-80,60-80,50-100,0-100 80 | 79,70-80,60-80,50-100,0-100 81 | 80,80-90,80-100,50-100,0-100 82 | 81,80-90,80-100,50-100,0-100 83 | 82,80-90,80-100,50-100,0-100 84 | 83,80-90,80-100,50-100,0-100 85 | 84,80-90,80-100,50-100,0-100 86 | 85,80-90,80-100,50-100,0-100 87 | 86,80-90,80-100,50-100,0-100 88 | 87,80-90,80-100,50-100,0-100 89 | 88,80-90,80-100,50-100,0-100 90 | 89,80-90,80-100,50-100,0-100 91 | 90,90-100,80-100,50-100,0-100 92 | -------------------------------------------------------------------------------- /conf/edu_hierarchy.txt: -------------------------------------------------------------------------------- 1 | Preschool,PrimarySchool,CompulsorySchool,BasicDegree,Educated 2 | 1st-4th,PrimarySchool,CompulsorySchool,BasicDegree,Educated 3 | 5th-6th,PrimarySchool,CompulsorySchool,BasicDegree,Educated 4 | 7th-8th,MiddleSchool,CompulsorySchool,BasicDegree,Educated 5 | 9th,MiddleSchool,CompulsorySchool,BasicDegree,Educated 6 | 10th,HighSchool,AdvancedSchool,BasicDegree,Educated 7 | 11th,HighSchool,AdvancedSchool,BasicDegree,Educated 8 | 12th,HighSchool,AdvancedSchool,BasicDegree,Educated 9 | HS-grad,HighSchool,AdvancedSchool,BasicDegree,Educated 10 | Assoc-voc,VocDegree,AdvancedSchool,BasicDegree,Educated 11 | Prof-school,VocDegree,AdvancedSchool,BasicDegree,Educated 12 | Some-college,VocDegree,AdvancedSchool,BasicDegree,Educated 13 | Assoc-acdm,UndergradSchool,ProfSchool,AdvancedDegree,Educated 14 | Bachelors,UndergradSchool,ProfSchool,AdvancedDegree,Educated 15 | Masters,GradSchool,ProfSchool,AdvancedDegree,Educated 16 | Doctorate,GradSchool,ProfSchool,AdvancedDegree,Educated -------------------------------------------------------------------------------- /conf/marital_hierarchy.txt: -------------------------------------------------------------------------------- 1 | Married-AF-spouse,MarriedTogether,Married,* 2 | Married-civ-spouse,MarriedTogether,Married,* 3 | Married-spouse-absent,MarriedSeparated,Married,* 4 | Separated,MarriedSeparated,Married,* 5 | Widowed,MarriedSingle,Married,* 6 | Divorced,MarriedSingle,Married,* 7 | Never-married,NeverMarried,NonMarried,* 8 | -------------------------------------------------------------------------------- /conf/race_hierarchy.txt: -------------------------------------------------------------------------------- 1 | Asian-Pac-Islander,Orient,* 2 | Black,Orient,* 3 | Other,Orient,* 4 | Amer-Indian-Eskimo,Occident,* 5 | White,Occident,* 6 | -------------------------------------------------------------------------------- /data/adult.names: -------------------------------------------------------------------------------- 1 | | This data was extracted from the census bureau database found at 2 | | http://www.census.gov/ftp/pub/DES/www/welcome.html 3 | | Donor: Ronny Kohavi and Barry Becker, 4 | | Data Mining and Visualization 5 | | Silicon Graphics. 6 | | e-mail: ronnyk@sgi.com for questions. 7 | | Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random). 8 | | 48842 instances, mix of continuous and discrete (train=32561, test=16281) 9 | | 45222 if instances with unknown values are removed (train=30162, test=15060) 10 | | Duplicate or conflicting instances : 6 11 | | Class probabilities for adult.all file 12 | | Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 13 | | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) 14 | | 15 | | Extraction was done by Barry Becker from the 1994 Census database. A set of 16 | | reasonably clean records was extracted using the following conditions: 17 | | ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) 18 | | 19 | | Prediction task is to determine whether a person makes over 50K 20 | | a year. 21 | | 22 | | First cited in: 23 | | @inproceedings{kohavi-nbtree, 24 | | author={Ron Kohavi}, 25 | | title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 26 | | Decision-Tree Hybrid}, 27 | | booktitle={Proceedings of the Second International Conference on 28 | | Knowledge Discovery and Data Mining}, 29 | | year = 1996, 30 | | pages={to appear}} 31 | | 32 | | Error Accuracy reported as follows, after removal of unknowns from 33 | | train/test sets): 34 | | C4.5 : 84.46+-0.30 35 | | Naive-Bayes: 83.88+-0.30 36 | | NBTree : 85.90+-0.28 37 | | 38 | | 39 | | Following algorithms were later run with the following error rates, 40 | | all after removal of unknowns and using the original train/test split. 41 | | All these numbers are straight runs using MLC++ with default values. 42 | | 43 | | Algorithm Error 44 | | -- ---------------- ----- 45 | | 1 C4.5 15.54 46 | | 2 C4.5-auto 14.46 47 | | 3 C4.5 rules 14.94 48 | | 4 Voted ID3 (0.6) 15.64 49 | | 5 Voted ID3 (0.8) 16.47 50 | | 6 T2 16.84 51 | | 7 1R 19.54 52 | | 8 NBTree 14.10 53 | | 9 CN2 16.00 54 | | 10 HOODG 14.82 55 | | 11 FSS Naive Bayes 14.05 56 | | 12 IDTM (Decision table) 14.46 57 | | 13 Naive-Bayes 16.12 58 | | 14 Nearest-neighbor (1) 21.42 59 | | 15 Nearest-neighbor (3) 20.35 60 | | 16 OC1 15.04 61 | | 17 Pebls Crashed. Unknown why (bounds WERE increased) 62 | | 63 | | Conversion of original data as follows: 64 | | 1. Discretized agrossincome into two ranges with threshold 50,000. 65 | | 2. Convert U.S. to US to avoid periods. 66 | | 3. Convert Unknown to "?" 67 | | 4. Run MLC++ GenCVFiles to generate data,test. 68 | | 69 | | Description of fnlwgt (final weight) 70 | | 71 | | The weights on the CPS files are controlled to independent estimates of the 72 | | civilian noninstitutional population of the US. These are prepared monthly 73 | | for us by Population Division here at the Census Bureau. We use 3 sets of 74 | | controls. 75 | | These are: 76 | | 1. A single cell estimate of the population 16+ for each state. 77 | | 2. Controls for Hispanic Origin by age and sex. 78 | | 3. Controls by Race, age and sex. 79 | | 80 | | We use all three sets of controls in our weighting program and "rake" through 81 | | them 6 times so that by the end we come back to all the controls we used. 82 | | 83 | | The term estimate refers to population totals derived from CPS by creating 84 | | "weighted tallies" of any specified socio-economic characteristics of the 85 | | population. 86 | | 87 | | People with similar demographic characteristics should have 88 | | similar weights. There is one important caveat to remember 89 | | about this statement. That is that since the CPS sample is 90 | | actually a collection of 51 state samples, each with its own 91 | | probability of selection, the statement only applies within 92 | | state. 93 | 94 | 95 | >50K, <=50K. 96 | 97 | age: continuous. 98 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 99 | fnlwgt: continuous. 100 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 101 | education-num: continuous. 102 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 103 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 104 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 105 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 106 | sex: Female, Male. 107 | capital-gain: continuous. 108 | capital-loss: continuous. 109 | hours-per-week: continuous. 110 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 111 | -------------------------------------------------------------------------------- /data/old.adult.names: -------------------------------------------------------------------------------- 1 | 1. Title of Database: adult 2 | 2. Sources: 3 | (a) Original owners of database (name/phone/snail address/email address) 4 | US Census Bureau. 5 | (b) Donor of database (name/phone/snail address/email address) 6 | Ronny Kohavi and Barry Becker, 7 | Data Mining and Visualization 8 | Silicon Graphics. 9 | e-mail: ronnyk@sgi.com 10 | (c) Date received (databases may change over time without name change!) 11 | 05/19/96 12 | 3. Past Usage: 13 | (a) Complete reference of article where it was described/used 14 | @inproceedings{kohavi-nbtree, 15 | author={Ron Kohavi}, 16 | title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 17 | Decision-Tree Hybrid}, 18 | booktitle={Proceedings of the Second International Conference on 19 | Knowledge Discovery and Data Mining}, 20 | year = 1996, 21 | pages={to appear}} 22 | (b) Indication of what attribute(s) were being predicted 23 | Salary greater or less than 50,000. 24 | (b) Indication of study's results (i.e. Is it a good domain to use?) 25 | Hard domain with a nice number of records. 26 | The following results obtained using MLC++ with default settings 27 | for the algorithms mentioned below. 28 | 29 | Algorithm Error 30 | -- ---------------- ----- 31 | 1 C4.5 15.54 32 | 2 C4.5-auto 14.46 33 | 3 C4.5 rules 14.94 34 | 4 Voted ID3 (0.6) 15.64 35 | 5 Voted ID3 (0.8) 16.47 36 | 6 T2 16.84 37 | 7 1R 19.54 38 | 8 NBTree 14.10 39 | 9 CN2 16.00 40 | 10 HOODG 14.82 41 | 11 FSS Naive Bayes 14.05 42 | 12 IDTM (Decision table) 14.46 43 | 13 Naive-Bayes 16.12 44 | 14 Nearest-neighbor (1) 21.42 45 | 15 Nearest-neighbor (3) 20.35 46 | 16 OC1 15.04 47 | 17 Pebls Crashed. Unknown why (bounds WERE increased) 48 | 49 | 4. Relevant Information Paragraph: 50 | Extraction was done by Barry Becker from the 1994 Census database. A set 51 | of reasonably clean records was extracted using the following conditions: 52 | ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) 53 | 54 | 5. Number of Instances 55 | 48842 instances, mix of continuous and discrete (train=32561, test=16281) 56 | 45222 if instances with unknown values are removed (train=30162, test=15060) 57 | Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random). 58 | 59 | 6. Number of Attributes 60 | 6 continuous, 8 nominal attributes. 61 | 62 | 7. Attribute Information: 63 | 64 | age: continuous. 65 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 66 | fnlwgt: continuous. 67 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 68 | education-num: continuous. 69 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 70 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 71 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 72 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 73 | sex: Female, Male. 74 | capital-gain: continuous. 75 | capital-loss: continuous. 76 | hours-per-week: continuous. 77 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 78 | class: >50K, <=50K 79 | 80 | 8. Missing Attribute Values: 81 | 82 | 7% have missing values. 83 | 84 | 9. Class Distribution: 85 | 86 | Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 87 | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) 88 | 89 | 90 | -------------------------------------------------------------------------------- /src/differential_privacy.py: -------------------------------------------------------------------------------- 1 | import sys 2 | sys.path.append('../') 3 | from utilis.readdata import * 4 | import laplace_mechanism 5 | import exponential_mechanism 6 | import math 7 | 8 | 9 | def evaluate_laplace_mechanism(eps = [0.5, 1]): 10 | """ 11 | Evaluate for Laplace Mechanism 12 | """ 13 | recordsv0 = readdata() 14 | recordsv1, recordsv2, recordsv3 = generate_data_for_laplace_mechanism(recordsv0) 15 | 16 | res1000 = {e:[] for e in eps} 17 | # res4000 = {e: [] for e in eps} 18 | rmse = {e: 0 for e in eps} 19 | 20 | 21 | """ 22 | evaluate for epsilon = 0.5 and 1 for 1000 queries 23 | """ 24 | printsent = ['original data', 'data removed a record with the oldest age', 25 | 'data removed any record with age 26', 'data removed any record with the youghest age'] 26 | i = 0 27 | for records in (recordsv0, recordsv1, recordsv2, recordsv3): 28 | print('############ Processing for {} ############'.format(printsent[i])) 29 | i += 1 30 | LampMec = laplace_mechanism.LaplaceMechanism(records) 31 | for e in eps: 32 | print('query 1000 results with epsilon = {}'.format(e)) 33 | res1000[e].append(LampMec.query_with_dp(e, querynum=1000)) 34 | # res4000[e].append(LampMec.query_with_dp(e, querynum=4000)) 35 | rmse[e] = LampMec.calc_distortion( 36 | LampMec.query_with_dp(e, querynum=4000)) 37 | 38 | print('\n') 39 | for e in eps: 40 | print('############ Prove the {}-indistinguishable'.format(e)) 41 | for i in range(1, 4): 42 | tmpresij, tmpresji = laplace_mechanism.prove_indistinguishable( 43 | res1000[e][0], res1000[e][i]) 44 | print('** {} ** OVER ** {} **:'.format(printsent[0], printsent[i])) 45 | print(tmpresij) 46 | print('** {} ** OVER ** {} **:'.format(printsent[i], printsent[0])) 47 | print(tmpresji) 48 | print('exp^e = {}'.format(math.exp(e))) 49 | print('\n') 50 | 51 | print('############ Measure the distortion (RMSE) ############') 52 | for e in eps: 53 | print('RMSE for e = {}: {}'.format(e, rmse[e])) 54 | print('Distortion of e=1 is smaller than e=0.5 ?: ', True if rmse[1] <= rmse[0.5] else False) 55 | del recordsv0 56 | del recordsv1 57 | del recordsv2 58 | del recordsv3 59 | 60 | 61 | 62 | 63 | 64 | def evaluate_exponential_mechanism(eps=[0.5,1]): 65 | """ 66 | Evaulate for Exponential Mechanism 67 | """ 68 | 69 | recordsv0 = readdata() 70 | recordsv1, recordsv2, recordsv3 = generate_data_for_exponential_mechanism(recordsv0) 71 | 72 | res1000 = {e:[] for e in eps} 73 | # res4000 = {e: [] for e in eps} 74 | dist = {e: 0 for e in eps} 75 | 76 | 77 | """ 78 | evaluate for epsilon = 0.5 and 1 for 1000 queries 79 | """ 80 | printsent = ['original data', 'data removed a record with most frequent education', 81 | 'data removed a record with second most frequent education', 82 | 'data removed any record with the least frequent education'] 83 | i = 0 84 | for records in (recordsv0, recordsv1, recordsv2, recordsv3): 85 | print('############ Processing for {} ############'.format(printsent[i])) 86 | i += 1 87 | ExpMe = exponential_mechanism.ExponentialMechanism(records) 88 | for e in eps: 89 | print('query 1000 results with epsilon = {}'.format(e)) 90 | res1000[e].append(ExpMe.query_with_dp(e, querynum=1000)) 91 | # res4000[e].append(LampMec.query_with_dp(e, querynum=4000)) 92 | dist[e] = ExpMe.calc_distortion( 93 | ExpMe.query_with_dp(e, querynum=4000)) 94 | 95 | print('\n') 96 | for e in eps: 97 | print('############ Prove the {}-indistinguishable'.format(e)) 98 | for i in range(1, 4): 99 | tmpresij, tmpresji = exponential_mechanism.prove_indistinguishable( 100 | res1000[e][0], res1000[e][i]) 101 | print('** {} ** OVER ** {} **:'.format(printsent[0], printsent[i])) 102 | print(tmpresij) 103 | print('** {} ** OVER ** {} **:'.format(printsent[i], printsent[0])) 104 | print(tmpresji) 105 | print('exp^e = {}'.format(math.exp(e))) 106 | print('\n') 107 | 108 | print('############ Measure the distortion (1-precision) ############') 109 | for e in eps: 110 | print('distortion for e = {}: {}'.format(e, dist[e])) 111 | print('Distortion of e=1 is smaller than e=0.5 ?: ', True if dist[1] <= dist[0.5] else False) 112 | 113 | 114 | 115 | 116 | 117 | if __name__ == "__main__": 118 | print("############################### Laplace Mechanism ###############################") 119 | evaluate_laplace_mechanism() 120 | print('\n') 121 | print("############################### Exponential Mechanism ###############################") 122 | evaluate_exponential_mechanism() 123 | 124 | 125 | -------------------------------------------------------------------------------- /src/exponential_mechanism.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from utilis.readdata import * 3 | from collections import Counter 4 | import math 5 | 6 | class ExponentialMechanism(): 7 | """ 8 | exponential mechanism 9 | 10 | """ 11 | 12 | def __init__(self, records): 13 | self.records = records 14 | self.s = self.__calculate_sensitivity() 15 | self.__count_education_nums_prop() 16 | 17 | def __calculate_sensitivity(self): 18 | """ 19 | calculate the sensitivity 20 | as the score function is #members, the sensitivity is 1 21 | 22 | Returns: 23 | [int] -- [sensitivity] 24 | """ 25 | return 1 26 | 27 | 28 | def __count_education_nums_prop(self): 29 | """ 30 | calculate the number and probability for education attribute 31 | """ 32 | 33 | self.educnt = {} 34 | eduidx = ATTNAME.index('education') 35 | for record in self.records: 36 | self.educnt[record[eduidx]] = self.educnt.get(record[eduidx], 0) + 1 37 | self.eduprop = {} 38 | for key, val in self.educnt.items(): 39 | self.eduprop[key] = val / len(self.records) 40 | 41 | def __exponential(self, u, e): 42 | """ 43 | return exponential probability 44 | 45 | Arguments: 46 | u {[float]} -- [probability] 47 | e {[float]} -- [epsilon] 48 | 49 | Returns: 50 | [float] -- [exponential probability] 51 | """ 52 | 53 | return np.random.exponential(e * u / (2*self.s)) 54 | 55 | def query_with_dp(self, e = 1, querynum = 1000): 56 | """ 57 | query with Exponential Mechanism 58 | 59 | Keyword Arguments: 60 | e {float} -- [epsilon] (default: {1}) 61 | querynum {int} -- [number of queries] (default: {1000}) 62 | 63 | Returns: 64 | [list] -- [list of queries] 65 | """ 66 | 67 | # candidate = list(self.educnt.keys()) 68 | # candidatefreq = [self.educnt[k] for k in candidate] 69 | candidate = list(self.eduprop.keys()) 70 | # print(candidate) 71 | # print([self.educnt[k] for k in candidate ]) 72 | candidatefreq = [self.eduprop[k] for k in candidate] 73 | res = [] 74 | for _ in range(querynum): 75 | weights = [self.__exponential(freq, e) for freq in candidatefreq] 76 | weights = [w/sum(weights) for w in weights] 77 | # print(weights) 78 | res.append(np.random.choice(candidate, p=weights)) 79 | return res 80 | 81 | 82 | def calc_groundtruth(self): 83 | """ 84 | calculate the groundtruth 85 | the most frequent education value 86 | 87 | Returns: 88 | [string] -- [most frequent education value] 89 | """ 90 | 91 | eduidx = ATTNAME.index('education') 92 | return Counter([record[eduidx] for record in self.records if record[eduidx] != '*']).most_common(1)[0][0] 93 | 94 | def calc_distortion(self, queryres): 95 | """ 96 | calculate the distortion 97 | 98 | Arguments: 99 | queryres {[list]} -- [query result] 100 | 101 | Returns: 102 | [float] -- [distortion] 103 | """ 104 | 105 | return 1 - Counter(queryres)[self.calc_groundtruth()]/len(queryres) 106 | 107 | 108 | def prove_indistinguishable(queryres1, queryres2): 109 | """ 110 | proove the indistinguishable for two query results 111 | 112 | Arguments: 113 | queryres1 {[list]} -- [query 1 result] 114 | queryres2 {[list]} -- [query 2 result] 115 | 116 | Returns: 117 | [float] -- [probability quotient] 118 | """ 119 | 120 | prob1 = Counter(queryres1) 121 | for key in prob1: 122 | prob1[key] /= len(queryres1) 123 | prob2 = Counter(queryres2) 124 | for key in prob2: 125 | prob2[key] /= len(queryres2) 126 | res = 0 127 | num = 0 128 | for key in prob1: 129 | if key not in prob2: 130 | print('no query result {} in query 2'.format(key)) 131 | continue 132 | res += prob1[key] / prob2[key] 133 | num += 1 134 | res1overres2 = res/num 135 | res = 0 136 | num = 0 137 | for key in prob2: 138 | if key not in prob1: 139 | print('no query result {} in query 1'.format(key)) 140 | continue 141 | res += prob2[key] / prob1[key] 142 | num += 1 143 | res2overres1 = res/num 144 | return res1overres2, res2overres1 145 | 146 | 147 | 148 | if __name__ == "__main__": 149 | records = readdata() 150 | ExpMe = ExponentialMechanism(records) 151 | res1 = ExpMe.query_with_dp(0.05, 1000) 152 | # res2 = ExpMe.query_with_dp(0.05, 1000) 153 | v1, v2, v3 = generate_data_for_exponential_mechanism(records) 154 | ExpMe2 = ExponentialMechanism(v1) 155 | res2 = ExpMe2.query_with_dp(0.05, 1000) 156 | # print(res1) 157 | print(ExpMe.calc_distortion(res1)) 158 | print(ExpMe.calc_distortion(ExpMe.query_with_dp(1, 1000))) 159 | print(ExpMe.calc_distortion(res2)) 160 | print(prove_indistinguishable(res1, res2)) 161 | print(prove_indistinguishable(res2, res1)) 162 | print(math.exp(0.05)) 163 | -------------------------------------------------------------------------------- /src/k_anonymity.py: -------------------------------------------------------------------------------- 1 | from utilis.readdata import * 2 | 3 | class KAnonymity(): 4 | def __init__(self, records): 5 | self.records = records 6 | self.confile = [AGECONFFILE, EDUCONFFILE, MARITALCONFFILE, RACECONFFILE] 7 | 8 | def anonymize(self, qi_names=['age', 'education', 'marital-status', 'race'], k=5): 9 | """ 10 | anonymizer for k-anonymity 11 | 12 | Keyword Arguments: 13 | qi_names {list} -- [qi names] (default: {['age', 'education', 'marital-status', 'race']}) 14 | k {int} -- [value for k] (default: {5}) 15 | """ 16 | 17 | domains, gen_levels = {}, {} 18 | qi_frequency = {} # store the frequency for each qi value 19 | # record_att_gen_levels = [[0 for _ in range(len(qi_names))] for _ in range(len(self.records))] 20 | 21 | assert len(self.confile) == len(qi_names), 'number of config files not equal to number of QI-names' 22 | generalize_tree = dict() 23 | for idx, name in enumerate(qi_names): 24 | generalize_tree[name] = Tree(self.confile[idx]) 25 | 26 | for qiname in qi_names: 27 | domains[qiname] = set() 28 | gen_levels[qiname] = 0 29 | 30 | for idx, record in enumerate(self.records): 31 | qi_sequence = self._get_qi_values(record[:], qi_names, generalize_tree) 32 | 33 | if qi_sequence in qi_frequency: 34 | qi_frequency[qi_sequence].add(idx) 35 | else: 36 | qi_frequency[qi_sequence] = {idx} 37 | for j, value in enumerate(qi_sequence): 38 | domains[qi_names[j]].add(value) 39 | 40 | # iteratively generalize the attributes with maximum distinct values 41 | while True: 42 | # count number of records not satisfying k-anonymity 43 | negcount = 0 44 | for qi_sequence, idxset in qi_frequency.items(): 45 | if len(idxset) < k: 46 | negcount += len(idxset) 47 | 48 | if negcount > k: 49 | # continue generalization, since there are more than k records not satisfying k-anonymity 50 | most_freq_att_num, most_freq_att_name = -1, None 51 | for qiname in qi_names: 52 | if len(domains[qiname]) > most_freq_att_num: 53 | most_freq_att_num = len(domains[qiname]) 54 | most_freq_att_name = qiname 55 | 56 | # find the attribute with most distinct values 57 | generalize_att = most_freq_att_name 58 | qi_index = qi_names.index(generalize_att) 59 | domains[generalize_att] = set() 60 | 61 | # generalize that attribute to one higher level 62 | for qi_sequence in list(qi_frequency.keys()): 63 | new_qi_sequence = list(qi_sequence) 64 | new_qi_sequence[qi_index] = generalize_tree[generalize_att].root[qi_sequence[qi_index]][0] 65 | new_qi_sequence = tuple(new_qi_sequence) 66 | 67 | if new_qi_sequence in qi_frequency: 68 | qi_frequency[new_qi_sequence].update( 69 | qi_frequency[qi_sequence]) 70 | qi_frequency.pop(qi_sequence, 0) 71 | else: 72 | qi_frequency[new_qi_sequence] = qi_frequency.pop(qi_sequence) 73 | 74 | domains[generalize_att].add(new_qi_sequence[qi_index]) 75 | 76 | gen_levels[generalize_att] += 1 77 | 78 | 79 | else: 80 | # end the while loop 81 | # suppress sequences not satisfying k-anonymity 82 | # save results and calculate distoration and precision 83 | genlvl_att = [0 for _ in range(len(qi_names))] 84 | dgh_att = [generalize_tree[name].level for name in qi_names] 85 | datasize = 0 86 | qiindex = [ATTNAME.index(name) for name in qi_names] 87 | 88 | # used to make sure the output file keeps the same order with original data file 89 | towriterecords = [None for _ in range(len(self.records))] 90 | with open('../data/adult_%d_kanonymity.data' %k, 'w') as wf: 91 | for qi_sequence, recordidxs in qi_frequency.items(): 92 | if len(recordidxs) < k: 93 | continue 94 | for idx in recordidxs: 95 | record = self.records[idx][:] 96 | for i in range(len(qiindex)): 97 | record[qiindex[i]] = qi_sequence[i] 98 | genlvl_att[i] += generalize_tree[qi_names[i]].root[qi_sequence[i]][1] 99 | record = list(map(str, record)) 100 | for i in range(len(record)): 101 | if record[i] == '*' and i not in qiindex: 102 | record[i] = '?' 103 | towriterecords[idx] = record[:] 104 | # wf.write(', '.join(record)) 105 | # wf.write('\n') 106 | datasize += len(recordidxs) 107 | for record in towriterecords: 108 | if record is not None: 109 | wf.write(', '.join(record)) 110 | wf.write('\n') 111 | else: 112 | wf.write('\n') 113 | 114 | print('qi names: ', qi_names) 115 | # precision = self.calc_precission(genlvl_att, dgh_att, datasize, len(qi_names)) 116 | precision = self.calc_precision(genlvl_att, dgh_att, len(self.records), len(qi_names)) 117 | distoration = self.calc_distoration([gen_levels[qi_names[i]] for i in range(len(qi_names))], dgh_att, len(qi_names)) 118 | print('precision: {}, distoration: {}'.format(precision, distoration)) 119 | break 120 | 121 | 122 | def calc_precision(self, genlvl_att, dgh_att, datasize, attsize = 4): 123 | """ 124 | calculate the precision of generalized value for each value of each attributes 125 | 126 | Arguments: 127 | genlvl_att {[list]} -- [sum of generalized level of each attribute] 128 | dgh_att {[list]} -- [maximum height of each attribute] 129 | datasize {[int]} -- [data size] 130 | 131 | Keyword Arguments: 132 | attsize {int} -- [number of qi attributes] (default: {4}) 133 | 134 | Returns: 135 | [float] -- [precision of the generalization] 136 | """ 137 | 138 | return 1 - sum([genlvl_att[i] / dgh_att[i] for i in range(attsize)])/(datasize*attsize) 139 | 140 | 141 | def calc_distoration(self, gen_levels_att, dgh_att, attsize): 142 | """ 143 | calculate the distoration for generalized levels of each attributes 144 | 145 | Arguments: 146 | gen_levels_att {[type]} -- [description] 147 | dgh_att {[type]} -- [description] 148 | attsize {[type]} -- [description] 149 | 150 | Returns: 151 | [type] -- [description] 152 | """ 153 | 154 | print('attribute gen level:', gen_levels_att) 155 | print('tree height:', dgh_att) 156 | return sum([gen_levels_att[i] / dgh_att[i] for i in range(attsize)]) / attsize 157 | 158 | 159 | def _get_qi_values(self, record, qi_names, generalize_tree): 160 | """ 161 | private method 162 | get qi values from one record 163 | 164 | Arguments: 165 | record {[list]} -- [one record] 166 | qi_names {[list]} -- [qi names] 167 | generalize_tree {[dict]} -- [dict storing the DGH trees] 168 | 169 | Returns: 170 | [tuple] -- [qi tuple value] 171 | """ 172 | 173 | qi_index = [ATTNAME.index(name) for name in qi_names] 174 | seq = [] 175 | for idx in qi_index: 176 | if idx == ATTNAME.index('age'): 177 | if record[idx] == -1: 178 | seq.append('0-100') 179 | else: 180 | seq.append(str(record[idx])) 181 | else: 182 | if record[idx] == '*': 183 | # TODO, handle missing value cases 184 | record[idx] = generalize_tree[qi_names[idx]].highestgen 185 | seq.append(record[idx]) 186 | return tuple(seq) 187 | 188 | 189 | 190 | 191 | class Tree: 192 | """ 193 | Tree class 194 | built for DGH tree, keep track of each node's parent, and current level 195 | """ 196 | 197 | def __init__(self, confile): 198 | self.confile = confile 199 | self.root = dict() 200 | self.level = -1 201 | self.highestgen = '' 202 | self.buildTree() 203 | 204 | 205 | def buildTree(self): 206 | """ 207 | build the DGH tree from config file 208 | """ 209 | 210 | with open(self.confile, 'r') as rf: 211 | for line in rf: 212 | line = line.strip() 213 | if not line: 214 | continue 215 | line = [col.strip() for col in line.split(',')] 216 | height = len(line)-1 217 | if self.level == -1: 218 | self.level = height 219 | if not self.highestgen: 220 | self.highestgen = line[-1] 221 | pre = None 222 | for idx, val in enumerate(line[::-1]): 223 | self.root[val] = (pre, height-idx) 224 | pre = val 225 | 226 | 227 | if __name__ == "__main__": 228 | records = readdata() 229 | KAnony = KAnonymity(records) 230 | KAnony.anonymize(k = 100) 231 | 232 | 233 | 234 | -------------------------------------------------------------------------------- /src/kanonymity_eval.py: -------------------------------------------------------------------------------- 1 | import sys 2 | sys.path.append('../') 3 | from utilis.readdata import * 4 | from k_anonymity import KAnonymity 5 | 6 | def main(): 7 | records = readdata() 8 | K = [5, 10, 50, 100] 9 | KAnony = KAnonymity(records) 10 | for k in K: 11 | print('############# k-anonymity for k={} #############: \n'.format(k)) 12 | KAnony.anonymize(k=k) 13 | print('\n') 14 | 15 | 16 | if __name__ == "__main__": 17 | main() 18 | -------------------------------------------------------------------------------- /src/laplace_mechanism.py: -------------------------------------------------------------------------------- 1 | from utilis.readdata import * 2 | import numpy as np 3 | import math 4 | 5 | 6 | class LaplaceMechanism(): 7 | def __init__(self, records): 8 | self.records = records 9 | self.s = self.__calculate_sensitivity() 10 | # print(self.s) 11 | 12 | def __calculate_sensitivity(self): 13 | """ 14 | calculate the sensitive value 15 | it should be the oldest age / num of records 16 | 17 | Returns: 18 | [float] -- [sensitive value] 19 | """ 20 | 21 | num, oldage = 0, -float('inf') 22 | ageidx = ATTNAME.index('age') 23 | for record in self.records: 24 | if record[ageidx] > 25: 25 | num += 1 26 | if record[ageidx] > oldage: 27 | oldage = record[ageidx] 28 | return oldage / num 29 | 30 | def __laplacian_noise(self, e): 31 | """ 32 | add laplacian_noise 33 | """ 34 | 35 | return np.random.laplace(self.s/e) 36 | 37 | def query_with_dp(self, e = 1, querynum=1000): 38 | """ 39 | query average age above 25 with Laplace Mechanism 40 | 41 | Keyword Arguments: 42 | e {float} -- [epsilon] (default: {1}) 43 | querynum {int} -- [number of queries] (default: {1000}) 44 | 45 | Returns: 46 | [list] -- [randomized query results] 47 | """ 48 | 49 | ageidx = ATTNAME.index('age') 50 | agegt25 = [record[ageidx] 51 | for record in self.records if record[ageidx] > 25] 52 | avgage = sum(agegt25) / len(agegt25) 53 | 54 | res = [] 55 | for _ in range(querynum): 56 | res.append(round(avgage + self.__laplacian_noise(e), 2)) 57 | return res 58 | 59 | def calc_groundtruth(self): 60 | """ 61 | calculate the true average age above 25 without adding noise 62 | 63 | Returns: 64 | [float] -- [true average age greater than 25] 65 | """ 66 | 67 | agesum = 0 68 | num = 0 69 | ageidx = ATTNAME.index('age') 70 | for record in self.records: 71 | if record[ageidx] > 25: 72 | agesum += record[ageidx] 73 | num += 1 74 | return round(agesum / num, 2) 75 | 76 | 77 | def calc_distortion(self, queryres): 78 | """ 79 | calcluate the distortion 80 | use RMSE here 81 | 82 | Arguments: 83 | queryres {[list]} -- [query result] 84 | 85 | Returns: 86 | [float] -- [rmse value] 87 | """ 88 | 89 | groundtruth = self.calc_groundtruth() 90 | rmse = (sum((res - groundtruth)**2 for res in queryres) / len(queryres))**(1/2) 91 | return rmse 92 | 93 | def prove_indistinguishable(queryres1, queryres2, bucketnum = 20): 94 | """ 95 | proove the indistinguishable for two query results 96 | 97 | Arguments: 98 | queryres1 {[list]} -- [query 1 result] 99 | queryres2 {[list]} -- [query 2 result] 100 | 101 | Keyword Arguments: 102 | bucketnum {int} -- [number of buckets used to calculate the probability] (default: {20}) 103 | 104 | Returns: 105 | [float] -- [probability quotient] 106 | """ 107 | 108 | maxval = max(max(queryres1), max(queryres2)) 109 | minval = min(min(queryres1), min(queryres2)) 110 | count1 = [0 for _ in range(bucketnum)] 111 | count2 = [0 for _ in range(bucketnum)] 112 | for val1, val2 in zip(queryres1, queryres2): 113 | count1[math.floor((val1-minval+1)/((maxval-minval+1)/bucketnum))-1] += 1 114 | count2[math.floor((val2-minval+1)//((maxval-minval+1)/bucketnum))-1] += 1 115 | prob1 = list(map(lambda x: x/len(queryres1), count1)) 116 | prob2 = list(map(lambda x: x/len(queryres2), count2)) 117 | 118 | res1overres2 = sum(p1 / p2 for p1, p2 in zip(prob1, prob2) if p2 != 0) / bucketnum 119 | res2overres1 = sum(p2 / p1 for p1, p2 in zip(prob1, prob2) if p1 != 0) / bucketnum 120 | return res1overres2, res2overres1 121 | 122 | 123 | if __name__ == "__main__": 124 | records = readdata() 125 | v1, v2, v3 = generate_data_for_laplace_mechanism(records) 126 | LapMe = LaplaceMechanism(records) 127 | res1 = LapMe.query_with_dp(0.5, 1000) 128 | # print(res1) 129 | # print(LapMe.calc_groundtruth()) 130 | print(LapMe.calc_distortion(LapMe.query_with_dp(1, 1000))) 131 | LapMe2 = LaplaceMechanism(v1) 132 | res2 = LapMe2.query_with_dp(0.5, 1000) 133 | print(LapMe.calc_distortion(res1)) 134 | print(LapMe2.calc_distortion(res2)) 135 | print(prove_indistinguishable(res1, res2)) 136 | # print(prove_indistinguishable(res2, res1)) 137 | print(math.exp(0.5)) 138 | -------------------------------------------------------------------------------- /utilis/readdata.py: -------------------------------------------------------------------------------- 1 | import pandas 2 | import os 3 | import random 4 | 5 | ATTNAME = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital-status', 6 | 'occupation', 'relationship', 'race', 'sex', 7 | 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'class'] 8 | 9 | AGECONFFILE = '../conf/age_hierarchy.txt' 10 | EDUCONFFILE = '../conf/edu_hierarchy.txt' 11 | MARITALCONFFILE = '../conf/marital_hierarchy.txt' 12 | RACECONFFILE = '../conf/race_hierarchy.txt' 13 | 14 | 15 | def readdata(filepath='../data', filename='adult.data'): 16 | records = [] 17 | try: 18 | with open(os.path.join(filepath, filename), 'r') as rf: 19 | for line in rf: 20 | line = line.strip() 21 | if not line: 22 | continue 23 | line = [a.strip() for a in line.split(',')] 24 | # print(line) 25 | intidx = [ATTNAME.index(colname) for colname in ( 26 | 'age', 'fnlwgt', 'education_num', 'capital-gain', 'capital-loss', 'hours-per-week')] 27 | for idx in intidx: 28 | try: 29 | line[idx] = int(line[idx]) 30 | except: 31 | print('attribute %s, value %s, cannot be converted to number' %(ATTNAME[idx], line[idx])) 32 | line[idx] = -1 33 | for idx in range(len(line)): 34 | if line[idx] == '' or line[idx] == '?': 35 | line[idx] = '*' 36 | records.append(line) 37 | return records 38 | except: 39 | print('cannot open file: %s:%s' %(filepath, filename)) 40 | 41 | 42 | def generate_data_for_laplace_mechanism(records): 43 | """ 44 | generate the three different versions datasets for Laplace Mechanism 45 | 46 | Arguments: 47 | records {[list of list]} -- [original records for adult datasets] 48 | 49 | Returns: 50 | three versions datasets for Laplace Mechanism 51 | oldest age and youngest age 52 | """ 53 | 54 | oldestidx, twentysixidx, youngestidx = -1, -1, -1 55 | oldest, youngest = -float('inf'), float('inf') 56 | ageidx = ATTNAME.index('age') 57 | for idx, record in enumerate(records): 58 | """ 59 | age == -1 means the value is missing in the dataset 60 | """ 61 | if record[ageidx] == -1: 62 | continue 63 | if record[ageidx] >= oldest: 64 | if record[ageidx] != oldest or random.random() >= 0.5: 65 | oldestidx, oldest = idx, record[ageidx] 66 | if record[ageidx] <= youngest: 67 | if record[ageidx] != youngest or random.random() >= 0.5: 68 | youngestidx, youngest = idx, record[ageidx] 69 | if record[ageidx] == 26 and (twentysixidx != -1 or random.random() >= 0.5): 70 | twentysixidx = idx 71 | version1 = _copy_with_exclude_idx(records, oldestidx) 72 | version2 = _copy_with_exclude_idx(records, twentysixidx) 73 | version3 = _copy_with_exclude_idx(records, youngestidx) 74 | return version1, version2, version3#, oldest, youngest 75 | 76 | 77 | def generate_data_for_exponential_mechanism(records): 78 | """ 79 | generate data for Exponential Mechanism 80 | 81 | Arguments: 82 | records {[list of list]} -- [original dataset] 83 | """ 84 | counter = {} 85 | eduidx = ATTNAME.index('education') 86 | for idx, record in enumerate(records): 87 | if record[eduidx] == '*': 88 | continue 89 | counter[record[eduidx]] = counter.get(record[eduidx], []) + [idx] 90 | 91 | firstlen, secondlen, leastlen = -float('inf'), -float('inf'), float('inf') 92 | firstedu, secondedu, leastedu = '', '', '' 93 | for key, val in counter.items(): 94 | if len(val) > firstlen: 95 | secondlen = firstlen 96 | secondedu = firstedu 97 | firstlen = len(val) 98 | firstedu = key 99 | elif len(val) > secondlen: 100 | secondlen = len(val) 101 | secondedu = key 102 | if len(val) < leastlen: 103 | leastlen = len(val) 104 | leastedu = key 105 | firstidx = counter[firstedu][random.randrange(0, firstlen)] 106 | secondidx = counter[secondedu][random.randrange(0, secondlen)] 107 | leastidx = counter[leastedu][random.randrange(0, leastlen)] 108 | 109 | version1 = _copy_with_exclude_idx(records, firstidx) 110 | version2 = _copy_with_exclude_idx(records, secondidx) 111 | version3 = _copy_with_exclude_idx(records, leastidx) 112 | return version1, version2, version3 113 | 114 | 115 | def _copy_with_exclude_idx(records, tgtidx): 116 | """ 117 | generate a new list of records without the target idx: tgtidx 118 | 119 | Arguments: 120 | records {[list of list]} -- [original records] 121 | tgtidx {[int]} -- [target idx will be excluded from records] 122 | 123 | Returns: 124 | [list of list] -- [copy of records excluding the tgtidx record] 125 | """ 126 | 127 | return [record for idx, record in enumerate(records) if idx != tgtidx] 128 | 129 | 130 | def generate_hierarchy_for_age(records): 131 | youngest, oldest = float('inf'), -float('inf') 132 | ageidx = ATTNAME.index('age') 133 | for record in records: 134 | if record[ageidx] == -1: 135 | continue 136 | if record[ageidx] > oldest: 137 | oldest = record[ageidx] 138 | if record[ageidx] < youngest: 139 | youngest = record[ageidx] 140 | print('age max: %d min: %d' %(oldest, youngest)) 141 | with open(AGECONFFILE, 'w') as wf: 142 | for i in range(oldest+1): 143 | h = [] 144 | h.append(str(i)) 145 | # h.append('%s-%s' %(i//10*10, (i//10+1)*10)) 146 | # h.append('%s-%s' %(i//20*20, (i//20+1)*20)) 147 | # h.append('%s-%s' %(i//50*50, (i//50+1)*50)) 148 | # h.append('%s-%s' %(i//100*100, (i//100+1)*100)) 149 | h.append('%s-%s' % (i//25*25, (i//25+1)*25)) 150 | # h.append('%s-%s' % (i//20*20, (i//20+1)*20)) 151 | h.append('%s-%s' % (i//50*50, (i//50+1)*50)) 152 | h.append('%s-%s' % (i//100*100, (i//100+1)*100)) 153 | wf.write(','.join(h)) 154 | wf.write('\n') 155 | 156 | def generate_hierarchy_for_edu(records): 157 | eduset = set() 158 | eduidx = ATTNAME.index('education') 159 | for record in records: 160 | if record[eduidx] != '*' and record[eduidx] not in eduset: 161 | eduset.add(record[eduidx]) 162 | with open(EDUCONFFILE, 'w') as wf: 163 | for edu in eduset: 164 | wf.write(edu + ','*2) 165 | wf.write('\n') 166 | 167 | def generate_hierarchy_for_marital(records): 168 | maritalset = set() 169 | maritalidx = ATTNAME.index('marital-status') 170 | for record in records: 171 | if record[maritalidx] != '*' and record[maritalidx] not in maritalset: 172 | maritalset.add(record[maritalidx]) 173 | with open(MARITALCONFFILE, 'w') as wf: 174 | for marital in maritalset: 175 | wf.write(marital + ','*2) 176 | wf.write('\n') 177 | 178 | def generate_hierarchy_for_race(records): 179 | reaceset = set() 180 | raceidx = ATTNAME.index('race') 181 | for record in records: 182 | if record[raceidx] != '*' and record[raceidx] not in reaceset: 183 | reaceset.add(record[raceidx]) 184 | with open(RACECONFFILE, 'w') as wf: 185 | for race in reaceset: 186 | wf.write(race+','*2) 187 | wf.write('\n') 188 | 189 | 190 | if __name__ == "__main__": 191 | print(os.getcwd()) 192 | records = readdata() 193 | generate_hierarchy_for_age(records) 194 | # generate_hierarchy_for_edu(records) 195 | # generate_hierarchy_for_marital(records) 196 | # generate_hierarchy_for_race(records) 197 | --------------------------------------------------------------------------------