├── README.md
├── conf
    ├── age_hierarchy.txt
    ├── edu_hierarchy.txt
    ├── marital_hierarchy.txt
    └── race_hierarchy.txt
├── data
    ├── adult.data
    ├── adult.names
    ├── adult.test
    └── old.adult.names
├── src
    ├── differential_privacy.py
    ├── exponential_mechanism.py
    ├── k_anonymity.py
    ├── kanonymity_eval.py
    └── laplace_mechanism.py
└── utilis
    └── readdata.py


/README.md:
--------------------------------------------------------------------------------
 1 | ## K-anonymity and Differential Privacy
 2 | 
 3 | 
 4 | [TOC]
 5 | 
 6 | #### 1. K-anonymity
 7 | 
 8 | ##### 1.1 Generalization Hierarchy
 9 | 
10 | Generalization hierarchy is defined in files in `conf` folder.
11 | 
12 | ##### 1.2 Heuristic program
13 | 
14 | I implement the datafly heuristic algorithm, which pseudo code is below:
15 | 
16 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0krnjtttwj30qw0dmjut.jpg)
17 | 
18 | The detailed comments for each function can be found in file `k_anonymity.py`.
19 | 
20 | ##### 1.3 Evaluation
21 | 
22 | I evaluate the result for `k = [5, 10, 50, 100]`, and calculate the distoration and precision. The calculation of distoration and precision is the same as given in lecture, where distortion is:
23 | 
24 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0ksdvufmpj30ij07a74y.jpg)
25 | 
26 | and precision is:
27 | 
28 | ![](https://ws4.sinaimg.cn/large/006tKfTcly1g0ksdvufmpj30ij07a74y.jpg)
29 | 
30 | 
31 | #### 2. Differential Privacy
32 | 
33 | ##### 2.1 Laplace Mechanism
34 | 
35 | ###### 2.1.1 Query for e=0.5 and e = 1
36 | 
37 | Query `1000` times for `e` in `[0.5, 1]`.
38 | 
39 | 
40 | ###### 2.1.2 0.5-Indistinguishable Proof
41 | 
42 | To prove the indistinguishable of two queries, I use bucket to gather outputs into 20 buckets, then calculate the probability for each bucket. Then calculate the quotient over each bucket probability. For two query result, `query1` and `query2`, I calculate both probability of `query1` over probability over `query2`, and also probability of `query2` over probability over `query1. We can see in both cases, the quotient is smaller than $exp^{\epsilon}$, which proves the $\epsilon$-indistinguishable.
43 | 
44 | 
45 | ######2.1.3 1-Indistinguishable Proof
46 | 
47 | The proof is same as 0.5-indistinguishable proof. We can see for each case, the $1$-indistinguishable holds.
48 | 
49 | 
50 | 
51 | ###### 2.1.4 Distortion
52 | 
53 | To calculate the distortion, I used the RMSE as metric. Firstly I calculate the groundtruth, which is the true average age greater than 25 without adding noise. Then I calculate the RMSE of query result with groundtruth. We can see when $\epsilon = 1$, the RMSE is smaller than $\epsilon=0.5$, which proves the distortino of $\epsilon = 1$ is smaller than  $\epsilon=0.5$.
54 | 
55 | 
56 | 
57 | ##### 2.2 Exponential Mechanism
58 | 
59 | ###### 2.2.1 Query for e=0.5 and e = 1
60 | 
61 | Query `1000` times for `e` in `[0.5, 1]`.
62 | 
63 | 
64 | 
65 | ###### 2.2.2 0.5-Indistinguishable Proof
66 | 
67 | To prove the $\epsilon$-indistinguishable, I firstly count the frequency of each education value in the query results. Then calculate the probability for each education value. Indistinguishability can be proved by showing the probability quotient of two adjacent tables is smaller than $\exp^{\epsilon}$. We can see for each case, the indistinguishability holds.
68 | 
69 | 
70 | ###### 2.2.3 1-Indistinguishable Proof
71 | 
72 | The proof is same with $\epsilon=0.5$. We can see for each case, the indistinguishability holds.
73 | 
74 | 
75 | ###### 2.2.4 Distortion 
76 | 
77 | The metric I used here is `1-precision`. `Precision` is calculated by #results which is the groudtruth / total query numbers. `1-precision` is a measure for distortion, as higher precison implies lower distortion. We can  see when $\epsilon = 1$, the distortion is smaller than $\epsilon=0.5$, which proves the distortino of $\epsilon = 1$ is smaller than  $\epsilon=0.5$.
78 | 
79 | 


--------------------------------------------------------------------------------
/conf/age_hierarchy.txt:
--------------------------------------------------------------------------------
 1 | 0,0-10,0-20,0-50,0-100
 2 | 1,0-10,0-20,0-50,0-100
 3 | 2,0-10,0-20,0-50,0-100
 4 | 3,0-10,0-20,0-50,0-100
 5 | 4,0-10,0-20,0-50,0-100
 6 | 5,0-10,0-20,0-50,0-100
 7 | 6,0-10,0-20,0-50,0-100
 8 | 7,0-10,0-20,0-50,0-100
 9 | 8,0-10,0-20,0-50,0-100
10 | 9,0-10,0-20,0-50,0-100
11 | 10,10-20,0-20,0-50,0-100
12 | 11,10-20,0-20,0-50,0-100
13 | 12,10-20,0-20,0-50,0-100
14 | 13,10-20,0-20,0-50,0-100
15 | 14,10-20,0-20,0-50,0-100
16 | 15,10-20,0-20,0-50,0-100
17 | 16,10-20,0-20,0-50,0-100
18 | 17,10-20,0-20,0-50,0-100
19 | 18,10-20,0-20,0-50,0-100
20 | 19,10-20,0-20,0-50,0-100
21 | 20,20-30,20-40,0-50,0-100
22 | 21,20-30,20-40,0-50,0-100
23 | 22,20-30,20-40,0-50,0-100
24 | 23,20-30,20-40,0-50,0-100
25 | 24,20-30,20-40,0-50,0-100
26 | 25,20-30,20-40,0-50,0-100
27 | 26,20-30,20-40,0-50,0-100
28 | 27,20-30,20-40,0-50,0-100
29 | 28,20-30,20-40,0-50,0-100
30 | 29,20-30,20-40,0-50,0-100
31 | 30,30-40,20-40,0-50,0-100
32 | 31,30-40,20-40,0-50,0-100
33 | 32,30-40,20-40,0-50,0-100
34 | 33,30-40,20-40,0-50,0-100
35 | 34,30-40,20-40,0-50,0-100
36 | 35,30-40,20-40,0-50,0-100
37 | 36,30-40,20-40,0-50,0-100
38 | 37,30-40,20-40,0-50,0-100
39 | 38,30-40,20-40,0-50,0-100
40 | 39,30-40,20-40,0-50,0-100
41 | 40,40-50,40-60,0-50,0-100
42 | 41,40-50,40-60,0-50,0-100
43 | 42,40-50,40-60,0-50,0-100
44 | 43,40-50,40-60,0-50,0-100
45 | 44,40-50,40-60,0-50,0-100
46 | 45,40-50,40-60,0-50,0-100
47 | 46,40-50,40-60,0-50,0-100
48 | 47,40-50,40-60,0-50,0-100
49 | 48,40-50,40-60,0-50,0-100
50 | 49,40-50,40-60,0-50,0-100
51 | 50,50-60,40-60,50-100,0-100
52 | 51,50-60,40-60,50-100,0-100
53 | 52,50-60,40-60,50-100,0-100
54 | 53,50-60,40-60,50-100,0-100
55 | 54,50-60,40-60,50-100,0-100
56 | 55,50-60,40-60,50-100,0-100
57 | 56,50-60,40-60,50-100,0-100
58 | 57,50-60,40-60,50-100,0-100
59 | 58,50-60,40-60,50-100,0-100
60 | 59,50-60,40-60,50-100,0-100
61 | 60,60-70,60-80,50-100,0-100
62 | 61,60-70,60-80,50-100,0-100
63 | 62,60-70,60-80,50-100,0-100
64 | 63,60-70,60-80,50-100,0-100
65 | 64,60-70,60-80,50-100,0-100
66 | 65,60-70,60-80,50-100,0-100
67 | 66,60-70,60-80,50-100,0-100
68 | 67,60-70,60-80,50-100,0-100
69 | 68,60-70,60-80,50-100,0-100
70 | 69,60-70,60-80,50-100,0-100
71 | 70,70-80,60-80,50-100,0-100
72 | 71,70-80,60-80,50-100,0-100
73 | 72,70-80,60-80,50-100,0-100
74 | 73,70-80,60-80,50-100,0-100
75 | 74,70-80,60-80,50-100,0-100
76 | 75,70-80,60-80,50-100,0-100
77 | 76,70-80,60-80,50-100,0-100
78 | 77,70-80,60-80,50-100,0-100
79 | 78,70-80,60-80,50-100,0-100
80 | 79,70-80,60-80,50-100,0-100
81 | 80,80-90,80-100,50-100,0-100
82 | 81,80-90,80-100,50-100,0-100
83 | 82,80-90,80-100,50-100,0-100
84 | 83,80-90,80-100,50-100,0-100
85 | 84,80-90,80-100,50-100,0-100
86 | 85,80-90,80-100,50-100,0-100
87 | 86,80-90,80-100,50-100,0-100
88 | 87,80-90,80-100,50-100,0-100
89 | 88,80-90,80-100,50-100,0-100
90 | 89,80-90,80-100,50-100,0-100
91 | 90,90-100,80-100,50-100,0-100
92 | 


--------------------------------------------------------------------------------
/conf/edu_hierarchy.txt:
--------------------------------------------------------------------------------
 1 | Preschool,PrimarySchool,CompulsorySchool,BasicDegree,Educated
 2 | 1st-4th,PrimarySchool,CompulsorySchool,BasicDegree,Educated
 3 | 5th-6th,PrimarySchool,CompulsorySchool,BasicDegree,Educated
 4 | 7th-8th,MiddleSchool,CompulsorySchool,BasicDegree,Educated
 5 | 9th,MiddleSchool,CompulsorySchool,BasicDegree,Educated
 6 | 10th,HighSchool,AdvancedSchool,BasicDegree,Educated
 7 | 11th,HighSchool,AdvancedSchool,BasicDegree,Educated
 8 | 12th,HighSchool,AdvancedSchool,BasicDegree,Educated
 9 | HS-grad,HighSchool,AdvancedSchool,BasicDegree,Educated
10 | Assoc-voc,VocDegree,AdvancedSchool,BasicDegree,Educated
11 | Prof-school,VocDegree,AdvancedSchool,BasicDegree,Educated
12 | Some-college,VocDegree,AdvancedSchool,BasicDegree,Educated
13 | Assoc-acdm,UndergradSchool,ProfSchool,AdvancedDegree,Educated
14 | Bachelors,UndergradSchool,ProfSchool,AdvancedDegree,Educated
15 | Masters,GradSchool,ProfSchool,AdvancedDegree,Educated
16 | Doctorate,GradSchool,ProfSchool,AdvancedDegree,Educated


--------------------------------------------------------------------------------
/conf/marital_hierarchy.txt:
--------------------------------------------------------------------------------
1 | Married-AF-spouse,MarriedTogether,Married,*
2 | Married-civ-spouse,MarriedTogether,Married,*
3 | Married-spouse-absent,MarriedSeparated,Married,*
4 | Separated,MarriedSeparated,Married,*
5 | Widowed,MarriedSingle,Married,*
6 | Divorced,MarriedSingle,Married,*
7 | Never-married,NeverMarried,NonMarried,*
8 | 


--------------------------------------------------------------------------------
/conf/race_hierarchy.txt:
--------------------------------------------------------------------------------
1 | Asian-Pac-Islander,Orient,*
2 | Black,Orient,*
3 | Other,Orient,*
4 | Amer-Indian-Eskimo,Occident,*
5 | White,Occident,*
6 | 


--------------------------------------------------------------------------------
/data/adult.names:
--------------------------------------------------------------------------------
  1 | | This data was extracted from the census bureau database found at
  2 | | http://www.census.gov/ftp/pub/DES/www/welcome.html
  3 | | Donor: Ronny Kohavi and Barry Becker,
  4 | |        Data Mining and Visualization
  5 | |        Silicon Graphics.
  6 | |        e-mail: ronnyk@sgi.com for questions.
  7 | | Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
  8 | | 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
  9 | | 45222 if instances with unknown values are removed (train=30162, test=15060)
 10 | | Duplicate or conflicting instances : 6
 11 | | Class probabilities for adult.all file
 12 | | Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
 13 | | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
 14 | |
 15 | | Extraction was done by Barry Becker from the 1994 Census database.  A set of
 16 | |   reasonably clean records was extracted using the following conditions:
 17 | |   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
 18 | |
 19 | | Prediction task is to determine whether a person makes over 50K
 20 | | a year.
 21 | |
 22 | | First cited in:
 23 | | @inproceedings{kohavi-nbtree,
 24 | |    author={Ron Kohavi},
 25 | |    title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
 26 | |           Decision-Tree Hybrid},
 27 | |    booktitle={Proceedings of the Second International Conference on
 28 | |               Knowledge Discovery and Data Mining},
 29 | |    year = 1996,
 30 | |    pages={to appear}}
 31 | |
 32 | | Error Accuracy reported as follows, after removal of unknowns from
 33 | |    train/test sets):
 34 | |    C4.5       : 84.46+-0.30
 35 | |    Naive-Bayes: 83.88+-0.30
 36 | |    NBTree     : 85.90+-0.28
 37 | |
 38 | |
 39 | | Following algorithms were later run with the following error rates,
 40 | |    all after removal of unknowns and using the original train/test split.
 41 | |    All these numbers are straight runs using MLC++ with default values.
 42 | |
 43 | |    Algorithm               Error
 44 | | -- ----------------        -----
 45 | | 1  C4.5                    15.54
 46 | | 2  C4.5-auto               14.46
 47 | | 3  C4.5 rules              14.94
 48 | | 4  Voted ID3 (0.6)         15.64
 49 | | 5  Voted ID3 (0.8)         16.47
 50 | | 6  T2                      16.84
 51 | | 7  1R                      19.54
 52 | | 8  NBTree                  14.10
 53 | | 9  CN2                     16.00
 54 | | 10 HOODG                   14.82
 55 | | 11 FSS Naive Bayes         14.05
 56 | | 12 IDTM (Decision table)   14.46
 57 | | 13 Naive-Bayes             16.12
 58 | | 14 Nearest-neighbor (1)    21.42
 59 | | 15 Nearest-neighbor (3)    20.35
 60 | | 16 OC1                     15.04
 61 | | 17 Pebls                   Crashed.  Unknown why (bounds WERE increased)
 62 | |
 63 | | Conversion of original data as follows:
 64 | | 1. Discretized agrossincome into two ranges with threshold 50,000.
 65 | | 2. Convert U.S. to US to avoid periods.
 66 | | 3. Convert Unknown to "?"
 67 | | 4. Run MLC++ GenCVFiles to generate data,test.
 68 | |
 69 | | Description of fnlwgt (final weight)
 70 | |
 71 | | The weights on the CPS files are controlled to independent estimates of the
 72 | | civilian noninstitutional population of the US.  These are prepared monthly
 73 | | for us by Population Division here at the Census Bureau.  We use 3 sets of
 74 | | controls.
 75 | |  These are:
 76 | |          1.  A single cell estimate of the population 16+ for each state.
 77 | |          2.  Controls for Hispanic Origin by age and sex.
 78 | |          3.  Controls by Race, age and sex.
 79 | |
 80 | | We use all three sets of controls in our weighting program and "rake" through
 81 | | them 6 times so that by the end we come back to all the controls we used.
 82 | |
 83 | | The term estimate refers to population totals derived from CPS by creating
 84 | | "weighted tallies" of any specified socio-economic characteristics of the
 85 | | population.
 86 | |
 87 | | People with similar demographic characteristics should have
 88 | | similar weights.  There is one important caveat to remember
 89 | | about this statement.  That is that since the CPS sample is
 90 | | actually a collection of 51 state samples, each with its own
 91 | | probability of selection, the statement only applies within
 92 | | state.
 93 | 
 94 | 
 95 | >50K, <=50K.
 96 | 
 97 | age: continuous.
 98 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
 99 | fnlwgt: continuous.
100 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
101 | education-num: continuous.
102 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
103 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
104 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
105 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
106 | sex: Female, Male.
107 | capital-gain: continuous.
108 | capital-loss: continuous.
109 | hours-per-week: continuous.
110 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
111 | 


--------------------------------------------------------------------------------
/data/old.adult.names:
--------------------------------------------------------------------------------
 1 | 1. Title of Database: adult
 2 | 2. Sources:
 3 |    (a) Original owners of database (name/phone/snail address/email address)
 4 |        US Census Bureau.
 5 |    (b) Donor of database (name/phone/snail address/email address)
 6 |        Ronny Kohavi and Barry Becker, 
 7 |        Data Mining and Visualization
 8 |        Silicon Graphics.
 9 |        e-mail: ronnyk@sgi.com
10 |    (c) Date received (databases may change over time without name change!)
11 |        05/19/96
12 | 3. Past Usage:
13 |    (a) Complete reference of article where it was described/used
14 |         @inproceedings{kohavi-nbtree,
15 |            author={Ron Kohavi},
16 |            title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 
17 |                   Decision-Tree Hybrid},
18 |            booktitle={Proceedings of the Second International Conference on
19 |                       Knowledge Discovery and Data Mining},
20 |            year = 1996,
21 |            pages={to appear}}
22 |    (b) Indication of what attribute(s) were being predicted 
23 |        Salary greater or less than 50,000.
24 |    (b) Indication of study's results (i.e. Is it a good domain to use?)
25 |        Hard domain with a nice number of records.
26 |        The following results obtained using MLC++ with default settings
27 |        for the algorithms mentioned below.
28 |         
29 |            Algorithm               Error
30 |         -- ----------------        -----
31 |         1  C4.5                    15.54
32 |         2  C4.5-auto               14.46
33 |         3  C4.5 rules              14.94
34 |         4  Voted ID3 (0.6)         15.64
35 |         5  Voted ID3 (0.8)         16.47
36 |         6  T2                      16.84
37 |         7  1R                      19.54
38 |         8  NBTree                  14.10
39 |         9  CN2                     16.00
40 |         10 HOODG                   14.82
41 |         11 FSS Naive Bayes         14.05
42 |         12 IDTM (Decision table)   14.46
43 |         13 Naive-Bayes             16.12
44 |         14 Nearest-neighbor (1)    21.42
45 |         15 Nearest-neighbor (3)    20.35
46 |         16 OC1                     15.04
47 |         17 Pebls                   Crashed.  Unknown why (bounds WERE increased)
48 | 
49 | 4. Relevant Information Paragraph:
50 |    Extraction was done by Barry Becker from the 1994 Census database.  A set
51 |     of reasonably clean records was extracted using the following conditions:
52 |     ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
53 | 
54 | 5. Number of Instances
55 |    48842 instances, mix of continuous and discrete    (train=32561, test=16281)
56 |    45222 if instances with unknown values are removed (train=30162, test=15060)
57 |    Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
58 | 
59 | 6. Number of Attributes 
60 |    6 continuous, 8 nominal attributes.
61 | 
62 | 7. Attribute Information: 
63 | 
64 | age: continuous.
65 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
66 | fnlwgt: continuous.
67 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
68 | education-num: continuous.
69 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
70 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
71 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
72 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
73 | sex: Female, Male.
74 | capital-gain: continuous.
75 | capital-loss: continuous.
76 | hours-per-week: continuous.
77 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
78 | class: >50K, <=50K
79 | 
80 | 8. Missing Attribute Values: 
81 | 
82 |    7% have missing values.
83 | 
84 | 9. Class Distribution: 
85 | 
86 |  Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
87 |  Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
88 | 
89 | 
90 | 


--------------------------------------------------------------------------------
/src/differential_privacy.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | sys.path.append('../')
  3 | from utilis.readdata import *
  4 | import laplace_mechanism
  5 | import exponential_mechanism
  6 | import math
  7 | 
  8 | 
  9 | def evaluate_laplace_mechanism(eps = [0.5, 1]):
 10 |     """
 11 |     Evaluate for Laplace Mechanism
 12 |     """
 13 |     recordsv0 = readdata()
 14 |     recordsv1, recordsv2, recordsv3 = generate_data_for_laplace_mechanism(recordsv0)
 15 | 
 16 |     res1000 = {e:[] for e in eps}
 17 |     # res4000 = {e: [] for e in eps}
 18 |     rmse = {e: 0 for e in eps}
 19 | 
 20 |     
 21 |     """
 22 |     evaluate for epsilon = 0.5 and 1 for 1000 queries
 23 |     """
 24 |     printsent = ['original data', 'data removed a record with the oldest age', 
 25 |                 'data removed any record with age 26', 'data removed any record with the youghest age']
 26 |     i = 0
 27 |     for records in (recordsv0, recordsv1, recordsv2, recordsv3):
 28 |         print('############ Processing for {} ############'.format(printsent[i]))
 29 |         i += 1
 30 |         LampMec = laplace_mechanism.LaplaceMechanism(records)
 31 |         for e in eps:
 32 |             print('query 1000 results with epsilon = {}'.format(e))
 33 |             res1000[e].append(LampMec.query_with_dp(e, querynum=1000))
 34 |             # res4000[e].append(LampMec.query_with_dp(e, querynum=4000))
 35 |             rmse[e] = LampMec.calc_distortion(
 36 |                 LampMec.query_with_dp(e, querynum=4000))
 37 |     
 38 |     print('\n')
 39 |     for e in eps:
 40 |         print('############ Prove the {}-indistinguishable'.format(e))
 41 |         for i in range(1, 4):
 42 |             tmpresij, tmpresji = laplace_mechanism.prove_indistinguishable(
 43 |                 res1000[e][0], res1000[e][i])
 44 |             print('** {} ** OVER ** {} **:'.format(printsent[0], printsent[i]))
 45 |             print(tmpresij)
 46 |             print('** {} ** OVER ** {} **:'.format(printsent[i], printsent[0]))
 47 |             print(tmpresji)
 48 |             print('exp^e = {}'.format(math.exp(e)))
 49 |             print('\n')
 50 |     
 51 |     print('############ Measure the distortion (RMSE) ############')
 52 |     for e in eps:
 53 |         print('RMSE for e = {}: {}'.format(e, rmse[e]))
 54 |     print('Distortion of e=1 is smaller than e=0.5 ?: ', True if rmse[1] <= rmse[0.5] else False)
 55 |     del recordsv0
 56 |     del recordsv1
 57 |     del recordsv2
 58 |     del recordsv3
 59 |         
 60 |     
 61 |     
 62 | 
 63 | 
 64 | def evaluate_exponential_mechanism(eps=[0.5,1]):
 65 |     """
 66 |     Evaulate for Exponential Mechanism
 67 |     """
 68 | 
 69 |     recordsv0 = readdata()
 70 |     recordsv1, recordsv2, recordsv3 = generate_data_for_exponential_mechanism(recordsv0)
 71 | 
 72 |     res1000 = {e:[] for e in eps}
 73 |     # res4000 = {e: [] for e in eps}
 74 |     dist = {e: 0 for e in eps}
 75 | 
 76 |     
 77 |     """
 78 |     evaluate for epsilon = 0.5 and 1 for 1000 queries
 79 |     """
 80 |     printsent = ['original data', 'data removed a record with most frequent education', 
 81 |                 'data removed a record with second most frequent education', 
 82 |                 'data removed any record with the least frequent education']
 83 |     i = 0
 84 |     for records in (recordsv0, recordsv1, recordsv2, recordsv3):
 85 |         print('############ Processing for {} ############'.format(printsent[i]))
 86 |         i += 1
 87 |         ExpMe = exponential_mechanism.ExponentialMechanism(records)
 88 |         for e in eps:
 89 |             print('query 1000 results with epsilon = {}'.format(e))
 90 |             res1000[e].append(ExpMe.query_with_dp(e, querynum=1000))
 91 |             # res4000[e].append(LampMec.query_with_dp(e, querynum=4000))
 92 |             dist[e] = ExpMe.calc_distortion(
 93 |                 ExpMe.query_with_dp(e, querynum=4000))
 94 |     
 95 |     print('\n')
 96 |     for e in eps:
 97 |         print('############ Prove the {}-indistinguishable'.format(e))
 98 |         for i in range(1, 4):
 99 |             tmpresij, tmpresji = exponential_mechanism.prove_indistinguishable(
100 |                 res1000[e][0], res1000[e][i])
101 |             print('** {} ** OVER ** {} **:'.format(printsent[0], printsent[i]))
102 |             print(tmpresij)
103 |             print('** {} ** OVER ** {} **:'.format(printsent[i], printsent[0]))
104 |             print(tmpresji)
105 |             print('exp^e = {}'.format(math.exp(e)))
106 |             print('\n')
107 |     
108 |     print('############ Measure the distortion (1-precision) ############')
109 |     for e in eps:
110 |         print('distortion for e = {}: {}'.format(e, dist[e]))
111 |     print('Distortion of e=1 is smaller than e=0.5 ?: ', True if dist[1] <= dist[0.5] else False)
112 | 
113 |         
114 | 
115 | 
116 | 
117 | if __name__ == "__main__":
118 |     print("############################### Laplace Mechanism ###############################")
119 |     evaluate_laplace_mechanism()
120 |     print('\n')
121 |     print("############################### Exponential Mechanism ###############################")
122 |     evaluate_exponential_mechanism()
123 |     
124 | 
125 | 


--------------------------------------------------------------------------------
/src/exponential_mechanism.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from utilis.readdata import *
  3 | from collections import Counter
  4 | import math
  5 | 
  6 | class ExponentialMechanism():
  7 |     """
  8 |     exponential mechanism
  9 |     
 10 |     """
 11 | 
 12 |     def __init__(self, records):
 13 |         self.records = records
 14 |         self.s = self.__calculate_sensitivity()
 15 |         self.__count_education_nums_prop()
 16 | 
 17 |     def __calculate_sensitivity(self):
 18 |         """
 19 |         calculate the sensitivity
 20 |         as the score function is #members, the sensitivity is 1
 21 |         
 22 |         Returns:
 23 |             [int] -- [sensitivity]
 24 |         """
 25 |         return 1
 26 | 
 27 | 
 28 |     def __count_education_nums_prop(self):
 29 |         """
 30 |         calculate the number and probability for education attribute
 31 |         """
 32 | 
 33 |         self.educnt = {}
 34 |         eduidx = ATTNAME.index('education')
 35 |         for record in self.records:
 36 |             self.educnt[record[eduidx]] = self.educnt.get(record[eduidx], 0) + 1
 37 |         self.eduprop = {}
 38 |         for key, val in self.educnt.items():
 39 |             self.eduprop[key] = val / len(self.records)
 40 | 
 41 |     def __exponential(self, u, e):
 42 |         """
 43 |         return exponential probability
 44 |         
 45 |         Arguments:
 46 |             u {[float]} -- [probability]
 47 |             e {[float]} -- [epsilon]
 48 |         
 49 |         Returns:
 50 |             [float] -- [exponential probability]
 51 |         """
 52 | 
 53 |         return np.random.exponential(e * u / (2*self.s))
 54 | 
 55 |     def query_with_dp(self, e = 1, querynum = 1000):
 56 |         """
 57 |         query with Exponential Mechanism
 58 |         
 59 |         Keyword Arguments:
 60 |             e {float} -- [epsilon] (default: {1})
 61 |             querynum {int} -- [number of queries] (default: {1000})
 62 |         
 63 |         Returns:
 64 |             [list] -- [list of queries]
 65 |         """
 66 | 
 67 |         # candidate = list(self.educnt.keys())
 68 |         # candidatefreq = [self.educnt[k] for k in candidate]
 69 |         candidate = list(self.eduprop.keys())
 70 |         # print(candidate)
 71 |         # print([self.educnt[k] for k in candidate ])
 72 |         candidatefreq = [self.eduprop[k] for k in candidate]
 73 |         res = []
 74 |         for _ in range(querynum):
 75 |             weights = [self.__exponential(freq, e) for freq in candidatefreq]
 76 |             weights = [w/sum(weights) for w in weights]
 77 |             # print(weights)
 78 |             res.append(np.random.choice(candidate, p=weights))
 79 |         return res
 80 | 
 81 |     
 82 |     def calc_groundtruth(self):
 83 |         """
 84 |         calculate the groundtruth
 85 |         the most frequent education value
 86 |         
 87 |         Returns:
 88 |             [string] -- [most frequent education value]
 89 |         """
 90 | 
 91 |         eduidx = ATTNAME.index('education')
 92 |         return Counter([record[eduidx] for record in self.records if record[eduidx] != '*']).most_common(1)[0][0]
 93 | 
 94 |     def calc_distortion(self, queryres):
 95 |         """
 96 |         calculate the distortion
 97 |         
 98 |         Arguments:
 99 |             queryres {[list]} -- [query result]
100 |         
101 |         Returns:
102 |             [float] -- [distortion]
103 |         """
104 | 
105 |         return 1 - Counter(queryres)[self.calc_groundtruth()]/len(queryres)
106 | 
107 | 
108 | def prove_indistinguishable(queryres1, queryres2):
109 |     """
110 |     proove the indistinguishable for two query results
111 |     
112 |     Arguments:
113 |         queryres1 {[list]} -- [query 1 result]
114 |         queryres2 {[list]} -- [query 2 result]
115 |     
116 |     Returns:
117 |         [float] -- [probability quotient]
118 |     """
119 | 
120 |     prob1 = Counter(queryres1)
121 |     for key in prob1:
122 |         prob1[key] /= len(queryres1)
123 |     prob2 = Counter(queryres2)
124 |     for key in prob2:
125 |         prob2[key] /= len(queryres2)
126 |     res = 0
127 |     num = 0
128 |     for key in prob1:
129 |         if key not in prob2:
130 |             print('no query result {} in query 2'.format(key))
131 |             continue
132 |         res += prob1[key] / prob2[key]
133 |         num += 1
134 |     res1overres2 = res/num
135 |     res = 0
136 |     num = 0
137 |     for key in prob2:
138 |         if key not in prob1:
139 |             print('no query result {} in query 1'.format(key))
140 |             continue
141 |         res += prob2[key] / prob1[key]
142 |         num += 1
143 |     res2overres1 = res/num
144 |     return res1overres2, res2overres1
145 | 
146 | 
147 | 
148 | if __name__ == "__main__":
149 |     records = readdata()
150 |     ExpMe = ExponentialMechanism(records)
151 |     res1 = ExpMe.query_with_dp(0.05, 1000)
152 |     # res2 = ExpMe.query_with_dp(0.05, 1000)
153 |     v1, v2, v3 = generate_data_for_exponential_mechanism(records)
154 |     ExpMe2 = ExponentialMechanism(v1)
155 |     res2 = ExpMe2.query_with_dp(0.05, 1000)
156 |     # print(res1)
157 |     print(ExpMe.calc_distortion(res1))
158 |     print(ExpMe.calc_distortion(ExpMe.query_with_dp(1, 1000)))
159 |     print(ExpMe.calc_distortion(res2))
160 |     print(prove_indistinguishable(res1, res2))
161 |     print(prove_indistinguishable(res2, res1))
162 |     print(math.exp(0.05))
163 | 


--------------------------------------------------------------------------------
/src/k_anonymity.py:
--------------------------------------------------------------------------------
  1 | from utilis.readdata import *
  2 | 
  3 | class KAnonymity():
  4 |     def __init__(self, records):
  5 |         self.records = records
  6 |         self.confile = [AGECONFFILE, EDUCONFFILE, MARITALCONFFILE, RACECONFFILE]
  7 |         
  8 |     def anonymize(self, qi_names=['age', 'education', 'marital-status', 'race'], k=5):
  9 |         """
 10 |         anonymizer for k-anonymity
 11 |         
 12 |         Keyword Arguments:
 13 |             qi_names {list} -- [qi names] (default: {['age', 'education', 'marital-status', 'race']})
 14 |             k {int} -- [value for k] (default: {5})
 15 |         """
 16 | 
 17 |         domains, gen_levels = {}, {}
 18 |         qi_frequency = {}       # store the frequency for each qi value
 19 |         # record_att_gen_levels = [[0 for _ in range(len(qi_names))] for _ in range(len(self.records))]
 20 | 
 21 |         assert len(self.confile) == len(qi_names), 'number of config files  not equal to number of QI-names'
 22 |         generalize_tree = dict()
 23 |         for idx, name in enumerate(qi_names):
 24 |             generalize_tree[name] = Tree(self.confile[idx])
 25 | 
 26 |         for qiname in qi_names:
 27 |             domains[qiname] = set()
 28 |             gen_levels[qiname] = 0
 29 |         
 30 |         for idx, record in enumerate(self.records):
 31 |             qi_sequence = self._get_qi_values(record[:], qi_names, generalize_tree)
 32 |             
 33 |             if qi_sequence in qi_frequency:
 34 |                 qi_frequency[qi_sequence].add(idx)
 35 |             else:
 36 |                 qi_frequency[qi_sequence] = {idx}
 37 |                 for j, value in enumerate(qi_sequence):
 38 |                     domains[qi_names[j]].add(value)
 39 |         
 40 |         # iteratively generalize the attributes with maximum distinct values
 41 |         while True:
 42 |             # count number of records not satisfying k-anonymity
 43 |             negcount = 0
 44 |             for qi_sequence, idxset in qi_frequency.items():
 45 |                 if len(idxset) < k:
 46 |                     negcount += len(idxset)
 47 |             
 48 |             if negcount > k:
 49 |                 # continue generalization, since there are more than k records not satisfying k-anonymity
 50 |                 most_freq_att_num, most_freq_att_name = -1, None
 51 |                 for qiname in qi_names:
 52 |                     if len(domains[qiname]) > most_freq_att_num:
 53 |                         most_freq_att_num = len(domains[qiname])
 54 |                         most_freq_att_name = qiname
 55 |                 
 56 |                 # find the attribute with most distinct values
 57 |                 generalize_att = most_freq_att_name
 58 |                 qi_index = qi_names.index(generalize_att)
 59 |                 domains[generalize_att] = set()
 60 |                 
 61 |                 # generalize that attribute to one higher level
 62 |                 for qi_sequence in list(qi_frequency.keys()):
 63 |                     new_qi_sequence = list(qi_sequence)
 64 |                     new_qi_sequence[qi_index] =  generalize_tree[generalize_att].root[qi_sequence[qi_index]][0]
 65 |                     new_qi_sequence = tuple(new_qi_sequence)
 66 |                 
 67 |                     if new_qi_sequence in qi_frequency:
 68 |                         qi_frequency[new_qi_sequence].update(
 69 |                             qi_frequency[qi_sequence])
 70 |                         qi_frequency.pop(qi_sequence, 0)
 71 |                     else:
 72 |                         qi_frequency[new_qi_sequence] = qi_frequency.pop(qi_sequence)
 73 |                     
 74 |                     domains[generalize_att].add(new_qi_sequence[qi_index])
 75 |                 
 76 |                 gen_levels[generalize_att] += 1
 77 |                 
 78 |             
 79 |             else:
 80 |                 # end the while loop
 81 |                 # suppress sequences not satisfying k-anonymity
 82 |                 # save results and calculate distoration and precision
 83 |                 genlvl_att = [0 for _ in range(len(qi_names))]
 84 |                 dgh_att = [generalize_tree[name].level for name in qi_names]
 85 |                 datasize = 0
 86 |                 qiindex = [ATTNAME.index(name) for name in qi_names]
 87 | 
 88 |                 # used to make sure the output file keeps the same order with original data file
 89 |                 towriterecords = [None for _ in range(len(self.records))]
 90 |                 with open('../data/adult_%d_kanonymity.data' %k, 'w') as wf:
 91 |                     for qi_sequence, recordidxs in qi_frequency.items():
 92 |                         if len(recordidxs) < k:
 93 |                             continue
 94 |                         for idx in recordidxs:
 95 |                             record = self.records[idx][:]
 96 |                             for i in range(len(qiindex)):
 97 |                                 record[qiindex[i]] = qi_sequence[i]
 98 |                                 genlvl_att[i] += generalize_tree[qi_names[i]].root[qi_sequence[i]][1]
 99 |                             record = list(map(str, record))
100 |                             for i in range(len(record)):
101 |                                 if record[i] == '*' and i not in qiindex:
102 |                                     record[i] = '?'
103 |                             towriterecords[idx] = record[:]
104 |                             # wf.write(', '.join(record))
105 |                             # wf.write('\n')
106 |                         datasize += len(recordidxs)
107 |                     for record in towriterecords:
108 |                         if record is not None:
109 |                             wf.write(', '.join(record))
110 |                             wf.write('\n')
111 |                         else:
112 |                             wf.write('\n')
113 |                 
114 |                 print('qi names: ', qi_names)
115 |                 # precision = self.calc_precission(genlvl_att, dgh_att, datasize, len(qi_names))
116 |                 precision = self.calc_precision(genlvl_att, dgh_att, len(self.records), len(qi_names))
117 |                 distoration = self.calc_distoration([gen_levels[qi_names[i]] for i in range(len(qi_names))], dgh_att, len(qi_names))
118 |                 print('precision: {}, distoration: {}'.format(precision, distoration))
119 |                 break
120 | 
121 | 
122 |     def calc_precision(self, genlvl_att, dgh_att, datasize, attsize = 4):
123 |         """
124 |         calculate the precision of generalized value for each value of each attributes
125 |         
126 |         Arguments:
127 |             genlvl_att {[list]} -- [sum of generalized level of each attribute]
128 |             dgh_att {[list]} -- [maximum height of each attribute]
129 |             datasize {[int]} -- [data size]
130 |         
131 |         Keyword Arguments:
132 |             attsize {int} -- [number of qi attributes] (default: {4})
133 |         
134 |         Returns:
135 |             [float] -- [precision of the generalization]
136 |         """
137 | 
138 |         return 1 - sum([genlvl_att[i] / dgh_att[i] for i in range(attsize)])/(datasize*attsize)
139 | 
140 | 
141 |     def calc_distoration(self, gen_levels_att, dgh_att, attsize):
142 |         """
143 |         calculate the distoration for generalized levels of each attributes
144 |         
145 |         Arguments:
146 |             gen_levels_att {[type]} -- [description]
147 |             dgh_att {[type]} -- [description]
148 |             attsize {[type]} -- [description]
149 |         
150 |         Returns:
151 |             [type] -- [description]
152 |         """
153 | 
154 |         print('attribute gen level:', gen_levels_att)
155 |         print('tree height:', dgh_att)
156 |         return sum([gen_levels_att[i] / dgh_att[i] for i in range(attsize)]) / attsize
157 | 
158 |         
159 |     def _get_qi_values(self, record, qi_names, generalize_tree):
160 |         """
161 |         private method
162 |         get qi values from one record
163 |         
164 |         Arguments:
165 |             record {[list]} -- [one record]
166 |             qi_names {[list]} -- [qi names]
167 |             generalize_tree {[dict]} -- [dict storing the DGH trees]
168 |         
169 |         Returns:
170 |             [tuple] -- [qi tuple value]
171 |         """
172 | 
173 |         qi_index = [ATTNAME.index(name) for name in qi_names]
174 |         seq = []
175 |         for idx in qi_index:
176 |             if idx == ATTNAME.index('age'):
177 |                 if record[idx] == -1:
178 |                     seq.append('0-100')
179 |                 else:
180 |                     seq.append(str(record[idx]))
181 |             else:
182 |                 if record[idx] == '*':
183 |                     # TODO, handle missing value cases
184 |                     record[idx] = generalize_tree[qi_names[idx]].highestgen
185 |                 seq.append(record[idx])
186 |         return tuple(seq)
187 | 
188 |             
189 | 
190 |         
191 | class Tree:
192 |     """
193 |     Tree class
194 |     built for DGH tree, keep track of each node's parent, and current level
195 |     """
196 | 
197 |     def __init__(self, confile):
198 |         self.confile = confile
199 |         self.root = dict()
200 |         self.level = -1
201 |         self.highestgen = ''
202 |         self.buildTree()
203 |         
204 |     
205 |     def buildTree(self):
206 |         """
207 |         build the DGH tree from config file
208 |         """
209 | 
210 |         with open(self.confile, 'r') as rf:
211 |             for line in rf:
212 |                 line = line.strip()
213 |                 if not line:
214 |                     continue
215 |                 line = [col.strip() for col in line.split(',')]
216 |                 height = len(line)-1
217 |                 if self.level == -1:
218 |                     self.level = height
219 |                 if not self.highestgen:
220 |                     self.highestgen = line[-1]
221 |                 pre = None
222 |                 for idx, val in enumerate(line[::-1]):
223 |                     self.root[val] = (pre, height-idx)
224 |                     pre = val
225 |                 
226 | 
227 | if __name__ == "__main__":
228 |     records = readdata()
229 |     KAnony = KAnonymity(records)
230 |     KAnony.anonymize(k = 100)
231 | 
232 | 
233 |         
234 | 


--------------------------------------------------------------------------------
/src/kanonymity_eval.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | sys.path.append('../')
 3 | from utilis.readdata import *
 4 | from k_anonymity import KAnonymity
 5 | 
 6 | def main():
 7 |     records = readdata()
 8 |     K = [5, 10, 50, 100]
 9 |     KAnony = KAnonymity(records)
10 |     for k in K:
11 |         print('############# k-anonymity for k={} #############: \n'.format(k))
12 |         KAnony.anonymize(k=k)
13 |         print('\n')
14 | 
15 | 
16 | if __name__ == "__main__":
17 |     main()    
18 | 


--------------------------------------------------------------------------------
/src/laplace_mechanism.py:
--------------------------------------------------------------------------------
  1 | from utilis.readdata import *
  2 | import numpy as np
  3 | import math
  4 | 
  5 | 
  6 | class LaplaceMechanism():
  7 |     def __init__(self, records):
  8 |         self.records = records
  9 |         self.s = self.__calculate_sensitivity()
 10 |         # print(self.s)
 11 | 
 12 |     def __calculate_sensitivity(self):
 13 |         """
 14 |         calculate the sensitive value
 15 |         it should be the oldest age / num of records
 16 | 
 17 |         Returns:
 18 |             [float] -- [sensitive value]
 19 |         """
 20 | 
 21 |         num, oldage = 0, -float('inf')
 22 |         ageidx = ATTNAME.index('age')
 23 |         for record in self.records:
 24 |             if record[ageidx] > 25:
 25 |                 num += 1
 26 |                 if record[ageidx] > oldage:
 27 |                     oldage = record[ageidx]
 28 |         return oldage / num
 29 | 
 30 |     def __laplacian_noise(self, e):
 31 |         """
 32 |         add laplacian_noise
 33 |         """
 34 | 
 35 |         return np.random.laplace(self.s/e)
 36 | 
 37 |     def query_with_dp(self, e = 1, querynum=1000):
 38 |         """
 39 |         query average age above 25 with Laplace Mechanism
 40 |         
 41 |         Keyword Arguments:
 42 |             e {float} -- [epsilon] (default: {1})
 43 |             querynum {int} -- [number of queries] (default: {1000})
 44 |         
 45 |         Returns:
 46 |             [list] -- [randomized query results]
 47 |         """
 48 | 
 49 |         ageidx = ATTNAME.index('age')
 50 |         agegt25 = [record[ageidx]
 51 |                    for record in self.records if record[ageidx] > 25]
 52 |         avgage = sum(agegt25) / len(agegt25)
 53 | 
 54 |         res = []
 55 |         for _ in range(querynum):
 56 |             res.append(round(avgage + self.__laplacian_noise(e), 2))
 57 |         return res
 58 | 
 59 |     def calc_groundtruth(self):
 60 |         """
 61 |         calculate the true average age above 25 without adding noise
 62 |         
 63 |         Returns:
 64 |             [float] -- [true average age greater than 25]
 65 |         """
 66 | 
 67 |         agesum = 0
 68 |         num = 0
 69 |         ageidx = ATTNAME.index('age')
 70 |         for record in self.records:
 71 |             if record[ageidx] > 25:
 72 |                 agesum += record[ageidx]
 73 |                 num += 1
 74 |         return round(agesum / num, 2)
 75 | 
 76 | 
 77 |     def calc_distortion(self, queryres):
 78 |         """
 79 |         calcluate the distortion
 80 |         use RMSE here
 81 |         
 82 |         Arguments:
 83 |             queryres {[list]} -- [query result]
 84 |         
 85 |         Returns:
 86 |             [float] -- [rmse value]
 87 |         """
 88 | 
 89 |         groundtruth = self.calc_groundtruth()
 90 |         rmse = (sum((res - groundtruth)**2 for res in queryres) / len(queryres))**(1/2)
 91 |         return rmse
 92 | 
 93 | def prove_indistinguishable(queryres1, queryres2, bucketnum = 20):
 94 |     """
 95 |     proove the indistinguishable for two query results
 96 |     
 97 |     Arguments:
 98 |         queryres1 {[list]} -- [query 1 result]
 99 |         queryres2 {[list]} -- [query 2 result]
100 |     
101 |     Keyword Arguments:
102 |         bucketnum {int} -- [number of buckets used to calculate the probability] (default: {20})
103 |     
104 |     Returns:
105 |         [float] -- [probability quotient]
106 |     """
107 | 
108 |     maxval = max(max(queryres1), max(queryres2))
109 |     minval = min(min(queryres1), min(queryres2))
110 |     count1 = [0 for _ in range(bucketnum)]
111 |     count2 = [0 for _ in range(bucketnum)]
112 |     for val1, val2 in zip(queryres1, queryres2):
113 |         count1[math.floor((val1-minval+1)/((maxval-minval+1)/bucketnum))-1] += 1
114 |         count2[math.floor((val2-minval+1)//((maxval-minval+1)/bucketnum))-1] += 1
115 |     prob1 = list(map(lambda x: x/len(queryres1), count1))
116 |     prob2 = list(map(lambda  x: x/len(queryres2), count2))
117 | 
118 |     res1overres2 = sum(p1 / p2 for p1, p2 in zip(prob1, prob2) if p2 != 0) / bucketnum
119 |     res2overres1 = sum(p2 / p1 for p1, p2 in zip(prob1, prob2) if p1 != 0) / bucketnum
120 |     return res1overres2, res2overres1
121 | 
122 | 
123 | if __name__ == "__main__":
124 |     records = readdata()
125 |     v1, v2, v3 = generate_data_for_laplace_mechanism(records)
126 |     LapMe = LaplaceMechanism(records)
127 |     res1 = LapMe.query_with_dp(0.5, 1000)
128 |     # print(res1)
129 |     # print(LapMe.calc_groundtruth())
130 |     print(LapMe.calc_distortion(LapMe.query_with_dp(1, 1000)))
131 |     LapMe2 = LaplaceMechanism(v1)
132 |     res2 = LapMe2.query_with_dp(0.5, 1000)
133 |     print(LapMe.calc_distortion(res1))
134 |     print(LapMe2.calc_distortion(res2))
135 |     print(prove_indistinguishable(res1, res2))
136 |     # print(prove_indistinguishable(res2, res1))
137 |     print(math.exp(0.5))
138 | 


--------------------------------------------------------------------------------
/utilis/readdata.py:
--------------------------------------------------------------------------------
  1 | import pandas
  2 | import os
  3 | import random
  4 | 
  5 | ATTNAME = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital-status',
  6 |            'occupation', 'relationship', 'race', 'sex', 
  7 |             'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'class']
  8 | 
  9 | AGECONFFILE = '../conf/age_hierarchy.txt'
 10 | EDUCONFFILE = '../conf/edu_hierarchy.txt'
 11 | MARITALCONFFILE = '../conf/marital_hierarchy.txt'
 12 | RACECONFFILE = '../conf/race_hierarchy.txt'
 13 | 
 14 | 
 15 | def readdata(filepath='../data', filename='adult.data'):
 16 |     records = []
 17 |     try:
 18 |         with open(os.path.join(filepath, filename), 'r') as rf:
 19 |             for line in rf:
 20 |                 line = line.strip()
 21 |                 if not line:
 22 |                     continue
 23 |                 line = [a.strip() for a in line.split(',')]
 24 |                 # print(line)
 25 |                 intidx = [ATTNAME.index(colname) for colname in (
 26 |                     'age', 'fnlwgt', 'education_num', 'capital-gain', 'capital-loss', 'hours-per-week')]
 27 |                 for idx in intidx:
 28 |                     try:
 29 |                         line[idx] = int(line[idx])
 30 |                     except:
 31 |                         print('attribute %s, value %s, cannot be converted to number' %(ATTNAME[idx], line[idx]))
 32 |                         line[idx] = -1
 33 |                 for idx in range(len(line)):
 34 |                     if line[idx] == '' or line[idx] == '?':
 35 |                         line[idx] = '*'
 36 |                 records.append(line)
 37 |         return records
 38 |     except:
 39 |         print('cannot open file: %s:%s' %(filepath, filename))
 40 |     
 41 | 
 42 | def generate_data_for_laplace_mechanism(records):
 43 |     """
 44 |     generate the three different versions datasets for Laplace Mechanism
 45 |     
 46 |     Arguments:
 47 |             records {[list of list]} -- [original records for adult datasets]
 48 |     
 49 |     Returns:
 50 |             three versions datasets for Laplace Mechanism
 51 |             oldest age and youngest age
 52 |     """
 53 | 
 54 |     oldestidx, twentysixidx, youngestidx = -1, -1, -1
 55 |     oldest, youngest = -float('inf'), float('inf')
 56 |     ageidx = ATTNAME.index('age')
 57 |     for idx, record in enumerate(records):
 58 |         """
 59 |         age == -1 means the value is missing in the dataset
 60 |         """
 61 |         if record[ageidx] == -1:
 62 |             continue
 63 |         if record[ageidx] >= oldest:
 64 |             if record[ageidx] != oldest or random.random() >= 0.5:
 65 |                 oldestidx, oldest = idx, record[ageidx]
 66 |         if record[ageidx] <= youngest:
 67 |             if record[ageidx] != youngest or random.random() >= 0.5:
 68 |                 youngestidx, youngest = idx, record[ageidx]
 69 |         if record[ageidx] == 26 and (twentysixidx != -1 or random.random() >= 0.5):
 70 |             twentysixidx = idx
 71 |     version1 = _copy_with_exclude_idx(records, oldestidx)
 72 |     version2 = _copy_with_exclude_idx(records, twentysixidx)
 73 |     version3 = _copy_with_exclude_idx(records, youngestidx)
 74 |     return version1, version2, version3#, oldest, youngest
 75 | 
 76 | 
 77 | def generate_data_for_exponential_mechanism(records):
 78 |     """
 79 |     generate data for Exponential Mechanism
 80 |     
 81 |     Arguments:
 82 |             records {[list of list]} -- [original dataset]
 83 |     """
 84 |     counter = {}
 85 |     eduidx = ATTNAME.index('education')
 86 |     for idx, record in enumerate(records):
 87 |         if record[eduidx] == '*':
 88 |             continue
 89 |         counter[record[eduidx]] = counter.get(record[eduidx], []) + [idx]
 90 |     
 91 |     firstlen, secondlen, leastlen = -float('inf'), -float('inf'), float('inf')
 92 |     firstedu, secondedu, leastedu = '', '', ''
 93 |     for key, val in counter.items():
 94 |         if len(val) > firstlen:
 95 |             secondlen = firstlen
 96 |             secondedu = firstedu
 97 |             firstlen = len(val)
 98 |             firstedu = key
 99 |         elif len(val) > secondlen:
100 |             secondlen = len(val)
101 |             secondedu = key
102 |         if len(val) < leastlen:
103 |             leastlen = len(val)
104 |             leastedu = key
105 |     firstidx = counter[firstedu][random.randrange(0, firstlen)]
106 |     secondidx = counter[secondedu][random.randrange(0, secondlen)]
107 |     leastidx = counter[leastedu][random.randrange(0, leastlen)]
108 | 
109 |     version1 = _copy_with_exclude_idx(records, firstidx)
110 |     version2 = _copy_with_exclude_idx(records, secondidx)
111 |     version3 = _copy_with_exclude_idx(records, leastidx)
112 |     return version1, version2, version3
113 |         
114 | 
115 | def _copy_with_exclude_idx(records, tgtidx):
116 |     """
117 |     generate a new list of records without the target idx: tgtidx
118 |     
119 |     Arguments:
120 |             records {[list of list]} -- [original records]
121 |             tgtidx {[int]} -- [target idx will be excluded from records]
122 |     
123 |     Returns:
124 |             [list of list] -- [copy of records excluding the tgtidx record]
125 |     """
126 | 
127 |     return [record for idx, record in enumerate(records) if idx != tgtidx]
128 | 
129 | 
130 | def generate_hierarchy_for_age(records):
131 |     youngest, oldest = float('inf'), -float('inf')
132 |     ageidx = ATTNAME.index('age')
133 |     for record in records:
134 |         if record[ageidx] == -1:
135 |             continue
136 |         if record[ageidx] > oldest:
137 |             oldest = record[ageidx]
138 |         if record[ageidx] < youngest:
139 |             youngest = record[ageidx]
140 |     print('age max: %d min: %d' %(oldest, youngest))
141 |     with open(AGECONFFILE, 'w') as wf:
142 |         for i in range(oldest+1):
143 |             h = []
144 |             h.append(str(i))
145 |         #     h.append('%s-%s' %(i//10*10, (i//10+1)*10))
146 |         #     h.append('%s-%s' %(i//20*20, (i//20+1)*20))
147 |         #     h.append('%s-%s' %(i//50*50, (i//50+1)*50))
148 |         #     h.append('%s-%s' %(i//100*100, (i//100+1)*100))
149 |             h.append('%s-%s' % (i//25*25, (i//25+1)*25))
150 |         #     h.append('%s-%s' % (i//20*20, (i//20+1)*20))
151 |             h.append('%s-%s' % (i//50*50, (i//50+1)*50))
152 |             h.append('%s-%s' % (i//100*100, (i//100+1)*100))
153 |             wf.write(','.join(h))
154 |             wf.write('\n')
155 | 
156 | def generate_hierarchy_for_edu(records):
157 |     eduset = set()
158 |     eduidx = ATTNAME.index('education')
159 |     for record in records:
160 |         if record[eduidx] != '*' and record[eduidx] not in eduset:
161 |             eduset.add(record[eduidx])
162 |     with open(EDUCONFFILE, 'w') as wf:
163 |         for edu in eduset:
164 |             wf.write(edu + ','*2)
165 |             wf.write('\n')
166 | 
167 | def generate_hierarchy_for_marital(records):
168 |     maritalset = set()
169 |     maritalidx = ATTNAME.index('marital-status')
170 |     for record in records:
171 |         if record[maritalidx] != '*' and record[maritalidx] not in maritalset:
172 |             maritalset.add(record[maritalidx])
173 |     with open(MARITALCONFFILE, 'w') as wf:
174 |         for marital in maritalset:
175 |             wf.write(marital + ','*2)
176 |             wf.write('\n')
177 | 
178 | def generate_hierarchy_for_race(records):
179 |     reaceset = set()
180 |     raceidx = ATTNAME.index('race')
181 |     for record in records:
182 |         if record[raceidx] != '*' and record[raceidx] not in reaceset:
183 |             reaceset.add(record[raceidx])
184 |     with open(RACECONFFILE, 'w') as wf:
185 |         for race in reaceset:
186 |             wf.write(race+','*2)
187 |             wf.write('\n')
188 | 
189 | 
190 | if __name__ == "__main__":
191 |     print(os.getcwd())
192 |     records = readdata()
193 |     generate_hierarchy_for_age(records)
194 | #     generate_hierarchy_for_edu(records)
195 | #     generate_hierarchy_for_marital(records)
196 | #     generate_hierarchy_for_race(records)
197 | 


--------------------------------------------------------------------------------