├── tutorial
    ├── __init__.py
    ├── filepaths.py
    ├── deidentify.py
    ├── generate.py
    └── synthesise.py
├── DataSynthesizer
    ├── __init__.py
    ├── lib
    │   ├── __init__.py
    │   ├── utils.py
    │   └── PrivBayes.py
    ├── datatypes
    │   ├── __init__.py
    │   ├── utils
    │   │   ├── __init__.py
    │   │   ├── DataType.py
    │   │   └── AttributeLoader.py
    │   ├── FloatAttribute.py
    │   ├── IntegerAttribute.py
    │   ├── SocialSecurityNumberAttribute.py
    │   ├── StringAttribute.py
    │   ├── DateTimeAttribute.py
    │   └── AbstractAttribute.py
    ├── README.md
    ├── ModelInspector.py
    ├── DataGenerator.py
    └── DataDescriber.py
├── .gitignore
├── data
    ├── nhs_ae_gender_codes.csv
    ├── hospitals_london.txt
    ├── nhs_ae_treatment_codes.csv
    ├── hospital_ae_description_random.json
    └── hospital_ae_description_independent.json
├── requirements.txt
├── plots
    ├── random_Gender.png
    ├── random_Treatment.png
    ├── correlated_Gender.png
    ├── independent_Gender.png
    ├── random_Age_bracket.png
    ├── random_Hospital_ID.png
    ├── correlated_Treatment.png
    ├── independent_Treatment.png
    ├── random_Arrival_Date.png
    ├── correlated_Age_bracket.png
    ├── correlated_Arrival_Date.png
    ├── correlated_Hospital_ID.png
    ├── independent_Age_bracket.png
    ├── independent_Hospital_ID.png
    ├── independent_Arrival_Date.png
    ├── random_Arrival_hour_range.png
    ├── random_Time_in_A&E_(mins).png
    ├── correlated_Arrival_hour_range.png
    ├── correlated_Time_in_A&E_(mins).png
    ├── independent_Arrival_hour_range.png
    ├── independent_Time_in_A&E_(mins).png
    ├── mutual_information_heatmap_random.png
    ├── mutual_information_heatmap_correlated.png
    ├── mutual_information_heatmap_independent.png
    ├── random_Index_of_Multiple_Deprivation_Decile.png
    ├── correlated_Index_of_Multiple_Deprivation_Decile.png
    └── independent_Index_of_Multiple_Deprivation_Decile.png
├── LICENSE
└── README.md


/tutorial/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/DataSynthesizer/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/DataSynthesizer/lib/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/utils/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .venv/*
2 | *.pyc
3 | data/London postcodes.csv
4 | 


--------------------------------------------------------------------------------
/data/nhs_ae_gender_codes.csv:
--------------------------------------------------------------------------------
1 | Gender,Code
2 | Not Known,0
3 | Male,1
4 | Female,2
5 | Not Specified,9


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==1.4.2
2 | scipy==1.8.0
3 | sklearn==1.0.2
4 | matplotlib==3.5.1
5 | seaborn==0.11.2


--------------------------------------------------------------------------------
/plots/random_Gender.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Gender.png


--------------------------------------------------------------------------------
/plots/random_Treatment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Treatment.png


--------------------------------------------------------------------------------
/plots/correlated_Gender.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Gender.png


--------------------------------------------------------------------------------
/plots/independent_Gender.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Gender.png


--------------------------------------------------------------------------------
/plots/random_Age_bracket.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Age_bracket.png


--------------------------------------------------------------------------------
/plots/random_Hospital_ID.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Hospital_ID.png


--------------------------------------------------------------------------------
/plots/correlated_Treatment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Treatment.png


--------------------------------------------------------------------------------
/plots/independent_Treatment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Treatment.png


--------------------------------------------------------------------------------
/plots/random_Arrival_Date.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Arrival_Date.png


--------------------------------------------------------------------------------
/plots/correlated_Age_bracket.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Age_bracket.png


--------------------------------------------------------------------------------
/plots/correlated_Arrival_Date.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Arrival_Date.png


--------------------------------------------------------------------------------
/plots/correlated_Hospital_ID.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Hospital_ID.png


--------------------------------------------------------------------------------
/plots/independent_Age_bracket.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Age_bracket.png


--------------------------------------------------------------------------------
/plots/independent_Hospital_ID.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Hospital_ID.png


--------------------------------------------------------------------------------
/plots/independent_Arrival_Date.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Arrival_Date.png


--------------------------------------------------------------------------------
/plots/random_Arrival_hour_range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Arrival_hour_range.png


--------------------------------------------------------------------------------
/plots/random_Time_in_A&E_(mins).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Time_in_A&E_(mins).png


--------------------------------------------------------------------------------
/plots/correlated_Arrival_hour_range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Arrival_hour_range.png


--------------------------------------------------------------------------------
/plots/correlated_Time_in_A&E_(mins).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Time_in_A&E_(mins).png


--------------------------------------------------------------------------------
/plots/independent_Arrival_hour_range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Arrival_hour_range.png


--------------------------------------------------------------------------------
/plots/independent_Time_in_A&E_(mins).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Time_in_A&E_(mins).png


--------------------------------------------------------------------------------
/plots/mutual_information_heatmap_random.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_random.png


--------------------------------------------------------------------------------
/plots/mutual_information_heatmap_correlated.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_correlated.png


--------------------------------------------------------------------------------
/plots/mutual_information_heatmap_independent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_independent.png


--------------------------------------------------------------------------------
/plots/random_Index_of_Multiple_Deprivation_Decile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Index_of_Multiple_Deprivation_Decile.png


--------------------------------------------------------------------------------
/plots/correlated_Index_of_Multiple_Deprivation_Decile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Index_of_Multiple_Deprivation_Decile.png


--------------------------------------------------------------------------------
/plots/independent_Index_of_Multiple_Deprivation_Decile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Index_of_Multiple_Deprivation_Decile.png


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/utils/DataType.py:
--------------------------------------------------------------------------------
 1 | from enum import Enum
 2 | 
 3 | 
 4 | class DataType(Enum):
 5 |     INTEGER = 'Integer'
 6 |     FLOAT = 'Float'
 7 |     STRING = 'String'
 8 |     DATETIME = 'DateTime'
 9 |     SOCIAL_SECURITY_NUMBER = 'SocialSecurityNumber'
10 | 


--------------------------------------------------------------------------------
/data/hospitals_london.txt:
--------------------------------------------------------------------------------
 1 | Barnet Hospital
 2 | Charing Cross Hospital
 3 | Chase Farm Hospital
 4 | Chelsea and Westminster Hospital
 5 | Croydon University Hospital
 6 | Ealing Hospital
 7 | Epsom General Hospital
 8 | Hillingdon Hospital
 9 | Homerton University Hospital
10 | King's College Hospital
11 | Kingston Hospital
12 | Newham General Hospital
13 | North Middlesex Hospital
14 | Northwick Park & St Marks Hospital
15 | Princess Royal University Hospital
16 | Queen Elizabeth Hospital
17 | Queen's Hospital
18 | Royal London Hospital
19 | St Mary's Hospital
20 | St Thomas' Hospital
21 | The Royal Free Hospital
22 | University College Hospital
23 | University Hospital Lewisham
24 | West Middlesex University Hospital
25 | Whipps Cross University Hospital
26 | The Whittington Hospital


--------------------------------------------------------------------------------
/data/nhs_ae_treatment_codes.csv:
--------------------------------------------------------------------------------
 1 | Treatment,Code
 2 | Dressing,01
 3 | Bandage/support,02
 4 | Sutures,03
 5 | Wound closure (excluding sutures),04
 6 | Plaster of Paris,05
 7 | Splint,06
 8 | Removal foreign body,08
 9 | Physiotherapy,09
10 | Incision & drainage,11
11 | Central line,13
12 | Chest drain,16
13 | Urinary catheter/suprapubic,17
14 | Defibrillation/pacing,18
15 | Resuscitation/cardiopulmonary resuscitation,19
16 | Minor surgery,20
17 | Guidance/advice only,22
18 | Anaesthesia,23
19 | Tetanus,24
20 | Nebuliser/spacer,25
21 | Recording vital signs,30
22 | Burns review,31
23 | Fracture review,33
24 | Wound cleaning,34
25 | Dressing/wound review,35
26 | Sling/collar cuff/broad arm sling,36
27 | Nasal airway,38
28 | Oral airway,39
29 | Arterial line,42
30 | Infusion fluids,43
31 | Blood product transfusion,44
32 | Lumbar puncture,46
33 | Joint aspiration,47
34 | Occupational Therapy,52
35 | Social work intervention,54
36 | Eye,55
37 | Dental treatment,56
38 | Prescription/medicines prepared to take away,57
39 | Other (consider alternatives),27
40 | None (consider guidance/advice option),99


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/FloatAttribute.py:
--------------------------------------------------------------------------------
 1 | from typing import Union
 2 | 
 3 | import numpy as np
 4 | from pandas import Series
 5 | 
 6 | from datatypes.AbstractAttribute import AbstractAttribute
 7 | from datatypes.utils.DataType import DataType
 8 | 
 9 | 
10 | class FloatAttribute(AbstractAttribute):
11 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
12 |         super().__init__(name, is_candidate_key, is_categorical, histogram_size, data)
13 |         self.is_numerical = True
14 |         self.data_type = DataType.FLOAT
15 | 
16 |     def infer_domain(self, categorical_domain=None, numerical_range=None):
17 |         super().infer_domain(categorical_domain, numerical_range)
18 | 
19 |     def infer_distribution(self):
20 |         super().infer_distribution()
21 | 
22 |     def generate_values_as_candidate_key(self, n):
23 |         return np.arange(self.min, self.max, (self.max - self.min) / n)
24 | 
25 |     def sample_values_from_binning_indices(self, binning_indices):
26 |         return super().sample_values_from_binning_indices(binning_indices)
27 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 The Open Data Institute
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/IntegerAttribute.py:
--------------------------------------------------------------------------------
 1 | from typing import Union
 2 | 
 3 | from pandas import Series
 4 | 
 5 | from datatypes.AbstractAttribute import AbstractAttribute
 6 | from datatypes.utils.DataType import DataType
 7 | 
 8 | 
 9 | class IntegerAttribute(AbstractAttribute):
10 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
11 |         super().__init__(name, is_candidate_key, is_categorical, histogram_size, data)
12 |         self.is_numerical = True
13 |         self.data_type = DataType.INTEGER
14 | 
15 |     def infer_domain(self, categorical_domain=None, numerical_range=None):
16 |         super().infer_domain(categorical_domain, numerical_range)
17 |         self.min = int(self.min)
18 |         self.max = int(self.max)
19 | 
20 |     def infer_distribution(self):
21 |         super().infer_distribution()
22 | 
23 |     def generate_values_as_candidate_key(self, n):
24 |         return super().generate_values_as_candidate_key(n)
25 | 
26 |     def sample_values_from_binning_indices(self, binning_indices):
27 |         column = super().sample_values_from_binning_indices(binning_indices)
28 |         column[~column.isnull()] = column[~column.isnull()].astype(int)
29 |         return column
30 | 


--------------------------------------------------------------------------------
/DataSynthesizer/README.md:
--------------------------------------------------------------------------------
 1 | # DataSynthesizer
 2 | 
 3 | All code in this directory is from the open-source [Datasynthesizer](https://github.com/DataResponsibly/DataSynthesizer) project.
 4 | 
 5 | You can read more on the project and related papers at the [Privacy-Preserving Synthetic Data project page](https://homes.cs.washington.edu/~billhowe//projects/2017/07/20/Data-Synthesizer.html).
 6 | 
 7 | ## Usage
 8 | 
 9 | DataSynthesizer can generate a synthetic dataset from a sensitive one for release to public. It is developed in Python 3.6 and requires some third-party modules, including numpy, scipy, pandas, and dateutil.
10 | 
11 | ## License
12 | 
13 | Copyright <2018> <dataresponsibly.com>
14 | 
15 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
16 | 
17 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
18 | 
19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/tutorial/filepaths.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from pathlib import Path
 4 | 
 5 | this_filepath = Path(os.path.realpath(__file__))
 6 | project_root = str(this_filepath.parents[1])
 7 | 
 8 | data_dir = os.path.join(project_root, 'data/')
 9 | 
10 | # add the DataSynthesizer repo to the pythonpath
11 | data_synthesizer_dir = os.path.join(project_root, 'DataSynthesizer/')
12 | sys.path.append(data_synthesizer_dir)
13 | 
14 | plots_dir = os.path.join(project_root, 'plots/')
15 | 
16 | postcodes_london = os.path.join(data_dir, 'London postcodes.csv')
17 | hospitals_london = os.path.join(data_dir, 'hospitals_london.txt')
18 | nhs_ae_gender_codes = os.path.join(data_dir, 'nhs_ae_gender_codes.csv')
19 | nhs_ae_treatment_codes = os.path.join(data_dir, 'nhs_ae_treatment_codes.csv')
20 | age_population_london = os.path.join(data_dir, 'age_population_london.csv')
21 | 
22 | hospital_ae_data = os.path.join(data_dir, 'hospital_ae_data.csv')
23 | hospital_ae_data_deidentify = os.path.join(data_dir, 'hospital_ae_data_deidentify.csv')
24 | 
25 | hospital_ae_data_synthetic_random = os.path.join(
26 |     data_dir, 'hospital_ae_data_synthetic_random.csv')
27 | hospital_ae_data_synthetic_independent = os.path.join(
28 |     data_dir, 'hospital_ae_data_synthetic_independent.csv')
29 | hospital_ae_data_synthetic_correlated = os.path.join(
30 |     data_dir, 'hospital_ae_data_synthetic_correlated.csv')
31 | 
32 | hospital_ae_description_random = os.path.join(
33 |     data_dir, 'hospital_ae_description_random.json')
34 | hospital_ae_description_independent = os.path.join(
35 |     data_dir, 'hospital_ae_description_independent.json')
36 | hospital_ae_description_correlated = os.path.join(
37 |     data_dir, 'hospital_ae_description_correlated.json')
38 | 
39 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/utils/AttributeLoader.py:
--------------------------------------------------------------------------------
 1 | from pandas import Series
 2 | 
 3 | from datatypes.DateTimeAttribute import DateTimeAttribute
 4 | from datatypes.FloatAttribute import FloatAttribute
 5 | from datatypes.IntegerAttribute import IntegerAttribute
 6 | from datatypes.SocialSecurityNumberAttribute import SocialSecurityNumberAttribute
 7 | from datatypes.StringAttribute import StringAttribute
 8 | from datatypes.utils.DataType import DataType
 9 | 
10 | 
11 | def parse_json(attribute_in_json):
12 |     name = attribute_in_json['name']
13 |     data_type = DataType(attribute_in_json['data_type'])
14 |     is_candidate_key = attribute_in_json['is_candidate_key']
15 |     is_categorical = attribute_in_json['is_categorical']
16 |     histogram_size = len(attribute_in_json['distribution_bins'])
17 |     if data_type is DataType.INTEGER:
18 |         attribute = IntegerAttribute(name, is_candidate_key, is_categorical, histogram_size, Series())
19 |     elif data_type is DataType.FLOAT:
20 |         attribute = FloatAttribute(name, is_candidate_key, is_categorical, histogram_size, Series())
21 |     elif data_type is DataType.DATETIME:
22 |         attribute = DateTimeAttribute(name, is_candidate_key, is_categorical, histogram_size, Series())
23 |     elif data_type is DataType.STRING:
24 |         attribute = StringAttribute(name, is_candidate_key, is_categorical, histogram_size, Series())
25 |     elif data_type is data_type.SOCIAL_SECURITY_NUMBER:
26 |         attribute = SocialSecurityNumberAttribute(name, is_candidate_key, is_categorical, histogram_size, Series())
27 |     else:
28 |         raise Exception('Data type {} is unknown.'.format(data_type.value))
29 | 
30 |     attribute.missing_rate = attribute_in_json['missing_rate']
31 |     attribute.min = attribute_in_json['min']
32 |     attribute.max = attribute_in_json['max']
33 |     attribute.distribution_bins = attribute_in_json['distribution_bins']
34 |     attribute.distribution_probabilities = attribute_in_json['distribution_probabilities']
35 | 
36 |     return attribute
37 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/SocialSecurityNumberAttribute.py:
--------------------------------------------------------------------------------
 1 | from typing import Union
 2 | 
 3 | import numpy as np
 4 | from pandas import Series
 5 | 
 6 | from datatypes.AbstractAttribute import AbstractAttribute
 7 | from datatypes.utils.DataType import DataType
 8 | 
 9 | 
10 | def pre_process(column: Series):
11 |     if column.size == 0:
12 |         return column
13 |     elif type(column.iloc[0]) is int:
14 |         return column
15 |     elif type(column.iloc[0]) is str:
16 |         return column.map(lambda x: int(x.replace('-', '')))
17 |     else:
18 |         raise Exception('Invalid SocialSecurityNumber.')
19 | 
20 | 
21 | def is_ssn(value):
22 |     """Test whether a number is between 0 and 1e9.
23 | 
24 |     Note this function does not take into consideration some special numbers that are never allocated.
25 |     https://en.wikipedia.org/wiki/Social_Security_number
26 |     """
27 |     if type(value) is int:
28 |         return 0 < value < 1e9
29 |     elif type(value) is str:
30 |         value = value.replace('-', '')
31 |         if value.isdigit():
32 |             return 0 < int(value) < 1e9
33 |     return False
34 | 
35 | 
36 | class SocialSecurityNumberAttribute(AbstractAttribute):
37 |     """SocialSecurityNumber of format AAA-GG-SSSS.
38 | 
39 |     """
40 | 
41 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
42 |         super().__init__(name, is_candidate_key, is_categorical, histogram_size, pre_process(data))
43 |         self.is_numerical = True
44 |         self.data_type = DataType.SOCIAL_SECURITY_NUMBER
45 | 
46 |     def infer_domain(self, categorical_domain=None, numerical_range=None):
47 |         super().infer_domain(categorical_domain, numerical_range)
48 |         self.min = int(self.min)
49 |         self.max = int(self.max)
50 | 
51 |     def infer_distribution(self):
52 |         super().infer_distribution()
53 | 
54 |     def generate_values_as_candidate_key(self, n):
55 |         if n < 1e9:
56 |             values = np.linspace(0, 1e9 - 1, num=n, dtype=int)
57 |             values = np.random.permutation(values)
58 |             values = [str(i).zfill(9) for i in values]
59 |             return ['{}-{}-{}'.format(i[:3], i[3:5], i[5:]) for i in values]
60 |         else:
61 |             raise Exception('The candidate key "{}" cannot generate more than 1e9 distinct values.', self.name)
62 | 
63 |     def sample_values_from_binning_indices(self, binning_indices):
64 |         return super().sample_binning_indices_in_independent_attribute_mode(binning_indices)
65 | 


--------------------------------------------------------------------------------
/DataSynthesizer/lib/utils.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import random
 3 | from string import ascii_lowercase
 4 | 
 5 | import numpy as np
 6 | from pandas import Series, DataFrame
 7 | from sklearn.metrics import mutual_info_score, normalized_mutual_info_score
 8 | 
 9 | 
10 | def set_random_seed(seed=0):
11 |     random.seed(seed)
12 |     np.random.seed(seed)
13 | 
14 | 
15 | def mutual_information(labels_x: Series, labels_y: DataFrame):
16 |     """Mutual information of distributions in format of Series or DataFrame.
17 | 
18 |     Parameters
19 |     ----------
20 |     labels_x : Series
21 |     labels_y : DataFrame
22 |     """
23 |     if labels_y.shape[1] == 1:
24 |         labels_y = labels_y.iloc[:, 0]
25 |     else:
26 |         labels_y = labels_y.apply(lambda x: ' '.join(x.get_values()), axis=1)
27 | 
28 |     return mutual_info_score(labels_x, labels_y)
29 | 
30 | 
31 | def pairwise_attributes_mutual_information(dataset):
32 |     """Compute normalized mutual information for all pairwise attributes. Return a DataFrame."""
33 |     sorted_columns = sorted(dataset.columns)
34 |     mi_df = DataFrame(columns=sorted_columns, index=sorted_columns, dtype=float)
35 |     for row in mi_df.columns:
36 |         for col in mi_df.columns:
37 |             mi_df.loc[row, col] = normalized_mutual_info_score(dataset[row].astype(str),
38 |                                                                dataset[col].astype(str),
39 |                                                                average_method='arithmetic')
40 |     return mi_df
41 | 
42 | 
43 | def normalize_given_distribution(frequencies):
44 |     distribution = np.array(frequencies, dtype=float)
45 |     distribution = distribution.clip(0)  # replace negative values with 0
46 |     summation = distribution.sum()
47 |     if summation > 0:
48 |         return distribution / distribution.sum()
49 |     else:
50 |         return np.full_like(distribution, 1 / distribution.size)
51 | 
52 | 
53 | def read_json_file(json_file):
54 |     with open(json_file, 'r') as file:
55 |         return json.load(file)
56 | 
57 | 
58 | def infer_numerical_attributes_in_dataframe(dataframe):
59 |     describe = dataframe.describe()
60 |     # DataFrame.describe() usually returns 8 rows.
61 |     if describe.shape[0] == 8:
62 |         return set(describe.columns)
63 |     # DataFrame.describe() returns less than 8 rows when there is no numerical attribute.
64 |     else:
65 |         return set()
66 | 
67 | 
68 | def display_bayesian_network(bn):
69 |     length = 0
70 |     for child, _ in bn:
71 |         if len(child) > length:
72 |             length = len(child)
73 | 
74 |     print('Constructed Bayesian network:')
75 |     for child, parents in bn:
76 |         print("    {0:{width}} has parents {1}.".format(child, parents, width=length))
77 | 
78 | 
79 | def generate_random_string(length):
80 |     return ''.join(np.random.choice(list(ascii_lowercase), size=length))
81 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/StringAttribute.py:
--------------------------------------------------------------------------------
 1 | from typing import Union
 2 | 
 3 | import numpy as np
 4 | from pandas import Series
 5 | 
 6 | from datatypes.AbstractAttribute import AbstractAttribute
 7 | from datatypes.utils.DataType import DataType
 8 | from lib import utils
 9 | 
10 | 
11 | class StringAttribute(AbstractAttribute):
12 |     """Variable min and max are the lengths of the shortest and longest strings.
13 | 
14 |     """
15 | 
16 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
17 |         super().__init__(name, is_candidate_key, is_categorical, histogram_size, data)
18 |         self.is_numerical = False
19 |         self.data_type = DataType.STRING
20 |         self.data_dropna_len = self.data_dropna.astype(str).map(len)
21 | 
22 |     def infer_domain(self, categorical_domain=None, numerical_range=None):
23 |         if categorical_domain:
24 |             lengths = [len(i) for i in categorical_domain]
25 |             self.min = min(lengths)
26 |             self.max = max(lengths)
27 |             self.distribution_bins = np.array(categorical_domain)
28 |         else:
29 |             self.min = int(self.data_dropna_len.min())
30 |             self.max = int(self.data_dropna_len.max())
31 |             if self.is_categorical:
32 |                 self.distribution_bins = self.data_dropna.unique()
33 |             else:
34 |                 self.distribution_bins = np.array([self.min, self.max])
35 | 
36 |         self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size)
37 | 
38 |     def infer_distribution(self):
39 |         if self.is_categorical:
40 |             distribution = self.data_dropna.value_counts()
41 |             for value in set(self.distribution_bins) - set(distribution.index):
42 |                 distribution[value] = 0
43 |             distribution.sort_index(inplace=True)
44 |             self.distribution_probabilities = utils.normalize_given_distribution(distribution)
45 |             self.distribution_bins = np.array(distribution.index)
46 |         else:
47 |             distribution = np.histogram(self.data_dropna_len, bins=self.histogram_size)
48 |             self.distribution_probabilities = utils.normalize_given_distribution(distribution[0])
49 |             bins = distribution[1][:-1]
50 |             bins[0] = bins[0] - 0.001 * (bins[1] - bins[0])
51 |             self.distribution_bins = bins
52 | 
53 |     def generate_values_as_candidate_key(self, n):
54 |         length = np.random.randint(self.min, self.max)
55 |         vectorized = np.vectorize(lambda x: '{}{}'.format(utils.generate_random_string(length), x))
56 |         return vectorized(np.arange(n))
57 | 
58 |     def sample_values_from_binning_indices(self, binning_indices):
59 |         column = super().sample_values_from_binning_indices(binning_indices)
60 |         if not self.is_categorical:
61 |             column[~column.isnull()] = column[~column.isnull()].apply(lambda x: utils.generate_random_string(int(x)))
62 | 
63 |         return column
64 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/DateTimeAttribute.py:
--------------------------------------------------------------------------------
 1 | from typing import Union
 2 | 
 3 | import numpy as np
 4 | from dateutil.parser import parse
 5 | from pandas import Series
 6 | 
 7 | from datatypes.AbstractAttribute import AbstractAttribute
 8 | from datatypes.utils.DataType import DataType
 9 | from lib.utils import normalize_given_distribution
10 | 
11 | 
12 | def is_datetime(value: str):
13 |     """Find whether a value is a datetime. Here weekdays and months are categorical values instead of datetime."""
14 |     weekdays = {'mon', 'monday', 'tue', 'tuesday', 'wed', 'wednesday', 'thu', 'thursday', 'fri', 'friday',
15 |                 'sat', 'saturday', 'sun', 'sunday'}
16 |     months = {'jan', 'january', 'feb', 'february', 'mar', 'march', 'apr', 'april', 'may', 'may', 'jun', 'june',
17 |               'jul', 'july', 'aug', 'august', 'sep', 'sept', 'september', 'oct', 'october', 'nov', 'november',
18 |               'dec', 'december'}
19 | 
20 |     value_lower = value.lower()
21 |     if (value_lower in weekdays) or (value_lower in months):
22 |         return False
23 |     try:
24 |         parse(value)
25 |         return True
26 |     except ValueError:
27 |         return False
28 | 
29 | 
30 | # TODO detect datetime formats
31 | class DateTimeAttribute(AbstractAttribute):
32 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
33 |         super().__init__(name, is_candidate_key, is_categorical, histogram_size, data)
34 |         self.is_numerical = True
35 |         self.data_type = DataType.DATETIME
36 |         epoch_datetime = parse('1970-01-01')
37 |         self.timestamps = self.data_dropna.map(lambda x: int((parse(x) - epoch_datetime).total_seconds()))
38 | 
39 |     def infer_domain(self, categorical_domain=None, numerical_range=None):
40 |         if numerical_range:
41 |             self.min, self.max = numerical_range
42 |             self.distribution_bins = np.array([self.min, self.max])
43 |         else:
44 |             self.min = float(self.data_dropna.min())
45 |             self.max = float(self.data_dropna.max())
46 |             if self.is_categorical:
47 |                 self.distribution_bins = self.data_dropna.unique()
48 |             else:
49 |                 self.distribution_bins = np.array([self.min, self.max])
50 | 
51 |         self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size)
52 | 
53 |     def infer_distribution(self):
54 |         if self.is_categorical:
55 |             distribution = self.data_dropna.value_counts()
56 |             for value in set(self.distribution_bins) - set(distribution.index):
57 |                 distribution[value] = 0
58 |             distribution.sort_index(inplace=True)
59 |             self.distribution_probabilities = normalize_given_distribution(distribution)
60 |             self.distribution_bins = np.array(distribution.index)
61 |         else:
62 |             distribution = np.histogram(self.timestamps, bins=self.histogram_size, range=(self.min, self.max))
63 |             self.distribution_probabilities = normalize_given_distribution(distribution[0])
64 |             bins = distribution[1][:-1]
65 |             bins[0] = bins[0] - 0.001 * (bins[1] - bins[0])
66 |             self.distribution_bins = bins
67 | 
68 |     def generate_values_as_candidate_key(self, n):
69 |         return np.arange(self.min, self.max, (self.min - self.max) / n)
70 | 
71 |     def sample_values_from_binning_indices(self, binning_indices):
72 |         column = super().sample_values_from_binning_indices(binning_indices)
73 |         column[~column.isnull()] = column[~column.isnull()].astype(int)
74 |         return column
75 | 


--------------------------------------------------------------------------------
/DataSynthesizer/ModelInspector.py:
--------------------------------------------------------------------------------
  1 | from typing import List
  2 | 
  3 | import matplotlib
  4 | import matplotlib.pyplot as plt
  5 | import numpy as np
  6 | import pandas as pd
  7 | import seaborn as sns
  8 | 
  9 | from lib.utils import pairwise_attributes_mutual_information, normalize_given_distribution
 10 | 
 11 | matplotlib.rc('xtick', labelsize=20)
 12 | matplotlib.rc('ytick', labelsize=20)
 13 | 
 14 | sns.set()
 15 | 
 16 | 
 17 | class ModelInspector(object):
 18 |     def __init__(self, private_df: pd.DataFrame, synthetic_df: pd.DataFrame, attribute_description):
 19 |         self.private_df = private_df
 20 |         self.synthetic_df = synthetic_df
 21 |         self.attribute_description = attribute_description
 22 | 
 23 |         self.candidate_keys = set()
 24 |         for attr in synthetic_df:
 25 |             if synthetic_df[attr].unique().size == synthetic_df.shape[0]:
 26 |                 self.candidate_keys.add(attr)
 27 | 
 28 |         self.private_df.drop(columns=self.candidate_keys, inplace=True)
 29 |         self.synthetic_df.drop(columns=self.candidate_keys, inplace=True)
 30 | 
 31 |     def compare_histograms(self, attribute, figure_filepath):
 32 |         datatype = self.attribute_description[attribute]['data_type']
 33 |         is_categorical = self.attribute_description[attribute]['is_categorical']
 34 | 
 35 |         # ignore datetime attributes, since they are converted into timestamps
 36 |         if datatype == 'DateTime':
 37 |             return
 38 |         # ignore non-categorical string attributes
 39 |         elif datatype == 'String' and not is_categorical:
 40 |             return
 41 |         elif attribute in self.candidate_keys:
 42 |             return
 43 |         else:
 44 |             fig = plt.figure(figsize=(25, 12), dpi=120)
 45 |             ax1 = fig.add_subplot(121)
 46 |             ax2 = fig.add_subplot(122)
 47 | 
 48 |             if is_categorical:
 49 |                 dist_priv = self.private_df[attribute].value_counts()
 50 |                 dist_synt = self.synthetic_df[attribute].value_counts()
 51 |                 for idx, number in dist_priv.iteritems():
 52 |                     if idx not in dist_synt.index:
 53 |                         dist_synt.loc[idx] = 0
 54 |                 for idx, number in dist_synt.iteritems():
 55 |                     if idx not in dist_priv.index:
 56 |                         dist_priv.loc[idx] = 0
 57 |                 dist_priv.index = [str(i) for i in dist_priv.index]
 58 |                 dist_synt.index = [str(i) for i in dist_synt.index]
 59 |                 dist_priv.sort_index(inplace=True)
 60 |                 dist_synt.sort_index(inplace=True)
 61 |                 pos_priv = list(range(len(dist_priv)))
 62 |                 pos_synt = list(range(len(dist_synt)))
 63 |                 ax1.bar(pos_priv, normalize_given_distribution(dist_priv.values), align='center', width=0.8)
 64 |                 ax2.bar(pos_synt, normalize_given_distribution(dist_synt.values), align='center', width=0.8)
 65 |                 ax1.set_xticks(np.arange(min(pos_priv), max(pos_priv) + 1, 1.0))
 66 |                 ax2.set_xticks(np.arange(min(pos_synt), max(pos_synt) + 1, 1.0))
 67 |                 ax1.set_xticklabels(dist_priv.index.tolist(), fontsize=10)
 68 |                 ax2.set_xticklabels(dist_synt.index.tolist(), fontsize=10)
 69 |             # the rest are non-categorical numeric attributes.
 70 |             else:
 71 |                 ax1.hist(self.private_df[attribute].dropna(), bins=15, align='left', density=True)
 72 |                 ax2.hist(self.synthetic_df[attribute].dropna(), bins=15, align='left', density=True)
 73 | 
 74 |             ax1_x_min, ax1_x_max = ax1.get_xlim()
 75 |             ax2_x_min, ax2_x_max = ax2.get_xlim()
 76 |             ax1_y_min, ax1_y_max = ax1.get_ylim()
 77 |             ax2_y_min, ax2_y_max = ax2.get_ylim()
 78 |             x_min = min(ax1_x_min, ax2_x_min)
 79 |             x_max = max(ax1_x_max, ax2_x_max)
 80 |             y_min = min(ax1_y_min, ax2_y_min)
 81 |             y_max = max(ax1_y_max, ax2_y_max)
 82 |             ax1.set_xlim([x_min, x_max])
 83 |             ax1.set_ylim([y_min, y_max])
 84 |             ax2.set_xlim([x_min, x_max])
 85 |             ax2.set_ylim([y_min, y_max])
 86 |             fig.autofmt_xdate()
 87 | 
 88 |             plt.savefig(figure_filepath, bbox_inches='tight')
 89 |             plt.close()
 90 | 
 91 |     def mutual_information_heatmap(self, figure_filepath, attributes: List = None):
 92 |         if attributes:
 93 |             private_df = self.private_df[attributes]
 94 |             synthetic_df = self.synthetic_df[attributes]
 95 |         else:
 96 |             private_df = self.private_df
 97 |             synthetic_df = self.synthetic_df
 98 | 
 99 |         private_mi = pairwise_attributes_mutual_information(private_df)
100 |         synthetic_mi = pairwise_attributes_mutual_information(synthetic_df)
101 | 
102 |         fig = plt.figure(figsize=(15, 6), dpi=120)
103 |         fig.suptitle('Pairwise Mutual Information Comparison (Private vs Synthetic)', fontsize=20)
104 |         ax1 = fig.add_subplot(121)
105 |         ax2 = fig.add_subplot(122)
106 |         sns.heatmap(private_mi, ax=ax1, cmap="GnBu")
107 |         sns.heatmap(synthetic_mi, ax=ax2, cmap="GnBu")
108 |         ax1.set_title('Private, max=1', fontsize=15)
109 |         ax2.set_title('Synthetic, max=1', fontsize=15)
110 |         fig.autofmt_xdate()
111 |         fig.tight_layout()
112 |         plt.subplots_adjust(top=0.83)
113 | 
114 |         plt.savefig(figure_filepath, bbox_inches='tight')
115 |         plt.close()
116 | 
117 | 
118 | if __name__ == '__main__':
119 |     # Directories of input and output files
120 |     input_dataset_file = '../datasets/AdultIncomeData/adult.csv'
121 |     dataset_description_file = '../output/description/AdultIncomeData_description.txt'
122 |     synthetic_dataset_file = '../output/synthetic_data/AdultIncomeData_synthetic.csv'
123 | 
124 |     df = pd.read_csv(input_dataset_file)
125 |     print(df.head(5))
126 | 


--------------------------------------------------------------------------------
/DataSynthesizer/datatypes/AbstractAttribute.py:
--------------------------------------------------------------------------------
  1 | from abc import ABCMeta, abstractmethod
  2 | from bisect import bisect_right
  3 | from random import uniform
  4 | from typing import List, Union
  5 | 
  6 | import numpy as np
  7 | from numpy.random import choice
  8 | from pandas import Series
  9 | 
 10 | from datatypes.utils import DataType
 11 | from lib import utils
 12 | 
 13 | 
 14 | class AbstractAttribute(object):
 15 |     __metaclass__ = ABCMeta
 16 | 
 17 |     def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series):
 18 |         self.name = name
 19 |         self.is_candidate_key = is_candidate_key
 20 |         self.is_categorical = is_categorical
 21 |         self.histogram_size: Union[int, str] = histogram_size
 22 |         self.data: Series = data
 23 |         self.data_dropna: Series = self.data.dropna()
 24 |         self.missing_rate: float = (self.data.size - self.data_dropna.size) / (self.data.size or 1)
 25 | 
 26 |         self.is_numerical: bool = None
 27 |         self.data_type: DataType = None
 28 |         self.min = None
 29 |         self.max = None
 30 |         self.distribution_bins: np.ndarray = None
 31 |         self.distribution_probabilities: np.ndarray = None
 32 | 
 33 |     @abstractmethod
 34 |     def infer_domain(self, categorical_domain: List = None, numerical_range: List = None):
 35 |         """Infer categorical_domain, including min, max, and 1-D distribution.
 36 | 
 37 |         """
 38 |         if categorical_domain:
 39 |             self.min = min(categorical_domain)
 40 |             self.max = max(categorical_domain)
 41 |             self.distribution_bins = np.array(categorical_domain)
 42 |         elif numerical_range:
 43 |             self.min, self.max = numerical_range
 44 |             self.distribution_bins = np.array([self.min, self.max])
 45 |         else:
 46 |             self.min = float(self.data_dropna.min())
 47 |             self.max = float(self.data_dropna.max())
 48 |             if self.is_categorical:
 49 |                 self.distribution_bins = self.data_dropna.unique()
 50 |             else:
 51 |                 self.distribution_bins = np.array([self.min, self.max])
 52 | 
 53 |         self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size)
 54 | 
 55 |     @abstractmethod
 56 |     def infer_distribution(self):
 57 |         if self.is_categorical:
 58 |             distribution = self.data_dropna.value_counts()
 59 |             for value in set(self.distribution_bins) - set(distribution.index):
 60 |                 distribution[value] = 0
 61 |             distribution.sort_index(inplace=True)
 62 |             self.distribution_probabilities = utils.normalize_given_distribution(distribution)
 63 |             self.distribution_bins = np.array(distribution.index)
 64 |         else:
 65 |             distribution = np.histogram(self.data_dropna, bins=self.histogram_size, range=(self.min, self.max))
 66 |             self.distribution_bins = distribution[1][:-1]  # Remove the last bin edge
 67 |             self.distribution_probabilities = utils.normalize_given_distribution(distribution[0])
 68 | 
 69 |     def inject_laplace_noise(self, epsilon=0.1, num_valid_attributes=10):
 70 |         if epsilon > 0:
 71 |             noisy_scale = num_valid_attributes / (epsilon * self.data.size)
 72 |             laplace_noises = np.random.laplace(0, scale=noisy_scale, size=len(self.distribution_probabilities))
 73 |             noisy_distribution = self.distribution_probabilities + laplace_noises
 74 |             self.distribution_probabilities = utils.normalize_given_distribution(noisy_distribution)
 75 | 
 76 |     def encode_values_into_bin_idx(self):
 77 |         """Encode values into bin indices for Bayesian Network construction.
 78 | 
 79 |         """
 80 |         if self.is_categorical:
 81 |             value_to_bin_idx = {value: idx for idx, value in enumerate(self.distribution_bins)}
 82 |             encoded = self.data.map(lambda x: value_to_bin_idx[x], na_action='ignore')
 83 |         else:
 84 |             encoded = self.data.map(lambda x: bisect_right(self.distribution_bins, x) - 1, na_action='ignore')
 85 | 
 86 |         encoded.fillna(len(self.distribution_bins), inplace=True)
 87 |         return encoded.astype(int, copy=False)
 88 | 
 89 |     def to_json(self):
 90 |         """Encode attribution information in JSON format / Python dictionary.
 91 | 
 92 |         """
 93 |         return {"name": self.name,
 94 |                 "data_type": self.data_type.value,
 95 |                 "is_categorical": self.is_categorical,
 96 |                 "is_candidate_key": self.is_candidate_key,
 97 |                 "min": self.min,
 98 |                 "max": self.max,
 99 |                 "missing_rate": self.missing_rate,
100 |                 "distribution_bins": self.distribution_bins.tolist(),
101 |                 "distribution_probabilities": self.distribution_probabilities.tolist()}
102 | 
103 |     @abstractmethod
104 |     def generate_values_as_candidate_key(self, n):
105 |         """When attribute should be a candidate key in output dataset.
106 | 
107 |         """
108 |         return np.arange(n)
109 | 
110 |     def sample_binning_indices_in_independent_attribute_mode(self, n):
111 |         """Sample an array of binning indices.
112 | 
113 |         """
114 |         return Series(choice(len(self.distribution_probabilities), size=n, p=self.distribution_probabilities))
115 | 
116 |     @abstractmethod
117 |     def sample_values_from_binning_indices(self, binning_indices):
118 |         """Convert binning indices into values in domain. Used by both independent and correlated attribute mode.
119 | 
120 |         """
121 |         return binning_indices.apply(lambda x: self.uniform_sampling_within_a_bin(x))
122 | 
123 |     def uniform_sampling_within_a_bin(self, bin_idx: int):
124 |         num_bins = len(self.distribution_bins)
125 |         if bin_idx == num_bins:
126 |             return np.nan
127 |         elif self.is_categorical:
128 |             return self.distribution_bins[bin_idx]
129 |         elif bin_idx < num_bins - 1:
130 |             return uniform(self.distribution_bins[bin_idx], self.distribution_bins[bin_idx + 1])
131 |         else:
132 |             # sample from the last interval where the right edge is missing in self.distribution_bins
133 |             neg_2, neg_1 = self.distribution_bins[-2:]
134 |             return uniform(neg_1, 2 * neg_1 - neg_2)
135 | 


--------------------------------------------------------------------------------
/DataSynthesizer/DataGenerator.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | from datatypes.utils.AttributeLoader import parse_json
  5 | from lib.utils import set_random_seed, read_json_file, generate_random_string
  6 | 
  7 | 
  8 | class DataGenerator(object):
  9 |     def __init__(self):
 10 |         self.n = 0
 11 |         self.synthetic_dataset = None
 12 |         self.description = {}
 13 |         self.encoded_dataset = None
 14 | 
 15 |     def generate_dataset_in_random_mode(self, n, description_file, seed=0, minimum=0, maximum=100):
 16 |         set_random_seed(seed)
 17 |         description = read_json_file(description_file)
 18 | 
 19 |         self.synthetic_dataset = pd.DataFrame()
 20 |         for attr in description['attribute_description'].keys():
 21 |             attr_info = description['attribute_description'][attr]
 22 |             datatype = attr_info['data_type']
 23 |             is_categorical = attr_info['is_categorical']
 24 |             is_candidate_key = attr_info['is_candidate_key']
 25 |             if is_candidate_key:
 26 |                 self.synthetic_dataset[attr] = parse_json(attr_info).generate_values_as_candidate_key(n)
 27 |             elif is_categorical:
 28 |                 self.synthetic_dataset[attr] = np.random.choice(attr_info['distribution_bins'], n)
 29 |             elif datatype == 'String':
 30 |                 length = np.random.randint(attr_info['min'], attr_info['max'])
 31 |                 self.synthetic_dataset[attr] = length
 32 |                 self.synthetic_dataset[attr] = self.synthetic_dataset[attr].map(lambda x: generate_random_string(x))
 33 |             else:
 34 |                 if datatype == 'Integer':
 35 |                     self.synthetic_dataset[attr] = np.random.randint(minimum, maximum + 1, n)
 36 |                 else:
 37 |                     self.synthetic_dataset[attr] = np.random.uniform(minimum, maximum, n)
 38 | 
 39 |     def generate_dataset_in_independent_mode(self, n, description_file, seed=0):
 40 |         set_random_seed(seed)
 41 |         self.description = read_json_file(description_file)
 42 | 
 43 |         all_attributes = self.description['meta']['all_attributes']
 44 |         candidate_keys = set(self.description['meta']['candidate_keys'])
 45 |         self.synthetic_dataset = pd.DataFrame(columns=all_attributes)
 46 |         for attr in all_attributes:
 47 |             attr_info = self.description['attribute_description'][attr]
 48 |             column = parse_json(attr_info)
 49 | 
 50 |             if attr in candidate_keys:
 51 |                 self.synthetic_dataset[attr] = column.generate_values_as_candidate_key(n)
 52 |             else:
 53 |                 binning_indices = column.sample_binning_indices_in_independent_attribute_mode(n)
 54 |                 self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(binning_indices)
 55 | 
 56 |     def generate_dataset_in_correlated_attribute_mode(self, n, description_file, seed=0):
 57 |         set_random_seed(seed)
 58 |         self.n = n
 59 |         self.description = read_json_file(description_file)
 60 | 
 61 |         all_attributes = self.description['meta']['all_attributes']
 62 |         candidate_keys = set(self.description['meta']['candidate_keys'])
 63 |         self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description)
 64 |         self.synthetic_dataset = pd.DataFrame(columns=all_attributes)
 65 |         for attr in all_attributes:
 66 |             attr_info = self.description['attribute_description'][attr]
 67 |             column = parse_json(attr_info)
 68 | 
 69 |             if attr in self.encoded_dataset:
 70 |                 self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(self.encoded_dataset[attr])
 71 |             elif attr in candidate_keys:
 72 |                 self.synthetic_dataset[attr] = column.generate_values_as_candidate_key(n)
 73 |             else:
 74 |                 # for attributes not in BN or candidate keys, use independent attribute mode.
 75 |                 binning_indices = column.sample_binning_indices_in_independent_attribute_mode(n)
 76 |                 self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(binning_indices)
 77 | 
 78 |     @staticmethod
 79 |     def get_sampling_order(bn):
 80 |         order = [bn[0][1][0]]
 81 |         for child, _ in bn:
 82 |             order.append(child)
 83 |         return order
 84 | 
 85 |     @staticmethod
 86 |     def generate_encoded_dataset(n, description):
 87 |         bn = description['bayesian_network']
 88 |         bn_root_attr = bn[0][1][0]
 89 |         root_attr_dist = description['conditional_probabilities'][bn_root_attr]
 90 |         encoded_df = pd.DataFrame(columns=DataGenerator.get_sampling_order(bn))
 91 |         encoded_df[bn_root_attr] = np.random.choice(len(root_attr_dist), size=n, p=root_attr_dist)
 92 | 
 93 |         for child, parents in bn:
 94 |             child_conditional_distributions = description['conditional_probabilities'][child]
 95 |             for parents_instance in child_conditional_distributions.keys():
 96 |                 dist = child_conditional_distributions[parents_instance]
 97 |                 parents_instance = list(eval(parents_instance))
 98 | 
 99 |                 filter_condition = ''
100 |                 for parent, value in zip(parents, parents_instance):
101 |                     filter_condition += f"(encoded_df['{parent}']=={value})&"
102 | 
103 |                 filter_condition = eval(filter_condition[:-1])
104 | 
105 |                 size = encoded_df[filter_condition].shape[0]
106 |                 if size:
107 |                     encoded_df.loc[filter_condition, child] = np.random.choice(len(dist), size=size, p=dist)
108 | 
109 |             unconditioned_distribution = description['attribute_description'][child]['distribution_probabilities']
110 |             encoded_df.loc[encoded_df[child].isnull(), child] = np.random.choice(len(unconditioned_distribution),
111 |                                                                                  size=encoded_df[child].isnull().sum(),
112 |                                                                                  p=unconditioned_distribution)
113 |         encoded_df[encoded_df.columns] = encoded_df[encoded_df.columns].astype(int)
114 |         return encoded_df
115 | 
116 |     def save_synthetic_data(self, to_file):
117 |         self.synthetic_dataset.to_csv(to_file, index=False)
118 | 
119 | 
120 | if __name__ == '__main__':
121 |     from time import time
122 | 
123 |     dataset_description_file = '../out/AdultIncome/description_test.txt'
124 |     dataset_description_file = '/home/haoyue/GitLab/data-responsibly-webUI/dataResponsiblyUI/static/intermediatedata/1498175138.8088856_description.txt'
125 | 
126 |     generator = DataGenerator()
127 | 
128 |     t = time()
129 |     generator.generate_dataset_in_correlated_attribute_mode(51, dataset_description_file)
130 |     print('running time: {} s'.format(time() - t))
131 |     print(generator.synthetic_dataset.loc[:50])
132 | 


--------------------------------------------------------------------------------
/tutorial/deidentify.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Takes the Hospitals A&E data generated from generate.py and runs it through a
  3 | set of de-identification steps. It then saves this as a new dataset.
  4 | '''
  5 | import random 
  6 | import time
  7 | import string
  8 | 
  9 | import pandas as pd
 10 | 
 11 | import filepaths
 12 | 
 13 | 
 14 | def main():
 15 |     print('running de-identification steps...')
 16 |     start = time.time()
 17 | 
 18 |     # "_df" is the usual way people refer to a Pandas DataFrame object
 19 |     hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data)
 20 | 
 21 |     print('removing Health Service ID numbers...')
 22 |     hospital_ae_df = remove_health_service_numbers(hospital_ae_df)
 23 | 
 24 |     print('converting postcodes to LSOA...')
 25 |     hospital_ae_df = convert_postcodes_to_lsoa(hospital_ae_df)
 26 | 
 27 |     print('converting LSOA to IMD decile...')
 28 |     hospital_ae_df = convert_lsoa_to_imd_decile(hospital_ae_df)
 29 | 
 30 |     print('replacing Hospital with random number...')
 31 |     hospital_ae_df = replace_hospital_with_random_number(hospital_ae_df)
 32 | 
 33 |     print('putting Arrival Hour in 4-hour bins...')
 34 |     hospital_ae_df = put_time_in_4_hour_bins(hospital_ae_df)
 35 | 
 36 |     print('removing non-male-or-female from gender ...')
 37 |     hospital_ae_df = remove_non_male_or_female(hospital_ae_df)
 38 | 
 39 |     print('putting ages in age brackets...')
 40 |     hospital_ae_df = add_age_brackets(hospital_ae_df)
 41 | 
 42 |     hospital_ae_df.to_csv(filepaths.hospital_ae_data_deidentify, index=False)
 43 | 
 44 |     elapsed = round(time.time() - start, 2)
 45 |     print('done in ' + str(elapsed) + ' seconds.')
 46 | 
 47 | 
 48 | def remove_health_service_numbers(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
 49 |     """Drops the Health Service ID numbers column from the dataset
 50 | 
 51 |     Keyword arguments:
 52 |     hospital_ae_df -- Hopsitals A&E records dataframe
 53 |     """
 54 |     hospital_ae_df = hospital_ae_df.drop('Health Service ID', 1)
 55 |     return hospital_ae_df
 56 | 
 57 | 
 58 | def convert_postcodes_to_lsoa(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
 59 |     """Adds corresponding Lower layer super output area for each row
 60 |     depending on their postcode. Uses London postcodes dataset from
 61 |     https://www.doogal.co.uk/PostcodeDownloads.php 
 62 | 
 63 |     Keyword arguments:
 64 |     hospital_ae_df -- Hopsitals A&E records dataframe
 65 |     """
 66 |     postcodes_df = pd.read_csv(filepaths.postcodes_london)
 67 |     hospital_ae_df = pd.merge(
 68 |         hospital_ae_df, 
 69 |         postcodes_df[['Postcode', 'Lower layer super output area']], 
 70 |         on='Postcode'
 71 |     )
 72 |     hospital_ae_df = hospital_ae_df.drop('Postcode', 1)
 73 |     return hospital_ae_df
 74 | 
 75 | 
 76 | def convert_lsoa_to_imd_decile(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
 77 |     """Maps each row's Lower layer super output area to which 
 78 |     Index of Multiple Deprivation decile it's in. It calculates the decile 
 79 |     rates based on the IMD's over all of London. 
 80 |     Uses "London postcodes.csv" dataset from
 81 |     https://www.doogal.co.uk/PostcodeDownloads.php 
 82 | 
 83 |     Keyword arguments:
 84 |     hospital_ae_df -- Hospitals A&E records dataframe
 85 |     """
 86 | 
 87 |     postcodes_df = pd.read_csv(filepaths.postcodes_london)
 88 | 
 89 |     hospital_ae_df = pd.merge(
 90 |         hospital_ae_df, 
 91 |         postcodes_df[
 92 |             ['Lower layer super output area', 
 93 |              'Index of Multiple Deprivation']
 94 |         ].drop_duplicates(), 
 95 |         on='Lower layer super output area'
 96 |     )
 97 |     _, bins = pd.qcut(
 98 |         postcodes_df['Index of Multiple Deprivation'], 10, 
 99 |         retbins=True, labels=False
100 |     )
101 |     hospital_ae_df['Index of Multiple Deprivation Decile'] = pd.cut(
102 |         hospital_ae_df['Index of Multiple Deprivation'], bins=bins, 
103 |         labels=False, include_lowest=True) + 1
104 | 
105 |     hospital_ae_df = hospital_ae_df.drop('Index of Multiple Deprivation', 1)
106 |     hospital_ae_df = hospital_ae_df.drop('Lower layer super output area', 1)
107 | 
108 |     return hospital_ae_df
109 | 
110 | 
111 | def replace_hospital_with_random_number(
112 |         hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
113 |     """ 
114 |     Gives each hospital a random integer number and adds a new column
115 |     with these numbers. Drops the hospital name column. 
116 | 
117 |     Keyword arguments:
118 |     hospital_ae_df -- Hopsitals A&E records dataframe
119 |     """
120 | 
121 |     hospitals = hospital_ae_df['Hospital'].unique().tolist()
122 |     random.shuffle(hospitals)
123 |     hospitals_map = {
124 |         hospital : ''.join(random.choices(string.digits, k=6))
125 |         for hospital in hospitals
126 |     }
127 |     hospital_ae_df['Hospital ID'] = hospital_ae_df['Hospital'].map(hospitals_map)
128 |     hospital_ae_df = hospital_ae_df.drop('Hospital', 1)
129 | 
130 |     return hospital_ae_df
131 | 
132 | 
133 | def put_time_in_4_hour_bins(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
134 |     """ 
135 |     Gives each hospital a random integer number and adds a new column
136 |     with these numbers. Drops the hospital name column. 
137 | 
138 |     Keyword arguments:
139 |     hospital_ae_df -- Hopsitals A&E records dataframe
140 |     """
141 | 
142 |     arrival_times = pd.to_datetime(hospital_ae_df['Arrival Time'])
143 |     hospital_ae_df['Arrival Date'] = arrival_times.dt.strftime('%Y-%m-%d')
144 |     hospital_ae_df['Arrival Hour'] = arrival_times.dt.hour
145 | 
146 |     hospital_ae_df['Arrival hour range'] = pd.cut(
147 |         hospital_ae_df['Arrival Hour'], 
148 |         bins=[0, 4, 8, 12, 16, 20, 24], 
149 |         labels=['00-03', '04-07', '08-11', '12-15', '16-19', '20-23'], 
150 |         include_lowest=True
151 |     )
152 |     hospital_ae_df = hospital_ae_df.drop('Arrival Time', 1)
153 |     hospital_ae_df = hospital_ae_df.drop('Arrival Hour', 1)
154 | 
155 |     return hospital_ae_df
156 | 
157 | 
158 | def remove_non_male_or_female(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
159 |     """ 
160 |     Removes any record which has a non-male-or-female entry for gender. 
161 | 
162 |     Keyword arguments:
163 |     hospital_ae_df -- Hopsitals A&E records dataframe
164 |     """
165 | 
166 |     hospital_ae_df = hospital_ae_df[hospital_ae_df['Gender'].isin(['Male', 'Female'])]
167 |     return hospital_ae_df
168 | 
169 | 
170 | def add_age_brackets(hospital_ae_df: pd.DataFrame) -> pd.DataFrame:
171 |     """
172 |     Put the integer ages in to age brackets 
173 | 
174 |     Keyword arguments:
175 |     hospital_ae_df -- Hopsitals A&E records dataframe
176 |     """
177 | 
178 |     hospital_ae_df['Age bracket'] = pd.cut(
179 |         hospital_ae_df['Age'], 
180 |         bins=[0, 18, 25, 45, 65, 85, 150], 
181 |         labels=['0-17', '18-24', '25-44', '45-64', '65-84', '85-'], 
182 |         include_lowest=True
183 |     )
184 |     hospital_ae_df = hospital_ae_df.drop('Age', 1)
185 |     return hospital_ae_df
186 | 
187 | 
188 | if __name__ == "__main__":
189 |     main()
190 | 


--------------------------------------------------------------------------------
/tutorial/generate.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Script that generates hospital A&E data to use in the synthetic data tutorial.
  3 | 
  4 | Columns of data inpired by NHS+ODI Leeds blog post:
  5 | https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data
  6 | 
  7 | """
  8 | 
  9 | import random
 10 | from datetime import datetime
 11 | import random, string
 12 | import time
 13 | 
 14 | import pandas as pd
 15 | import numpy as np
 16 | 
 17 | import filepaths
 18 | 
 19 | # TODO: give hospitals different average waiting times
 20 | 
 21 | num_of_rows = 10000
 22 | 
 23 | 
 24 | def main():
 25 |     print('generating data...')
 26 |     start = time.time()
 27 | 
 28 |     hospital_ae_dataset = {}
 29 | 
 30 |     print('generating Health Service ID numbers...')
 31 |     hospital_ae_dataset['Health Service ID'] = generate_health_service_id_numbers()
 32 | 
 33 |     print('generating patient ages and times in A&E...')
 34 |     (hospital_ae_dataset['Age'], hospital_ae_dataset['Time in A&E (mins)']) = generate_ages_times_in_age()
 35 | 
 36 |     print('generating hospital instances...')
 37 |     hospital_ae_dataset['Hospital'] = generate_hospitals()
 38 | 
 39 |     print('generating arrival times...')
 40 |     hospital_ae_dataset['Arrival Time'] = generate_arrival_times()
 41 | 
 42 |     print('generating A&E treaments...')
 43 |     hospital_ae_dataset['Treatment'] = generate_treatments()
 44 | 
 45 |     print('generating patient gender instances...')
 46 |     hospital_ae_dataset['Gender'] = generate_genders()
 47 | 
 48 |     print('generating patient postcodes...')
 49 |     hospital_ae_dataset['Postcode'] = generate_postcodes()
 50 | 
 51 |     write_out_dataset(hospital_ae_dataset, filepaths.hospital_ae_data)
 52 |     print('dataset written out to: ', filepaths.hospital_ae_data)
 53 | 
 54 |     elapsed = round(time.time() - start, 2)
 55 |     print('done in ' + str(elapsed) + ' seconds.')
 56 | 
 57 | 
 58 | def generate_ages_times_in_age():
 59 |     """
 60 |     Generates correlated ages and waiting times and returns them as lists
 61 | 
 62 |     Obviously normally distributed ages is not very true to real life but is fine for our mock data.
 63 | 
 64 |     Correlated random data generation code based on:
 65 |     https://realpython.com/python-random/
 66 |     """
 67 |     # Start with a correlation matrix and standard deviations.
 68 |     # 0.9 is the correlation between ages and waiting times, and the correlation of a variable with itself is 1
 69 |     correlations = np.array([[1, 0.95], [0.95, 1]])
 70 | 
 71 |     # Standard deviations/means of ages and waiting times, respectively
 72 |     stdev = np.array([20, 20])
 73 |     mean = np.array([41, 60])
 74 |     cov = corr2cov(correlations, stdev)
 75 | 
 76 |     data = np.random.multivariate_normal(mean=mean, cov=cov, size=num_of_rows)
 77 |     data = np.array(data, dtype=int)
 78 | 
 79 |     # negative ages or waiting times wouldn't make sense
 80 |     # so set any negative values to 0 and 1 respectively 
 81 |     data[np.nonzero(data[:, 0] < 1)[0], 0] = 0
 82 |     data[np.nonzero(data[:, 1] < 1)[0], 1] = 1
 83 | 
 84 |     ages = data[:, 0].tolist()
 85 |     times_in_ae = data[:, 1].tolist()
 86 | 
 87 |     return (ages, times_in_ae)
 88 | 
 89 | 
 90 | def corr2cov(correlations: np.ndarray, stdev: np.ndarray) -> np.ndarray:
 91 |     """Covariance matrix from correlation & standard deviations"""
 92 |     diagonal_stdev = np.diag(stdev)
 93 |     covariance = diagonal_stdev @ correlations @ diagonal_stdev
 94 |     return covariance
 95 | 
 96 | 
 97 | def generate_admission_ids() -> list:
 98 |     """ Generate a unique 10-digit ID for every admission record """
 99 |     
100 |     uids = []
101 |     for _ in range(num_of_rows):    
102 |         x = ''.join(random.choice(string.digits) for _ in range(10))
103 |         uids.append(x)
104 |     return uids
105 | 
106 | def generate_health_service_id_numbers() -> list:
107 |     """ Generate dummy Health Service ID numbers similar to NHS 10 digit format
108 |     See: https://www.nhs.uk/using-the-nhs/about-the-nhs/what-is-an-nhs-number/
109 |     """
110 |     health_service_id_numbers = []
111 |     for _ in range(num_of_rows): 
112 |         health_service_id = ''.join(random.choice(string.digits) for _ in range(3)) + '-'   
113 |         health_service_id += ''.join(random.choice(string.digits) for _ in range(3)) + '-'   
114 |         health_service_id += ''.join(random.choice(string.digits) for _ in range(4))
115 |         health_service_id_numbers.append(health_service_id)
116 |     return health_service_id_numbers
117 | 
118 | 
119 | def generate_postcodes() -> list:
120 |     """ Reads a .csv containing info on every London postcode. Reads the 
121 |     postcodes in use and returns a sample of them.
122 | 
123 |     # List of London postcodes from https://www.doogal.co.uk/PostcodeDownloads.php
124 |     """
125 |     postcodes_df = pd.read_csv(filepaths.postcodes_london)
126 |     postcodes_in_use = list(postcodes_df[postcodes_df['In Use?'] == "No"]['Postcode'])
127 |     postcodes = random.choices(postcodes_in_use, k=num_of_rows)
128 |     return postcodes
129 | 
130 | 
131 | def generate_hospitals() -> list:
132 |     """ Reads the data/hospitals_london.txt file, and generates a
133 |     sample of them to add to the dataset.
134 | 
135 |     List of London hospitals loosely based on 
136 |     https://en.wikipedia.org/wiki/Category:NHS_hospitals_in_London
137 |     """
138 |     with open(filepaths.hospitals_london, 'r') as file_in:
139 |         hospitals = file_in.readlines()
140 |     hospitals = [name.strip() for name in hospitals]
141 | 
142 |     weights = random.choices(range(1, 100), k=len(hospitals))
143 |     hospitals = random.choices(hospitals, k=num_of_rows, weights=weights)
144 | 
145 |     return hospitals
146 | 
147 | 
148 | def generate_arrival_times() -> list:
149 |     """ Generate and return arrival times.
150 |         Hardcoding times to first week of April 2019
151 |     """
152 |     arrival_times = []
153 | 
154 |     # first 7 days in April 2019
155 |     days_dates = [1, 2, 3, 4, 5, 6, 7]
156 |     # have more people come in at the weekend - higher weights 
157 |     day_weights = [0.5, 0.6, 0.7, 0.8, 0.9, 1, 1]
158 |     days = random.choices(days_dates, day_weights, k=num_of_rows)
159 |     # this is just so each day has a different peak time
160 |     days_time_modes = {day: random.random() for day in days_dates}
161 | 
162 |     for day in days:
163 |         start = datetime(2019, 4, day, 00, 00, 00)
164 |         end = datetime(2019, 4, day, 23, 59, 59)
165 | 
166 |         random_num = random.triangular(0, 1, days_time_modes[day])
167 |         random_datetime = start + (end - start) * random_num
168 |         arrival_times.append(random_datetime.strftime('%Y-%m-%d %H:%M:%S'))
169 | 
170 |     return arrival_times
171 | 
172 | 
173 | def generate_genders() -> list:
174 |     """ Generate and return list of genders for every row. 
175 | 
176 |     # National codes for gender in NHS data
177 |     # https://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_code_de.asp?shownav=1
178 |     """
179 |     gender_codes_df = pd.read_csv(filepaths.nhs_ae_gender_codes)
180 |     genders = gender_codes_df['Gender'].tolist()
181 |     # these weights are just dummy values. please don't take them as accurate.
182 |     weights =[0.005, 0.495, 0.495, 0.005]
183 |     gender_codes = random.choices(genders, k=num_of_rows, weights=weights)
184 |     return gender_codes
185 | 
186 | 
187 | def generate_treatments() -> list:
188 |     """ Generate and return sample of treatments patients received. 
189 | 
190 |     Reads data/treatment_codes_nhs_ae.csv file 
191 | 
192 |     NHS treatment codes:
193 |     https://www.datadictionary.nhs.uk/web_site_content/supporting_information/clinical_coding/accident_and_emergency_treatment_tables.asp?shownav=1
194 |     """
195 | 
196 |     treatment_codes_df = pd.read_csv(filepaths.nhs_ae_treatment_codes)
197 |     treatments = treatment_codes_df['Treatment'].tolist()
198 | 
199 |     # likelihood of each of the treatments - make some more common
200 |     weights = random.choices(range(1, 100), k=len(treatments))
201 |     treatment_codes = random.choices(
202 |         treatments, k=num_of_rows, weights=weights)
203 |     return treatment_codes
204 | 
205 | 
206 | def write_out_dataset(dataset: dict, filepath: str):
207 |     """Writing dataset to .csv file
208 | 
209 |     Keyword arguments:
210 |     dataset -- the dataset to be written to disk
211 |     filepath -- path to write the file out to
212 |     """
213 | 
214 |     df = pd.DataFrame.from_dict(dataset)
215 |     df.to_csv(filepath, index=False)
216 | 
217 | 
218 | if __name__ == "__main__":
219 |     main()
220 | 


--------------------------------------------------------------------------------
/tutorial/synthesise.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | This generates synthetic data from the hospital_ae_data_deidentify.csv
  3 | file. It generates three types of synthetic data and saves them in 
  4 | different files. 
  5 | '''
  6 | 
  7 | import random 
  8 | import os
  9 | import time
 10 | 
 11 | import pandas as pd
 12 | import numpy as np
 13 | 
 14 | import filepaths
 15 | from DataDescriber import DataDescriber
 16 | from DataGenerator import DataGenerator
 17 | from ModelInspector import ModelInspector
 18 | from lib.utils import read_json_file
 19 | 
 20 | 
 21 | attribute_to_datatype = {
 22 |     'Time in A&E (mins)': 'Integer', 
 23 |     'Treatment': 'String', 
 24 |     'Gender': 'String', 
 25 |     'Index of Multiple Deprivation Decile': 'Integer',
 26 |     'Hospital ID': 'String', 
 27 |     'Arrival Date': 'String', 
 28 |     'Arrival hour range': 'String',  
 29 |     'Age bracket': 'String'
 30 | }
 31 | 
 32 | attribute_is_categorical = {
 33 |     'Time in A&E (mins)': False, 
 34 |     'Treatment': True, 
 35 |     'Gender': True, 
 36 |     'Index of Multiple Deprivation Decile': True,
 37 |     'Hospital ID': True, 
 38 |     'Arrival Date': True, 
 39 |     'Arrival hour range': True,  
 40 |     'Age bracket': True
 41 | }
 42 | 
 43 | mode_filepaths = {
 44 |     'random': {
 45 |         'description': filepaths.hospital_ae_description_random, 
 46 |         'data': filepaths.hospital_ae_data_synthetic_random
 47 |     },
 48 |     'independent': {
 49 |         'description': filepaths.hospital_ae_description_independent, 
 50 |         'data': filepaths.hospital_ae_data_synthetic_independent
 51 |     },
 52 |     'correlated': {
 53 |         'description': filepaths.hospital_ae_description_correlated, 
 54 |         'data': filepaths.hospital_ae_data_synthetic_correlated
 55 |     }
 56 | }
 57 | 
 58 | 
 59 | def main():
 60 |     start = time.time()
 61 | 
 62 |     # "_df" is the usual way people refer to a Pandas DataFrame object
 63 |     hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data_deidentify)
 64 | 
 65 |     # let's generate the same amount of rows as original data (though we don't have to)
 66 |     num_rows = len(hospital_ae_df)
 67 | 
 68 |     # iterate through the 3 modes to generate synthetic data
 69 |     for mode in ['random','independent', 'correlated']: 
 70 | 
 71 |         print('describing synthetic data for', mode, 'mode...')
 72 |         describe_synthetic_data(mode, mode_filepaths[mode]['description'])
 73 | 
 74 |         print('generating synthetic data for', mode, 'mode...')
 75 |         generate_synthetic_data(
 76 |             mode, 
 77 |             num_rows, 
 78 |             mode_filepaths[mode]['description'],
 79 |             mode_filepaths[mode]['data']
 80 |         )
 81 | 
 82 |         print('comparing histograms for', mode, 'mode...')
 83 |         compare_histograms(
 84 |             mode, 
 85 |             hospital_ae_df, 
 86 |             mode_filepaths[mode]['description'],
 87 |             mode_filepaths[mode]['data']
 88 |         )
 89 | 
 90 |         print('comparing pairwise mutual information for', mode, 'mode...')
 91 |         compare_pairwise_mutual_information(
 92 |             mode, 
 93 |             hospital_ae_df, 
 94 |             mode_filepaths[mode]['description'],
 95 |             mode_filepaths[mode]['data']
 96 |         )
 97 | 
 98 |     elapsed = round(time.time() - start, 2)
 99 |     print('done in ' + str(elapsed) + ' seconds.')
100 | 
101 | 
102 | def describe_synthetic_data(mode: str, description_filepath:str):
103 |     '''
104 |     Describes the synthetic data and saves it to the data/ directory.
105 | 
106 |     Keyword arguments:
107 |     mode -- what type of synthetic data
108 |     category_threshold -- limit at which categories are considered blah
109 |     description_filepath -- filepath to the data description
110 |     '''
111 |     describer = DataDescriber()
112 | 
113 |     if mode == 'random':
114 |         describer.describe_dataset_in_random_mode(
115 |             filepaths.hospital_ae_data_deidentify,
116 |             attribute_to_datatype=attribute_to_datatype,
117 |             attribute_to_is_categorical=attribute_is_categorical)
118 |     
119 |     elif mode == 'independent':
120 |         describer.describe_dataset_in_independent_attribute_mode(
121 |             filepaths.hospital_ae_data_deidentify,
122 |             attribute_to_datatype=attribute_to_datatype,
123 |             attribute_to_is_categorical=attribute_is_categorical)
124 |     
125 |     elif mode == 'correlated':
126 |         # Increase epsilon value to reduce the injected noises. 
127 |         # We're not using differential privacy in this tutorial, 
128 |         # so we'll set epsilon=0 to turn off differential privacy 
129 |         epsilon = 0
130 | 
131 |         # The maximum number of parents in Bayesian network
132 |         # i.e., the maximum number of incoming edges.
133 |         degree_of_bayesian_network = 1
134 | 
135 |         describer.describe_dataset_in_correlated_attribute_mode(
136 |             dataset_file=filepaths.hospital_ae_data_deidentify, 
137 |             epsilon=epsilon, 
138 |             k=degree_of_bayesian_network,
139 |             attribute_to_datatype=attribute_to_datatype,
140 |             attribute_to_is_categorical=attribute_is_categorical)
141 |             # attribute_to_is_candidate_key=attribute_to_is_candidate_key)
142 | 
143 |     describer.save_dataset_description_to_file(description_filepath)
144 | 
145 | 
146 | def generate_synthetic_data(
147 |         mode: str, 
148 |         num_rows: int, 
149 |         description_filepath: str,
150 |         synthetic_data_filepath: str
151 |     ):
152 |     '''
153 |     Generates the synthetic data and saves it to the data/ directory.
154 | 
155 |     Keyword arguments:
156 |     mode -- what type of synthetic data
157 |     num_rows -- number of rows in the synthetic dataset
158 |     description_filepath -- filepath to the data description
159 |     synthetic_data_filepath -- filepath to where synthetic data written
160 |     '''
161 |     generator = DataGenerator()
162 | 
163 |     if mode == 'random':
164 |         generator.generate_dataset_in_random_mode(num_rows, description_filepath)
165 |             
166 |     elif mode == 'independent':
167 |         generator.generate_dataset_in_independent_mode(num_rows, description_filepath)
168 |     
169 |     elif mode == 'correlated':
170 |         generator.generate_dataset_in_correlated_attribute_mode(num_rows, description_filepath)
171 | 
172 |     generator.save_synthetic_data(synthetic_data_filepath)
173 | 
174 | 
175 | def compare_histograms(
176 |         mode: str, 
177 |         hospital_ae_df: pd.DataFrame, 
178 |         description_filepath: str,
179 |         synthetic_data_filepath: str
180 |     ):
181 |     '''
182 |     Makes comapirson plots showing the histograms for each column in the 
183 |     synthetic data.
184 | 
185 |     Keyword arguments:
186 |     mode -- what type of synthetic data
187 |     hospital_ae_df -- DataFrame of the original dataset
188 |     description_filepath -- filepath to the data description
189 |     synthetic_data_filepath -- filepath to where synthetic data written
190 |     '''
191 | 
192 |     synthetic_df = pd.read_csv(synthetic_data_filepath)
193 | 
194 |     # Read attribute description from the dataset description file.
195 |     attribute_description = read_json_file(
196 |         description_filepath)['attribute_description']
197 | 
198 |     inspector = ModelInspector(
199 |         hospital_ae_df, synthetic_df, attribute_description)
200 | 
201 |     for attribute in synthetic_df.columns:
202 |         figure_filepath = os.path.join(
203 |             filepaths.plots_dir, 
204 |             mode + '_' + attribute + '.png'
205 |         )
206 |         # need to replace whitespace in filepath for Markdown reference
207 |         figure_filepath = figure_filepath.replace(' ', '_')
208 |         inspector.compare_histograms(attribute, figure_filepath)
209 | 
210 | def compare_pairwise_mutual_information(
211 |         mode: str, 
212 |         hospital_ae_df: pd.DataFrame, 
213 |         description_filepath: str,
214 |         synthetic_data_filepath: str
215 |     ):
216 |     '''
217 |     Looks at correlation of attributes by producing heatmap
218 | 
219 |     Keyword arguments:
220 |     mode -- what type of synthetic data
221 |     hospital_ae_df -- DataFrame of the original dataset
222 |     description_filepath -- filepath to the data description
223 |     synthetic_data_filepath -- filepath to where synthetic data written
224 |     '''
225 | 
226 |     synthetic_df = pd.read_csv(synthetic_data_filepath)
227 | 
228 |     attribute_description = read_json_file(
229 |         description_filepath)['attribute_description']
230 | 
231 |     inspector = ModelInspector(
232 |         hospital_ae_df, synthetic_df, attribute_description)
233 | 
234 |     figure_filepath = os.path.join(
235 |         filepaths.plots_dir, 
236 |         'mutual_information_heatmap_' + mode + '.png'
237 |     )
238 | 
239 |     inspector.mutual_information_heatmap(figure_filepath)
240 | 
241 | 
242 | if __name__ == "__main__":
243 |     main()
244 | 


--------------------------------------------------------------------------------
/DataSynthesizer/lib/PrivBayes.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import warnings
  3 | from itertools import combinations, product
  4 | from math import log, ceil
  5 | from multiprocessing.pool import Pool
  6 | 
  7 | import numpy as np
  8 | import pandas as pd
  9 | from scipy.optimize import fsolve
 10 | 
 11 | from lib.utils import mutual_information, normalize_given_distribution
 12 | 
 13 | """
 14 | This module is based on PrivBayes in the following paper:
 15 | 
 16 | Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X.
 17 | PrivBayes: Private Data Release via Bayesian Networks.
 18 | """
 19 | 
 20 | 
 21 | def sensitivity(num_tuples):
 22 |     """Sensitivity function for Bayesian network construction. PrivBayes Lemma 1.
 23 | 
 24 |     Parameters
 25 |     ----------
 26 |     num_tuples : int
 27 |         Number of tuples in sensitive dataset.
 28 | 
 29 |     Return
 30 |     --------
 31 |     int
 32 |         Sensitivity value.
 33 |     """
 34 |     a = (2 / num_tuples) * log((num_tuples + 1) / 2)
 35 |     b = (1 - 1 / num_tuples) * log(1 + 2 / (num_tuples - 1))
 36 |     return a + b
 37 | 
 38 | 
 39 | def delta(num_attributes, num_tuples, epsilon):
 40 |     """Computing delta, which is a factor when applying differential privacy.
 41 | 
 42 |     More info is in PrivBayes Section 4.2 "A First-Cut Solution".
 43 | 
 44 |     Parameters
 45 |     ----------
 46 |     num_attributes : int
 47 |         Number of attributes in dataset.
 48 |     num_tuples : int
 49 |         Number of tuples in dataset.
 50 |     epsilon : float
 51 |         Parameter of differential privacy.
 52 |     """
 53 |     return 2 * (num_attributes - 1) * sensitivity(num_tuples) / epsilon
 54 | 
 55 | 
 56 | def usefulness_minus_target(k, num_attributes, num_tuples, target_usefulness=5, epsilon=0.1):
 57 |     """Usefulness function in PrivBayes.
 58 | 
 59 |     Parameters
 60 |     ----------
 61 |     k : int
 62 |         Max number of degree in Bayesian networks construction
 63 |     num_attributes : int
 64 |         Number of attributes in dataset.
 65 |     num_tuples : int
 66 |         Number of tuples in dataset.
 67 |     target_usefulness : int or float
 68 |     epsilon : float
 69 |         Parameter of differential privacy.
 70 |     """
 71 |     if k == num_attributes:
 72 |         print('here')
 73 |         usefulness = target_usefulness
 74 |     else:
 75 |         usefulness = num_tuples * epsilon / ((num_attributes - k) * (2 ** (k + 3)))  # PrivBayes Lemma 3
 76 |     return usefulness - target_usefulness
 77 | 
 78 | 
 79 | def calculate_k(num_attributes, num_tuples, target_usefulness=4, epsilon=0.1):
 80 |     """Calculate the maximum degree when constructing Bayesian networks. See PrivBayes Lemma 3."""
 81 |     default_k = 3
 82 |     initial_usefulness = usefulness_minus_target(default_k, num_attributes, num_tuples, 0, epsilon)
 83 |     if initial_usefulness > target_usefulness:
 84 |         return default_k
 85 |     else:
 86 |         arguments = (num_attributes, num_tuples, target_usefulness, epsilon)
 87 |         warnings.filterwarnings("error")
 88 |         try:
 89 |             ans = fsolve(usefulness_minus_target, int(num_attributes / 2), args=arguments)[0]
 90 |             ans = ceil(ans)
 91 |         except RuntimeWarning:
 92 |             print("Warning: k is not properly computed!")
 93 |             ans = default_k
 94 |         if ans < 1 or ans > num_attributes:
 95 |             ans = default_k
 96 |         return ans
 97 | 
 98 | 
 99 | def worker(paras):
100 |     child, V, num_parents, split, dataset = paras
101 |     parents_pair_list = []
102 |     mutual_info_list = []
103 | 
104 |     if split + num_parents - 1 < len(V):
105 |         for other_parents in combinations(V[split + 1:], num_parents - 1):
106 |             parents = list(other_parents)
107 |             parents.append(V[split])
108 |             parents_pair_list.append((child, parents))
109 |             # TODO consider to change the computation of MI by combined integers instead of strings.
110 |             mi = mutual_information(dataset[child], dataset[parents])
111 |             mutual_info_list.append(mi)
112 | 
113 |     return parents_pair_list, mutual_info_list
114 | 
115 | 
116 | def greedy_bayes(dataset, k=2, epsilon=0):
117 |     """Construct a Bayesian Network (BN) using greedy algorithm.
118 | 
119 |     Parameters
120 |     ----------
121 |     dataset : DataFrame
122 |         Input dataset, which only contains categorical attributes.
123 |     k : int
124 |         Maximum degree of the constructed BN. If k=0, k is automatically calculated.
125 |     epsilon : float
126 |         Parameter of differential privacy.
127 |     """
128 |     dataset = dataset.astype(str, copy=False)
129 |     num_tuples, num_attributes = dataset.shape
130 |     if not k:
131 |         k = calculate_k(num_attributes, num_tuples)
132 | 
133 |     print('================ Constructing Bayesian Network (BN) ================')
134 |     root_attribute = random.choice(dataset.columns)
135 |     V = [root_attribute]
136 |     rest_attributes = set(dataset.columns)
137 |     rest_attributes.remove(root_attribute)
138 |     print(f'Adding ROOT {root_attribute}')
139 |     N = []
140 |     while rest_attributes:
141 |         parents_pair_list = []
142 |         mutual_info_list = []
143 | 
144 |         num_parents = min(len(V), k)
145 |         tasks = [(child, V, num_parents, split, dataset) for child, split in
146 |                  product(rest_attributes, range(len(V) - num_parents + 1))]
147 |         with Pool() as pool:
148 |             res_list = pool.map(worker, tasks)
149 | 
150 |         for res in res_list:
151 |             parents_pair_list += res[0]
152 |             mutual_info_list += res[1]
153 | 
154 |         if epsilon:
155 |             sampling_distribution = exponential_mechanism(dataset, mutual_info_list, epsilon)
156 |             idx = np.random.choice(list(range(len(mutual_info_list))), p=sampling_distribution)
157 |         else:
158 |             idx = mutual_info_list.index(max(mutual_info_list))
159 | 
160 |         N.append(parents_pair_list[idx])
161 |         adding_attribute = parents_pair_list[idx][0]
162 |         V.append(adding_attribute)
163 |         rest_attributes.remove(adding_attribute)
164 |         print(f'Adding attribute {adding_attribute}')
165 | 
166 |     print('========================= BN constructed =========================')
167 | 
168 |     return N
169 | 
170 | 
171 | def exponential_mechanism(dataset, mutual_info_list, epsilon=0.1):
172 |     """Applied in Exponential Mechanism to sample outcomes."""
173 |     num_tuples, num_attributes = dataset.shape
174 |     mi_array = np.array(mutual_info_list)
175 |     mi_array = mi_array / (2 * delta(num_attributes, num_tuples, epsilon))
176 |     mi_array = np.exp(mi_array)
177 |     mi_array = normalize_given_distribution(mi_array)
178 |     return mi_array
179 | 
180 | 
181 | def laplace_noise_parameter(k, num_attributes, num_tuples, epsilon):
182 |     """The noises injected into conditional distributions. PrivBayes Algorithm 1."""
183 |     return 4 * (num_attributes - k) / (num_tuples * epsilon)
184 | 
185 | 
186 | def get_noisy_distribution_of_attributes(attributes, encoded_dataset, epsilon=0.1):
187 |     data = encoded_dataset.copy().loc[:, attributes]
188 |     data['count'] = 1
189 |     stats = data.groupby(attributes).sum()
190 | 
191 |     iterables = [range(int(encoded_dataset[attr].max()) + 1) for attr in attributes]
192 |     full_space = pd.DataFrame(columns=attributes, data=list(product(*iterables)))
193 |     stats.reset_index(inplace=True)
194 |     stats = pd.merge(full_space, stats, how='left')
195 |     stats.fillna(0, inplace=True)
196 | 
197 |     if epsilon:
198 |         k = len(attributes) - 1
199 |         num_tuples, num_attributes = encoded_dataset.shape
200 |         noise_para = laplace_noise_parameter(k, num_attributes, num_tuples, epsilon)
201 |         laplace_noises = np.random.laplace(0, scale=noise_para, size=stats.index.size)
202 |         stats['count'] += laplace_noises
203 |         stats.loc[stats['count'] < 0, 'count'] = 0
204 | 
205 |     return stats
206 | 
207 | 
208 | def construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon=0.1):
209 |     """See more in Algorithm 1 in PrivBayes.
210 | 
211 |     """
212 | 
213 |     k = len(bayesian_network[-1][1])
214 |     conditional_distributions = {}
215 | 
216 |     # first k+1 attributes
217 |     root = bayesian_network[0][1][0]
218 |     kplus1_attributes = [root]
219 |     for child, _ in bayesian_network[:k]:
220 |         kplus1_attributes.append(child)
221 | 
222 |     noisy_dist_of_kplus1_attributes = get_noisy_distribution_of_attributes(kplus1_attributes, encoded_dataset, epsilon)
223 | 
224 |     # generate noisy distribution of root attribute.
225 |     root_stats = noisy_dist_of_kplus1_attributes.loc[:, [root, 'count']].groupby(root).sum()['count']
226 |     conditional_distributions[root] = normalize_given_distribution(root_stats).tolist()
227 | 
228 |     for idx, (child, parents) in enumerate(bayesian_network):
229 |         conditional_distributions[child] = {}
230 | 
231 |         if idx < k:
232 |             stats = noisy_dist_of_kplus1_attributes.copy().loc[:, parents + [child, 'count']]
233 |         else:
234 |             stats = get_noisy_distribution_of_attributes(parents + [child], encoded_dataset, epsilon)
235 | 
236 |         stats = pd.DataFrame(stats.loc[:, parents + [child, 'count']].groupby(parents + [child]).sum())
237 | 
238 |         if len(parents) == 1:
239 |             for parent_instance in stats.index.levels[0]:
240 |                 dist = normalize_given_distribution(stats.loc[parent_instance]['count']).tolist()
241 |                 conditional_distributions[child][str([parent_instance])] = dist
242 |         else:
243 |             for parents_instance in product(*stats.index.levels[:-1]):
244 |                 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist()
245 |                 conditional_distributions[child][str(list(parents_instance))] = dist
246 | 
247 |     return conditional_distributions
248 | 


--------------------------------------------------------------------------------
/data/hospital_ae_description_random.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "meta": {
  3 |         "num_tuples": 9897,
  4 |         "num_attributes": 8,
  5 |         "num_attributes_in_BN": 8,
  6 |         "all_attributes": [
  7 |             "Time in A&E (mins)",
  8 |             "Treatment",
  9 |             "Gender",
 10 |             "Index of Multiple Deprivation Decile",
 11 |             "Hospital ID",
 12 |             "Arrival Date",
 13 |             "Arrival hour range",
 14 |             "Age bracket"
 15 |         ],
 16 |         "candidate_keys": [],
 17 |         "non_categorical_string_attributes": [],
 18 |         "attributes_in_BN": [
 19 |             "Gender",
 20 |             "Hospital ID",
 21 |             "Treatment",
 22 |             "Arrival Date",
 23 |             "Arrival hour range",
 24 |             "Age bracket",
 25 |             "Time in A&E (mins)",
 26 |             "Index of Multiple Deprivation Decile"
 27 |         ]
 28 |     },
 29 |     "attribute_description": {
 30 |         "Time in A&E (mins)": {
 31 |             "name": "Time in A&E (mins)",
 32 |             "data_type": "Integer",
 33 |             "is_categorical": false,
 34 |             "is_candidate_key": false,
 35 |             "min": 1,
 36 |             "max": 132,
 37 |             "missing_rate": 0.0,
 38 |             "distribution_bins": [
 39 |                 1.0,
 40 |                 132.0
 41 |             ],
 42 |             "distribution_probabilities": [
 43 |                 0.5,
 44 |                 0.5
 45 |             ]
 46 |         },
 47 |         "Treatment": {
 48 |             "name": "Treatment",
 49 |             "data_type": "String",
 50 |             "is_categorical": true,
 51 |             "is_candidate_key": false,
 52 |             "min": 3,
 53 |             "max": 44,
 54 |             "missing_rate": 0.0,
 55 |             "distribution_bins": [
 56 |                 "Dressing",
 57 |                 "Sutures",
 58 |                 "Joint aspiration",
 59 |                 "Tetanus",
 60 |                 "Urinary catheter/suprapubic",
 61 |                 "Lumbar puncture",
 62 |                 "Prescription/medicines prepared to take away",
 63 |                 "Defibrillation/pacing",
 64 |                 "Sling/collar cuff/broad arm sling",
 65 |                 "Other (consider alternatives)",
 66 |                 "Nebuliser/spacer",
 67 |                 "Eye",
 68 |                 "Infusion fluids",
 69 |                 "Bandage/support",
 70 |                 "Dressing/wound review",
 71 |                 "Arterial line",
 72 |                 "Chest drain",
 73 |                 "Minor surgery",
 74 |                 "Wound cleaning",
 75 |                 "Blood product transfusion",
 76 |                 "Plaster of Paris",
 77 |                 "Oral airway",
 78 |                 "Wound closure (excluding sutures)",
 79 |                 "Resuscitation/cardiopulmonary resuscitation",
 80 |                 "Incision & drainage",
 81 |                 "Occupational Therapy",
 82 |                 "Dental treatment",
 83 |                 "Removal foreign body",
 84 |                 "Central line",
 85 |                 "Burns review",
 86 |                 "Anaesthesia",
 87 |                 "Guidance/advice only",
 88 |                 "None (consider guidance/advice option)",
 89 |                 "Fracture review",
 90 |                 "Nasal airway",
 91 |                 "Social work intervention",
 92 |                 "Physiotherapy",
 93 |                 "Recording vital signs",
 94 |                 "Splint"
 95 |             ],
 96 |             "distribution_probabilities": [
 97 |                 0.02564102564102564,
 98 |                 0.02564102564102564,
 99 |                 0.02564102564102564,
100 |                 0.02564102564102564,
101 |                 0.02564102564102564,
102 |                 0.02564102564102564,
103 |                 0.02564102564102564,
104 |                 0.02564102564102564,
105 |                 0.02564102564102564,
106 |                 0.02564102564102564,
107 |                 0.02564102564102564,
108 |                 0.02564102564102564,
109 |                 0.02564102564102564,
110 |                 0.02564102564102564,
111 |                 0.02564102564102564,
112 |                 0.02564102564102564,
113 |                 0.02564102564102564,
114 |                 0.02564102564102564,
115 |                 0.02564102564102564,
116 |                 0.02564102564102564,
117 |                 0.02564102564102564,
118 |                 0.02564102564102564,
119 |                 0.02564102564102564,
120 |                 0.02564102564102564,
121 |                 0.02564102564102564,
122 |                 0.02564102564102564,
123 |                 0.02564102564102564,
124 |                 0.02564102564102564,
125 |                 0.02564102564102564,
126 |                 0.02564102564102564,
127 |                 0.02564102564102564,
128 |                 0.02564102564102564,
129 |                 0.02564102564102564,
130 |                 0.02564102564102564,
131 |                 0.02564102564102564,
132 |                 0.02564102564102564,
133 |                 0.02564102564102564,
134 |                 0.02564102564102564,
135 |                 0.02564102564102564
136 |             ]
137 |         },
138 |         "Gender": {
139 |             "name": "Gender",
140 |             "data_type": "String",
141 |             "is_categorical": true,
142 |             "is_candidate_key": false,
143 |             "min": 4,
144 |             "max": 6,
145 |             "missing_rate": 0.0,
146 |             "distribution_bins": [
147 |                 "Male",
148 |                 "Female"
149 |             ],
150 |             "distribution_probabilities": [
151 |                 0.5,
152 |                 0.5
153 |             ]
154 |         },
155 |         "Index of Multiple Deprivation Decile": {
156 |             "name": "Index of Multiple Deprivation Decile",
157 |             "data_type": "Integer",
158 |             "is_categorical": true,
159 |             "is_candidate_key": false,
160 |             "min": 1,
161 |             "max": 10,
162 |             "missing_rate": 0.0,
163 |             "distribution_bins": [
164 |                 8,
165 |                 6,
166 |                 2,
167 |                 3,
168 |                 5,
169 |                 4,
170 |                 7,
171 |                 9,
172 |                 1,
173 |                 10
174 |             ],
175 |             "distribution_probabilities": [
176 |                 0,
177 |                 0,
178 |                 0,
179 |                 0,
180 |                 0,
181 |                 0,
182 |                 0,
183 |                 0,
184 |                 0,
185 |                 0
186 |             ]
187 |         },
188 |         "Hospital ID": {
189 |             "name": "Hospital ID",
190 |             "data_type": "String",
191 |             "is_categorical": true,
192 |             "is_candidate_key": false,
193 |             "min": 5,
194 |             "max": 6,
195 |             "missing_rate": 0.0,
196 |             "distribution_bins": [
197 |                 714199,
198 |                 339622,
199 |                 514115,
200 |                 147009,
201 |                 660843,
202 |                 434748,
203 |                 881145,
204 |                 192852,
205 |                 954491,
206 |                 864150,
207 |                 849155,
208 |                 87754,
209 |                 61685,
210 |                 209044,
211 |                 719566,
212 |                 872881,
213 |                 378500,
214 |                 304807,
215 |                 769379,
216 |                 85318,
217 |                 624058,
218 |                 883825,
219 |                 851090,
220 |                 450621,
221 |                 104821,
222 |                 860413
223 |             ],
224 |             "distribution_probabilities": [
225 |                 0,
226 |                 0,
227 |                 0,
228 |                 0,
229 |                 0,
230 |                 0,
231 |                 0,
232 |                 0,
233 |                 0,
234 |                 0,
235 |                 0,
236 |                 0,
237 |                 0,
238 |                 0,
239 |                 0,
240 |                 0,
241 |                 0,
242 |                 0,
243 |                 0,
244 |                 0,
245 |                 0,
246 |                 0,
247 |                 0,
248 |                 0,
249 |                 0,
250 |                 0
251 |             ]
252 |         },
253 |         "Arrival Date": {
254 |             "name": "Arrival Date",
255 |             "data_type": "String",
256 |             "is_categorical": true,
257 |             "is_candidate_key": false,
258 |             "min": 10,
259 |             "max": 10,
260 |             "missing_rate": 0.0,
261 |             "distribution_bins": [
262 |                 "2019-04-07",
263 |                 "2019-04-03",
264 |                 "2019-04-06",
265 |                 "2019-04-01",
266 |                 "2019-04-04",
267 |                 "2019-04-05",
268 |                 "2019-04-02"
269 |             ],
270 |             "distribution_probabilities": [
271 |                 0.14285714285714285,
272 |                 0.14285714285714285,
273 |                 0.14285714285714285,
274 |                 0.14285714285714285,
275 |                 0.14285714285714285,
276 |                 0.14285714285714285,
277 |                 0.14285714285714285
278 |             ]
279 |         },
280 |         "Arrival hour range": {
281 |             "name": "Arrival hour range",
282 |             "data_type": "String",
283 |             "is_categorical": true,
284 |             "is_candidate_key": false,
285 |             "min": 5,
286 |             "max": 5,
287 |             "missing_rate": 0.0,
288 |             "distribution_bins": [
289 |                 "00-03",
290 |                 "08-11",
291 |                 "16-19",
292 |                 "12-15",
293 |                 "04-07",
294 |                 "20-23"
295 |             ],
296 |             "distribution_probabilities": [
297 |                 0.16666666666666666,
298 |                 0.16666666666666666,
299 |                 0.16666666666666666,
300 |                 0.16666666666666666,
301 |                 0.16666666666666666,
302 |                 0.16666666666666666
303 |             ]
304 |         },
305 |         "Age bracket": {
306 |             "name": "Age bracket",
307 |             "data_type": "String",
308 |             "is_categorical": true,
309 |             "is_candidate_key": false,
310 |             "min": 3,
311 |             "max": 5,
312 |             "missing_rate": 0.0,
313 |             "distribution_bins": [
314 |                 "25-44",
315 |                 "65-84",
316 |                 "45-64",
317 |                 "0-17",
318 |                 "18-24",
319 |                 "85-"
320 |             ],
321 |             "distribution_probabilities": [
322 |                 0.16666666666666666,
323 |                 0.16666666666666666,
324 |                 0.16666666666666666,
325 |                 0.16666666666666666,
326 |                 0.16666666666666666,
327 |                 0.16666666666666666
328 |             ]
329 |         }
330 |     }
331 | }


--------------------------------------------------------------------------------
/data/hospital_ae_description_independent.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "meta": {
  3 |         "num_tuples": 9897,
  4 |         "num_attributes": 8,
  5 |         "num_attributes_in_BN": 8,
  6 |         "all_attributes": [
  7 |             "Time in A&E (mins)",
  8 |             "Treatment",
  9 |             "Gender",
 10 |             "Index of Multiple Deprivation Decile",
 11 |             "Hospital ID",
 12 |             "Arrival Date",
 13 |             "Arrival hour range",
 14 |             "Age bracket"
 15 |         ],
 16 |         "candidate_keys": [],
 17 |         "non_categorical_string_attributes": [],
 18 |         "attributes_in_BN": [
 19 |             "Gender",
 20 |             "Hospital ID",
 21 |             "Treatment",
 22 |             "Arrival Date",
 23 |             "Arrival hour range",
 24 |             "Age bracket",
 25 |             "Time in A&E (mins)",
 26 |             "Index of Multiple Deprivation Decile"
 27 |         ]
 28 |     },
 29 |     "attribute_description": {
 30 |         "Time in A&E (mins)": {
 31 |             "name": "Time in A&E (mins)",
 32 |             "data_type": "Integer",
 33 |             "is_categorical": false,
 34 |             "is_candidate_key": false,
 35 |             "min": 1,
 36 |             "max": 132,
 37 |             "missing_rate": 0.0,
 38 |             "distribution_bins": [
 39 |                 1.0,
 40 |                 7.55,
 41 |                 14.1,
 42 |                 20.65,
 43 |                 27.2,
 44 |                 33.75,
 45 |                 40.3,
 46 |                 46.85,
 47 |                 53.4,
 48 |                 59.949999999999996,
 49 |                 66.5,
 50 |                 73.05,
 51 |                 79.6,
 52 |                 86.14999999999999,
 53 |                 92.7,
 54 |                 99.25,
 55 |                 105.8,
 56 |                 112.35,
 57 |                 118.89999999999999,
 58 |                 125.45
 59 |             ],
 60 |             "distribution_probabilities": [
 61 |                 0.0048137849053241435,
 62 |                 0.012751209255727067,
 63 |                 0.012787960085510593,
 64 |                 0.029766054617680452,
 65 |                 0.03908536424246603,
 66 |                 0.0711838002502208,
 67 |                 0.08112732494047205,
 68 |                 0.1289694426246986,
 69 |                 0.1270821082831697,
 70 |                 0.13274010311130605,
 71 |                 0.11819792116763203,
 72 |                 0.08385242821680011,
 73 |                 0.06362083177515561,
 74 |                 0.05065187199646423,
 75 |                 0.014176002474539257,
 76 |                 0.0006923597724718898,
 77 |                 0.0,
 78 |                 0.01069262871326837,
 79 |                 0.007190356203672139,
 80 |                 0.010618447363421076
 81 |             ]
 82 |         },
 83 |         "Treatment": {
 84 |             "name": "Treatment",
 85 |             "data_type": "String",
 86 |             "is_categorical": true,
 87 |             "is_candidate_key": false,
 88 |             "min": 3,
 89 |             "max": 44,
 90 |             "missing_rate": 0.0,
 91 |             "distribution_bins": [
 92 |                 "Anaesthesia",
 93 |                 "Arterial line",
 94 |                 "Bandage/support",
 95 |                 "Blood product transfusion",
 96 |                 "Burns review",
 97 |                 "Central line",
 98 |                 "Chest drain",
 99 |                 "Defibrillation/pacing",
100 |                 "Dental treatment",
101 |                 "Dressing",
102 |                 "Dressing/wound review",
103 |                 "Eye",
104 |                 "Fracture review",
105 |                 "Guidance/advice only",
106 |                 "Incision & drainage",
107 |                 "Infusion fluids",
108 |                 "Joint aspiration",
109 |                 "Lumbar puncture",
110 |                 "Minor surgery",
111 |                 "Nasal airway",
112 |                 "Nebuliser/spacer",
113 |                 "None (consider guidance/advice option)",
114 |                 "Occupational Therapy",
115 |                 "Oral airway",
116 |                 "Other (consider alternatives)",
117 |                 "Physiotherapy",
118 |                 "Plaster of Paris",
119 |                 "Prescription/medicines prepared to take away",
120 |                 "Recording vital signs",
121 |                 "Removal foreign body",
122 |                 "Resuscitation/cardiopulmonary resuscitation",
123 |                 "Sling/collar cuff/broad arm sling",
124 |                 "Social work intervention",
125 |                 "Splint",
126 |                 "Sutures",
127 |                 "Tetanus",
128 |                 "Urinary catheter/suprapubic",
129 |                 "Wound cleaning",
130 |                 "Wound closure (excluding sutures)"
131 |             ],
132 |             "distribution_probabilities": [
133 |                 0.03743653923302075,
134 |                 0.03751715461964534,
135 |                 0.04351428297613841,
136 |                 0.03760806003801341,
137 |                 0.010529054594822455,
138 |                 0.026381269227099918,
139 |                 0.007549988860363595,
140 |                 0.05986187837291627,
141 |                 0.00850742241230225,
142 |                 0.03480733915916512,
143 |                 0.027468928722761122,
144 |                 0.026717830161984556,
145 |                 0.03416644462647804,
146 |                 0.014965422991640035,
147 |                 0.007795447487468476,
148 |                 0.026400977355992464,
149 |                 0.041373870011095104,
150 |                 0.045900876813776054,
151 |                 0.05198354973969091,
152 |                 0.027879729350620344,
153 |                 0.0,
154 |                 0.0387508493435389,
155 |                 0.044485037296703195,
156 |                 0.023103633518272867,
157 |                 0.014430904850183484,
158 |                 0.018850348982361948,
159 |                 0.029749335129931886,
160 |                 0.004684584676552847,
161 |                 0.0005176287878392601,
162 |                 0.010716969801534658,
163 |                 0.01731195359361903,
164 |                 0.023087506024678742,
165 |                 0.03700822721307918,
166 |                 0.0015958069798097654,
167 |                 0.02627143667426696,
168 |                 0.01493749383417674,
169 |                 0.031912250792335325,
170 |                 0.030438293292970365,
171 |                 0.023781672453150347
172 |             ]
173 |         },
174 |         "Gender": {
175 |             "name": "Gender",
176 |             "data_type": "String",
177 |             "is_categorical": true,
178 |             "is_candidate_key": false,
179 |             "min": 4,
180 |             "max": 6,
181 |             "missing_rate": 0.0,
182 |             "distribution_bins": [
183 |                 "Female",
184 |                 "Male"
185 |             ],
186 |             "distribution_probabilities": [
187 |                 0.5056121783414436,
188 |                 0.4943878216585565
189 |             ]
190 |         },
191 |         "Index of Multiple Deprivation Decile": {
192 |             "name": "Index of Multiple Deprivation Decile",
193 |             "data_type": "Integer",
194 |             "is_categorical": true,
195 |             "is_candidate_key": false,
196 |             "min": 1,
197 |             "max": 10,
198 |             "missing_rate": 0.0,
199 |             "distribution_bins": [
200 |                 1,
201 |                 2,
202 |                 3,
203 |                 4,
204 |                 5,
205 |                 6,
206 |                 7,
207 |                 8,
208 |                 9,
209 |                 10
210 |             ],
211 |             "distribution_probabilities": [
212 |                 0.0818966101509957,
213 |                 0.10914128206359103,
214 |                 0.08158665621236741,
215 |                 0.09902888772825894,
216 |                 0.10971942432967483,
217 |                 0.1130540710198635,
218 |                 0.09027435917728921,
219 |                 0.11448459660960227,
220 |                 0.0979777744972692,
221 |                 0.10283633821108792
222 |             ]
223 |         },
224 |         "Hospital ID": {
225 |             "name": "Hospital ID",
226 |             "data_type": "String",
227 |             "is_categorical": true,
228 |             "is_candidate_key": false,
229 |             "min": 5,
230 |             "max": 6,
231 |             "missing_rate": 0.0,
232 |             "distribution_bins": [
233 |                 61685,
234 |                 85318,
235 |                 87754,
236 |                 104821,
237 |                 147009,
238 |                 192852,
239 |                 209044,
240 |                 304807,
241 |                 339622,
242 |                 378500,
243 |                 434748,
244 |                 450621,
245 |                 514115,
246 |                 624058,
247 |                 660843,
248 |                 714199,
249 |                 719566,
250 |                 769379,
251 |                 849155,
252 |                 851090,
253 |                 860413,
254 |                 864150,
255 |                 872881,
256 |                 881145,
257 |                 883825,
258 |                 954491
259 |             ],
260 |             "distribution_probabilities": [
261 |                 0.05352275494736269,
262 |                 0.0704080187421479,
263 |                 0.07997320566661555,
264 |                 0.007398206322024037,
265 |                 0.053574210896845144,
266 |                 0.03120563200976509,
267 |                 0.02944400397157849,
268 |                 0.034277388726412124,
269 |                 0.022913627909725703,
270 |                 0.03360203536704146,
271 |                 0.05642809286467514,
272 |                 0.0,
273 |                 0.013587186830387246,
274 |                 0.024354770752385856,
275 |                 0.07317344659341625,
276 |                 0.05898285136165305,
277 |                 0.010627480302828646,
278 |                 0.008050565576995733,
279 |                 0.06730111431230959,
280 |                 0.018626035275006815,
281 |                 0.006169743743641671,
282 |                 0.040299250943709626,
283 |                 0.024344801544253756,
284 |                 0.07884146818539779,
285 |                 0.011315984326774164,
286 |                 0.09157812282704632
287 |             ]
288 |         },
289 |         "Arrival Date": {
290 |             "name": "Arrival Date",
291 |             "data_type": "String",
292 |             "is_categorical": true,
293 |             "is_candidate_key": false,
294 |             "min": 10,
295 |             "max": 10,
296 |             "missing_rate": 0.0,
297 |             "distribution_bins": [
298 |                 "2019-04-01",
299 |                 "2019-04-02",
300 |                 "2019-04-03",
301 |                 "2019-04-04",
302 |                 "2019-04-05",
303 |                 "2019-04-06",
304 |                 "2019-04-07"
305 |             ],
306 |             "distribution_probabilities": [
307 |                 0.06783542752636566,
308 |                 0.12284925724304259,
309 |                 0.09489380953045443,
310 |                 0.15654728206204,
311 |                 0.16681750153635305,
312 |                 0.18350661317758582,
313 |                 0.20755010892415843
314 |             ]
315 |         },
316 |         "Arrival hour range": {
317 |             "name": "Arrival hour range",
318 |             "data_type": "String",
319 |             "is_categorical": true,
320 |             "is_candidate_key": false,
321 |             "min": 5,
322 |             "max": 5,
323 |             "missing_rate": 0.0,
324 |             "distribution_bins": [
325 |                 "00-03",
326 |                 "04-07",
327 |                 "08-11",
328 |                 "12-15",
329 |                 "16-19",
330 |                 "20-23"
331 |             ],
332 |             "distribution_probabilities": [
333 |                 0.15580117541263627,
334 |                 0.19847718869225747,
335 |                 0.20637700957772262,
336 |                 0.19740459394581067,
337 |                 0.17211788737444159,
338 |                 0.06982214499713145
339 |             ]
340 |         },
341 |         "Age bracket": {
342 |             "name": "Age bracket",
343 |             "data_type": "String",
344 |             "is_categorical": true,
345 |             "is_candidate_key": false,
346 |             "min": 3,
347 |             "max": 5,
348 |             "missing_rate": 0.0,
349 |             "distribution_bins": [
350 |                 "0-17",
351 |                 "18-24",
352 |                 "25-44",
353 |                 "45-64",
354 |                 "65-84",
355 |                 "85-"
356 |             ],
357 |             "distribution_probabilities": [
358 |                 0.13304536821134044,
359 |                 0.09792995305020429,
360 |                 0.3712089415455074,
361 |                 0.2844464676120552,
362 |                 0.10127555731296116,
363 |                 0.012093712267931548
364 |             ]
365 |         }
366 |     }
367 | }


--------------------------------------------------------------------------------
/DataSynthesizer/DataDescriber.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | from typing import Dict, List, Union
  3 | 
  4 | from numpy import array_equal
  5 | from pandas import DataFrame, read_csv
  6 | 
  7 | from datatypes.AbstractAttribute import AbstractAttribute
  8 | from datatypes.DateTimeAttribute import is_datetime, DateTimeAttribute
  9 | from datatypes.FloatAttribute import FloatAttribute
 10 | from datatypes.IntegerAttribute import IntegerAttribute
 11 | from datatypes.SocialSecurityNumberAttribute import is_ssn, SocialSecurityNumberAttribute
 12 | from datatypes.StringAttribute import StringAttribute
 13 | from datatypes.utils.DataType import DataType
 14 | from lib import utils
 15 | from lib.PrivBayes import greedy_bayes, construct_noisy_conditional_distributions
 16 | 
 17 | 
 18 | class DataDescriber:
 19 |     """Model input dataset, then save a description of the dataset into a JSON file.
 20 | 
 21 |     Attributes
 22 |     ----------
 23 |     histogram_bins : int or str
 24 |         Number of bins in histograms.
 25 |         If it is a string such as 'auto' or 'fd', calculate the optimal bin width by `numpy.histogram_bin_edges`.
 26 |     category_threshold : int
 27 |         Categorical variables have no more than "this number" of distinct values.
 28 |     null_values: str or list
 29 |         Additional strings to recognize as missing values.
 30 |         By default missing values already include {‘’, ‘NULL’, ‘N/A’, ‘NA’, ‘NaN’, ‘nan’}.
 31 |     attr_to_datatype : dict
 32 |         Dictionary of {attribute: datatype}, e.g., {"age": "Integer", "gender": "String"}.
 33 |     attr_to_is_categorical : dict
 34 |         Dictionary of {attribute: boolean}, e.g., {"gender":True, "age":False}.
 35 |     attr_to_is_candidate_key: dict
 36 |         Dictionary of {attribute: boolean}, e.g., {"id":True, "name":False}.
 37 |     data_description: dict
 38 |         Nested dictionary (equivalent to JSON) recording the mined dataset information.
 39 |     df_input : DataFrame
 40 |         The input dataset to be analyzed.
 41 |     attr_to_column : Dict
 42 |         Dictionary of {attribute: AbstractAttribute}
 43 |     bayesian_network : list
 44 |         List of [child, [parent,]] to represent a Bayesian Network.
 45 |     df_encoded : DataFrame
 46 |         Input dataset encoded into integers, taken as input by PrivBayes algorithm in correlated attribute mode.
 47 |     """
 48 | 
 49 |     def __init__(self, histogram_bins: Union[int, str] = 20, category_threshold=10, null_values=None):
 50 |         self.histogram_bins: Union[int, str] = histogram_bins
 51 |         self.category_threshold: int = category_threshold
 52 |         self.null_values = null_values
 53 | 
 54 |         self.attr_to_datatype: Dict[str, DataType] = None
 55 |         self.attr_to_is_categorical: Dict[str, bool] = None
 56 |         self.attr_to_is_candidate_key: Dict[str, bool] = None
 57 | 
 58 |         self.data_description: Dict = {}
 59 |         self.df_input: DataFrame = None
 60 |         self.attr_to_column: Dict[str, AbstractAttribute] = None
 61 |         self.bayesian_network: List = None
 62 |         self.df_encoded: DataFrame = None
 63 | 
 64 |     def describe_dataset_in_random_mode(self,
 65 |                                         dataset_file: str,
 66 |                                         attribute_to_datatype: Dict[str, DataType] = None,
 67 |                                         attribute_to_is_categorical: Dict[str, bool] = None,
 68 |                                         attribute_to_is_candidate_key: Dict[str, bool] = None,
 69 |                                         categorical_attribute_domain_file: str = None,
 70 |                                         numerical_attribute_ranges: Dict[str, List] = None,
 71 |                                         seed=0):
 72 |         attribute_to_datatype = attribute_to_datatype or {}
 73 |         attribute_to_is_categorical = attribute_to_is_categorical or {}
 74 |         attribute_to_is_candidate_key = attribute_to_is_candidate_key or {}
 75 |         numerical_attribute_ranges = numerical_attribute_ranges or {}
 76 | 
 77 |         if categorical_attribute_domain_file:
 78 |             categorical_attribute_to_domain = utils.read_json_file(categorical_attribute_domain_file)
 79 |         else:
 80 |             categorical_attribute_to_domain = {}
 81 | 
 82 |         utils.set_random_seed(seed)
 83 |         self.attr_to_datatype = {attr: DataType(datatype) for attr, datatype in attribute_to_datatype.items()}
 84 |         self.attr_to_is_categorical = attribute_to_is_categorical
 85 |         self.attr_to_is_candidate_key = attribute_to_is_candidate_key
 86 |         self.read_dataset_from_csv(dataset_file)
 87 |         self.infer_attribute_data_types()
 88 |         self.analyze_dataset_meta()
 89 |         self.represent_input_dataset_by_columns()
 90 | 
 91 |         for column in self.attr_to_column.values():
 92 |             attr_name = column.name
 93 |             if attr_name in categorical_attribute_to_domain:
 94 |                 column.infer_domain(categorical_domain=categorical_attribute_to_domain[attr_name])
 95 |             elif attr_name in numerical_attribute_ranges:
 96 |                 column.infer_domain(numerical_range=numerical_attribute_ranges[attr_name])
 97 |             else:
 98 |                 column.infer_domain()
 99 | 
100 |         # record attribute information in json format
101 |         self.data_description['attribute_description'] = {}
102 |         for attr, column in self.attr_to_column.items():
103 |             self.data_description['attribute_description'][attr] = column.to_json()
104 | 
105 |     def describe_dataset_in_independent_attribute_mode(self,
106 |                                                        dataset_file,
107 |                                                        epsilon=0.1,
108 |                                                        attribute_to_datatype: Dict[str, DataType] = None,
109 |                                                        attribute_to_is_categorical: Dict[str, bool] = None,
110 |                                                        attribute_to_is_candidate_key: Dict[str, bool] = None,
111 |                                                        categorical_attribute_domain_file: str = None,
112 |                                                        numerical_attribute_ranges: Dict[str, List] = None,
113 |                                                        seed=0):
114 |         self.describe_dataset_in_random_mode(dataset_file,
115 |                                              attribute_to_datatype,
116 |                                              attribute_to_is_categorical,
117 |                                              attribute_to_is_candidate_key,
118 |                                              categorical_attribute_domain_file,
119 |                                              numerical_attribute_ranges,
120 |                                              seed=seed)
121 | 
122 |         for column in self.attr_to_column.values():
123 |             column.infer_distribution()
124 | 
125 |         self.inject_laplace_noise_into_distribution_per_attribute(epsilon)
126 |         # record attribute information in json format
127 |         self.data_description['attribute_description'] = {}
128 |         for attr, column in self.attr_to_column.items():
129 |             self.data_description['attribute_description'][attr] = column.to_json()
130 | 
131 |     def describe_dataset_in_correlated_attribute_mode(self,
132 |                                                       dataset_file,
133 |                                                       k=0,
134 |                                                       epsilon=0.1,
135 |                                                       attribute_to_datatype: Dict[str, DataType] = None,
136 |                                                       attribute_to_is_categorical: Dict[str, bool] = None,
137 |                                                       attribute_to_is_candidate_key: Dict[str, bool] = None,
138 |                                                       categorical_attribute_domain_file: str = None,
139 |                                                       numerical_attribute_ranges: Dict[str, List] = None,
140 |                                                       seed=0):
141 |         """Generate dataset description using correlated attribute mode.
142 | 
143 |         Parameters
144 |         ----------
145 |         dataset_file : str
146 |             File name (with directory) of the sensitive dataset as input in csv format.
147 |         k : int
148 |             Maximum number of parents in Bayesian network.
149 |         epsilon : float
150 |             A parameter in Differential Privacy. Increase epsilon value to reduce the injected noises. Set epsilon=0 to turn
151 |             off Differential Privacy.
152 |         attribute_to_datatype : dict
153 |             Dictionary of {attribute: datatype}, e.g., {"age": "Integer", "gender": "String"}.
154 |         attribute_to_is_categorical : dict
155 |             Dictionary of {attribute: boolean}, e.g., {"gender":True, "age":False}.
156 |         attribute_to_is_candidate_key: dict
157 |             Dictionary of {attribute: boolean}, e.g., {"id":True, "name":False}.
158 |         categorical_attribute_domain_file: str
159 |             File name of a JSON file of some categorical attribute domains.
160 |         numerical_attribute_ranges: dict
161 |             Dictionary of {attribute: [min, max]}, e.g., {"age": [25, 65]}
162 |         seed : int or float
163 |             Seed the random number generator.
164 |         """
165 |         self.describe_dataset_in_independent_attribute_mode(dataset_file,
166 |                                                             epsilon,
167 |                                                             attribute_to_datatype,
168 |                                                             attribute_to_is_categorical,
169 |                                                             attribute_to_is_candidate_key,
170 |                                                             categorical_attribute_domain_file,
171 |                                                             numerical_attribute_ranges,
172 |                                                             seed)
173 |         self.df_encoded = self.encode_dataset_into_binning_indices()
174 |         if self.df_encoded.shape[1] < 2:
175 |             raise Exception("Correlated Attribute Mode requires at least 2 attributes/columns in dataset.")
176 | 
177 |         self.bayesian_network = greedy_bayes(self.df_encoded, k, epsilon)
178 |         self.data_description['bayesian_network'] = self.bayesian_network
179 |         self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
180 |             self.bayesian_network, self.df_encoded, epsilon)
181 | 
182 |     def read_dataset_from_csv(self, file_name=None):
183 |         try:
184 |             self.df_input = read_csv(file_name, skipinitialspace=True, na_values=self.null_values)
185 |         except (UnicodeDecodeError, NameError):
186 |             self.df_input = read_csv(file_name, skipinitialspace=True, na_values=self.null_values,
187 |                                      encoding='latin1')
188 | 
189 |         # Remove columns with empty active domain, i.e., all values are missing.
190 |         attributes_before = set(self.df_input.columns)
191 |         self.df_input.dropna(axis=1, how='all')
192 |         attributes_after = set(self.df_input.columns)
193 |         if len(attributes_before) > len(attributes_after):
194 |             print(f'Empty columns are removed, including {attributes_before - attributes_after}.')
195 | 
196 |     def infer_attribute_data_types(self):
197 |         attributes_with_unknown_datatype = set(self.df_input.columns) - set(self.attr_to_datatype)
198 |         inferred_numerical_attributes = utils.infer_numerical_attributes_in_dataframe(self.df_input)
199 | 
200 |         for attr in attributes_with_unknown_datatype:
201 |             column_dropna = self.df_input[attr].dropna()
202 | 
203 |             # current attribute is either Integer or Float.
204 |             if attr in inferred_numerical_attributes:
205 |                 # TODO Comparing all values may be too slow for large datasets.
206 |                 if array_equal(column_dropna, column_dropna.astype(int, copy=False)):
207 |                     self.attr_to_datatype[attr] = DataType.INTEGER
208 |                 else:
209 |                     self.attr_to_datatype[attr] = DataType.FLOAT
210 | 
211 |             # current attribute is either String, DateTime, or SocialSecurityNumber.
212 |             else:
213 |                 # Sample 20 values to test its data_type.
214 |                 samples = column_dropna.sample(20, replace=True)
215 |                 if all(samples.map(is_datetime)):
216 |                     self.attr_to_datatype[attr] = DataType.DATETIME
217 |                 else:
218 |                     if all(samples.map(is_ssn)):
219 |                         self.attr_to_datatype[attr] = DataType.SOCIAL_SECURITY_NUMBER
220 |                     else:
221 |                         self.attr_to_datatype[attr] = DataType.STRING
222 | 
223 |     def analyze_dataset_meta(self):
224 |         all_attributes = set(self.df_input.columns)
225 | 
226 |         # find all candidate keys.
227 |         for attr in all_attributes - set(self.attr_to_is_candidate_key):
228 |             self.attr_to_is_candidate_key[attr] = self.df_input[attr].is_unique
229 | 
230 |         candidate_keys = {attr for attr, is_key in self.attr_to_is_candidate_key.items() if is_key}
231 | 
232 |         # find all categorical attributes.
233 |         for attr in all_attributes - set(self.attr_to_is_categorical):
234 |             self.attr_to_is_categorical[attr] = self.is_categorical(attr)
235 | 
236 |         non_categorical_string_attributes = set()
237 |         for attr, is_categorical in self.attr_to_is_categorical.items():
238 |             if not is_categorical and self.attr_to_datatype[attr] is DataType.STRING:
239 |                 non_categorical_string_attributes.add(attr)
240 | 
241 |         attributes_in_BN = list(all_attributes - candidate_keys - non_categorical_string_attributes)
242 |         non_categorical_string_attributes = list(non_categorical_string_attributes)
243 | 
244 |         self.data_description['meta'] = {"num_tuples": self.df_input.shape[0],
245 |                                          "num_attributes": self.df_input.shape[1],
246 |                                          "num_attributes_in_BN": len(attributes_in_BN),
247 |                                          "all_attributes": self.df_input.columns.tolist(),
248 |                                          "candidate_keys": list(candidate_keys),
249 |                                          "non_categorical_string_attributes": non_categorical_string_attributes,
250 |                                          "attributes_in_BN": attributes_in_BN}
251 | 
252 |     def is_categorical(self, attribute_name):
253 |         """ Detect whether an attribute is categorical.
254 | 
255 |         Parameters
256 |         ----------
257 |         attribute_name : str
258 |         """
259 |         if attribute_name in self.attr_to_is_categorical:
260 |             return self.attr_to_is_categorical[attribute_name]
261 |         else:
262 |             return self.df_input[attribute_name].dropna().unique().size <= self.category_threshold
263 | 
264 |     def represent_input_dataset_by_columns(self):
265 |         self.attr_to_column = {}
266 |         for attr in self.df_input:
267 |             data_type = self.attr_to_datatype[attr]
268 |             is_candidate_key = self.attr_to_is_candidate_key[attr]
269 |             is_categorical = self.attr_to_is_categorical[attr]
270 |             paras = (attr, is_candidate_key, is_categorical, self.histogram_bins, self.df_input[attr])
271 |             if data_type is DataType.INTEGER:
272 |                 self.attr_to_column[attr] = IntegerAttribute(*paras)
273 |             elif data_type is DataType.FLOAT:
274 |                 self.attr_to_column[attr] = FloatAttribute(*paras)
275 |             elif data_type is DataType.DATETIME:
276 |                 self.attr_to_column[attr] = DateTimeAttribute(*paras)
277 |             elif data_type is DataType.STRING:
278 |                 self.attr_to_column[attr] = StringAttribute(*paras)
279 |             elif data_type is DataType.SOCIAL_SECURITY_NUMBER:
280 |                 self.attr_to_column[attr] = SocialSecurityNumberAttribute(*paras)
281 |             else:
282 |                 raise Exception(f'The DataType of {attr} is unknown.')
283 | 
284 |     def inject_laplace_noise_into_distribution_per_attribute(self, epsilon=0.1):
285 |         num_attributes_in_BN = self.data_description['meta']['num_attributes_in_BN']
286 |         for column in self.attr_to_column.values():
287 |             assert isinstance(column, AbstractAttribute)
288 |             column.inject_laplace_noise(epsilon, num_attributes_in_BN)
289 | 
290 |     def encode_dataset_into_binning_indices(self):
291 |         """Before constructing Bayesian network, encode input dataset into binning indices."""
292 |         encoded_dataset = DataFrame()
293 |         for attr in self.data_description['meta']['attributes_in_BN']:
294 |             encoded_dataset[attr] = self.attr_to_column[attr].encode_values_into_bin_idx()
295 |         return encoded_dataset
296 | 
297 |     def save_dataset_description_to_file(self, file_name):
298 |         with open(file_name, 'w') as outfile:
299 |             json.dump(self.data_description, outfile, indent=4)
300 | 
301 |     def display_dataset_description(self):
302 |         print(json.dumps(self.data_description, indent=4))
303 | 
304 | 
305 | if __name__ == '__main__':
306 |     from DataGenerator import DataGenerator
307 | 
308 |     # input dataset
309 |     input_data = './data/adult_ssn.csv'
310 |     # location of two output files
311 |     mode = 'correlated_attribute_mode'
312 |     description_file = './out/{}/description.txt'.format(mode)
313 |     synthetic_data = './out/{}/sythetic_data.csv'.format(mode)
314 | 
315 |     # An attribute is categorical if its domain size is less than this threshold.
316 |     # Here modify the threshold to adapt to the domain size of "education" (which is 14 in input dataset).
317 |     threshold_value = 20
318 | 
319 |     # Additional strings to recognize as NA/NaN.
320 |     na_values = '<NULL>'
321 | 
322 |     # specify which attributes are candidate keys of input dataset.
323 |     candidate_keys = {'age': False, 'ssn': True}
324 | 
325 |     # A parameter in differential privacy.
326 |     # It roughly means that removing one tuple will change the probability of any output by  at most exp(eps).
327 |     # Set eps=0 to turn off differential privacy.
328 |     eps = 0.1
329 | 
330 |     # The maximum number of parents in Bayesian network, i.e., the maximum number of incoming edges.
331 |     degree_of_bayesian_network = 2
332 | 
333 |     # Number of tuples generated in synthetic dataset.
334 |     num_tuples_to_generate = 32561  # Here 32561 is the same as input dataset, but it can be set to another number.
335 | 
336 |     describer = DataDescriber(histogram_bins='fd',
337 |                               category_threshold=threshold_value,
338 |                               null_values=na_values)
339 |     describer.describe_dataset_in_correlated_attribute_mode(input_data,
340 |                                                             epsilon=eps,
341 |                                                             attribute_to_is_candidate_key=candidate_keys)
342 |     describer.save_dataset_description_to_file(description_file)
343 | 
344 |     generator = DataGenerator()
345 |     generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
346 |     generator.save_synthetic_data(synthetic_data)
347 |     print(generator.synthetic_dataset.head())
348 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | _Last tested: 2022-04-14. Updated the requirements and ran in Python 3.10 (although a few warnings from Pandas)._
  2 | 
  3 | # Anonymisation with Synthetic Data Tutorial
  4 | 
  5 | ## Some questions
  6 | 
  7 | **What is this?**
  8 | 
  9 | A hands-on tutorial showing how to use Python to create synthetic data.
 10 | 
 11 | **Wait, what is this "synthetic data" you speak of?**
 12 | 
 13 | It's data that is created by an automated process which contains many of the statistical patterns of an original dataset. It is also sometimes used as a way to release data that has no personal information in it, even if the original did contain lots of data that could identify people. This means programmers and data scientists can crack on with building software and algorithms that they know will work similarly on the real data.
 14 | 
 15 | **Who is this tutorial for?**
 16 | 
 17 | For any person who programs who wants to learn about data anonymisation in general or more specifically about synthetic data.
 18 | 
 19 | **What is it not for?**
 20 | 
 21 | Non-programmers. Although we think this tutorial is still worth a browse to get some of the main ideas in what goes in to anonymising a dataset. However, if you're looking for info on how to create synthetic data using the latest and greatest deep learning techniques, this is not the tutorial for you.
 22 | 
 23 | **Who are you?**
 24 | 
 25 | We're the Open Data Institute. We work with companies and governments to build an open, trustworthy data ecosystem. Anonymisation and synthetic data are some of the many, many ways we can responsibly increase access to data. If you want to learn more, [check out our site](http://theodi.org).
 26 | 
 27 | **Why did you make this?**
 28 | 
 29 | We have an [R&D program](https://theodi.org/project/data-innovation-for-uk-research-and-development/) that has a number of projects looking in to how to support innovation, improve data infrastructure and encourage ethical data sharing. One of our projects is about [managing the risks of re-identification](https://theodi.org/project/rd-broaden-access-to-personal-data-while-protecting-privacy-and-creating-a-fair-market/) in shared and open data. As you can see in the *Key outputs* section, we have other material from the project, but we thought it'd be good to have something specifically aimed at programmers who are interested in learning by doing.
 30 | 
 31 | **Speaking of which, can I just get to the tutorial now?**
 32 | 
 33 | Sure! Let's go.
 34 | 
 35 | ## Overview
 36 | 
 37 | In this tutorial you are aiming to create a safe version of accident and emergency (A&E) admissions data, collected from multiple hospitals. This data contains some sensitive personal information about people's health and can't be openly shared. By removing and altering certain identifying information in the data we can greatly reduce the risk that patients can be re-identified and therefore hope to release the data.
 38 | 
 39 | Just to be clear, we're not using actual A&E data but are creating our own simple, mock, version of it.
 40 | 
 41 | The practical steps involve:
 42 | 
 43 | 1. Create an A&E admissions dataset which will contain (pretend) personal information.
 44 | 2. Run some anonymisation steps over this dataset to generate a new dataset with much less re-identification risk.
 45 | 3. Take this de-identified dataset and generate multiple synthetic datasets from it to reduce the re-identification risk even further.
 46 | 4. Analyse the synthetic datasets to see how similar they are to the original data.
 47 | 
 48 | You may be wondering, why can't we just do synthetic data step? If it's synthetic surely it won't contain any personal information?
 49 | 
 50 | Not exactly. Patterns picked up in the original data can be transferred to the synthetic data. This is especially true for outliers. For instance if there is only one person from an certain area over 85 and this shows up in the synthetic data, we would be able to re-identify them.
 51 | 
 52 | ## Credit to others
 53 | 
 54 | This tutorial is inspired by the [NHS England and ODI Leeds' research](https://odileeds.org/events/synae/) in creating a synthetic dataset from NHS England's accident and emergency admissions. Please do read about their project, as it's really interesting and great for learning about the benefits and risks in creating synthetic data.
 55 | 
 56 | Also, the synthetic data generating library we use is [DataSynthetizer](https://homes.cs.washington.edu/~billhowe//projects/2017/07/20/Data-Synthesizer.html) and comes as part of this codebase. Coming from researchers in Drexel University and University of Washington, it's an excellent piece of software and their research and papers are well worth checking out. It's available as a [repo on Github](https://github.com/DataResponsibly/DataSynthesizer) which includes some short tutorials on how to use the toolkit and an accompanying research paper describing the theory behind it.
 57 | 
 58 | ---
 59 | 
 60 | ## Setup
 61 | 
 62 | First, make sure you have [Python3 installed](https://www.python.org/downloads/). Minimum Python 3.6.
 63 | 
 64 | Download this repository either as a zip or clone using Git.
 65 | 
 66 | Install required dependent libraries. You can do that, for example, with a _virtualenv_.
 67 | 
 68 | ```bash
 69 | cd /path/to/repo/synthetic_data_tutorial/
 70 | pip install -r requirements.txt
 71 | ```
 72 | 
 73 | Next we'll go through how to create, de-identify and synthesise the code. We'll show this using code snippets but the full code is contained within the `/tutorial` directory.
 74 | 
 75 | There's small differences between the code presented here and what's in the Python scripts but it's mostly down to variable naming. I'd encourage you to run, edit and play with the code locally.
 76 | 
 77 | ## Generate mock NHS A&E dataset
 78 | 
 79 | The data already exists in `data/nhs_ae_mock.csv` so feel free to browse that. But you should generate your own fresh dataset using the `tutorial/generate.py` script.
 80 | 
 81 | To do this, you'll need to download one dataset first. It's a list of all postcodes in London. You can find it at this page on [doogal.co.uk](https://www.doogal.co.uk/PostcodeDownloads.php), at the _London_ link under the _By English region_ section. Or just download it directly at [this link](https://www.doogal.co.uk/UKPostcodesCSV.ashx?region=E12000007) (just take note, it's 133MB in size), then place the `London postcodes.csv` file in to the `data/` directory.
 82 | 
 83 | Or you can just do it using `curl`.
 84 | 
 85 | ```bash
 86 | curl -o "./data/London postcodes.csv" https://www.doogal.co.uk/UKPostcodesCSV.ashx?region=E12000007
 87 | ```
 88 | 
 89 | Then, to generate the data, from the project root directory run the `generate.py` script.
 90 | 
 91 | ```bash
 92 | python tutorial/generate.py
 93 | ```
 94 | 
 95 | Voila! You'll now see a new `hospital_ae_data.csv` file in the `/data` directory. Open it up and have a browse. It's contains the following columns:
 96 | 
 97 | - **Health Service ID**: NHS number of the admitted patient  
 98 | - **Age**: age of patient
 99 | - **Time in A&E (mins)**: time in minutes of how long the patient spent in A&E. This is generated to correlate with the age of the patient.
100 | - **Hospital**: which hospital admitted the patient - with some hospitals being more prevalent in the data than others
101 | - **Arrival Time**: what time and date the patient was admitted - with weekends as busier and and a different peak time for each day
102 | - **Treatment**: what the person was treated for - with certain treatments being more common than others
103 | - **Gender**: patient gender - based on [NHS patient gender codes](https://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_code_de.asp?shownav=1)
104 | - **Postcode**: postcode of patient - random, in use, London postcodes extracted from the `London postcodes.csv` file.
105 | 
106 | We can see this dataset obviously contains some personal information. For instance, if we knew roughly the time a neighbour went to A&E we could use their postcode to figure out exactly what ailment they went in with. Or, if a list of people's Health Service ID's were to be leaked in future, lots of people could be re-identified.
107 | 
108 | Because of this, we'll need to take some de-identification steps.
109 | 
110 | ---
111 | 
112 | ## De-identification
113 | 
114 | For this stage, we're going to be loosely following the de-identification techniques used by Jonathan Pearson of NHS England, and described in a blog post about [creating its own synthetic data](https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data).
115 | 
116 | If you look in `tutorial/deidentify.py` you'll see the full code of all de-identification steps. You can run this code easily.
117 | 
118 | ```bash
119 | python tutorial/deidentify.py
120 | ```
121 | 
122 | It takes the `data/hospital_ae_data.csv` file, run the steps, and saves the new dataset to `data/hospital_ae_data_deidentify.csv`.
123 | 
124 | Breaking down each of these steps. It first loads the `data/nhs_ae_data.csv` file in to the Pandas DataFrame as `hospital_ae_df`.
125 | 
126 | ```python
127 | # _df is a common way to refer to a Pandas DataFrame object
128 | hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data)
129 | ```
130 | 
131 | (`filepaths.py` is, surprise, surprise, where all the filepaths are listed)
132 | 
133 | ### Remove Health Service ID numbers
134 | 
135 | Health Service ID numbers are direct identifiers and should be removed. So we'll simply drop the entire column.
136 | 
137 | ```python
138 | hospital_ae_df = hospital_ae_df.drop('Health Service ID', 1)
139 | ```
140 | 
141 | ### Where a patient lives
142 | 
143 | Pseudo-identifiers, also known as [quasi-identifiers](https://en.wikipedia.org/wiki/Quasi-identifier), are pieces of information that don't directly identify people but can used with other information to identify a person. If we were to take the age, postcode and gender of a person we could combine these and check the dataset to see what that person was treated for in A&E.
144 | 
145 | The data scientist from NHS England, Jonathan Pearson, describes this in the blog post:
146 | 
147 | > I started with the postcode of the patients resident lower super output area (LSOA). This is a geographical definition with an average of 1500 residents created to make reporting in England and Wales easier. I wanted to keep some basic information about the area where the patient lives whilst completely removing any information regarding any actual postcode. A key variable in health care inequalities is the patients Index of Multiple deprivation (IMD) decile (broad measure of relative deprivation) which gives an average ranked value for each LSOA. By replacing the patients resident postcode with an IMD decile I have kept a key bit of information whilst making this field non-identifiable.
148 | 
149 | We'll do just the same with our dataset.
150 | 
151 | First we'll map the rows' postcodes to their LSOA and then drop the postcodes column.
152 | 
153 | ```python
154 | postcodes_df = pd.read_csv(filepaths.postcodes_london)
155 | hospital_ae_df = pd.merge(
156 |     hospital_ae_df,
157 |     postcodes_df[['Postcode', 'Lower layer super output area']],
158 |     on='Postcode'
159 | )
160 | hospital_ae_df = hospital_ae_df.drop('Postcode', 1)
161 | ```
162 | 
163 | Then we'll add a mapped column of "Index of Multiple Deprivation" column for each entry's LSOA.
164 | 
165 | ```python
166 | hospital_ae_df = pd.merge(
167 |     hospital_ae_df,
168 |     postcodes_df[['Lower layer super output area', 'Index of Multiple Deprivation']].drop_duplicates(),
169 |     on='Lower layer super output area'
170 | )
171 | ```
172 | 
173 | Next calculate the decile bins for the IMDs by taking all the IMDs from large list of London. We'll use the Pandas `qcut` (quantile cut), function for this.
174 | 
175 | ```python
176 | _, bins = pd.qcut(
177 |     postcodes_df['Index of Multiple Deprivation'],
178 |     10,
179 |     retbins=True,
180 |     labels=False
181 | )
182 | ```
183 | 
184 | Then we'll use those decile `bins` to map each row's IMD to its IMD decile.
185 | 
186 | ```python
187 | # add +1 to get deciles from 1 to 10 (not 0 to 9)
188 | hospital_ae_df['Index of Multiple Deprivation Decile'] = pd.cut(
189 |     hospital_ae_df['Index of Multiple Deprivation'],
190 |     bins=bins,
191 |     labels=False,
192 |     include_lowest=True) + 1
193 | ```
194 | 
195 | And finally drop the columns we no longer need.
196 | 
197 | ```python
198 | hospital_ae_df = hospital_ae_df.drop('Index of Multiple Deprivation', 1)
199 | hospital_ae_df = hospital_ae_df.drop('Lower layer super output area', 1)
200 | ```
201 | 
202 | ### Individual hospitals
203 | 
204 | The data scientist at NHS England masked individual hospitals giving the following reason.
205 | 
206 | > As each hospital has its own complex case mix and health system, using these data to identify poor performance or possible improvements would be invalid and un-helpful. Therefore, I decided to replace the hospital code with a random number.
207 | 
208 | So we'll do as they did, replacing hospitals with a random six-digit ID.
209 | 
210 | ```python
211 | hospitals = hospital_ae_df['Hospital'].unique().tolist()
212 | random.shuffle(hospitals)
213 | hospitals_map = {
214 |     hospital : ''.join(random.choices(string.digits, k=6))
215 |     for hospital in hospitals
216 | }
217 | hospital_ae_df['Hospital ID'] = hospital_ae_df['Hospital'].map(hospitals_map)
218 | ```
219 | 
220 | And remove the `Hospital` column.
221 | 
222 | ```python
223 | hospital_ae_df = hospital_ae_df.drop('Hospital', 1)
224 | ```
225 | 
226 | ### Time in the data
227 | 
228 | > The next obvious step was to simplify some of the time information I have available as health care system analysis doesn't need to be responsive enough to work on a second and minute basis. Thus, I removed the time information from the 'arrival date', mapped the 'arrival time' into 4-hour chunks
229 | 
230 | First we'll split the `Arrival Time` column in to `Arrival Date` and `Arrival Hour`.
231 | 
232 | ```python
233 | arrival_times = pd.to_datetime(hospital_ae_df['Arrival Time'])
234 | hospital_ae_df['Arrival Date'] = arrival_times.dt.strftime('%Y-%m-%d')
235 | hospital_ae_df['Arrival Hour'] = arrival_times.dt.hour
236 | hospital_ae_df = hospital_ae_df.drop('Arrival Time', 1)
237 | ```
238 | 
239 | Then we'll map the hours to 4-hour chunks and drop the `Arrival Hour` column.
240 | 
241 | ```python
242 | hospital_ae_df['Arrival hour range'] = pd.cut(
243 |     hospital_ae_df['Arrival Hour'],
244 |     bins=[0, 4, 8, 12, 16, 20, 24],
245 |     labels=['00-03', '04-07', '08-11', '12-15', '16-19', '20-23'],
246 |     include_lowest=True
247 | )
248 | hospital_ae_df = hospital_ae_df.drop('Arrival Hour', 1)
249 | ```
250 | 
251 | ### Patient demographics
252 | 
253 | > I decided to only include records with a sex of male or female in order to reduce risk of re identification through low numbers.
254 | 
255 | ```python
256 | hospital_ae_df = hospital_ae_df[hospital_ae_df['Gender'].isin(['Male', 'Female'])]
257 | ```
258 | 
259 | > For the patients age it is common practice to group these into bands and so I've used a standard set - 1-17, 18-24, 25-44, 45-64, 65-84, and 85+ - which although are non-uniform are well used segments defining different average health care usage.
260 | 
261 | ```python
262 | hospital_ae_df['Age bracket'] = pd.cut(
263 |     hospital_ae_df['Age'],
264 |     bins=[0, 18, 25, 45, 65, 85, 150],
265 |     labels=['0-17', '18-24', '25-44', '45-64', '65-84', '85-'],
266 |     include_lowest=True
267 | )
268 | hospital_ae_df = hospital_ae_df.drop('Age', 1)
269 | ```
270 | 
271 | That's all the steps we'll take. We'll finally save our new de-identified dataset.
272 | 
273 | ```python
274 | hospital_ae_df.to_csv(filepaths.hospital_ae_data_deidentify, index=False)
275 | ```
276 | 
277 | ---
278 | 
279 | ## Synthesise
280 | 
281 | Synthetic data exists on a spectrum from merely the same columns and datatypes as the original data all the way to carrying nearly all of the statistical patterns of the original dataset.
282 | 
283 | The UK's Office of National Statistics has a great report on synthetic data and the [_Synthetic Data Spectrum_](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot?utm_campaign=201903_UK_DataPolicyNetwork&utm_source=hs_email&utm_medium=email&utm_content=70377606&_hsenc=p2ANqtz-9W6ByBext_HsgkTPG1lw2JJ_utRoJSTIeVC5Z2lz3QkzwFQpZ0dp2ns9SZLPqxLJrgWzsjC_zt7FQcBvtIGoeSjZtwNg&_hsmi=70377606#synthetic-dataset-spectrum) section is very good in explaining the nuances in more detail.
284 | 
285 | In this tutorial we'll create not one, not two, but *three* synthetic datasets, that are on a range across the synthetic data spectrum: *Random*, *Independent* and *Correlated*.
286 | 
287 | > In **correlated attribute mode**, we learn a differentially private Bayesian network capturing the correlation structure between attributes, then draw samples from this model to construct the result dataset.
288 | >
289 | > In cases where the correlated attribute mode is too computationally expensive or when there is insufficient data to derive a reasonable model, one can use **independent attribute mode**. In this mode, a histogram is derived for each attribute, noise is added to the histogram to achieve differential privacy, and then samples are drawn for each attribute.
290 | >
291 | > Finally, for cases of extremely sensitive data, one can use **random mode** that simply generates type-consistent random values for each attribute.
292 | 
293 | We'll go through each of these now, moving along the synthetic data spectrum, in the order of random to independent to correlated.
294 | 
295 | The toolkit we will be using to generate the three synthetic datasets is DataSynthetizer.
296 | 
297 | ### DataSynthesizer
298 | 
299 | As described in the introduction, this is an open-source toolkit for generating synthetic data. And I'd like to lavish much praise on the researchers who made it as it's excellent.
300 | 
301 | Instead of explaining it myself, I'll use the researchers' own words from their paper:
302 | 
303 | > DataSynthesizer infers the domain of each attribute and derives a description of the distribution of attribute values in the private dataset. This information is saved in a dataset description file, to which we refer as data summary. Then DataSynthesizer is able to generate synthetic datasets of arbitrary size by sampling from the probabilistic model in the dataset description file.
304 | 
305 | We'll create and inspect our synthetic datasets using three modules within it.
306 | 
307 | > DataSynthesizer consists of three high-level modules:
308 | >
309 | > 1. **DataDescriber**: investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary.
310 | > 2. **DataGenerator**: samples from the summary computed by DataDescriber and outputs synthetic data
311 | > 3. **ModelInspector**: shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired.
312 | 
313 | If you want to browse the code for each of these modules, you can find the Python classes for in the `DataSynthetizer` directory (all code in here from the [original repo](https://github.com/DataResponsibly/DataSynthesizer)).
314 | 
315 | 
316 | ### An aside about differential privacy and Bayesian networks
317 | 
318 | You might have seen the phrase "differentially private Bayesian network" in the *correlated mode* description earlier, and got slightly panicked. But fear not! You don't need to worry *too* much about these to get DataSynthesizer working.
319 | 
320 | First off, while DataSynthesizer has the option of using differential privacy for anonymisation, we are turning it off and won't be using it in this tutorial. So you can ignore that part. However, if you care about anonymisation you really should read up on differential privacy. I've read a lot of explainers on it and the best I found was [this article from Access Now](https://www.accessnow.org/understanding-differential-privacy-matters-digital-rights/).
321 | 
322 | Now the next term, Bayesian networks. These are graphs with directions which model the statistical relationship between a dataset's variables. It does this by saying certain variables are "parents" of others, that is, their value influences their "children" variables. Parent variables can influence children but children can't influence parents. In our case, if patient age is a parent of waiting time, it means the age of patient influences how long they wait, but how long they doesn't influence their age. So by using Bayesian Networks, DataSynthesizer can model these influences and use this model in generating the synthetic data.
323 | 
324 | It can be a slightly tricky topic to grasp but a nice, introductory tutorial on them is at the [Probabilistic World site](https://www.probabilisticworld.com/bayesian-belief-networks-part-1/). Give it a read.
325 | 
326 | ### Random mode
327 | 
328 | If we were just to generate A&E data for testing our software, we wouldn't care too much about the statistical patterns within the data. Just that it was roughly a similar size and that the datatypes and columns aligned.
329 | 
330 | In this case, we can just generate the data at random using the `generate_dataset_in_random_mode` function within the `DataGenerator` class.
331 | 
332 | #### Data Description: Random
333 | 
334 | The first step is to create a description of the data, defining the datatypes and which are the categorical variables.
335 | 
336 | ```python
337 | attribute_to_datatype = {
338 |     'Time in A&E (mins)': 'Integer',
339 |     'Treatment': 'String',
340 |     'Gender': 'String',
341 |     'Index of Multiple Deprivation Decile': 'Integer',
342 |     'Hospital ID': 'String',
343 |     'Arrival Date': 'String',
344 |     'Arrival hour range': 'String',  
345 |     'Age bracket': 'String'
346 | }
347 | 
348 | attribute_is_categorical = {
349 |     'Hospital ID': True,
350 |     'Time in A&E (mins)': False,
351 |     'Treatment': True,
352 |     'Gender': True,
353 |     'Index of Multiple Deprivation Decile': False,
354 |     'Arrival Date': True,
355 |     'Arrival hour range': True,  
356 |     'Age bracket': True
357 | }
358 | ```
359 | 
360 | We'll be feeding these in to a `DataDescriber` instance.
361 | 
362 | ```python
363 | describer = DataDescriber()
364 | ```
365 | 
366 | Using this `describer` instance, feeding in the attribute descriptions, we create a description file.
367 | 
368 | ```python
369 | describer.describe_dataset_in_random_mode(
370 |     filepaths.hospital_ae_data_deidentify,
371 |     attribute_to_datatype=attribute_to_datatype,
372 |     attribute_to_is_categorical=attribute_is_categorical)
373 | describer.save_dataset_description_to_file(
374 |     filepaths.hospital_ae_description_random)
375 | ```
376 | 
377 | You can see an example description file in `data/hospital_ae_description_random.json`.
378 | 
379 | #### Data Generation: Random
380 | 
381 | Next, generate the random data. We'll just generate the same amount of rows as was in the original data but, importantly, we could generate much more or less if we wanted to.
382 | 
383 | ```python
384 | num_rows = len(hospital_ae_df)
385 | ```
386 | 
387 | Now generate the random data.
388 | 
389 | ```python
390 | generator = DataGenerator()
391 | generator.generate_dataset_in_random_mode(
392 |     num_rows, filepaths.hospital_ae_description_random)
393 | generator.save_synthetic_data(filepaths.hospital_ae_data_synthetic_random)
394 | ```
395 | 
396 | You can view this random synthetic data in the file `data/hospital_ae_data_synthetic_random.csv`.
397 | 
398 | #### Attribute Comparison: Random
399 | 
400 | We'll compare each attribute in the original data to the synthetic data by generating plots of histograms using the `ModelInspector` class.
401 | 
402 | `figure_filepath` is just a variable holding where we'll write the plot out to.
403 | 
404 | ```python
405 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_random)
406 | 
407 | # Read attribute description from the dataset description file.
408 | attribute_description = read_json_file(
409 |     filepaths.hospital_ae_description_random)['attribute_description']
410 | 
411 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description)
412 | 
413 | for attribute in synthetic_df.columns:
414 |     inspector.compare_histograms(attribute, figure_filepath)
415 | ```
416 | 
417 | Let's look at the histogram plots now for a few of the attributes. We can see that the generated data is completely random and doesn't contain any information about averages or distributions.
418 | 
419 | *Comparison of ages in original data (left) and random synthetic data (right)*
420 | ![Random mode age bracket histograms](plots/random_Age_bracket.png)
421 | 
422 | *Comparison of hospital attendance in original data (left) and random synthetic data (right)*
423 | ![Random mode age bracket histograms](plots/random_Hospital_ID.png)
424 | 
425 | *Comparison of arrival date in original data (left) and random synthetic data (right)*
426 | ![Random mode age bracket histograms](plots/random_Arrival_Date.png)
427 | 
428 | You can see more comparison examples in the `/plots` directory.
429 | 
430 | #### Compare pairwise mutual information: Random
431 | 
432 | DataSynthesizer has a function to compare the _mutual information_ between each of the variables in the dataset and plot them. We'll avoid the mathematical definition of mutual information but [Scholarpedia notes](http://www.scholarpedia.org/article/Mutual_information) it:
433 | 
434 | > can be thought of as the reduction in uncertainty about one random variable given knowledge of another.
435 | 
436 | To create this plot we run.
437 | 
438 | ```python
439 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_random)
440 | 
441 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description)
442 | inspector.mutual_information_heatmap(figure_filepath)
443 | ```
444 | 
445 | We can see the original, private data has a correlation between `Age bracket` and `Time in A&E (mins)`. Not surprisingly, this correlation is lost when we generate our random data.
446 | 
447 | *Mutual Information Heatmap in original data (left) and random synthetic data (right)*
448 | ![Random mode age mutual information](plots/mutual_information_heatmap_random.png)
449 | 
450 | ### Independent attribute mode
451 | 
452 | What if we had the use case where we wanted to build models to analyse the medians of ages, or hospital usage in the synthetic data? In this case we'd use independent attribute mode.
453 | 
454 | #### Data Description: Independent
455 | 
456 | ```python
457 | describer.describe_dataset_in_independent_attribute_mode(
458 |     attribute_to_datatype=attribute_to_datatype,
459 |     attribute_to_is_categorical=attribute_is_categorical)
460 | describer.save_dataset_description_to_file(
461 |     filepaths.hospital_ae_description_independent)
462 | ```
463 | 
464 | #### Data Generation: Independent
465 | 
466 | Next generate the data which keep the distributions of each column but not the data correlations.
467 | 
468 | ```python
469 | generator = DataGenerator()
470 | generator.generate_dataset_in_independent_mode(
471 |     num_rows, filepaths.hospital_ae_description_independent)
472 | generator.save_synthetic_data(
473 |     filepaths.hospital_ae_data_synthetic_independent)
474 | ```
475 | 
476 | #### Attribute Comparison: Independent
477 | 
478 | Comparing the attribute histograms we see the independent mode captures the distributions pretty accurately. You can see the synthetic data is _mostly_ similar but not exactly.
479 | 
480 | ```python
481 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_independent)
482 | attribute_description = read_json_file(
483 |     filepaths.hospital_ae_description_random)['attribute_description']
484 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description)
485 | 
486 | for attribute in synthetic_df.columns:
487 |     inspector.compare_histograms(attribute, figure_filepath)
488 | ```
489 | 
490 | *Comparison of ages in original data (left) and independent synthetic data (right)*
491 | ![Random mode age bracket histograms](plots/independent_Age_bracket.png)
492 | 
493 | *Comparison of hospital attendance in original data (left) and independent synthetic data (right)*
494 | ![Random mode age bracket histograms](plots/independent_Hospital_ID.png)
495 | 
496 | *Comparison of arrival date in original data (left) and independent synthetic data (right)*
497 | ![Random mode age bracket histograms](plots/independent_Arrival_Date.png)
498 | 
499 | #### Compare pairwise mutual information: Independent
500 | 
501 | ```python
502 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_independent)
503 | 
504 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description)
505 | inspector.mutual_information_heatmap(figure_filepath)
506 | ```
507 | 
508 | We can see the independent data also does not contain any of the attribute correlations from the original data.
509 | 
510 | *Mutual Information Heatmap in original data (left) and independent synthetic data (right)*
511 | ![Independent mode mutual information](plots/mutual_information_heatmap_independent.png)
512 | 
513 | ### Correlated attribute mode - include correlations between columns in the data
514 | 
515 | If we want to capture correlated variables, for instance if patient is related to waiting times, we'll need correlated data. To do this we use *correlated mode*.
516 | 
517 | #### Data Description: Correlated
518 | 
519 | There's a couple of parameters that are different here so we'll explain them.
520 | 
521 | `epsilon` is a value for DataSynthesizer's differential privacy which says the amount of noise to add to the data - the higher the value, the more noise and therefore more privacy. We're not using differential privacy so we can set it to zero.
522 | 
523 | `k` is the maximum number of parents in a Bayesian network, i.e., the maximum number of incoming edges. For simplicity's sake, we're going to set this to 1, saying that for a variable only one other variable can influence it.
524 | 
525 | ```python
526 | describer.describe_dataset_in_correlated_attribute_mode(
527 |     dataset_file=filepaths.hospital_ae_data_deidentify,
528 |     epsilon=0,
529 |     k=1,
530 |     attribute_to_datatype=attribute_to_datatype,
531 |     attribute_to_is_categorical=attribute_is_categorical)
532 | 
533 | describer.save_dataset_description_to_file(filepaths.hospital_ae_description_correlated)
534 | ```
535 | 
536 | #### Data Generation: Correlated
537 | 
538 | ```python
539 | generator.generate_dataset_in_correlated_attribute_mode(
540 |     num_rows, filepaths.hospital_ae_description_correlated)
541 | generator.save_synthetic_data(filepaths.hospital_ae_data_synthetic_correlated)
542 | ```
543 | 
544 | #### Attribute Comparison: Correlated
545 | 
546 | We can see correlated mode keeps similar distributions also. It looks the exact same but if you look closely there are also small differences in the distributions.
547 | 
548 | *Comparison of ages in original data (left) and correlated synthetic data (right)*
549 | ![Random mode age bracket histograms](plots/correlated_Age_bracket.png)
550 | 
551 | *Comparison of hospital attendance in original data (left) and independent synthetic data (right)*
552 | ![Random mode age bracket histograms](plots/correlated_Hospital_ID.png)
553 | 
554 | *Comparison of arrival date in original data (left) and independent synthetic data (right)*
555 | ![Random mode age bracket histograms](plots/correlated_Arrival_Date.png)
556 | 
557 | #### Compare pairwise mutual information: Correlated
558 | 
559 | Finally, we see in correlated mode, we manage to capture the correlation between `Age bracket` and `Time in A&E (mins)`.
560 | 
561 | ```python
562 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_correlated)
563 | 
564 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description)
565 | inspector.mutual_information_heatmap(figure_filepath)
566 | ```
567 | 
568 | *Mutual Information Heatmap in original data (left) and correlated synthetic data (right)*
569 | ![Independent mode mutual information](plots/mutual_information_heatmap_correlated.png)
570 | 
571 | ---
572 | 
573 | ### Wrap-up
574 | 
575 | This is where our tutorial ends. But there is much, much more to the world of anonymisation and synthetic data. Please check out more in the references below.
576 | 
577 | If you have any queries, comments or improvements about this tutorial please do get in touch. You can send me a message through Github or leave an Issue.
578 | 
579 | ### References
580 | 
581 | - [Exploring methods for synthetic A&E data](https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data) - Jonathan Pearson, NHS England with Open Data Institute Leeds.
582 | - [DataSynthesizer Github Repository](https://github.com/DataResponsibly/DataSynthesizer)
583 | - [DataSynthesizer: Privacy-Preserving Synthetic Datasets](https://faculty.washington.edu/billhowe/publications/pdfs/ping17datasynthesizer.pdf) Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017
584 | - [ONS methodology working paper series number 16 - Synthetic data pilot](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot) - Office of National Statistics, 2019.
585 | - [Wrap-up blog post](http://theodi.org) (not yet published) from our anonymisation project which talks about what we learned and other outputs we created.
586 | - We referred to the [UK Anonymisation Network's Decision Making Framework](https://ukanon.net/ukan-resources/ukan-decision-making-framework/) a lot during our work. It's pretty involved but it's excellent as a deep-dive resource on anonymisation.
587 | 


--------------------------------------------------------------------------------