├── tutorial ├── __init__.py ├── filepaths.py ├── deidentify.py ├── generate.py └── synthesise.py ├── DataSynthesizer ├── __init__.py ├── lib │ ├── __init__.py │ ├── utils.py │ └── PrivBayes.py ├── datatypes │ ├── __init__.py │ ├── utils │ │ ├── __init__.py │ │ ├── DataType.py │ │ └── AttributeLoader.py │ ├── FloatAttribute.py │ ├── IntegerAttribute.py │ ├── SocialSecurityNumberAttribute.py │ ├── StringAttribute.py │ ├── DateTimeAttribute.py │ └── AbstractAttribute.py ├── README.md ├── ModelInspector.py ├── DataGenerator.py └── DataDescriber.py ├── .gitignore ├── data ├── nhs_ae_gender_codes.csv ├── hospitals_london.txt ├── nhs_ae_treatment_codes.csv ├── hospital_ae_description_random.json └── hospital_ae_description_independent.json ├── requirements.txt ├── plots ├── random_Gender.png ├── random_Treatment.png ├── correlated_Gender.png ├── independent_Gender.png ├── random_Age_bracket.png ├── random_Hospital_ID.png ├── correlated_Treatment.png ├── independent_Treatment.png ├── random_Arrival_Date.png ├── correlated_Age_bracket.png ├── correlated_Arrival_Date.png ├── correlated_Hospital_ID.png ├── independent_Age_bracket.png ├── independent_Hospital_ID.png ├── independent_Arrival_Date.png ├── random_Arrival_hour_range.png ├── random_Time_in_A&E_(mins).png ├── correlated_Arrival_hour_range.png ├── correlated_Time_in_A&E_(mins).png ├── independent_Arrival_hour_range.png ├── independent_Time_in_A&E_(mins).png ├── mutual_information_heatmap_random.png ├── mutual_information_heatmap_correlated.png ├── mutual_information_heatmap_independent.png ├── random_Index_of_Multiple_Deprivation_Decile.png ├── correlated_Index_of_Multiple_Deprivation_Decile.png └── independent_Index_of_Multiple_Deprivation_Decile.png ├── LICENSE └── README.md /tutorial/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /DataSynthesizer/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /DataSynthesizer/lib/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .venv/* 2 | *.pyc 3 | data/London postcodes.csv 4 | -------------------------------------------------------------------------------- /data/nhs_ae_gender_codes.csv: -------------------------------------------------------------------------------- 1 | Gender,Code 2 | Not Known,0 3 | Male,1 4 | Female,2 5 | Not Specified,9 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==1.4.2 2 | scipy==1.8.0 3 | sklearn==1.0.2 4 | matplotlib==3.5.1 5 | seaborn==0.11.2 -------------------------------------------------------------------------------- /plots/random_Gender.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Gender.png -------------------------------------------------------------------------------- /plots/random_Treatment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Treatment.png -------------------------------------------------------------------------------- /plots/correlated_Gender.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Gender.png -------------------------------------------------------------------------------- /plots/independent_Gender.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Gender.png -------------------------------------------------------------------------------- /plots/random_Age_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Age_bracket.png -------------------------------------------------------------------------------- /plots/random_Hospital_ID.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Hospital_ID.png -------------------------------------------------------------------------------- /plots/correlated_Treatment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Treatment.png -------------------------------------------------------------------------------- /plots/independent_Treatment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Treatment.png -------------------------------------------------------------------------------- /plots/random_Arrival_Date.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Arrival_Date.png -------------------------------------------------------------------------------- /plots/correlated_Age_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Age_bracket.png -------------------------------------------------------------------------------- /plots/correlated_Arrival_Date.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Arrival_Date.png -------------------------------------------------------------------------------- /plots/correlated_Hospital_ID.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Hospital_ID.png -------------------------------------------------------------------------------- /plots/independent_Age_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Age_bracket.png -------------------------------------------------------------------------------- /plots/independent_Hospital_ID.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Hospital_ID.png -------------------------------------------------------------------------------- /plots/independent_Arrival_Date.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Arrival_Date.png -------------------------------------------------------------------------------- /plots/random_Arrival_hour_range.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Arrival_hour_range.png -------------------------------------------------------------------------------- /plots/random_Time_in_A&E_(mins).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Time_in_A&E_(mins).png -------------------------------------------------------------------------------- /plots/correlated_Arrival_hour_range.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Arrival_hour_range.png -------------------------------------------------------------------------------- /plots/correlated_Time_in_A&E_(mins).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Time_in_A&E_(mins).png -------------------------------------------------------------------------------- /plots/independent_Arrival_hour_range.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Arrival_hour_range.png -------------------------------------------------------------------------------- /plots/independent_Time_in_A&E_(mins).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Time_in_A&E_(mins).png -------------------------------------------------------------------------------- /plots/mutual_information_heatmap_random.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_random.png -------------------------------------------------------------------------------- /plots/mutual_information_heatmap_correlated.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_correlated.png -------------------------------------------------------------------------------- /plots/mutual_information_heatmap_independent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/mutual_information_heatmap_independent.png -------------------------------------------------------------------------------- /plots/random_Index_of_Multiple_Deprivation_Decile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/random_Index_of_Multiple_Deprivation_Decile.png -------------------------------------------------------------------------------- /plots/correlated_Index_of_Multiple_Deprivation_Decile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/correlated_Index_of_Multiple_Deprivation_Decile.png -------------------------------------------------------------------------------- /plots/independent_Index_of_Multiple_Deprivation_Decile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/theodi/synthetic-data-tutorial/HEAD/plots/independent_Index_of_Multiple_Deprivation_Decile.png -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/utils/DataType.py: -------------------------------------------------------------------------------- 1 | from enum import Enum 2 | 3 | 4 | class DataType(Enum): 5 | INTEGER = 'Integer' 6 | FLOAT = 'Float' 7 | STRING = 'String' 8 | DATETIME = 'DateTime' 9 | SOCIAL_SECURITY_NUMBER = 'SocialSecurityNumber' 10 | -------------------------------------------------------------------------------- /data/hospitals_london.txt: -------------------------------------------------------------------------------- 1 | Barnet Hospital 2 | Charing Cross Hospital 3 | Chase Farm Hospital 4 | Chelsea and Westminster Hospital 5 | Croydon University Hospital 6 | Ealing Hospital 7 | Epsom General Hospital 8 | Hillingdon Hospital 9 | Homerton University Hospital 10 | King's College Hospital 11 | Kingston Hospital 12 | Newham General Hospital 13 | North Middlesex Hospital 14 | Northwick Park & St Marks Hospital 15 | Princess Royal University Hospital 16 | Queen Elizabeth Hospital 17 | Queen's Hospital 18 | Royal London Hospital 19 | St Mary's Hospital 20 | St Thomas' Hospital 21 | The Royal Free Hospital 22 | University College Hospital 23 | University Hospital Lewisham 24 | West Middlesex University Hospital 25 | Whipps Cross University Hospital 26 | The Whittington Hospital -------------------------------------------------------------------------------- /data/nhs_ae_treatment_codes.csv: -------------------------------------------------------------------------------- 1 | Treatment,Code 2 | Dressing,01 3 | Bandage/support,02 4 | Sutures,03 5 | Wound closure (excluding sutures),04 6 | Plaster of Paris,05 7 | Splint,06 8 | Removal foreign body,08 9 | Physiotherapy,09 10 | Incision & drainage,11 11 | Central line,13 12 | Chest drain,16 13 | Urinary catheter/suprapubic,17 14 | Defibrillation/pacing,18 15 | Resuscitation/cardiopulmonary resuscitation,19 16 | Minor surgery,20 17 | Guidance/advice only,22 18 | Anaesthesia,23 19 | Tetanus,24 20 | Nebuliser/spacer,25 21 | Recording vital signs,30 22 | Burns review,31 23 | Fracture review,33 24 | Wound cleaning,34 25 | Dressing/wound review,35 26 | Sling/collar cuff/broad arm sling,36 27 | Nasal airway,38 28 | Oral airway,39 29 | Arterial line,42 30 | Infusion fluids,43 31 | Blood product transfusion,44 32 | Lumbar puncture,46 33 | Joint aspiration,47 34 | Occupational Therapy,52 35 | Social work intervention,54 36 | Eye,55 37 | Dental treatment,56 38 | Prescription/medicines prepared to take away,57 39 | Other (consider alternatives),27 40 | None (consider guidance/advice option),99 -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/FloatAttribute.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | import numpy as np 4 | from pandas import Series 5 | 6 | from datatypes.AbstractAttribute import AbstractAttribute 7 | from datatypes.utils.DataType import DataType 8 | 9 | 10 | class FloatAttribute(AbstractAttribute): 11 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 12 | super().__init__(name, is_candidate_key, is_categorical, histogram_size, data) 13 | self.is_numerical = True 14 | self.data_type = DataType.FLOAT 15 | 16 | def infer_domain(self, categorical_domain=None, numerical_range=None): 17 | super().infer_domain(categorical_domain, numerical_range) 18 | 19 | def infer_distribution(self): 20 | super().infer_distribution() 21 | 22 | def generate_values_as_candidate_key(self, n): 23 | return np.arange(self.min, self.max, (self.max - self.min) / n) 24 | 25 | def sample_values_from_binning_indices(self, binning_indices): 26 | return super().sample_values_from_binning_indices(binning_indices) 27 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 The Open Data Institute 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/IntegerAttribute.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | from pandas import Series 4 | 5 | from datatypes.AbstractAttribute import AbstractAttribute 6 | from datatypes.utils.DataType import DataType 7 | 8 | 9 | class IntegerAttribute(AbstractAttribute): 10 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 11 | super().__init__(name, is_candidate_key, is_categorical, histogram_size, data) 12 | self.is_numerical = True 13 | self.data_type = DataType.INTEGER 14 | 15 | def infer_domain(self, categorical_domain=None, numerical_range=None): 16 | super().infer_domain(categorical_domain, numerical_range) 17 | self.min = int(self.min) 18 | self.max = int(self.max) 19 | 20 | def infer_distribution(self): 21 | super().infer_distribution() 22 | 23 | def generate_values_as_candidate_key(self, n): 24 | return super().generate_values_as_candidate_key(n) 25 | 26 | def sample_values_from_binning_indices(self, binning_indices): 27 | column = super().sample_values_from_binning_indices(binning_indices) 28 | column[~column.isnull()] = column[~column.isnull()].astype(int) 29 | return column 30 | -------------------------------------------------------------------------------- /DataSynthesizer/README.md: -------------------------------------------------------------------------------- 1 | # DataSynthesizer 2 | 3 | All code in this directory is from the open-source [Datasynthesizer](https://github.com/DataResponsibly/DataSynthesizer) project. 4 | 5 | You can read more on the project and related papers at the [Privacy-Preserving Synthetic Data project page](https://homes.cs.washington.edu/~billhowe//projects/2017/07/20/Data-Synthesizer.html). 6 | 7 | ## Usage 8 | 9 | DataSynthesizer can generate a synthetic dataset from a sensitive one for release to public. It is developed in Python 3.6 and requires some third-party modules, including numpy, scipy, pandas, and dateutil. 10 | 11 | ## License 12 | 13 | Copyright <2018> 14 | 15 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 16 | 17 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /tutorial/filepaths.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from pathlib import Path 4 | 5 | this_filepath = Path(os.path.realpath(__file__)) 6 | project_root = str(this_filepath.parents[1]) 7 | 8 | data_dir = os.path.join(project_root, 'data/') 9 | 10 | # add the DataSynthesizer repo to the pythonpath 11 | data_synthesizer_dir = os.path.join(project_root, 'DataSynthesizer/') 12 | sys.path.append(data_synthesizer_dir) 13 | 14 | plots_dir = os.path.join(project_root, 'plots/') 15 | 16 | postcodes_london = os.path.join(data_dir, 'London postcodes.csv') 17 | hospitals_london = os.path.join(data_dir, 'hospitals_london.txt') 18 | nhs_ae_gender_codes = os.path.join(data_dir, 'nhs_ae_gender_codes.csv') 19 | nhs_ae_treatment_codes = os.path.join(data_dir, 'nhs_ae_treatment_codes.csv') 20 | age_population_london = os.path.join(data_dir, 'age_population_london.csv') 21 | 22 | hospital_ae_data = os.path.join(data_dir, 'hospital_ae_data.csv') 23 | hospital_ae_data_deidentify = os.path.join(data_dir, 'hospital_ae_data_deidentify.csv') 24 | 25 | hospital_ae_data_synthetic_random = os.path.join( 26 | data_dir, 'hospital_ae_data_synthetic_random.csv') 27 | hospital_ae_data_synthetic_independent = os.path.join( 28 | data_dir, 'hospital_ae_data_synthetic_independent.csv') 29 | hospital_ae_data_synthetic_correlated = os.path.join( 30 | data_dir, 'hospital_ae_data_synthetic_correlated.csv') 31 | 32 | hospital_ae_description_random = os.path.join( 33 | data_dir, 'hospital_ae_description_random.json') 34 | hospital_ae_description_independent = os.path.join( 35 | data_dir, 'hospital_ae_description_independent.json') 36 | hospital_ae_description_correlated = os.path.join( 37 | data_dir, 'hospital_ae_description_correlated.json') 38 | 39 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/utils/AttributeLoader.py: -------------------------------------------------------------------------------- 1 | from pandas import Series 2 | 3 | from datatypes.DateTimeAttribute import DateTimeAttribute 4 | from datatypes.FloatAttribute import FloatAttribute 5 | from datatypes.IntegerAttribute import IntegerAttribute 6 | from datatypes.SocialSecurityNumberAttribute import SocialSecurityNumberAttribute 7 | from datatypes.StringAttribute import StringAttribute 8 | from datatypes.utils.DataType import DataType 9 | 10 | 11 | def parse_json(attribute_in_json): 12 | name = attribute_in_json['name'] 13 | data_type = DataType(attribute_in_json['data_type']) 14 | is_candidate_key = attribute_in_json['is_candidate_key'] 15 | is_categorical = attribute_in_json['is_categorical'] 16 | histogram_size = len(attribute_in_json['distribution_bins']) 17 | if data_type is DataType.INTEGER: 18 | attribute = IntegerAttribute(name, is_candidate_key, is_categorical, histogram_size, Series()) 19 | elif data_type is DataType.FLOAT: 20 | attribute = FloatAttribute(name, is_candidate_key, is_categorical, histogram_size, Series()) 21 | elif data_type is DataType.DATETIME: 22 | attribute = DateTimeAttribute(name, is_candidate_key, is_categorical, histogram_size, Series()) 23 | elif data_type is DataType.STRING: 24 | attribute = StringAttribute(name, is_candidate_key, is_categorical, histogram_size, Series()) 25 | elif data_type is data_type.SOCIAL_SECURITY_NUMBER: 26 | attribute = SocialSecurityNumberAttribute(name, is_candidate_key, is_categorical, histogram_size, Series()) 27 | else: 28 | raise Exception('Data type {} is unknown.'.format(data_type.value)) 29 | 30 | attribute.missing_rate = attribute_in_json['missing_rate'] 31 | attribute.min = attribute_in_json['min'] 32 | attribute.max = attribute_in_json['max'] 33 | attribute.distribution_bins = attribute_in_json['distribution_bins'] 34 | attribute.distribution_probabilities = attribute_in_json['distribution_probabilities'] 35 | 36 | return attribute 37 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/SocialSecurityNumberAttribute.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | import numpy as np 4 | from pandas import Series 5 | 6 | from datatypes.AbstractAttribute import AbstractAttribute 7 | from datatypes.utils.DataType import DataType 8 | 9 | 10 | def pre_process(column: Series): 11 | if column.size == 0: 12 | return column 13 | elif type(column.iloc[0]) is int: 14 | return column 15 | elif type(column.iloc[0]) is str: 16 | return column.map(lambda x: int(x.replace('-', ''))) 17 | else: 18 | raise Exception('Invalid SocialSecurityNumber.') 19 | 20 | 21 | def is_ssn(value): 22 | """Test whether a number is between 0 and 1e9. 23 | 24 | Note this function does not take into consideration some special numbers that are never allocated. 25 | https://en.wikipedia.org/wiki/Social_Security_number 26 | """ 27 | if type(value) is int: 28 | return 0 < value < 1e9 29 | elif type(value) is str: 30 | value = value.replace('-', '') 31 | if value.isdigit(): 32 | return 0 < int(value) < 1e9 33 | return False 34 | 35 | 36 | class SocialSecurityNumberAttribute(AbstractAttribute): 37 | """SocialSecurityNumber of format AAA-GG-SSSS. 38 | 39 | """ 40 | 41 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 42 | super().__init__(name, is_candidate_key, is_categorical, histogram_size, pre_process(data)) 43 | self.is_numerical = True 44 | self.data_type = DataType.SOCIAL_SECURITY_NUMBER 45 | 46 | def infer_domain(self, categorical_domain=None, numerical_range=None): 47 | super().infer_domain(categorical_domain, numerical_range) 48 | self.min = int(self.min) 49 | self.max = int(self.max) 50 | 51 | def infer_distribution(self): 52 | super().infer_distribution() 53 | 54 | def generate_values_as_candidate_key(self, n): 55 | if n < 1e9: 56 | values = np.linspace(0, 1e9 - 1, num=n, dtype=int) 57 | values = np.random.permutation(values) 58 | values = [str(i).zfill(9) for i in values] 59 | return ['{}-{}-{}'.format(i[:3], i[3:5], i[5:]) for i in values] 60 | else: 61 | raise Exception('The candidate key "{}" cannot generate more than 1e9 distinct values.', self.name) 62 | 63 | def sample_values_from_binning_indices(self, binning_indices): 64 | return super().sample_binning_indices_in_independent_attribute_mode(binning_indices) 65 | -------------------------------------------------------------------------------- /DataSynthesizer/lib/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | import random 3 | from string import ascii_lowercase 4 | 5 | import numpy as np 6 | from pandas import Series, DataFrame 7 | from sklearn.metrics import mutual_info_score, normalized_mutual_info_score 8 | 9 | 10 | def set_random_seed(seed=0): 11 | random.seed(seed) 12 | np.random.seed(seed) 13 | 14 | 15 | def mutual_information(labels_x: Series, labels_y: DataFrame): 16 | """Mutual information of distributions in format of Series or DataFrame. 17 | 18 | Parameters 19 | ---------- 20 | labels_x : Series 21 | labels_y : DataFrame 22 | """ 23 | if labels_y.shape[1] == 1: 24 | labels_y = labels_y.iloc[:, 0] 25 | else: 26 | labels_y = labels_y.apply(lambda x: ' '.join(x.get_values()), axis=1) 27 | 28 | return mutual_info_score(labels_x, labels_y) 29 | 30 | 31 | def pairwise_attributes_mutual_information(dataset): 32 | """Compute normalized mutual information for all pairwise attributes. Return a DataFrame.""" 33 | sorted_columns = sorted(dataset.columns) 34 | mi_df = DataFrame(columns=sorted_columns, index=sorted_columns, dtype=float) 35 | for row in mi_df.columns: 36 | for col in mi_df.columns: 37 | mi_df.loc[row, col] = normalized_mutual_info_score(dataset[row].astype(str), 38 | dataset[col].astype(str), 39 | average_method='arithmetic') 40 | return mi_df 41 | 42 | 43 | def normalize_given_distribution(frequencies): 44 | distribution = np.array(frequencies, dtype=float) 45 | distribution = distribution.clip(0) # replace negative values with 0 46 | summation = distribution.sum() 47 | if summation > 0: 48 | return distribution / distribution.sum() 49 | else: 50 | return np.full_like(distribution, 1 / distribution.size) 51 | 52 | 53 | def read_json_file(json_file): 54 | with open(json_file, 'r') as file: 55 | return json.load(file) 56 | 57 | 58 | def infer_numerical_attributes_in_dataframe(dataframe): 59 | describe = dataframe.describe() 60 | # DataFrame.describe() usually returns 8 rows. 61 | if describe.shape[0] == 8: 62 | return set(describe.columns) 63 | # DataFrame.describe() returns less than 8 rows when there is no numerical attribute. 64 | else: 65 | return set() 66 | 67 | 68 | def display_bayesian_network(bn): 69 | length = 0 70 | for child, _ in bn: 71 | if len(child) > length: 72 | length = len(child) 73 | 74 | print('Constructed Bayesian network:') 75 | for child, parents in bn: 76 | print(" {0:{width}} has parents {1}.".format(child, parents, width=length)) 77 | 78 | 79 | def generate_random_string(length): 80 | return ''.join(np.random.choice(list(ascii_lowercase), size=length)) 81 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/StringAttribute.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | import numpy as np 4 | from pandas import Series 5 | 6 | from datatypes.AbstractAttribute import AbstractAttribute 7 | from datatypes.utils.DataType import DataType 8 | from lib import utils 9 | 10 | 11 | class StringAttribute(AbstractAttribute): 12 | """Variable min and max are the lengths of the shortest and longest strings. 13 | 14 | """ 15 | 16 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 17 | super().__init__(name, is_candidate_key, is_categorical, histogram_size, data) 18 | self.is_numerical = False 19 | self.data_type = DataType.STRING 20 | self.data_dropna_len = self.data_dropna.astype(str).map(len) 21 | 22 | def infer_domain(self, categorical_domain=None, numerical_range=None): 23 | if categorical_domain: 24 | lengths = [len(i) for i in categorical_domain] 25 | self.min = min(lengths) 26 | self.max = max(lengths) 27 | self.distribution_bins = np.array(categorical_domain) 28 | else: 29 | self.min = int(self.data_dropna_len.min()) 30 | self.max = int(self.data_dropna_len.max()) 31 | if self.is_categorical: 32 | self.distribution_bins = self.data_dropna.unique() 33 | else: 34 | self.distribution_bins = np.array([self.min, self.max]) 35 | 36 | self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size) 37 | 38 | def infer_distribution(self): 39 | if self.is_categorical: 40 | distribution = self.data_dropna.value_counts() 41 | for value in set(self.distribution_bins) - set(distribution.index): 42 | distribution[value] = 0 43 | distribution.sort_index(inplace=True) 44 | self.distribution_probabilities = utils.normalize_given_distribution(distribution) 45 | self.distribution_bins = np.array(distribution.index) 46 | else: 47 | distribution = np.histogram(self.data_dropna_len, bins=self.histogram_size) 48 | self.distribution_probabilities = utils.normalize_given_distribution(distribution[0]) 49 | bins = distribution[1][:-1] 50 | bins[0] = bins[0] - 0.001 * (bins[1] - bins[0]) 51 | self.distribution_bins = bins 52 | 53 | def generate_values_as_candidate_key(self, n): 54 | length = np.random.randint(self.min, self.max) 55 | vectorized = np.vectorize(lambda x: '{}{}'.format(utils.generate_random_string(length), x)) 56 | return vectorized(np.arange(n)) 57 | 58 | def sample_values_from_binning_indices(self, binning_indices): 59 | column = super().sample_values_from_binning_indices(binning_indices) 60 | if not self.is_categorical: 61 | column[~column.isnull()] = column[~column.isnull()].apply(lambda x: utils.generate_random_string(int(x))) 62 | 63 | return column 64 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/DateTimeAttribute.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | import numpy as np 4 | from dateutil.parser import parse 5 | from pandas import Series 6 | 7 | from datatypes.AbstractAttribute import AbstractAttribute 8 | from datatypes.utils.DataType import DataType 9 | from lib.utils import normalize_given_distribution 10 | 11 | 12 | def is_datetime(value: str): 13 | """Find whether a value is a datetime. Here weekdays and months are categorical values instead of datetime.""" 14 | weekdays = {'mon', 'monday', 'tue', 'tuesday', 'wed', 'wednesday', 'thu', 'thursday', 'fri', 'friday', 15 | 'sat', 'saturday', 'sun', 'sunday'} 16 | months = {'jan', 'january', 'feb', 'february', 'mar', 'march', 'apr', 'april', 'may', 'may', 'jun', 'june', 17 | 'jul', 'july', 'aug', 'august', 'sep', 'sept', 'september', 'oct', 'october', 'nov', 'november', 18 | 'dec', 'december'} 19 | 20 | value_lower = value.lower() 21 | if (value_lower in weekdays) or (value_lower in months): 22 | return False 23 | try: 24 | parse(value) 25 | return True 26 | except ValueError: 27 | return False 28 | 29 | 30 | # TODO detect datetime formats 31 | class DateTimeAttribute(AbstractAttribute): 32 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 33 | super().__init__(name, is_candidate_key, is_categorical, histogram_size, data) 34 | self.is_numerical = True 35 | self.data_type = DataType.DATETIME 36 | epoch_datetime = parse('1970-01-01') 37 | self.timestamps = self.data_dropna.map(lambda x: int((parse(x) - epoch_datetime).total_seconds())) 38 | 39 | def infer_domain(self, categorical_domain=None, numerical_range=None): 40 | if numerical_range: 41 | self.min, self.max = numerical_range 42 | self.distribution_bins = np.array([self.min, self.max]) 43 | else: 44 | self.min = float(self.data_dropna.min()) 45 | self.max = float(self.data_dropna.max()) 46 | if self.is_categorical: 47 | self.distribution_bins = self.data_dropna.unique() 48 | else: 49 | self.distribution_bins = np.array([self.min, self.max]) 50 | 51 | self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size) 52 | 53 | def infer_distribution(self): 54 | if self.is_categorical: 55 | distribution = self.data_dropna.value_counts() 56 | for value in set(self.distribution_bins) - set(distribution.index): 57 | distribution[value] = 0 58 | distribution.sort_index(inplace=True) 59 | self.distribution_probabilities = normalize_given_distribution(distribution) 60 | self.distribution_bins = np.array(distribution.index) 61 | else: 62 | distribution = np.histogram(self.timestamps, bins=self.histogram_size, range=(self.min, self.max)) 63 | self.distribution_probabilities = normalize_given_distribution(distribution[0]) 64 | bins = distribution[1][:-1] 65 | bins[0] = bins[0] - 0.001 * (bins[1] - bins[0]) 66 | self.distribution_bins = bins 67 | 68 | def generate_values_as_candidate_key(self, n): 69 | return np.arange(self.min, self.max, (self.min - self.max) / n) 70 | 71 | def sample_values_from_binning_indices(self, binning_indices): 72 | column = super().sample_values_from_binning_indices(binning_indices) 73 | column[~column.isnull()] = column[~column.isnull()].astype(int) 74 | return column 75 | -------------------------------------------------------------------------------- /DataSynthesizer/ModelInspector.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import matplotlib 4 | import matplotlib.pyplot as plt 5 | import numpy as np 6 | import pandas as pd 7 | import seaborn as sns 8 | 9 | from lib.utils import pairwise_attributes_mutual_information, normalize_given_distribution 10 | 11 | matplotlib.rc('xtick', labelsize=20) 12 | matplotlib.rc('ytick', labelsize=20) 13 | 14 | sns.set() 15 | 16 | 17 | class ModelInspector(object): 18 | def __init__(self, private_df: pd.DataFrame, synthetic_df: pd.DataFrame, attribute_description): 19 | self.private_df = private_df 20 | self.synthetic_df = synthetic_df 21 | self.attribute_description = attribute_description 22 | 23 | self.candidate_keys = set() 24 | for attr in synthetic_df: 25 | if synthetic_df[attr].unique().size == synthetic_df.shape[0]: 26 | self.candidate_keys.add(attr) 27 | 28 | self.private_df.drop(columns=self.candidate_keys, inplace=True) 29 | self.synthetic_df.drop(columns=self.candidate_keys, inplace=True) 30 | 31 | def compare_histograms(self, attribute, figure_filepath): 32 | datatype = self.attribute_description[attribute]['data_type'] 33 | is_categorical = self.attribute_description[attribute]['is_categorical'] 34 | 35 | # ignore datetime attributes, since they are converted into timestamps 36 | if datatype == 'DateTime': 37 | return 38 | # ignore non-categorical string attributes 39 | elif datatype == 'String' and not is_categorical: 40 | return 41 | elif attribute in self.candidate_keys: 42 | return 43 | else: 44 | fig = plt.figure(figsize=(25, 12), dpi=120) 45 | ax1 = fig.add_subplot(121) 46 | ax2 = fig.add_subplot(122) 47 | 48 | if is_categorical: 49 | dist_priv = self.private_df[attribute].value_counts() 50 | dist_synt = self.synthetic_df[attribute].value_counts() 51 | for idx, number in dist_priv.iteritems(): 52 | if idx not in dist_synt.index: 53 | dist_synt.loc[idx] = 0 54 | for idx, number in dist_synt.iteritems(): 55 | if idx not in dist_priv.index: 56 | dist_priv.loc[idx] = 0 57 | dist_priv.index = [str(i) for i in dist_priv.index] 58 | dist_synt.index = [str(i) for i in dist_synt.index] 59 | dist_priv.sort_index(inplace=True) 60 | dist_synt.sort_index(inplace=True) 61 | pos_priv = list(range(len(dist_priv))) 62 | pos_synt = list(range(len(dist_synt))) 63 | ax1.bar(pos_priv, normalize_given_distribution(dist_priv.values), align='center', width=0.8) 64 | ax2.bar(pos_synt, normalize_given_distribution(dist_synt.values), align='center', width=0.8) 65 | ax1.set_xticks(np.arange(min(pos_priv), max(pos_priv) + 1, 1.0)) 66 | ax2.set_xticks(np.arange(min(pos_synt), max(pos_synt) + 1, 1.0)) 67 | ax1.set_xticklabels(dist_priv.index.tolist(), fontsize=10) 68 | ax2.set_xticklabels(dist_synt.index.tolist(), fontsize=10) 69 | # the rest are non-categorical numeric attributes. 70 | else: 71 | ax1.hist(self.private_df[attribute].dropna(), bins=15, align='left', density=True) 72 | ax2.hist(self.synthetic_df[attribute].dropna(), bins=15, align='left', density=True) 73 | 74 | ax1_x_min, ax1_x_max = ax1.get_xlim() 75 | ax2_x_min, ax2_x_max = ax2.get_xlim() 76 | ax1_y_min, ax1_y_max = ax1.get_ylim() 77 | ax2_y_min, ax2_y_max = ax2.get_ylim() 78 | x_min = min(ax1_x_min, ax2_x_min) 79 | x_max = max(ax1_x_max, ax2_x_max) 80 | y_min = min(ax1_y_min, ax2_y_min) 81 | y_max = max(ax1_y_max, ax2_y_max) 82 | ax1.set_xlim([x_min, x_max]) 83 | ax1.set_ylim([y_min, y_max]) 84 | ax2.set_xlim([x_min, x_max]) 85 | ax2.set_ylim([y_min, y_max]) 86 | fig.autofmt_xdate() 87 | 88 | plt.savefig(figure_filepath, bbox_inches='tight') 89 | plt.close() 90 | 91 | def mutual_information_heatmap(self, figure_filepath, attributes: List = None): 92 | if attributes: 93 | private_df = self.private_df[attributes] 94 | synthetic_df = self.synthetic_df[attributes] 95 | else: 96 | private_df = self.private_df 97 | synthetic_df = self.synthetic_df 98 | 99 | private_mi = pairwise_attributes_mutual_information(private_df) 100 | synthetic_mi = pairwise_attributes_mutual_information(synthetic_df) 101 | 102 | fig = plt.figure(figsize=(15, 6), dpi=120) 103 | fig.suptitle('Pairwise Mutual Information Comparison (Private vs Synthetic)', fontsize=20) 104 | ax1 = fig.add_subplot(121) 105 | ax2 = fig.add_subplot(122) 106 | sns.heatmap(private_mi, ax=ax1, cmap="GnBu") 107 | sns.heatmap(synthetic_mi, ax=ax2, cmap="GnBu") 108 | ax1.set_title('Private, max=1', fontsize=15) 109 | ax2.set_title('Synthetic, max=1', fontsize=15) 110 | fig.autofmt_xdate() 111 | fig.tight_layout() 112 | plt.subplots_adjust(top=0.83) 113 | 114 | plt.savefig(figure_filepath, bbox_inches='tight') 115 | plt.close() 116 | 117 | 118 | if __name__ == '__main__': 119 | # Directories of input and output files 120 | input_dataset_file = '../datasets/AdultIncomeData/adult.csv' 121 | dataset_description_file = '../output/description/AdultIncomeData_description.txt' 122 | synthetic_dataset_file = '../output/synthetic_data/AdultIncomeData_synthetic.csv' 123 | 124 | df = pd.read_csv(input_dataset_file) 125 | print(df.head(5)) 126 | -------------------------------------------------------------------------------- /DataSynthesizer/datatypes/AbstractAttribute.py: -------------------------------------------------------------------------------- 1 | from abc import ABCMeta, abstractmethod 2 | from bisect import bisect_right 3 | from random import uniform 4 | from typing import List, Union 5 | 6 | import numpy as np 7 | from numpy.random import choice 8 | from pandas import Series 9 | 10 | from datatypes.utils import DataType 11 | from lib import utils 12 | 13 | 14 | class AbstractAttribute(object): 15 | __metaclass__ = ABCMeta 16 | 17 | def __init__(self, name: str, is_candidate_key, is_categorical, histogram_size: Union[int, str], data: Series): 18 | self.name = name 19 | self.is_candidate_key = is_candidate_key 20 | self.is_categorical = is_categorical 21 | self.histogram_size: Union[int, str] = histogram_size 22 | self.data: Series = data 23 | self.data_dropna: Series = self.data.dropna() 24 | self.missing_rate: float = (self.data.size - self.data_dropna.size) / (self.data.size or 1) 25 | 26 | self.is_numerical: bool = None 27 | self.data_type: DataType = None 28 | self.min = None 29 | self.max = None 30 | self.distribution_bins: np.ndarray = None 31 | self.distribution_probabilities: np.ndarray = None 32 | 33 | @abstractmethod 34 | def infer_domain(self, categorical_domain: List = None, numerical_range: List = None): 35 | """Infer categorical_domain, including min, max, and 1-D distribution. 36 | 37 | """ 38 | if categorical_domain: 39 | self.min = min(categorical_domain) 40 | self.max = max(categorical_domain) 41 | self.distribution_bins = np.array(categorical_domain) 42 | elif numerical_range: 43 | self.min, self.max = numerical_range 44 | self.distribution_bins = np.array([self.min, self.max]) 45 | else: 46 | self.min = float(self.data_dropna.min()) 47 | self.max = float(self.data_dropna.max()) 48 | if self.is_categorical: 49 | self.distribution_bins = self.data_dropna.unique() 50 | else: 51 | self.distribution_bins = np.array([self.min, self.max]) 52 | 53 | self.distribution_probabilities = np.full_like(self.distribution_bins, 1 / self.distribution_bins.size) 54 | 55 | @abstractmethod 56 | def infer_distribution(self): 57 | if self.is_categorical: 58 | distribution = self.data_dropna.value_counts() 59 | for value in set(self.distribution_bins) - set(distribution.index): 60 | distribution[value] = 0 61 | distribution.sort_index(inplace=True) 62 | self.distribution_probabilities = utils.normalize_given_distribution(distribution) 63 | self.distribution_bins = np.array(distribution.index) 64 | else: 65 | distribution = np.histogram(self.data_dropna, bins=self.histogram_size, range=(self.min, self.max)) 66 | self.distribution_bins = distribution[1][:-1] # Remove the last bin edge 67 | self.distribution_probabilities = utils.normalize_given_distribution(distribution[0]) 68 | 69 | def inject_laplace_noise(self, epsilon=0.1, num_valid_attributes=10): 70 | if epsilon > 0: 71 | noisy_scale = num_valid_attributes / (epsilon * self.data.size) 72 | laplace_noises = np.random.laplace(0, scale=noisy_scale, size=len(self.distribution_probabilities)) 73 | noisy_distribution = self.distribution_probabilities + laplace_noises 74 | self.distribution_probabilities = utils.normalize_given_distribution(noisy_distribution) 75 | 76 | def encode_values_into_bin_idx(self): 77 | """Encode values into bin indices for Bayesian Network construction. 78 | 79 | """ 80 | if self.is_categorical: 81 | value_to_bin_idx = {value: idx for idx, value in enumerate(self.distribution_bins)} 82 | encoded = self.data.map(lambda x: value_to_bin_idx[x], na_action='ignore') 83 | else: 84 | encoded = self.data.map(lambda x: bisect_right(self.distribution_bins, x) - 1, na_action='ignore') 85 | 86 | encoded.fillna(len(self.distribution_bins), inplace=True) 87 | return encoded.astype(int, copy=False) 88 | 89 | def to_json(self): 90 | """Encode attribution information in JSON format / Python dictionary. 91 | 92 | """ 93 | return {"name": self.name, 94 | "data_type": self.data_type.value, 95 | "is_categorical": self.is_categorical, 96 | "is_candidate_key": self.is_candidate_key, 97 | "min": self.min, 98 | "max": self.max, 99 | "missing_rate": self.missing_rate, 100 | "distribution_bins": self.distribution_bins.tolist(), 101 | "distribution_probabilities": self.distribution_probabilities.tolist()} 102 | 103 | @abstractmethod 104 | def generate_values_as_candidate_key(self, n): 105 | """When attribute should be a candidate key in output dataset. 106 | 107 | """ 108 | return np.arange(n) 109 | 110 | def sample_binning_indices_in_independent_attribute_mode(self, n): 111 | """Sample an array of binning indices. 112 | 113 | """ 114 | return Series(choice(len(self.distribution_probabilities), size=n, p=self.distribution_probabilities)) 115 | 116 | @abstractmethod 117 | def sample_values_from_binning_indices(self, binning_indices): 118 | """Convert binning indices into values in domain. Used by both independent and correlated attribute mode. 119 | 120 | """ 121 | return binning_indices.apply(lambda x: self.uniform_sampling_within_a_bin(x)) 122 | 123 | def uniform_sampling_within_a_bin(self, bin_idx: int): 124 | num_bins = len(self.distribution_bins) 125 | if bin_idx == num_bins: 126 | return np.nan 127 | elif self.is_categorical: 128 | return self.distribution_bins[bin_idx] 129 | elif bin_idx < num_bins - 1: 130 | return uniform(self.distribution_bins[bin_idx], self.distribution_bins[bin_idx + 1]) 131 | else: 132 | # sample from the last interval where the right edge is missing in self.distribution_bins 133 | neg_2, neg_1 = self.distribution_bins[-2:] 134 | return uniform(neg_1, 2 * neg_1 - neg_2) 135 | -------------------------------------------------------------------------------- /DataSynthesizer/DataGenerator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | from datatypes.utils.AttributeLoader import parse_json 5 | from lib.utils import set_random_seed, read_json_file, generate_random_string 6 | 7 | 8 | class DataGenerator(object): 9 | def __init__(self): 10 | self.n = 0 11 | self.synthetic_dataset = None 12 | self.description = {} 13 | self.encoded_dataset = None 14 | 15 | def generate_dataset_in_random_mode(self, n, description_file, seed=0, minimum=0, maximum=100): 16 | set_random_seed(seed) 17 | description = read_json_file(description_file) 18 | 19 | self.synthetic_dataset = pd.DataFrame() 20 | for attr in description['attribute_description'].keys(): 21 | attr_info = description['attribute_description'][attr] 22 | datatype = attr_info['data_type'] 23 | is_categorical = attr_info['is_categorical'] 24 | is_candidate_key = attr_info['is_candidate_key'] 25 | if is_candidate_key: 26 | self.synthetic_dataset[attr] = parse_json(attr_info).generate_values_as_candidate_key(n) 27 | elif is_categorical: 28 | self.synthetic_dataset[attr] = np.random.choice(attr_info['distribution_bins'], n) 29 | elif datatype == 'String': 30 | length = np.random.randint(attr_info['min'], attr_info['max']) 31 | self.synthetic_dataset[attr] = length 32 | self.synthetic_dataset[attr] = self.synthetic_dataset[attr].map(lambda x: generate_random_string(x)) 33 | else: 34 | if datatype == 'Integer': 35 | self.synthetic_dataset[attr] = np.random.randint(minimum, maximum + 1, n) 36 | else: 37 | self.synthetic_dataset[attr] = np.random.uniform(minimum, maximum, n) 38 | 39 | def generate_dataset_in_independent_mode(self, n, description_file, seed=0): 40 | set_random_seed(seed) 41 | self.description = read_json_file(description_file) 42 | 43 | all_attributes = self.description['meta']['all_attributes'] 44 | candidate_keys = set(self.description['meta']['candidate_keys']) 45 | self.synthetic_dataset = pd.DataFrame(columns=all_attributes) 46 | for attr in all_attributes: 47 | attr_info = self.description['attribute_description'][attr] 48 | column = parse_json(attr_info) 49 | 50 | if attr in candidate_keys: 51 | self.synthetic_dataset[attr] = column.generate_values_as_candidate_key(n) 52 | else: 53 | binning_indices = column.sample_binning_indices_in_independent_attribute_mode(n) 54 | self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(binning_indices) 55 | 56 | def generate_dataset_in_correlated_attribute_mode(self, n, description_file, seed=0): 57 | set_random_seed(seed) 58 | self.n = n 59 | self.description = read_json_file(description_file) 60 | 61 | all_attributes = self.description['meta']['all_attributes'] 62 | candidate_keys = set(self.description['meta']['candidate_keys']) 63 | self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description) 64 | self.synthetic_dataset = pd.DataFrame(columns=all_attributes) 65 | for attr in all_attributes: 66 | attr_info = self.description['attribute_description'][attr] 67 | column = parse_json(attr_info) 68 | 69 | if attr in self.encoded_dataset: 70 | self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(self.encoded_dataset[attr]) 71 | elif attr in candidate_keys: 72 | self.synthetic_dataset[attr] = column.generate_values_as_candidate_key(n) 73 | else: 74 | # for attributes not in BN or candidate keys, use independent attribute mode. 75 | binning_indices = column.sample_binning_indices_in_independent_attribute_mode(n) 76 | self.synthetic_dataset[attr] = column.sample_values_from_binning_indices(binning_indices) 77 | 78 | @staticmethod 79 | def get_sampling_order(bn): 80 | order = [bn[0][1][0]] 81 | for child, _ in bn: 82 | order.append(child) 83 | return order 84 | 85 | @staticmethod 86 | def generate_encoded_dataset(n, description): 87 | bn = description['bayesian_network'] 88 | bn_root_attr = bn[0][1][0] 89 | root_attr_dist = description['conditional_probabilities'][bn_root_attr] 90 | encoded_df = pd.DataFrame(columns=DataGenerator.get_sampling_order(bn)) 91 | encoded_df[bn_root_attr] = np.random.choice(len(root_attr_dist), size=n, p=root_attr_dist) 92 | 93 | for child, parents in bn: 94 | child_conditional_distributions = description['conditional_probabilities'][child] 95 | for parents_instance in child_conditional_distributions.keys(): 96 | dist = child_conditional_distributions[parents_instance] 97 | parents_instance = list(eval(parents_instance)) 98 | 99 | filter_condition = '' 100 | for parent, value in zip(parents, parents_instance): 101 | filter_condition += f"(encoded_df['{parent}']=={value})&" 102 | 103 | filter_condition = eval(filter_condition[:-1]) 104 | 105 | size = encoded_df[filter_condition].shape[0] 106 | if size: 107 | encoded_df.loc[filter_condition, child] = np.random.choice(len(dist), size=size, p=dist) 108 | 109 | unconditioned_distribution = description['attribute_description'][child]['distribution_probabilities'] 110 | encoded_df.loc[encoded_df[child].isnull(), child] = np.random.choice(len(unconditioned_distribution), 111 | size=encoded_df[child].isnull().sum(), 112 | p=unconditioned_distribution) 113 | encoded_df[encoded_df.columns] = encoded_df[encoded_df.columns].astype(int) 114 | return encoded_df 115 | 116 | def save_synthetic_data(self, to_file): 117 | self.synthetic_dataset.to_csv(to_file, index=False) 118 | 119 | 120 | if __name__ == '__main__': 121 | from time import time 122 | 123 | dataset_description_file = '../out/AdultIncome/description_test.txt' 124 | dataset_description_file = '/home/haoyue/GitLab/data-responsibly-webUI/dataResponsiblyUI/static/intermediatedata/1498175138.8088856_description.txt' 125 | 126 | generator = DataGenerator() 127 | 128 | t = time() 129 | generator.generate_dataset_in_correlated_attribute_mode(51, dataset_description_file) 130 | print('running time: {} s'.format(time() - t)) 131 | print(generator.synthetic_dataset.loc[:50]) 132 | -------------------------------------------------------------------------------- /tutorial/deidentify.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Takes the Hospitals A&E data generated from generate.py and runs it through a 3 | set of de-identification steps. It then saves this as a new dataset. 4 | ''' 5 | import random 6 | import time 7 | import string 8 | 9 | import pandas as pd 10 | 11 | import filepaths 12 | 13 | 14 | def main(): 15 | print('running de-identification steps...') 16 | start = time.time() 17 | 18 | # "_df" is the usual way people refer to a Pandas DataFrame object 19 | hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data) 20 | 21 | print('removing Health Service ID numbers...') 22 | hospital_ae_df = remove_health_service_numbers(hospital_ae_df) 23 | 24 | print('converting postcodes to LSOA...') 25 | hospital_ae_df = convert_postcodes_to_lsoa(hospital_ae_df) 26 | 27 | print('converting LSOA to IMD decile...') 28 | hospital_ae_df = convert_lsoa_to_imd_decile(hospital_ae_df) 29 | 30 | print('replacing Hospital with random number...') 31 | hospital_ae_df = replace_hospital_with_random_number(hospital_ae_df) 32 | 33 | print('putting Arrival Hour in 4-hour bins...') 34 | hospital_ae_df = put_time_in_4_hour_bins(hospital_ae_df) 35 | 36 | print('removing non-male-or-female from gender ...') 37 | hospital_ae_df = remove_non_male_or_female(hospital_ae_df) 38 | 39 | print('putting ages in age brackets...') 40 | hospital_ae_df = add_age_brackets(hospital_ae_df) 41 | 42 | hospital_ae_df.to_csv(filepaths.hospital_ae_data_deidentify, index=False) 43 | 44 | elapsed = round(time.time() - start, 2) 45 | print('done in ' + str(elapsed) + ' seconds.') 46 | 47 | 48 | def remove_health_service_numbers(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 49 | """Drops the Health Service ID numbers column from the dataset 50 | 51 | Keyword arguments: 52 | hospital_ae_df -- Hopsitals A&E records dataframe 53 | """ 54 | hospital_ae_df = hospital_ae_df.drop('Health Service ID', 1) 55 | return hospital_ae_df 56 | 57 | 58 | def convert_postcodes_to_lsoa(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 59 | """Adds corresponding Lower layer super output area for each row 60 | depending on their postcode. Uses London postcodes dataset from 61 | https://www.doogal.co.uk/PostcodeDownloads.php 62 | 63 | Keyword arguments: 64 | hospital_ae_df -- Hopsitals A&E records dataframe 65 | """ 66 | postcodes_df = pd.read_csv(filepaths.postcodes_london) 67 | hospital_ae_df = pd.merge( 68 | hospital_ae_df, 69 | postcodes_df[['Postcode', 'Lower layer super output area']], 70 | on='Postcode' 71 | ) 72 | hospital_ae_df = hospital_ae_df.drop('Postcode', 1) 73 | return hospital_ae_df 74 | 75 | 76 | def convert_lsoa_to_imd_decile(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 77 | """Maps each row's Lower layer super output area to which 78 | Index of Multiple Deprivation decile it's in. It calculates the decile 79 | rates based on the IMD's over all of London. 80 | Uses "London postcodes.csv" dataset from 81 | https://www.doogal.co.uk/PostcodeDownloads.php 82 | 83 | Keyword arguments: 84 | hospital_ae_df -- Hospitals A&E records dataframe 85 | """ 86 | 87 | postcodes_df = pd.read_csv(filepaths.postcodes_london) 88 | 89 | hospital_ae_df = pd.merge( 90 | hospital_ae_df, 91 | postcodes_df[ 92 | ['Lower layer super output area', 93 | 'Index of Multiple Deprivation'] 94 | ].drop_duplicates(), 95 | on='Lower layer super output area' 96 | ) 97 | _, bins = pd.qcut( 98 | postcodes_df['Index of Multiple Deprivation'], 10, 99 | retbins=True, labels=False 100 | ) 101 | hospital_ae_df['Index of Multiple Deprivation Decile'] = pd.cut( 102 | hospital_ae_df['Index of Multiple Deprivation'], bins=bins, 103 | labels=False, include_lowest=True) + 1 104 | 105 | hospital_ae_df = hospital_ae_df.drop('Index of Multiple Deprivation', 1) 106 | hospital_ae_df = hospital_ae_df.drop('Lower layer super output area', 1) 107 | 108 | return hospital_ae_df 109 | 110 | 111 | def replace_hospital_with_random_number( 112 | hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 113 | """ 114 | Gives each hospital a random integer number and adds a new column 115 | with these numbers. Drops the hospital name column. 116 | 117 | Keyword arguments: 118 | hospital_ae_df -- Hopsitals A&E records dataframe 119 | """ 120 | 121 | hospitals = hospital_ae_df['Hospital'].unique().tolist() 122 | random.shuffle(hospitals) 123 | hospitals_map = { 124 | hospital : ''.join(random.choices(string.digits, k=6)) 125 | for hospital in hospitals 126 | } 127 | hospital_ae_df['Hospital ID'] = hospital_ae_df['Hospital'].map(hospitals_map) 128 | hospital_ae_df = hospital_ae_df.drop('Hospital', 1) 129 | 130 | return hospital_ae_df 131 | 132 | 133 | def put_time_in_4_hour_bins(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 134 | """ 135 | Gives each hospital a random integer number and adds a new column 136 | with these numbers. Drops the hospital name column. 137 | 138 | Keyword arguments: 139 | hospital_ae_df -- Hopsitals A&E records dataframe 140 | """ 141 | 142 | arrival_times = pd.to_datetime(hospital_ae_df['Arrival Time']) 143 | hospital_ae_df['Arrival Date'] = arrival_times.dt.strftime('%Y-%m-%d') 144 | hospital_ae_df['Arrival Hour'] = arrival_times.dt.hour 145 | 146 | hospital_ae_df['Arrival hour range'] = pd.cut( 147 | hospital_ae_df['Arrival Hour'], 148 | bins=[0, 4, 8, 12, 16, 20, 24], 149 | labels=['00-03', '04-07', '08-11', '12-15', '16-19', '20-23'], 150 | include_lowest=True 151 | ) 152 | hospital_ae_df = hospital_ae_df.drop('Arrival Time', 1) 153 | hospital_ae_df = hospital_ae_df.drop('Arrival Hour', 1) 154 | 155 | return hospital_ae_df 156 | 157 | 158 | def remove_non_male_or_female(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 159 | """ 160 | Removes any record which has a non-male-or-female entry for gender. 161 | 162 | Keyword arguments: 163 | hospital_ae_df -- Hopsitals A&E records dataframe 164 | """ 165 | 166 | hospital_ae_df = hospital_ae_df[hospital_ae_df['Gender'].isin(['Male', 'Female'])] 167 | return hospital_ae_df 168 | 169 | 170 | def add_age_brackets(hospital_ae_df: pd.DataFrame) -> pd.DataFrame: 171 | """ 172 | Put the integer ages in to age brackets 173 | 174 | Keyword arguments: 175 | hospital_ae_df -- Hopsitals A&E records dataframe 176 | """ 177 | 178 | hospital_ae_df['Age bracket'] = pd.cut( 179 | hospital_ae_df['Age'], 180 | bins=[0, 18, 25, 45, 65, 85, 150], 181 | labels=['0-17', '18-24', '25-44', '45-64', '65-84', '85-'], 182 | include_lowest=True 183 | ) 184 | hospital_ae_df = hospital_ae_df.drop('Age', 1) 185 | return hospital_ae_df 186 | 187 | 188 | if __name__ == "__main__": 189 | main() 190 | -------------------------------------------------------------------------------- /tutorial/generate.py: -------------------------------------------------------------------------------- 1 | """ 2 | Script that generates hospital A&E data to use in the synthetic data tutorial. 3 | 4 | Columns of data inpired by NHS+ODI Leeds blog post: 5 | https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data 6 | 7 | """ 8 | 9 | import random 10 | from datetime import datetime 11 | import random, string 12 | import time 13 | 14 | import pandas as pd 15 | import numpy as np 16 | 17 | import filepaths 18 | 19 | # TODO: give hospitals different average waiting times 20 | 21 | num_of_rows = 10000 22 | 23 | 24 | def main(): 25 | print('generating data...') 26 | start = time.time() 27 | 28 | hospital_ae_dataset = {} 29 | 30 | print('generating Health Service ID numbers...') 31 | hospital_ae_dataset['Health Service ID'] = generate_health_service_id_numbers() 32 | 33 | print('generating patient ages and times in A&E...') 34 | (hospital_ae_dataset['Age'], hospital_ae_dataset['Time in A&E (mins)']) = generate_ages_times_in_age() 35 | 36 | print('generating hospital instances...') 37 | hospital_ae_dataset['Hospital'] = generate_hospitals() 38 | 39 | print('generating arrival times...') 40 | hospital_ae_dataset['Arrival Time'] = generate_arrival_times() 41 | 42 | print('generating A&E treaments...') 43 | hospital_ae_dataset['Treatment'] = generate_treatments() 44 | 45 | print('generating patient gender instances...') 46 | hospital_ae_dataset['Gender'] = generate_genders() 47 | 48 | print('generating patient postcodes...') 49 | hospital_ae_dataset['Postcode'] = generate_postcodes() 50 | 51 | write_out_dataset(hospital_ae_dataset, filepaths.hospital_ae_data) 52 | print('dataset written out to: ', filepaths.hospital_ae_data) 53 | 54 | elapsed = round(time.time() - start, 2) 55 | print('done in ' + str(elapsed) + ' seconds.') 56 | 57 | 58 | def generate_ages_times_in_age(): 59 | """ 60 | Generates correlated ages and waiting times and returns them as lists 61 | 62 | Obviously normally distributed ages is not very true to real life but is fine for our mock data. 63 | 64 | Correlated random data generation code based on: 65 | https://realpython.com/python-random/ 66 | """ 67 | # Start with a correlation matrix and standard deviations. 68 | # 0.9 is the correlation between ages and waiting times, and the correlation of a variable with itself is 1 69 | correlations = np.array([[1, 0.95], [0.95, 1]]) 70 | 71 | # Standard deviations/means of ages and waiting times, respectively 72 | stdev = np.array([20, 20]) 73 | mean = np.array([41, 60]) 74 | cov = corr2cov(correlations, stdev) 75 | 76 | data = np.random.multivariate_normal(mean=mean, cov=cov, size=num_of_rows) 77 | data = np.array(data, dtype=int) 78 | 79 | # negative ages or waiting times wouldn't make sense 80 | # so set any negative values to 0 and 1 respectively 81 | data[np.nonzero(data[:, 0] < 1)[0], 0] = 0 82 | data[np.nonzero(data[:, 1] < 1)[0], 1] = 1 83 | 84 | ages = data[:, 0].tolist() 85 | times_in_ae = data[:, 1].tolist() 86 | 87 | return (ages, times_in_ae) 88 | 89 | 90 | def corr2cov(correlations: np.ndarray, stdev: np.ndarray) -> np.ndarray: 91 | """Covariance matrix from correlation & standard deviations""" 92 | diagonal_stdev = np.diag(stdev) 93 | covariance = diagonal_stdev @ correlations @ diagonal_stdev 94 | return covariance 95 | 96 | 97 | def generate_admission_ids() -> list: 98 | """ Generate a unique 10-digit ID for every admission record """ 99 | 100 | uids = [] 101 | for _ in range(num_of_rows): 102 | x = ''.join(random.choice(string.digits) for _ in range(10)) 103 | uids.append(x) 104 | return uids 105 | 106 | def generate_health_service_id_numbers() -> list: 107 | """ Generate dummy Health Service ID numbers similar to NHS 10 digit format 108 | See: https://www.nhs.uk/using-the-nhs/about-the-nhs/what-is-an-nhs-number/ 109 | """ 110 | health_service_id_numbers = [] 111 | for _ in range(num_of_rows): 112 | health_service_id = ''.join(random.choice(string.digits) for _ in range(3)) + '-' 113 | health_service_id += ''.join(random.choice(string.digits) for _ in range(3)) + '-' 114 | health_service_id += ''.join(random.choice(string.digits) for _ in range(4)) 115 | health_service_id_numbers.append(health_service_id) 116 | return health_service_id_numbers 117 | 118 | 119 | def generate_postcodes() -> list: 120 | """ Reads a .csv containing info on every London postcode. Reads the 121 | postcodes in use and returns a sample of them. 122 | 123 | # List of London postcodes from https://www.doogal.co.uk/PostcodeDownloads.php 124 | """ 125 | postcodes_df = pd.read_csv(filepaths.postcodes_london) 126 | postcodes_in_use = list(postcodes_df[postcodes_df['In Use?'] == "No"]['Postcode']) 127 | postcodes = random.choices(postcodes_in_use, k=num_of_rows) 128 | return postcodes 129 | 130 | 131 | def generate_hospitals() -> list: 132 | """ Reads the data/hospitals_london.txt file, and generates a 133 | sample of them to add to the dataset. 134 | 135 | List of London hospitals loosely based on 136 | https://en.wikipedia.org/wiki/Category:NHS_hospitals_in_London 137 | """ 138 | with open(filepaths.hospitals_london, 'r') as file_in: 139 | hospitals = file_in.readlines() 140 | hospitals = [name.strip() for name in hospitals] 141 | 142 | weights = random.choices(range(1, 100), k=len(hospitals)) 143 | hospitals = random.choices(hospitals, k=num_of_rows, weights=weights) 144 | 145 | return hospitals 146 | 147 | 148 | def generate_arrival_times() -> list: 149 | """ Generate and return arrival times. 150 | Hardcoding times to first week of April 2019 151 | """ 152 | arrival_times = [] 153 | 154 | # first 7 days in April 2019 155 | days_dates = [1, 2, 3, 4, 5, 6, 7] 156 | # have more people come in at the weekend - higher weights 157 | day_weights = [0.5, 0.6, 0.7, 0.8, 0.9, 1, 1] 158 | days = random.choices(days_dates, day_weights, k=num_of_rows) 159 | # this is just so each day has a different peak time 160 | days_time_modes = {day: random.random() for day in days_dates} 161 | 162 | for day in days: 163 | start = datetime(2019, 4, day, 00, 00, 00) 164 | end = datetime(2019, 4, day, 23, 59, 59) 165 | 166 | random_num = random.triangular(0, 1, days_time_modes[day]) 167 | random_datetime = start + (end - start) * random_num 168 | arrival_times.append(random_datetime.strftime('%Y-%m-%d %H:%M:%S')) 169 | 170 | return arrival_times 171 | 172 | 173 | def generate_genders() -> list: 174 | """ Generate and return list of genders for every row. 175 | 176 | # National codes for gender in NHS data 177 | # https://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_code_de.asp?shownav=1 178 | """ 179 | gender_codes_df = pd.read_csv(filepaths.nhs_ae_gender_codes) 180 | genders = gender_codes_df['Gender'].tolist() 181 | # these weights are just dummy values. please don't take them as accurate. 182 | weights =[0.005, 0.495, 0.495, 0.005] 183 | gender_codes = random.choices(genders, k=num_of_rows, weights=weights) 184 | return gender_codes 185 | 186 | 187 | def generate_treatments() -> list: 188 | """ Generate and return sample of treatments patients received. 189 | 190 | Reads data/treatment_codes_nhs_ae.csv file 191 | 192 | NHS treatment codes: 193 | https://www.datadictionary.nhs.uk/web_site_content/supporting_information/clinical_coding/accident_and_emergency_treatment_tables.asp?shownav=1 194 | """ 195 | 196 | treatment_codes_df = pd.read_csv(filepaths.nhs_ae_treatment_codes) 197 | treatments = treatment_codes_df['Treatment'].tolist() 198 | 199 | # likelihood of each of the treatments - make some more common 200 | weights = random.choices(range(1, 100), k=len(treatments)) 201 | treatment_codes = random.choices( 202 | treatments, k=num_of_rows, weights=weights) 203 | return treatment_codes 204 | 205 | 206 | def write_out_dataset(dataset: dict, filepath: str): 207 | """Writing dataset to .csv file 208 | 209 | Keyword arguments: 210 | dataset -- the dataset to be written to disk 211 | filepath -- path to write the file out to 212 | """ 213 | 214 | df = pd.DataFrame.from_dict(dataset) 215 | df.to_csv(filepath, index=False) 216 | 217 | 218 | if __name__ == "__main__": 219 | main() 220 | -------------------------------------------------------------------------------- /tutorial/synthesise.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This generates synthetic data from the hospital_ae_data_deidentify.csv 3 | file. It generates three types of synthetic data and saves them in 4 | different files. 5 | ''' 6 | 7 | import random 8 | import os 9 | import time 10 | 11 | import pandas as pd 12 | import numpy as np 13 | 14 | import filepaths 15 | from DataDescriber import DataDescriber 16 | from DataGenerator import DataGenerator 17 | from ModelInspector import ModelInspector 18 | from lib.utils import read_json_file 19 | 20 | 21 | attribute_to_datatype = { 22 | 'Time in A&E (mins)': 'Integer', 23 | 'Treatment': 'String', 24 | 'Gender': 'String', 25 | 'Index of Multiple Deprivation Decile': 'Integer', 26 | 'Hospital ID': 'String', 27 | 'Arrival Date': 'String', 28 | 'Arrival hour range': 'String', 29 | 'Age bracket': 'String' 30 | } 31 | 32 | attribute_is_categorical = { 33 | 'Time in A&E (mins)': False, 34 | 'Treatment': True, 35 | 'Gender': True, 36 | 'Index of Multiple Deprivation Decile': True, 37 | 'Hospital ID': True, 38 | 'Arrival Date': True, 39 | 'Arrival hour range': True, 40 | 'Age bracket': True 41 | } 42 | 43 | mode_filepaths = { 44 | 'random': { 45 | 'description': filepaths.hospital_ae_description_random, 46 | 'data': filepaths.hospital_ae_data_synthetic_random 47 | }, 48 | 'independent': { 49 | 'description': filepaths.hospital_ae_description_independent, 50 | 'data': filepaths.hospital_ae_data_synthetic_independent 51 | }, 52 | 'correlated': { 53 | 'description': filepaths.hospital_ae_description_correlated, 54 | 'data': filepaths.hospital_ae_data_synthetic_correlated 55 | } 56 | } 57 | 58 | 59 | def main(): 60 | start = time.time() 61 | 62 | # "_df" is the usual way people refer to a Pandas DataFrame object 63 | hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data_deidentify) 64 | 65 | # let's generate the same amount of rows as original data (though we don't have to) 66 | num_rows = len(hospital_ae_df) 67 | 68 | # iterate through the 3 modes to generate synthetic data 69 | for mode in ['random','independent', 'correlated']: 70 | 71 | print('describing synthetic data for', mode, 'mode...') 72 | describe_synthetic_data(mode, mode_filepaths[mode]['description']) 73 | 74 | print('generating synthetic data for', mode, 'mode...') 75 | generate_synthetic_data( 76 | mode, 77 | num_rows, 78 | mode_filepaths[mode]['description'], 79 | mode_filepaths[mode]['data'] 80 | ) 81 | 82 | print('comparing histograms for', mode, 'mode...') 83 | compare_histograms( 84 | mode, 85 | hospital_ae_df, 86 | mode_filepaths[mode]['description'], 87 | mode_filepaths[mode]['data'] 88 | ) 89 | 90 | print('comparing pairwise mutual information for', mode, 'mode...') 91 | compare_pairwise_mutual_information( 92 | mode, 93 | hospital_ae_df, 94 | mode_filepaths[mode]['description'], 95 | mode_filepaths[mode]['data'] 96 | ) 97 | 98 | elapsed = round(time.time() - start, 2) 99 | print('done in ' + str(elapsed) + ' seconds.') 100 | 101 | 102 | def describe_synthetic_data(mode: str, description_filepath:str): 103 | ''' 104 | Describes the synthetic data and saves it to the data/ directory. 105 | 106 | Keyword arguments: 107 | mode -- what type of synthetic data 108 | category_threshold -- limit at which categories are considered blah 109 | description_filepath -- filepath to the data description 110 | ''' 111 | describer = DataDescriber() 112 | 113 | if mode == 'random': 114 | describer.describe_dataset_in_random_mode( 115 | filepaths.hospital_ae_data_deidentify, 116 | attribute_to_datatype=attribute_to_datatype, 117 | attribute_to_is_categorical=attribute_is_categorical) 118 | 119 | elif mode == 'independent': 120 | describer.describe_dataset_in_independent_attribute_mode( 121 | filepaths.hospital_ae_data_deidentify, 122 | attribute_to_datatype=attribute_to_datatype, 123 | attribute_to_is_categorical=attribute_is_categorical) 124 | 125 | elif mode == 'correlated': 126 | # Increase epsilon value to reduce the injected noises. 127 | # We're not using differential privacy in this tutorial, 128 | # so we'll set epsilon=0 to turn off differential privacy 129 | epsilon = 0 130 | 131 | # The maximum number of parents in Bayesian network 132 | # i.e., the maximum number of incoming edges. 133 | degree_of_bayesian_network = 1 134 | 135 | describer.describe_dataset_in_correlated_attribute_mode( 136 | dataset_file=filepaths.hospital_ae_data_deidentify, 137 | epsilon=epsilon, 138 | k=degree_of_bayesian_network, 139 | attribute_to_datatype=attribute_to_datatype, 140 | attribute_to_is_categorical=attribute_is_categorical) 141 | # attribute_to_is_candidate_key=attribute_to_is_candidate_key) 142 | 143 | describer.save_dataset_description_to_file(description_filepath) 144 | 145 | 146 | def generate_synthetic_data( 147 | mode: str, 148 | num_rows: int, 149 | description_filepath: str, 150 | synthetic_data_filepath: str 151 | ): 152 | ''' 153 | Generates the synthetic data and saves it to the data/ directory. 154 | 155 | Keyword arguments: 156 | mode -- what type of synthetic data 157 | num_rows -- number of rows in the synthetic dataset 158 | description_filepath -- filepath to the data description 159 | synthetic_data_filepath -- filepath to where synthetic data written 160 | ''' 161 | generator = DataGenerator() 162 | 163 | if mode == 'random': 164 | generator.generate_dataset_in_random_mode(num_rows, description_filepath) 165 | 166 | elif mode == 'independent': 167 | generator.generate_dataset_in_independent_mode(num_rows, description_filepath) 168 | 169 | elif mode == 'correlated': 170 | generator.generate_dataset_in_correlated_attribute_mode(num_rows, description_filepath) 171 | 172 | generator.save_synthetic_data(synthetic_data_filepath) 173 | 174 | 175 | def compare_histograms( 176 | mode: str, 177 | hospital_ae_df: pd.DataFrame, 178 | description_filepath: str, 179 | synthetic_data_filepath: str 180 | ): 181 | ''' 182 | Makes comapirson plots showing the histograms for each column in the 183 | synthetic data. 184 | 185 | Keyword arguments: 186 | mode -- what type of synthetic data 187 | hospital_ae_df -- DataFrame of the original dataset 188 | description_filepath -- filepath to the data description 189 | synthetic_data_filepath -- filepath to where synthetic data written 190 | ''' 191 | 192 | synthetic_df = pd.read_csv(synthetic_data_filepath) 193 | 194 | # Read attribute description from the dataset description file. 195 | attribute_description = read_json_file( 196 | description_filepath)['attribute_description'] 197 | 198 | inspector = ModelInspector( 199 | hospital_ae_df, synthetic_df, attribute_description) 200 | 201 | for attribute in synthetic_df.columns: 202 | figure_filepath = os.path.join( 203 | filepaths.plots_dir, 204 | mode + '_' + attribute + '.png' 205 | ) 206 | # need to replace whitespace in filepath for Markdown reference 207 | figure_filepath = figure_filepath.replace(' ', '_') 208 | inspector.compare_histograms(attribute, figure_filepath) 209 | 210 | def compare_pairwise_mutual_information( 211 | mode: str, 212 | hospital_ae_df: pd.DataFrame, 213 | description_filepath: str, 214 | synthetic_data_filepath: str 215 | ): 216 | ''' 217 | Looks at correlation of attributes by producing heatmap 218 | 219 | Keyword arguments: 220 | mode -- what type of synthetic data 221 | hospital_ae_df -- DataFrame of the original dataset 222 | description_filepath -- filepath to the data description 223 | synthetic_data_filepath -- filepath to where synthetic data written 224 | ''' 225 | 226 | synthetic_df = pd.read_csv(synthetic_data_filepath) 227 | 228 | attribute_description = read_json_file( 229 | description_filepath)['attribute_description'] 230 | 231 | inspector = ModelInspector( 232 | hospital_ae_df, synthetic_df, attribute_description) 233 | 234 | figure_filepath = os.path.join( 235 | filepaths.plots_dir, 236 | 'mutual_information_heatmap_' + mode + '.png' 237 | ) 238 | 239 | inspector.mutual_information_heatmap(figure_filepath) 240 | 241 | 242 | if __name__ == "__main__": 243 | main() 244 | -------------------------------------------------------------------------------- /DataSynthesizer/lib/PrivBayes.py: -------------------------------------------------------------------------------- 1 | import random 2 | import warnings 3 | from itertools import combinations, product 4 | from math import log, ceil 5 | from multiprocessing.pool import Pool 6 | 7 | import numpy as np 8 | import pandas as pd 9 | from scipy.optimize import fsolve 10 | 11 | from lib.utils import mutual_information, normalize_given_distribution 12 | 13 | """ 14 | This module is based on PrivBayes in the following paper: 15 | 16 | Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X. 17 | PrivBayes: Private Data Release via Bayesian Networks. 18 | """ 19 | 20 | 21 | def sensitivity(num_tuples): 22 | """Sensitivity function for Bayesian network construction. PrivBayes Lemma 1. 23 | 24 | Parameters 25 | ---------- 26 | num_tuples : int 27 | Number of tuples in sensitive dataset. 28 | 29 | Return 30 | -------- 31 | int 32 | Sensitivity value. 33 | """ 34 | a = (2 / num_tuples) * log((num_tuples + 1) / 2) 35 | b = (1 - 1 / num_tuples) * log(1 + 2 / (num_tuples - 1)) 36 | return a + b 37 | 38 | 39 | def delta(num_attributes, num_tuples, epsilon): 40 | """Computing delta, which is a factor when applying differential privacy. 41 | 42 | More info is in PrivBayes Section 4.2 "A First-Cut Solution". 43 | 44 | Parameters 45 | ---------- 46 | num_attributes : int 47 | Number of attributes in dataset. 48 | num_tuples : int 49 | Number of tuples in dataset. 50 | epsilon : float 51 | Parameter of differential privacy. 52 | """ 53 | return 2 * (num_attributes - 1) * sensitivity(num_tuples) / epsilon 54 | 55 | 56 | def usefulness_minus_target(k, num_attributes, num_tuples, target_usefulness=5, epsilon=0.1): 57 | """Usefulness function in PrivBayes. 58 | 59 | Parameters 60 | ---------- 61 | k : int 62 | Max number of degree in Bayesian networks construction 63 | num_attributes : int 64 | Number of attributes in dataset. 65 | num_tuples : int 66 | Number of tuples in dataset. 67 | target_usefulness : int or float 68 | epsilon : float 69 | Parameter of differential privacy. 70 | """ 71 | if k == num_attributes: 72 | print('here') 73 | usefulness = target_usefulness 74 | else: 75 | usefulness = num_tuples * epsilon / ((num_attributes - k) * (2 ** (k + 3))) # PrivBayes Lemma 3 76 | return usefulness - target_usefulness 77 | 78 | 79 | def calculate_k(num_attributes, num_tuples, target_usefulness=4, epsilon=0.1): 80 | """Calculate the maximum degree when constructing Bayesian networks. See PrivBayes Lemma 3.""" 81 | default_k = 3 82 | initial_usefulness = usefulness_minus_target(default_k, num_attributes, num_tuples, 0, epsilon) 83 | if initial_usefulness > target_usefulness: 84 | return default_k 85 | else: 86 | arguments = (num_attributes, num_tuples, target_usefulness, epsilon) 87 | warnings.filterwarnings("error") 88 | try: 89 | ans = fsolve(usefulness_minus_target, int(num_attributes / 2), args=arguments)[0] 90 | ans = ceil(ans) 91 | except RuntimeWarning: 92 | print("Warning: k is not properly computed!") 93 | ans = default_k 94 | if ans < 1 or ans > num_attributes: 95 | ans = default_k 96 | return ans 97 | 98 | 99 | def worker(paras): 100 | child, V, num_parents, split, dataset = paras 101 | parents_pair_list = [] 102 | mutual_info_list = [] 103 | 104 | if split + num_parents - 1 < len(V): 105 | for other_parents in combinations(V[split + 1:], num_parents - 1): 106 | parents = list(other_parents) 107 | parents.append(V[split]) 108 | parents_pair_list.append((child, parents)) 109 | # TODO consider to change the computation of MI by combined integers instead of strings. 110 | mi = mutual_information(dataset[child], dataset[parents]) 111 | mutual_info_list.append(mi) 112 | 113 | return parents_pair_list, mutual_info_list 114 | 115 | 116 | def greedy_bayes(dataset, k=2, epsilon=0): 117 | """Construct a Bayesian Network (BN) using greedy algorithm. 118 | 119 | Parameters 120 | ---------- 121 | dataset : DataFrame 122 | Input dataset, which only contains categorical attributes. 123 | k : int 124 | Maximum degree of the constructed BN. If k=0, k is automatically calculated. 125 | epsilon : float 126 | Parameter of differential privacy. 127 | """ 128 | dataset = dataset.astype(str, copy=False) 129 | num_tuples, num_attributes = dataset.shape 130 | if not k: 131 | k = calculate_k(num_attributes, num_tuples) 132 | 133 | print('================ Constructing Bayesian Network (BN) ================') 134 | root_attribute = random.choice(dataset.columns) 135 | V = [root_attribute] 136 | rest_attributes = set(dataset.columns) 137 | rest_attributes.remove(root_attribute) 138 | print(f'Adding ROOT {root_attribute}') 139 | N = [] 140 | while rest_attributes: 141 | parents_pair_list = [] 142 | mutual_info_list = [] 143 | 144 | num_parents = min(len(V), k) 145 | tasks = [(child, V, num_parents, split, dataset) for child, split in 146 | product(rest_attributes, range(len(V) - num_parents + 1))] 147 | with Pool() as pool: 148 | res_list = pool.map(worker, tasks) 149 | 150 | for res in res_list: 151 | parents_pair_list += res[0] 152 | mutual_info_list += res[1] 153 | 154 | if epsilon: 155 | sampling_distribution = exponential_mechanism(dataset, mutual_info_list, epsilon) 156 | idx = np.random.choice(list(range(len(mutual_info_list))), p=sampling_distribution) 157 | else: 158 | idx = mutual_info_list.index(max(mutual_info_list)) 159 | 160 | N.append(parents_pair_list[idx]) 161 | adding_attribute = parents_pair_list[idx][0] 162 | V.append(adding_attribute) 163 | rest_attributes.remove(adding_attribute) 164 | print(f'Adding attribute {adding_attribute}') 165 | 166 | print('========================= BN constructed =========================') 167 | 168 | return N 169 | 170 | 171 | def exponential_mechanism(dataset, mutual_info_list, epsilon=0.1): 172 | """Applied in Exponential Mechanism to sample outcomes.""" 173 | num_tuples, num_attributes = dataset.shape 174 | mi_array = np.array(mutual_info_list) 175 | mi_array = mi_array / (2 * delta(num_attributes, num_tuples, epsilon)) 176 | mi_array = np.exp(mi_array) 177 | mi_array = normalize_given_distribution(mi_array) 178 | return mi_array 179 | 180 | 181 | def laplace_noise_parameter(k, num_attributes, num_tuples, epsilon): 182 | """The noises injected into conditional distributions. PrivBayes Algorithm 1.""" 183 | return 4 * (num_attributes - k) / (num_tuples * epsilon) 184 | 185 | 186 | def get_noisy_distribution_of_attributes(attributes, encoded_dataset, epsilon=0.1): 187 | data = encoded_dataset.copy().loc[:, attributes] 188 | data['count'] = 1 189 | stats = data.groupby(attributes).sum() 190 | 191 | iterables = [range(int(encoded_dataset[attr].max()) + 1) for attr in attributes] 192 | full_space = pd.DataFrame(columns=attributes, data=list(product(*iterables))) 193 | stats.reset_index(inplace=True) 194 | stats = pd.merge(full_space, stats, how='left') 195 | stats.fillna(0, inplace=True) 196 | 197 | if epsilon: 198 | k = len(attributes) - 1 199 | num_tuples, num_attributes = encoded_dataset.shape 200 | noise_para = laplace_noise_parameter(k, num_attributes, num_tuples, epsilon) 201 | laplace_noises = np.random.laplace(0, scale=noise_para, size=stats.index.size) 202 | stats['count'] += laplace_noises 203 | stats.loc[stats['count'] < 0, 'count'] = 0 204 | 205 | return stats 206 | 207 | 208 | def construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon=0.1): 209 | """See more in Algorithm 1 in PrivBayes. 210 | 211 | """ 212 | 213 | k = len(bayesian_network[-1][1]) 214 | conditional_distributions = {} 215 | 216 | # first k+1 attributes 217 | root = bayesian_network[0][1][0] 218 | kplus1_attributes = [root] 219 | for child, _ in bayesian_network[:k]: 220 | kplus1_attributes.append(child) 221 | 222 | noisy_dist_of_kplus1_attributes = get_noisy_distribution_of_attributes(kplus1_attributes, encoded_dataset, epsilon) 223 | 224 | # generate noisy distribution of root attribute. 225 | root_stats = noisy_dist_of_kplus1_attributes.loc[:, [root, 'count']].groupby(root).sum()['count'] 226 | conditional_distributions[root] = normalize_given_distribution(root_stats).tolist() 227 | 228 | for idx, (child, parents) in enumerate(bayesian_network): 229 | conditional_distributions[child] = {} 230 | 231 | if idx < k: 232 | stats = noisy_dist_of_kplus1_attributes.copy().loc[:, parents + [child, 'count']] 233 | else: 234 | stats = get_noisy_distribution_of_attributes(parents + [child], encoded_dataset, epsilon) 235 | 236 | stats = pd.DataFrame(stats.loc[:, parents + [child, 'count']].groupby(parents + [child]).sum()) 237 | 238 | if len(parents) == 1: 239 | for parent_instance in stats.index.levels[0]: 240 | dist = normalize_given_distribution(stats.loc[parent_instance]['count']).tolist() 241 | conditional_distributions[child][str([parent_instance])] = dist 242 | else: 243 | for parents_instance in product(*stats.index.levels[:-1]): 244 | dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist() 245 | conditional_distributions[child][str(list(parents_instance))] = dist 246 | 247 | return conditional_distributions 248 | -------------------------------------------------------------------------------- /data/hospital_ae_description_random.json: -------------------------------------------------------------------------------- 1 | { 2 | "meta": { 3 | "num_tuples": 9897, 4 | "num_attributes": 8, 5 | "num_attributes_in_BN": 8, 6 | "all_attributes": [ 7 | "Time in A&E (mins)", 8 | "Treatment", 9 | "Gender", 10 | "Index of Multiple Deprivation Decile", 11 | "Hospital ID", 12 | "Arrival Date", 13 | "Arrival hour range", 14 | "Age bracket" 15 | ], 16 | "candidate_keys": [], 17 | "non_categorical_string_attributes": [], 18 | "attributes_in_BN": [ 19 | "Gender", 20 | "Hospital ID", 21 | "Treatment", 22 | "Arrival Date", 23 | "Arrival hour range", 24 | "Age bracket", 25 | "Time in A&E (mins)", 26 | "Index of Multiple Deprivation Decile" 27 | ] 28 | }, 29 | "attribute_description": { 30 | "Time in A&E (mins)": { 31 | "name": "Time in A&E (mins)", 32 | "data_type": "Integer", 33 | "is_categorical": false, 34 | "is_candidate_key": false, 35 | "min": 1, 36 | "max": 132, 37 | "missing_rate": 0.0, 38 | "distribution_bins": [ 39 | 1.0, 40 | 132.0 41 | ], 42 | "distribution_probabilities": [ 43 | 0.5, 44 | 0.5 45 | ] 46 | }, 47 | "Treatment": { 48 | "name": "Treatment", 49 | "data_type": "String", 50 | "is_categorical": true, 51 | "is_candidate_key": false, 52 | "min": 3, 53 | "max": 44, 54 | "missing_rate": 0.0, 55 | "distribution_bins": [ 56 | "Dressing", 57 | "Sutures", 58 | "Joint aspiration", 59 | "Tetanus", 60 | "Urinary catheter/suprapubic", 61 | "Lumbar puncture", 62 | "Prescription/medicines prepared to take away", 63 | "Defibrillation/pacing", 64 | "Sling/collar cuff/broad arm sling", 65 | "Other (consider alternatives)", 66 | "Nebuliser/spacer", 67 | "Eye", 68 | "Infusion fluids", 69 | "Bandage/support", 70 | "Dressing/wound review", 71 | "Arterial line", 72 | "Chest drain", 73 | "Minor surgery", 74 | "Wound cleaning", 75 | "Blood product transfusion", 76 | "Plaster of Paris", 77 | "Oral airway", 78 | "Wound closure (excluding sutures)", 79 | "Resuscitation/cardiopulmonary resuscitation", 80 | "Incision & drainage", 81 | "Occupational Therapy", 82 | "Dental treatment", 83 | "Removal foreign body", 84 | "Central line", 85 | "Burns review", 86 | "Anaesthesia", 87 | "Guidance/advice only", 88 | "None (consider guidance/advice option)", 89 | "Fracture review", 90 | "Nasal airway", 91 | "Social work intervention", 92 | "Physiotherapy", 93 | "Recording vital signs", 94 | "Splint" 95 | ], 96 | "distribution_probabilities": [ 97 | 0.02564102564102564, 98 | 0.02564102564102564, 99 | 0.02564102564102564, 100 | 0.02564102564102564, 101 | 0.02564102564102564, 102 | 0.02564102564102564, 103 | 0.02564102564102564, 104 | 0.02564102564102564, 105 | 0.02564102564102564, 106 | 0.02564102564102564, 107 | 0.02564102564102564, 108 | 0.02564102564102564, 109 | 0.02564102564102564, 110 | 0.02564102564102564, 111 | 0.02564102564102564, 112 | 0.02564102564102564, 113 | 0.02564102564102564, 114 | 0.02564102564102564, 115 | 0.02564102564102564, 116 | 0.02564102564102564, 117 | 0.02564102564102564, 118 | 0.02564102564102564, 119 | 0.02564102564102564, 120 | 0.02564102564102564, 121 | 0.02564102564102564, 122 | 0.02564102564102564, 123 | 0.02564102564102564, 124 | 0.02564102564102564, 125 | 0.02564102564102564, 126 | 0.02564102564102564, 127 | 0.02564102564102564, 128 | 0.02564102564102564, 129 | 0.02564102564102564, 130 | 0.02564102564102564, 131 | 0.02564102564102564, 132 | 0.02564102564102564, 133 | 0.02564102564102564, 134 | 0.02564102564102564, 135 | 0.02564102564102564 136 | ] 137 | }, 138 | "Gender": { 139 | "name": "Gender", 140 | "data_type": "String", 141 | "is_categorical": true, 142 | "is_candidate_key": false, 143 | "min": 4, 144 | "max": 6, 145 | "missing_rate": 0.0, 146 | "distribution_bins": [ 147 | "Male", 148 | "Female" 149 | ], 150 | "distribution_probabilities": [ 151 | 0.5, 152 | 0.5 153 | ] 154 | }, 155 | "Index of Multiple Deprivation Decile": { 156 | "name": "Index of Multiple Deprivation Decile", 157 | "data_type": "Integer", 158 | "is_categorical": true, 159 | "is_candidate_key": false, 160 | "min": 1, 161 | "max": 10, 162 | "missing_rate": 0.0, 163 | "distribution_bins": [ 164 | 8, 165 | 6, 166 | 2, 167 | 3, 168 | 5, 169 | 4, 170 | 7, 171 | 9, 172 | 1, 173 | 10 174 | ], 175 | "distribution_probabilities": [ 176 | 0, 177 | 0, 178 | 0, 179 | 0, 180 | 0, 181 | 0, 182 | 0, 183 | 0, 184 | 0, 185 | 0 186 | ] 187 | }, 188 | "Hospital ID": { 189 | "name": "Hospital ID", 190 | "data_type": "String", 191 | "is_categorical": true, 192 | "is_candidate_key": false, 193 | "min": 5, 194 | "max": 6, 195 | "missing_rate": 0.0, 196 | "distribution_bins": [ 197 | 714199, 198 | 339622, 199 | 514115, 200 | 147009, 201 | 660843, 202 | 434748, 203 | 881145, 204 | 192852, 205 | 954491, 206 | 864150, 207 | 849155, 208 | 87754, 209 | 61685, 210 | 209044, 211 | 719566, 212 | 872881, 213 | 378500, 214 | 304807, 215 | 769379, 216 | 85318, 217 | 624058, 218 | 883825, 219 | 851090, 220 | 450621, 221 | 104821, 222 | 860413 223 | ], 224 | "distribution_probabilities": [ 225 | 0, 226 | 0, 227 | 0, 228 | 0, 229 | 0, 230 | 0, 231 | 0, 232 | 0, 233 | 0, 234 | 0, 235 | 0, 236 | 0, 237 | 0, 238 | 0, 239 | 0, 240 | 0, 241 | 0, 242 | 0, 243 | 0, 244 | 0, 245 | 0, 246 | 0, 247 | 0, 248 | 0, 249 | 0, 250 | 0 251 | ] 252 | }, 253 | "Arrival Date": { 254 | "name": "Arrival Date", 255 | "data_type": "String", 256 | "is_categorical": true, 257 | "is_candidate_key": false, 258 | "min": 10, 259 | "max": 10, 260 | "missing_rate": 0.0, 261 | "distribution_bins": [ 262 | "2019-04-07", 263 | "2019-04-03", 264 | "2019-04-06", 265 | "2019-04-01", 266 | "2019-04-04", 267 | "2019-04-05", 268 | "2019-04-02" 269 | ], 270 | "distribution_probabilities": [ 271 | 0.14285714285714285, 272 | 0.14285714285714285, 273 | 0.14285714285714285, 274 | 0.14285714285714285, 275 | 0.14285714285714285, 276 | 0.14285714285714285, 277 | 0.14285714285714285 278 | ] 279 | }, 280 | "Arrival hour range": { 281 | "name": "Arrival hour range", 282 | "data_type": "String", 283 | "is_categorical": true, 284 | "is_candidate_key": false, 285 | "min": 5, 286 | "max": 5, 287 | "missing_rate": 0.0, 288 | "distribution_bins": [ 289 | "00-03", 290 | "08-11", 291 | "16-19", 292 | "12-15", 293 | "04-07", 294 | "20-23" 295 | ], 296 | "distribution_probabilities": [ 297 | 0.16666666666666666, 298 | 0.16666666666666666, 299 | 0.16666666666666666, 300 | 0.16666666666666666, 301 | 0.16666666666666666, 302 | 0.16666666666666666 303 | ] 304 | }, 305 | "Age bracket": { 306 | "name": "Age bracket", 307 | "data_type": "String", 308 | "is_categorical": true, 309 | "is_candidate_key": false, 310 | "min": 3, 311 | "max": 5, 312 | "missing_rate": 0.0, 313 | "distribution_bins": [ 314 | "25-44", 315 | "65-84", 316 | "45-64", 317 | "0-17", 318 | "18-24", 319 | "85-" 320 | ], 321 | "distribution_probabilities": [ 322 | 0.16666666666666666, 323 | 0.16666666666666666, 324 | 0.16666666666666666, 325 | 0.16666666666666666, 326 | 0.16666666666666666, 327 | 0.16666666666666666 328 | ] 329 | } 330 | } 331 | } -------------------------------------------------------------------------------- /data/hospital_ae_description_independent.json: -------------------------------------------------------------------------------- 1 | { 2 | "meta": { 3 | "num_tuples": 9897, 4 | "num_attributes": 8, 5 | "num_attributes_in_BN": 8, 6 | "all_attributes": [ 7 | "Time in A&E (mins)", 8 | "Treatment", 9 | "Gender", 10 | "Index of Multiple Deprivation Decile", 11 | "Hospital ID", 12 | "Arrival Date", 13 | "Arrival hour range", 14 | "Age bracket" 15 | ], 16 | "candidate_keys": [], 17 | "non_categorical_string_attributes": [], 18 | "attributes_in_BN": [ 19 | "Gender", 20 | "Hospital ID", 21 | "Treatment", 22 | "Arrival Date", 23 | "Arrival hour range", 24 | "Age bracket", 25 | "Time in A&E (mins)", 26 | "Index of Multiple Deprivation Decile" 27 | ] 28 | }, 29 | "attribute_description": { 30 | "Time in A&E (mins)": { 31 | "name": "Time in A&E (mins)", 32 | "data_type": "Integer", 33 | "is_categorical": false, 34 | "is_candidate_key": false, 35 | "min": 1, 36 | "max": 132, 37 | "missing_rate": 0.0, 38 | "distribution_bins": [ 39 | 1.0, 40 | 7.55, 41 | 14.1, 42 | 20.65, 43 | 27.2, 44 | 33.75, 45 | 40.3, 46 | 46.85, 47 | 53.4, 48 | 59.949999999999996, 49 | 66.5, 50 | 73.05, 51 | 79.6, 52 | 86.14999999999999, 53 | 92.7, 54 | 99.25, 55 | 105.8, 56 | 112.35, 57 | 118.89999999999999, 58 | 125.45 59 | ], 60 | "distribution_probabilities": [ 61 | 0.0048137849053241435, 62 | 0.012751209255727067, 63 | 0.012787960085510593, 64 | 0.029766054617680452, 65 | 0.03908536424246603, 66 | 0.0711838002502208, 67 | 0.08112732494047205, 68 | 0.1289694426246986, 69 | 0.1270821082831697, 70 | 0.13274010311130605, 71 | 0.11819792116763203, 72 | 0.08385242821680011, 73 | 0.06362083177515561, 74 | 0.05065187199646423, 75 | 0.014176002474539257, 76 | 0.0006923597724718898, 77 | 0.0, 78 | 0.01069262871326837, 79 | 0.007190356203672139, 80 | 0.010618447363421076 81 | ] 82 | }, 83 | "Treatment": { 84 | "name": "Treatment", 85 | "data_type": "String", 86 | "is_categorical": true, 87 | "is_candidate_key": false, 88 | "min": 3, 89 | "max": 44, 90 | "missing_rate": 0.0, 91 | "distribution_bins": [ 92 | "Anaesthesia", 93 | "Arterial line", 94 | "Bandage/support", 95 | "Blood product transfusion", 96 | "Burns review", 97 | "Central line", 98 | "Chest drain", 99 | "Defibrillation/pacing", 100 | "Dental treatment", 101 | "Dressing", 102 | "Dressing/wound review", 103 | "Eye", 104 | "Fracture review", 105 | "Guidance/advice only", 106 | "Incision & drainage", 107 | "Infusion fluids", 108 | "Joint aspiration", 109 | "Lumbar puncture", 110 | "Minor surgery", 111 | "Nasal airway", 112 | "Nebuliser/spacer", 113 | "None (consider guidance/advice option)", 114 | "Occupational Therapy", 115 | "Oral airway", 116 | "Other (consider alternatives)", 117 | "Physiotherapy", 118 | "Plaster of Paris", 119 | "Prescription/medicines prepared to take away", 120 | "Recording vital signs", 121 | "Removal foreign body", 122 | "Resuscitation/cardiopulmonary resuscitation", 123 | "Sling/collar cuff/broad arm sling", 124 | "Social work intervention", 125 | "Splint", 126 | "Sutures", 127 | "Tetanus", 128 | "Urinary catheter/suprapubic", 129 | "Wound cleaning", 130 | "Wound closure (excluding sutures)" 131 | ], 132 | "distribution_probabilities": [ 133 | 0.03743653923302075, 134 | 0.03751715461964534, 135 | 0.04351428297613841, 136 | 0.03760806003801341, 137 | 0.010529054594822455, 138 | 0.026381269227099918, 139 | 0.007549988860363595, 140 | 0.05986187837291627, 141 | 0.00850742241230225, 142 | 0.03480733915916512, 143 | 0.027468928722761122, 144 | 0.026717830161984556, 145 | 0.03416644462647804, 146 | 0.014965422991640035, 147 | 0.007795447487468476, 148 | 0.026400977355992464, 149 | 0.041373870011095104, 150 | 0.045900876813776054, 151 | 0.05198354973969091, 152 | 0.027879729350620344, 153 | 0.0, 154 | 0.0387508493435389, 155 | 0.044485037296703195, 156 | 0.023103633518272867, 157 | 0.014430904850183484, 158 | 0.018850348982361948, 159 | 0.029749335129931886, 160 | 0.004684584676552847, 161 | 0.0005176287878392601, 162 | 0.010716969801534658, 163 | 0.01731195359361903, 164 | 0.023087506024678742, 165 | 0.03700822721307918, 166 | 0.0015958069798097654, 167 | 0.02627143667426696, 168 | 0.01493749383417674, 169 | 0.031912250792335325, 170 | 0.030438293292970365, 171 | 0.023781672453150347 172 | ] 173 | }, 174 | "Gender": { 175 | "name": "Gender", 176 | "data_type": "String", 177 | "is_categorical": true, 178 | "is_candidate_key": false, 179 | "min": 4, 180 | "max": 6, 181 | "missing_rate": 0.0, 182 | "distribution_bins": [ 183 | "Female", 184 | "Male" 185 | ], 186 | "distribution_probabilities": [ 187 | 0.5056121783414436, 188 | 0.4943878216585565 189 | ] 190 | }, 191 | "Index of Multiple Deprivation Decile": { 192 | "name": "Index of Multiple Deprivation Decile", 193 | "data_type": "Integer", 194 | "is_categorical": true, 195 | "is_candidate_key": false, 196 | "min": 1, 197 | "max": 10, 198 | "missing_rate": 0.0, 199 | "distribution_bins": [ 200 | 1, 201 | 2, 202 | 3, 203 | 4, 204 | 5, 205 | 6, 206 | 7, 207 | 8, 208 | 9, 209 | 10 210 | ], 211 | "distribution_probabilities": [ 212 | 0.0818966101509957, 213 | 0.10914128206359103, 214 | 0.08158665621236741, 215 | 0.09902888772825894, 216 | 0.10971942432967483, 217 | 0.1130540710198635, 218 | 0.09027435917728921, 219 | 0.11448459660960227, 220 | 0.0979777744972692, 221 | 0.10283633821108792 222 | ] 223 | }, 224 | "Hospital ID": { 225 | "name": "Hospital ID", 226 | "data_type": "String", 227 | "is_categorical": true, 228 | "is_candidate_key": false, 229 | "min": 5, 230 | "max": 6, 231 | "missing_rate": 0.0, 232 | "distribution_bins": [ 233 | 61685, 234 | 85318, 235 | 87754, 236 | 104821, 237 | 147009, 238 | 192852, 239 | 209044, 240 | 304807, 241 | 339622, 242 | 378500, 243 | 434748, 244 | 450621, 245 | 514115, 246 | 624058, 247 | 660843, 248 | 714199, 249 | 719566, 250 | 769379, 251 | 849155, 252 | 851090, 253 | 860413, 254 | 864150, 255 | 872881, 256 | 881145, 257 | 883825, 258 | 954491 259 | ], 260 | "distribution_probabilities": [ 261 | 0.05352275494736269, 262 | 0.0704080187421479, 263 | 0.07997320566661555, 264 | 0.007398206322024037, 265 | 0.053574210896845144, 266 | 0.03120563200976509, 267 | 0.02944400397157849, 268 | 0.034277388726412124, 269 | 0.022913627909725703, 270 | 0.03360203536704146, 271 | 0.05642809286467514, 272 | 0.0, 273 | 0.013587186830387246, 274 | 0.024354770752385856, 275 | 0.07317344659341625, 276 | 0.05898285136165305, 277 | 0.010627480302828646, 278 | 0.008050565576995733, 279 | 0.06730111431230959, 280 | 0.018626035275006815, 281 | 0.006169743743641671, 282 | 0.040299250943709626, 283 | 0.024344801544253756, 284 | 0.07884146818539779, 285 | 0.011315984326774164, 286 | 0.09157812282704632 287 | ] 288 | }, 289 | "Arrival Date": { 290 | "name": "Arrival Date", 291 | "data_type": "String", 292 | "is_categorical": true, 293 | "is_candidate_key": false, 294 | "min": 10, 295 | "max": 10, 296 | "missing_rate": 0.0, 297 | "distribution_bins": [ 298 | "2019-04-01", 299 | "2019-04-02", 300 | "2019-04-03", 301 | "2019-04-04", 302 | "2019-04-05", 303 | "2019-04-06", 304 | "2019-04-07" 305 | ], 306 | "distribution_probabilities": [ 307 | 0.06783542752636566, 308 | 0.12284925724304259, 309 | 0.09489380953045443, 310 | 0.15654728206204, 311 | 0.16681750153635305, 312 | 0.18350661317758582, 313 | 0.20755010892415843 314 | ] 315 | }, 316 | "Arrival hour range": { 317 | "name": "Arrival hour range", 318 | "data_type": "String", 319 | "is_categorical": true, 320 | "is_candidate_key": false, 321 | "min": 5, 322 | "max": 5, 323 | "missing_rate": 0.0, 324 | "distribution_bins": [ 325 | "00-03", 326 | "04-07", 327 | "08-11", 328 | "12-15", 329 | "16-19", 330 | "20-23" 331 | ], 332 | "distribution_probabilities": [ 333 | 0.15580117541263627, 334 | 0.19847718869225747, 335 | 0.20637700957772262, 336 | 0.19740459394581067, 337 | 0.17211788737444159, 338 | 0.06982214499713145 339 | ] 340 | }, 341 | "Age bracket": { 342 | "name": "Age bracket", 343 | "data_type": "String", 344 | "is_categorical": true, 345 | "is_candidate_key": false, 346 | "min": 3, 347 | "max": 5, 348 | "missing_rate": 0.0, 349 | "distribution_bins": [ 350 | "0-17", 351 | "18-24", 352 | "25-44", 353 | "45-64", 354 | "65-84", 355 | "85-" 356 | ], 357 | "distribution_probabilities": [ 358 | 0.13304536821134044, 359 | 0.09792995305020429, 360 | 0.3712089415455074, 361 | 0.2844464676120552, 362 | 0.10127555731296116, 363 | 0.012093712267931548 364 | ] 365 | } 366 | } 367 | } -------------------------------------------------------------------------------- /DataSynthesizer/DataDescriber.py: -------------------------------------------------------------------------------- 1 | import json 2 | from typing import Dict, List, Union 3 | 4 | from numpy import array_equal 5 | from pandas import DataFrame, read_csv 6 | 7 | from datatypes.AbstractAttribute import AbstractAttribute 8 | from datatypes.DateTimeAttribute import is_datetime, DateTimeAttribute 9 | from datatypes.FloatAttribute import FloatAttribute 10 | from datatypes.IntegerAttribute import IntegerAttribute 11 | from datatypes.SocialSecurityNumberAttribute import is_ssn, SocialSecurityNumberAttribute 12 | from datatypes.StringAttribute import StringAttribute 13 | from datatypes.utils.DataType import DataType 14 | from lib import utils 15 | from lib.PrivBayes import greedy_bayes, construct_noisy_conditional_distributions 16 | 17 | 18 | class DataDescriber: 19 | """Model input dataset, then save a description of the dataset into a JSON file. 20 | 21 | Attributes 22 | ---------- 23 | histogram_bins : int or str 24 | Number of bins in histograms. 25 | If it is a string such as 'auto' or 'fd', calculate the optimal bin width by `numpy.histogram_bin_edges`. 26 | category_threshold : int 27 | Categorical variables have no more than "this number" of distinct values. 28 | null_values: str or list 29 | Additional strings to recognize as missing values. 30 | By default missing values already include {‘’, ‘NULL’, ‘N/A’, ‘NA’, ‘NaN’, ‘nan’}. 31 | attr_to_datatype : dict 32 | Dictionary of {attribute: datatype}, e.g., {"age": "Integer", "gender": "String"}. 33 | attr_to_is_categorical : dict 34 | Dictionary of {attribute: boolean}, e.g., {"gender":True, "age":False}. 35 | attr_to_is_candidate_key: dict 36 | Dictionary of {attribute: boolean}, e.g., {"id":True, "name":False}. 37 | data_description: dict 38 | Nested dictionary (equivalent to JSON) recording the mined dataset information. 39 | df_input : DataFrame 40 | The input dataset to be analyzed. 41 | attr_to_column : Dict 42 | Dictionary of {attribute: AbstractAttribute} 43 | bayesian_network : list 44 | List of [child, [parent,]] to represent a Bayesian Network. 45 | df_encoded : DataFrame 46 | Input dataset encoded into integers, taken as input by PrivBayes algorithm in correlated attribute mode. 47 | """ 48 | 49 | def __init__(self, histogram_bins: Union[int, str] = 20, category_threshold=10, null_values=None): 50 | self.histogram_bins: Union[int, str] = histogram_bins 51 | self.category_threshold: int = category_threshold 52 | self.null_values = null_values 53 | 54 | self.attr_to_datatype: Dict[str, DataType] = None 55 | self.attr_to_is_categorical: Dict[str, bool] = None 56 | self.attr_to_is_candidate_key: Dict[str, bool] = None 57 | 58 | self.data_description: Dict = {} 59 | self.df_input: DataFrame = None 60 | self.attr_to_column: Dict[str, AbstractAttribute] = None 61 | self.bayesian_network: List = None 62 | self.df_encoded: DataFrame = None 63 | 64 | def describe_dataset_in_random_mode(self, 65 | dataset_file: str, 66 | attribute_to_datatype: Dict[str, DataType] = None, 67 | attribute_to_is_categorical: Dict[str, bool] = None, 68 | attribute_to_is_candidate_key: Dict[str, bool] = None, 69 | categorical_attribute_domain_file: str = None, 70 | numerical_attribute_ranges: Dict[str, List] = None, 71 | seed=0): 72 | attribute_to_datatype = attribute_to_datatype or {} 73 | attribute_to_is_categorical = attribute_to_is_categorical or {} 74 | attribute_to_is_candidate_key = attribute_to_is_candidate_key or {} 75 | numerical_attribute_ranges = numerical_attribute_ranges or {} 76 | 77 | if categorical_attribute_domain_file: 78 | categorical_attribute_to_domain = utils.read_json_file(categorical_attribute_domain_file) 79 | else: 80 | categorical_attribute_to_domain = {} 81 | 82 | utils.set_random_seed(seed) 83 | self.attr_to_datatype = {attr: DataType(datatype) for attr, datatype in attribute_to_datatype.items()} 84 | self.attr_to_is_categorical = attribute_to_is_categorical 85 | self.attr_to_is_candidate_key = attribute_to_is_candidate_key 86 | self.read_dataset_from_csv(dataset_file) 87 | self.infer_attribute_data_types() 88 | self.analyze_dataset_meta() 89 | self.represent_input_dataset_by_columns() 90 | 91 | for column in self.attr_to_column.values(): 92 | attr_name = column.name 93 | if attr_name in categorical_attribute_to_domain: 94 | column.infer_domain(categorical_domain=categorical_attribute_to_domain[attr_name]) 95 | elif attr_name in numerical_attribute_ranges: 96 | column.infer_domain(numerical_range=numerical_attribute_ranges[attr_name]) 97 | else: 98 | column.infer_domain() 99 | 100 | # record attribute information in json format 101 | self.data_description['attribute_description'] = {} 102 | for attr, column in self.attr_to_column.items(): 103 | self.data_description['attribute_description'][attr] = column.to_json() 104 | 105 | def describe_dataset_in_independent_attribute_mode(self, 106 | dataset_file, 107 | epsilon=0.1, 108 | attribute_to_datatype: Dict[str, DataType] = None, 109 | attribute_to_is_categorical: Dict[str, bool] = None, 110 | attribute_to_is_candidate_key: Dict[str, bool] = None, 111 | categorical_attribute_domain_file: str = None, 112 | numerical_attribute_ranges: Dict[str, List] = None, 113 | seed=0): 114 | self.describe_dataset_in_random_mode(dataset_file, 115 | attribute_to_datatype, 116 | attribute_to_is_categorical, 117 | attribute_to_is_candidate_key, 118 | categorical_attribute_domain_file, 119 | numerical_attribute_ranges, 120 | seed=seed) 121 | 122 | for column in self.attr_to_column.values(): 123 | column.infer_distribution() 124 | 125 | self.inject_laplace_noise_into_distribution_per_attribute(epsilon) 126 | # record attribute information in json format 127 | self.data_description['attribute_description'] = {} 128 | for attr, column in self.attr_to_column.items(): 129 | self.data_description['attribute_description'][attr] = column.to_json() 130 | 131 | def describe_dataset_in_correlated_attribute_mode(self, 132 | dataset_file, 133 | k=0, 134 | epsilon=0.1, 135 | attribute_to_datatype: Dict[str, DataType] = None, 136 | attribute_to_is_categorical: Dict[str, bool] = None, 137 | attribute_to_is_candidate_key: Dict[str, bool] = None, 138 | categorical_attribute_domain_file: str = None, 139 | numerical_attribute_ranges: Dict[str, List] = None, 140 | seed=0): 141 | """Generate dataset description using correlated attribute mode. 142 | 143 | Parameters 144 | ---------- 145 | dataset_file : str 146 | File name (with directory) of the sensitive dataset as input in csv format. 147 | k : int 148 | Maximum number of parents in Bayesian network. 149 | epsilon : float 150 | A parameter in Differential Privacy. Increase epsilon value to reduce the injected noises. Set epsilon=0 to turn 151 | off Differential Privacy. 152 | attribute_to_datatype : dict 153 | Dictionary of {attribute: datatype}, e.g., {"age": "Integer", "gender": "String"}. 154 | attribute_to_is_categorical : dict 155 | Dictionary of {attribute: boolean}, e.g., {"gender":True, "age":False}. 156 | attribute_to_is_candidate_key: dict 157 | Dictionary of {attribute: boolean}, e.g., {"id":True, "name":False}. 158 | categorical_attribute_domain_file: str 159 | File name of a JSON file of some categorical attribute domains. 160 | numerical_attribute_ranges: dict 161 | Dictionary of {attribute: [min, max]}, e.g., {"age": [25, 65]} 162 | seed : int or float 163 | Seed the random number generator. 164 | """ 165 | self.describe_dataset_in_independent_attribute_mode(dataset_file, 166 | epsilon, 167 | attribute_to_datatype, 168 | attribute_to_is_categorical, 169 | attribute_to_is_candidate_key, 170 | categorical_attribute_domain_file, 171 | numerical_attribute_ranges, 172 | seed) 173 | self.df_encoded = self.encode_dataset_into_binning_indices() 174 | if self.df_encoded.shape[1] < 2: 175 | raise Exception("Correlated Attribute Mode requires at least 2 attributes/columns in dataset.") 176 | 177 | self.bayesian_network = greedy_bayes(self.df_encoded, k, epsilon) 178 | self.data_description['bayesian_network'] = self.bayesian_network 179 | self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions( 180 | self.bayesian_network, self.df_encoded, epsilon) 181 | 182 | def read_dataset_from_csv(self, file_name=None): 183 | try: 184 | self.df_input = read_csv(file_name, skipinitialspace=True, na_values=self.null_values) 185 | except (UnicodeDecodeError, NameError): 186 | self.df_input = read_csv(file_name, skipinitialspace=True, na_values=self.null_values, 187 | encoding='latin1') 188 | 189 | # Remove columns with empty active domain, i.e., all values are missing. 190 | attributes_before = set(self.df_input.columns) 191 | self.df_input.dropna(axis=1, how='all') 192 | attributes_after = set(self.df_input.columns) 193 | if len(attributes_before) > len(attributes_after): 194 | print(f'Empty columns are removed, including {attributes_before - attributes_after}.') 195 | 196 | def infer_attribute_data_types(self): 197 | attributes_with_unknown_datatype = set(self.df_input.columns) - set(self.attr_to_datatype) 198 | inferred_numerical_attributes = utils.infer_numerical_attributes_in_dataframe(self.df_input) 199 | 200 | for attr in attributes_with_unknown_datatype: 201 | column_dropna = self.df_input[attr].dropna() 202 | 203 | # current attribute is either Integer or Float. 204 | if attr in inferred_numerical_attributes: 205 | # TODO Comparing all values may be too slow for large datasets. 206 | if array_equal(column_dropna, column_dropna.astype(int, copy=False)): 207 | self.attr_to_datatype[attr] = DataType.INTEGER 208 | else: 209 | self.attr_to_datatype[attr] = DataType.FLOAT 210 | 211 | # current attribute is either String, DateTime, or SocialSecurityNumber. 212 | else: 213 | # Sample 20 values to test its data_type. 214 | samples = column_dropna.sample(20, replace=True) 215 | if all(samples.map(is_datetime)): 216 | self.attr_to_datatype[attr] = DataType.DATETIME 217 | else: 218 | if all(samples.map(is_ssn)): 219 | self.attr_to_datatype[attr] = DataType.SOCIAL_SECURITY_NUMBER 220 | else: 221 | self.attr_to_datatype[attr] = DataType.STRING 222 | 223 | def analyze_dataset_meta(self): 224 | all_attributes = set(self.df_input.columns) 225 | 226 | # find all candidate keys. 227 | for attr in all_attributes - set(self.attr_to_is_candidate_key): 228 | self.attr_to_is_candidate_key[attr] = self.df_input[attr].is_unique 229 | 230 | candidate_keys = {attr for attr, is_key in self.attr_to_is_candidate_key.items() if is_key} 231 | 232 | # find all categorical attributes. 233 | for attr in all_attributes - set(self.attr_to_is_categorical): 234 | self.attr_to_is_categorical[attr] = self.is_categorical(attr) 235 | 236 | non_categorical_string_attributes = set() 237 | for attr, is_categorical in self.attr_to_is_categorical.items(): 238 | if not is_categorical and self.attr_to_datatype[attr] is DataType.STRING: 239 | non_categorical_string_attributes.add(attr) 240 | 241 | attributes_in_BN = list(all_attributes - candidate_keys - non_categorical_string_attributes) 242 | non_categorical_string_attributes = list(non_categorical_string_attributes) 243 | 244 | self.data_description['meta'] = {"num_tuples": self.df_input.shape[0], 245 | "num_attributes": self.df_input.shape[1], 246 | "num_attributes_in_BN": len(attributes_in_BN), 247 | "all_attributes": self.df_input.columns.tolist(), 248 | "candidate_keys": list(candidate_keys), 249 | "non_categorical_string_attributes": non_categorical_string_attributes, 250 | "attributes_in_BN": attributes_in_BN} 251 | 252 | def is_categorical(self, attribute_name): 253 | """ Detect whether an attribute is categorical. 254 | 255 | Parameters 256 | ---------- 257 | attribute_name : str 258 | """ 259 | if attribute_name in self.attr_to_is_categorical: 260 | return self.attr_to_is_categorical[attribute_name] 261 | else: 262 | return self.df_input[attribute_name].dropna().unique().size <= self.category_threshold 263 | 264 | def represent_input_dataset_by_columns(self): 265 | self.attr_to_column = {} 266 | for attr in self.df_input: 267 | data_type = self.attr_to_datatype[attr] 268 | is_candidate_key = self.attr_to_is_candidate_key[attr] 269 | is_categorical = self.attr_to_is_categorical[attr] 270 | paras = (attr, is_candidate_key, is_categorical, self.histogram_bins, self.df_input[attr]) 271 | if data_type is DataType.INTEGER: 272 | self.attr_to_column[attr] = IntegerAttribute(*paras) 273 | elif data_type is DataType.FLOAT: 274 | self.attr_to_column[attr] = FloatAttribute(*paras) 275 | elif data_type is DataType.DATETIME: 276 | self.attr_to_column[attr] = DateTimeAttribute(*paras) 277 | elif data_type is DataType.STRING: 278 | self.attr_to_column[attr] = StringAttribute(*paras) 279 | elif data_type is DataType.SOCIAL_SECURITY_NUMBER: 280 | self.attr_to_column[attr] = SocialSecurityNumberAttribute(*paras) 281 | else: 282 | raise Exception(f'The DataType of {attr} is unknown.') 283 | 284 | def inject_laplace_noise_into_distribution_per_attribute(self, epsilon=0.1): 285 | num_attributes_in_BN = self.data_description['meta']['num_attributes_in_BN'] 286 | for column in self.attr_to_column.values(): 287 | assert isinstance(column, AbstractAttribute) 288 | column.inject_laplace_noise(epsilon, num_attributes_in_BN) 289 | 290 | def encode_dataset_into_binning_indices(self): 291 | """Before constructing Bayesian network, encode input dataset into binning indices.""" 292 | encoded_dataset = DataFrame() 293 | for attr in self.data_description['meta']['attributes_in_BN']: 294 | encoded_dataset[attr] = self.attr_to_column[attr].encode_values_into_bin_idx() 295 | return encoded_dataset 296 | 297 | def save_dataset_description_to_file(self, file_name): 298 | with open(file_name, 'w') as outfile: 299 | json.dump(self.data_description, outfile, indent=4) 300 | 301 | def display_dataset_description(self): 302 | print(json.dumps(self.data_description, indent=4)) 303 | 304 | 305 | if __name__ == '__main__': 306 | from DataGenerator import DataGenerator 307 | 308 | # input dataset 309 | input_data = './data/adult_ssn.csv' 310 | # location of two output files 311 | mode = 'correlated_attribute_mode' 312 | description_file = './out/{}/description.txt'.format(mode) 313 | synthetic_data = './out/{}/sythetic_data.csv'.format(mode) 314 | 315 | # An attribute is categorical if its domain size is less than this threshold. 316 | # Here modify the threshold to adapt to the domain size of "education" (which is 14 in input dataset). 317 | threshold_value = 20 318 | 319 | # Additional strings to recognize as NA/NaN. 320 | na_values = '' 321 | 322 | # specify which attributes are candidate keys of input dataset. 323 | candidate_keys = {'age': False, 'ssn': True} 324 | 325 | # A parameter in differential privacy. 326 | # It roughly means that removing one tuple will change the probability of any output by at most exp(eps). 327 | # Set eps=0 to turn off differential privacy. 328 | eps = 0.1 329 | 330 | # The maximum number of parents in Bayesian network, i.e., the maximum number of incoming edges. 331 | degree_of_bayesian_network = 2 332 | 333 | # Number of tuples generated in synthetic dataset. 334 | num_tuples_to_generate = 32561 # Here 32561 is the same as input dataset, but it can be set to another number. 335 | 336 | describer = DataDescriber(histogram_bins='fd', 337 | category_threshold=threshold_value, 338 | null_values=na_values) 339 | describer.describe_dataset_in_correlated_attribute_mode(input_data, 340 | epsilon=eps, 341 | attribute_to_is_candidate_key=candidate_keys) 342 | describer.save_dataset_description_to_file(description_file) 343 | 344 | generator = DataGenerator() 345 | generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file) 346 | generator.save_synthetic_data(synthetic_data) 347 | print(generator.synthetic_dataset.head()) 348 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | _Last tested: 2022-04-14. Updated the requirements and ran in Python 3.10 (although a few warnings from Pandas)._ 2 | 3 | # Anonymisation with Synthetic Data Tutorial 4 | 5 | ## Some questions 6 | 7 | **What is this?** 8 | 9 | A hands-on tutorial showing how to use Python to create synthetic data. 10 | 11 | **Wait, what is this "synthetic data" you speak of?** 12 | 13 | It's data that is created by an automated process which contains many of the statistical patterns of an original dataset. It is also sometimes used as a way to release data that has no personal information in it, even if the original did contain lots of data that could identify people. This means programmers and data scientists can crack on with building software and algorithms that they know will work similarly on the real data. 14 | 15 | **Who is this tutorial for?** 16 | 17 | For any person who programs who wants to learn about data anonymisation in general or more specifically about synthetic data. 18 | 19 | **What is it not for?** 20 | 21 | Non-programmers. Although we think this tutorial is still worth a browse to get some of the main ideas in what goes in to anonymising a dataset. However, if you're looking for info on how to create synthetic data using the latest and greatest deep learning techniques, this is not the tutorial for you. 22 | 23 | **Who are you?** 24 | 25 | We're the Open Data Institute. We work with companies and governments to build an open, trustworthy data ecosystem. Anonymisation and synthetic data are some of the many, many ways we can responsibly increase access to data. If you want to learn more, [check out our site](http://theodi.org). 26 | 27 | **Why did you make this?** 28 | 29 | We have an [R&D program](https://theodi.org/project/data-innovation-for-uk-research-and-development/) that has a number of projects looking in to how to support innovation, improve data infrastructure and encourage ethical data sharing. One of our projects is about [managing the risks of re-identification](https://theodi.org/project/rd-broaden-access-to-personal-data-while-protecting-privacy-and-creating-a-fair-market/) in shared and open data. As you can see in the *Key outputs* section, we have other material from the project, but we thought it'd be good to have something specifically aimed at programmers who are interested in learning by doing. 30 | 31 | **Speaking of which, can I just get to the tutorial now?** 32 | 33 | Sure! Let's go. 34 | 35 | ## Overview 36 | 37 | In this tutorial you are aiming to create a safe version of accident and emergency (A&E) admissions data, collected from multiple hospitals. This data contains some sensitive personal information about people's health and can't be openly shared. By removing and altering certain identifying information in the data we can greatly reduce the risk that patients can be re-identified and therefore hope to release the data. 38 | 39 | Just to be clear, we're not using actual A&E data but are creating our own simple, mock, version of it. 40 | 41 | The practical steps involve: 42 | 43 | 1. Create an A&E admissions dataset which will contain (pretend) personal information. 44 | 2. Run some anonymisation steps over this dataset to generate a new dataset with much less re-identification risk. 45 | 3. Take this de-identified dataset and generate multiple synthetic datasets from it to reduce the re-identification risk even further. 46 | 4. Analyse the synthetic datasets to see how similar they are to the original data. 47 | 48 | You may be wondering, why can't we just do synthetic data step? If it's synthetic surely it won't contain any personal information? 49 | 50 | Not exactly. Patterns picked up in the original data can be transferred to the synthetic data. This is especially true for outliers. For instance if there is only one person from an certain area over 85 and this shows up in the synthetic data, we would be able to re-identify them. 51 | 52 | ## Credit to others 53 | 54 | This tutorial is inspired by the [NHS England and ODI Leeds' research](https://odileeds.org/events/synae/) in creating a synthetic dataset from NHS England's accident and emergency admissions. Please do read about their project, as it's really interesting and great for learning about the benefits and risks in creating synthetic data. 55 | 56 | Also, the synthetic data generating library we use is [DataSynthetizer](https://homes.cs.washington.edu/~billhowe//projects/2017/07/20/Data-Synthesizer.html) and comes as part of this codebase. Coming from researchers in Drexel University and University of Washington, it's an excellent piece of software and their research and papers are well worth checking out. It's available as a [repo on Github](https://github.com/DataResponsibly/DataSynthesizer) which includes some short tutorials on how to use the toolkit and an accompanying research paper describing the theory behind it. 57 | 58 | --- 59 | 60 | ## Setup 61 | 62 | First, make sure you have [Python3 installed](https://www.python.org/downloads/). Minimum Python 3.6. 63 | 64 | Download this repository either as a zip or clone using Git. 65 | 66 | Install required dependent libraries. You can do that, for example, with a _virtualenv_. 67 | 68 | ```bash 69 | cd /path/to/repo/synthetic_data_tutorial/ 70 | pip install -r requirements.txt 71 | ``` 72 | 73 | Next we'll go through how to create, de-identify and synthesise the code. We'll show this using code snippets but the full code is contained within the `/tutorial` directory. 74 | 75 | There's small differences between the code presented here and what's in the Python scripts but it's mostly down to variable naming. I'd encourage you to run, edit and play with the code locally. 76 | 77 | ## Generate mock NHS A&E dataset 78 | 79 | The data already exists in `data/nhs_ae_mock.csv` so feel free to browse that. But you should generate your own fresh dataset using the `tutorial/generate.py` script. 80 | 81 | To do this, you'll need to download one dataset first. It's a list of all postcodes in London. You can find it at this page on [doogal.co.uk](https://www.doogal.co.uk/PostcodeDownloads.php), at the _London_ link under the _By English region_ section. Or just download it directly at [this link](https://www.doogal.co.uk/UKPostcodesCSV.ashx?region=E12000007) (just take note, it's 133MB in size), then place the `London postcodes.csv` file in to the `data/` directory. 82 | 83 | Or you can just do it using `curl`. 84 | 85 | ```bash 86 | curl -o "./data/London postcodes.csv" https://www.doogal.co.uk/UKPostcodesCSV.ashx?region=E12000007 87 | ``` 88 | 89 | Then, to generate the data, from the project root directory run the `generate.py` script. 90 | 91 | ```bash 92 | python tutorial/generate.py 93 | ``` 94 | 95 | Voila! You'll now see a new `hospital_ae_data.csv` file in the `/data` directory. Open it up and have a browse. It's contains the following columns: 96 | 97 | - **Health Service ID**: NHS number of the admitted patient 98 | - **Age**: age of patient 99 | - **Time in A&E (mins)**: time in minutes of how long the patient spent in A&E. This is generated to correlate with the age of the patient. 100 | - **Hospital**: which hospital admitted the patient - with some hospitals being more prevalent in the data than others 101 | - **Arrival Time**: what time and date the patient was admitted - with weekends as busier and and a different peak time for each day 102 | - **Treatment**: what the person was treated for - with certain treatments being more common than others 103 | - **Gender**: patient gender - based on [NHS patient gender codes](https://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_code_de.asp?shownav=1) 104 | - **Postcode**: postcode of patient - random, in use, London postcodes extracted from the `London postcodes.csv` file. 105 | 106 | We can see this dataset obviously contains some personal information. For instance, if we knew roughly the time a neighbour went to A&E we could use their postcode to figure out exactly what ailment they went in with. Or, if a list of people's Health Service ID's were to be leaked in future, lots of people could be re-identified. 107 | 108 | Because of this, we'll need to take some de-identification steps. 109 | 110 | --- 111 | 112 | ## De-identification 113 | 114 | For this stage, we're going to be loosely following the de-identification techniques used by Jonathan Pearson of NHS England, and described in a blog post about [creating its own synthetic data](https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data). 115 | 116 | If you look in `tutorial/deidentify.py` you'll see the full code of all de-identification steps. You can run this code easily. 117 | 118 | ```bash 119 | python tutorial/deidentify.py 120 | ``` 121 | 122 | It takes the `data/hospital_ae_data.csv` file, run the steps, and saves the new dataset to `data/hospital_ae_data_deidentify.csv`. 123 | 124 | Breaking down each of these steps. It first loads the `data/nhs_ae_data.csv` file in to the Pandas DataFrame as `hospital_ae_df`. 125 | 126 | ```python 127 | # _df is a common way to refer to a Pandas DataFrame object 128 | hospital_ae_df = pd.read_csv(filepaths.hospital_ae_data) 129 | ``` 130 | 131 | (`filepaths.py` is, surprise, surprise, where all the filepaths are listed) 132 | 133 | ### Remove Health Service ID numbers 134 | 135 | Health Service ID numbers are direct identifiers and should be removed. So we'll simply drop the entire column. 136 | 137 | ```python 138 | hospital_ae_df = hospital_ae_df.drop('Health Service ID', 1) 139 | ``` 140 | 141 | ### Where a patient lives 142 | 143 | Pseudo-identifiers, also known as [quasi-identifiers](https://en.wikipedia.org/wiki/Quasi-identifier), are pieces of information that don't directly identify people but can used with other information to identify a person. If we were to take the age, postcode and gender of a person we could combine these and check the dataset to see what that person was treated for in A&E. 144 | 145 | The data scientist from NHS England, Jonathan Pearson, describes this in the blog post: 146 | 147 | > I started with the postcode of the patients resident lower super output area (LSOA). This is a geographical definition with an average of 1500 residents created to make reporting in England and Wales easier. I wanted to keep some basic information about the area where the patient lives whilst completely removing any information regarding any actual postcode. A key variable in health care inequalities is the patients Index of Multiple deprivation (IMD) decile (broad measure of relative deprivation) which gives an average ranked value for each LSOA. By replacing the patients resident postcode with an IMD decile I have kept a key bit of information whilst making this field non-identifiable. 148 | 149 | We'll do just the same with our dataset. 150 | 151 | First we'll map the rows' postcodes to their LSOA and then drop the postcodes column. 152 | 153 | ```python 154 | postcodes_df = pd.read_csv(filepaths.postcodes_london) 155 | hospital_ae_df = pd.merge( 156 | hospital_ae_df, 157 | postcodes_df[['Postcode', 'Lower layer super output area']], 158 | on='Postcode' 159 | ) 160 | hospital_ae_df = hospital_ae_df.drop('Postcode', 1) 161 | ``` 162 | 163 | Then we'll add a mapped column of "Index of Multiple Deprivation" column for each entry's LSOA. 164 | 165 | ```python 166 | hospital_ae_df = pd.merge( 167 | hospital_ae_df, 168 | postcodes_df[['Lower layer super output area', 'Index of Multiple Deprivation']].drop_duplicates(), 169 | on='Lower layer super output area' 170 | ) 171 | ``` 172 | 173 | Next calculate the decile bins for the IMDs by taking all the IMDs from large list of London. We'll use the Pandas `qcut` (quantile cut), function for this. 174 | 175 | ```python 176 | _, bins = pd.qcut( 177 | postcodes_df['Index of Multiple Deprivation'], 178 | 10, 179 | retbins=True, 180 | labels=False 181 | ) 182 | ``` 183 | 184 | Then we'll use those decile `bins` to map each row's IMD to its IMD decile. 185 | 186 | ```python 187 | # add +1 to get deciles from 1 to 10 (not 0 to 9) 188 | hospital_ae_df['Index of Multiple Deprivation Decile'] = pd.cut( 189 | hospital_ae_df['Index of Multiple Deprivation'], 190 | bins=bins, 191 | labels=False, 192 | include_lowest=True) + 1 193 | ``` 194 | 195 | And finally drop the columns we no longer need. 196 | 197 | ```python 198 | hospital_ae_df = hospital_ae_df.drop('Index of Multiple Deprivation', 1) 199 | hospital_ae_df = hospital_ae_df.drop('Lower layer super output area', 1) 200 | ``` 201 | 202 | ### Individual hospitals 203 | 204 | The data scientist at NHS England masked individual hospitals giving the following reason. 205 | 206 | > As each hospital has its own complex case mix and health system, using these data to identify poor performance or possible improvements would be invalid and un-helpful. Therefore, I decided to replace the hospital code with a random number. 207 | 208 | So we'll do as they did, replacing hospitals with a random six-digit ID. 209 | 210 | ```python 211 | hospitals = hospital_ae_df['Hospital'].unique().tolist() 212 | random.shuffle(hospitals) 213 | hospitals_map = { 214 | hospital : ''.join(random.choices(string.digits, k=6)) 215 | for hospital in hospitals 216 | } 217 | hospital_ae_df['Hospital ID'] = hospital_ae_df['Hospital'].map(hospitals_map) 218 | ``` 219 | 220 | And remove the `Hospital` column. 221 | 222 | ```python 223 | hospital_ae_df = hospital_ae_df.drop('Hospital', 1) 224 | ``` 225 | 226 | ### Time in the data 227 | 228 | > The next obvious step was to simplify some of the time information I have available as health care system analysis doesn't need to be responsive enough to work on a second and minute basis. Thus, I removed the time information from the 'arrival date', mapped the 'arrival time' into 4-hour chunks 229 | 230 | First we'll split the `Arrival Time` column in to `Arrival Date` and `Arrival Hour`. 231 | 232 | ```python 233 | arrival_times = pd.to_datetime(hospital_ae_df['Arrival Time']) 234 | hospital_ae_df['Arrival Date'] = arrival_times.dt.strftime('%Y-%m-%d') 235 | hospital_ae_df['Arrival Hour'] = arrival_times.dt.hour 236 | hospital_ae_df = hospital_ae_df.drop('Arrival Time', 1) 237 | ``` 238 | 239 | Then we'll map the hours to 4-hour chunks and drop the `Arrival Hour` column. 240 | 241 | ```python 242 | hospital_ae_df['Arrival hour range'] = pd.cut( 243 | hospital_ae_df['Arrival Hour'], 244 | bins=[0, 4, 8, 12, 16, 20, 24], 245 | labels=['00-03', '04-07', '08-11', '12-15', '16-19', '20-23'], 246 | include_lowest=True 247 | ) 248 | hospital_ae_df = hospital_ae_df.drop('Arrival Hour', 1) 249 | ``` 250 | 251 | ### Patient demographics 252 | 253 | > I decided to only include records with a sex of male or female in order to reduce risk of re identification through low numbers. 254 | 255 | ```python 256 | hospital_ae_df = hospital_ae_df[hospital_ae_df['Gender'].isin(['Male', 'Female'])] 257 | ``` 258 | 259 | > For the patients age it is common practice to group these into bands and so I've used a standard set - 1-17, 18-24, 25-44, 45-64, 65-84, and 85+ - which although are non-uniform are well used segments defining different average health care usage. 260 | 261 | ```python 262 | hospital_ae_df['Age bracket'] = pd.cut( 263 | hospital_ae_df['Age'], 264 | bins=[0, 18, 25, 45, 65, 85, 150], 265 | labels=['0-17', '18-24', '25-44', '45-64', '65-84', '85-'], 266 | include_lowest=True 267 | ) 268 | hospital_ae_df = hospital_ae_df.drop('Age', 1) 269 | ``` 270 | 271 | That's all the steps we'll take. We'll finally save our new de-identified dataset. 272 | 273 | ```python 274 | hospital_ae_df.to_csv(filepaths.hospital_ae_data_deidentify, index=False) 275 | ``` 276 | 277 | --- 278 | 279 | ## Synthesise 280 | 281 | Synthetic data exists on a spectrum from merely the same columns and datatypes as the original data all the way to carrying nearly all of the statistical patterns of the original dataset. 282 | 283 | The UK's Office of National Statistics has a great report on synthetic data and the [_Synthetic Data Spectrum_](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot?utm_campaign=201903_UK_DataPolicyNetwork&utm_source=hs_email&utm_medium=email&utm_content=70377606&_hsenc=p2ANqtz-9W6ByBext_HsgkTPG1lw2JJ_utRoJSTIeVC5Z2lz3QkzwFQpZ0dp2ns9SZLPqxLJrgWzsjC_zt7FQcBvtIGoeSjZtwNg&_hsmi=70377606#synthetic-dataset-spectrum) section is very good in explaining the nuances in more detail. 284 | 285 | In this tutorial we'll create not one, not two, but *three* synthetic datasets, that are on a range across the synthetic data spectrum: *Random*, *Independent* and *Correlated*. 286 | 287 | > In **correlated attribute mode**, we learn a differentially private Bayesian network capturing the correlation structure between attributes, then draw samples from this model to construct the result dataset. 288 | > 289 | > In cases where the correlated attribute mode is too computationally expensive or when there is insufficient data to derive a reasonable model, one can use **independent attribute mode**. In this mode, a histogram is derived for each attribute, noise is added to the histogram to achieve differential privacy, and then samples are drawn for each attribute. 290 | > 291 | > Finally, for cases of extremely sensitive data, one can use **random mode** that simply generates type-consistent random values for each attribute. 292 | 293 | We'll go through each of these now, moving along the synthetic data spectrum, in the order of random to independent to correlated. 294 | 295 | The toolkit we will be using to generate the three synthetic datasets is DataSynthetizer. 296 | 297 | ### DataSynthesizer 298 | 299 | As described in the introduction, this is an open-source toolkit for generating synthetic data. And I'd like to lavish much praise on the researchers who made it as it's excellent. 300 | 301 | Instead of explaining it myself, I'll use the researchers' own words from their paper: 302 | 303 | > DataSynthesizer infers the domain of each attribute and derives a description of the distribution of attribute values in the private dataset. This information is saved in a dataset description file, to which we refer as data summary. Then DataSynthesizer is able to generate synthetic datasets of arbitrary size by sampling from the probabilistic model in the dataset description file. 304 | 305 | We'll create and inspect our synthetic datasets using three modules within it. 306 | 307 | > DataSynthesizer consists of three high-level modules: 308 | > 309 | > 1. **DataDescriber**: investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary. 310 | > 2. **DataGenerator**: samples from the summary computed by DataDescriber and outputs synthetic data 311 | > 3. **ModelInspector**: shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. 312 | 313 | If you want to browse the code for each of these modules, you can find the Python classes for in the `DataSynthetizer` directory (all code in here from the [original repo](https://github.com/DataResponsibly/DataSynthesizer)). 314 | 315 | 316 | ### An aside about differential privacy and Bayesian networks 317 | 318 | You might have seen the phrase "differentially private Bayesian network" in the *correlated mode* description earlier, and got slightly panicked. But fear not! You don't need to worry *too* much about these to get DataSynthesizer working. 319 | 320 | First off, while DataSynthesizer has the option of using differential privacy for anonymisation, we are turning it off and won't be using it in this tutorial. So you can ignore that part. However, if you care about anonymisation you really should read up on differential privacy. I've read a lot of explainers on it and the best I found was [this article from Access Now](https://www.accessnow.org/understanding-differential-privacy-matters-digital-rights/). 321 | 322 | Now the next term, Bayesian networks. These are graphs with directions which model the statistical relationship between a dataset's variables. It does this by saying certain variables are "parents" of others, that is, their value influences their "children" variables. Parent variables can influence children but children can't influence parents. In our case, if patient age is a parent of waiting time, it means the age of patient influences how long they wait, but how long they doesn't influence their age. So by using Bayesian Networks, DataSynthesizer can model these influences and use this model in generating the synthetic data. 323 | 324 | It can be a slightly tricky topic to grasp but a nice, introductory tutorial on them is at the [Probabilistic World site](https://www.probabilisticworld.com/bayesian-belief-networks-part-1/). Give it a read. 325 | 326 | ### Random mode 327 | 328 | If we were just to generate A&E data for testing our software, we wouldn't care too much about the statistical patterns within the data. Just that it was roughly a similar size and that the datatypes and columns aligned. 329 | 330 | In this case, we can just generate the data at random using the `generate_dataset_in_random_mode` function within the `DataGenerator` class. 331 | 332 | #### Data Description: Random 333 | 334 | The first step is to create a description of the data, defining the datatypes and which are the categorical variables. 335 | 336 | ```python 337 | attribute_to_datatype = { 338 | 'Time in A&E (mins)': 'Integer', 339 | 'Treatment': 'String', 340 | 'Gender': 'String', 341 | 'Index of Multiple Deprivation Decile': 'Integer', 342 | 'Hospital ID': 'String', 343 | 'Arrival Date': 'String', 344 | 'Arrival hour range': 'String', 345 | 'Age bracket': 'String' 346 | } 347 | 348 | attribute_is_categorical = { 349 | 'Hospital ID': True, 350 | 'Time in A&E (mins)': False, 351 | 'Treatment': True, 352 | 'Gender': True, 353 | 'Index of Multiple Deprivation Decile': False, 354 | 'Arrival Date': True, 355 | 'Arrival hour range': True, 356 | 'Age bracket': True 357 | } 358 | ``` 359 | 360 | We'll be feeding these in to a `DataDescriber` instance. 361 | 362 | ```python 363 | describer = DataDescriber() 364 | ``` 365 | 366 | Using this `describer` instance, feeding in the attribute descriptions, we create a description file. 367 | 368 | ```python 369 | describer.describe_dataset_in_random_mode( 370 | filepaths.hospital_ae_data_deidentify, 371 | attribute_to_datatype=attribute_to_datatype, 372 | attribute_to_is_categorical=attribute_is_categorical) 373 | describer.save_dataset_description_to_file( 374 | filepaths.hospital_ae_description_random) 375 | ``` 376 | 377 | You can see an example description file in `data/hospital_ae_description_random.json`. 378 | 379 | #### Data Generation: Random 380 | 381 | Next, generate the random data. We'll just generate the same amount of rows as was in the original data but, importantly, we could generate much more or less if we wanted to. 382 | 383 | ```python 384 | num_rows = len(hospital_ae_df) 385 | ``` 386 | 387 | Now generate the random data. 388 | 389 | ```python 390 | generator = DataGenerator() 391 | generator.generate_dataset_in_random_mode( 392 | num_rows, filepaths.hospital_ae_description_random) 393 | generator.save_synthetic_data(filepaths.hospital_ae_data_synthetic_random) 394 | ``` 395 | 396 | You can view this random synthetic data in the file `data/hospital_ae_data_synthetic_random.csv`. 397 | 398 | #### Attribute Comparison: Random 399 | 400 | We'll compare each attribute in the original data to the synthetic data by generating plots of histograms using the `ModelInspector` class. 401 | 402 | `figure_filepath` is just a variable holding where we'll write the plot out to. 403 | 404 | ```python 405 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_random) 406 | 407 | # Read attribute description from the dataset description file. 408 | attribute_description = read_json_file( 409 | filepaths.hospital_ae_description_random)['attribute_description'] 410 | 411 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description) 412 | 413 | for attribute in synthetic_df.columns: 414 | inspector.compare_histograms(attribute, figure_filepath) 415 | ``` 416 | 417 | Let's look at the histogram plots now for a few of the attributes. We can see that the generated data is completely random and doesn't contain any information about averages or distributions. 418 | 419 | *Comparison of ages in original data (left) and random synthetic data (right)* 420 | ![Random mode age bracket histograms](plots/random_Age_bracket.png) 421 | 422 | *Comparison of hospital attendance in original data (left) and random synthetic data (right)* 423 | ![Random mode age bracket histograms](plots/random_Hospital_ID.png) 424 | 425 | *Comparison of arrival date in original data (left) and random synthetic data (right)* 426 | ![Random mode age bracket histograms](plots/random_Arrival_Date.png) 427 | 428 | You can see more comparison examples in the `/plots` directory. 429 | 430 | #### Compare pairwise mutual information: Random 431 | 432 | DataSynthesizer has a function to compare the _mutual information_ between each of the variables in the dataset and plot them. We'll avoid the mathematical definition of mutual information but [Scholarpedia notes](http://www.scholarpedia.org/article/Mutual_information) it: 433 | 434 | > can be thought of as the reduction in uncertainty about one random variable given knowledge of another. 435 | 436 | To create this plot we run. 437 | 438 | ```python 439 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_random) 440 | 441 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description) 442 | inspector.mutual_information_heatmap(figure_filepath) 443 | ``` 444 | 445 | We can see the original, private data has a correlation between `Age bracket` and `Time in A&E (mins)`. Not surprisingly, this correlation is lost when we generate our random data. 446 | 447 | *Mutual Information Heatmap in original data (left) and random synthetic data (right)* 448 | ![Random mode age mutual information](plots/mutual_information_heatmap_random.png) 449 | 450 | ### Independent attribute mode 451 | 452 | What if we had the use case where we wanted to build models to analyse the medians of ages, or hospital usage in the synthetic data? In this case we'd use independent attribute mode. 453 | 454 | #### Data Description: Independent 455 | 456 | ```python 457 | describer.describe_dataset_in_independent_attribute_mode( 458 | attribute_to_datatype=attribute_to_datatype, 459 | attribute_to_is_categorical=attribute_is_categorical) 460 | describer.save_dataset_description_to_file( 461 | filepaths.hospital_ae_description_independent) 462 | ``` 463 | 464 | #### Data Generation: Independent 465 | 466 | Next generate the data which keep the distributions of each column but not the data correlations. 467 | 468 | ```python 469 | generator = DataGenerator() 470 | generator.generate_dataset_in_independent_mode( 471 | num_rows, filepaths.hospital_ae_description_independent) 472 | generator.save_synthetic_data( 473 | filepaths.hospital_ae_data_synthetic_independent) 474 | ``` 475 | 476 | #### Attribute Comparison: Independent 477 | 478 | Comparing the attribute histograms we see the independent mode captures the distributions pretty accurately. You can see the synthetic data is _mostly_ similar but not exactly. 479 | 480 | ```python 481 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_independent) 482 | attribute_description = read_json_file( 483 | filepaths.hospital_ae_description_random)['attribute_description'] 484 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description) 485 | 486 | for attribute in synthetic_df.columns: 487 | inspector.compare_histograms(attribute, figure_filepath) 488 | ``` 489 | 490 | *Comparison of ages in original data (left) and independent synthetic data (right)* 491 | ![Random mode age bracket histograms](plots/independent_Age_bracket.png) 492 | 493 | *Comparison of hospital attendance in original data (left) and independent synthetic data (right)* 494 | ![Random mode age bracket histograms](plots/independent_Hospital_ID.png) 495 | 496 | *Comparison of arrival date in original data (left) and independent synthetic data (right)* 497 | ![Random mode age bracket histograms](plots/independent_Arrival_Date.png) 498 | 499 | #### Compare pairwise mutual information: Independent 500 | 501 | ```python 502 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_independent) 503 | 504 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description) 505 | inspector.mutual_information_heatmap(figure_filepath) 506 | ``` 507 | 508 | We can see the independent data also does not contain any of the attribute correlations from the original data. 509 | 510 | *Mutual Information Heatmap in original data (left) and independent synthetic data (right)* 511 | ![Independent mode mutual information](plots/mutual_information_heatmap_independent.png) 512 | 513 | ### Correlated attribute mode - include correlations between columns in the data 514 | 515 | If we want to capture correlated variables, for instance if patient is related to waiting times, we'll need correlated data. To do this we use *correlated mode*. 516 | 517 | #### Data Description: Correlated 518 | 519 | There's a couple of parameters that are different here so we'll explain them. 520 | 521 | `epsilon` is a value for DataSynthesizer's differential privacy which says the amount of noise to add to the data - the higher the value, the more noise and therefore more privacy. We're not using differential privacy so we can set it to zero. 522 | 523 | `k` is the maximum number of parents in a Bayesian network, i.e., the maximum number of incoming edges. For simplicity's sake, we're going to set this to 1, saying that for a variable only one other variable can influence it. 524 | 525 | ```python 526 | describer.describe_dataset_in_correlated_attribute_mode( 527 | dataset_file=filepaths.hospital_ae_data_deidentify, 528 | epsilon=0, 529 | k=1, 530 | attribute_to_datatype=attribute_to_datatype, 531 | attribute_to_is_categorical=attribute_is_categorical) 532 | 533 | describer.save_dataset_description_to_file(filepaths.hospital_ae_description_correlated) 534 | ``` 535 | 536 | #### Data Generation: Correlated 537 | 538 | ```python 539 | generator.generate_dataset_in_correlated_attribute_mode( 540 | num_rows, filepaths.hospital_ae_description_correlated) 541 | generator.save_synthetic_data(filepaths.hospital_ae_data_synthetic_correlated) 542 | ``` 543 | 544 | #### Attribute Comparison: Correlated 545 | 546 | We can see correlated mode keeps similar distributions also. It looks the exact same but if you look closely there are also small differences in the distributions. 547 | 548 | *Comparison of ages in original data (left) and correlated synthetic data (right)* 549 | ![Random mode age bracket histograms](plots/correlated_Age_bracket.png) 550 | 551 | *Comparison of hospital attendance in original data (left) and independent synthetic data (right)* 552 | ![Random mode age bracket histograms](plots/correlated_Hospital_ID.png) 553 | 554 | *Comparison of arrival date in original data (left) and independent synthetic data (right)* 555 | ![Random mode age bracket histograms](plots/correlated_Arrival_Date.png) 556 | 557 | #### Compare pairwise mutual information: Correlated 558 | 559 | Finally, we see in correlated mode, we manage to capture the correlation between `Age bracket` and `Time in A&E (mins)`. 560 | 561 | ```python 562 | synthetic_df = pd.read_csv(filepaths.hospital_ae_data_synthetic_correlated) 563 | 564 | inspector = ModelInspector(hospital_ae_df, synthetic_df, attribute_description) 565 | inspector.mutual_information_heatmap(figure_filepath) 566 | ``` 567 | 568 | *Mutual Information Heatmap in original data (left) and correlated synthetic data (right)* 569 | ![Independent mode mutual information](plots/mutual_information_heatmap_correlated.png) 570 | 571 | --- 572 | 573 | ### Wrap-up 574 | 575 | This is where our tutorial ends. But there is much, much more to the world of anonymisation and synthetic data. Please check out more in the references below. 576 | 577 | If you have any queries, comments or improvements about this tutorial please do get in touch. You can send me a message through Github or leave an Issue. 578 | 579 | ### References 580 | 581 | - [Exploring methods for synthetic A&E data](https://odileeds.org/blog/2019-01-24-exploring-methods-for-creating-synthetic-a-e-data) - Jonathan Pearson, NHS England with Open Data Institute Leeds. 582 | - [DataSynthesizer Github Repository](https://github.com/DataResponsibly/DataSynthesizer) 583 | - [DataSynthesizer: Privacy-Preserving Synthetic Datasets](https://faculty.washington.edu/billhowe/publications/pdfs/ping17datasynthesizer.pdf) Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017 584 | - [ONS methodology working paper series number 16 - Synthetic data pilot](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot) - Office of National Statistics, 2019. 585 | - [Wrap-up blog post](http://theodi.org) (not yet published) from our anonymisation project which talks about what we learned and other outputs we created. 586 | - We referred to the [UK Anonymisation Network's Decision Making Framework](https://ukanon.net/ukan-resources/ukan-decision-making-framework/) a lot during our work. It's pretty involved but it's excellent as a deep-dive resource on anonymisation. 587 | --------------------------------------------------------------------------------