├── README.md
├── LICENSE
├── common_utils.py
├── fairness_example.ipynb
├── disparate_impact_remover.ipynb
├── reweighing_preproc.ipynb
└── datasets_analysis.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # AI Fairness
2 |
3 | This GitHub repository is dedicated to providing libraries and tutorials on the topic of AI fairness. This repository aims to empower developers and researchers to create AI systems that are fair and unbiased for all individuals. Inside, you will find a collection of libraries and tools that can be used to identify and mitigate sources of bias in machine learning models. Additionally, you will find a range of tutorials and examples that demonstrate best practices for implementing fairness in AI, from data preprocessing to model evaluation. This repository is designed to provide the resources to create fair and equitable AI systems.
4 |
5 |
6 | 
7 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Ali Alameer
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/common_utils.py:
--------------------------------------------------------------------------------
1 | # Metrics function
2 | from collections import OrderedDict
3 | from aif360.metrics import ClassificationMetric
4 |
5 | def compute_metrics(dataset_true, dataset_pred,
6 | unprivileged_groups, privileged_groups,
7 | disp = True):
8 | """ Compute the key metrics """
9 | classified_metric_pred = ClassificationMetric(dataset_true,
10 | dataset_pred,
11 | unprivileged_groups=unprivileged_groups,
12 | privileged_groups=privileged_groups)
13 | metrics = OrderedDict()
14 | metrics["Balanced accuracy"] = 0.5*(classified_metric_pred.true_positive_rate()+
15 | classified_metric_pred.true_negative_rate())
16 | metrics["Statistical parity difference"] = classified_metric_pred.statistical_parity_difference()
17 | metrics["Disparate impact"] = classified_metric_pred.disparate_impact()
18 | metrics["Average odds difference"] = classified_metric_pred.average_odds_difference()
19 | metrics["Equal opportunity difference"] = classified_metric_pred.equal_opportunity_difference()
20 | metrics["Theil index"] = classified_metric_pred.theil_index()
21 |
22 | if disp:
23 | for k in metrics:
24 | print("%s = %.4f" % (k, metrics[k]))
25 |
26 | return metrics
--------------------------------------------------------------------------------
/fairness_example.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"attachments":{},"cell_type":"markdown","metadata":{},"source":["# Detecting and mitigating age bias on credit decisions \n","\n","### Biases and Machine Learning\n","A machine learning model makes predictions of an outcome for a particular instance. (Given an instance of a loan application, predict if the applicant will repay the loan.) The model makes these predictions based on a training dataset, where many other instances (other loan applications) and actual outcomes (whether they repaid) are provided. Thus, a machine learning algorithm will attempt to find patterns, or generalizations, in the training dataset to use when a prediction for a new instance is needed. (For example, one pattern it might discover is \"if a person has salary > USD 40K and has outstanding debt < USD 5, they will repay the loan\".) In many domains this technique, called supervised machine learning, has worked very well.\n","\n","However, sometimes the patterns that are found may not be desirable or may even be illegal. For example, a loan repay model may determine that age plays a significant role in the prediction of repayment because the training dataset happened to have better repayment for one age group than for another. This raises two problems: 1) the training dataset may not be representative of the true population of people of all age groups, and 2) even if it is representative, it is illegal to base any decision on a applicant's age, regardless of whether this is a good prediction based on historical data.\n","\n","AI Fairness 360 is designed to help address this problem with _fairness metrics_ and _bias mitigators_. Fairness metrics can be used to check for bias in machine learning workflows. Bias mitigators can be used to overcome bias in the workflow to produce a more fair outcome. \n","\n","The loan scenario describes an intuitive example of illegal bias. However, not all undesirable bias in machine learning is illegal it may also exist in more subtle ways. For example, a loan company may want a diverse portfolio of customers across all income levels, and thus, will deem it undesirable if they are making more loans to high income levels over low income levels. Although this is not illegal or unethical, it is undesirable for the company's strategy.\n","\n","As these two examples illustrate, a bias detection and/or mitigation toolkit needs to be tailored to the particular bias of interest. More specifically, it needs to know the attribute or attributes, called _protected attributes_, that are of interest: race is one example of a _protected attribute_ and age is a second.\n","\n","### The Machine Learning Workflow\n","To understand how bias can enter a machine learning model, we first review the basics of how a model is created in a supervised machine learning process. \n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","\n","First, the process starts with a _training dataset_, which contains a sequence of instances, where each instance has two components: the features and the correct prediction for those features. Next, a machine learning algorithm is trained on this training dataset to produce a machine learning model. This generated model can be used to make a prediction when given a new instance. A second dataset with features and correct predictions, called a _test dataset_, is used to assess the accuracy of the model.\n","Since this test dataset is the same format as the training dataset, a set of instances of features and prediction pairs, often these two datasets derive from the same initial dataset. A random partitioning algorithm is used to split the initial dataset into training and test datasets.\n","\n","Bias can enter the system in any of the three steps above. The training data set may be biased in that its outcomes may be biased towards particular kinds of instances. The algorithm that creates the model may be biased in that it may generate models that are weighted towards particular features in the input. The test data set may be biased in that it has expectations on correct answers that may be biased. These three points in the machine learning process represent points for testing and mitigating bias. In AI Fairness 360 codebase, these points are called _pre-processing_, _in-processing_, and _post-processing_. \n","\n","### AI Fairness 360\n","We are now ready to utilize AI Fairness 360 (`aif360`) to detect and mitigate bias. We will use the German credit dataset, splitting it into a training and test dataset. We will look for bias in the creation of a machine learning model to predict if an applicant should be given credit based on various features from a typical credit application. The protected attribute will be \"Age\", with \"1\" (older than or equal to 25) and \"0\" (younger than 25) being the values for the privileged and unprivileged groups, respectively.\n","Here, we will check for bias in the initial training data, mitigate the bias, and recheck. \n","\n","Here are the steps involved\n","#### Step 1: Write import statements\n","#### Step 2: Set bias detection options, load dataset, and split between train and test\n","#### Step 3: Compute fairness metric on original training dataset\n","#### Step 4: Mitigate bias by transforming the original dataset\n","#### Step 5: Compute fairness metric on transformed training dataset\n","\n","### Step 1 Import Statements\n","As with any python program, the first step will be to import the necessary packages. Below we import several components from the `aif360` package. We import the GermanDataset, metrics to check for bias, and classes related to the algorithm we will use to mitigate bias."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["
"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!pip install aif360"]},{"cell_type":"code","execution_count":1,"metadata":{"tags":["parameters"]},"outputs":[],"source":["# Load all necessary packages\n","import sys\n","sys.path.insert(1, \"../\") \n","\n","import numpy as np\n","np.random.seed(0)\n","\n","from aif360.datasets import GermanDataset\n","from aif360.metrics import BinaryLabelDatasetMetric\n","from aif360.algorithms.preprocessing import Reweighing\n","\n","from IPython.display import Markdown, display"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# preparing paths to download the datasets \n","\n","import site\n","import os\n","\n","# Get the dist-packages directory path\n","dist_packages_path = site.getsitepackages()[0]\n","\n","# Print the path\n","print(dist_packages_path)\n","\n","## German credit score dataset\n","\n","german_dir_data = 'aif360/data/raw/german/german.data'\n","german_path_data = os.path.join(dist_packages_path, german_dir_data)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import urllib.request \n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data\",german_path_data) "]},{"cell_type":"markdown","metadata":{},"source":["### Step 2 Load dataset, specifying protected attribute, and split dataset into train and test\n","In Step 2 we load the initial dataset, setting the protected attribute to be age. We then splits the original dataset into training and testing datasets. Although we will use only the training dataset in this tutorial, a normal workflow would also use a test dataset for assessing the efficacy (accuracy, fairness, etc.) during the development of a machine learning model. Finally, we set two variables (to be used in Step 3) for the privileged (1) and unprivileged (0) values for the age attribute. These are key inputs for detecting and mitigating bias, which will be Step 3 and Step 4. "]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[],"source":["dataset_orig = GermanDataset(\n"," protected_attribute_names=['age'], # this dataset also contains protected\n"," # attribute for \"sex\" which we do not\n"," # consider in this evaluation\n"," privileged_classes=[lambda x: x >= 25], # age >=25 is considered privileged\n"," features_to_drop=['personal_status', 'sex'] # ignore sex-related attributes\n",")\n","\n","dataset_orig_train, dataset_orig_test = dataset_orig.split([0.7], shuffle=True)\n","\n","privileged_groups = [{'age': 1}]\n","unprivileged_groups = [{'age': 0}]"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# check for protected attribute\n","dataset_orig.protected_attribute_names"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Convert AIF360 dataset to dataframe\n","\n","dataset=dataset_orig.convert_to_dataframe()\n","dataset_pd=dataset[0]"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# View the dataframe created\n","dataset_pd"]},{"cell_type":"markdown","metadata":{},"source":["### Step 3 Compute fairness metric on original training dataset\n","Now that we've identified the protected attribute 'age' and defined privileged and unprivileged values, we can use aif360 to detect bias in the dataset. One simple test is to compare the percentage of favorable results for the privileged and unprivileged groups, subtracting the former percentage from the latter. A negative value indicates less favorable outcomes for the unprivileged groups. This is implemented in the method called mean_difference on the BinaryLabelDatasetMetric class. The code below performs this check and displays the output, showing that the difference is -0.169905."]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"data":{"text/markdown":["#### Original training dataset"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Difference in mean outcomes between unprivileged and privileged groups = -0.169905\n"]}],"source":["metric_orig_train = BinaryLabelDatasetMetric(dataset_orig_train, \n"," unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","display(Markdown(\"#### Original training dataset\"))\n","print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_orig_train.mean_difference())"]},{"cell_type":"markdown","metadata":{},"source":["### Step 4 Mitigate bias by transforming the original dataset\n","The previous step showed that the privileged group was getting 17% more positive outcomes in the training dataset. Since this is not desirable, we are going to try to mitigate this bias in the training dataset. As stated above, this is called _pre-processing_ mitigation because it happens before the creation of the model. \n","\n","AI Fairness 360 implements several pre-processing mitigation algorithms. We will choose the Reweighing algorithm [1], which is implemented in the `Reweighing` class in the `aif360.algorithms.preprocessing` package. This algorithm will transform the dataset to have more equity in positive outcomes on the protected attribute for the privileged and unprivileged groups.\n","\n","We then call the fit and transform methods to perform the transformation, producing a newly transformed training dataset (dataset_transf_train).\n","\n","`[1] F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012.`"]},{"cell_type":"code","execution_count":4,"metadata":{"collapsed":true},"outputs":[],"source":["RW = Reweighing(unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","dataset_transf_train = RW.fit_transform(dataset_orig_train)"]},{"cell_type":"markdown","metadata":{},"source":["### Step 5 Compute fairness metric on transformed dataset\n","Now that we have a transformed dataset, we can check how effective it was in removing bias by using the same metric we used for the original training dataset in Step 3. Once again, we use the function mean_difference in the BinaryLabelDatasetMetric class. We see the mitigation step was very effective, the difference in mean outcomes is now 0.0. So we went from a 17% advantage for the privileged group to equality in terms of mean outcome."]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[{"data":{"text/markdown":["#### Transformed training dataset"],"text/plain":[""]},"metadata":{},"output_type":"display_data"},{"name":"stdout","output_type":"stream","text":["Difference in mean outcomes between unprivileged and privileged groups = 0.000000\n"]}],"source":["metric_transf_train = BinaryLabelDatasetMetric(dataset_transf_train, \n"," unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","display(Markdown(\"#### Transformed training dataset\"))\n","print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_transf_train.mean_difference())"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Summary\n","The purpose of this tutorial is to give a new user to bias detection and mitigation a gentle introduction to some of the functionality of AI Fairness 360. A more complete use case would take the next step and see how the transformed dataset impacts the accuracy and fairness of a trained model.\n","\n","There are many metrics one can use to detect the presence of bias. AI Fairness 360 provides many of them for your use. Likewise, there are many different bias mitigation algorithms one can employ, many of which are in AI Fairness 360. "]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":2},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython2","version":"2.7.11"}},"nbformat":4,"nbformat_minor":2}
2 |
--------------------------------------------------------------------------------
/disparate_impact_remover.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "K_swytGulTCT"
17 | },
18 | "source": [
19 | "### This notebook demonstrates the ability of the DisparateImpactRemover algorithm.\n",
20 | "The algorithm corrects for imbalanced selection rates between unprivileged and privileged groups at various levels of repair. It follows the guidelines set forth by [1] for training the algorithm and classifier and uses the AdultDataset as an example."
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {
27 | "id": "b5uHITCPllxK"
28 | },
29 | "outputs": [],
30 | "source": [
31 | "!pip install 'aif360[all]'\n",
32 | "!wget https://raw.githubusercontent.com/Ali-Alameer/AI_fairness/main/common_utils.py"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "# preparing paths to download the datasets \n",
42 | "\n",
43 | "import site\n",
44 | "import os\n",
45 | "\n",
46 | "# Get the dist-packages directory path\n",
47 | "dist_packages_path = site.getsitepackages()[0]\n",
48 | "\n",
49 | "# Print the path\n",
50 | "print(dist_packages_path)\n",
51 | "\n",
52 | "## Adult dataset\n",
53 | "\n",
54 | "aif360_dir_data = 'aif360/data/raw/adult/adult.data'\n",
55 | "aif360_dir_test = 'aif360/data/raw/adult/adult.test'\n",
56 | "aif360_dir_names = 'aif360/data/raw/adult/adult.names'\n",
57 | "\n",
58 | "\n",
59 | "full_path_data = os.path.join(dist_packages_path, aif360_dir_data)\n",
60 | "full_path_test = os.path.join(dist_packages_path, aif360_dir_test)\n",
61 | "full_path_names = os.path.join(dist_packages_path, aif360_dir_names)\n",
62 | "\n"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "id": "KOxQFeRgmXJJ"
70 | },
71 | "outputs": [],
72 | "source": [
73 | "import urllib.request \n",
74 | "# For Adult dataset\n",
75 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",full_path_data) \n",
76 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",full_path_test) \n",
77 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names\",full_path_names) \n"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 3,
83 | "metadata": {
84 | "id": "7Zk-dRgklTCV"
85 | },
86 | "outputs": [],
87 | "source": [
88 | "from __future__ import absolute_import\n",
89 | "from __future__ import division\n",
90 | "from __future__ import print_function\n",
91 | "from __future__ import unicode_literals\n",
92 | "\n",
93 | "import matplotlib.pyplot as plt\n",
94 | "\n",
95 | "import sys\n",
96 | "sys.path.append(\"../\")\n",
97 | "import warnings\n",
98 | "\n",
99 | "import numpy as np\n",
100 | "from tqdm import tqdm\n",
101 | "\n",
102 | "from sklearn.linear_model import LogisticRegression\n",
103 | "from sklearn.svm import SVC as SVM\n",
104 | "from sklearn.preprocessing import MinMaxScaler\n",
105 | "\n",
106 | "from aif360.algorithms.preprocessing import DisparateImpactRemover\n",
107 | "from aif360.datasets import AdultDataset\n",
108 | "from aif360.metrics import BinaryLabelDatasetMetric\n",
109 | "from common_utils import compute_metrics"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 4,
115 | "metadata": {
116 | "id": "xUeN7eYplTCW"
117 | },
118 | "outputs": [],
119 | "source": [
120 | "protected = 'sex'\n",
121 | "ad = AdultDataset(protected_attribute_names=[protected],\n",
122 | " privileged_classes=[['Male']], categorical_features=[],\n",
123 | " features_to_keep=['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'])"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 5,
129 | "metadata": {
130 | "id": "m9LmxuY_lTCW"
131 | },
132 | "outputs": [],
133 | "source": [
134 | "scaler = MinMaxScaler(copy=False)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 6,
140 | "metadata": {
141 | "id": "YuljMCF5lTCW"
142 | },
143 | "outputs": [],
144 | "source": [
145 | "test, train = ad.split([16281])\n",
146 | "train.features = scaler.fit_transform(train.features)\n",
147 | "test.features = scaler.fit_transform(test.features)\n",
148 | "\n",
149 | "index = train.feature_names.index(protected)"
150 | ]
151 | },
152 | {
153 | "attachments": {},
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "The repair_level parameter in DisparateImpactRemover specifies the amount or intensity of the repair applied to the data. It is typically a value between 0 and 1, where 0 means no repair (i.e., no modification of the data) and 1 means maximum repair (i.e., full modification of the data to eliminate disparate impact). The value of repair_level determines the trade-off between fairness and accuracy in the resulting model. Higher values of repair_level may result in more fairness but potentially lower accuracy, while lower values may result in higher accuracy but less fairness."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {
164 | "id": "kcUbHdaWlTCW"
165 | },
166 | "outputs": [],
167 | "source": [
168 | "acc = []\n",
169 | "dis_impact = []\n",
170 | "ave_odds_diff = []\n",
171 | "Statistical_parity_difference = []\n",
172 | "Equal_opportunity_difference = []\n",
173 | "Theil_index = []\n",
174 | "repair_level_all = []\n",
175 | "\n",
176 | "for level in tqdm(np.linspace(0., 1., 11)):\n",
177 | " di = DisparateImpactRemover(repair_level=level)\n",
178 | " train_repd = di.fit_transform(train)\n",
179 | " test_repd = di.fit_transform(test)\n",
180 | " \n",
181 | " X_tr = np.delete(train_repd.features, index, axis=1)\n",
182 | " X_te = np.delete(test_repd.features, index, axis=1)\n",
183 | " y_tr = train_repd.labels.ravel()\n",
184 | " \n",
185 | " lmod = LogisticRegression(class_weight='balanced', solver='liblinear')\n",
186 | " lmod.fit(X_tr, y_tr)\n",
187 | " \n",
188 | " test_repd_pred = test_repd.copy()\n",
189 | " test_repd_pred.labels = lmod.predict(X_te)\n",
190 | "\n",
191 | " p = [{protected: 1}]\n",
192 | " u = [{protected: 0}]\n",
193 | " cm = BinaryLabelDatasetMetric(test_repd_pred, privileged_groups=p, unprivileged_groups=u)\n",
194 | " print(\"Repair Level = %f\" % level)\n",
195 | " dis_impact.append(cm.disparate_impact())\n",
196 | " metric_test = compute_metrics(test_repd, test_repd_pred, \n",
197 | " unprivileged_groups=u, privileged_groups=p,\n",
198 | " disp = True)\n",
199 | " acc.append(metric_test[\"Balanced accuracy\"])\n",
200 | " ave_odds_diff.append(metric_test[\"Average odds difference\"])\n",
201 | " Statistical_parity_difference.append(metric_test[\"Statistical parity difference\"])\n",
202 | " Equal_opportunity_difference.append(metric_test[\"Equal opportunity difference\"])\n",
203 | " Theil_index.append(metric_test[\"Theil index\"])\n",
204 | " repair_level_all.append(level)"
205 | ]
206 | },
207 | {
208 | "attachments": {},
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "\n",
213 | "\n",
214 | "These fairness metrics provide quantitative measures to assess the presence of bias and fairness in machine learning systems, and can be used to evaluate the performance of fairness-aware machine learning algorithms and techniques.\n",
215 | "\n",
216 | "- Statistical Parity Difference: This metric measures the difference in the rate of favorable outcomes between the unprivileged and privileged groups. The ideal value is 0, and fairness is indicated by values between -0.1 and 0.1.\n",
217 | "\n",
218 | "- Disparate Impact: This metric computes the ratio of the rate of favorable outcomes between the unprivileged and privileged groups. The ideal value is 1.0, and fairness is indicated by values between 0.8 and 1.25. A value < 1 implies higher benefit for the privileged group, while a value > 1 implies higher benefit for the unprivileged group.\n",
219 | "\n",
220 | "- Average Odds Difference: This metric measures the average difference in false positive rate and true positive rate between the unprivileged and privileged groups. The ideal value is 0, and fairness is indicated by values between -0.1 and 0.1. A value < 0 implies higher benefit for the privileged group, while a value > 0 implies higher benefit for the unprivileged group.\n",
221 | "\n",
222 | "- Equal Opportunity Difference: This metric computes the difference in true positive rates between the unprivileged and privileged groups. The ideal value is 0, and fairness is indicated by values between -0.1 and 0.1. A value < 0 implies higher benefit for the privileged group, while a value > 0 implies higher benefit for the unprivileged group.\n",
223 | "\n",
224 | "- Theil Index: This metric measures the inequality in benefit allocation for individuals, computed as the generalized entropy of benefit with alpha = 1. The ideal value is 0, with lower scores indicating better fairness and higher scores indicating more problematic fairness."
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "metadata": {},
231 | "outputs": [],
232 | "source": [
233 | "%matplotlib inline\n",
234 | "# Create figure and axes\n",
235 | "fig, ax1 = plt.subplots()\n",
236 | "\n",
237 | "# Plotting ave_odds_diff against repair_level_all on the left-hand y-axis\n",
238 | "ax1.plot(repair_level_all, ave_odds_diff, label='Average Odds Difference')\n",
239 | "ax1.plot(repair_level_all, Statistical_parity_difference, label='Statistical Parity Difference')\n",
240 | "ax1.plot(repair_level_all, Equal_opportunity_difference, label='Equal Opportunity Difference')\n",
241 | "ax1.plot(repair_level_all, Theil_index, label='Theil Index')\n",
242 | "ax1.plot(repair_level_all, dis_impact, label='Disparate Impact')\n",
243 | "ax1.set_xlabel('Repair Level')\n",
244 | "ax1.set_ylabel('Fairness Metrics')\n",
245 | "\n",
246 | "# Create a twin axes sharing the x-axis with ax1\n",
247 | "ax2 = ax1.twinx()\n",
248 | "\n",
249 | "# Plotting acc against repair_level_all on the right-hand y-axis\n",
250 | "ax2.plot(repair_level_all, acc, label='Accuracy')\n",
251 | "ax2.set_ylabel('Accuracy')\n",
252 | "\n",
253 | "# Add legend\n",
254 | "lines = ax1.get_lines() + ax2.get_lines()\n",
255 | "ax1.legend(lines, [line.get_label() for line in lines])\n",
256 | "\n",
257 | "# Show the plot\n",
258 | "plt.show()"
259 | ]
260 | },
261 | {
262 | "attachments": {},
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "After correcting for disparate impact, we observe that the most improved metric on the graph was actually the disparate impact"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": null,
272 | "metadata": {},
273 | "outputs": [],
274 | "source": [
275 | "# Get the number of features (columns) in the ndarray\n",
276 | "num_features = train.features.shape[1]\n",
277 | "\n",
278 | "# List of feature names\n",
279 | "feature_names = train.feature_names\n",
280 | "\n",
281 | "# Set up the subplots for the histograms\n",
282 | "fig, axes = plt.subplots(1, num_features, figsize=(12, 4))\n",
283 | "\n",
284 | "# Iterate over each feature and plot the histograms\n",
285 | "for i in range(num_features):\n",
286 | " # Plot original feature histogram\n",
287 | " axes[i].hist(train.features[:, i], bins=10, alpha=0.5, color='blue', label='Original')\n",
288 | " # Plot transformed feature histogram\n",
289 | " axes[i].hist(train_repd.features[:, i], bins=10, alpha=0.5, color='orange', label='Transformed')\n",
290 | " axes[i].set_xlabel(feature_names[i]) # Set x-axis label to feature name\n",
291 | " axes[i].set_ylabel('Frequency')\n",
292 | " axes[i].set_title(f'{feature_names[i]}') # Set title to feature name histogram\n",
293 | " axes[i].legend()\n",
294 | "plt.suptitle('Original vs Transformed Feature Histograms With Disparate Impact Remover at Repair Level 1', fontsize=16, y=1.05)\n",
295 | "\n",
296 | "# Adjust the layout and display the plot\n",
297 | "plt.tight_layout()\n",
298 | "plt.show()"
299 | ]
300 | },
301 | {
302 | "attachments": {},
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "In this visualisation, we showcase the transformation of features using the Disparate Impact Remover technique with a repair level of 1, which represents the maximum repair level. We utilise histograms of features to illustrate how and where the transformation is applied."
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {
313 | "id": "Hbmz3H3VlTCX"
314 | },
315 | "outputs": [],
316 | "source": [
317 | "plt.plot(np.linspace(0, 1, 11), dis_impact, marker='o')\n",
318 | "plt.plot([0, 1], [1, 1], 'g')\n",
319 | "plt.plot([0, 1], [0.8, 0.8], 'r')\n",
320 | "plt.ylim([0.4, 1.2])\n",
321 | "plt.ylabel('Disparate Impact (DI)')\n",
322 | "plt.xlabel('repair level')\n",
323 | "plt.show()"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {
329 | "id": "15oJkoXwlTCX"
330 | },
331 | "source": [
332 | " References:\n",
333 | " .. [1] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and\n",
334 | " S. Venkatasubramanian, \"Certifying and removing disparate impact.\"\n",
335 | " ACM SIGKDD International Conference on Knowledge Discovery and Data\n",
336 | " Mining, 2015."
337 | ]
338 | }
339 | ],
340 | "metadata": {
341 | "colab": {
342 | "include_colab_link": true,
343 | "provenance": []
344 | },
345 | "kernelspec": {
346 | "display_name": "Python 3",
347 | "language": "python",
348 | "name": "python3"
349 | },
350 | "language_info": {
351 | "codemirror_mode": {
352 | "name": "ipython",
353 | "version": 3
354 | },
355 | "file_extension": ".py",
356 | "mimetype": "text/x-python",
357 | "name": "python",
358 | "nbconvert_exporter": "python",
359 | "pygments_lexer": "ipython3",
360 | "version": "3.6.7"
361 | }
362 | },
363 | "nbformat": 4,
364 | "nbformat_minor": 0
365 | }
366 |
--------------------------------------------------------------------------------
/reweighing_preproc.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["#### This notebook demonstrates the use of a reweighing pre-processing algorithm for bias mitigation\n"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["
"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["!pip install 'aif360[all]'\n","!wget https://raw.githubusercontent.com/Trusted-AI/AIF360/master/examples/common_utils.py"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# preparing paths to download the datasets \n","\n","import site\n","import os\n","\n","# Get the dist-packages directory path\n","dist_packages_path = site.getsitepackages()[0]\n","\n","# Print the path\n","print(dist_packages_path)\n","\n","## Adult dataset\n","\n","aif360_dir_names = 'aif360/data/raw/adult/adult.names'\n","aif360_dir_test = 'aif360/data/raw/adult/adult.test'\n","aif360_dir_data = 'aif360/data/raw/adult/adult.data'\n","full_path_names = os.path.join(dist_packages_path, aif360_dir_names)\n","full_path_test = os.path.join(dist_packages_path, aif360_dir_test)\n","full_path_data = os.path.join(dist_packages_path, aif360_dir_data)\n","\n","## German credit score dataset\n","\n","german_dir_data = 'aif360/data/raw/german/german.data'\n","german_dir_docs = 'aif360/data/raw/german/german.doc'\n","\n","german_path_data = os.path.join(dist_packages_path, german_dir_data)\n","german_path_docs = os.path.join(dist_packages_path, german_dir_docs)\n","\n","## Compas dataset \n","\n","compas_dir_csv = 'aif360/data/raw/compas/compas-scores-two-years.csv'\n","compas_path_csv = os.path.join(dist_packages_path, compas_dir_csv)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import urllib.request \n","# For Adult dataset\n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",full_path_data) \n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",full_path_test) \n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names\",full_path_names) \n","\n","# # For German Dataset\n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data\",german_path_data) \n","urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc\",german_path_docs)\n","\n","# # For Compas Dataset\n","urllib.request.urlretrieve(\"https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv\",compas_path_csv) "]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["%matplotlib inline\n","# Load all necessary packages\n","import sys\n","sys.path.append(\"../\")\n","import numpy as np\n","from tqdm import tqdm\n","\n","from aif360.datasets import BinaryLabelDataset\n","from aif360.datasets import AdultDataset, GermanDataset, CompasDataset\n","from aif360.metrics import BinaryLabelDatasetMetric\n","from aif360.metrics import ClassificationMetric\n","from aif360.algorithms.preprocessing.reweighing import Reweighing\n","from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions\\\n"," import load_preproc_data_adult, load_preproc_data_german, load_preproc_data_compas\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.preprocessing import StandardScaler\n","from sklearn.metrics import accuracy_score\n","\n","from IPython.display import Markdown, display\n","import matplotlib.pyplot as plt\n","\n","from common_utils import compute_metrics"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["\n","# Convert AIF360 Adult dataset to dataframe\n","dataset_orig_adult = load_preproc_data_adult(['sex'])\n","dataset_orig_adult1=dataset_orig_adult.convert_to_dataframe()\n","dataset_orig_adult2=dataset_orig_adult1[0]"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# View the dataframe created\n","dataset_orig_adult2"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore income binary in regards to sex distribution in adult dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## sex (privileged: Male, unprivileged: Female) \n","## Income binary( privileged: > $50K income, unprivileged: <= $50k income)\n","import seaborn as sns\n","sns.countplot(x=\"Income Binary\", hue= 'sex', data = dataset_orig_adult2)\n","plt.title('Income Distribution for Adult Dataset')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore sex in adult dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## sex (privileged: Male, unprivileged: Female) \n","sns.countplot(x=\"sex\", data = dataset_orig_adult2)\n","plt.title('Sex Distribution for Adult Dataset')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore race in adult dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## race (privileged: White, unprivileged: Non-white)\n","sns.countplot(x=\"race\", data = dataset_orig_adult2)\n","plt.title('Race Distribution')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Convert German dataset to dataframe\n","dataset_orig_german = load_preproc_data_german(['sex'])\n","dataset_orig_german2=dataset_orig_german.convert_to_dataframe()\n","dataset_orig_german3=dataset_orig_german2[0]\n","dataset_orig_german3"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore sex in German dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## sex (privileged: Male, unprivileged: Female) \n","sns.countplot(x=\"sex\", data = dataset_orig_german3,palette=['#432371',\"#FAAE7B\"])\n","plt.title('Sex Distribution for German Dataset')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore age in German dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## age (privileged: Older than or Equal to 25 years, unprivileged: Younger than 25 years) \n","sns.countplot(x=\"age\", data = dataset_orig_german3,palette=['#432371',\"#FAAE7B\"])\n","plt.title('Age Distribution for German Dataset')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Convert Compas dataset to dataframe\n","dataset_orig_compas = load_preproc_data_compas(['sex'])\n","dataset_orig_compas2=dataset_orig_compas.convert_to_dataframe()\n","dataset_orig_compas3=dataset_orig_compas2[0]\n","dataset_orig_compas3"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore sex in Compas dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## sex (Privileged: Female, unprivileged: Male) \n","sns.countplot(x=\"sex\", data = dataset_orig_compas3,palette=[\"#9b59b6\", \"#3498db\"])\n","plt.title('Sex Distribution for Compas Dataset')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["## explore race in Compas dataset to check class imbalance and distribution\n","## Privileged = 1, unprivileged = 0\n","## race (Privileged: Caucasian, unprivileged: Non-Caucasian).\n","import seaborn as sns\n","sns.countplot(x=\"race\", data = dataset_orig_compas3,palette=[\"#9b59b6\", \"#3498db\"])\n","plt.title('Race Distribution for Compas Dataset')\n","plt.show()"]},{"cell_type":"markdown","metadata":{},"source":["#### Load dataset and set options"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[],"source":["## import dataset\n","dataset_used = \"adult\" # \"adult\", \"german\", \"compas\"\n","protected_attribute_used = 1 # 1, 2\n","\n","\n","if dataset_used == \"adult\":\n","# dataset_orig = AdultDataset()\n"," if protected_attribute_used == 1:\n"," privileged_groups = [{'sex': 1}]\n"," unprivileged_groups = [{'sex': 0}]\n"," dataset_orig = load_preproc_data_adult(['sex'])\n"," else:\n"," privileged_groups = [{'race': 1}]\n"," unprivileged_groups = [{'race': 0}]\n"," dataset_orig = load_preproc_data_adult(['race'])\n"," \n","elif dataset_used == \"german\":\n","# dataset_orig = GermanDataset()\n"," if protected_attribute_used == 1:\n"," privileged_groups = [{'sex': 1}]\n"," unprivileged_groups = [{'sex': 0}]\n"," dataset_orig = load_preproc_data_german(['sex'])\n"," else:\n"," privileged_groups = [{'age': 1}]\n"," unprivileged_groups = [{'age': 0}]\n"," dataset_orig = load_preproc_data_german(['age'])\n"," \n","elif dataset_used == \"compas\":\n","# dataset_orig = CompasDataset()\n"," if protected_attribute_used == 1:\n"," privileged_groups = [{'sex': 1}]\n"," unprivileged_groups = [{'sex': 0}]\n"," dataset_orig = load_preproc_data_compas(['sex'])\n"," else:\n"," privileged_groups = [{'race': 1}]\n"," unprivileged_groups = [{'race': 0}]\n"," dataset_orig = load_preproc_data_compas(['race'])\n","\n","all_metrics = [\"Statistical parity difference\",\n"," \"Average odds difference\",\n"," \"Equal opportunity difference\"]\n","\n","#random seed for calibrated equal odds prediction\n","np.random.seed(1)"]},{"cell_type":"markdown","metadata":{},"source":["#### Split into train, and test"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[],"source":["# Get the dataset and split into train and test\n","dataset_orig_train, dataset_orig_vt = dataset_orig.split([0.7], shuffle=True)\n","dataset_orig_valid, dataset_orig_test = dataset_orig_vt.split([0.5], shuffle=True)"]},{"cell_type":"markdown","metadata":{},"source":["#### Clean up training data"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# print out some labels, names, etc.\n","display(Markdown(\"#### Training Dataset shape\"))\n","print(dataset_orig_train.features.shape)\n","display(Markdown(\"#### Favorable and unfavorable labels\"))\n","print(dataset_orig_train.favorable_label, dataset_orig_train.unfavorable_label)\n","display(Markdown(\"#### Protected attribute names\"))\n","print(dataset_orig_train.protected_attribute_names)\n","display(Markdown(\"#### Privileged and unprivileged protected attribute values\"))\n","print(dataset_orig_train.privileged_protected_attributes, \n"," dataset_orig_train.unprivileged_protected_attributes)\n","display(Markdown(\"#### Dataset feature names\"))\n","print(dataset_orig_train.feature_names)"]},{"cell_type":"markdown","metadata":{},"source":["#### Metric for original training data"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Metric for the original dataset, here statistical_parity_difference is used as the following: Pr(Y=1|D=unprivileged)−Pr(Y=1|D=privileged)\n","metric_orig_train = BinaryLabelDatasetMetric(dataset_orig_train, \n"," unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","display(Markdown(\"#### Original training dataset\"))\n","print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_orig_train.mean_difference())"]},{"cell_type":"markdown","metadata":{},"source":["#### Train with and transform the original training data"]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[],"source":["RW = Reweighing(unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","RW.fit(dataset_orig_train)\n","dataset_transf_train = RW.transform(dataset_orig_train)"]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[],"source":["### Testing \n","assert np.abs(dataset_transf_train.instance_weights.sum()-dataset_orig_train.instance_weights.sum())<1e-6"]},{"cell_type":"markdown","metadata":{},"source":["#### Metric with the transformed training data"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["metric_transf_train = BinaryLabelDatasetMetric(dataset_transf_train, \n"," unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n","display(Markdown(\"#### Transformed training dataset\"))\n","print(\"Difference in mean outcomes between unprivileged and privileged groups = %f\" % metric_transf_train.mean_difference())"]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[],"source":["### Testing \n","assert np.abs(metric_transf_train.mean_difference()) < 1e-6"]},{"cell_type":"markdown","metadata":{},"source":["### Train classifier on original data"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[],"source":["# Logistic regression classifier and predictions\n","scale_orig = StandardScaler()\n","X_train = scale_orig.fit_transform(dataset_orig_train.features)\n","y_train = dataset_orig_train.labels.ravel()\n","w_train = dataset_orig_train.instance_weights.ravel()\n","\n","lmod = LogisticRegression()\n","lmod.fit(X_train, y_train, \n"," sample_weight=dataset_orig_train.instance_weights)\n","y_train_pred = lmod.predict(X_train)\n","\n","# positive class index\n","pos_ind = np.where(lmod.classes_ == dataset_orig_train.favorable_label)[0][0]\n","\n","dataset_orig_train_pred = dataset_orig_train.copy()\n","dataset_orig_train_pred.labels = y_train_pred"]},{"cell_type":"markdown","metadata":{},"source":["#### Obtain scores for original validation and test sets"]},{"cell_type":"code","execution_count":11,"metadata":{},"outputs":[],"source":["dataset_orig_valid_pred = dataset_orig_valid.copy(deepcopy=True)\n","X_valid = scale_orig.transform(dataset_orig_valid_pred.features)\n","y_valid = dataset_orig_valid_pred.labels\n","dataset_orig_valid_pred.scores = lmod.predict_proba(X_valid)[:,pos_ind].reshape(-1,1)\n","\n","dataset_orig_test_pred = dataset_orig_test.copy(deepcopy=True)\n","X_test = scale_orig.transform(dataset_orig_test_pred.features)\n","y_test = dataset_orig_test_pred.labels\n","dataset_orig_test_pred.scores = lmod.predict_proba(X_test)[:,pos_ind].reshape(-1,1)"]},{"cell_type":"markdown","metadata":{},"source":["### Find the optimal classification threshold from the validation set"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["num_thresh = 100\n","ba_arr = np.zeros(num_thresh)\n","class_thresh_arr = np.linspace(0.01, 0.99, num_thresh)\n","for idx, class_thresh in enumerate(class_thresh_arr):\n"," \n"," fav_inds = dataset_orig_valid_pred.scores > class_thresh\n"," dataset_orig_valid_pred.labels[fav_inds] = dataset_orig_valid_pred.favorable_label\n"," dataset_orig_valid_pred.labels[~fav_inds] = dataset_orig_valid_pred.unfavorable_label\n"," \n"," classified_metric_orig_valid = ClassificationMetric(dataset_orig_valid,\n"," dataset_orig_valid_pred, \n"," unprivileged_groups=unprivileged_groups,\n"," privileged_groups=privileged_groups)\n"," \n"," ba_arr[idx] = 0.5*(classified_metric_orig_valid.true_positive_rate()\\\n"," +classified_metric_orig_valid.true_negative_rate())\n","\n","best_ind = np.where(ba_arr == np.max(ba_arr))[0][0]\n","best_class_thresh = class_thresh_arr[best_ind]\n","\n","print(\"Best balanced accuracy (no reweighing) = %.4f\" % np.max(ba_arr))\n","print(\"Optimal classification threshold (no reweighing) = %.4f\" % best_class_thresh)"]},{"cell_type":"markdown","metadata":{},"source":["### Predictions from the original test set at the optimal classification threshold"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["display(Markdown(\"#### Predictions from original testing data\"))\n","bal_acc_arr_orig = []\n","disp_imp_arr_orig = []\n","avg_odds_diff_arr_orig = []\n","\n","print(\"Classification threshold used = %.4f\" % best_class_thresh)\n","for thresh in tqdm(class_thresh_arr):\n"," \n"," if thresh == best_class_thresh:\n"," disp = True\n"," else:\n"," disp = False\n"," \n"," fav_inds = dataset_orig_test_pred.scores > thresh\n"," dataset_orig_test_pred.labels[fav_inds] = dataset_orig_test_pred.favorable_label\n"," dataset_orig_test_pred.labels[~fav_inds] = dataset_orig_test_pred.unfavorable_label\n"," \n"," metric_test_bef = compute_metrics(dataset_orig_test, dataset_orig_test_pred, \n"," unprivileged_groups, privileged_groups,\n"," disp = disp)\n","\n"," bal_acc_arr_orig.append(metric_test_bef[\"Balanced accuracy\"])\n"," avg_odds_diff_arr_orig.append(metric_test_bef[\"Average odds difference\"])\n"," disp_imp_arr_orig.append(metric_test_bef[\"Disparate impact\"])"]},{"cell_type":"markdown","metadata":{},"source":["#### Display results for all thresholds"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax1 = plt.subplots(figsize=(10,7))\n","ax1.plot(class_thresh_arr, bal_acc_arr_orig)\n","ax1.set_xlabel('Classification Thresholds', fontsize=16, fontweight='bold')\n","ax1.set_ylabel('Balanced Accuracy', color='b', fontsize=16, fontweight='bold')\n","ax1.xaxis.set_tick_params(labelsize=14)\n","ax1.yaxis.set_tick_params(labelsize=14)\n","\n","\n","ax2 = ax1.twinx()\n","ax2.plot(class_thresh_arr, np.abs(1.0-np.array(disp_imp_arr_orig)), color='r')\n","ax2.set_ylabel('abs(1-disparate impact)', color='r', fontsize=16, fontweight='bold')\n","ax2.axvline(best_class_thresh, color='k', linestyle=':')\n","ax2.yaxis.set_tick_params(labelsize=14)\n","ax2.grid(True)"]},{"cell_type":"markdown","metadata":{},"source":["```abs(1-disparate impact)``` must be small (close to 0) for classifier predictions to be fair.\n","\n","However, for a classifier trained with original training data, at the best classification rate, this is quite high. This implies unfairness."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax1 = plt.subplots(figsize=(10,7))\n","ax1.plot(class_thresh_arr, bal_acc_arr_orig)\n","ax1.set_xlabel('Classification Thresholds', fontsize=16, fontweight='bold')\n","ax1.set_ylabel('Balanced Accuracy', color='b', fontsize=16, fontweight='bold')\n","ax1.xaxis.set_tick_params(labelsize=14)\n","ax1.yaxis.set_tick_params(labelsize=14)\n","\n","\n","ax2 = ax1.twinx()\n","ax2.plot(class_thresh_arr, avg_odds_diff_arr_orig, color='r')\n","ax2.set_ylabel('avg. odds diff.', color='r', fontsize=16, fontweight='bold')\n","ax2.axvline(best_class_thresh, color='k', linestyle=':')\n","ax2.yaxis.set_tick_params(labelsize=14)\n","ax2.grid(True)"]},{"cell_type":"markdown","metadata":{},"source":["```average odds difference = 0.5((FPR_unpriv-FPR_priv)+(TPR_unpriv-TPR_priv))``` must be close to zero for the classifier to be fair.\n","\n","However, for a classifier trained with original training data, at the best classification rate, this is quite high. This implies unfairness."]},{"cell_type":"markdown","metadata":{},"source":["### Train classifier on transformed data"]},{"cell_type":"code","execution_count":16,"metadata":{},"outputs":[],"source":["scale_transf = StandardScaler()\n","X_train = scale_transf.fit_transform(dataset_transf_train.features)\n","y_train = dataset_transf_train.labels.ravel()\n","\n","lmod = LogisticRegression()\n","lmod.fit(X_train, y_train,\n"," sample_weight=dataset_transf_train.instance_weights)\n","y_train_pred = lmod.predict(X_train)"]},{"cell_type":"markdown","metadata":{},"source":["#### Obtain scores for transformed test set"]},{"cell_type":"code","execution_count":17,"metadata":{},"outputs":[],"source":["dataset_transf_test_pred = dataset_orig_test.copy(deepcopy=True)\n","X_test = scale_transf.fit_transform(dataset_transf_test_pred.features)\n","y_test = dataset_transf_test_pred.labels\n","dataset_transf_test_pred.scores = lmod.predict_proba(X_test)[:,pos_ind].reshape(-1,1)"]},{"cell_type":"markdown","metadata":{},"source":["### Predictions from the transformed test set at the optimal classification threshold"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["display(Markdown(\"#### Predictions from transformed testing data\"))\n","bal_acc_arr_transf = []\n","disp_imp_arr_transf = []\n","avg_odds_diff_arr_transf = []\n","\n","print(\"Classification threshold used = %.4f\" % best_class_thresh)\n","for thresh in tqdm(class_thresh_arr):\n"," \n"," if thresh == best_class_thresh:\n"," disp = True\n"," else:\n"," disp = False\n"," \n"," fav_inds = dataset_transf_test_pred.scores > thresh\n"," dataset_transf_test_pred.labels[fav_inds] = dataset_transf_test_pred.favorable_label\n"," dataset_transf_test_pred.labels[~fav_inds] = dataset_transf_test_pred.unfavorable_label\n"," \n"," metric_test_aft = compute_metrics(dataset_orig_test, dataset_transf_test_pred, \n"," unprivileged_groups, privileged_groups,\n"," disp = disp)\n","\n"," bal_acc_arr_transf.append(metric_test_aft[\"Balanced accuracy\"])\n"," avg_odds_diff_arr_transf.append(metric_test_aft[\"Average odds difference\"])\n"," disp_imp_arr_transf.append(metric_test_aft[\"Disparate impact\"])"]},{"cell_type":"markdown","metadata":{},"source":["#### Display results for all thresholds"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax1 = plt.subplots(figsize=(10,7))\n","ax1.plot(class_thresh_arr, bal_acc_arr_transf)\n","ax1.set_xlabel('Classification Thresholds', fontsize=16, fontweight='bold')\n","ax1.set_ylabel('Balanced Accuracy', color='b', fontsize=16, fontweight='bold')\n","ax1.xaxis.set_tick_params(labelsize=14)\n","ax1.yaxis.set_tick_params(labelsize=14)\n","\n","\n","ax2 = ax1.twinx()\n","ax2.plot(class_thresh_arr, np.abs(1.0-np.array(disp_imp_arr_transf)), color='r')\n","ax2.set_ylabel('abs(1-disparate impact)', color='r', fontsize=16, fontweight='bold')\n","ax2.axvline(best_class_thresh, color='k', linestyle=':')\n","ax2.yaxis.set_tick_params(labelsize=14)\n","ax2.grid(True)"]},{"cell_type":"markdown","metadata":{},"source":["```abs(1-disparate impact)``` must be small (close to 0) for classifier predictions to be fair.\n","\n","For a classifier trained with reweighted training data, at the best classification rate, this is indeed the case.\n","This implies fairness."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax1 = plt.subplots(figsize=(10,7))\n","ax1.plot(class_thresh_arr, bal_acc_arr_transf)\n","ax1.set_xlabel('Classification Thresholds', fontsize=16, fontweight='bold')\n","ax1.set_ylabel('Balanced Accuracy', color='b', fontsize=16, fontweight='bold')\n","ax1.xaxis.set_tick_params(labelsize=14)\n","ax1.yaxis.set_tick_params(labelsize=14)\n","\n","\n","ax2 = ax1.twinx()\n","ax2.plot(class_thresh_arr, avg_odds_diff_arr_transf, color='r')\n","ax2.set_ylabel('avg. odds diff.', color='r', fontsize=16, fontweight='bold')\n","ax2.axvline(best_class_thresh, color='k', linestyle=':')\n","ax2.yaxis.set_tick_params(labelsize=14)\n","ax2.grid(True)"]},{"cell_type":"markdown","metadata":{},"source":["```average odds difference = 0.5((FPR_unpriv-FPR_priv)+(TPR_unpriv-TPR_priv))``` must be close to zero for the classifier to be fair.\n","\n","For a classifier trained with reweighted training data, at the best classification rate, this is indeed the case.\n","This implies fairness."]},{"cell_type":"markdown","metadata":{"collapsed":true},"source":["# Summary of Results\n","We show the optimal classification thresholds, and the fairness and accuracy metrics."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Classification Thresholds\n","\n","| Dataset |Classification threshold|\n","|-|-|\n","|Adult|0.2674|\n","|German|0.6732|\n","|Compas|0.5148|"]},{"cell_type":"markdown","metadata":{},"source":["### Fairness Metric: Disparate impact, Accuracy Metric: Balanced accuracy\n","\n","#### Performance\n","\n","| Dataset |Sex (Acc-Bef)|Sex (Acc-Aft)|Sex (Fair-Bef)|Sex (Fair-Aft)|Race/Age (Acc-Bef)|Race/Age (Acc-Aft)|Race/Age (Fair-Bef)|Race/Age (Fair-Aft)|\n","|-|-|-|-|-|-|-|-|-|\n","|Adult (Test)|0.7417|0.7128|0.2774|0.7625|0.7417|0.7443|0.4423|0.7430|\n","|German (Test)|0.6524|0.6460|0.9948|1.0852|0.6524|0.6460|0.3824|0.5735|\n","|Compas (Test)|0.6774|0.6562|0.6631|0.8342|0.6774|0.6342|0.6600|1.1062|\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["### Fairness Metric: Average odds difference, Accuracy Metric: Balanced accuracy\n","\n","#### Performance\n","\n","| Dataset |Sex (Acc-Bef)|Sex (Acc-Aft)|Sex (Fair-Bef)|Sex (Fair-Aft)|Race/Age (Acc-Bef)|Race/Age (Acc-Aft)|Race/Age (Fair-Bef)|Race/Age (Fair-Aft)|\n","|-|-|-|-|-|-|-|-|-|\n","|Adult (Test)|0.7417|0.7128|-0.3281|-0.0266|0.7417|0.7443|-0.1991|-0.0395|\n","|German (Test)|0.6524|0.6460|0.0071|0.0550|0.6524|0.6460|-0.3278|-0.1944|\n","|Compas (Test)|0.6774|0.6562|-0.2439|-0.0946|0.6774|0.6342|-0.1927|0.1042|"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":2},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython2","version":"2.7.11"}},"nbformat":4,"nbformat_minor":2}
2 |
--------------------------------------------------------------------------------
/datasets_analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "t0LIXWb6Gsc-"
17 | },
18 | "source": [
19 | "#### This notebook Break down the datasets used in the framework of Fairness\n",
20 | "#### It shows also the original datasets before preprocessing"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {
27 | "id": "s5v5qFwUGsdQ"
28 | },
29 | "outputs": [],
30 | "source": [
31 | "!pip install 'aif360[all]'\n",
32 | "!wget https://raw.githubusercontent.com/Ali-Alameer/AI_fairness/main/common_utils.py\n"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "# preparing paths to download the datasets \n",
42 | "\n",
43 | "import site\n",
44 | "import os\n",
45 | "\n",
46 | "# Get the dist-packages directory path\n",
47 | "dist_packages_path = site.getsitepackages()[0]\n",
48 | "\n",
49 | "# Print the path\n",
50 | "print(dist_packages_path)\n",
51 | "\n",
52 | "## Adult dataset\n",
53 | "\n",
54 | "aif360_dir_names = 'aif360/data/raw/adult/adult.names'\n",
55 | "aif360_dir_test = 'aif360/data/raw/adult/adult.test'\n",
56 | "aif360_dir_data = 'aif360/data/raw/adult/adult.data'\n",
57 | "full_path_names = os.path.join(dist_packages_path, aif360_dir_names)\n",
58 | "full_path_test = os.path.join(dist_packages_path, aif360_dir_test)\n",
59 | "full_path_data = os.path.join(dist_packages_path, aif360_dir_data)\n",
60 | "\n",
61 | "## German credit score dataset\n",
62 | "\n",
63 | "german_dir_data = 'aif360/data/raw/german/german.data'\n",
64 | "german_dir_docs = 'aif360/data/raw/german/german.doc'\n",
65 | "\n",
66 | "german_path_data = os.path.join(dist_packages_path, german_dir_data)\n",
67 | "german_path_docs = os.path.join(dist_packages_path, german_dir_docs)\n",
68 | "\n",
69 | "## Compas dataset \n",
70 | "\n",
71 | "compas_dir_csv = 'aif360/data/raw/compas/compas-scores-two-years.csv'\n",
72 | "compas_path_csv = os.path.join(dist_packages_path, compas_dir_csv)"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {
79 | "id": "z9DAk7OkGsdU"
80 | },
81 | "outputs": [],
82 | "source": [
83 | "import urllib.request \n",
84 | "# For Adult dataset\n",
85 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",full_path_data) \n",
86 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test\",full_path_test) \n",
87 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names\",full_path_names) \n",
88 | "\n",
89 | "# # For German Dataset\n",
90 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data\",german_path_data) \n",
91 | "urllib.request.urlretrieve(\"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc\",german_path_docs)\n",
92 | "\n",
93 | "# # For Compas Dataset\n",
94 | "urllib.request.urlretrieve(\"https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv\",compas_path_csv) "
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 3,
100 | "metadata": {
101 | "id": "O1Qp1g0sGsdW"
102 | },
103 | "outputs": [],
104 | "source": [
105 | "%matplotlib inline\n",
106 | "# Load all necessary packages\n",
107 | "import sys\n",
108 | "sys.path.append(\"../\")\n",
109 | "import numpy as np\n",
110 | "from tqdm import tqdm\n",
111 | "\n",
112 | "from aif360.datasets import BinaryLabelDataset\n",
113 | "from aif360.datasets import AdultDataset, GermanDataset, CompasDataset\n",
114 | "from aif360.metrics import BinaryLabelDatasetMetric\n",
115 | "from aif360.metrics import ClassificationMetric\n",
116 | "from aif360.algorithms.preprocessing.reweighing import Reweighing\n",
117 | "from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions\\\n",
118 | " import load_preproc_data_adult, load_preproc_data_german, load_preproc_data_compas\n",
119 | "from sklearn.linear_model import LogisticRegression\n",
120 | "from sklearn.preprocessing import StandardScaler\n",
121 | "from sklearn.metrics import accuracy_score\n",
122 | "\n",
123 | "from IPython.display import Markdown, display\n",
124 | "import matplotlib.pyplot as plt\n",
125 | "\n",
126 | "from common_utils import compute_metrics"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {
132 | "id": "D2DIWi1dGsdb"
133 | },
134 | "source": [
135 | "#### Load dataset and set options"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 4,
141 | "metadata": {
142 | "id": "3zIuWq7ZGsdc"
143 | },
144 | "outputs": [],
145 | "source": [
146 | "## import dataset\n",
147 | "dataset_used = \"adult\" # \"adult\", \"german\", \"compas\"\n",
148 | "protected_attribute_used = 1 # 1, 2\n",
149 | "\n",
150 | "\n",
151 | "if dataset_used == \"adult\":\n",
152 | "# dataset_orig = AdultDataset()\n",
153 | " if protected_attribute_used == 1:\n",
154 | " privileged_groups = [{'sex': 1}]\n",
155 | " unprivileged_groups = [{'sex': 0}]\n",
156 | " dataset_orig = load_preproc_data_adult(['sex'])\n",
157 | " else:\n",
158 | " privileged_groups = [{'race': 1}]\n",
159 | " unprivileged_groups = [{'race': 0}]\n",
160 | " dataset_orig = load_preproc_data_adult(['race'])\n",
161 | " \n",
162 | "elif dataset_used == \"german\":\n",
163 | "# dataset_orig = GermanDataset()\n",
164 | " if protected_attribute_used == 1:\n",
165 | " privileged_groups = [{'sex': 1}]\n",
166 | " unprivileged_groups = [{'sex': 0}]\n",
167 | " dataset_orig = load_preproc_data_german(['sex'])\n",
168 | " else:\n",
169 | " privileged_groups = [{'age': 1}]\n",
170 | " unprivileged_groups = [{'age': 0}]\n",
171 | " dataset_orig = load_preproc_data_german(['age'])\n",
172 | " \n",
173 | "elif dataset_used == \"compas\":\n",
174 | "# dataset_orig = CompasDataset()\n",
175 | " if protected_attribute_used == 1:\n",
176 | " privileged_groups = [{'sex': 1}]\n",
177 | " unprivileged_groups = [{'sex': 0}]\n",
178 | " dataset_orig = load_preproc_data_compas(['sex'])\n",
179 | " else:\n",
180 | " privileged_groups = [{'race': 1}]\n",
181 | " unprivileged_groups = [{'race': 0}]\n",
182 | " dataset_orig = load_preproc_data_compas(['race'])\n",
183 | "\n",
184 | "all_metrics = [\"Statistical parity difference\",\n",
185 | " \"Average odds difference\",\n",
186 | " \"Equal opportunity difference\"]\n",
187 | "\n",
188 | "#random seed for calibrated equal odds prediction\n",
189 | "np.random.seed(1)"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {
195 | "id": "3cyhuFW_HOa4"
196 | },
197 | "source": [
198 | "## Exploratory Data Analysis(EDA)"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {
205 | "id": "MxSoVRnPSHqa"
206 | },
207 | "outputs": [],
208 | "source": [
209 | "# Convert the initial Adult Dataset from AIf360 Library into a dataframe and view the created dataframe\n",
210 | "dataset_orig_adult= AdultDataset()\n",
211 | "dataset_orig_adult1=dataset_orig_adult.convert_to_dataframe()[0]\n",
212 | "dataset_orig_adult1"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {
219 | "id": "ZCkVkvqbJptu"
220 | },
221 | "outputs": [],
222 | "source": [
223 | "# Convert pre-processed AIF360 Adult dataset to dataframe and view dataframe created\n",
224 | "dataset_processed_adult = load_preproc_data_adult()\n",
225 | "dataset_processed_adult1=dataset_processed_adult.convert_to_dataframe()[0]\n",
226 | "dataset_processed_adult1"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "metadata": {
233 | "id": "mqL8blvcJ0yl"
234 | },
235 | "outputs": [],
236 | "source": [
237 | "# View the data structure for the pre-processed Adult dataset\n",
238 | "dataset_processed_adult1.shape"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": null,
244 | "metadata": {
245 | "id": "ufjFxfMWJ6E9"
246 | },
247 | "outputs": [],
248 | "source": [
249 | "# View the data structure for the initial Adult dataset in AIF360 library N/B missing data of 3620 rows were removed from AdultDataset\n",
250 | "dataset_orig_adult1.shape"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {
257 | "id": "dgb7lnpYKZWw"
258 | },
259 | "outputs": [],
260 | "source": [
261 | "# View a list of the dataset features\n",
262 | "dataset_orig_adult1.columns.tolist()"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": null,
268 | "metadata": {
269 | "id": "5A0sU09cKgTC"
270 | },
271 | "outputs": [],
272 | "source": [
273 | "# View the Features for the pre-processed Adult dataset\n",
274 | "dataset_processed_adult1.columns.tolist()"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {
281 | "id": "QbIB1YxbKihd"
282 | },
283 | "outputs": [],
284 | "source": [
285 | "# Check for missing values in the initial Adult dataset in AIF360 library\n",
286 | "dataset_orig_adult1.isnull().sum()"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {
293 | "id": "ciPE33J1KoTF"
294 | },
295 | "outputs": [],
296 | "source": [
297 | "# check for missing values for the pre-processed Adult dataset\n",
298 | "dataset_processed_adult1.isnull().sum()"
299 | ]
300 | },
301 | {
302 | "cell_type": "code",
303 | "execution_count": null,
304 | "metadata": {
305 | "id": "BQvV9z0HZo18"
306 | },
307 | "outputs": [],
308 | "source": [
309 | "# Explore education number of years in the Adult dataset to check for outliers using box plot\n",
310 | "import seaborn as sns\n",
311 | "sns.boxplot(x= 'education-num', data= dataset_orig_adult1)\n",
312 | "plt.title('Education number Distribution for Adult Dataset')"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "metadata": {
319 | "id": "JsclUAV7dAaF"
320 | },
321 | "outputs": [],
322 | "source": [
323 | "# Using a distplot, explore the effect of education number of years on Income level of residents \n",
324 | "dataset_orig_adult1['income-per-year']= dataset_orig_adult1['income-per-year'].replace({0.0:'<= $50k income', 1.0:'> $50K income'})\n",
325 | "sns.histplot(x= 'education-num', hue = 'income-per-year',data= dataset_orig_adult1,multiple=\"dodge\")\n",
326 | "plt.title('Education number by Income Distribution for Adult Dataset')"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {
333 | "id": "Iy0rgwLQbR20"
334 | },
335 | "outputs": [],
336 | "source": [
337 | "# Explore age distribution relationship with income level of residents\n",
338 | "sns.histplot(x= 'age', hue = 'income-per-year',data= dataset_orig_adult1,multiple=\"stack\")\n",
339 | "plt.title('Age Distribution by Income for Adult Dataset')"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": null,
345 | "metadata": {
346 | "id": "kb_R-M8k7jG_"
347 | },
348 | "outputs": [],
349 | "source": [
350 | "## explore income binary in regards to sex distribution in adult dataset to check class imbalance and distribution\n",
351 | "## Privileged = 1, unprivileged = 0\n",
352 | "##sex (privileged: Male, unprivileged: Female) \n",
353 | "##Income binary( privileged: > $50K income, unprivileged: <= $50k income)\n",
354 | "import seaborn as sns\n",
355 | "dataset_processed_adult1['Income Binary']= dataset_processed_adult1['Income Binary'].replace({0.0:'<= $50k income', 1.0:'> $50K income'})\n",
356 | "sns.countplot(x=\"Income Binary\", hue= 'sex', data = dataset_processed_adult1)\n",
357 | "plt.title('Income Distribution by Sex for Adult Dataset')\n",
358 | "plt.legend(title= 'sex', labels=(\"Female\",\"Male\"))\n",
359 | "plt.show()"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": null,
365 | "metadata": {
366 | "id": "mHGzdXxB7lC1"
367 | },
368 | "outputs": [],
369 | "source": [
370 | "## explore sex as a protected attribute in adult dataset to check class imbalance and distribution\n",
371 | "## Privileged = 1, unprivileged = 0\n",
372 | "##sex (privileged: Male, unprivileged: Female) \n",
373 | "dataset_processed_adult1['sex']= dataset_processed_adult1['sex'].replace({0.0:'Female', 1.0:'Male'})\n",
374 | "sns.countplot(x=\"sex\", data = dataset_processed_adult1)\n",
375 | "plt.title('Sex Distribution for Adult Dataset')\n",
376 | "plt.show()"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": null,
382 | "metadata": {
383 | "id": "eQUrr6IQ7pXh"
384 | },
385 | "outputs": [],
386 | "source": [
387 | "## explore race as aprotected attribute in adult dataset to check class imbalance and distribution\n",
388 | "## Privileged = 1, unprivileged = 0\n",
389 | "##race (privileged: White, unprivileged: Non-white).\n",
390 | "dataset_processed_adult1['race']= dataset_processed_adult1['race'].replace({0.0:'Non-white', 1.0:'White'})\n",
391 | "sns.countplot(x=\"race\", data = dataset_processed_adult1)\n",
392 | "plt.title('Race Distribution for Adult Dataset')\n",
393 | "plt.show()"
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {
400 | "id": "N6BTvwHpD_yl"
401 | },
402 | "outputs": [],
403 | "source": [
404 | "# Explore income distribution of residents\n",
405 | "tips = dataset_orig_adult1\n",
406 | "ax= sns.countplot(x= 'income-per-year',data= tips)\n",
407 | "plt.title('Income Distribution for Adult Dataset')\n",
408 | "for p in ax.patches:\n",
409 | " ax.annotate('{:.1f}%'.format(100*p.get_height()/len(tips)), (p.get_x()+0.2, p.get_height()+5))"
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": null,
415 | "metadata": {
416 | "id": "NQ4QszM57rjv"
417 | },
418 | "outputs": [],
419 | "source": [
420 | "# Convert pre-processed German dataset to dataframe\n",
421 | "dataset_processed_german = load_preproc_data_german(['sex'])\n",
422 | "dataset_processed_german2=dataset_processed_german.convert_to_dataframe()[0]\n",
423 | "dataset_processed_german2"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {
430 | "id": "DYQLRVZUU_Fu"
431 | },
432 | "outputs": [],
433 | "source": [
434 | "# Convert initial German dataset to dataframe\n",
435 | "dataset_orig_german=GermanDataset()\n",
436 | "dataset_orig_german1= dataset_orig_german.convert_to_dataframe()[0]\n",
437 | "dataset_orig_german1"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {
444 | "id": "mQGvi-ndRfaY"
445 | },
446 | "outputs": [],
447 | "source": [
448 | "# View the data structure for the pre-processed German dataset\n",
449 | "dataset_processed_german2.shape"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": null,
455 | "metadata": {
456 | "id": "Om9wDPnjRjTx"
457 | },
458 | "outputs": [],
459 | "source": [
460 | "# View the data structure for the initial German dataset in AIF360 Library\n",
461 | "dataset_orig_german1.shape"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": null,
467 | "metadata": {
468 | "id": "d6DTI78kubUk"
469 | },
470 | "outputs": [],
471 | "source": [
472 | "# View the Features for the pre-processed German dataset\n",
473 | "dataset_processed_german2.columns.tolist()"
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": null,
479 | "metadata": {
480 | "id": "Rhn7kieSt45i"
481 | },
482 | "outputs": [],
483 | "source": [
484 | "# View the Features for the Initial German dataset in AIF360\n",
485 | "dataset_orig_german1.columns.tolist()"
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": null,
491 | "metadata": {
492 | "id": "AB844igNLAcv"
493 | },
494 | "outputs": [],
495 | "source": [
496 | "# Check for missing values in pre-processed German dataset\n",
497 | "dataset_processed_german2.isnull().sum()"
498 | ]
499 | },
500 | {
501 | "cell_type": "code",
502 | "execution_count": null,
503 | "metadata": {
504 | "id": "vOPfVprOHDPQ"
505 | },
506 | "outputs": [],
507 | "source": [
508 | "# Explore credit risk based onduration in months.\n",
509 | "plt.figure(figsize=(15,5))\n",
510 | "sns.countplot(x=\"month\", hue= 'credit', data = dataset_orig_german1,palette=['#432371',\"#FAAE7B\"])\n",
511 | "plt.title('Duration of Credits(Month) Distribution for German Dataset')\n",
512 | "plt.legend(title= 'credit',labels=['good credit', 'bad credit'])\n",
513 | "plt.show()"
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "metadata": {
520 | "id": "F7FqoHL97wAv"
521 | },
522 | "outputs": [],
523 | "source": [
524 | "## explore sex in relationship with credit risk in processed German dataset to check class imbalance and distribution\n",
525 | "## Privileged = 1, unprivileged = 0\n",
526 | "##sex (privileged: Male, unprivileged: Female) \n",
527 | "## Credit ( good credit =1, bad credit = 2)\n",
528 | "dataset_processed_german2['sex']= dataset_processed_german2['sex'].replace({0.0:'Female', 1.0:'Male'})\n",
529 | "sns.countplot(x=\"sex\", hue= 'credit', data = dataset_processed_german2,palette=['#432371',\"#FAAE7B\"])\n",
530 | "plt.title('Sex Distribution for German Dataset')\n",
531 | "plt.legend(title= 'credit',labels=['good credit', 'bad credit'])\n",
532 | "plt.show()"
533 | ]
534 | },
535 | {
536 | "cell_type": "code",
537 | "execution_count": null,
538 | "metadata": {
539 | "id": "2n_zfMwy7w7J"
540 | },
541 | "outputs": [],
542 | "source": [
543 | "## explore age in German dataset to check class imbalance and distribution\n",
544 | "## Privileged = 1, unprivileged = 0\n",
545 | "## age (privileged: Older than or Equal to 25 years, unprivileged: Younger than 25 years) \n",
546 | "## Credit ( good credit =1, bad credit = 2)\n",
547 | "dataset_processed_german2['age']= dataset_processed_german2['age'].replace({0.0:'< 25 years', 1.0:'>=25 years'})\n",
548 | "sns.countplot(x=\"age\", hue ='credit',data = dataset_processed_german2,palette=['#432371',\"#FAAE7B\"])\n",
549 | "plt.title('Age Distribution for German Dataset')\n",
550 | "plt.legend(title= 'credit',labels=['good credit', 'bad credit'])\n",
551 | "plt.show()"
552 | ]
553 | },
554 | {
555 | "cell_type": "code",
556 | "execution_count": null,
557 | "metadata": {
558 | "id": "qJ4HPE6jJ2_w"
559 | },
560 | "outputs": [],
561 | "source": [
562 | "# Explore Credit distribution of German Dataset\n",
563 | "tips = dataset_processed_german2\n",
564 | "## Credit ( good credit =1, bad credit = 2)\n",
565 | "dataset_processed_german2['credit']= dataset_processed_german2['credit'].replace({1.0:'Good credit', 2.0:'Bad credit'})\n",
566 | "ax= sns.countplot(x= 'credit',data= tips, palette= ['#432371',\"#FAAE7B\"])\n",
567 | "plt.title('Credit Distribution for German Dataset')\n",
568 | "for p in ax.patches:\n",
569 | " ax.annotate('{:.1f}%'.format(100*p.get_height()/len(tips)), (p.get_x()+0.2, p.get_height()+5))"
570 | ]
571 | },
572 | {
573 | "cell_type": "code",
574 | "execution_count": null,
575 | "metadata": {
576 | "id": "KLVGepQ37zBy"
577 | },
578 | "outputs": [],
579 | "source": [
580 | "# Convert processed Compas dataset to dataframe\n",
581 | "dataset_processed_compas = load_preproc_data_compas()\n",
582 | "dataset_processed_compas2=dataset_processed_compas.convert_to_dataframe()[0]\n",
583 | "dataset_processed_compas2"
584 | ]
585 | },
586 | {
587 | "cell_type": "code",
588 | "execution_count": null,
589 | "metadata": {
590 | "id": "f1Vg_8NL0Hb2"
591 | },
592 | "outputs": [],
593 | "source": [
594 | "# Convert the Initial Compas dataset in AIF360 library to dataframe\n",
595 | "dataset_orig_compas = CompasDataset()\n",
596 | "dataset_orig_compas2 = dataset_orig_compas.convert_to_dataframe()[0]\n",
597 | "dataset_orig_compas2"
598 | ]
599 | },
600 | {
601 | "cell_type": "code",
602 | "execution_count": null,
603 | "metadata": {
604 | "id": "o_tsbWIRg_3y"
605 | },
606 | "outputs": [],
607 | "source": [
608 | "# View the data structure for the pre-processed compas dataset\n",
609 | "dataset_processed_compas2.shape"
610 | ]
611 | },
612 | {
613 | "cell_type": "code",
614 | "execution_count": null,
615 | "metadata": {
616 | "id": "-PXNIqYmhDaR"
617 | },
618 | "outputs": [],
619 | "source": [
620 | "# View the data structure for the initial compas dataset in AIF360 library\n",
621 | "dataset_orig_compas2.shape"
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": null,
627 | "metadata": {
628 | "id": "3584JjQrg6yT"
629 | },
630 | "outputs": [],
631 | "source": [
632 | "# View the features for the pre-processed compas dataset\n",
633 | "dataset_processed_compas2.columns.tolist()"
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": null,
639 | "metadata": {
640 | "id": "9ehjs8o00UwM"
641 | },
642 | "outputs": [],
643 | "source": [
644 | "# View the Features for the initial compas dataset in AIF360 library\n",
645 | "dataset_orig_compas2.columns.tolist()"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "execution_count": null,
651 | "metadata": {
652 | "id": "PFEZHg3vLXv9"
653 | },
654 | "outputs": [],
655 | "source": [
656 | "# Check for missing values in pre-processed Compas dataset\n",
657 | "dataset_processed_compas2.isnull().sum()"
658 | ]
659 | },
660 | {
661 | "cell_type": "code",
662 | "execution_count": null,
663 | "metadata": {
664 | "id": "eRKWylkgae0F"
665 | },
666 | "outputs": [],
667 | "source": [
668 | "# Explore Age distribution of the Initial Compas Dataset in relation to two-years recid\n",
669 | "dataset_orig_compas2['two_year_recid']= dataset_orig_compas2['two_year_recid'].replace({0.0:'re-offended',1.0:'did not re-offend'})\n",
670 | "sns.histplot(x= 'age', hue= 'two_year_recid', data= dataset_orig_compas2,multiple=\"stack\")\n",
671 | "plt.title('Age Distribution for Compas Dataset')\n",
672 | "plt.show()"
673 | ]
674 | },
675 | {
676 | "cell_type": "code",
677 | "execution_count": null,
678 | "metadata": {
679 | "id": "T8SauRrnbes6"
680 | },
681 | "outputs": [],
682 | "source": [
683 | "#Explore the prior counts of crimes in relation to recidivism\n",
684 | "sns.histplot(x= 'priors_count', hue= 'two_year_recid', data= dataset_orig_compas2, multiple=\"stack\")\n",
685 | "plt.title('Prior Crime count Distribution for Compas Dataset' )\n",
686 | "plt.show()"
687 | ]
688 | }
689 | ],
690 | "metadata": {
691 | "colab": {
692 | "include_colab_link": true,
693 | "provenance": [],
694 | "toc_visible": true
695 | },
696 | "kernelspec": {
697 | "display_name": "Python 3",
698 | "language": "python",
699 | "name": "python3"
700 | },
701 | "language_info": {
702 | "codemirror_mode": {
703 | "name": "ipython",
704 | "version": 2
705 | },
706 | "file_extension": ".py",
707 | "mimetype": "text/x-python",
708 | "name": "python",
709 | "nbconvert_exporter": "python",
710 | "pygments_lexer": "ipython2",
711 | "version": "2.7.11"
712 | }
713 | },
714 | "nbformat": 4,
715 | "nbformat_minor": 0
716 | }
717 |
--------------------------------------------------------------------------------