├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── data
├── MLU_Logo.png
└── final_project
│ ├── german_credit_test.csv
│ ├── german_credit_test_labels.csv
│ └── german_credit_training.csv
├── environment.yml
├── notebooks
├── day_1
│ ├── MLA-RESML-DAY1-FINAL-STUDENT-NB-SOLUTION.ipynb
│ ├── MLA-RESML-DAY1-FINAL-STUDENT-NB.ipynb
│ └── MLA-RESML-EDA.ipynb
├── day_2
│ ├── MLA-RESML-DATAPREP.ipynb
│ ├── MLA-RESML-DAY2-FINAL-STUDENT-NB-SOLUTION.ipynb
│ ├── MLA-RESML-DAY2-FINAL-STUDENT-NB.ipynb
│ ├── MLA-RESML-DI.ipynb
│ ├── MLA-RESML-LOGREG.ipynb
│ └── img
│ │ ├── fig-data.png
│ │ ├── fig-lg.png
│ │ └── fig-lr.png
└── day_3
│ ├── MLA-RESML-DAY3-FINAL-STUDENT-NB-SOLUTION.ipynb
│ ├── MLA-RESML-DAY3-FINAL-STUDENT-NB.ipynb
│ ├── MLA-RESML-ODDS.ipynb
│ └── MLA-RESML-SHAP.ipynb
├── requirements.txt
└── slides
├── MLU-RAI-DAY1.pdf
└── MLU-RAI-DAY2.pdf
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | */.ipynb_checkpoints/*
3 | */mnist_data/*
4 |
5 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Apache License
3 | Version 2.0, January 2004
4 | http://www.apache.org/licenses/
5 |
6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7 |
8 | 1. Definitions.
9 |
10 | "License" shall mean the terms and conditions for use, reproduction,
11 | and distribution as defined by Sections 1 through 9 of this document.
12 |
13 | "Licensor" shall mean the copyright owner or entity authorized by
14 | the copyright owner that is granting the License.
15 |
16 | "Legal Entity" shall mean the union of the acting entity and all
17 | other entities that control, are controlled by, or are under common
18 | control with that entity. For the purposes of this definition,
19 | "control" means (i) the power, direct or indirect, to cause the
20 | direction or management of such entity, whether by contract or
21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
22 | outstanding shares, or (iii) beneficial ownership of such entity.
23 |
24 | "You" (or "Your") shall mean an individual or Legal Entity
25 | exercising permissions granted by this License.
26 |
27 | "Source" form shall mean the preferred form for making modifications,
28 | including but not limited to software source code, documentation
29 | source, and configuration files.
30 |
31 | "Object" form shall mean any form resulting from mechanical
32 | transformation or translation of a Source form, including but
33 | not limited to compiled object code, generated documentation,
34 | and conversions to other media types.
35 |
36 | "Work" shall mean the work of authorship, whether in Source or
37 | Object form, made available under the License, as indicated by a
38 | copyright notice that is included in or attached to the work
39 | (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object
42 | form, that is based on (or derived from) the Work and for which the
43 | editorial revisions, annotations, elaborations, or other modifications
44 | represent, as a whole, an original work of authorship. For the purposes
45 | of this License, Derivative Works shall not include works that remain
46 | separable from, or merely link (or bind by name) to the interfaces of,
47 | the Work and Derivative Works thereof.
48 |
49 | "Contribution" shall mean any work of authorship, including
50 | the original version of the Work and any modifications or additions
51 | to that Work or Derivative Works thereof, that is intentionally
52 | submitted to Licensor for inclusion in the Work by the copyright owner
53 | or by an individual or Legal Entity authorized to submit on behalf of
54 | the copyright owner. For the purposes of this definition, "submitted"
55 | means any form of electronic, verbal, or written communication sent
56 | to the Licensor or its representatives, including but not limited to
57 | communication on electronic mailing lists, source code control systems,
58 | and issue tracking systems that are managed by, or on behalf of, the
59 | Licensor for the purpose of discussing and improving the Work, but
60 | excluding communication that is conspicuously marked or otherwise
61 | designated in writing by the copyright owner as "Not a Contribution."
62 |
63 | "Contributor" shall mean Licensor and any individual or Legal Entity
64 | on behalf of whom a Contribution has been received by Licensor and
65 | subsequently incorporated within the Work.
66 |
67 | 2. Grant of Copyright License. Subject to the terms and conditions of
68 | this License, each Contributor hereby grants to You a perpetual,
69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70 | copyright license to reproduce, prepare Derivative Works of,
71 | publicly display, publicly perform, sublicense, and distribute the
72 | Work and such Derivative Works in Source or Object form.
73 |
74 | 3. Grant of Patent License. Subject to the terms and conditions of
75 | this License, each Contributor hereby grants to You a perpetual,
76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77 | (except as stated in this section) patent license to make, have made,
78 | use, offer to sell, sell, import, and otherwise transfer the Work,
79 | where such license applies only to those patent claims licensable
80 | by such Contributor that are necessarily infringed by their
81 | Contribution(s) alone or by combination of their Contribution(s)
82 | with the Work to which such Contribution(s) was submitted. If You
83 | institute patent litigation against any entity (including a
84 | cross-claim or counterclaim in a lawsuit) alleging that the Work
85 | or a Contribution incorporated within the Work constitutes direct
86 | or contributory patent infringement, then any patent licenses
87 | granted to You under this License for that Work shall terminate
88 | as of the date such litigation is filed.
89 |
90 | 4. Redistribution. You may reproduce and distribute copies of the
91 | Work or Derivative Works thereof in any medium, with or without
92 | modifications, and in Source or Object form, provided that You
93 | meet the following conditions:
94 |
95 | (a) You must give any other recipients of the Work or
96 | Derivative Works a copy of this License; and
97 |
98 | (b) You must cause any modified files to carry prominent notices
99 | stating that You changed the files; and
100 |
101 | (c) You must retain, in the Source form of any Derivative Works
102 | that You distribute, all copyright, patent, trademark, and
103 | attribution notices from the Source form of the Work,
104 | excluding those notices that do not pertain to any part of
105 | the Derivative Works; and
106 |
107 | (d) If the Work includes a "NOTICE" text file as part of its
108 | distribution, then any Derivative Works that You distribute must
109 | include a readable copy of the attribution notices contained
110 | within such NOTICE file, excluding those notices that do not
111 | pertain to any part of the Derivative Works, in at least one
112 | of the following places: within a NOTICE text file distributed
113 | as part of the Derivative Works; within the Source form or
114 | documentation, if provided along with the Derivative Works; or,
115 | within a display generated by the Derivative Works, if and
116 | wherever such third-party notices normally appear. The contents
117 | of the NOTICE file are for informational purposes only and
118 | do not modify the License. You may add Your own attribution
119 | notices within Derivative Works that You distribute, alongside
120 | or as an addendum to the NOTICE text from the Work, provided
121 | that such additional attribution notices cannot be construed
122 | as modifying the License.
123 |
124 | You may add Your own copyright statement to Your modifications and
125 | may provide additional or different license terms and conditions
126 | for use, reproduction, or distribution of Your modifications, or
127 | for any such Derivative Works as a whole, provided Your use,
128 | reproduction, and distribution of the Work otherwise complies with
129 | the conditions stated in this License.
130 |
131 | 5. Submission of Contributions. Unless You explicitly state otherwise,
132 | any Contribution intentionally submitted for inclusion in the Work
133 | by You to the Licensor shall be under the terms and conditions of
134 | this License, without any additional terms or conditions.
135 | Notwithstanding the above, nothing herein shall supersede or modify
136 | the terms of any separate license agreement you may have executed
137 | with Licensor regarding such Contributions.
138 |
139 | 6. Trademarks. This License does not grant permission to use the trade
140 | names, trademarks, service marks, or product names of the Licensor,
141 | except as required for reasonable and customary use in describing the
142 | origin of the Work and reproducing the content of the NOTICE file.
143 |
144 | 7. Disclaimer of Warranty. Unless required by applicable law or
145 | agreed to in writing, Licensor provides the Work (and each
146 | Contributor provides its Contributions) on an "AS IS" BASIS,
147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 | implied, including, without limitation, any warranties or conditions
149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 | PARTICULAR PURPOSE. You are solely responsible for determining the
151 | appropriateness of using or redistributing the Work and assume any
152 | risks associated with Your exercise of permissions under this License.
153 |
154 | 8. Limitation of Liability. In no event and under no legal theory,
155 | whether in tort (including negligence), contract, or otherwise,
156 | unless required by applicable law (such as deliberate and grossly
157 | negligent acts) or agreed to in writing, shall any Contributor be
158 | liable to You for damages, including any direct, indirect, special,
159 | incidental, or consequential damages of any character arising as a
160 | result of this License or out of the use or inability to use the
161 | Work (including but not limited to damages for loss of goodwill,
162 | work stoppage, computer failure or malfunction, or any and all
163 | other commercial damages or losses), even if such Contributor
164 | has been advised of the possibility of such damages.
165 |
166 | 9. Accepting Warranty or Additional Liability. While redistributing
167 | the Work or Derivative Works thereof, You may choose to offer,
168 | and charge a fee for, acceptance of support, warranty, indemnity,
169 | or other liability obligations and/or rights consistent with this
170 | License. However, in accepting such obligations, You may act only
171 | on Your own behalf and on Your sole responsibility, not on behalf
172 | of any other Contributor, and only if You agree to indemnify,
173 | defend, and hold each Contributor harmless for any liability
174 | incurred by, or claims asserted against, such Contributor by reason
175 | of your accepting any such warranty or additional liability.
176 |
--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | 
2 | ## Machine Learning University: Responsible AI
3 |
4 | This repository contains __slides__, __notebooks__, and __data__ for the __Machine Learning University (MLU) Responsible AI__ class. Our mission is to make Machine Learning accessible to everyone. We have courses available across many topics of machine learning and believe knowledge of ML can be a key enabler for success. This class is designed to help you get started with Responsible AI, learn about widely used Machine Learning techniques, and apply them to real-world problems.
5 |
6 | ## YouTube
7 | Watch all Responsible AI video recordings in this [YouTube playlist](https://www.youtube.com/playlist?list=PL8P_Z6C4GcuVMxhwT9JO_nKuW0QMSJ-cZ) from our [YouTube channel](https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw/playlists).
8 |
9 | ## Course Overview
10 | There are three lectures and one final project for this class.
11 |
12 | __Lecture 1__
13 | | Title | Studio lab |
14 | | :---: | ---: |
15 | | Exploratory Data Analysis| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_1/MLA-RESML-EDA.ipynb)|
16 | | Final Challenge Day 1| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_1/MLA-RESML-DAY1-FINAL-STUDENT-NB.ipynb)|
17 | | Completed Final Challenge Day 1| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_1/MLA-RESML-DAY1-FINAL-STUDENT-NB-SOLUTION.ipynb)|
18 |
19 | __Lecture 2__
20 | | Title | Studio lab |
21 | | :---: | ---: |
22 | | Data Preparation| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DATAPREP.ipynb)|
23 | | Disparate Impact| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DI.ipynb)|
24 | | Logistic Regression| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-LOGREG.ipynb)|
25 | | Final Challenge Day 2| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DAY2-FINAL-STUDENT-NB.ipynb)|
26 | | Completed Final Challenge Day 2| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DAY2-FINAL-STUDENT-NB-SOLUTION.ipynb)|
27 |
28 | __Lecture 3__
29 | | Title | Studio lab |
30 | | :---: | ---: |
31 | | Equalized Odds| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-ODDS.ipynb)|
32 | | SHAP| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-SHAP.ipynb)|
33 | | Final Challenge Day 3| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DAY3-FINAL-STUDENT-NB.ipynb)|
34 | | Completed Final Challenge Day 3| [](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-responsible-ai/blob/master/notebooks/day_2/MLA-RESML-DAY3-FINAL-STUDENT-NB-SOLUTION.ipynb)|
35 |
36 |
37 | __Final Project:__ Practice working with a "real-world" dataset for the final project. Final project dataset is in the [data/final_project folder](https://github.com/aws-samples/aws-machine-learning-university-responsible-ai/tree/master/data/final_project). For more details on the final project, check out [this notebook](https://github.com/aws-samples/aws-machine-learning-university-responsible-ai/blob/main/notebooks/day_1/MLA-RESML-DAY1-FINAL-STUDENT-NB.ipynb).
38 |
39 | ## Interactives/Visuals
40 | Interested in visual, interactive explanations of core machine learning concepts? Check out our [MLU-Explain articles](https://mlu-explain.github.io/) to learn at your own pace! Relevant for this class is this article on [Equality of Odds](https://mlu-explain.github.io/equality-of-odds/).
41 |
42 | ## Contribute
43 | If you would like to contribute to the project, see [CONTRIBUTING](CONTRIBUTING.md) for more information.
44 |
45 | ## License
46 | The license for this repository depends on the section. Data set for the course is being provided to you by permission of Amazon and is subject to the terms of the [Amazon License and Access](https://www.amazon.com/gp/help/customer/display.html?nodeId=201909000). You are expressly prohibited from copying, modifying, selling, exporting or using this data set in any way other than for the purpose of completing this course. The lecture slides are released under the CC-BY-SA-4.0 License. This project is licensed under the Apache-2.0 License. See each section's LICENSE file for details.
47 |
--------------------------------------------------------------------------------
/data/MLU_Logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/data/MLU_Logo.png
--------------------------------------------------------------------------------
/data/final_project/german_credit_test_labels.csv:
--------------------------------------------------------------------------------
1 | ID,credit_risk
2 | 963,1
3 | 611,1
4 | 106,1
5 | 891,0
6 | 342,0
7 | 539,0
8 | 839,0
9 | 769,0
10 | 943,0
11 | 537,0
12 | 189,0
13 | 901,0
14 | 883,0
15 | 241,0
16 | 232,0
17 | 378,1
18 | 20,0
19 | 554,0
20 | 438,0
21 | 726,0
22 | 375,1
23 | 742,0
24 | 278,1
25 | 280,0
26 | 945,0
27 | 28,0
28 | 331,1
29 | 480,0
30 | 503,1
31 | 814,1
32 | 174,1
33 | 327,0
34 | 487,0
35 | 743,0
36 | 386,0
37 | 394,0
38 | 574,0
39 | 657,0
40 | 578,1
41 | 179,0
42 | 233,0
43 | 662,0
44 | 752,0
45 | 94,0
46 | 411,0
47 | 371,0
48 | 103,0
49 | 741,0
50 | 870,0
51 | 842,1
52 | 866,0
53 | 66,0
54 | 101,0
55 | 888,0
56 | 693,0
57 | 122,0
58 | 715,0
59 | 364,1
60 | 266,0
61 | 926,0
62 | 199,1
63 | 761,1
64 | 352,0
65 | 933,0
66 | 555,1
67 | 708,0
68 | 702,0
69 | 240,1
70 | 400,0
71 | 519,0
72 | 330,0
73 | 18,1
74 | 218,0
75 | 469,0
76 | 838,0
77 | 593,1
78 | 684,0
79 | 36,0
80 | 372,0
81 | 52,0
82 | 864,1
83 | 653,1
84 | 553,0
85 | 607,1
86 | 548,1
87 | 994,0
88 | 599,0
89 | 788,1
90 | 462,0
91 | 551,0
92 | 658,0
93 | 139,0
94 | 622,1
95 | 512,0
96 | 612,0
97 | 155,1
98 | 107,0
99 | 580,1
100 | 987,0
101 | 990,0
102 | 894,0
103 | 219,0
104 | 325,0
105 | 899,1
106 | 904,0
107 | 239,0
108 | 44,1
109 | 558,1
110 | 118,1
111 | 238,0
112 | 357,1
113 | 34,0
114 | 343,0
115 | 522,1
116 | 283,0
117 | 48,0
118 | 42,0
119 | 934,0
120 | 691,0
121 | 458,0
122 | 632,0
123 | 205,0
124 | 365,0
125 | 1,1
126 | 706,1
127 | 631,1
128 | 617,0
129 | 729,0
130 | 220,0
131 | 642,1
132 | 598,1
133 | 753,0
134 | 376,0
135 | 61,0
136 | 274,1
137 | 789,1
138 | 680,0
139 | 315,1
140 | 628,0
141 | 49,0
142 | 419,1
143 | 858,1
144 | 311,0
145 | 302,1
146 | 132,0
147 | 513,0
148 | 897,0
149 | 915,1
150 | 65,0
151 | 244,0
152 | 783,1
153 | 125,0
154 | 849,1
155 | 436,0
156 | 980,1
157 | 751,1
158 | 760,0
159 | 734,0
160 | 913,0
161 | 569,1
162 | 932,0
163 | 766,1
164 | 861,1
165 | 168,0
166 | 511,0
167 | 112,0
168 | 437,0
169 | 165,0
170 | 176,0
171 | 414,1
172 | 892,0
173 | 588,1
174 | 955,0
175 | 648,1
176 | 516,0
177 | 410,0
178 | 956,0
179 | 277,0
180 | 215,0
181 | 142,0
182 | 798,0
183 | 314,0
184 | 312,0
185 | 57,0
186 | 294,0
187 | 461,0
188 | 610,1
189 | 275,0
190 | 833,0
191 | 983,1
192 | 401,0
193 | 528,1
194 | 984,0
195 | 221,0
196 | 952,1
197 | 263,0
198 | 4,1
199 | 749,0
200 | 797,0
201 | 460,0
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: mlu-rai
2 |
3 | dependencies:
4 | - python=3.9
5 | - ipykernel
6 | - pip
7 | - pip:
8 | - -r requirements.txt
--------------------------------------------------------------------------------
/notebooks/day_1/MLA-RESML-DAY1-FINAL-STUDENT-NB-SOLUTION.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - Final Project Solution\n",
15 | "\n",
16 | "Build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
17 | "\n",
18 | "### Final Project Problem: Loan Approval\n",
19 | "\n",
20 | "__Problem Definition:__\n",
21 | "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
22 | "\n",
23 | "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
24 | "\n",
25 | "For example, certain laws declare it unlawful for creditors to discriminate against any applicant on the basis of age (or other sensitive attributes). For more details, have a look at this paper:\n",
26 | "\n",
27 | "``` \n",
28 | "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
29 | "```\n",
30 | "\n",
31 | "__Table of contents__\n",
32 | "\n",
33 | "1. Read the datasets (Given) \n",
34 | "2. Data Processing (Implement)\n",
35 | " * Exploratory Data Analysis\n",
36 | " * Select features to build the model (Suggested)\n",
37 | " * Train - Validation - Test Datasets\n",
38 | " * Data Processing with Pipeline\n",
39 | "3. Train (and Tune) a Classifier on the Training Dataset (Implement)\n",
40 | "4. Make Predictions on the Test Dataset (Implement)\n",
41 | "5. Evaluate Results (Given)\n",
42 | "\n",
43 | "\n",
44 | "__Datasets and Files__\n",
45 | "\n",
46 | "\n",
47 | "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
48 | "\n",
49 | "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "# Reshaping/basic libraries\n",
66 | "import pandas as pd\n",
67 | "import numpy as np\n",
68 | "\n",
69 | "# Plotting libraries\n",
70 | "import matplotlib.pyplot as plt\n",
71 | "import seaborn as sns\n",
72 | "\n",
73 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
74 | "\n",
75 | "# ML libraries\n",
76 | "from sklearn.model_selection import train_test_split\n",
77 | "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
78 | "from sklearn.impute import SimpleImputer\n",
79 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
80 | "from sklearn.pipeline import Pipeline\n",
81 | "from sklearn.compose import ColumnTransformer\n",
82 | "from sklearn.linear_model import LogisticRegression\n",
83 | "\n",
84 | "# Operational libraries\n",
85 | "import sys\n",
86 | "\n",
87 | "sys.path.append(\"..\")\n",
88 | "\n",
89 | "# Jupyter(lab) libraries\n",
90 | "import warnings\n",
91 | "\n",
92 | "warnings.filterwarnings(\"ignore\")"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "## 1. Read the datasets (Given)\n",
100 | "(Go to top)"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
117 | "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
118 | "\n",
119 | "print(\"The shape of the training dataset is:\", training_data.shape)\n",
120 | "print(\"The shape of the test dataset is:\", test_data.shape)"
121 | ]
122 | },
123 | {
124 | "cell_type": "markdown",
125 | "metadata": {},
126 | "source": [
127 | "## 2. Data Processing (Implement)\n",
128 | "(Go to top) "
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "### 2.1 Exploratory Data Analysis\n",
136 | "(Go to Data Processing)\n",
137 | "\n",
138 | "We look at number of rows, columns, and some simple statistics of the dataset."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "training_data.head()"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "test_data.head()"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": null,
162 | "metadata": {},
163 | "outputs": [],
164 | "source": [
165 | "# Implement more EDA here"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "training_data.info()"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "### 2.2 Select features to build the model \n",
182 | "(Go to Data Processing)\n",
183 | "\n",
184 | "For a quick start, we recommend using only a few of the numerical and categorical features. However, feel free to explore other fields. In this case, we do not need to cast our features to numerical/objects. Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": null,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "# Grab model features/inputs and target/output\n",
194 | "categorical_features = [\"job_status\", \"employed_since_years\", \"savings\", \"age_groups\"]\n",
195 | "\n",
196 | "numerical_features = [\"credit_amount\", \"credit_duration_months\"]"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "Separate features and the model target."
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": null,
209 | "metadata": {},
210 | "outputs": [],
211 | "source": [
212 | "model_target = \"credit_risk\"\n",
213 | "model_features = categorical_features + numerical_features\n",
214 | "\n",
215 | "print(\"Model features: \", model_features)\n",
216 | "print(\"Model target: \", model_target)"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "### 2.3 Train - Validation Datasets\n",
224 | "(Go to Data Processing)\n",
225 | "\n",
226 | "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
227 | "\n",
228 | "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Validation data you get here will be used later in section 3 to tune your classifier."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "# Implement here"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": null,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "sns.catplot(x=\"age_groups\", hue=\"credit_risk\", kind=\"count\", data=training_data)"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "We observe that age group with members that are less than 25 yrs old are at a disadvantage. Almost the same number of applications get rejected as approved whereas for the group with members $\\geq$ the ration is almost 3:1 (e.g. three times as many applications approved as rejected)."
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": null,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "# We only need to split between train and val (test is already separate)\n",
263 | "train_data, val_data = train_test_split(\n",
264 | " training_data, test_size=0.1, shuffle=True, random_state=23\n",
265 | ")\n",
266 | "\n",
267 | "# Print the shapes of the Train - Test Datasets\n",
268 | "print(\n",
269 | " \"Train - Test - Validation datasets shapes: \",\n",
270 | " train_data.shape,\n",
271 | " test_data.shape,\n",
272 | " val_data.shape,\n",
273 | ")"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "### 2.4 Data processing with Pipeline\n",
281 | "(Go to Data Processing)\n",
282 | "\n",
283 | "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. \n"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": [
292 | "# Implement here"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "### STEP 1 ###\n",
302 | "##############\n",
303 | "\n",
304 | "# Preprocess the numerical features\n",
305 | "numerical_processor = Pipeline(\n",
306 | " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
307 | ")\n",
308 | "# Preprocess the categorical features\n",
309 | "categorical_processor = Pipeline(\n",
310 | " [\n",
311 | " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
312 | " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\", drop=\"if_binary\")),\n",
313 | " ]\n",
314 | ")\n",
315 | "\n",
316 | "### STEP 2 ###\n",
317 | "##############\n",
318 | "\n",
319 | "# Combine all data preprocessors from above\n",
320 | "data_processor = ColumnTransformer(\n",
321 | " [\n",
322 | " (\"numerical_processing\", numerical_processor, numerical_features),\n",
323 | " (\"categorical_processing\", categorical_processor, categorical_features),\n",
324 | " ]\n",
325 | ")\n",
326 | "\n",
327 | "### STEP 3 ###\n",
328 | "##############\n",
329 | "\n",
330 | "# Pipeline desired all data transformers, along with an estimator at the end\n",
331 | "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
332 | "pipeline = Pipeline(\n",
333 | " [\n",
334 | " (\"data_processing\", data_processor),\n",
335 | " (\"lg\", LogisticRegression(solver=\"lbfgs\", penalty=None)),\n",
336 | " ]\n",
337 | ")"
338 | ]
339 | },
340 | {
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "## 3. Train (and Tune) a Classifier (Implement)\n",
345 | "(Go to top)\n",
346 | "\n",
347 | "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": null,
353 | "metadata": {},
354 | "outputs": [],
355 | "source": [
356 | "# Implement here"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": [
365 | "# Get train data to train the classifier\n",
366 | "X_train = train_data[model_features]\n",
367 | "y_train = train_data[model_target]\n",
368 | "\n",
369 | "# Fit the classifier to the train data\n",
370 | "# Train data going through the Pipeline is imputed (with means from the train data),\n",
371 | "# scaled (with the min/max from the train data),\n",
372 | "# and finally used to fit the model\n",
373 | "pipeline.fit(X_train, y_train)"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "## 4. Make Predictions on the Test Dataset (Implement)\n",
381 | "(Go to top)\n",
382 | "\n",
383 | "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "# Implement here\n",
393 | "\n",
394 | "# Get test data to test the classifier\n",
395 | "# ! test data should come from german_credit_test.csv !\n",
396 | "# ...\n",
397 | "\n",
398 | "# Use the trained model to make predictions on the test dataset\n",
399 | "# test_predictions = ..."
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": null,
405 | "metadata": {},
406 | "outputs": [],
407 | "source": [
408 | "# Get test data to validate the classifier\n",
409 | "X_test = test_data[model_features]\n",
410 | "\n",
411 | "# Use the fitted model to make predictions on the test dataset\n",
412 | "# Test data going through the Pipeline is imputed (with means from the train data),\n",
413 | "# scaled (with the min/max from the train data),\n",
414 | "# and finally used to make predictions\n",
415 | "test_predictions = pipeline.predict(X_test)"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "## 5. Evaluate Results (Given)\n",
423 | "(Go to top)"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": null,
429 | "metadata": {},
430 | "outputs": [],
431 | "source": [
432 | "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
433 | "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
434 | "result_df[\"credit_risk_pred\"] = test_predictions\n",
435 | "\n",
436 | "result_df.to_csv(\"../../data/final_project/project_day1_result.csv\", index=False)"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "### Final Evaluation on Test Data - Disparate Impact\n",
444 | "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
454 | " \"\"\"\n",
455 | " Function to calculate Disparate Impact metric using the results from this notebook.\n",
456 | " \"\"\"\n",
457 | " try:\n",
458 | " # Merge predictions with original test data to model per group\n",
459 | " di_df = pred_df.merge(test_data, on=\"ID\")\n",
460 | " # Count for group with members less than 25y old\n",
461 | " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
462 | " 0\n",
463 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
464 | " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
465 | " # Count for group with members greater equal 25y old\n",
466 | " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
467 | " 0\n",
468 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
469 | " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
470 | " # Check if correct number of gorups\n",
471 | " if total_geq25 == 0:\n",
472 | " print(\"There is only one group present in the data.\")\n",
473 | " elif total_less25 == 0:\n",
474 | " print(\"There is only one group present in the data.\")\n",
475 | " else:\n",
476 | " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
477 | " pos_outcomes_geq25 / total_geq25\n",
478 | " )\n",
479 | " return disparate_impact\n",
480 | " except:\n",
481 | " print(\"Wrong inputs provided.\")"
482 | ]
483 | },
484 | {
485 | "cell_type": "code",
486 | "execution_count": null,
487 | "metadata": {},
488 | "outputs": [],
489 | "source": [
490 | "calculate_di(test_data, result_df, \"credit_risk_pred\")"
491 | ]
492 | },
493 | {
494 | "cell_type": "markdown",
495 | "metadata": {},
496 | "source": [
497 | "While this might look good, keep in mind that `age_groups` was used to train the model; depending on the domain, it might not be permissible to use this feature."
498 | ]
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
505 | "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": null,
511 | "metadata": {},
512 | "outputs": [],
513 | "source": [
514 | "accuracy_score(\n",
515 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
516 | " \"credit_risk\"\n",
517 | " ],\n",
518 | " result_df[\"credit_risk_pred\"],\n",
519 | ")"
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": null,
525 | "metadata": {},
526 | "outputs": [],
527 | "source": [
528 | "f1_score(\n",
529 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
530 | " \"credit_risk\"\n",
531 | " ],\n",
532 | " result_df[\"credit_risk_pred\"],\n",
533 | ")"
534 | ]
535 | },
536 | {
537 | "cell_type": "markdown",
538 | "metadata": {},
539 | "source": [
540 | "This is the end of the notebook."
541 | ]
542 | }
543 | ],
544 | "metadata": {
545 | "kernelspec": {
546 | "display_name": ".conda-mlu-rai:Python",
547 | "language": "python",
548 | "name": "conda-env-.conda-mlu-rai-py"
549 | },
550 | "language_info": {
551 | "codemirror_mode": {
552 | "name": "ipython",
553 | "version": 3
554 | },
555 | "file_extension": ".py",
556 | "mimetype": "text/x-python",
557 | "name": "python",
558 | "nbconvert_exporter": "python",
559 | "pygments_lexer": "ipython3",
560 | "version": "3.9.20"
561 | }
562 | },
563 | "nbformat": 4,
564 | "nbformat_minor": 4
565 | }
566 |
--------------------------------------------------------------------------------
/notebooks/day_1/MLA-RESML-DAY1-FINAL-STUDENT-NB.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "tags": []
14 | },
15 | "source": [
16 | "# Responsible AI - Final Project\n",
17 | "\n",
18 | "Build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
19 | "\n",
20 | "### Final Project Problem: Loan Approval\n",
21 | "\n",
22 | "__Problem Definition:__\n",
23 | "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
24 | "\n",
25 | "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
26 | "\n",
27 | "For example, certain laws declare it unlawful for creditors to discriminate against any applicant on the basis of age (or other sensitive attributes). For more details, have a look at this paper:\n",
28 | "\n",
29 | "``` \n",
30 | "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
31 | "```\n",
32 | "\n",
33 | "__Table of contents__\n",
34 | "\n",
35 | "1. Read the datasets (Given) \n",
36 | "2. Data Processing (Implement)\n",
37 | " * Exploratory Data Analysis\n",
38 | " * Select features to build the model (Suggested)\n",
39 | " * Train - Validation - Test Datasets\n",
40 | " * Data Processing with Pipeline\n",
41 | "3. Train (and Tune) a Classifier on the Training Dataset (Implement)\n",
42 | "4. Make Predictions on the Test Dataset (Implement)\n",
43 | "5. Evaluate Results (Given)\n",
44 | "\n",
45 | "\n",
46 | "__Datasets and Files__\n",
47 | "\n",
48 | "\n",
49 | "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
50 | "\n",
51 | "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment.\n"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "# Reshaping/basic libraries\n",
68 | "import pandas as pd\n",
69 | "import numpy as np\n",
70 | "\n",
71 | "# Plotting libraries\n",
72 | "import matplotlib.pyplot as plt\n",
73 | "import seaborn as sns\n",
74 | "\n",
75 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
76 | "\n",
77 | "# ML libraries\n",
78 | "from sklearn.model_selection import train_test_split\n",
79 | "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
80 | "from sklearn.impute import SimpleImputer\n",
81 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
82 | "from sklearn.pipeline import Pipeline\n",
83 | "from sklearn.compose import ColumnTransformer\n",
84 | "from sklearn.linear_model import LogisticRegression\n",
85 | "\n",
86 | "# Operational libraries\n",
87 | "import sys\n",
88 | "\n",
89 | "sys.path.append(\"..\")\n",
90 | "\n",
91 | "# Jupyter(lab) libraries\n",
92 | "import warnings\n",
93 | "\n",
94 | "warnings.filterwarnings(\"ignore\")"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "## 1. Read the datasets (Given)\n",
102 | "(Go to top)\n",
103 | "\n",
104 | "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": null,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
114 | "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
115 | "\n",
116 | "print(\"The shape of the training dataset is:\", training_data.shape)\n",
117 | "print(\"The shape of the test dataset is:\", test_data.shape)"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "## 2. Data Processing (Implement)\n",
125 | "(Go to top) "
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "### 2.1 Exploratory Data Analysis\n",
133 | "(Go to Data Processing)\n",
134 | "\n",
135 | "We look at number of rows, columns, and some simple statistics of the dataset."
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "training_data.head()"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": null,
150 | "metadata": {},
151 | "outputs": [],
152 | "source": [
153 | "test_data.head()"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "# Implement more EDA here"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "### 2.2 Select features to build the model \n",
170 | "(Go to Data Processing)\n",
171 | "\n",
172 | "For a quick start, we recommend using only a few of the numerical and categorical features. However, feel free to explore other fields. In this case, we do not need to cast our features to numerical/objects. Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": null,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "# Grab model features/inputs and target/output\n",
182 | "categorical_features = [\"job_status\", \"employed_since_years\", \"savings\", \"age_groups\"]\n",
183 | "\n",
184 | "numerical_features = [\"credit_amount\", \"credit_duration_months\"]"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "Separate features and the model target."
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": null,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "model_target = \"credit_risk\"\n",
201 | "model_features = categorical_features + numerical_features\n",
202 | "\n",
203 | "print(\"Model features: \", model_features)\n",
204 | "print(\"Model target: \", model_target)"
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | "### 2.3 Train - Validation Datasets\n",
212 | "(Go to Data Processing)\n",
213 | "\n",
214 | "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
215 | "\n",
216 | "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Validation data you get here will be used later in section 3 to tune your classifier."
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "metadata": {},
223 | "outputs": [],
224 | "source": [
225 | "# Implement here"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "### 2.4 Data processing with Pipeline\n",
233 | "(Go to Data Processing)\n",
234 | "\n",
235 | "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. \n"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {},
242 | "outputs": [],
243 | "source": [
244 | "# Implement here"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "## 3. Train (and Tune) a Classifier (Implement)\n",
252 | "(Go to top)\n",
253 | "\n",
254 | "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "# Implement here"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "## 4. Make Predictions on the Test Dataset (Implement)\n",
271 | "(Go to top)\n",
272 | "\n",
273 | "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "# Implement here\n",
283 | "\n",
284 | "# Get test data to test the classifier\n",
285 | "# ! test data should come from german_credit_test.csv !\n",
286 | "# ...\n",
287 | "\n",
288 | "# Use the trained model to make predictions on the test dataset\n",
289 | "# test_predictions = ..."
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "## 5. Evaluate Results (Given)\n",
297 | "(Go to top)"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": null,
303 | "metadata": {},
304 | "outputs": [],
305 | "source": [
306 | "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
307 | "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
308 | "result_df[\"credit_risk_pred\"] = test_predictions\n",
309 | "\n",
310 | "result_df.to_csv(\"../../data/final_project/project_day1_result.csv\", index=False)"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "### Final Evaluation on Test Data - Disparate Impact\n",
318 | "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": null,
324 | "metadata": {},
325 | "outputs": [],
326 | "source": [
327 | "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
328 | " \"\"\"\n",
329 | " Function to calculate Disparate Impact metric using the results from this notebook.\n",
330 | " \"\"\"\n",
331 | " try:\n",
332 | " # Merge predictions with original test data to model per group\n",
333 | " di_df = pred_df.merge(test_data, on=\"ID\")\n",
334 | " # Count for group with members less than 25y old\n",
335 | " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
336 | " 0\n",
337 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
338 | " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
339 | " # Count for group with members greater equal 25y old\n",
340 | " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
341 | " 0\n",
342 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
343 | " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
344 | " # Check if correct number of gorups\n",
345 | " if total_geq25 == 0:\n",
346 | " print(\"There is only one group present in the data.\")\n",
347 | " elif total_less25 == 0:\n",
348 | " print(\"There is only one group present in the data.\")\n",
349 | " else:\n",
350 | " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
351 | " pos_outcomes_geq25 / total_geq25\n",
352 | " )\n",
353 | " return disparate_impact\n",
354 | " except:\n",
355 | " print(\"Wrong inputs provided.\")"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": null,
361 | "metadata": {},
362 | "outputs": [],
363 | "source": [
364 | "calculate_di(test_data, result_df, \"credit_risk_pred\")"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
372 | "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": [
381 | "accuracy_score(\n",
382 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
383 | " \"credit_risk\"\n",
384 | " ],\n",
385 | " result_df[\"credit_risk_pred\"],\n",
386 | ")"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": [
395 | "f1_score(\n",
396 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
397 | " \"credit_risk\"\n",
398 | " ],\n",
399 | " result_df[\"credit_risk_pred\"],\n",
400 | ")"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "This is the end of the notebook."
408 | ]
409 | }
410 | ],
411 | "metadata": {
412 | "kernelspec": {
413 | "display_name": ".conda-mlu-rai:Python",
414 | "language": "python",
415 | "name": "conda-env-.conda-mlu-rai-py"
416 | },
417 | "language_info": {
418 | "codemirror_mode": {
419 | "name": "ipython",
420 | "version": 3
421 | },
422 | "file_extension": ".py",
423 | "mimetype": "text/x-python",
424 | "name": "python",
425 | "nbconvert_exporter": "python",
426 | "pygments_lexer": "ipython3",
427 | "version": "3.9.20"
428 | }
429 | },
430 | "nbformat": 4,
431 | "nbformat_minor": 4
432 | }
433 |
--------------------------------------------------------------------------------
/notebooks/day_2/MLA-RESML-DATAPREP.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - Data Processing\n",
15 | "\n",
16 | "This notebook shows basic data processing steps required to get data ready for model ingestion.\n",
17 | "\n",
18 | "__Dataset:__ \n",
19 | "You will download a dataset for this exercise using [folktables](https://github.com/zykls/folktables). Folktables provides an API to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files which are managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n",
20 | "\n",
21 | "__ML Problem:__ \n",
22 | "Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n",
23 | "\n",
24 | "__Table of contents__\n",
25 | "1. Loading Data\n",
26 | "2. Data Prep: Basics\n",
27 | "3. Data Prep: Missing Values\n",
28 | "4. Data Prep: Renaming Columns\n",
29 | "5. Data Prep: Encoding Categoricals\n",
30 | "5. Data Prep: Scaling Numericals"
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": null,
43 | "metadata": {
44 | "tags": []
45 | },
46 | "outputs": [],
47 | "source": [
48 | "# Reshaping/basic libraries\n",
49 | "import pandas as pd\n",
50 | "\n",
51 | "# Plotting libraries\n",
52 | "import matplotlib.pyplot as plt\n",
53 | "import seaborn as sns\n",
54 | "\n",
55 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
56 | "\n",
57 | "# Operational libraries\n",
58 | "import sys\n",
59 | "\n",
60 | "sys.path.append(\"..\")\n",
61 | "\n",
62 | "# ML libraries\n",
63 | "from sklearn.impute import SimpleImputer\n",
64 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
65 | "\n",
66 | "# Fairness libraries\n",
67 | "from folktables.acs import *\n",
68 | "from folktables.folktables import *\n",
69 | "from folktables.load_acs import *\n",
70 | "\n",
71 | "# Jupyter(lab) libraries\n",
72 | "import warnings\n",
73 | "\n",
74 | "warnings.filterwarnings(\"ignore\")"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {
80 | "tags": []
81 | },
82 | "source": [
83 | "## 1. Loading Data\n",
84 | "(Go to top)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type."
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual."
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "metadata": {
105 | "tags": []
106 | },
107 | "outputs": [],
108 | "source": [
109 | "income_features = [\n",
110 | " \"AGEP\", # age individual\n",
111 | " \"COW\", # class of worker\n",
112 | " \"SCHL\", # educational attainment\n",
113 | " \"MAR\", # marital status\n",
114 | " \"OCCP\", # occupation\n",
115 | " \"POBP\", # place of birth\n",
116 | " \"RELP\", # relationship\n",
117 | " \"WKHP\", # hours worked per week past 12 months\n",
118 | " \"SEX\", # sex\n",
119 | " \"RAC1P\", # recorded detailed race code\n",
120 | " \"PWGTP\", # persons weight\n",
121 | " \"GCL\", # grandparents living with grandchildren\n",
122 | " \"SCH\", # school enrollment\n",
123 | "]\n",
124 | "\n",
125 | "# Define the prediction problem and features\n",
126 | "ACSIncome = folktables.BasicProblem(\n",
127 | " features=income_features,\n",
128 | " target=\"PINCP\", # total persons income\n",
129 | " target_transform=lambda x: x > 50000,\n",
130 | " group=\"RAC1P\",\n",
131 | " preprocess=adult_filter, # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))\n",
132 | " postprocess=lambda x: x, # applies post processing, e.g. fill all NAs\n",
133 | ")\n",
134 | "\n",
135 | "# Initialize year, duration (\"1-Year\" or \"5-Year\") and granularity (household or person)\n",
136 | "data_source = ACSDataSource(survey_year=\"2018\", horizon=\"1-Year\", survey=\"person\")\n",
137 | "# Specify region (here: California) and load data\n",
138 | "ca_data = data_source.get_data(states=[\"CA\"], download=True)\n",
139 | "# Apply transformation as per problem statement above\n",
140 | "ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {
146 | "tags": []
147 | },
148 | "source": [
149 | "## 2. Data Prep: Basics\n",
150 | "(Go to top)\n",
151 | "\n",
152 | "We want to go through basic steps of data prep and convert all categorical features into dummy features (0/1 encoding) and also scale numerical values. Scaling is very important as various ML techniques use distance measures and values on different scales can fool those. Before you start the encoding and scaling, you should have a look at the main characteristics of the dataset first."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {
159 | "tags": []
160 | },
161 | "outputs": [],
162 | "source": [
163 | "# Convert numpy array to dataframe\n",
164 | "df = pd.DataFrame(\n",
165 | " np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),\n",
166 | " columns=income_features + [\">50k\"],\n",
167 | ")\n",
168 | "\n",
169 | "# Print the first five rows\n",
170 | "# NaN means missing data\n",
171 | "df.head()"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "Let's cast the categorical and numerical features accordingly (see EDA for additional explanation)."
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "tags": []
186 | },
187 | "outputs": [],
188 | "source": [
189 | "categorical_features = [\n",
190 | " \"COW\",\n",
191 | " \"SCHL\",\n",
192 | " \"MAR\",\n",
193 | " \"OCCP\",\n",
194 | " \"POBP\",\n",
195 | " \"RELP\",\n",
196 | " \"SEX\",\n",
197 | " \"RAC1P\",\n",
198 | " \"GCL\",\n",
199 | " \"SCH\",\n",
200 | "]\n",
201 | "\n",
202 | "numerical_features = [\"AGEP\", \"WKHP\", \"PWGTP\"]"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {
209 | "tags": []
210 | },
211 | "outputs": [],
212 | "source": [
213 | "# We cast categorical features to `category`\n",
214 | "df[categorical_features] = df[categorical_features].astype(\"object\")\n",
215 | "\n",
216 | "# We cast numerical features to `int`\n",
217 | "df[numerical_features] = df[numerical_features].astype(\"int\")"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "Looks good, so we can now separate model features from model target to explore them separately. \n",
225 | "\n",
226 | "#### Model Target & Model Features"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "model_target = \">50k\"\n",
236 | "model_features = categorical_features + numerical_features"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": null,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "# Double check that that target is not accidentally part of the features\n",
246 | "model_target in model_features"
247 | ]
248 | },
249 | {
250 | "cell_type": "markdown",
251 | "metadata": {},
252 | "source": [
253 | "All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.\n",
254 | "\n",
255 | "Let's have a look at missing values next."
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {
261 | "tags": []
262 | },
263 | "source": [
264 | "## 3. Data Prep: Missing Values\n",
265 | "(Go to top)\n",
266 | "\n",
267 | "The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values."
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "metadata": {},
274 | "outputs": [],
275 | "source": [
276 | "# Show missing values\n",
277 | "df.isna().sum()"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "To fill missing values we will use Sklearns `SimpleImputer`. `SimpleImputer` is a Sklearn transformer which means we first need to fit it and then we can apply the transformation to our data. We start by initializing the transformer:"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": null,
290 | "metadata": {},
291 | "outputs": [],
292 | "source": [
293 | "# Depending on the data type we need different imputation strategies!\n",
294 | "\n",
295 | "# If we have missing values in a numerical column, we can backfill with the mean\n",
296 | "imputer_numerical = SimpleImputer(strategy=\"mean\")\n",
297 | "\n",
298 | "# If we have missing values in a categorical column, we can backfill with \"missing\"\n",
299 | "imputer_categorical = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "Once the transformers have been initialized, we can fit them and apply them to data."
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {},
313 | "outputs": [],
314 | "source": [
315 | "imputer_numerical.fit(df[numerical_features])\n",
316 | "imputer_categorical.fit(df[categorical_features])"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "The `.fit()` method learns the transformation (i.e. learns the mean per column, finds most frequent value, ...). Now that the transformation is learned, we can apply it. Careful when doing this on a dataset that was split to create a train, test and validation subset. The transformation needs to be learned on the training set and can then be applied to all other subsets."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {},
330 | "outputs": [],
331 | "source": [
332 | "df_num = imputer_numerical.transform(df[numerical_features])\n",
333 | "df_cat = imputer_categorical.transform(\n",
334 | " df[categorical_features].astype(str)\n",
335 | ") # make sure to cast all other categoricals as string\n",
336 | "\n",
337 | "df = pd.concat(\n",
338 | " [\n",
339 | " pd.DataFrame(df_num, columns=numerical_features),\n",
340 | " pd.DataFrame(df_cat, columns=categorical_features),\n",
341 | " ],\n",
342 | " axis=1,\n",
343 | ").copy(deep=True)\n",
344 | "\n",
345 | "# Show missing values\n",
346 | "df.isna().sum()"
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "Let's take a quick detour and rename the columns to make them easier to understand."
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {
359 | "tags": []
360 | },
361 | "source": [
362 | "## 4. Data Prep: Renaming Columns\n",
363 | "(Go to top)\n",
364 | "\n",
365 | "When looking at the dataframe, we notice that the column headers are not self-explanatory. This will make debugging and communicating results potentially confusing. We should therefore consider to rename the column headers. We can do this with `.rename()`. To perform the renaming, we need to create a mapping between the old name and the new name we want to use."
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {},
372 | "outputs": [],
373 | "source": [
374 | "# Create column name mapping\n",
375 | "name_mapping = {\n",
376 | " \"AGEP\": \"age_individual\",\n",
377 | " \"COW\": \"class_of_worker\",\n",
378 | " \"SCHL\": \"educational_attainment\",\n",
379 | " \"MAR\": \"marital_status\",\n",
380 | " \"OCCP\": \"occupation\",\n",
381 | " \"POBP\": \"place_of_birth\",\n",
382 | " \"RELP\": \"relationship\",\n",
383 | " \"WKHP\": \"hours_worked_weekly_past_year\",\n",
384 | " \"SEX\": \"sex\",\n",
385 | " \"RAC1P\": \"race_code\",\n",
386 | " \"PWGTP\": \"persons_weight\",\n",
387 | " \"GCL\": \"grand_parents_living_with_grandchildren\",\n",
388 | " \"SCH\": \"school_enrollment\",\n",
389 | "}\n",
390 | "\n",
391 | "# Rename the columns\n",
392 | "df.rename(name_mapping, axis=1, inplace=True)\n",
393 | "\n",
394 | "# Make sure to update the lists that contain the categorical and numerical features\n",
395 | "categorical_features = [\n",
396 | " name_mapping[k] for k in name_mapping.keys() if k in categorical_features\n",
397 | "]\n",
398 | "numerical_features = [\n",
399 | " name_mapping[k] for k in name_mapping.keys() if k in numerical_features\n",
400 | "]"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "Now that we have dealt with the missing values and renamed the columns, we can convert the categorical columns to one-hot encoded versions (dummies)."
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {
413 | "tags": []
414 | },
415 | "source": [
416 | "## 5. Data Prep: Encoding Categoricals\n",
417 | "(Go to top)\n",
418 | "\n",
419 | "One-hot encoding only works if there are no NAs left in the dataframe, hence why we had to deal with the missing values first. Once again, we will use a transformer from Sklearn, `OneHotEncoder`."
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {},
426 | "outputs": [],
427 | "source": [
428 | "# Initialize OneHotEncoder\n",
429 | "ohe = OneHotEncoder(handle_unknown=\"ignore\")\n",
430 | "\n",
431 | "# Fit and transform in one step\n",
432 | "df_cat_ohe = ohe.fit_transform(df[categorical_features])\n",
433 | "\n",
434 | "# Create dataframe\n",
435 | "df_cat_new = pd.DataFrame(\n",
436 | " df_cat_ohe.toarray(), columns=ohe.get_feature_names_out(categorical_features)\n",
437 | ")\n",
438 | "\n",
439 | "df_cat_new.head()"
440 | ]
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {
445 | "tags": []
446 | },
447 | "source": [
448 | "## 6. Data Prep: Scaling Numericals\n",
449 | "(Go to top)\n",
450 | "\n",
451 | "Generally in ML we want all our numerical features to be on the same scale. This avoids that certain features are seen as more important based on their values alone. Scaling also helps for algorithms that use distance measures to evaluate similarity. We can use `MinMaxScaler` or `StandardScaler` for scaling numerical features."
452 | ]
453 | },
454 | {
455 | "cell_type": "code",
456 | "execution_count": null,
457 | "metadata": {},
458 | "outputs": [],
459 | "source": [
460 | "# Initialize MinMaxScaler\n",
461 | "mms = MinMaxScaler()\n",
462 | "\n",
463 | "# Fit and transform in one step\n",
464 | "df_num_mms = mms.fit_transform(df[numerical_features])\n",
465 | "\n",
466 | "# Create dataframe\n",
467 | "df_num_new = pd.DataFrame(df_num_mms, columns=numerical_features)\n",
468 | "\n",
469 | "df_num_new.head()"
470 | ]
471 | },
472 | {
473 | "cell_type": "markdown",
474 | "metadata": {},
475 | "source": [
476 | "This is the end of this notebook."
477 | ]
478 | }
479 | ],
480 | "metadata": {
481 | "kernelspec": {
482 | "display_name": ".conda-mlu-rai:Python",
483 | "language": "python",
484 | "name": "conda-env-.conda-mlu-rai-py"
485 | },
486 | "language_info": {
487 | "codemirror_mode": {
488 | "name": "ipython",
489 | "version": 3
490 | },
491 | "file_extension": ".py",
492 | "mimetype": "text/x-python",
493 | "name": "python",
494 | "nbconvert_exporter": "python",
495 | "pygments_lexer": "ipython3",
496 | "version": "3.9.20"
497 | }
498 | },
499 | "nbformat": 4,
500 | "nbformat_minor": 4
501 | }
502 |
--------------------------------------------------------------------------------
/notebooks/day_2/MLA-RESML-DAY2-FINAL-STUDENT-NB.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - Final Project\n",
15 | "\n",
16 | "Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
17 | "\n",
18 | "### Final Project Problem: Loan Approval\n",
19 | "\n",
20 | "__Problem Definition:__\n",
21 | "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
22 | "\n",
23 | "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
24 | "\n",
25 | "\n",
26 | "``` \n",
27 | "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
28 | "```\n",
29 | "\n",
30 | "1. Read the datasets (Given) \n",
31 | "2. Data Processing (Implement)\n",
32 | " * Exploratory Data Analysis\n",
33 | " * Select features to build the model (Suggested)\n",
34 | " * Train - Validation - Test Datasets\n",
35 | " * Feature transformation\n",
36 | "3. Train a Classifier on the Training Dataset (Implement)\n",
37 | "4. Make Predictions on the Test Dataset (Implement)\n",
38 | "5. Evaluate Results (Given)\n",
39 | "\n",
40 | "\n",
41 | "__Datasets and Files:__\n",
42 | "\n",
43 | "\n",
44 | "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
45 | "\n",
46 | "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "tags": []
61 | },
62 | "outputs": [],
63 | "source": [
64 | "%%capture\n",
65 | "\n",
66 | "# Reshaping/basic libraries\n",
67 | "import pandas as pd\n",
68 | "import numpy as np\n",
69 | "\n",
70 | "# Plotting libraries\n",
71 | "import matplotlib.pyplot as plt\n",
72 | "import seaborn as sns\n",
73 | "\n",
74 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
75 | "\n",
76 | "# ML libraries\n",
77 | "from sklearn.model_selection import train_test_split\n",
78 | "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
79 | "from sklearn.impute import SimpleImputer\n",
80 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
81 | "from sklearn.pipeline import Pipeline\n",
82 | "from sklearn.compose import ColumnTransformer\n",
83 | "from sklearn.linear_model import LogisticRegression\n",
84 | "\n",
85 | "# Fairness libraries\n",
86 | "from folktables.acs import *\n",
87 | "from folktables.folktables import *\n",
88 | "from folktables.load_acs import *\n",
89 | "from aif360.datasets import BinaryLabelDataset, Dataset\n",
90 | "from aif360.metrics import BinaryLabelDatasetMetric\n",
91 | "from aif360.algorithms.preprocessing import DisparateImpactRemover\n",
92 | "\n",
93 | "# Operational libraries\n",
94 | "import sys\n",
95 | "\n",
96 | "sys.path.append(\"..\")\n",
97 | "\n",
98 | "# Jupyter(lab) libraries\n",
99 | "import warnings\n",
100 | "\n",
101 | "warnings.filterwarnings(\"ignore\")"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "## 1. Read the datasets (Given)\n",
109 | "(Go to top)"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "metadata": {
123 | "tags": []
124 | },
125 | "outputs": [],
126 | "source": [
127 | "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
128 | "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
129 | "\n",
130 | "print(\"The shape of the training dataset is:\", training_data.shape)\n",
131 | "print(\"The shape of the test dataset is:\", test_data.shape)"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## 2. Data Processing (Implement)\n",
139 | "(Go to top) "
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### 2.1 Exploratory Data Analysis\n",
147 | "(Go to Data Processing)\n",
148 | "\n",
149 | "We look at number of rows, columns, and some simple statistics of the datasets."
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {
156 | "tags": []
157 | },
158 | "outputs": [],
159 | "source": [
160 | "training_data.head()"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {
167 | "tags": []
168 | },
169 | "outputs": [],
170 | "source": [
171 | "test_data.head()"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {
178 | "tags": []
179 | },
180 | "outputs": [],
181 | "source": [
182 | "# Implement more EDA here"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "### 2.2 Select features to build the model \n",
190 | "(Go to Data Processing)\n",
191 | "\n",
192 | "Let's use all the features. Below you see a snippet of code that separates categorical and numerical columns based on their data type. This should only be used if we are sure that the data types are correctly assigned (check during EDA). Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "metadata": {
199 | "tags": []
200 | },
201 | "outputs": [],
202 | "source": [
203 | "# Grab model features/inputs and target/output\n",
204 | "categorical_features = (\n",
205 | " training_data.drop(\"credit_risk\", axis=1)\n",
206 | " .select_dtypes(include=\"object\")\n",
207 | " .columns.tolist()\n",
208 | ")\n",
209 | "print(\"Categorical columns:\", categorical_features)\n",
210 | "\n",
211 | "print(\"\")\n",
212 | "\n",
213 | "numerical_features = (\n",
214 | " training_data.drop(\"credit_risk\", axis=1)\n",
215 | " .select_dtypes(include=np.number)\n",
216 | " .columns.tolist()\n",
217 | ")\n",
218 | "print(\"Numerical columns:\", numerical_features)"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "We notice that `ID` is identified as numerical column. ID's should never be used as features for training as they are unique by row. Let's drop the ID from the model features after we have separated target and features:"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "tags": []
233 | },
234 | "outputs": [],
235 | "source": [
236 | "model_target = \"credit_risk\"\n",
237 | "model_features = categorical_features + numerical_features\n",
238 | "\n",
239 | "print(\"Model features: \", model_features)\n",
240 | "print(\"\\n\")\n",
241 | "print(\"Model target: \", model_target)"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": null,
247 | "metadata": {
248 | "tags": []
249 | },
250 | "outputs": [],
251 | "source": [
252 | "to_remove = \"ID\"\n",
253 | "\n",
254 | "# Drop 'ID' feature from the respective list(s)\n",
255 | "if to_remove in model_features:\n",
256 | " model_features.remove(to_remove)\n",
257 | "if to_remove in categorical_features:\n",
258 | " categorical_features.remove(to_remove)\n",
259 | "if to_remove in numerical_features:\n",
260 | " numerical_features.remove(to_remove)"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "Let's also remove `age_years` as this is an obvious proxy for the age groups."
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "metadata": {},
274 | "outputs": [],
275 | "source": [
276 | "to_remove = \"age_years\"\n",
277 | "\n",
278 | "# Drop 'ID' feature from the respective list(s)\n",
279 | "if to_remove in model_features:\n",
280 | " model_features.remove(to_remove)\n",
281 | "if to_remove in categorical_features:\n",
282 | " categorical_features.remove(to_remove)\n",
283 | "if to_remove in numerical_features:\n",
284 | " numerical_features.remove(to_remove)"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "### 2.3 Train - Validation Datasets\n",
292 | "(Go to Data Processing)\n",
293 | "\n",
294 | "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
295 | "\n",
296 | "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. "
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": null,
302 | "metadata": {
303 | "tags": []
304 | },
305 | "outputs": [],
306 | "source": [
307 | "# Implement here"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "### 2.4 Feature transformation\n",
315 | "(Go to Data Processing)\n",
316 | "\n",
317 | "Here, you have different options. You can use Reweighing, prepare for Disparate Impact Remover or use Suppression. Regardless of which method to use, it makes sense to prepare the data first by dealing with missing values, one-hot encoding and scaling."
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "metadata": {
324 | "tags": []
325 | },
326 | "outputs": [],
327 | "source": [
328 | "### STEP 1 ###\n",
329 | "##############\n",
330 | "\n",
331 | "# Preprocess the numerical features\n",
332 | "numerical_processor = Pipeline(\n",
333 | " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
334 | ")\n",
335 | "# Preprocess the categorical features\n",
336 | "categorical_processor = Pipeline(\n",
337 | " [\n",
338 | " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
339 | " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
340 | " ]\n",
341 | ")\n",
342 | "\n",
343 | "### STEP 2 ###\n",
344 | "##############\n",
345 | "\n",
346 | "# Combine all data preprocessors from above\n",
347 | "data_processor = ColumnTransformer(\n",
348 | " [\n",
349 | " (\"numerical_processing\", numerical_processor, numerical_features),\n",
350 | " (\"categorical_processing\", categorical_processor, categorical_features),\n",
351 | " ]\n",
352 | ")\n",
353 | "\n",
354 | "### STEP 3 ###\n",
355 | "##############\n",
356 | "\n",
357 | "# Fit the dataprocessor to our training data and apply transform to test data\n",
358 | "processed_train = data_processor.fit_transform(train_data[model_features])\n",
359 | "processed_test = data_processor.transform(test_data)"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "#### 2.4.1 DI-Transformation\n",
367 | "\n",
368 | "The dataframe you just created will not have any column names for the one-hot encoded categorical features and will also be saved as sparse matrix. If you want to proceed with DI transformation, you need to convert the sparse matrix back to a data frame and also re-create the column names."
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": null,
374 | "metadata": {},
375 | "outputs": [],
376 | "source": [
377 | "# Implement here"
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {},
383 | "source": [
384 | "#### 2.4.2 Reweighing\n",
385 | "\n",
386 | "Alternatively if you want to build the model with reweighing, you can use the custom function below and then apply the weights during the training stage."
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {
393 | "tags": []
394 | },
395 | "outputs": [],
396 | "source": [
397 | "def reweighing(data, label, sensitive_attr, return_list=True):\n",
398 | " label_dict = dict()\n",
399 | " try:\n",
400 | " # This will loop through the different labels (1 - awarded grant, 0 - not awarded)\n",
401 | " for outcome in data[label].unique():\n",
402 | " weight_map = dict()\n",
403 | " # Check for all possible groups (here we have A & B but there could be more in reality)\n",
404 | " for val in data[sensitive_attr].unique():\n",
405 | " # Calculate the probabilities\n",
406 | " nom = (\n",
407 | " len(data[data[sensitive_attr] == val])\n",
408 | " / len(data)\n",
409 | " * len(data[data[label] == outcome])\n",
410 | " / len(data)\n",
411 | " )\n",
412 | " denom = len(\n",
413 | " data[(data[sensitive_attr] == val) & (data[label] == outcome)]\n",
414 | " ) / len(data)\n",
415 | " # Store weights according to sensitive attribute\n",
416 | " weight_map[val] = round(nom / denom, 2)\n",
417 | " # Store\n",
418 | " label_dict[outcome] = weight_map\n",
419 | " # Create full list of all weights for every data point provided as input\n",
420 | " data[\"weights\"] = list(\n",
421 | " map(lambda x, y: label_dict[y][x], data[sensitive_attr], data[label])\n",
422 | " )\n",
423 | " if return_list == True:\n",
424 | " return data[\"weights\"].to_list()\n",
425 | " else:\n",
426 | " return label_dict\n",
427 | " except Exception as err:\n",
428 | " print(err)\n",
429 | " print(\"Dataframe might have no entries.\")"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {},
436 | "outputs": [],
437 | "source": [
438 | "# Implement here"
439 | ]
440 | },
441 | {
442 | "cell_type": "markdown",
443 | "metadata": {},
444 | "source": [
445 | "## 3. Train a Classifier (Implement)\n",
446 | "(Go to top)\n",
447 | "\n",
448 | "Train the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline."
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": null,
454 | "metadata": {
455 | "tags": []
456 | },
457 | "outputs": [],
458 | "source": [
459 | "# Implement here"
460 | ]
461 | },
462 | {
463 | "cell_type": "markdown",
464 | "metadata": {},
465 | "source": [
466 | "## 4. Make Predictions on the Test Dataset (Implement)\n",
467 | "(Go to top)\n",
468 | "\n",
469 | "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": null,
475 | "metadata": {
476 | "tags": []
477 | },
478 | "outputs": [],
479 | "source": [
480 | "# Implement here\n",
481 | "\n",
482 | "# Get test data to test the classifier\n",
483 | "# ! test data should come from german_credit_test.csv !\n",
484 | "# ...\n",
485 | "\n",
486 | "# Use the trained model to make predictions on the test dataset\n",
487 | "# test_predictions = ..."
488 | ]
489 | },
490 | {
491 | "cell_type": "markdown",
492 | "metadata": {},
493 | "source": [
494 | "## 5. Evaluate Results (Given)\n",
495 | "(Go to top)"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": null,
501 | "metadata": {
502 | "tags": []
503 | },
504 | "outputs": [],
505 | "source": [
506 | "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
507 | "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
508 | "result_df[\"credit_risk_pred\"] = test_predictions\n",
509 | "\n",
510 | "result_df.to_csv(\"../../data/final_project/project_day2_result.csv\", index=False)"
511 | ]
512 | },
513 | {
514 | "cell_type": "markdown",
515 | "metadata": {},
516 | "source": [
517 | "### Final Evaluation on Test Data - Disparate Impact\n",
518 | "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "metadata": {
525 | "tags": []
526 | },
527 | "outputs": [],
528 | "source": [
529 | "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
530 | " \"\"\"\n",
531 | " Function to calculate Disparate Impact metric using the results from this notebook.\n",
532 | " \"\"\"\n",
533 | " try:\n",
534 | " # Merge predictions with original test data to model per group\n",
535 | " di_df = pred_df.merge(test_data, on=\"ID\")\n",
536 | " # Count for group with members less than 25y old\n",
537 | " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
538 | " 0\n",
539 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
540 | " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
541 | " # Count for group with members greater equal 25y old\n",
542 | " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
543 | " 0\n",
544 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
545 | " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
546 | " # Check if correct number of gorups\n",
547 | " if total_geq25 == 0:\n",
548 | " print(\"There is only one group present in the data.\")\n",
549 | " elif total_less25 == 0:\n",
550 | " print(\"There is only one group present in the data.\")\n",
551 | " else:\n",
552 | " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
553 | " pos_outcomes_geq25 / total_geq25\n",
554 | " )\n",
555 | " return disparate_impact\n",
556 | " except:\n",
557 | " print(\"Wrong inputs provided.\")"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": null,
563 | "metadata": {
564 | "tags": []
565 | },
566 | "outputs": [],
567 | "source": [
568 | "calculate_di(test_data, result_df, \"credit_risk_pred\")"
569 | ]
570 | },
571 | {
572 | "cell_type": "markdown",
573 | "metadata": {},
574 | "source": [
575 | "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
576 | "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": null,
582 | "metadata": {
583 | "tags": []
584 | },
585 | "outputs": [],
586 | "source": [
587 | "accuracy_score(\n",
588 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
589 | " \"credit_risk\"\n",
590 | " ],\n",
591 | " result_df[\"credit_risk_pred\"],\n",
592 | ")"
593 | ]
594 | },
595 | {
596 | "cell_type": "code",
597 | "execution_count": null,
598 | "metadata": {
599 | "tags": []
600 | },
601 | "outputs": [],
602 | "source": [
603 | "f1_score(\n",
604 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
605 | " \"credit_risk\"\n",
606 | " ],\n",
607 | " result_df[\"credit_risk_pred\"],\n",
608 | ")"
609 | ]
610 | },
611 | {
612 | "cell_type": "markdown",
613 | "metadata": {},
614 | "source": [
615 | "This is the end of the notebook."
616 | ]
617 | }
618 | ],
619 | "metadata": {
620 | "kernelspec": {
621 | "display_name": ".conda-mlu-rai:Python",
622 | "language": "python",
623 | "name": "conda-env-.conda-mlu-rai-py"
624 | },
625 | "language_info": {
626 | "codemirror_mode": {
627 | "name": "ipython",
628 | "version": 3
629 | },
630 | "file_extension": ".py",
631 | "mimetype": "text/x-python",
632 | "name": "python",
633 | "nbconvert_exporter": "python",
634 | "pygments_lexer": "ipython3",
635 | "version": "3.9.20"
636 | }
637 | },
638 | "nbformat": 4,
639 | "nbformat_minor": 4
640 | }
641 |
--------------------------------------------------------------------------------
/notebooks/day_2/img/fig-data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/notebooks/day_2/img/fig-data.png
--------------------------------------------------------------------------------
/notebooks/day_2/img/fig-lg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/notebooks/day_2/img/fig-lg.png
--------------------------------------------------------------------------------
/notebooks/day_2/img/fig-lr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/notebooks/day_2/img/fig-lr.png
--------------------------------------------------------------------------------
/notebooks/day_3/MLA-RESML-DAY3-FINAL-STUDENT-NB-SOLUTION.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - Final Project Solution\n",
15 | "\n",
16 | "Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
17 | "\n",
18 | "### Final Project Problem: Loan Approval\n",
19 | "\n",
20 | "__Problem Definition:__\n",
21 | "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
22 | "\n",
23 | "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
24 | "\n",
25 | "\n",
26 | "``` \n",
27 | "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
28 | "```\n",
29 | "\n",
30 | "1. Read the datasets (Given) \n",
31 | "2. Data Processing (Implement)\n",
32 | " * Exploratory Data Analysis\n",
33 | " * Select features to build the model (Suggested)\n",
34 | " * Train - Validation - Test Datasets\n",
35 | " * Feature transformation\n",
36 | "3. Train a Classifier on the Training Dataset (Implement)\n",
37 | "4. Make Predictions on the Test Dataset (Implement)\n",
38 | "5. Evaluate Results (Given)\n",
39 | "\n",
40 | "\n",
41 | "__Datasets and Files:__\n",
42 | "\n",
43 | "\n",
44 | "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
45 | "\n",
46 | "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "tags": []
61 | },
62 | "outputs": [],
63 | "source": [
64 | "%%capture\n",
65 | "\n",
66 | "# Reshaping/basic libraries\n",
67 | "import pandas as pd\n",
68 | "import numpy as np\n",
69 | "\n",
70 | "# Plotting libraries\n",
71 | "import matplotlib.pyplot as plt\n",
72 | "\n",
73 | "%matplotlib inline\n",
74 | "import seaborn as sns\n",
75 | "\n",
76 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
77 | "\n",
78 | "# ML libraries\n",
79 | "from sklearn.model_selection import train_test_split\n",
80 | "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
81 | "from sklearn.impute import SimpleImputer\n",
82 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
83 | "from sklearn.pipeline import Pipeline\n",
84 | "from sklearn.compose import ColumnTransformer\n",
85 | "from sklearn.linear_model import LogisticRegression\n",
86 | "\n",
87 | "# Operational libraries\n",
88 | "import sys\n",
89 | "\n",
90 | "sys.path.append(\"..\")\n",
91 | "sys.path.insert(1, \"..\")\n",
92 | "\n",
93 | "# Fairness libraries\n",
94 | "from folktables.acs import *\n",
95 | "from folktables.folktables import *\n",
96 | "from folktables.load_acs import *\n",
97 | "from fairlearn.reductions import EqualizedOdds\n",
98 | "from fairlearn.postprocessing import ThresholdOptimizer\n",
99 | "from fairlearn.metrics import MetricFrame, selection_rate\n",
100 | "\n",
101 | "# Jupyter(lab) libraries\n",
102 | "import warnings\n",
103 | "\n",
104 | "warnings.filterwarnings(\"ignore\")"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "## 1. Read the datasets (Given)\n",
112 | "(Go to top)"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {
126 | "tags": []
127 | },
128 | "outputs": [],
129 | "source": [
130 | "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
131 | "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
132 | "\n",
133 | "print(\"The shape of the training dataset is:\", training_data.shape)\n",
134 | "print(\"The shape of the test dataset is:\", test_data.shape)"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "## 2. Data Processing (Implement)\n",
142 | "(Go to top) "
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "### 2.1 Exploratory Data Analysis\n",
150 | "(Go to Data Processing)\n",
151 | "\n",
152 | "We look at number of rows, columns, and some simple statistics of the datasets."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {
159 | "tags": []
160 | },
161 | "outputs": [],
162 | "source": [
163 | "training_data.head()"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "tags": []
171 | },
172 | "outputs": [],
173 | "source": [
174 | "test_data.head()"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {
181 | "tags": []
182 | },
183 | "outputs": [],
184 | "source": [
185 | "# Implement more EDA here"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "### 2.2 Select features to build the model \n",
193 | "(Go to Data Processing)\n",
194 | "\n",
195 | "Let's use all the features. Below you see a snippet of code that separates categorical and numerical columns based on their data type. This should only be used if we are sure that the data types are correctly assigned (check during EDA). Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "tags": []
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# Grab model features/inputs and target/output\n",
207 | "categorical_features = (\n",
208 | " training_data.drop(\"credit_risk\", axis=1)\n",
209 | " .select_dtypes(include=\"object\")\n",
210 | " .columns.tolist()\n",
211 | ")\n",
212 | "print(\"Categorical columns:\", categorical_features)\n",
213 | "\n",
214 | "print(\"\")\n",
215 | "\n",
216 | "numerical_features = (\n",
217 | " training_data.drop(\"credit_risk\", axis=1)\n",
218 | " .select_dtypes(include=np.number)\n",
219 | " .columns.tolist()\n",
220 | ")\n",
221 | "print(\"Numerical columns:\", numerical_features)"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "We notice that `ID` is identified as numerical column. ID's should never be used as features for training as they are unique by row. Let's drop the ID from the model features after we have separated target and features. Also make sure to remove the sensitive feature so it does not end up as input for training."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "tags": []
236 | },
237 | "outputs": [],
238 | "source": [
239 | "sensitive_feature = \"age_groups\"\n",
240 | "\n",
241 | "try:\n",
242 | " numerical_features.remove(sensitive_feature)\n",
243 | "except:\n",
244 | " pass\n",
245 | "\n",
246 | "try:\n",
247 | " categorical_features.remove(sensitive_feature)\n",
248 | "except:\n",
249 | " pass\n",
250 | "\n",
251 | "model_target = \"credit_risk\"\n",
252 | "model_features = categorical_features + numerical_features\n",
253 | "\n",
254 | "print(\"Model features: \", model_features)\n",
255 | "print(\"\\n\")\n",
256 | "print(\"Model target: \", model_target)"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "tags": []
264 | },
265 | "outputs": [],
266 | "source": [
267 | "to_remove = \"ID\"\n",
268 | "\n",
269 | "# Drop 'ID' feature from the respective list(s)\n",
270 | "if to_remove in model_features:\n",
271 | " model_features.remove(to_remove)\n",
272 | "if to_remove in categorical_features:\n",
273 | " categorical_features.remove(to_remove)\n",
274 | "if to_remove in numerical_features:\n",
275 | " numerical_features.remove(to_remove)"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "Let's also remove `age_years` as this is an obvious proxy for the age groups."
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {
289 | "tags": []
290 | },
291 | "outputs": [],
292 | "source": [
293 | "to_remove = \"age_years\"\n",
294 | "\n",
295 | "# Drop 'ID' feature from the respective list(s)\n",
296 | "if to_remove in model_features:\n",
297 | " model_features.remove(to_remove)\n",
298 | "if to_remove in categorical_features:\n",
299 | " categorical_features.remove(to_remove)\n",
300 | "if to_remove in numerical_features:\n",
301 | " numerical_features.remove(to_remove)"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "### 2.3 Feature transformation\n",
309 | "(Go to Data Processing)\n",
310 | "\n",
311 | "Here, you have different options. You can use Reweighing, Disparate Impact Remover or Suppression. However, in this notebook you should try to implement Equalized Odds postprocessing. Therefore, no transformation is required at this point."
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "### 2.4 Train - Validation Datasets\n",
319 | "(Go to Data Processing)\n",
320 | "\n",
321 | "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
322 | "\n",
323 | "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. "
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {
330 | "tags": []
331 | },
332 | "outputs": [],
333 | "source": [
334 | "# Implement here"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {
341 | "tags": []
342 | },
343 | "outputs": [],
344 | "source": [
345 | "# We only need to split between train and val (test is already separate)\n",
346 | "train_data, val_data = train_test_split(\n",
347 | " training_data, test_size=0.1, shuffle=True, random_state=23\n",
348 | ")\n",
349 | "\n",
350 | "# Print the shapes of the Train - Test Datasets\n",
351 | "print(\n",
352 | " \"Train - Test - Validation datasets shapes: \",\n",
353 | " train_data.shape,\n",
354 | " test_data.shape,\n",
355 | " val_data.shape,\n",
356 | ")\n",
357 | "\n",
358 | "train_data.reset_index(inplace=True, drop=True)\n",
359 | "val_data.reset_index(inplace=True, drop=True)"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "### 2.5 Data processing with Pipeline\n",
367 | "(Go to Data Processing)\n",
368 | "\n",
369 | "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. \n"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": null,
375 | "metadata": {
376 | "tags": []
377 | },
378 | "outputs": [],
379 | "source": [
380 | "# Implement here"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": null,
386 | "metadata": {
387 | "tags": []
388 | },
389 | "outputs": [],
390 | "source": [
391 | "### STEP 1 ###\n",
392 | "##############\n",
393 | "\n",
394 | "# Preprocess the numerical features\n",
395 | "numerical_processor = Pipeline(\n",
396 | " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
397 | ")\n",
398 | "# Preprocess the categorical features\n",
399 | "categorical_processor = Pipeline(\n",
400 | " [\n",
401 | " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
402 | " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
403 | " ]\n",
404 | ")\n",
405 | "\n",
406 | "### STEP 2 ###\n",
407 | "##############\n",
408 | "\n",
409 | "# Combine all data preprocessors from above\n",
410 | "data_processor = ColumnTransformer(\n",
411 | " [\n",
412 | " (\"numerical_processing\", numerical_processor, numerical_features),\n",
413 | " (\"categorical_processing\", categorical_processor, categorical_features),\n",
414 | " ]\n",
415 | ")\n",
416 | "\n",
417 | "\n",
418 | "### STEP 3 ###\n",
419 | "##############\n",
420 | "\n",
421 | "# Pipeline desired all data transformers, along with an estimator at the end\n",
422 | "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
423 | "pipeline = Pipeline(\n",
424 | " [\n",
425 | " (\"data_processing\", data_processor),\n",
426 | " (\"lg\", LogisticRegression(random_state=0)),\n",
427 | " ]\n",
428 | ")\n",
429 | "\n",
430 | "# Visualize the pipeline\n",
431 | "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
432 | "from sklearn import set_config\n",
433 | "\n",
434 | "set_config(display=\"diagram\")\n",
435 | "pipeline"
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "## 3. Train (and Tune) a Classifier (Implement)\n",
443 | "(Go to top)\n",
444 | "\n",
445 | "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": null,
451 | "metadata": {
452 | "tags": []
453 | },
454 | "outputs": [],
455 | "source": [
456 | "# Implement here"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": null,
462 | "metadata": {
463 | "tags": []
464 | },
465 | "outputs": [],
466 | "source": [
467 | "# Get train data to train the classifier\n",
468 | "X_train = train_data[model_features]\n",
469 | "y_train = train_data[model_target]\n",
470 | "\n",
471 | "# Learn the transformation & extract feature names\n",
472 | "data_processor.fit(X_train)\n",
473 | "\n",
474 | "# To extract feature names we first need to fit the data processor as this will generate the one hot encoding\n",
475 | "ft_names = numerical_features + list(\n",
476 | " data_processor.transformers_[1][1]\n",
477 | " .named_steps[\"cat_encoder\"]\n",
478 | " .get_feature_names_out(categorical_features)\n",
479 | ")\n",
480 | "\n",
481 | "# Add column names and convert to data frame\n",
482 | "X_train_prep = pd.DataFrame(\n",
483 | " data_processor.transform(X_train).todense(), columns=ft_names\n",
484 | ")\n",
485 | "\n",
486 | "# Set up ThresholdOptimizer\n",
487 | "eo_model = ThresholdOptimizer(\n",
488 | " estimator=pipeline[-1],\n",
489 | " constraints=\"equalized_odds\",\n",
490 | " objective=\"accuracy_score\",\n",
491 | " grid_size=1000,\n",
492 | " flip=False,\n",
493 | " prefit=False,\n",
494 | " predict_method=\"deprecated\",\n",
495 | ")\n",
496 | "\n",
497 | "# Adjust the results that the classifier would produce by letting ThresholdOptimizer know what the sensitive features are\n",
498 | "eo_model.fit(X_train_prep, y_train, sensitive_features=train_data[\"age_groups\"].values)"
499 | ]
500 | },
501 | {
502 | "cell_type": "markdown",
503 | "metadata": {},
504 | "source": [
505 | "## 4. Make Predictions on the Test Dataset (Implement)\n",
506 | "(Go to top)\n",
507 | "\n",
508 | "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "metadata": {
515 | "tags": []
516 | },
517 | "outputs": [],
518 | "source": [
519 | "# Implement here\n",
520 | "\n",
521 | "# Get test data to test the classifier\n",
522 | "# ! test data should come from german_credit_test.csv !\n",
523 | "# ...\n",
524 | "\n",
525 | "# Use the trained model to make predictions on the test dataset\n",
526 | "# test_predictions = ..."
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": null,
532 | "metadata": {
533 | "tags": []
534 | },
535 | "outputs": [],
536 | "source": [
537 | "# You can now use the fitted equalized odds post-processor and use it to create adjusted outcomes (predictions)\n",
538 | "test_predictions = eo_model.predict(\n",
539 | " np.asarray(data_processor.transform(test_data[model_features]).todense()),\n",
540 | " sensitive_features=test_data[\"age_groups\"].values,\n",
541 | ")"
542 | ]
543 | },
544 | {
545 | "cell_type": "markdown",
546 | "metadata": {},
547 | "source": [
548 | "## 5. Evaluate Results (Given)\n",
549 | "(Go to top)"
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": null,
555 | "metadata": {
556 | "tags": []
557 | },
558 | "outputs": [],
559 | "source": [
560 | "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
561 | "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
562 | "result_df[\"credit_risk_pred\"] = test_predictions\n",
563 | "\n",
564 | "result_df.to_csv(\"../../data/final_project/project_day3_result.csv\", index=False)"
565 | ]
566 | },
567 | {
568 | "cell_type": "markdown",
569 | "metadata": {},
570 | "source": [
571 | "### Final Evaluation on Test Data - Disparate Impact\n",
572 | "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
573 | ]
574 | },
575 | {
576 | "cell_type": "code",
577 | "execution_count": null,
578 | "metadata": {
579 | "tags": []
580 | },
581 | "outputs": [],
582 | "source": [
583 | "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
584 | " \"\"\"\n",
585 | " Function to calculate Disparate Impact metric using the results from this notebook.\n",
586 | " \"\"\"\n",
587 | " try:\n",
588 | " # Merge predictions with original test data to model per group\n",
589 | " di_df = pred_df.merge(test_data, on=\"ID\")\n",
590 | " # Count for group with members less than 25y old\n",
591 | " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
592 | " 0\n",
593 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
594 | " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
595 | " # Count for group with members greater equal 25y old\n",
596 | " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
597 | " 0\n",
598 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
599 | " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
600 | " # Check if correct number of gorups\n",
601 | " if total_geq25 == 0:\n",
602 | " print(\"There is only one group present in the data.\")\n",
603 | " elif total_less25 == 0:\n",
604 | " print(\"There is only one group present in the data.\")\n",
605 | " else:\n",
606 | " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
607 | " pos_outcomes_geq25 / total_geq25\n",
608 | " )\n",
609 | " return disparate_impact\n",
610 | " except:\n",
611 | " print(\"Wrong inputs provided.\")"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "metadata": {
618 | "tags": []
619 | },
620 | "outputs": [],
621 | "source": [
622 | "calculate_di(test_data, result_df, \"credit_risk_pred\")"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
630 | "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
631 | ]
632 | },
633 | {
634 | "cell_type": "code",
635 | "execution_count": null,
636 | "metadata": {
637 | "tags": []
638 | },
639 | "outputs": [],
640 | "source": [
641 | "accuracy_score(\n",
642 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
643 | " \"credit_risk\"\n",
644 | " ],\n",
645 | " result_df[\"credit_risk_pred\"],\n",
646 | ")"
647 | ]
648 | },
649 | {
650 | "cell_type": "code",
651 | "execution_count": null,
652 | "metadata": {
653 | "tags": []
654 | },
655 | "outputs": [],
656 | "source": [
657 | "f1_score(\n",
658 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
659 | " \"credit_risk\"\n",
660 | " ],\n",
661 | " result_df[\"credit_risk_pred\"],\n",
662 | ")"
663 | ]
664 | },
665 | {
666 | "cell_type": "markdown",
667 | "metadata": {},
668 | "source": [
669 | "This is the end of the notebook."
670 | ]
671 | }
672 | ],
673 | "metadata": {
674 | "kernelspec": {
675 | "display_name": ".conda-mlu-rai:Python",
676 | "language": "python",
677 | "name": "conda-env-.conda-mlu-rai-py"
678 | },
679 | "language_info": {
680 | "codemirror_mode": {
681 | "name": "ipython",
682 | "version": 3
683 | },
684 | "file_extension": ".py",
685 | "mimetype": "text/x-python",
686 | "name": "python",
687 | "nbconvert_exporter": "python",
688 | "pygments_lexer": "ipython3",
689 | "version": "3.9.20"
690 | }
691 | },
692 | "nbformat": 4,
693 | "nbformat_minor": 4
694 | }
695 |
--------------------------------------------------------------------------------
/notebooks/day_3/MLA-RESML-DAY3-FINAL-STUDENT-NB.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - Final Project Solution\n",
15 | "\n",
16 | "Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
17 | "\n",
18 | "### Final Project Problem: Loan Approval\n",
19 | "\n",
20 | "__Problem Definition:__\n",
21 | "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
22 | "\n",
23 | "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
24 | "\n",
25 | "\n",
26 | "``` \n",
27 | "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
28 | "```\n",
29 | "\n",
30 | "1. Read the datasets (Given) \n",
31 | "2. Data Processing (Implement)\n",
32 | " * Exploratory Data Analysis\n",
33 | " * Select features to build the model (Suggested)\n",
34 | " * Train - Validation - Test Datasets\n",
35 | " * Feature transformation\n",
36 | "3. Train a Classifier on the Training Dataset (Implement)\n",
37 | "4. Make Predictions on the Test Dataset (Implement)\n",
38 | "5. Evaluate Results (Given)\n",
39 | "\n",
40 | "\n",
41 | "__Datasets and Files:__\n",
42 | "\n",
43 | "\n",
44 | "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
45 | "\n",
46 | "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": null,
59 | "metadata": {
60 | "tags": []
61 | },
62 | "outputs": [],
63 | "source": [
64 | "%%capture\n",
65 | "\n",
66 | "# Reshaping/basic libraries\n",
67 | "import pandas as pd\n",
68 | "import numpy as np\n",
69 | "\n",
70 | "# Plotting libraries\n",
71 | "import matplotlib.pyplot as plt\n",
72 | "\n",
73 | "%matplotlib inline\n",
74 | "import seaborn as sns\n",
75 | "\n",
76 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
77 | "\n",
78 | "# ML libraries\n",
79 | "from sklearn.model_selection import train_test_split\n",
80 | "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
81 | "from sklearn.impute import SimpleImputer\n",
82 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
83 | "from sklearn.pipeline import Pipeline\n",
84 | "from sklearn.compose import ColumnTransformer\n",
85 | "from sklearn.linear_model import LogisticRegression\n",
86 | "\n",
87 | "# Operational libraries\n",
88 | "import sys\n",
89 | "\n",
90 | "sys.path.append(\"..\")\n",
91 | "sys.path.insert(1, \"..\")\n",
92 | "\n",
93 | "# Fairness libraries\n",
94 | "from folktables.acs import *\n",
95 | "from folktables.folktables import *\n",
96 | "from folktables.load_acs import *\n",
97 | "from fairlearn.reductions import EqualizedOdds\n",
98 | "from fairlearn.postprocessing import ThresholdOptimizer\n",
99 | "from fairlearn.metrics import MetricFrame, selection_rate\n",
100 | "\n",
101 | "# Jupyter(lab) libraries\n",
102 | "import warnings\n",
103 | "\n",
104 | "warnings.filterwarnings(\"ignore\")"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "## 1. Read the datasets (Given)\n",
112 | "(Go to top)"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {
126 | "tags": []
127 | },
128 | "outputs": [],
129 | "source": [
130 | "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
131 | "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
132 | "\n",
133 | "print(\"The shape of the training dataset is:\", training_data.shape)\n",
134 | "print(\"The shape of the test dataset is:\", test_data.shape)"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "## 2. Data Processing (Implement)\n",
142 | "(Go to top) "
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "### 2.1 Exploratory Data Analysis\n",
150 | "(Go to Data Processing)\n",
151 | "\n",
152 | "We look at number of rows, columns, and some simple statistics of the datasets."
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {
159 | "tags": []
160 | },
161 | "outputs": [],
162 | "source": [
163 | "training_data.head()"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "tags": []
171 | },
172 | "outputs": [],
173 | "source": [
174 | "test_data.head()"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {
181 | "tags": []
182 | },
183 | "outputs": [],
184 | "source": [
185 | "# Implement more EDA here"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "### 2.2 Select features to build the model \n",
193 | "(Go to Data Processing)\n",
194 | "\n",
195 | "Let's use all the features. Below you see a snippet of code that separates categorical and numerical columns based on their data type. This should only be used if we are sure that the data types are correctly assigned (check during EDA). Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "tags": []
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# Grab model features/inputs and target/output\n",
207 | "categorical_features = (\n",
208 | " training_data.drop(\"credit_risk\", axis=1)\n",
209 | " .select_dtypes(include=\"object\")\n",
210 | " .columns.tolist()\n",
211 | ")\n",
212 | "print(\"Categorical columns:\", categorical_features)\n",
213 | "\n",
214 | "print(\"\")\n",
215 | "\n",
216 | "numerical_features = (\n",
217 | " training_data.drop(\"credit_risk\", axis=1)\n",
218 | " .select_dtypes(include=np.number)\n",
219 | " .columns.tolist()\n",
220 | ")\n",
221 | "print(\"Numerical columns:\", numerical_features)"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "We notice that `ID` is identified as numerical column. ID's should never be used as features for training as they are unique by row. Let's drop the ID from the model features after we have separated target and features. Also make sure to remove the sensitive feature so it does not end up as input for training."
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "tags": []
236 | },
237 | "outputs": [],
238 | "source": [
239 | "sensitive_feature = \"age_groups\"\n",
240 | "\n",
241 | "try:\n",
242 | " numerical_features.remove(sensitive_feature)\n",
243 | "except:\n",
244 | " pass\n",
245 | "\n",
246 | "try:\n",
247 | " categorical_features.remove(sensitive_feature)\n",
248 | "except:\n",
249 | " pass\n",
250 | "\n",
251 | "model_target = \"credit_risk\"\n",
252 | "model_features = categorical_features + numerical_features\n",
253 | "\n",
254 | "print(\"Model features: \", model_features)\n",
255 | "print(\"\\n\")\n",
256 | "print(\"Model target: \", model_target)"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "tags": []
264 | },
265 | "outputs": [],
266 | "source": [
267 | "to_remove = \"ID\"\n",
268 | "\n",
269 | "# Drop 'ID' feature from the respective list(s)\n",
270 | "if to_remove in model_features:\n",
271 | " model_features.remove(to_remove)\n",
272 | "if to_remove in categorical_features:\n",
273 | " categorical_features.remove(to_remove)\n",
274 | "if to_remove in numerical_features:\n",
275 | " numerical_features.remove(to_remove)"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "Let's also remove `age_years` as this is an obvious proxy for the age groups."
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {
289 | "tags": []
290 | },
291 | "outputs": [],
292 | "source": [
293 | "to_remove = \"age_years\"\n",
294 | "\n",
295 | "# Drop 'ID' feature from the respective list(s)\n",
296 | "if to_remove in model_features:\n",
297 | " model_features.remove(to_remove)\n",
298 | "if to_remove in categorical_features:\n",
299 | " categorical_features.remove(to_remove)\n",
300 | "if to_remove in numerical_features:\n",
301 | " numerical_features.remove(to_remove)"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "### 2.3 Feature transformation\n",
309 | "(Go to Data Processing)\n",
310 | "\n",
311 | "Here, you have different options. You can use Reweighing, Disparate Impact Remover or Suppression. However, in this notebook you should try to implement Equalized Odds postprocessing. Therefore, no transformation is required at this point."
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {},
317 | "source": [
318 | "### 2.4 Train - Validation Datasets\n",
319 | "(Go to Data Processing)\n",
320 | "\n",
321 | "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
322 | "\n",
323 | "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. "
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": null,
329 | "metadata": {
330 | "tags": []
331 | },
332 | "outputs": [],
333 | "source": [
334 | "# Implement here"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "### 2.5 Data processing with Pipeline\n",
342 | "(Go to Data Processing)\n",
343 | "\n",
344 | "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. \n"
345 | ]
346 | },
347 | {
348 | "cell_type": "code",
349 | "execution_count": null,
350 | "metadata": {
351 | "tags": []
352 | },
353 | "outputs": [],
354 | "source": [
355 | "# Implement here"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "## 3. Train (and Tune) a Classifier (Implement)\n",
363 | "(Go to top)\n",
364 | "\n",
365 | "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {
372 | "tags": []
373 | },
374 | "outputs": [],
375 | "source": [
376 | "# Implement here"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "## 4. Make Predictions on the Test Dataset (Implement)\n",
384 | "(Go to top)\n",
385 | "\n",
386 | "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {
393 | "tags": []
394 | },
395 | "outputs": [],
396 | "source": [
397 | "# Implement here\n",
398 | "\n",
399 | "# Get test data to test the classifier\n",
400 | "# ! test data should come from german_credit_test.csv !\n",
401 | "# ...\n",
402 | "\n",
403 | "# Use the trained model to make predictions on the test dataset\n",
404 | "# test_predictions = ..."
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "metadata": {},
410 | "source": [
411 | "## 5. Evaluate Results (Given)\n",
412 | "(Go to top)"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": null,
418 | "metadata": {
419 | "tags": []
420 | },
421 | "outputs": [],
422 | "source": [
423 | "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
424 | "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
425 | "result_df[\"credit_risk_pred\"] = test_predictions\n",
426 | "\n",
427 | "result_df.to_csv(\"../../data/final_project/project_day3_result.csv\", index=False)"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "### Final Evaluation on Test Data - Disparate Impact\n",
435 | "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {
442 | "tags": []
443 | },
444 | "outputs": [],
445 | "source": [
446 | "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
447 | " \"\"\"\n",
448 | " Function to calculate Disparate Impact metric using the results from this notebook.\n",
449 | " \"\"\"\n",
450 | " try:\n",
451 | " # Merge predictions with original test data to model per group\n",
452 | " di_df = pred_df.merge(test_data, on=\"ID\")\n",
453 | " # Count for group with members less than 25y old\n",
454 | " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
455 | " 0\n",
456 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
457 | " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
458 | " # Count for group with members greater equal 25y old\n",
459 | " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
460 | " 0\n",
461 | " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
462 | " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
463 | " # Check if correct number of gorups\n",
464 | " if total_geq25 == 0:\n",
465 | " print(\"There is only one group present in the data.\")\n",
466 | " elif total_less25 == 0:\n",
467 | " print(\"There is only one group present in the data.\")\n",
468 | " else:\n",
469 | " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
470 | " pos_outcomes_geq25 / total_geq25\n",
471 | " )\n",
472 | " return disparate_impact\n",
473 | " except:\n",
474 | " print(\"Wrong inputs provided.\")"
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "execution_count": null,
480 | "metadata": {
481 | "tags": []
482 | },
483 | "outputs": [],
484 | "source": [
485 | "calculate_di(test_data, result_df, \"credit_risk_pred\")"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {},
491 | "source": [
492 | "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
493 | "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": null,
499 | "metadata": {
500 | "tags": []
501 | },
502 | "outputs": [],
503 | "source": [
504 | "accuracy_score(\n",
505 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
506 | " \"credit_risk\"\n",
507 | " ],\n",
508 | " result_df[\"credit_risk_pred\"],\n",
509 | ")"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": null,
515 | "metadata": {
516 | "tags": []
517 | },
518 | "outputs": [],
519 | "source": [
520 | "f1_score(\n",
521 | " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
522 | " \"credit_risk\"\n",
523 | " ],\n",
524 | " result_df[\"credit_risk_pred\"],\n",
525 | ")"
526 | ]
527 | },
528 | {
529 | "cell_type": "markdown",
530 | "metadata": {},
531 | "source": [
532 | "This is the end of the notebook."
533 | ]
534 | }
535 | ],
536 | "metadata": {
537 | "kernelspec": {
538 | "display_name": ".conda-mlu-rai:Python",
539 | "language": "python",
540 | "name": "conda-env-.conda-mlu-rai-py"
541 | },
542 | "language_info": {
543 | "codemirror_mode": {
544 | "name": "ipython",
545 | "version": 3
546 | },
547 | "file_extension": ".py",
548 | "mimetype": "text/x-python",
549 | "name": "python",
550 | "nbconvert_exporter": "python",
551 | "pygments_lexer": "ipython3",
552 | "version": "3.9.20"
553 | }
554 | },
555 | "nbformat": 4,
556 | "nbformat_minor": 4
557 | }
558 |
--------------------------------------------------------------------------------
/notebooks/day_3/MLA-RESML-ODDS.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "tags": []
14 | },
15 | "source": [
16 | "# Responsible AI - Equalized Odds\n",
17 | "\n",
18 | "This notebook shows how to process outputs produced by a probabilistic model to generate fairer results. We will use a logistic regression model to predict whether an individuals' income is $\\leq$ 50k or not using US census data.\n",
19 | "\n",
20 | "__Dataset:__ \n",
21 | "The dataset we will use for this exercise is coming from [folktables](https://github.com/zykls/folktables). Folktables provide code to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n",
22 | "\n",
23 | "__ML Problem:__ \n",
24 | "Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n",
25 | "\n",
26 | "\n",
27 | "1. Read the dataset\n",
28 | "2. Data Processing\n",
29 | " * Exploratory Data Analysis\n",
30 | " * Select features to build the model\n",
31 | " * Feature Transformation\n",
32 | " * Train - Validation - Test Datasets\n",
33 | " * Data processing with Pipeline and ColumnTransformer\n",
34 | "3. Train (and Tune) a Classifier\n",
35 | "4. Test the Classifier"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {
49 | "tags": []
50 | },
51 | "outputs": [],
52 | "source": [
53 | "%%capture\n",
54 | "\n",
55 | "# Reshaping/basic libraries\n",
56 | "import pandas as pd\n",
57 | "import numpy as np\n",
58 | "\n",
59 | "# Plotting libraries\n",
60 | "import matplotlib.pyplot as plt\n",
61 | "\n",
62 | "%matplotlib inline\n",
63 | "import seaborn as sns\n",
64 | "\n",
65 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
66 | "\n",
67 | "# ML libraries\n",
68 | "from sklearn.model_selection import train_test_split\n",
69 | "from sklearn.metrics import confusion_matrix, accuracy_score\n",
70 | "from sklearn.impute import SimpleImputer\n",
71 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
72 | "from sklearn.pipeline import Pipeline\n",
73 | "from sklearn.compose import ColumnTransformer\n",
74 | "from sklearn.linear_model import LogisticRegression\n",
75 | "\n",
76 | "# Operational libraries\n",
77 | "import sys\n",
78 | "\n",
79 | "sys.path.append(\"..\")\n",
80 | "sys.path.insert(1, \"..\")\n",
81 | "\n",
82 | "# Fairness libraries\n",
83 | "from folktables.acs import *\n",
84 | "from folktables.folktables import *\n",
85 | "from folktables.load_acs import *\n",
86 | "from fairlearn.reductions import EqualizedOdds\n",
87 | "from fairlearn.postprocessing import ThresholdOptimizer\n",
88 | "from fairlearn.metrics import MetricFrame, selection_rate\n",
89 | "\n",
90 | "# Jupyter(lab) libraries\n",
91 | "import warnings\n",
92 | "\n",
93 | "warnings.filterwarnings(\"ignore\")"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {
99 | "tags": []
100 | },
101 | "source": [
102 | "## 1. Read the dataset\n",
103 | "(Go to top)\n",
104 | "\n",
105 | "To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type.\n",
106 | "\n",
107 | "The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {
114 | "tags": []
115 | },
116 | "outputs": [],
117 | "source": [
118 | "income_features = [\n",
119 | " \"AGEP\", # age individual\n",
120 | " \"COW\", # class of worker\n",
121 | " \"SCHL\", # educational attainment\n",
122 | " \"MAR\", # marital status\n",
123 | " \"OCCP\", # occupation\n",
124 | " \"POBP\", # place of birth\n",
125 | " \"RELP\", # relationship\n",
126 | " \"WKHP\", # hours worked per week past 12 months\n",
127 | " \"SEX\", # sex\n",
128 | " \"RAC1P\", # recorded detailed race code\n",
129 | " \"PWGTP\", # persons weight\n",
130 | " \"GCL\", # grand parents living with grandchildren\n",
131 | "]\n",
132 | "\n",
133 | "# Define the prediction problem and features\n",
134 | "ACSIncome = folktables.BasicProblem(\n",
135 | " features=income_features,\n",
136 | " target=\"PINCP\", # total persons income\n",
137 | " target_transform=lambda x: x > 50000,\n",
138 | " group=\"RAC1P\",\n",
139 | " preprocess=adult_filter, # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))\n",
140 | " postprocess=lambda x: x, # applies post processing, e.g. fill all NAs\n",
141 | ")\n",
142 | "\n",
143 | "# Initialize year, duration (\"1-Year\" or \"5-Year\") and granularity (household or person)\n",
144 | "data_source = ACSDataSource(survey_year=\"2018\", horizon=\"1-Year\", survey=\"person\")\n",
145 | "# Specify region (here: California) and load data\n",
146 | "ca_data = data_source.get_data(states=[\"CA\"], download=True)\n",
147 | "# Apply transformation as per problem statement above\n",
148 | "ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)\n",
149 | "\n",
150 | "# Convert numpy array to dataframe\n",
151 | "df = pd.DataFrame(\n",
152 | " np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),\n",
153 | " columns=income_features + [\">50k\"],\n",
154 | ")\n",
155 | "\n",
156 | "# For further modelling you want to use only 2 groups\n",
157 | "df = df[df[\"RAC1P\"].isin([6, 8])].copy(deep=True)"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {
163 | "tags": []
164 | },
165 | "source": [
166 | "## 2. Data Processing\n",
167 | "(Go to top)"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "### 2.1 Exploratory Data Analysis\n",
175 | "(Go to Data Processing)\n",
176 | "\n",
177 | "We look at number of rows, columns, and some simple statistics of the dataset."
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": null,
183 | "metadata": {
184 | "tags": []
185 | },
186 | "outputs": [],
187 | "source": [
188 | "# Print the first five rows\n",
189 | "# NaN means missing data\n",
190 | "df.head()"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {
197 | "tags": []
198 | },
199 | "outputs": [],
200 | "source": [
201 | "# Check how many rows and columns we have in the data frame\n",
202 | "print(\"The shape of the dataset is:\", df.shape)"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {
209 | "tags": []
210 | },
211 | "outputs": [],
212 | "source": [
213 | "# Let's see the data types and non-null values for each column\n",
214 | "df.info()"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "We can clearly see that all columns are numerical (`dtype = float64`). However, when checking the column headers (and information at top of the notebook), we should notice that we are actually dealing with multimodal data. We expect to see a mix of categorical, numerical and potentially even text information.\n",
222 | "\n",
223 | "Let's cast the features accordingly. We start by creating list for each feature type."
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": null,
229 | "metadata": {
230 | "tags": []
231 | },
232 | "outputs": [],
233 | "source": [
234 | "categorical_features = [\n",
235 | " \"COW\",\n",
236 | " \"SCHL\",\n",
237 | " \"MAR\",\n",
238 | " \"OCCP\",\n",
239 | " \"POBP\",\n",
240 | " \"RELP\",\n",
241 | " \"SEX\",\n",
242 | " \"GCL\",\n",
243 | "]\n",
244 | "\n",
245 | "numerical_features = [\"AGEP\", \"WKHP\", \"PWGTP\"]"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {
252 | "tags": []
253 | },
254 | "outputs": [],
255 | "source": [
256 | "# Cast categorical features to `category`\n",
257 | "df[categorical_features] = df[categorical_features].astype(\"object\")\n",
258 | "\n",
259 | "# Cast numerical features to `int`\n",
260 | "df[numerical_features] = df[numerical_features].astype(\"int\")"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "Let's check with `.info()` again to make sure the changes took effect."
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": null,
273 | "metadata": {
274 | "tags": []
275 | },
276 | "outputs": [],
277 | "source": [
278 | "df.info()"
279 | ]
280 | },
281 | {
282 | "cell_type": "markdown",
283 | "metadata": {},
284 | "source": [
285 | "Looks good, so we can now separate model features from model target and sensitive feature to explore them separately."
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {
292 | "tags": []
293 | },
294 | "outputs": [],
295 | "source": [
296 | "sensitive_feature = \"RAC1P\"\n",
297 | "\n",
298 | "model_target = \">50k\"\n",
299 | "model_features = categorical_features + numerical_features\n",
300 | "\n",
301 | "print(\"Model features: \", model_features)\n",
302 | "print(\"Model target: \", model_target)"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {
309 | "tags": []
310 | },
311 | "outputs": [],
312 | "source": [
313 | "# Double check that that target is not accidentally part of the features\n",
314 | "model_target in model_features"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.\n",
322 | "\n",
323 | "Let's have a look at missing values next.\n",
324 | "\n",
325 | "\n",
326 | "#### Missing values\n",
327 | "The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values."
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": null,
333 | "metadata": {
334 | "tags": []
335 | },
336 | "outputs": [],
337 | "source": [
338 | "# Show missing values\n",
339 | "df.isna().sum()"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "Before starting with the plots, let's have a look at how many unique instances we have per column. This helps us avoid plotting charts with hundreds of unique values. Let's filter for columns with fewer than 10 unique instances."
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": null,
352 | "metadata": {
353 | "tags": []
354 | },
355 | "outputs": [],
356 | "source": [
357 | "shortlist_fts = (\n",
358 | " df[model_features]\n",
359 | " .apply(lambda col: col.nunique())\n",
360 | " .where(df[model_features].apply(lambda col: col.nunique()) < 10)\n",
361 | " .dropna()\n",
362 | ")\n",
363 | "\n",
364 | "print(shortlist_fts)"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "#### Target distribution\n",
372 | "\n",
373 | "Let's check our target distribution."
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "metadata": {
380 | "tags": []
381 | },
382 | "outputs": [],
383 | "source": [
384 | "df[model_target].value_counts().plot.bar(color=\"black\")\n",
385 | "plt.show()"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {},
391 | "source": [
392 | "We notice that we are dealing with an imbalanced dataset. This means there are more examples for one type of results (here: 0; meaning individuals earning $\\leq$ 50k). This is relevant for model choice and potential up-sampling or down-sampling to balance out the classes."
393 | ]
394 | },
395 | {
396 | "cell_type": "markdown",
397 | "metadata": {},
398 | "source": [
399 | "#### Feature distribution(s)\n",
400 | "\n",
401 | "Let's now plot bar charts for the shortlist features of our dataset (as per above: shortlist - feature columns with less than 10 unique instance classes)."
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": null,
407 | "metadata": {
408 | "tags": []
409 | },
410 | "outputs": [],
411 | "source": [
412 | "fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))\n",
413 | "fig.suptitle(\"Feature Bar Plots\")\n",
414 | "\n",
415 | "fts = range(len(shortlist_fts.index.tolist()))\n",
416 | "for i, ax in zip(fts, axs.ravel()):\n",
417 | " df[shortlist_fts.index.tolist()[i]].value_counts().plot.bar(color=\"black\", ax=ax)\n",
418 | " ax.set_title(shortlist_fts.index.tolist()[i])\n",
419 | "plt.show()"
420 | ]
421 | },
422 | {
423 | "cell_type": "markdown",
424 | "metadata": {},
425 | "source": [
426 | "### 2.2 Select features to build the model\n",
427 | "(Go to Data Processing)\n",
428 | "\n",
429 | "During the extended EDA in the DATAPREP notebook, we learned that `GCL` is a feature that is equally present for both outcome types and also contains a lot of missing values. Therefore, we can drop it from the list of features we want to use for model build. We also drop `OCCP` and `POBP` as those features have too many unique categories."
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": null,
435 | "metadata": {
436 | "tags": []
437 | },
438 | "outputs": [],
439 | "source": [
440 | "to_remove = [\"GCL\", \"OCCP\", \"POBP\"]\n",
441 | "\n",
442 | "# Drop to_remove features from the respective list(s) - if applicable\n",
443 | "for ft in to_remove:\n",
444 | " if ft in model_features:\n",
445 | " model_features.remove(ft)\n",
446 | " if ft in categorical_features:\n",
447 | " categorical_features.remove(ft)\n",
448 | " if ft in numerical_features:\n",
449 | " numerical_features.remove(ft)\n",
450 | "\n",
451 | "# Let's also clean up the dataframe and only keep the features and columns we need\n",
452 | "df = df[model_features + [sensitive_feature] + [model_target]].copy(deep=True)"
453 | ]
454 | },
455 | {
456 | "cell_type": "markdown",
457 | "metadata": {},
458 | "source": [
459 | "### 2.3 Feature transformation\n",
460 | "(Go to Data Processing)\n",
461 | "\n",
462 | "In this notebook, we won't perform any transformation."
463 | ]
464 | },
465 | {
466 | "cell_type": "markdown",
467 | "metadata": {},
468 | "source": [
469 | "### 2.4 Train - Validation - Test Datasets\n",
470 | "(Go to Data Processing)\n",
471 | "\n",
472 | "To get a training, test and validation set, we will use sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function."
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": null,
478 | "metadata": {
479 | "tags": []
480 | },
481 | "outputs": [],
482 | "source": [
483 | "train_data, test_data = train_test_split(\n",
484 | " df, test_size=0.1, shuffle=True, random_state=23\n",
485 | ")\n",
486 | "\n",
487 | "train_data, val_data = train_test_split(\n",
488 | " train_data, test_size=0.15, shuffle=True, random_state=23\n",
489 | ")\n",
490 | "\n",
491 | "# Print the shapes of the Train - Test Datasets\n",
492 | "print(\n",
493 | " \"Train - Test - Validation datasets shapes: \",\n",
494 | " train_data.shape,\n",
495 | " test_data.shape,\n",
496 | " val_data.shape,\n",
497 | ")"
498 | ]
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "### 2.4 Data processing with Pipeline and ColumnTransformer\n",
505 | "(Go to Data Processing)\n",
506 | "\n",
507 | "Let's build a full model pipeline. We need preprocessing split per data type, and then combine everything back into a composite pipeline along with a model. To achieve this, we will use sklearns `Pipeline` and `ColumnTransformer`.\n",
508 | "\n",
509 | "__Step 1 (set up pre-processing per data type):__\n",
510 | "> For the numerical features pipeline, the __numerical_processor__ below, we impute missing values with the mean using sklearn's `SimpleImputer`, followed by a `MinMaxScaler` (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n",
511 | "\n",
512 | " > In the categorical features pipeline, the __categorical_processor__ below, we impute with a placeholder value and encode with sklearn's `OneHotEncoder`. If computing memory is an issue, it is a good idea to check categoricals' unique values, to get an estimate of many dummy features will be created by one-hot encoding. Note the __handle_unknown__ parameter that tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation/and or test set that was not present in the initial training set.\n",
513 | " \n",
514 | "__Step 2 (combining pre-processing methods into a transformer):__ \n",
515 | " > The selective preparations of the dataset features are then put together into a collective `ColumnTransformer`, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.\n",
516 | " \n",
517 | "__Step 3 (combining transformer with a model):__ \n",
518 | "> Combine `ColumnTransformer` from Step 2 with a selected algorithm in a new pipeline. For example, the algorithm could be a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for classification problems."
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "metadata": {
525 | "tags": []
526 | },
527 | "outputs": [],
528 | "source": [
529 | "### STEP 1 ###\n",
530 | "##############\n",
531 | "\n",
532 | "# Preprocess the numerical features\n",
533 | "numerical_processor = Pipeline(\n",
534 | " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
535 | ")\n",
536 | "# Preprocess the categorical features\n",
537 | "categorical_processor = Pipeline(\n",
538 | " [\n",
539 | " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
540 | " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
541 | " ]\n",
542 | ")\n",
543 | "\n",
544 | "### STEP 2 ###\n",
545 | "##############\n",
546 | "\n",
547 | "# Combine all data preprocessors from above\n",
548 | "data_processor = ColumnTransformer(\n",
549 | " [\n",
550 | " (\"numerical_processing\", numerical_processor, numerical_features),\n",
551 | " (\"categorical_processing\", categorical_processor, categorical_features),\n",
552 | " ]\n",
553 | ")\n",
554 | "\n",
555 | "\n",
556 | "### STEP 3 ###\n",
557 | "##############\n",
558 | "\n",
559 | "# Pipeline desired all data transformers, along with an estimator at the end\n",
560 | "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
561 | "pipeline = Pipeline(\n",
562 | " [\n",
563 | " (\"data_processing\", data_processor),\n",
564 | " (\"lg\", LogisticRegression(random_state=0)),\n",
565 | " ]\n",
566 | ")\n",
567 | "\n",
568 | "# Visualize the pipeline\n",
569 | "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
570 | "from sklearn import set_config\n",
571 | "\n",
572 | "set_config(display=\"diagram\")\n",
573 | "pipeline"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "metadata": {},
579 | "source": [
580 | "## 3. Train a Classifier\n",
581 | "(Go to top)\n",
582 | "\n",
583 | "We use the pipeline, along with a Logistic Regression estimator for training and Equalized Odds postprocessing."
584 | ]
585 | },
586 | {
587 | "cell_type": "markdown",
588 | "metadata": {},
589 | "source": [
590 | "### Model Training\n",
591 | "\n",
592 | "We train the classifier with __.fit()__ on our training dataset. "
593 | ]
594 | },
595 | {
596 | "cell_type": "code",
597 | "execution_count": null,
598 | "metadata": {
599 | "tags": []
600 | },
601 | "outputs": [],
602 | "source": [
603 | "# Get train data to train the classifier\n",
604 | "X_train = train_data[model_features]\n",
605 | "y_train = train_data[model_target]\n",
606 | "\n",
607 | "# Fit the classifier to the train data\n",
608 | "# Train data going through the Pipeline is imputed (with means from the train data),\n",
609 | "# scaled (with the min/max from the train data),\n",
610 | "# and finally used to fit the model\n",
611 | "pipeline.fit(X_train, y_train)\n",
612 | "\n",
613 | "y_train_pred = pipeline.predict(X_train)"
614 | ]
615 | },
616 | {
617 | "cell_type": "markdown",
618 | "metadata": {},
619 | "source": [
620 | "Next, we want to enforce Equalized Odds. To do so, we are going to use fairlearns `ThresholdOptimzer` to implement Equality of Opportunity. It is a postprocessing algorithm based on the paper [Equality of Opportunity in Supervised Learning](https://proceedings.neurips.cc/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf). This technique takes an existing classifier and the sensitive feature as inputs, and derives a transformation of the classifier's prediction to enforce the specified parity constraints."
621 | ]
622 | },
623 | {
624 | "cell_type": "code",
625 | "execution_count": null,
626 | "metadata": {
627 | "tags": []
628 | },
629 | "outputs": [],
630 | "source": [
631 | "# Set up ThresholdOptimizer\n",
632 | "eo_model = ThresholdOptimizer(\n",
633 | " estimator=pipeline[-1],\n",
634 | " constraints=\"equalized_odds\",\n",
635 | " objective=\"accuracy_score\",\n",
636 | " grid_size=1000,\n",
637 | " flip=False,\n",
638 | " prefit=False,\n",
639 | " predict_method=\"deprecated\",\n",
640 | ")\n",
641 | "\n",
642 | "# Learn the transformation & extract feature names\n",
643 | "data_processor.fit(X_train)\n",
644 | "\n",
645 | "# To extract feature names we first need to fit the data processor as this will generate the one hot encoding\n",
646 | "ft_names = numerical_features + list(\n",
647 | " data_processor.transformers_[1][1]\n",
648 | " .named_steps[\"cat_encoder\"]\n",
649 | " .get_feature_names_out(categorical_features)\n",
650 | ")\n",
651 | "\n",
652 | "# Add column names and convert to data frame\n",
653 | "X_train_prep = pd.DataFrame(\n",
654 | " data_processor.transform(X_train).todense(), columns=ft_names\n",
655 | " )\n",
656 | "\n",
657 | "# Adjust the results that the classifier would produce by letting ThresholdOptimizer know what the sensitive features are\n",
658 | "eo_model.fit(X_train_prep, y_train, sensitive_features=train_data[\"RAC1P\"].values)"
659 | ]
660 | },
661 | {
662 | "cell_type": "code",
663 | "execution_count": null,
664 | "metadata": {
665 | "tags": []
666 | },
667 | "outputs": [],
668 | "source": [
669 | "# You can now use the fitted equalized odds post-processor and use it to create adjusted outcomes (predictions)\n",
670 | "y_train_adjusted = eo_model.predict(\n",
671 | " X_train_prep, sensitive_features=train_data[\"RAC1P\"].values\n",
672 | ")"
673 | ]
674 | },
675 | {
676 | "cell_type": "code",
677 | "execution_count": null,
678 | "metadata": {
679 | "tags": []
680 | },
681 | "outputs": [],
682 | "source": [
683 | "# Join the data, adjusted predictions, original predictions, and the true outcome\n",
684 | "eop_df_train = pd.DataFrame(\n",
685 | " {\n",
686 | " \"RAC1P\": train_data[\"RAC1P\"].reset_index(drop=True),\n",
687 | " \"y_train_adjusted\": y_train_adjusted,\n",
688 | " \"y_train_pred\": y_train_pred,\n",
689 | " \"y_train_true\": train_data[model_target].reset_index(drop=True),\n",
690 | " }\n",
691 | ")"
692 | ]
693 | },
694 | {
695 | "cell_type": "markdown",
696 | "metadata": {},
697 | "source": [
698 | "## 4. Test the Classifier\n",
699 | "(Go to top)\n",
700 | "\n",
701 | "Let's now evaluate the performance of the trained classifier on the test dataset. We use __.predict()__ this time. "
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": null,
707 | "metadata": {
708 | "tags": []
709 | },
710 | "outputs": [],
711 | "source": [
712 | "# Create the adjusted predictions for test and convert to dataframe\n",
713 | "y_test_adjusted = eo_model.predict(\n",
714 | " np.asarray(data_processor.transform(test_data[model_features]).todense()),\n",
715 | " sensitive_features=test_data[\"RAC1P\"].values,\n",
716 | ")\n",
717 | "\n",
718 | "# Let's have a look at the adjusted outputs for the test dataset\n",
719 | "print(y_test_adjusted)"
720 | ]
721 | },
722 | {
723 | "cell_type": "code",
724 | "execution_count": null,
725 | "metadata": {
726 | "tags": []
727 | },
728 | "outputs": [],
729 | "source": [
730 | "# Join the data, adjusted predictions, original predictions, and the true outcome\n",
731 | "eop_df_test = pd.DataFrame(\n",
732 | " {\n",
733 | " \"RAC1P\": test_data[\"RAC1P\"].reset_index(drop=True),\n",
734 | " \"y_test_adjusted\": y_test_adjusted,\n",
735 | " \"y_test_true\": test_data[model_target].reset_index(drop=True),\n",
736 | " \"y_test_withoutEO\": pipeline.predict(test_data[model_features]),\n",
737 | " }\n",
738 | ")\n",
739 | "\n",
740 | "%matplotlib inline\n",
741 | "# Initialize figure\n",
742 | "fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6))\n",
743 | "\n",
744 | "# Set title of figure\n",
745 | "fig.suptitle(\"Comparison of Model Predictions and Baseline\")\n",
746 | "\n",
747 | "# Set title\n",
748 | "ax1.title.set_text(\"Baseline target distribution\")\n",
749 | "ax2.title.set_text(\"Prediction EO adjusted\")\n",
750 | "ax3.title.set_text(\"Prediction without EO adjustment\")\n",
751 | "\n",
752 | "# Create plots\n",
753 | "eop_df_test.groupby([\"RAC1P\", \"y_test_true\"]).size().unstack().plot(\n",
754 | " kind=\"bar\", stacked=True, color=sns.husl_palette(2), ax=ax1\n",
755 | ")\n",
756 | "eop_df_test.groupby([\"RAC1P\", \"y_test_adjusted\"]).size().unstack().plot(\n",
757 | " kind=\"bar\", stacked=True, color=sns.husl_palette(2), ax=ax2\n",
758 | ")\n",
759 | "eop_df_test.groupby([\"RAC1P\", \"y_test_withoutEO\"]).size().unstack().plot(\n",
760 | " kind=\"bar\", stacked=True, color=sns.husl_palette(2), ax=ax3\n",
761 | ")\n",
762 | "# Align y-axis\n",
763 | "ax2.sharey(ax1)\n",
764 | "ax3.sharey(ax1)"
765 | ]
766 | },
767 | {
768 | "cell_type": "markdown",
769 | "metadata": {},
770 | "source": [
771 | "You can clearly see that the ratios between the groups were adjusted drastically (compared to predictions without EO postprocessing). The question remains whether it was fair the reduce the number of positive outcomes for RAC1P group 6. The odds are equal now, but this came at the cost of placing a lot of outcomes that were actually positive into the negative bucket. You can see that if you hadn't enforced Equalized Odds, the model would have actually amplified the existing bias in the dataset."
772 | ]
773 | },
774 | {
775 | "cell_type": "markdown",
776 | "metadata": {},
777 | "source": [
778 | "This is the end of the notebook."
779 | ]
780 | }
781 | ],
782 | "metadata": {
783 | "kernelspec": {
784 | "display_name": ".conda-mlu-rai:Python",
785 | "language": "python",
786 | "name": "conda-env-.conda-mlu-rai-py"
787 | },
788 | "language_info": {
789 | "codemirror_mode": {
790 | "name": "ipython",
791 | "version": 3
792 | },
793 | "file_extension": ".py",
794 | "mimetype": "text/x-python",
795 | "name": "python",
796 | "nbconvert_exporter": "python",
797 | "pygments_lexer": "ipython3",
798 | "version": "3.9.20"
799 | }
800 | },
801 | "nbformat": 4,
802 | "nbformat_minor": 4
803 | }
804 |
--------------------------------------------------------------------------------
/notebooks/day_3/MLA-RESML-SHAP.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | ""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Responsible AI - SHAP Values\n",
15 | "\n",
16 | "This notebook shows how to build a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model to predict whether an individuals' income is $\\leq$ 50k or not using US census data. The goal is to explore why certain individuals receive a particular prediction and also get an overall understanding of feature importance using SHAP.\n",
17 | "\n",
18 | "__Dataset:__ \n",
19 | "The dataset we will use for this exercise is coming from [folktables](https://github.com/zykls/folktables). Folktables provide code to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n",
20 | "\n",
21 | "__ML Problem:__ \n",
22 | "Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n",
23 | "\n",
24 | "\n",
25 | "1. Read the dataset\n",
26 | "2. Data Processing\n",
27 | " * Exploratory Data Analysis\n",
28 | " * Select features to build the model\n",
29 | " * Train - Validation - Test Datasets\n",
30 | " * Data processing with Pipeline and ColumnTransformer\n",
31 | "3. Train (and Tune) a Classifier\n",
32 | "4. Test the Classifier\n",
33 | "5. SHAP Values"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "This notebook assumes an installation of the SageMaker kernel `.conda-mlu-rai:Python` through the `environment.yaml` file in SageMaker Sudio Labs."
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": null,
46 | "metadata": {
47 | "tags": []
48 | },
49 | "outputs": [],
50 | "source": [
51 | "# Reshaping/basic libraries\n",
52 | "import pandas as pd\n",
53 | "import numpy as np\n",
54 | "\n",
55 | "# Plotting libraries\n",
56 | "import matplotlib.pyplot as plt\n",
57 | "import seaborn as sns\n",
58 | "\n",
59 | "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
60 | "\n",
61 | "# ML libraries\n",
62 | "from sklearn.model_selection import train_test_split\n",
63 | "from sklearn.metrics import confusion_matrix, accuracy_score\n",
64 | "from sklearn.impute import SimpleImputer\n",
65 | "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
66 | "from sklearn.pipeline import Pipeline\n",
67 | "from sklearn.compose import ColumnTransformer\n",
68 | "from sklearn.ensemble import RandomForestClassifier\n",
69 | "\n",
70 | "# Operational libraries\n",
71 | "import sys\n",
72 | "import json\n",
73 | "\n",
74 | "sys.path.append(\"..\")\n",
75 | "\n",
76 | "# Fairness libraries\n",
77 | "from folktables.acs import *\n",
78 | "from folktables.folktables import *\n",
79 | "from folktables.load_acs import *\n",
80 | "import shap\n",
81 | "\n",
82 | "# Jupyter(lab) libraries\n",
83 | "import warnings\n",
84 | "\n",
85 | "warnings.filterwarnings(\"ignore\")"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "## 1. Read the dataset\n",
93 | "(Go to top)\n",
94 | "\n",
95 | "To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type.\n",
96 | "\n",
97 | "The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "tags": []
105 | },
106 | "outputs": [],
107 | "source": [
108 | "income_features = [\n",
109 | " \"AGEP\", # age individual\n",
110 | " \"COW\", # class of worker\n",
111 | " \"SCHL\", # educational attainment\n",
112 | " \"MAR\", # marital status\n",
113 | " \"OCCP\", # occupation\n",
114 | " \"POBP\", # place of birth\n",
115 | " \"RELP\", # relationship\n",
116 | " \"WKHP\", # hours worked per week past 12 months\n",
117 | " \"SEX\", # sex\n",
118 | " \"RAC1P\", # recorded detailed race code\n",
119 | " \"PWGTP\", # persons weight\n",
120 | " \"GCL\", # grand parents living with grandchildren\n",
121 | "]\n",
122 | "\n",
123 | "# Define the prediction problem and features\n",
124 | "ACSIncome = folktables.BasicProblem(\n",
125 | " features=income_features,\n",
126 | " target=\"PINCP\", # total persons income\n",
127 | " target_transform=lambda x: x > 50000,\n",
128 | " group=\"RAC1P\",\n",
129 | " preprocess=adult_filter, # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))\n",
130 | " postprocess=lambda x: x, # applies post processing, e.g. fill all NAs\n",
131 | ")\n",
132 | "\n",
133 | "# Initialize year, duration (\"1-Year\" or \"5-Year\") and granularity (household or person)\n",
134 | "data_source = ACSDataSource(survey_year=\"2018\", horizon=\"1-Year\", survey=\"person\")\n",
135 | "# Specify region (here: California) and load data\n",
136 | "ca_data = data_source.get_data(states=[\"CA\"], download=True)\n",
137 | "# Apply transformation as per problem statement above\n",
138 | "ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)\n",
139 | "\n",
140 | "# Convert numpy array to dataframe\n",
141 | "df = pd.DataFrame(\n",
142 | " np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),\n",
143 | " columns=income_features + [\">50k\"],\n",
144 | ")\n",
145 | "\n",
146 | "# For further modelling we want to use only 2 groups (see DATAPREP notebook for details)\n",
147 | "df = df[df[\"RAC1P\"].isin([6, 8])].copy(deep=True)"
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {},
153 | "source": [
154 | "## 2. Data Processing\n",
155 | "(Go to top)"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "### 2.1 Exploratory Data Analysis\n",
163 | "(Go to Data Processing)\n",
164 | "\n",
165 | "We look at number of rows, columns, and some simple statistics of the dataset."
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {
172 | "tags": []
173 | },
174 | "outputs": [],
175 | "source": [
176 | "# Print the first five rows\n",
177 | "# NaN means missing data\n",
178 | "df.head()"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "tags": []
186 | },
187 | "outputs": [],
188 | "source": [
189 | "# Check how many rows and columns we have in the data frame\n",
190 | "print(\"The shape of the dataset is:\", df.shape)"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {
197 | "tags": []
198 | },
199 | "outputs": [],
200 | "source": [
201 | "# Let's see the data types and non-null values for each column\n",
202 | "df.info()"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "We can clearly see that all columns are numerical (`dtype = float64`). However, when checking the column headers (and information at top of the notebook), we should notice that we are actually dealing with multimodal data. We expect to see a mix of categorical, numerical and potentially even text information.\n",
210 | "\n",
211 | "Let's cast the features accordingly. We start by creating list for each feature type."
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {
218 | "tags": []
219 | },
220 | "outputs": [],
221 | "source": [
222 | "categorical_features = [\n",
223 | " \"COW\",\n",
224 | " \"SCHL\",\n",
225 | " \"MAR\",\n",
226 | " \"OCCP\",\n",
227 | " \"POBP\",\n",
228 | " \"RELP\",\n",
229 | " \"SEX\",\n",
230 | " \"RAC1P\",\n",
231 | " \"GCL\",\n",
232 | "]\n",
233 | "\n",
234 | "numerical_features = [\"AGEP\", \"WKHP\", \"PWGTP\"]"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {
241 | "tags": []
242 | },
243 | "outputs": [],
244 | "source": [
245 | "# We cast categorical features to `category`\n",
246 | "df[categorical_features] = df[categorical_features].astype(\"object\")\n",
247 | "\n",
248 | "# We cast numerical features to `int`\n",
249 | "df[numerical_features] = df[numerical_features].astype(\"int\")"
250 | ]
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "Let's check with `.info()` again to make sure the changes took effect."
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "tags": []
264 | },
265 | "outputs": [],
266 | "source": [
267 | "df.info()"
268 | ]
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "metadata": {},
273 | "source": [
274 | "Looks good, so we can now separate model features from model target to explore them separately."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {
281 | "tags": []
282 | },
283 | "outputs": [],
284 | "source": [
285 | "model_target = \">50k\"\n",
286 | "model_features = categorical_features + numerical_features\n",
287 | "\n",
288 | "print(\"Model features: \", model_features)\n",
289 | "print(\"Model target: \", model_target)"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "tags": []
297 | },
298 | "outputs": [],
299 | "source": [
300 | "# Double check that that target is not accidentally part of the features\n",
301 | "model_target in model_features"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.\n",
309 | "\n",
310 | "Let's have a look at missing values next.\n",
311 | "\n",
312 | "\n",
313 | "#### Missing values\n",
314 | "The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values."
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": null,
320 | "metadata": {
321 | "tags": []
322 | },
323 | "outputs": [],
324 | "source": [
325 | "# Show missing values\n",
326 | "df.isna().sum()"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "Before starting with the plots, let's have a look at how many unique instances we have per column. This helps us avoid plotting charts with hundreds of unique values. Let's filter for columns with fewer than 10 unique instances."
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": null,
339 | "metadata": {
340 | "tags": []
341 | },
342 | "outputs": [],
343 | "source": [
344 | "shortlist_fts = (\n",
345 | " df[model_features]\n",
346 | " .apply(lambda col: col.nunique())\n",
347 | " .where(df[model_features].apply(lambda col: col.nunique()) < 10)\n",
348 | " .dropna()\n",
349 | ")\n",
350 | "\n",
351 | "print(shortlist_fts)"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "#### Target distribution\n",
359 | "\n",
360 | "Let's check our target distribution."
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": null,
366 | "metadata": {
367 | "tags": []
368 | },
369 | "outputs": [],
370 | "source": [
371 | "df[model_target].value_counts().plot.bar(color=\"black\")\n",
372 | "plt.show()"
373 | ]
374 | },
375 | {
376 | "cell_type": "markdown",
377 | "metadata": {},
378 | "source": [
379 | "We notice that we are dealing with an imbalanced dataset. This means there are more examples for one type of results (here: 0; meaning individuals earning $\\leq$ 50k). This is relevant for model choice and potential up-sampling or down-sampling to balance out the classes."
380 | ]
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {},
385 | "source": [
386 | "#### Feature distribution(s)\n",
387 | "\n",
388 | "Let's now plot bar charts for the shortlist features of our dataset (as per above: shortlist - feature columns with less than 10 unique instance classes)."
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {
395 | "tags": []
396 | },
397 | "outputs": [],
398 | "source": [
399 | "fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))\n",
400 | "fig.suptitle(\"Feature Bar Plots\")\n",
401 | "\n",
402 | "fts = range(len(shortlist_fts.index.tolist()))\n",
403 | "for i, ax in zip(fts, axs.ravel()):\n",
404 | " df[shortlist_fts.index.tolist()[i]].value_counts().plot.bar(color=\"black\", ax=ax)\n",
405 | " ax.set_title(shortlist_fts.index.tolist()[i])\n",
406 | "plt.show()"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "### 2.2 Select features to build the model\n",
414 | "(Go to Data Processing)\n",
415 | "\n",
416 | "During the extended EDA in the DATAPREP notebook, we learned that `GCL` is a feature that is equally present for both outcome types and also contains a lot of missing values. Therefore, we can drop it from the list of features we want to use for model build. We also drop `OCCP` and `POBP` as those features have too many unique categories."
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": null,
422 | "metadata": {
423 | "tags": []
424 | },
425 | "outputs": [],
426 | "source": [
427 | "to_remove = [\"GCL\", \"OCCP\", \"POBP\"]\n",
428 | "\n",
429 | "# Drop to_remove features from the respective list(s) - if applicable\n",
430 | "for ft in to_remove:\n",
431 | " if ft in model_features:\n",
432 | " model_features.remove(ft)\n",
433 | " if ft in categorical_features:\n",
434 | " categorical_features.remove(ft)\n",
435 | " if ft in numerical_features:\n",
436 | " numerical_features.remove(ft)\n",
437 | "\n",
438 | "# Let's also clean up the dataframe and only keep the features and columns we need\n",
439 | "df = df[model_features + [model_target]].copy(deep=True)"
440 | ]
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {},
445 | "source": [
446 | "### 2.3 Train - Validation - Test Datasets\n",
447 | "(Go to Data Processing)\n",
448 | "\n",
449 | "To get a training, test and validation set, we will use sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function."
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": null,
455 | "metadata": {
456 | "tags": []
457 | },
458 | "outputs": [],
459 | "source": [
460 | "train_data, test_data = train_test_split(\n",
461 | " df, test_size=0.1, shuffle=True, random_state=23\n",
462 | ")\n",
463 | "train_data, val_data = train_test_split(\n",
464 | " train_data, test_size=0.15, shuffle=True, random_state=23\n",
465 | ")\n",
466 | "\n",
467 | "# Print the shapes of the Train - Test Datasets\n",
468 | "print(\n",
469 | " \"Train - Test - Validation datasets shapes: \",\n",
470 | " train_data.shape,\n",
471 | " test_data.shape,\n",
472 | " val_data.shape,\n",
473 | ")"
474 | ]
475 | },
476 | {
477 | "cell_type": "markdown",
478 | "metadata": {},
479 | "source": [
480 | "### 2.4 Data processing with Pipeline and ColumnTransformer\n",
481 | "(Go to Data Processing)\n",
482 | "\n",
483 | "Let's build a full model pipeline. We need pre-processing split per data type, and then combine everything back into a composite pipeline along with a model. To achieve this, we will use sklearns `Pipeline` and `ColumnTransformer`.\n",
484 | "\n",
485 | "__Step 1 (set up pre-processing per data type):__\n",
486 | "> For the numerical features pipeline, the __numerical_processor__ below, we impute missing values with the mean using sklearn's `SimpleImputer`, followed by a `MinMaxScaler` (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n",
487 | "\n",
488 | " > In the categorical features pipeline, the __categorical_processor__ below, we impute with a placeholder value and encode with sklearn's `OneHotEncoder`. If computing memory is an issue, it is a good idea to check categoricals' unique values, to get an estimate of many dummy features will be created by one-hot encoding. Note the __handle_unknown__ parameter that tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation/and or test set that was not present in the initial training set.\n",
489 | " \n",
490 | "__Step 2 (combining pre-processing methods into a transformer):__ \n",
491 | " > The selective preparations of the dataset features are then put together into a collective `ColumnTransformer`, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.\n",
492 | " \n",
493 | "__Step 3 (combining transformer with a model):__ \n",
494 | "> Combine `ColumnTransformer` from Step 2 with a selected algorithm in a new pipeline. For example, the algorithm could be a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for classification problems."
495 | ]
496 | },
497 | {
498 | "cell_type": "code",
499 | "execution_count": null,
500 | "metadata": {
501 | "tags": []
502 | },
503 | "outputs": [],
504 | "source": [
505 | "### STEP 1 ###\n",
506 | "##############\n",
507 | "\n",
508 | "# Preprocess the numerical features\n",
509 | "numerical_processor = Pipeline(\n",
510 | " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
511 | ")\n",
512 | "# Preprocess the categorical features\n",
513 | "categorical_processor = Pipeline(\n",
514 | " [\n",
515 | " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
516 | " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
517 | " ]\n",
518 | ")\n",
519 | "\n",
520 | "### STEP 2 ###\n",
521 | "##############\n",
522 | "\n",
523 | "# Combine all data preprocessors from above\n",
524 | "data_processor = ColumnTransformer(\n",
525 | " [\n",
526 | " (\"numerical_processing\", numerical_processor, numerical_features),\n",
527 | " (\"categorical_processing\", categorical_processor, categorical_features),\n",
528 | " ]\n",
529 | ")\n",
530 | "\n",
531 | "### STEP 3 ###\n",
532 | "##############\n",
533 | "\n",
534 | "# Pipeline desired all data transformers, along with an estimator at the end\n",
535 | "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
536 | "pipeline = Pipeline(\n",
537 | " [\n",
538 | " (\"data_processing\", data_processor),\n",
539 | " (\"rf\", RandomForestClassifier(max_depth=10, max_features=40, random_state=1)),\n",
540 | " ]\n",
541 | ")\n",
542 | "\n",
543 | "# Visualize the pipeline\n",
544 | "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
545 | "from sklearn import set_config\n",
546 | "\n",
547 | "set_config(display=\"diagram\")\n",
548 | "pipeline"
549 | ]
550 | },
551 | {
552 | "cell_type": "markdown",
553 | "metadata": {},
554 | "source": [
555 | "## 3. Train a Classifier\n",
556 | "(Go to top)\n",
557 | "\n",
558 | "We use the pipeline with the desired data transformers, along with a RandomForestClassifier for training.\n"
559 | ]
560 | },
561 | {
562 | "cell_type": "markdown",
563 | "metadata": {},
564 | "source": [
565 | "### Model Training\n",
566 | "\n",
567 | "We train the classifier with __.fit()__ on our training dataset. "
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": null,
573 | "metadata": {
574 | "tags": []
575 | },
576 | "outputs": [],
577 | "source": [
578 | "# Get train data to train the classifier\n",
579 | "X_train = train_data[model_features]\n",
580 | "y_train = train_data[model_target]\n",
581 | "\n",
582 | "# Fit the classifier to the train data\n",
583 | "# Train data going through the Pipeline is imputed (with means from the train data),\n",
584 | "# scaled (with the min/max from the train data),\n",
585 | "# and finally used to fit the model\n",
586 | "pipeline.fit(X_train, y_train)"
587 | ]
588 | },
589 | {
590 | "cell_type": "code",
591 | "execution_count": null,
592 | "metadata": {
593 | "tags": []
594 | },
595 | "outputs": [],
596 | "source": [
597 | "# Get train data to train the classifier\n",
598 | "X_val = val_data[model_features]\n",
599 | "y_val = val_data[model_target]\n",
600 | "\n",
601 | "y_val_pred = pipeline.predict(X_val)\n",
602 | "\n",
603 | "print(\"Model performance on the validation set:\")\n",
604 | "print(\"Validation accuracy:\", accuracy_score(y_val, y_val_pred))"
605 | ]
606 | },
607 | {
608 | "cell_type": "markdown",
609 | "metadata": {},
610 | "source": [
611 | "## 4. Test the Classifier\n",
612 | "(Go to top)\n",
613 | "\n",
614 | "Let's now evaluate the performance of the trained classifier on the test dataset. We use __.predict()__ this time. \n"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": null,
620 | "metadata": {
621 | "tags": []
622 | },
623 | "outputs": [],
624 | "source": [
625 | "# Get validation data to validate the classifier\n",
626 | "X_test = test_data[model_features]\n",
627 | "y_test = test_data[model_target]\n",
628 | "\n",
629 | "# Use the fitted model to make predictions on the test dataset\n",
630 | "# Test data going through the Pipeline is imputed (with means from the train data),\n",
631 | "# scaled (with the min/max from the train data),\n",
632 | "# and finally used to make predictions\n",
633 | "test_predictions = pipeline.predict(X_test)\n",
634 | "\n",
635 | "print(\"Model performance on the test set:\")\n",
636 | "print(\"Test accuracy:\", accuracy_score(y_test, test_predictions))"
637 | ]
638 | },
639 | {
640 | "cell_type": "markdown",
641 | "metadata": {},
642 | "source": [
643 | "## 5. SHAP Values\n",
644 | "(Go to top)\n",
645 | "\n",
646 | "SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see [paper](https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html) for details).\n",
647 | "\n",
648 | "\n",
649 | "Let's have a look at SHAP values and plots for our dataset. To make the explanations more easily understandable, we extract the features names and create a small sample dataframe."
650 | ]
651 | },
652 | {
653 | "cell_type": "code",
654 | "execution_count": null,
655 | "metadata": {},
656 | "outputs": [],
657 | "source": [
658 | "ft_names = numerical_features + list(\n",
659 | " data_processor.transformers_[1][1]\n",
660 | " .named_steps[\"cat_encoder\"]\n",
661 | " .get_feature_names_out(categorical_features)\n",
662 | ")"
663 | ]
664 | },
665 | {
666 | "cell_type": "code",
667 | "execution_count": null,
668 | "metadata": {},
669 | "outputs": [],
670 | "source": [
671 | "# Let's take a smaller sub-sample of the training dataset to generate explanations with SHAP\n",
672 | "X_sample = pd.DataFrame(\n",
673 | " pipeline[0].transform(X_train[:200]).toarray(), columns=ft_names\n",
674 | ")\n",
675 | "\n",
676 | "# Create the explanations by passing in the model and the data\n",
677 | "explainer = shap.Explainer(pipeline[-1])\n",
678 | "shap_values = explainer(X_sample)"
679 | ]
680 | },
681 | {
682 | "cell_type": "markdown",
683 | "metadata": {},
684 | "source": [
685 | "We will now look at an explanation created with SHAP. The explanation starts with a base value and will add & subtract SHAP values that push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue."
686 | ]
687 | },
688 | {
689 | "cell_type": "code",
690 | "execution_count": null,
691 | "metadata": {},
692 | "outputs": [],
693 | "source": [
694 | "outcome_class = 0 # Needs to be 0 or 1 for binary classification\n",
695 | "\n",
696 | "# Set the style of plotting to white\n",
697 | "sns.set_style(\"white\")\n",
698 | "# Load JS visualization code to notebook\n",
699 | "shap.initjs()\n",
700 | "\n",
701 | "# Create waterfall plot\n",
702 | "shap.plots.waterfall(shap_values[0][:, outcome_class])"
703 | ]
704 | },
705 | {
706 | "cell_type": "code",
707 | "execution_count": null,
708 | "metadata": {},
709 | "outputs": [],
710 | "source": [
711 | "datapoint_num = 2 # Needs to be between 0 and length of dataframe\n",
712 | "\n",
713 | "# Plot explanations\n",
714 | "shap.force_plot(\n",
715 | " explainer.expected_value[outcome_class],\n",
716 | " shap_values[datapoint_num].values[:, outcome_class],\n",
717 | " X_sample.iloc[datapoint_num, :],\n",
718 | ")"
719 | ]
720 | },
721 | {
722 | "cell_type": "markdown",
723 | "metadata": {},
724 | "source": [
725 | "If we take a range of data points and create explanations such as the one shown above, rotate them 90 degrees, and then stack them horizontally, we can see explanations for an entire dataset:"
726 | ]
727 | },
728 | {
729 | "cell_type": "code",
730 | "execution_count": null,
731 | "metadata": {},
732 | "outputs": [],
733 | "source": [
734 | "datapoint_range = 10\n",
735 | "\n",
736 | "shap.force_plot(\n",
737 | " explainer.expected_value[outcome_class],\n",
738 | " shap_values[0:datapoint_range].values[:datapoint_range, :, outcome_class],\n",
739 | " X_sample.iloc[0:datapoint_range, :],\n",
740 | ")"
741 | ]
742 | },
743 | {
744 | "cell_type": "markdown",
745 | "metadata": {},
746 | "source": [
747 | "The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low)."
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": null,
753 | "metadata": {},
754 | "outputs": [],
755 | "source": [
756 | "shap.plots.beeswarm(shap_values[:, :, outcome_class])"
757 | ]
758 | },
759 | {
760 | "cell_type": "markdown",
761 | "metadata": {},
762 | "source": [
763 | "The mean absolute value of the SHAP values for each feature to get a standard bar plot"
764 | ]
765 | },
766 | {
767 | "cell_type": "code",
768 | "execution_count": null,
769 | "metadata": {},
770 | "outputs": [],
771 | "source": [
772 | "shap.plots.bar(shap_values[0:datapoint_range, :, outcome_class])"
773 | ]
774 | },
775 | {
776 | "cell_type": "markdown",
777 | "metadata": {},
778 | "source": [
779 | "The model in this notebook was trained without removal of sensitive features; it is concerning that both age and ethnicity show up very high in the overall importance. This should prompt bias mitigation. Explanations can clearly show this model behavior; it is therefore very important to look at why a model is making certain predictions to find these patterns."
780 | ]
781 | },
782 | {
783 | "cell_type": "markdown",
784 | "metadata": {},
785 | "source": [
786 | "This is the end of this notebook."
787 | ]
788 | }
789 | ],
790 | "metadata": {
791 | "kernelspec": {
792 | "display_name": ".conda-mlu-rai:Python",
793 | "language": "python",
794 | "name": "conda-env-.conda-mlu-rai-py"
795 | },
796 | "language_info": {
797 | "codemirror_mode": {
798 | "name": "ipython",
799 | "version": 3
800 | },
801 | "file_extension": ".py",
802 | "mimetype": "text/x-python",
803 | "name": "python",
804 | "nbconvert_exporter": "python",
805 | "pygments_lexer": "ipython3",
806 | "version": "3.9.20"
807 | }
808 | },
809 | "nbformat": 4,
810 | "nbformat_minor": 4
811 | }
812 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==2.1.0
2 | aif360==0.6.1
3 | annotated-types==0.7.0
4 | attrs==23.2.0
5 | BlackBoxAuditing==0.1.54
6 | boto3==1.35.26
7 | botocore==1.35.26
8 | certifi==2024.8.30
9 | charset-normalizer==3.3.2
10 | cloudpickle==2.2.1
11 | contourpy==1.3.0
12 | cycler==0.12.1
13 | dill==0.3.8
14 | docker==7.1.0
15 | fairlearn==0.10.0
16 | filelock==3.16.1
17 | folktables==0.0.12
18 | fonttools==4.54.1
19 | fsspec==2024.9.0
20 | google-pasta==0.2.0
21 | h5py==3.11.0
22 | huggingface-hub==0.25.1
23 | idna==3.10
24 | importlib-metadata==6.11.0
25 | importlib_resources==6.4.5
26 | iniconfig==2.0.0
27 | ipywidgets==8.1.5
28 | jmespath==1.0.1
29 | joblib==1.4.2
30 | jsonschema==4.23.0
31 | jsonschema-specifications==2023.12.1
32 | jupyterlab_widgets==3.0.13
33 | keras==3.5.0
34 | kiwisolver==1.4.7
35 | llvmlite==0.43.0
36 | markdown-it-py==3.0.0
37 | mdurl==0.1.2
38 | memory-profiler==0.61.0
39 | ml_dtypes==0.5.0
40 | mock==4.0.3
41 | multiprocess==0.70.16
42 | namex==0.0.8
43 | networkx==3.2.1
44 | numba==0.60.0
45 | numpy==1.26.4
46 | optree==0.12.1
47 | pandas==2.2.3
48 | pathos==0.3.2
49 | pillow==10.4.0
50 | pluggy==1.5.0
51 | pox==0.3.4
52 | ppft==1.7.6.8
53 | protobuf==4.25.5
54 | pydantic==2.9.2
55 | pydantic_core==2.23.4
56 | pyparsing==3.1.4
57 | pytest==8.3.3
58 | pytest-warnings==0.3.1
59 | pytz==2024.2
60 | PyYAML==6.0.2
61 | referencing==0.35.1
62 | requests==2.32.3
63 | rich==13.8.1
64 | rpds-py==0.20.0
65 | s3transfer==0.10.2
66 | sagemaker==2.232.1
67 | sagemaker-core==1.0.7
68 | schema==0.7.7
69 | scikit-learn==1.5.2
70 | scipy==1.13.1
71 | seaborn==0.13.2
72 | shap==0.46.0
73 | slicer==0.0.8
74 | smdebug-rulesconfig==1.0.1
75 | tblib==3.0.0
76 | tempeh==0.1.12
77 | threadpoolctl==3.5.0
78 | tomli==2.0.1
79 | tqdm==4.66.5
80 | tzdata==2024.2
81 | urllib3==1.26.20
82 | widgetsnbextension==4.0.13
--------------------------------------------------------------------------------
/slides/MLU-RAI-DAY1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/slides/MLU-RAI-DAY1.pdf
--------------------------------------------------------------------------------
/slides/MLU-RAI-DAY2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-responsible-ai/db8ce25b0e84123967c8da2c4b2c67c8d2c8d6df/slides/MLU-RAI-DAY2.pdf
--------------------------------------------------------------------------------