├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── NOTICE
├── README.md
├── deployment
├── sagemaker-graph-fraud-detection.yaml
├── sagemaker-notebook-instance-stack.yaml
├── sagemaker-permissions-stack.yaml
└── solution-assistant
│ ├── requirements.in
│ ├── requirements.txt
│ ├── solution-assistant.yaml
│ └── src
│ └── lambda_function.py
├── docs
├── arch.png
├── details.md
└── overview_video_thumbnail.png
├── iam
├── launch.json
└── usage.json
├── metadata
└── metadata.json
└── source
├── lambda
├── data-preprocessing
│ └── index.py
└── graph-modelling
│ └── index.py
└── sagemaker
├── baselines
├── mlp-fraud-baseline.ipynb
├── mlp_fraud_entry_point.py
├── utils.py
└── xgboost-fraud-baseline.ipynb
├── data-preprocessing
├── buildspec.yml
├── container
│ ├── Dockerfile
│ └── build_and_push.sh
└── graph_data_preprocessor.py
├── dgl-fraud-detection.ipynb
├── sagemaker_graph_fraud_detection
├── __init__.py
├── config.py
├── container_build
│ ├── __init__.py
│ ├── container_build.py
│ └── logs.py
├── dgl_fraud_detection
│ ├── __init__.py
│ ├── data.py
│ ├── estimator_fns.py
│ ├── graph.py
│ ├── model
│ │ ├── __init__.py
│ │ ├── mxnet.py
│ │ └── pytorch.py
│ ├── requirements.txt
│ ├── sampler.py
│ ├── train_dgl_mxnet_entry_point.py
│ └── utils.py
├── requirements.txt
└── setup.py
└── setup.sh
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guidelines
2 |
3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
4 | documentation, we greatly value feedback and contributions from our community.
5 |
6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
7 | information to effectively respond to your bug report or contribution.
8 |
9 |
10 | ## Reporting Bugs/Feature Requests
11 |
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 |
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 |
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 |
22 |
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 |
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 |
30 | To send us a pull request, please:
31 |
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 |
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 |
42 |
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 |
46 |
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 |
52 |
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 |
56 |
57 | ## Licensing
58 |
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 |
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Apache License
3 | Version 2.0, January 2004
4 | http://www.apache.org/licenses/
5 |
6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7 |
8 | 1. Definitions.
9 |
10 | "License" shall mean the terms and conditions for use, reproduction,
11 | and distribution as defined by Sections 1 through 9 of this document.
12 |
13 | "Licensor" shall mean the copyright owner or entity authorized by
14 | the copyright owner that is granting the License.
15 |
16 | "Legal Entity" shall mean the union of the acting entity and all
17 | other entities that control, are controlled by, or are under common
18 | control with that entity. For the purposes of this definition,
19 | "control" means (i) the power, direct or indirect, to cause the
20 | direction or management of such entity, whether by contract or
21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
22 | outstanding shares, or (iii) beneficial ownership of such entity.
23 |
24 | "You" (or "Your") shall mean an individual or Legal Entity
25 | exercising permissions granted by this License.
26 |
27 | "Source" form shall mean the preferred form for making modifications,
28 | including but not limited to software source code, documentation
29 | source, and configuration files.
30 |
31 | "Object" form shall mean any form resulting from mechanical
32 | transformation or translation of a Source form, including but
33 | not limited to compiled object code, generated documentation,
34 | and conversions to other media types.
35 |
36 | "Work" shall mean the work of authorship, whether in Source or
37 | Object form, made available under the License, as indicated by a
38 | copyright notice that is included in or attached to the work
39 | (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object
42 | form, that is based on (or derived from) the Work and for which the
43 | editorial revisions, annotations, elaborations, or other modifications
44 | represent, as a whole, an original work of authorship. For the purposes
45 | of this License, Derivative Works shall not include works that remain
46 | separable from, or merely link (or bind by name) to the interfaces of,
47 | the Work and Derivative Works thereof.
48 |
49 | "Contribution" shall mean any work of authorship, including
50 | the original version of the Work and any modifications or additions
51 | to that Work or Derivative Works thereof, that is intentionally
52 | submitted to Licensor for inclusion in the Work by the copyright owner
53 | or by an individual or Legal Entity authorized to submit on behalf of
54 | the copyright owner. For the purposes of this definition, "submitted"
55 | means any form of electronic, verbal, or written communication sent
56 | to the Licensor or its representatives, including but not limited to
57 | communication on electronic mailing lists, source code control systems,
58 | and issue tracking systems that are managed by, or on behalf of, the
59 | Licensor for the purpose of discussing and improving the Work, but
60 | excluding communication that is conspicuously marked or otherwise
61 | designated in writing by the copyright owner as "Not a Contribution."
62 |
63 | "Contributor" shall mean Licensor and any individual or Legal Entity
64 | on behalf of whom a Contribution has been received by Licensor and
65 | subsequently incorporated within the Work.
66 |
67 | 2. Grant of Copyright License. Subject to the terms and conditions of
68 | this License, each Contributor hereby grants to You a perpetual,
69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70 | copyright license to reproduce, prepare Derivative Works of,
71 | publicly display, publicly perform, sublicense, and distribute the
72 | Work and such Derivative Works in Source or Object form.
73 |
74 | 3. Grant of Patent License. Subject to the terms and conditions of
75 | this License, each Contributor hereby grants to You a perpetual,
76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77 | (except as stated in this section) patent license to make, have made,
78 | use, offer to sell, sell, import, and otherwise transfer the Work,
79 | where such license applies only to those patent claims licensable
80 | by such Contributor that are necessarily infringed by their
81 | Contribution(s) alone or by combination of their Contribution(s)
82 | with the Work to which such Contribution(s) was submitted. If You
83 | institute patent litigation against any entity (including a
84 | cross-claim or counterclaim in a lawsuit) alleging that the Work
85 | or a Contribution incorporated within the Work constitutes direct
86 | or contributory patent infringement, then any patent licenses
87 | granted to You under this License for that Work shall terminate
88 | as of the date such litigation is filed.
89 |
90 | 4. Redistribution. You may reproduce and distribute copies of the
91 | Work or Derivative Works thereof in any medium, with or without
92 | modifications, and in Source or Object form, provided that You
93 | meet the following conditions:
94 |
95 | (a) You must give any other recipients of the Work or
96 | Derivative Works a copy of this License; and
97 |
98 | (b) You must cause any modified files to carry prominent notices
99 | stating that You changed the files; and
100 |
101 | (c) You must retain, in the Source form of any Derivative Works
102 | that You distribute, all copyright, patent, trademark, and
103 | attribution notices from the Source form of the Work,
104 | excluding those notices that do not pertain to any part of
105 | the Derivative Works; and
106 |
107 | (d) If the Work includes a "NOTICE" text file as part of its
108 | distribution, then any Derivative Works that You distribute must
109 | include a readable copy of the attribution notices contained
110 | within such NOTICE file, excluding those notices that do not
111 | pertain to any part of the Derivative Works, in at least one
112 | of the following places: within a NOTICE text file distributed
113 | as part of the Derivative Works; within the Source form or
114 | documentation, if provided along with the Derivative Works; or,
115 | within a display generated by the Derivative Works, if and
116 | wherever such third-party notices normally appear. The contents
117 | of the NOTICE file are for informational purposes only and
118 | do not modify the License. You may add Your own attribution
119 | notices within Derivative Works that You distribute, alongside
120 | or as an addendum to the NOTICE text from the Work, provided
121 | that such additional attribution notices cannot be construed
122 | as modifying the License.
123 |
124 | You may add Your own copyright statement to Your modifications and
125 | may provide additional or different license terms and conditions
126 | for use, reproduction, or distribution of Your modifications, or
127 | for any such Derivative Works as a whole, provided Your use,
128 | reproduction, and distribution of the Work otherwise complies with
129 | the conditions stated in this License.
130 |
131 | 5. Submission of Contributions. Unless You explicitly state otherwise,
132 | any Contribution intentionally submitted for inclusion in the Work
133 | by You to the Licensor shall be under the terms and conditions of
134 | this License, without any additional terms or conditions.
135 | Notwithstanding the above, nothing herein shall supersede or modify
136 | the terms of any separate license agreement you may have executed
137 | with Licensor regarding such Contributions.
138 |
139 | 6. Trademarks. This License does not grant permission to use the trade
140 | names, trademarks, service marks, or product names of the Licensor,
141 | except as required for reasonable and customary use in describing the
142 | origin of the Work and reproducing the content of the NOTICE file.
143 |
144 | 7. Disclaimer of Warranty. Unless required by applicable law or
145 | agreed to in writing, Licensor provides the Work (and each
146 | Contributor provides its Contributions) on an "AS IS" BASIS,
147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 | implied, including, without limitation, any warranties or conditions
149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 | PARTICULAR PURPOSE. You are solely responsible for determining the
151 | appropriateness of using or redistributing the Work and assume any
152 | risks associated with Your exercise of permissions under this License.
153 |
154 | 8. Limitation of Liability. In no event and under no legal theory,
155 | whether in tort (including negligence), contract, or otherwise,
156 | unless required by applicable law (such as deliberate and grossly
157 | negligent acts) or agreed to in writing, shall any Contributor be
158 | liable to You for damages, including any direct, indirect, special,
159 | incidental, or consequential damages of any character arising as a
160 | result of this License or out of the use or inability to use the
161 | Work (including but not limited to damages for loss of goodwill,
162 | work stoppage, computer failure or malfunction, or any and all
163 | other commercial damages or losses), even if such Contributor
164 | has been advised of the possibility of such damages.
165 |
166 | 9. Accepting Warranty or Additional Liability. While redistributing
167 | the Work or Derivative Works thereof, You may choose to offer,
168 | and charge a fee for, acceptance of support, warranty, indemnity,
169 | or other liability obligations and/or rights consistent with this
170 | License. However, in accepting such obligations, You may act only
171 | on Your own behalf and on Your sole responsibility, not on behalf
172 | of any other Contributor, and only if You agree to indemnify,
173 | defend, and hold each Contributor harmless for any liability
174 | incurred by, or claims asserted against, such Contributor by reason
175 | of your accepting any such warranty or additional liability.
176 |
--------------------------------------------------------------------------------
/NOTICE:
--------------------------------------------------------------------------------
1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Amazon SageMaker and Deep Graph Library for Fraud Detection in Heterogeneous Graphs
2 |
3 | Many online businesses lose billions annually to fraud, but machine learning based fraud detection models can help businesses predict what interactions or users are likely fraudulent and save them from incurring those costs.
4 |
5 | In this project, we formulate the problem of fraud detection as a classification task on a heterogeneous interaction network. The machine learning model is a Graph Neural Network (GNN) that learns latent representations of users or transactions which can then be easily separated into fraud or legitimate.
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 | This project shows how to use [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Deep Graph Library (DGL)](https://www.dgl.ai/) to construct a heterogeneous graph from tabular data and train a GNN model to detect fraudulent transactions in the [IEEE-CIS dataset](https://www.kaggle.com/c/ieee-fraud-detection/data).
16 |
17 | See the [details page](docs/details.md) to learn more about the techniques used, and the [online webinar](https://www.youtube.com/watch?v=P_oCAbSYRwY&feature=youtu.be) or [tutorial blog post](https://aws.amazon.com/blogs/machine-learning/detecting-fraud-in-heterogeneous-networks-using-amazon-sagemaker-and-deep-graph-library/) to see step by step explanations and instructions on how to use this solution.
18 |
19 | ## Getting Started
20 |
21 | You will need an AWS account to use this solution. Sign up for an account [here](https://aws.amazon.com/).
22 |
23 | To run this JumpStart 1P Solution and have the infrastructure deploy to your AWS account you will need to create an active SageMaker Studio instance (see [Onboard to Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html)). When your Studio instance is *Ready*, use the instructions in [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html) to 1-Click Launch the solution.
24 |
25 | The solution artifacts are included in this GitHub repository for reference.
26 |
27 | *Note*: Solutions are available in most regions including us-west-2, and us-east-1.
28 |
29 | **Caution**: Cloning this GitHub repository and running the code manually could lead to unexpected issues! Use the AWS CloudFormation template. You'll get an Amazon SageMaker Notebook instance that's been correctly setup and configured to access the other resources in the solution.
30 |
31 | ## Architecture
32 |
33 | The project architecture deployed by the cloud formation template is shown here.
34 |
35 | 
36 |
37 | ## Project Organization
38 | The project is divided into two main modules.
39 |
40 | The [first module](source/sagemaker/data-preprocessing) uses [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) to do feature engineering and extract edgelists from a table of transactions or interactions.
41 |
42 |
43 | The [second module](source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) shows how to use DGL to define a GNN model and train the model using [Amazon SageMaker training infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/deep-graph-library.html).
44 |
45 |
46 | The [jupyter notebook](source/sagemaker/dgl-fraud-detection.ipynb) shows how to run the full project on an [example dataset](https://www.kaggle.com/c/ieee-fraud-detection/data).
47 |
48 |
49 | The project also contains a [cloud formation template](deployment/sagemaker-graph-fraud-detection.yaml) that deploys the code in this repo and all AWS resources needed to run the project in an end-to-end manner in the AWS account it's launched in.
50 |
51 | ## Contents
52 |
53 | * `deployment/`
54 | * `sagemaker-graph-fraud-detection.yaml`: Creates AWS CloudFormation Stack for solution
55 | * `source/`
56 | * `lambda/`
57 | * `data-preprocessing/`
58 | * `index.py`: Lambda function script for invoking SageMaker processing
59 | * `graph-modelling/`
60 | * `index.py`: Lambda function script for invoking SageMaker training
61 | * `sagemaker/`
62 | * `baselines/`
63 | * `mlp-fraud-baseline.ipynb`: Jupyter notebook for feature based MLP baseline method using SageMaker and MXNet
64 | * `mlp_fraud_entry_point.py`: python entry point used by the MLP baseline notebook for MXNet training/deployment
65 | * `utils.py`: utility functions for baseline notebooks
66 | * `xgboost-fraud-entry-point.ipynb`: Jupyter notebook for feature based XGBoost baseline method using SageMaker
67 | * `data-preprocessing/`
68 | * `container/`
69 | * `Dockerfile`: Describes custom Docker image hosted on Amazon ECR for SageMaker Processing
70 | * `build_and_push.sh`: Script to build Docker image and push to Amazon ECR
71 | * `graph_data_preprocessor.py`: Custom script used by SageMaker Processing for data processing/feature engineering
72 | * `sagemaker_graph_fraud_detection/`
73 | * `dgl_fraud_detection/`
74 | * `model`
75 | * `mxnet.py`: Implements the various graph neural network models used in the project with the mxnet backend
76 | * `data.py`: Contains functions for reading node features and labels
77 | * `estimator_fns.py`: Contains functions for parsing input from SageMaker estimator objects
78 | * `graph.py`: Contains functions for constructing DGL Graphs with node features and edge lists
79 | * `requirements.txt`: Describes Python package requirements of the Amazon SageMaker training instance
80 | * `sampler.py`: Contains functions for graph sampling for mini-batch training
81 | * `train_dgl_mxnet_entry_point.py`: python entry point used by the notebook for GNN training with DGL mxnet backend
82 | * `utils.py`: python script with utility functions for computing metrics and plots
83 | * `config.py`: python file to load stack configurations and pass to sagemaker notebook
84 | * `requirements.txt`: Describes Python package requirements of the SageMaker notebook instance
85 | * `setup.py`: setup sagemaker-graph-fraud-detection as a python package
86 | * `dgl-fraud-detection.ipynb`: Orchestrates the solution. Triggers preprocessing and model training
87 | * `setup.sh`: prepare notebook environment with necessary pre-reqs
88 |
89 | ## License
90 |
91 | This project is licensed under the Apache-2.0 License.
92 |
93 |
--------------------------------------------------------------------------------
/deployment/sagemaker-graph-fraud-detection.yaml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: "2010-09-09"
2 | Description: "(SA0002) - sagemaker-graph-fraud-detection: Solution for training a graph neural network model for fraud detection using Amazon SageMaker. Version 1"
3 | Parameters:
4 | SolutionPrefix:
5 | Type: String
6 | Default: "sagemaker-soln-graph-fraud"
7 | Description: |
8 | Used to name resources created as part of this stack (and inside nested stacks too).
9 | Can be the same as the stack name used by AWS CloudFormation, but this field has extra
10 | constraints because it's used to name resources with restrictions (e.g. Amazon S3 bucket
11 | names cannot contain capital letters).
12 | AllowedPattern: '^sagemaker-soln-[a-z0-9\-]{1,20}$'
13 | ConstraintDescription: |
14 | Only allowed to use lowercase letters, hyphens and/or numbers.
15 | Should also start with 'sagemaker-soln-graph-fraud' for permission management.
16 | IamRole:
17 | Type: String
18 | Default: ""
19 | Description: |
20 | IAM Role that will be attached to the resources created by this cloudformation to grant them permissions to
21 | perform their required functions. This role should allow SageMaker and Lambda perform the required actions like
22 | creating training jobs and processing jobs. If left blank, the template will attempt to create a role for you.
23 | This can cause a stack creation error if you don't have privileges to create new roles.
24 | S3HistoricalTransactionsPrefix:
25 | Description: Enter the S3 prefix where historical transactions/relations are stored.
26 | Type: String
27 | Default: "raw-data"
28 | S3ProcessingJobInputPrefix:
29 | Description: Enter the S3 prefix where inputs should be monitored for changes to start the processing job
30 | Type: String
31 | Default: "processing-input"
32 | S3ProcessingJobOutputPrefix:
33 | Description: Enter the S3 prefix where preprocessed data should be stored and monitored for changes to start the training job
34 | Type: String
35 | Default: "preprocessed-data"
36 | S3TrainingJobOutputPrefix:
37 | Description: Enter the S3 prefix where model and output artifacts from the training job should be stored
38 | Type: String
39 | Default: "training-output"
40 | CreateSageMakerNotebookInstance:
41 | Description: Whether to launch classic sagemaker notebook instance
42 | Type: String
43 | AllowedValues:
44 | - "true"
45 | - "false"
46 | Default: "false"
47 | BuildSageMakerContainersRemotely:
48 | Description: |
49 | Whether to launch a CodeBuild project to build sagemaker containers.
50 | If set to 'true' SageMaker notebook will use the CodeBuild project to launch a build job for sagemaker containers.
51 | If set to 'false' SageMaker notebook will attempt to build solution containers on the notebook instance.
52 | This may lead to some unexpected issues as docker isn't installed in SageMaker studio containers.
53 | Type: String
54 | AllowedValues:
55 | - "true"
56 | - "false"
57 | Default: "true"
58 | SageMakerProcessingJobContainerName:
59 | Description: Name of the SageMaker processing job ECR Container
60 | Type: String
61 | Default: "sagemaker-soln-graph-fraud-preprocessing"
62 | SageMakerProcessingJobInstanceType:
63 | Description: Instance type of the SageMaker processing job
64 | Type: String
65 | Default: "ml.m4.xlarge"
66 | SageMakerTrainingJobInstanceType:
67 | Description: Instance type of the SageMaker processing job
68 | Type: String
69 | Default: "ml.p3.2xlarge"
70 | SageMakerNotebookInstanceType:
71 | Description: Instance type of the SageMaker notebook instance
72 | Type: String
73 | Default: "ml.m4.xlarge"
74 | StackVersion:
75 | Description: |
76 | CloudFormation Stack version.
77 | Use 'release' version unless you are customizing the
78 | CloudFormation templates and solution artifacts.
79 | Type: String
80 | Default: release
81 | AllowedValues:
82 | - release
83 | - development
84 |
85 | Metadata:
86 | AWS::CloudFormation::Interface:
87 | ParameterGroups:
88 | -
89 | Label:
90 | default: Solution Configuration
91 | Parameters:
92 | - SolutionPrefix
93 | - IamRole
94 | - StackVersion
95 | -
96 | Label:
97 | default: S3 Configuration
98 | Parameters:
99 | - S3HistoricalTransactionsPrefix
100 | - S3ProcessingJobInputPrefix
101 | - S3ProcessingJobOutputPrefix
102 | - S3TrainingJobOutputPrefix
103 | -
104 | Label:
105 | default: SageMaker Configuration
106 | Parameters:
107 | - CreateSageMakerNotebookInstance
108 | - BuildSageMakerContainersRemotely
109 | - SageMakerProcessingJobContainerName
110 | - SageMakerProcessingJobInstanceType
111 | - SageMakerTrainingJobInstanceType
112 | - SageMakerNotebookInstanceType
113 | ParameterLabels:
114 | SolutionPrefix:
115 | default: Solution Resources Name Prefix
116 | IamRole:
117 | default: Solution IAM Role Arn
118 | StackVersion:
119 | default: Solution Stack Version
120 | S3HistoricalTransactionsPrefix:
121 | default: S3 Data Prefix
122 | S3ProcessingJobInputPrefix:
123 | default: S3 Processing Input Prefix
124 | S3ProcessingJobOutputPrefix:
125 | default: S3 Preprocessed Data Prefix
126 | S3TrainingJobOutputPrefix:
127 | default: S3 Training Results Prefix
128 | CreateSageMakerNotebookInstance:
129 | default: Launch Classic SageMaker Notebook Instance
130 | BuildSageMakerContainersRemotely:
131 | default: Build SageMaker containers on AWS CodeBuild
132 | SageMakerProcessingJobContainerName:
133 | default: SageMaker Processing Container Name
134 | SageMakerProcessingJobInstanceType:
135 | default: SageMaker Processing Instance
136 | SageMakerTrainingJobInstanceType:
137 | default: SageMaker Training Instance
138 | SageMakerNotebookInstanceType:
139 | default: SageMaker Notebook Instance
140 |
141 | Mappings:
142 | S3:
143 | release:
144 | BucketPrefix: "sagemaker-solutions"
145 | development:
146 | BucketPrefix: "sagemaker-solutions-build"
147 | Lambda:
148 | DataPreprocessing:
149 | S3Key: "Fraud-detection-in-financial-networks/build/data_preprocessing.zip"
150 | GraphModelling:
151 | S3Key: "Fraud-detection-in-financial-networks/build/graph_modelling.zip"
152 | CodeBuild:
153 | ProcessingContainer:
154 | S3Key: "Fraud-detection-in-financial-networks/build/sagemaker_data_preprocessing.zip"
155 |
156 | Conditions:
157 | CreateClassicSageMakerResources: !Equals [ !Ref CreateSageMakerNotebookInstance, "true" ]
158 | CreateCustomSolutionRole: !Equals [!Ref IamRole, ""]
159 | CreateCodeBuildProject: !Equals [!Ref BuildSageMakerContainersRemotely, "true"]
160 |
161 | Resources:
162 | S3Bucket:
163 | Type: AWS::S3::Bucket
164 | DeletionPolicy: Retain
165 | Properties:
166 | BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
167 | PublicAccessBlockConfiguration:
168 | BlockPublicAcls: true
169 | BlockPublicPolicy: true
170 | IgnorePublicAcls: true
171 | RestrictPublicBuckets: true
172 | BucketEncryption:
173 | ServerSideEncryptionConfiguration:
174 | -
175 | ServerSideEncryptionByDefault:
176 | SSEAlgorithm: AES256
177 | NotificationConfiguration:
178 | LambdaConfigurations:
179 | -
180 | Event: s3:ObjectCreated:*
181 | Function: !GetAtt DataPreprocessingLambda.Arn
182 | Filter:
183 | S3Key:
184 | Rules:
185 | - Name: prefix
186 | Value: !Ref S3ProcessingJobInputPrefix
187 | -
188 | Event: s3:ObjectCreated:*
189 | Function: !GetAtt GraphModellingLambda.Arn
190 | Filter:
191 | S3Key:
192 | Rules:
193 | - Name: prefix
194 | Value: !Ref S3ProcessingJobOutputPrefix
195 | Metadata:
196 | cfn_nag:
197 | rules_to_suppress:
198 | - id: W35
199 | reason: Configuring logging requires supplying an existing customer S3 bucket to store logs
200 | - id: W51
201 | reason: Default access policy suffices
202 |
203 | SolutionAssistantStack:
204 | Type: "AWS::CloudFormation::Stack"
205 | Properties:
206 | TemplateURL: !Sub
207 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/solution-assistant/solution-assistant.yaml"
208 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
209 | Region: !Ref AWS::Region
210 | Parameters:
211 | SolutionPrefix: !Ref SolutionPrefix
212 | SolutionsRefBucketName: !Sub
213 | - "${SolutionRefBucketBase}-${AWS::Region}"
214 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
215 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
216 | ECRRepository: !Ref SageMakerProcessingJobContainerName
217 | RoleArn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
218 |
219 | SageMakerPermissionsStack:
220 | Type: "AWS::CloudFormation::Stack"
221 | Condition: CreateCustomSolutionRole
222 | Properties:
223 | TemplateURL: !Sub
224 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/sagemaker-permissions-stack.yaml"
225 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
226 | Region: !Ref AWS::Region
227 | Parameters:
228 | SolutionPrefix: !Ref SolutionPrefix
229 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
230 | SolutionCodeBuildProject: !Sub "${SolutionPrefix}-processing-job-container-build"
231 |
232 | SageMakerStack:
233 | Type: "AWS::CloudFormation::Stack"
234 | Condition: CreateClassicSageMakerResources
235 | Properties:
236 | TemplateURL: !Sub
237 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/sagemaker-notebook-instance-stack.yaml"
238 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
239 | Region: !Ref AWS::Region
240 | Parameters:
241 | SolutionPrefix: !Ref SolutionPrefix
242 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
243 | S3InputDataPrefix: !Ref S3HistoricalTransactionsPrefix
244 | S3ProcessingJobOutputPrefix: !Ref S3ProcessingJobOutputPrefix
245 | S3TrainingJobOutputPrefix: !Ref S3TrainingJobOutputPrefix
246 | SageMakerNotebookInstanceType: !Ref SageMakerNotebookInstanceType
247 | SageMakerProcessingJobContainerName: !Ref SageMakerProcessingJobContainerName
248 | SageMakerProcessingJobContainerCodeBuild: !If [CreateCodeBuildProject, !Ref ProcessingJobContainerBuild, "local"]
249 | NotebookInstanceExecutionRoleArn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
250 |
251 | ProcessingJobContainerBuild:
252 | Condition: CreateCodeBuildProject
253 | Type: AWS::CodeBuild::Project
254 | Properties:
255 | Name: !Sub "${SolutionPrefix}-processing-job-container-build"
256 | Description: !Sub "Build docker container for SageMaker Processing job for ${SolutionPrefix}"
257 | ServiceRole: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
258 | Source:
259 | Type: S3
260 | Location: !Sub
261 | - "${SolutionRefBucketBase}-${AWS::Region}/${SourceKey}"
262 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
263 | SourceKey: !FindInMap [CodeBuild, ProcessingContainer, S3Key]
264 | Environment:
265 | ComputeType: BUILD_GENERAL1_MEDIUM
266 | Image: aws/codebuild/standard:4.0
267 | Type: LINUX_CONTAINER
268 | PrivilegedMode: True
269 | EnvironmentVariables:
270 | - Name: ecr_repository
271 | Value: !Ref SageMakerProcessingJobContainerName
272 | - Name: region
273 | Value: !Ref AWS::Region
274 | - Name: account_id
275 | Value: !Ref AWS::AccountId
276 | Artifacts:
277 | Type: NO_ARTIFACTS
278 | Metadata:
279 | cfn_nag:
280 | rules_to_suppress:
281 | - id: W32
282 | reason: overriding encryption requirements for codebuild
283 |
284 | DataPreprocessingLambda:
285 | Type: AWS::Lambda::Function
286 | Properties:
287 | Handler: "index.process_event"
288 | FunctionName: !Sub "${SolutionPrefix}-data-preprocessing"
289 | Role: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
290 | Environment:
291 | Variables:
292 | processing_job_ecr_repository: !Ref SageMakerProcessingJobContainerName
293 | processing_job_input_s3_prefix: !Ref S3ProcessingJobInputPrefix
294 | processing_job_instance_type: !Ref SageMakerProcessingJobInstanceType
295 | processing_job_output_s3_prefix: !Ref S3ProcessingJobOutputPrefix
296 | processing_job_role_arn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
297 | processing_job_s3_bucket: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
298 | processing_job_s3_raw_data_key: !Ref S3HistoricalTransactionsPrefix
299 | Runtime: "python3.7"
300 | Code:
301 | S3Bucket: !Sub
302 | - "${SolutionRefBucketBase}-${Region}"
303 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
304 | Region: !Ref AWS::Region
305 | S3Key: !FindInMap [Lambda, DataPreprocessing, S3Key]
306 | Timeout : 60
307 | MemorySize : 256
308 | Metadata:
309 | cfn_nag:
310 | rules_to_suppress:
311 | - id: W58
312 | reason: Passed in role or created role both have cloudwatch write permissions
313 | DataPreprocessingLambdaPermission:
314 | Type: AWS::Lambda::Permission
315 | Properties:
316 | Action: 'lambda:InvokeFunction'
317 | FunctionName: !Ref DataPreprocessingLambda
318 | Principal: s3.amazonaws.com
319 | SourceArn: !Sub "arn:aws:s3:::${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
320 | SourceAccount: !Ref AWS::AccountId
321 | GraphModellingLambda:
322 | Type: AWS::Lambda::Function
323 | Properties:
324 | Handler: "index.process_event"
325 | FunctionName: !Sub "${SolutionPrefix}-model-training"
326 | Role: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
327 | Environment:
328 | Variables:
329 | training_job_instance_type: !Ref SageMakerTrainingJobInstanceType
330 | training_job_output_s3_prefix: !Ref S3TrainingJobOutputPrefix
331 | training_job_role_arn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
332 | training_job_s3_bucket: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
333 | Runtime: "python3.7"
334 | Code:
335 | S3Bucket: !Sub
336 | - "${SolutionRefBucketBase}-${Region}"
337 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
338 | Region: !Ref AWS::Region
339 | S3Key: !FindInMap [Lambda, GraphModelling, S3Key]
340 | Timeout : 60
341 | MemorySize : 256
342 | Metadata:
343 | cfn_nag:
344 | rules_to_suppress:
345 | - id: W58
346 | reason: Passed in role or created role both have cloudwatch write permissions
347 | GraphModellingLambdaPermission:
348 | Type: AWS::Lambda::Permission
349 | Properties:
350 | Action: 'lambda:InvokeFunction'
351 | FunctionName: !Ref GraphModellingLambda
352 | Principal: s3.amazonaws.com
353 | SourceAccount: !Ref AWS::AccountId
354 |
355 | Outputs:
356 | SourceCode:
357 | Condition: CreateClassicSageMakerResources
358 | Description: "Open Jupyter IDE. This authenticate you against Jupyter."
359 | Value: !GetAtt SageMakerStack.Outputs.SourceCode
360 |
361 | NotebookInstance:
362 | Condition: CreateClassicSageMakerResources
363 | Description: "SageMaker Notebook instance to manually orchestrate data preprocessing and model training"
364 | Value: !GetAtt SageMakerStack.Outputs.NotebookInstance
365 |
366 | AccountID:
367 | Description: "AWS Account ID to be passed downstream to the notebook instance"
368 | Value: !Ref AWS::AccountId
369 |
370 | AWSRegion:
371 | Description: "AWS Region to be passed downstream to the notebook instance"
372 | Value: !Ref AWS::Region
373 |
374 | IamRole:
375 | Description: "Arn of SageMaker Execution Role"
376 | Value: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole]
377 |
378 | SolutionPrefix:
379 | Description: "Solution Prefix for naming SageMaker transient resources"
380 | Value: !Ref SolutionPrefix
381 |
382 | SolutionS3Bucket:
383 | Description: "Solution S3 bucket name"
384 | Value: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}"
385 |
386 | S3InputDataPrefix:
387 | Description: "S3 bucket prefix for raw data"
388 | Value: !Ref S3HistoricalTransactionsPrefix
389 |
390 | S3ProcessingJobOutputPrefix:
391 | Description: "S3 bucket prefix for processed data"
392 | Value: !Ref S3ProcessingJobOutputPrefix
393 |
394 | S3TrainingJobOutputPrefix:
395 | Description: "S3 bucket prefix for trained model and other artifacts"
396 | Value: !Ref S3TrainingJobOutputPrefix
397 |
398 | SageMakerProcessingJobContainerName:
399 | Description: "ECR Container name for SageMaker processing job"
400 | Value: !Ref SageMakerProcessingJobContainerName
401 |
402 | SageMakerProcessingJobContainerBuild:
403 | Description: "Code build project for remotely building the sagemaker preprocessing container"
404 | Value: !If [CreateCodeBuildProject, !Ref ProcessingJobContainerBuild, "local"]
--------------------------------------------------------------------------------
/deployment/sagemaker-notebook-instance-stack.yaml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: "2010-09-09"
2 | Description: "(SA0002) - sagemaker-graph-fraud-detection SageMaker stack"
3 | Parameters:
4 | SolutionPrefix:
5 | Description: Enter the name of the prefix for the solution used for naming
6 | Type: String
7 | SolutionS3BucketName:
8 | Description: Enter the name of the S3 bucket for the solution
9 | Type: String
10 | S3InputDataPrefix:
11 | Description: S3 prefix where raw data is stored
12 | Type: String
13 | Default: "raw-data"
14 | S3ProcessingJobOutputPrefix:
15 | Description: S3 prefix where preprocessed data is stored after processing
16 | Type: String
17 | Default: "preprocessed-data"
18 | S3TrainingJobOutputPrefix:
19 | Description: S3 prefix where training outputs are stored after training
20 | Type: String
21 | Default: "training-output"
22 | SageMakerNotebookInstanceType:
23 | Description: Instance type of the SageMaker notebook instance
24 | Type: String
25 | Default: "ml.t3.medium"
26 | SageMakerProcessingJobContainerName:
27 | Description: Name of the SageMaker processing job ECR Container
28 | Type: String
29 | Default: "sagemaker-soln-graph-fraud-preprocessing"
30 | SageMakerProcessingJobContainerCodeBuild:
31 | Description: Name of the SageMaker processing container code build project
32 | Type: String
33 | Default: "local"
34 | NotebookInstanceExecutionRoleArn:
35 | Type: String
36 | Description: Execution Role for the SageMaker notebook instance
37 | StackVersion:
38 | Type: String
39 | Description: CloudFormation Stack version.
40 | Default: "release"
41 |
42 | Mappings:
43 | S3:
44 | release:
45 | BucketPrefix: "sagemaker-solutions"
46 | development:
47 | BucketPrefix: "sagemaker-solutions-build"
48 | SageMaker:
49 | Source:
50 | S3Key: "Fraud-detection-in-financial-networks/source/sagemaker/"
51 |
52 | Resources:
53 | NotebookInstance:
54 | Type: AWS::SageMaker::NotebookInstance
55 | Properties:
56 | DirectInternetAccess: Enabled
57 | InstanceType: !Ref SageMakerNotebookInstanceType
58 | LifecycleConfigName: !GetAtt LifeCycleConfig.NotebookInstanceLifecycleConfigName
59 | NotebookInstanceName: !Sub "${SolutionPrefix}-notebook-instance"
60 | RoleArn: !Ref NotebookInstanceExecutionRoleArn
61 | VolumeSizeInGB: 120
62 | Metadata:
63 | cfn_nag:
64 | rules_to_suppress:
65 | - id: W1201
66 | reason: Solution does not have KMS encryption enabled by default
67 | LifeCycleConfig:
68 | Type: AWS::SageMaker::NotebookInstanceLifecycleConfig
69 | Properties:
70 | NotebookInstanceLifecycleConfigName: !Sub "${SolutionPrefix}-nb-lifecycle-config"
71 | OnCreate:
72 | - Content:
73 | Fn::Base64: !Sub
74 | - |
75 | cd /home/ec2-user/SageMaker
76 | aws s3 cp --recursive s3://${SolutionsRefBucketBase}-${AWS::Region}/${SolutionsRefSource} .
77 | touch stack_outputs.json
78 | echo '{' >> stack_outputs.json
79 | echo ' "AccountID": "${AWS::AccountId}",' >> stack_outputs.json
80 | echo ' "AWSRegion": "${AWS::Region}",' >> stack_outputs.json
81 | echo ' "IamRole": "${NotebookInstanceExecutionRoleArn}",' >> stack_outputs.json
82 | echo ' "SolutionPrefix": "${SolutionPrefix}",' >> stack_outputs.json
83 | echo ' "SolutionS3Bucket": "${SolutionS3BucketName}",' >> stack_outputs.json
84 | echo ' "S3InputDataPrefix": "${S3InputDataPrefix}",' >> stack_outputs.json
85 | echo ' "S3ProcessingJobOutputPrefix": "${S3ProcessingJobOutputPrefix}",' >> stack_outputs.json
86 | echo ' "S3TrainingJobOutputPrefix": "${S3TrainingJobOutputPrefix}",' >> stack_outputs.json
87 | echo ' "SageMakerProcessingJobContainerName": "${SageMakerProcessingJobContainerName}",' >> stack_outputs.json
88 | echo ' "SageMakerProcessingJobContainerBuild": "${SageMakerProcessingJobContainerCodeBuild}"' >> stack_outputs.json
89 | echo '}' >> stack_outputs.json
90 | sudo chown -R ec2-user:ec2-user .
91 | - SolutionsRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
92 | SolutionsRefSource: !FindInMap [SageMaker, Source, S3Key]
93 | Outputs:
94 | SourceCode:
95 | Description: "Open Jupyter IDE. This authenticate you against Jupyter."
96 | Value: !Sub "https://console.aws.amazon.com/sagemaker/home?region=${AWS::Region}#/notebook-instances/openNotebook/${SolutionPrefix}-notebook-instance?view=classic"
97 | NotebookInstance:
98 | Description: "SageMaker Notebook instance to manually orchestrate data preprocessing and model training"
99 | Value: !Sub "https://${SolutionPrefix}-notebook-instance.notebook.${AWS::Region}.sagemaker.aws/notebooks/dgl-fraud-detection.ipynb"
--------------------------------------------------------------------------------
/deployment/sagemaker-permissions-stack.yaml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: "2010-09-09"
2 | Description: "(SA0002) - sagemaker-graph-fraud-detection SageMaker permissions stack"
3 | Parameters:
4 | SolutionPrefix:
5 | Description: Enter the name of the prefix for the solution used for naming
6 | Type: String
7 | Default: "sagemaker-soln-graph-fraud-detection"
8 | SolutionS3BucketName:
9 | Description: Enter the name of the S3 bucket for the solution
10 | Type: String
11 | Default: "sagemaker-soln-*"
12 | SolutionCodeBuildProject:
13 | Description: Enter the name of the Code build project for the solution
14 | Type: String
15 | Default: "sagemaker-soln-*"
16 | StackVersion:
17 | Description: Enter the name of the template stack version
18 | Type: String
19 | Default: "release"
20 |
21 | Mappings:
22 | S3:
23 | release:
24 | BucketPrefix: "sagemaker-solutions"
25 | development:
26 | BucketPrefix: "sagemaker-solutions-build"
27 |
28 | Resources:
29 | NotebookInstanceExecutionRole:
30 | Type: AWS::IAM::Role
31 | Properties:
32 | RoleName: !Sub "${SolutionPrefix}-${AWS::Region}-nb-role"
33 | AssumeRolePolicyDocument:
34 | Version: '2012-10-17'
35 | Statement:
36 | - Effect: Allow
37 | Principal:
38 | AWS:
39 | - !Sub "arn:aws:iam::${AWS::AccountId}:root"
40 | Service:
41 | - sagemaker.amazonaws.com
42 | - lambda.amazonaws.com
43 | - codebuild.amazonaws.com
44 | Action:
45 | - 'sts:AssumeRole'
46 | Metadata:
47 | cfn_nag:
48 | rules_to_suppress:
49 | - id: W28
50 | reason: Needs to be explicitly named to tighten launch permissions policy
51 |
52 | NotebookInstanceIAMPolicy:
53 | Type: AWS::IAM::Policy
54 | Properties:
55 | PolicyName: !Sub "${SolutionPrefix}-nb-instance-policy"
56 | Roles:
57 | - !Ref NotebookInstanceExecutionRole
58 | PolicyDocument:
59 | Version: '2012-10-17'
60 | Statement:
61 | - Effect: Allow
62 | Action:
63 | - sagemaker:CreateTrainingJob
64 | - sagemaker:DescribeTrainingJob
65 | - sagemaker:CreateProcessingJob
66 | - sagemaker:DescribeProcessingJob
67 | Resource:
68 | - !Sub "arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*"
69 | - Effect: Allow
70 | Action:
71 | - ecr:GetAuthorizationToken
72 | - ecr:GetDownloadUrlForLayer
73 | - ecr:BatchGetImage
74 | - ecr:PutImage
75 | - ecr:BatchCheckLayerAvailability
76 | - ecr:CreateRepository
77 | - ecr:DescribeRepositories
78 | - ecr:InitiateLayerUpload
79 | - ecr:CompleteLayerUpload
80 | - ecr:UploadLayerPart
81 | - ecr:TagResource
82 | - ecr:DescribeImages
83 | - ecr:BatchDeleteImage
84 | Resource:
85 | - "*"
86 | - !Sub "arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repository/*"
87 | - Effect: Allow
88 | Action:
89 | - codebuild:BatchGetBuilds
90 | - codebuild:StartBuild
91 | Resource:
92 | - !Sub "arn:aws:codebuild:${AWS::Region}:${AWS::AccountId}:project/${SolutionCodeBuildProject}"
93 | - !Sub "arn:aws:codebuild:${AWS::Region}:${AWS::AccountId}:build/*"
94 | - Effect: Allow
95 | Action:
96 | - cloudwatch:PutMetricData
97 | - cloudwatch:GetMetricData
98 | - cloudwatch:GetMetricStatistics
99 | - cloudwatch:ListMetrics
100 | Resource:
101 | - !Sub "arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:*"
102 | - Effect: Allow
103 | Action:
104 | - logs:CreateLogGroup
105 | - logs:CreateLogStream
106 | - logs:DescribeLogStreams
107 | - logs:GetLogEvents
108 | - logs:PutLogEvents
109 | Resource:
110 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/sagemaker/*"
111 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/*"
112 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/*"
113 | - Effect: Allow
114 | Action:
115 | - iam:PassRole
116 | Resource:
117 | - !GetAtt NotebookInstanceExecutionRole.Arn
118 | Condition:
119 | StringEquals:
120 | iam:PassedToService: sagemaker.amazonaws.com
121 | - Effect: Allow
122 | Action:
123 | - iam:GetRole
124 | Resource:
125 | - !GetAtt NotebookInstanceExecutionRole.Arn
126 | - Effect: Allow
127 | Action:
128 | - s3:ListBucket
129 | - s3:GetObject
130 | - s3:PutObject
131 | - s3:GetObjectVersion
132 | - s3:DeleteObject
133 | - s3:DeleteBucket
134 | Resource:
135 | - !Sub "arn:aws:s3:::${SolutionS3BucketName}"
136 | - !Sub "arn:aws:s3:::${SolutionS3BucketName}/*"
137 | - !Sub
138 | - "arn:aws:s3:::${SolutionRefBucketBase}-${Region}"
139 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
140 | Region: !Ref AWS::Region
141 | - !Sub
142 | - "arn:aws:s3:::${SolutionRefBucketBase}-${Region}/*"
143 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix]
144 | Region: !Ref AWS::Region
145 | - Effect: Allow
146 | Action:
147 | - s3:CreateBucket
148 | - s3:ListBucket
149 | - s3:GetObject
150 | - s3:GetObjectVersion
151 | - s3:PutObject
152 | - s3:DeleteObject
153 | Resource:
154 | - !Sub "arn:aws:s3:::sagemaker-${AWS::Region}-${AWS::AccountId}"
155 | - !Sub "arn:aws:s3:::sagemaker-${AWS::Region}-${AWS::AccountId}/*"
156 | Metadata:
157 | cfn_nag:
158 | rules_to_suppress:
159 | - id: W12
160 | reason: ECR GetAuthorizationToken is non resource-specific action
161 |
162 | Outputs:
163 | SageMakerRoleArn:
164 | Description: "SageMaker Execution Role for the solution"
165 | Value: !GetAtt NotebookInstanceExecutionRole.Arn
--------------------------------------------------------------------------------
/deployment/solution-assistant/requirements.in:
--------------------------------------------------------------------------------
1 | crhelper
2 |
--------------------------------------------------------------------------------
/deployment/solution-assistant/requirements.txt:
--------------------------------------------------------------------------------
1 | #
2 | # This file is autogenerated by pip-compile
3 | # To update, run:
4 | #
5 | # pip-compile requirements.in
6 | #
7 | crhelper==2.0.6 # via -r requirements.in
8 |
--------------------------------------------------------------------------------
/deployment/solution-assistant/solution-assistant.yaml:
--------------------------------------------------------------------------------
1 | AWSTemplateFormatVersion: 2010-09-09
2 | Description: Stack for Solution Helper resources.
3 | Parameters:
4 | SolutionPrefix:
5 | Description: Used as a prefix to name all stack resources.
6 | Type: String
7 | SolutionsRefBucketName:
8 | Description: Amazon S3 Bucket containing solutions
9 | Type: String
10 | SolutionS3BucketName:
11 | Description: Amazon S3 Bucket used to store trained model and data.
12 | Type: String
13 | ECRRepository:
14 | Description: Amazon ECR Repository containing container images for processing job.
15 | Type: String
16 | RoleArn:
17 | Description: Role to use for lambda resource
18 | Type: String
19 | Mappings:
20 | Function:
21 | SolutionAssistant:
22 | S3Key: "Fraud-detection-in-financial-networks/build/solution_assistant.zip"
23 | Resources:
24 | SolutionAssistant:
25 | Type: "Custom::SolutionAssistant"
26 | Properties:
27 | ServiceToken: !GetAtt SolutionAssistantLambda.Arn
28 | SolutionS3BucketName: !Ref SolutionS3BucketName
29 | ECRRepository: !Ref ECRRepository
30 | SolutionAssistantLambda:
31 | Type: AWS::Lambda::Function
32 | Properties:
33 | Handler: "lambda_function.handler"
34 | FunctionName: !Sub "${SolutionPrefix}-solution-assistant"
35 | Role: !Ref RoleArn
36 | Runtime: "python3.8"
37 | Code:
38 | S3Bucket: !Ref SolutionsRefBucketName
39 | S3Key: !FindInMap
40 | - Function
41 | - SolutionAssistant
42 | - S3Key
43 | Timeout : 60
44 | Metadata:
45 | cfn_nag:
46 | rules_to_suppress:
47 | - id: W58
48 | reason: Passed in role has cloudwatch write permissions
49 |
--------------------------------------------------------------------------------
/deployment/solution-assistant/src/lambda_function.py:
--------------------------------------------------------------------------------
1 | import boto3
2 | import sys
3 |
4 | sys.path.append('./site-packages')
5 | from crhelper import CfnResource
6 |
7 | helper = CfnResource()
8 |
9 |
10 | @helper.create
11 | def on_create(_, __):
12 | pass
13 |
14 | @helper.update
15 | def on_update(_, __):
16 | pass
17 |
18 |
19 | def delete_s3_objects(bucket_name):
20 | s3_resource = boto3.resource("s3")
21 | try:
22 | s3_resource.Bucket(bucket_name).objects.all().delete()
23 | print(
24 | "Successfully deleted objects in bucket "
25 | "called '{}'.".format(bucket_name)
26 | )
27 | except s3_resource.meta.client.exceptions.NoSuchBucket:
28 | print(
29 | "Could not find bucket called '{}'. "
30 | "Skipping delete.".format(bucket_name)
31 | )
32 |
33 | def delete_ecr_images(repository_name):
34 | ecr_client = boto3.client("ecr")
35 | try:
36 | images = ecr_client.describe_images(repositoryName=repository_name)
37 | image_details = images["imageDetails"]
38 | if len(image_details) > 0:
39 | image_ids = [
40 | {"imageDigest": i["imageDigest"]} for i in image_details
41 | ]
42 | ecr_client.batch_delete_image(
43 | repositoryName=repository_name, imageIds=image_ids
44 | )
45 | print(
46 | "Successfully deleted {} images from repository "
47 | "called '{}'. ".format(len(image_details), repository_name)
48 | )
49 | else:
50 | print(
51 | "Could not find any images in repository "
52 | "called '{}' not found. "
53 | "Skipping delete.".format(repository_name)
54 | )
55 | except ecr_client.exceptions.RepositoryNotFoundException:
56 | print(
57 | "Could not find repository called '{}' not found. "
58 | "Skipping delete.".format(repository_name)
59 | )
60 |
61 |
62 |
63 | def delete_s3_bucket(bucket_name):
64 | s3_resource = boto3.resource("s3")
65 | try:
66 | s3_resource.Bucket(bucket_name).delete()
67 | print(
68 | "Successfully deleted bucket "
69 | "called '{}'.".format(bucket_name)
70 | )
71 | except s3_resource.meta.client.exceptions.NoSuchBucket:
72 | print(
73 | "Could not find bucket called '{}'. "
74 | "Skipping delete.".format(bucket_name)
75 | )
76 |
77 |
78 | @helper.delete
79 | def on_delete(event, __):
80 |
81 | # delete ecr container repo
82 | repository_name = event["ResourceProperties"]["ECRRepository"]
83 | delete_ecr_images(repository_name)
84 |
85 | # remove files in s3 and delete bucket
86 | solution_bucket = event["ResourceProperties"]["SolutionS3BucketName"]
87 | delete_s3_objects(solution_bucket)
88 | delete_s3_bucket(solution_bucket)
89 |
90 |
91 | def handler(event, context):
92 | helper(event, context)
93 |
--------------------------------------------------------------------------------
/docs/arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/docs/arch.png
--------------------------------------------------------------------------------
/docs/details.md:
--------------------------------------------------------------------------------
1 | # Fraud Detection with Graph Neural Networks
2 |
3 | ## Overview
4 |
5 | Graph Neural Networks (GNNs) have shown promising results in solving problems in various domains from recommendations to fraud detection. For fraud detection, GNN techniques are especially powerful because they can learn representations of users, transactions and other entities in an inductive fashion. This means GNNs are able to model potentially fraudulent users and transactions based on both their features and connectivity patterns in an interaction graph. This allows GNN based fraud detections methods to be resilient towards evasive or camouflaging techniques that are employed by malicious users to fool rules based systems or simple feature based models. However, real world applications of GNNs for fraud detection have been limited due to the complex infrastructure required to train large graphs. This project addresses this issue by using Deep Graph Library (DGL) and Amazon SageMaker to manage the complexity of training a GNN on large graphs for fraud detection.
6 |
7 | DGL is an easy-to-use, high performance and scalable Python package for deep learning on graphs. It supports the major deep learning frameworks (Pytorch, MXNet and Tensorflow) as a backend. This project uses DGL to define the graph and implement the GNN models and performs all of the modeling and training using Amazon SageMaker managed resources.
8 |
9 | ## Problem Description and GNN Formulation
10 |
11 | Many businesses lose billions annually to fraud but machine learning based fraud detection models can help businesses predict based on training data what interactions or users are likely fraudulent or malicious and save them from incurring those costs. In this project, we formulate the problem of fraud detection as a classification task, where the machine learning model is a Graph Neural Network that learns good latent representations that can be easily separated into fraud and legitimate. The model is trained using historical transactions or interactions data that contains ground-truth labels for some of the transactions/users.
12 |
13 | The interaction data is assumed to be in the form of a relational table or a set of relational tables. The tables record interactions between a user and other users or other relevant entities. From this table, we extract all the different kind of relations and create edge lists per relation type. In order to make the node representation learning inductive, we also assume that the data contains some attributes or information about the user. We use the attributes if they are present, to create initial feature vectors. We can also encode temporal attributes extracted from the interaction table into the user features to capture the temporal behavior of the user in the case where we our interaction data is timestamped.
14 |
15 | Using the edge lists, we construct a heterogeneous graph which contains the user nodes and all the other node types corresponding to relevant entities in the edge lists. A heterogeneous graph is one where user/account nodes and possibly other entities have several kinds of distinct relationships. Examples of use cases that would fall under this include
16 |
17 |
18 | * a financial network where users transact with other users as well as specific financial institutions or applications
19 | * a gaming network where users interact with other users but also with distinct games or devices
20 | * a social network where users can have different types of links to other users
21 |
22 |
23 | Once the graph is constructed, we define an R-GCN model to learn representations for the graph nodes. The R-GCN module is connected to a fully connected neural network layer to perform classification based on the R-GCN learned node representations. The full model is trained end-to-end in a semi-supervised fashion, where the training loss is computed only using information from nodes that have labels.
24 |
25 | Overall, the project is divided into two modules:
26 |
27 |
28 | * The [first module](../source/sagemaker/data-preprocessing) uses [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) to construct a heterogeneous graph with node features, from a relational table of transactions or interactions.
29 |
30 |
31 |
32 | * The [second module](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) shows how to use DGL to define a Graph Neural Network model and train the model using [Amazon SageMaker training infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/deep-graph-library.html).
33 |
34 |
35 | To run the full project end to end, use the [jupyter notebook](../source/sagemaker/dgl-fraud-detection.ipynb)
36 |
37 | ## Data Processing and Feature Engineering
38 |
39 | The data processing and feature engineering steps convert the data from relational form in a table, to a set of edge lists and features for the user nodes.
40 |
41 | Amazon SageMaker Processing is used perform the data processing and feature engineering. The Amazon SageMaker Processing ScriptProcessor requires a docker container with the processing environment and dependencies, and a processing script that defines the actual data processing implementation. All the artifacts necessary for building the processing environment docker container are in the [_container_ folder](../source/sagemaker/data-preprocessing/container). The [_Dockerfile_](source/sagemaker/data-preoprocessing/container/Dockerfile) specifies the content of the container. The only requirements for the data processing script is pandas so that’s the only package that’s installed in the container.
42 |
43 | The actual data processing script is [_graph_data_preprocessor.py_](source/sagemaker/data-preoprocessing/graph_data_preprocessor.py). The script accepts the transaction data and the identity attributes as input arguments and performs a train/validation split by choosing for the validation data new users that joined on after the train_days argument. The script then extracts all the various relations from the relation table and performs deduplication before writing the relations to an output folder. The script also performs feature engineering to encode the categorical features and the temporal features. Finally, if the construct_homogenous argument is passed in, the script also writes a homogeneous edge list that consists only of edges between user nodes to the output folder.
44 |
45 | Once the SageMaker Processing instance finishes running the script the files in the output folder are then uploaded to S3 for graph modeling and training.
46 |
47 |
48 | ## Graph Modeling and Training
49 |
50 | The graph modeling and training code is implemented using DGL with MXNet as the backend framework and is designed to be run on the managed SageMaker training instances. The [_dgl-fraud-detection_ folder](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) contains the code that is run on those training instances. The supported graph neural network models are defined in [_model.py_](source/sagemaker/dgl-fraud-detection/model.py), and helper functions for graph construction are implemented in [_data.py_](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/data.py). The graph sampling functions for mini-batch graph training are implemented in [sampler.py](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/sampler.py) and [_utils.py_](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/utils.py) contains utility functions. The entry point for the graph modeling and training however is [_train_dgl_entry_point.py_](source/sagemaker/dgl-fraud-detection/train_dgl_entry_point.py).
51 |
52 | The entry point script orchestrates the entire graph training process by going through the following steps:
53 |
54 |
55 | * Reading in the edge lists and user features to construct the graph using the DGLGraph or DGLHeteroGraph API
56 | * Reading in the labels for the target nodes and masking labels for target nodes that won't have labels during training
57 | * Creating the Graph Neural Network model
58 | * Initializing the DataLoader and the Graph Sampler if performing mini-batch graph training
59 | * Initializing the model parameters and training the model
60 |
61 |
62 | At the end of model training, the script saves the models and metrics or predictions to the the output folder which gets uploaded to S3 before the SageMaker training instance is terminated.
63 |
64 |
65 |
66 | ## FAQ
67 |
68 | ### What is fraud detection?
69 |
70 | Fraud is when a malicious actor illicitly or deceitfully tries to obtain goods or services that a business provides means. Fraud detection is a set of techniques for identifying fraudulent cases and distinguishing them from normal or legitimate cases. In this project, we model fraud detection as a semi-supervised learning process, where we have some amount of users that have already been labelled as fraudulent or legitimate, and other users which have no labels during training. The task is to use the trained model to infer the labels for the unlabelled users.
71 |
72 | ### What are Graphs?
73 |
74 | Graphs are a data structure that can be used to represent relationships between entities. They are convenient and flexible way of representing information about interacting entities, and can be easily used to model a lot of real world processes. Graphs consists of a set of entities called the nodes, where pairs of the nodes are connected by links called edges. Lots of systems that exist in the world are networks that are naturally expressed as graphs. Graphs can be directed if the edges have an orientation or are asymmetric, while undirected graphs have symmetric edges. A homogeneous graph consists of nodes and edges of one type while a heterogeneous allows multiple node types and/or edge types.
75 |
76 |
77 | ### What are Graph Neural Networks?
78 |
79 | Graph Neural Networks are a family of neural networks that using the graph structure directly to learn useful representations for nodes and edges in a graph and solve graph based tasks like node classification, link prediction or graph classification. Effectively, graph neural networks are message passing networks that learn node representations by using deep learning techniques to aggregate information from neighboring nodes and edges. Popular examples of Graph Neural Networks are [Graph Convolutional Networks (GCN)](https://arxiv.org/abs/1609.02907), [GraphSage](https://arxiv.org/abs/1706.02216), [Graph Attention Networks (GAT)](https://arxiv.org/abs/1710.10903) and [Relational-Graph Convolutional Networks (R-GCN)](https://arxiv.org/abs/1703.06103).
80 |
81 |
82 | ### What makes Graph Neural Networks useful?
83 |
84 | One reason Graph neural networks are useful is that they can learn both inductive and transductive representations compared to classical graph learning techniques, like random walks and graph factorizations, that can only learn transductive representations. A transductive representation is one that applies specifically to a particular node instance, while an inductive representation is one that can levarage features of the node, and change change as the node features change, allowing for better generalization. Additionally, varying the depth of a Graph Neural Network allows the network to integrate topologically distant information into the representation of a particular node. Graph Neural Networks are also end-to-end differentiable so they can be trained jointly with a downstream, task-specific model, which allows the downstream model to supervise and tailor the representations learned by the GNN.
85 |
86 |
87 | ### How do Graph Neural Networks work?
88 |
89 | Graph Neural Networks are trained, like other Deep Neural Networks, by using gradient based optimizers like sgd or adam to learn network parameter values that optimize a particular loss function. As with other neural networks, this is performed by running a forward step - to compute the feature representations and the loss function, a backward step - to compute the gradients of loss with respect to the network parameters, and an optimize step - to update the network parameter values with the computed gradient. Graph Neural Networks are unique in the forward step. They compute the intermediate representations by a process known as ‘message passing’. For a particular node, this involves using the graph structure to collect all or a subset of the neighboring nodes and edges. At each layer, the intermediate representation of the neighboring nodes and edges are then aggregated into a single message which is combined with the previous intermediate representation of the node to form the new node representation. At the earliest layers, a node representation is informed by it’s eager network - it’s immediate neighbors, but at later layers, the nodes representation is informed by the current representation of the nodes neighbors which had been earlier informed by those neighbor’s neighbors, thus extending the sphere of influence to nodes that are multiple hops away.
90 |
91 |
92 | ### What is an R-GCN Model?
93 |
94 | The Relational Graph Convolutional Network (R-GCN) (https://arxiv.org/abs/1703.06103) model is a GNN that specifically models different edge types and node types differently during message passing and aggregation. Thus, it is especially effective for learning on heterogenous graphs and R-GCN is the default model used in this project for node representation learning. It is based on the simpler GCN architecture but adapted for multi-relational data.
95 |
96 |
--------------------------------------------------------------------------------
/docs/overview_video_thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/docs/overview_video_thumbnail.png
--------------------------------------------------------------------------------
/iam/launch.json:
--------------------------------------------------------------------------------
1 | {
2 | "Version": "2012-10-17",
3 | "Statement": [
4 | {
5 | "Effect": "Allow",
6 | "Action": [
7 | "servicecatalog:*"
8 | ],
9 | "Resource": [
10 | "*"
11 | ]
12 | },
13 | {
14 | "Effect": "Allow",
15 | "Action": [
16 | "cloudformation:CreateStack"
17 | ],
18 | "Resource": [
19 | "arn:aws:cloudformation:*:*:stack/sagemaker-soln-*"
20 | ],
21 | "Condition" : {
22 | "ForAllValues:StringLike" : {
23 | "cloudformation:TemplateUrl" : [
24 | "https://s3.amazonaws.com/sagemaker-solutions-us-east-1/*",
25 | "https://s3-us-east-2.amazonaws.com/sagemaker-solutions-us-east-2/*",
26 | "https://s3-us-west-2.amazonaws.com/sagemaker-solutions-us-west-2/*",
27 | "https://s3.amazonaws.com/sagemaker-solutions-*",
28 | "https://s3.amazonaws.com/sc-*",
29 | "https://s3.us-east-2.amazonaws.com/sc-*",
30 | "https://s3.us-west-2.amazonaws.com/sc-*"
31 | ]
32 | }
33 | }
34 | },
35 | {
36 | "Effect": "Allow",
37 | "Action": [
38 | "cloudformation:DeleteStack"
39 | ],
40 | "Resource": "arn:aws:cloudformation:*:*:stack/sagemaker-soln-*"
41 | },
42 | {
43 | "Effect": "Allow",
44 | "Action": [
45 | "cloudformation:DescribeStackEvents",
46 | "cloudformation:DescribeStacks",
47 | "cloudformation:DescribeStackResource",
48 | "cloudformation:DescribeStackResources",
49 | "cloudformation:GetTemplate",
50 | "cloudformation:GetTemplateSummary",
51 | "cloudformation:ValidateTemplate",
52 | "cloudformation:ListStacks",
53 | "cloudformation:ListStackResources"
54 | ],
55 | "Resource": "*"
56 | },
57 | {
58 | "Action": [
59 | "s3:GetObject"
60 | ],
61 | "Resource": [
62 | "arn:aws:s3:::sc-*/*"
63 | ],
64 | "Effect": "Allow"
65 | },
66 | {
67 | "Effect": "Allow",
68 | "Action": [
69 | "iam:PassRole"
70 | ],
71 | "Resource": "arn:aws:iam::*:role/sagemaker-soln-*"
72 | },
73 | {
74 | "Effect": "Allow",
75 | "Action": [
76 | "codebuild:CreateProject",
77 | "codebuild:DeleteProject"
78 | ],
79 | "Resource": "arn:aws:codebuild:*:*:project/sagemaker-soln-*"
80 | },
81 | {
82 | "Effect": "Allow",
83 | "Action": [
84 | "lambda:AddPermission",
85 | "lambda:RemovePermission",
86 | "lambda:CreateFunction",
87 | "lambda:DeleteFunction",
88 | "lambda:GetFunction",
89 | "lambda:GetFunctionConfiguration",
90 | "lambda:InvokeFunction"
91 | ],
92 | "Resource": "arn:aws:lambda:*:*:function:sagemaker-soln-*"
93 | },
94 | {
95 | "Effect": "Allow",
96 | "Action": [
97 | "s3:CreateBucket",
98 | "s3:DeleteBucket",
99 | "s3:PutBucketNotification",
100 | "s3:PutBucketPublicAccessBlock",
101 | "s3:PutEncryptionConfiguration"
102 | ],
103 | "Resource": "arn:aws:s3:::sagemaker-soln-*"
104 | },
105 | {
106 | "Effect": "Allow",
107 | "Action": [
108 | "s3:GetObject"
109 | ],
110 | "Resource": "arn:aws:s3:::sagemaker-solutions-*/*"
111 | },
112 | {
113 | "Effect": "Allow",
114 | "Action": [
115 | "sagemaker:CreateNotebookInstance",
116 | "sagemaker:StopNotebookInstance",
117 | "sagemaker:DescribeNotebookInstance",
118 | "sagemaker:DeleteNotebookInstance"
119 | ],
120 | "Resource": "arn:aws:sagemaker:*:*:notebook-instance/sagemaker-soln-*"
121 | },
122 | {
123 | "Effect": "Allow",
124 | "Action": [
125 | "sagemaker:CreateNotebookInstanceLifecycleConfig",
126 | "sagemaker:DescribeNotebookInstanceLifecycleConfig",
127 | "sagemaker:DeleteNotebookInstanceLifecycleConfig"
128 | ],
129 | "Resource": "arn:aws:sagemaker:*:*:notebook-instance-lifecycle-config/sagemaker-soln-*"
130 | }
131 | ]
132 | }
--------------------------------------------------------------------------------
/iam/usage.json:
--------------------------------------------------------------------------------
1 | {
2 | "Version": "2012-10-17",
3 | "Statement": [
4 | {
5 | "Action": [
6 | "sagemaker:CreateTrainingJob",
7 | "sagemaker:DescribeTrainingJob",
8 | "sagemaker:CreateProcessingJob",
9 | "sagemaker:DescribeProcessingJob"
10 | ],
11 | "Resource": [
12 | "arn:aws:sagemaker:*:*:*"
13 | ],
14 | "Effect": "Allow"
15 | },
16 | {
17 | "Action": [
18 | "ecr:GetAuthorizationToken",
19 | "ecr:GetDownloadUrlForLayer",
20 | "ecr:BatchGetImage",
21 | "ecr:PutImage",
22 | "ecr:BatchCheckLayerAvailability",
23 | "ecr:CreateRepository",
24 | "ecr:DescribeRepositories",
25 | "ecr:InitiateLayerUpload",
26 | "ecr:CompleteLayerUpload",
27 | "ecr:UploadLayerPart",
28 | "ecr:TagResource",
29 | "ecr:DescribeImages",
30 | "ecr:BatchDeleteImage"
31 | ],
32 | "Resource": [
33 | "*",
34 | "arn:aws:ecr:*:*:repository/*"
35 | ],
36 | "Effect": "Allow"
37 | },
38 | {
39 | "Action": [
40 | "codebuild:BatchGetBuilds",
41 | "codebuild:StartBuild"
42 | ],
43 | "Resource": [
44 | "arn:aws:codebuild:*:*:project/sagemaker-soln-*",
45 | "arn:aws:codebuild:*:*:build/*"
46 | ],
47 | "Effect": "Allow"
48 | },
49 | {
50 | "Action": [
51 | "cloudwatch:PutMetricData",
52 | "cloudwatch:GetMetricData",
53 | "cloudwatch:GetMetricStatistics",
54 | "cloudwatch:ListMetrics"
55 | ],
56 | "Resource": [
57 | "arn:aws:cloudwatch:*:*:*"
58 | ],
59 | "Effect": "Allow"
60 | },
61 | {
62 | "Action": [
63 | "logs:CreateLogGroup",
64 | "logs:CreateLogStream",
65 | "logs:DescribeLogStreams",
66 | "logs:GetLogEvents",
67 | "logs:PutLogEvents"
68 | ],
69 | "Resource": [
70 | "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
71 | ],
72 | "Effect": "Allow"
73 | },
74 | {
75 | "Condition": {
76 | "StringEquals": {
77 | "iam:PassedToService": "sagemaker.amazonaws.com"
78 | }
79 | },
80 | "Action": [
81 | "iam:PassRole"
82 | ],
83 | "Resource": [
84 | "arn:aws:iam::*:role/sagemaker-soln-*"
85 | ],
86 | "Effect": "Allow"
87 | },
88 | {
89 | "Action": [
90 | "iam:GetRole"
91 | ],
92 | "Resource": [
93 | "arn:aws:iam::*:role/sagemaker-soln-*"
94 | ],
95 | "Effect": "Allow"
96 | },
97 | {
98 | "Action": [
99 | "s3:ListBucket",
100 | "s3:DeleteBucket",
101 | "s3:GetObject",
102 | "s3:PutObject",
103 | "s3:DeleteObject"
104 | ],
105 | "Resource": [
106 | "arn:aws:s3:::sagemaker-solutions-*",
107 | "arn:aws:s3:::sagemaker-solutions-*/*",
108 | "arn:aws:s3:::sagemaker-soln-*",
109 | "arn:aws:s3:::sagemaker-soln-*/*"
110 | ],
111 | "Effect": "Allow"
112 | },
113 | {
114 | "Action": [
115 | "s3:CreateBucket",
116 | "s3:ListBucket",
117 | "s3:GetObject",
118 | "s3:PutObject",
119 | "s3:DeleteObject"
120 | ],
121 | "Resource": [
122 | "arn:aws:s3:::sagemaker-*",
123 | "arn:aws:s3:::sagemaker-*/*"
124 | ],
125 | "Effect": "Allow"
126 | }
127 | ]
128 | }
--------------------------------------------------------------------------------
/metadata/metadata.json:
--------------------------------------------------------------------------------
1 | {
2 | "id": "sagemaker-soln-gfd",
3 | "name": "Amazon SageMaker and Deep Graph Library for Fraud Detection in Heterogeneous Graphs",
4 | "shortName": "Graph Fraud Detection",
5 | "priority": 0,
6 | "desc": "Many online businesses lose billions annually to fraud, but machine learning based fraud detection models can help businesses predict what interactions or users are likely fraudulent and save them from incurring those costs. In this project, we formulate the problem of fraud detection as a classification task on a heterogeneous interaction network. The machine learning model is a Graph Neural Network (GNN) that learns latent representations of users or transactions which can then be easily separated into fraud or legitimate.",
7 | "meta": "fraud classification detection graph gnn network interaction",
8 | "tags": ["financial services", "fraud detection", "internet retail"],
9 | "parameters": [
10 | {
11 | "name": "SolutionPrefix",
12 | "type": "text",
13 | "default": "sagemaker-soln-graph-fraud"
14 | },
15 | {
16 | "name": "IamRole",
17 | "type": "text",
18 | "default": ""
19 | },
20 | {
21 | "name": "S3HistoricalTransactionsPrefix",
22 | "type": "text",
23 | "default": "raw-data"
24 | },
25 | {
26 | "name": "S3ProcessingJobInputPrefix",
27 | "type": "text",
28 | "default": "processing-input"
29 | },
30 | {
31 | "name": "S3ProcessingJobOutputPrefix",
32 | "type": "text",
33 | "default": "preprocessed-data"
34 | },
35 | {
36 | "name": "S3TrainingJobOutputPrefix",
37 | "type": "text",
38 | "default": "training-output"
39 | },
40 | {
41 | "name": "CreateSageMakerNotebookInstance",
42 | "type": "text",
43 | "default": "false"
44 | },
45 | {
46 | "name": "BuildSageMakerContainersRemotely",
47 | "type": "text",
48 | "default": "true"
49 | },
50 | {
51 | "name": "SageMakerProcessingJobContainerName",
52 | "type": "text",
53 | "default": "sagemaker-soln-graph-fraud-preprocessing"
54 | },
55 | {
56 | "name": "SageMakerProcessingJobInstanceType",
57 | "type": "text",
58 | "default": "ml.m4.xlarge"
59 | },
60 | {
61 | "name": "SageMakerTrainingJobInstanceType",
62 | "type": "text",
63 | "default": "ml.p3.2xlarge"
64 | },
65 | {
66 | "name": "SageMakerNotebookInstanceType",
67 | "type": "text",
68 | "default": "ml.m4.xlarge"
69 | },
70 | {
71 | "name": "StackVersion",
72 | "type": "text",
73 | "default": "release"
74 | }
75 | ],
76 | "acknowledgements": ["CAPABILITY_IAM","CAPABILITY_NAMED_IAM"],
77 | "cloudFormationTemplate": "s3-us-east-2.amazonaws.com/sagemaker-solutions-build-us-east-2/Fraud-detection-in-financial-networks/deployment/sagemaker-graph-fraud-detection.yaml",
78 | "serviceCatalogProduct": "TBD",
79 | "copyS3Source": "sagemaker-solutions-build-us-east-2",
80 | "copyS3SourcePrefix": "Fraud-detection-in-financial-networks/source/sagemaker",
81 | "notebooksDirectory": "Fraud-detection-in-financial-networks/source/sagemaker",
82 | "notebookPaths": [
83 | "Fraud-detection-in-financial-networks/source/sagemaker/dgl-fraud-detection.ipynb"
84 | ],
85 | "permissions": "TBD"
86 | }
--------------------------------------------------------------------------------
/source/lambda/data-preprocessing/index.py:
--------------------------------------------------------------------------------
1 | import os
2 | import boto3
3 | import json
4 | from time import strftime, gmtime
5 |
6 | s3_client = boto3.client('s3')
7 | S3_BUCKET = os.environ['processing_job_s3_bucket']
8 | DATA_PREFIX = os.environ['processing_job_s3_raw_data_key']
9 | INPUT_PREFIX = os.environ['processing_job_input_s3_prefix']
10 | OUTPUT_PREFIX = os.environ['processing_job_output_s3_prefix']
11 | INSTANCE_TYPE = os.environ['processing_job_instance_type']
12 | ROLE_ARN = os.environ['processing_job_role_arn']
13 | IMAGE_URI = os.environ['processing_job_ecr_repository']
14 |
15 |
16 | def process_event(event, context):
17 | print(event)
18 |
19 | event_source_s3 = event['Records'][0]['s3']
20 |
21 | print("S3 Put event source: {}".format(get_full_path(event_source_s3)))
22 | timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
23 |
24 | inputs = prepare_preprocessing_inputs(timestamp)
25 | outputs = prepare_preprocessing_output(event_source_s3, timestamp)
26 | response = run_preprocessing_job(inputs, outputs, timestamp)
27 |
28 | print(response)
29 | return response
30 |
31 |
32 | def prepare_preprocessing_inputs(timestamp, s3_bucket=S3_BUCKET, data_prefix=DATA_PREFIX, input_prefix=INPUT_PREFIX):
33 | print("Preparing Inputs")
34 | key = os.path.join(input_prefix, timestamp)
35 |
36 | print("Copying raw data from {} to {}".format(get_full_s3_path(s3_bucket, data_prefix),
37 | get_full_s3_path(s3_bucket, key)))
38 | objects = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=data_prefix)
39 | files = [content['Key'] for content in objects['Contents']]
40 | verify(files)
41 |
42 | for file in files:
43 | dest = os.path.join(key, os.path.basename(file))
44 | s3_client.copy({'Bucket': s3_bucket, 'Key': file}, s3_bucket, dest)
45 |
46 | return get_full_s3_path(s3_bucket, key)
47 |
48 |
49 | def prepare_preprocessing_output(event_source_s3, timestamp, s3_bucket=S3_BUCKET, output_prefix=OUTPUT_PREFIX):
50 | print("Preparing Output")
51 | copy_source = {
52 | 'Bucket': event_source_s3['bucket']['name'],
53 | 'Key': event_source_s3['object']['key']
54 | }
55 |
56 | key = os.path.join(output_prefix, timestamp, os.path.basename(event_source_s3['object']['key']))
57 |
58 | destination = get_full_s3_path(s3_bucket, key)
59 | print("Copying new accounts from {} to {}".format(get_full_path(event_source_s3), destination))
60 | s3_client.copy(copy_source, s3_bucket, key)
61 | return get_full_s3_path(s3_bucket, os.path.join(output_prefix, timestamp))
62 |
63 |
64 | def verify(files, s3_bucket=S3_BUCKET, data_prefix=DATA_PREFIX, expected_files=['transaction.csv', 'identity.csv']):
65 | if not all([file in list(map(os.path.basename, files)) for file in expected_files]):
66 | raise Exception("Raw data absent or incomplete in {}".format(get_full_s3_path(
67 | s3_bucket, data_prefix)))
68 |
69 |
70 | def get_full_s3_path(bucket, key):
71 | return os.path.join('s3://', bucket, key)
72 |
73 |
74 | def get_full_path(event_source_s3):
75 | return get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key'])
76 |
77 |
78 | def run_preprocessing_job(input,
79 | output,
80 | timestamp,
81 | s3_bucket=S3_BUCKET,
82 | input_prefix=INPUT_PREFIX,
83 | instance_type=INSTANCE_TYPE,
84 | image_uri=IMAGE_URI
85 | ):
86 | print("Creating SageMaker Processing job with inputs from {} and outputs to {}".format(input, output))
87 |
88 | sagemaker_client = boto3.client('sagemaker')
89 |
90 | region = boto3.session.Session().region_name
91 | account_id = boto3.client('sts').get_caller_identity().get('Account')
92 | ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, image_uri)
93 |
94 | # upload code
95 | code_file = 'data-preprocessing/graph_data_preprocessor.py'
96 | code_file_s3_key = os.path.join(input_prefix, timestamp, code_file)
97 | s3_client.upload_file(code_file, s3_bucket, code_file_s3_key)
98 |
99 | entrypoint = ["python3"] + [os.path.join("/opt/ml/processing/input/code",
100 | os.path.basename(code_file))]
101 |
102 | app_spec = {
103 | 'ImageUri': ecr_repository_uri,
104 | 'ContainerEntrypoint': entrypoint,
105 | 'ContainerArguments': ['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
106 | '--cat-cols','M1,M2,M3,M4,M5,M6,M7,M8,M9']
107 | }
108 |
109 | processing_inputs = [
110 | {
111 | 'InputName': 'input1',
112 | 'S3Input': {
113 | 'S3Uri': input,
114 | 'LocalPath': '/opt/ml/processing/input',
115 | 'S3DataType': 'S3Prefix',
116 | 'S3InputMode': 'File',
117 | }
118 | },
119 | {
120 | 'InputName': 'code',
121 | 'S3Input': {
122 | 'S3Uri': get_full_s3_path(s3_bucket, code_file_s3_key),
123 | 'LocalPath': '/opt/ml/processing/input/code',
124 | 'S3DataType': 'S3Prefix',
125 | 'S3InputMode': 'File',
126 | }
127 | },
128 |
129 | ]
130 | processing_output = {'Outputs': [{'OutputName': 'output1',
131 | 'S3Output': {'S3Uri': output,
132 | 'LocalPath': '/opt/ml/processing/output',
133 | 'S3UploadMode': 'EndOfJob'}
134 | }]}
135 |
136 | processing_job_name = "sagemaker-graph-fraud-data-processing-{}".format(timestamp)
137 | resources = {
138 | 'ClusterConfig': {
139 | 'InstanceCount': 1,
140 | 'InstanceType': instance_type,
141 | 'VolumeSizeInGB': 30
142 | }
143 | }
144 |
145 | network_config = {'EnableNetworkIsolation': False}
146 | stopping_condition = {'MaxRuntimeInSeconds': 3600}
147 |
148 | response = sagemaker_client.create_processing_job(ProcessingInputs=processing_inputs,
149 | ProcessingOutputConfig=processing_output,
150 | ProcessingJobName=processing_job_name,
151 | ProcessingResources=resources,
152 | StoppingCondition=stopping_condition,
153 | AppSpecification=app_spec,
154 | NetworkConfig=network_config,
155 | RoleArn=ROLE_ARN)
156 | return response
157 |
--------------------------------------------------------------------------------
/source/lambda/graph-modelling/index.py:
--------------------------------------------------------------------------------
1 | import os
2 | import boto3
3 | import json
4 | import tarfile
5 | from time import strftime, gmtime
6 |
7 | s3_client = boto3.client('s3')
8 | S3_BUCKET = os.environ['training_job_s3_bucket']
9 | OUTPUT_PREFIX = os.environ['training_job_output_s3_prefix']
10 | INSTANCE_TYPE = os.environ['training_job_instance_type']
11 | ROLE_ARN = os.environ['training_job_role_arn']
12 |
13 |
14 | def process_event(event, context):
15 | print(event)
16 |
17 | event_source_s3 = event['Records'][0]['s3']
18 |
19 | print("S3 Put event source: {}".format(get_full_path(event_source_s3)))
20 | train_input, msg = verify_modelling_inputs(event_source_s3)
21 | print(msg)
22 | if not train_input:
23 | return msg
24 | timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
25 | response = run_modelling_job(timestamp, train_input)
26 | return response
27 |
28 |
29 | def verify_modelling_inputs(event_source_s3):
30 | training_kickoff_signal = "tags.csv"
31 | if training_kickoff_signal not in event_source_s3['object']['key']:
32 | msg = "Event source was not the training signal. Triggered by {} but expected folder to contain {}"
33 | return False, msg.format(get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key']),
34 | training_kickoff_signal)
35 |
36 | training_folder = os.path.dirname(event_source_s3['object']['key'])
37 | full_s3_training_folder = get_full_s3_path(event_source_s3['bucket']['name'], training_folder)
38 |
39 | objects = s3_client.list_objects_v2(Bucket=event_source_s3['bucket']['name'], Prefix=training_folder)
40 | files = [content['Key'] for content in objects['Contents']]
41 | print("Contents of training data folder :")
42 | print("\n".join(files))
43 | minimum_expected_files = ['features.csv', 'tags.csv']
44 |
45 | if not all([file in [os.path.basename(s3_file) for s3_file in files] for file in minimum_expected_files]):
46 | return False, "Training data absent or incomplete in {}".format(full_s3_training_folder)
47 |
48 | return full_s3_training_folder, "Minimum files needed for training present in {}".format(full_s3_training_folder)
49 |
50 |
51 | def run_modelling_job(timestamp,
52 | train_input,
53 | s3_bucket=S3_BUCKET,
54 | train_out_prefix=OUTPUT_PREFIX,
55 | train_job_prefix='sagemaker-graph-fraud-model-training',
56 | train_source_dir='dgl_fraud_detection',
57 | train_entry_point='train_dgl_mxnet_entry_point.py',
58 | framework='mxnet',
59 | framework_version='1.6.0',
60 | xpu='gpu',
61 | python_version='py3',
62 | instance_type=INSTANCE_TYPE
63 | ):
64 | print("Creating SageMaker Training job with inputs from {}".format(train_input))
65 |
66 | sagemaker_client = boto3.client('sagemaker')
67 | region = boto3.session.Session().region_name
68 |
69 | container = "763104351884.dkr.ecr.{}.amazonaws.com/{}-training:{}-{}-{}".format(region,
70 | framework,
71 | framework_version,
72 | xpu,
73 | python_version)
74 |
75 | training_job_name = "{}-{}".format(train_job_prefix, timestamp)
76 |
77 | code_path = tar_and_upload_to_s3(train_source_dir,
78 | s3_bucket,
79 | os.path.join(train_out_prefix, training_job_name, 'source'))
80 |
81 | framework_params = {
82 | 'sagemaker_container_log_level': str(20),
83 | 'sagemaker_enable_cloudwatch_metrics': 'false',
84 | 'sagemaker_job_name': json.dumps(training_job_name),
85 | 'sagemaker_program': json.dumps(train_entry_point),
86 | 'sagemaker_region': json.dumps(region),
87 | 'sagemaker_submit_directory': json.dumps(code_path)
88 | }
89 |
90 | model_params = {
91 | 'nodes': 'features.csv',
92 | 'edges': 'relation*',
93 | 'labels': 'tags.csv',
94 | 'model': 'rgcn',
95 | 'num-gpus': 1,
96 | 'batch-size': 10000,
97 | 'embedding-size': 64,
98 | 'n-neighbors': 1000,
99 | 'n-layers': 2,
100 | 'n-epochs': 10,
101 | 'optimizer': 'adam',
102 | 'lr': 1e-2
103 | }
104 | model_params = {k: json.dumps(str(v)) for k, v in model_params.items()}
105 |
106 | model_params.update(framework_params)
107 |
108 | train_params = \
109 | {
110 | 'TrainingJobName': training_job_name,
111 | "AlgorithmSpecification": {
112 | "TrainingImage": container,
113 | "TrainingInputMode": "File"
114 | },
115 | "RoleArn": ROLE_ARN,
116 | "OutputDataConfig": {
117 | "S3OutputPath": get_full_s3_path(s3_bucket, train_out_prefix)
118 | },
119 | "ResourceConfig": {
120 | "InstanceCount": 1,
121 | "InstanceType": instance_type,
122 | "VolumeSizeInGB": 30
123 | },
124 | "HyperParameters": model_params,
125 | "StoppingCondition": {
126 | "MaxRuntimeInSeconds": 86400
127 | },
128 | "InputDataConfig": [
129 | {
130 | "ChannelName": "train",
131 | "DataSource": {
132 | "S3DataSource": {
133 | "S3DataType": "S3Prefix",
134 | "S3Uri": train_input,
135 | "S3DataDistributionType": "FullyReplicated"
136 | }
137 | },
138 | },
139 | ]
140 | }
141 |
142 | response = sagemaker_client.create_training_job(**train_params)
143 | return response
144 |
145 |
146 | def tar_and_upload_to_s3(source, s3_bucket, s3_key):
147 | filename = "/tmp/sourcedir.tar.gz"
148 | with tarfile.open(filename, mode="w:gz") as t:
149 | for file in os.listdir(source):
150 | t.add(os.path.join(source, file), arcname=file)
151 |
152 | s3_client.upload_file(filename, s3_bucket, os.path.join(s3_key, 'sourcedir.tar.gz'))
153 |
154 | return get_full_s3_path(s3_bucket, os.path.join(s3_key, 'sourcedir.tar.gz'))
155 |
156 |
157 | def get_full_s3_path(bucket, key):
158 | return os.path.join('s3://', bucket, key)
159 |
160 |
161 | def get_full_path(event_source_s3):
162 | return get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key'])
163 |
--------------------------------------------------------------------------------
/source/sagemaker/baselines/mlp-fraud-baseline.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Baseline MLP Model\n",
8 | "## Training a Feed-foward Neural Network with node features"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": null,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "from utils import get_data\n",
18 | "import os\n",
19 | "os.chdir(\"../\")\n",
20 | "\n",
21 | "import pandas as pd\n",
22 | "import numpy as np\n",
23 | "\n",
24 | "!bash setup.sh"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Read data and upload to S3"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import io\n",
41 | "from scipy.sparse import csr_matrix, save_npz \n",
42 | "\n",
43 | "train_X, train_y, test_X, test_y = get_data()\n",
44 | "\n",
45 | "train_X.loc[:, 0] = train_y.values\n",
46 | "sparse_matrix = csr_matrix(train_X.values)\n",
47 | "filename = 'mlp-fraud-dataset.npz'\n",
48 | "save_npz(filename, sparse_matrix, compressed=True)"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {},
55 | "outputs": [],
56 | "source": [
57 | "import os\n",
58 | "import sagemaker\n",
59 | "from sagemaker.s3 import S3Uploader\n",
60 | "\n",
61 | "from sagemaker_graph_fraud_detection import config\n",
62 | "\n",
63 | "role = config.role\n",
64 | "\n",
65 | "session = sagemaker.Session()\n",
66 | "bucket = config.solution_bucket\n",
67 | "prefix = 'mlp-fraud-detection'\n",
68 | "\n",
69 | "s3_train_data = S3Uploader.upload(filename, 's3://{}/{}/{}'.format(bucket, prefix,'train'))\n",
70 | "print('Uploaded training data location: {}'.format(s3_train_data))\n",
71 | "\n",
72 | "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
73 | "print('Training artifacts will be uploaded to: {}'.format(output_location))"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "## Train SageMaker MXNet Estimator"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "from sagemaker import get_execution_role\n",
90 | "from sagemaker.mxnet import MXNet\n",
91 | "\n",
92 | "params = {'num-gpus': 1,\n",
93 | " 'n-layers': 5,\n",
94 | " 'n-epochs': 100,\n",
95 | " 'optimizer': 'adam',\n",
96 | " 'lr': 1e-2\n",
97 | " } \n",
98 | "\n",
99 | "mlp = MXNet(entry_point='baselines/mlp_fraud_entry_point.py',\n",
100 | " role=role, \n",
101 | " train_instance_count=1, \n",
102 | " train_instance_type='ml.p3.2xlarge',\n",
103 | " framework_version=\"1.4.1\",\n",
104 | " py_version='py3',\n",
105 | " hyperparameters=params,\n",
106 | " output_path=output_location,\n",
107 | " code_location=output_location,\n",
108 | " sagemaker_session=session)\n",
109 | "\n",
110 | "mlp.fit({'train': s3_train_data})"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {
117 | "scrolled": true
118 | },
119 | "outputs": [],
120 | "source": [
121 | "from sagemaker.predictor import json_serializer\n",
122 | "\n",
123 | "predictor = mlp.deploy(initial_instance_count=1,\n",
124 | " endpoint_name=\"mlp-fraud-endpoint\",\n",
125 | " instance_type='ml.p3.2xlarge')\n",
126 | "\n",
127 | "# Specify input and output formats.\n",
128 | "predictor.content_type = 'text/json'\n",
129 | "predictor.serializer = json_serializer\n",
130 | "predictor.deserializer = None"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "import json\n",
140 | "\n",
141 | "def predict(current_predictor, data, rows=500):\n",
142 | " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n",
143 | " predictions = []\n",
144 | " for array in split_array:\n",
145 | " predictions.append(np.array(json.loads(current_predictor.predict(array.tolist()))))\n",
146 | " return np.concatenate(tuple(predictions), axis=0)\n",
147 | "\n",
148 | "raw_preds = predict(predictor, test_X.values[:, 1:])\n",
149 | "y_preds = np.where(raw_preds > 0.5, 1, 0)"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "from sklearn.metrics import confusion_matrix, roc_curve, auc\n",
159 | "from matplotlib import pyplot as plt\n",
160 | "%matplotlib inline\n",
161 | "\n",
162 | "def print_metrics(y_true, y_predicted):\n",
163 | "\n",
164 | " cm = confusion_matrix(y_true, y_predicted)\n",
165 | " true_neg, false_pos, false_neg, true_pos = cm.ravel()\n",
166 | " cm = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]),\n",
167 | " columns=[\"labels positive\", \"labels negative\"],\n",
168 | " index=[\"predicted positive\", \"predicted negative\"])\n",
169 | " \n",
170 | " acc = (true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg)\n",
171 | " precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0\n",
172 | " recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0\n",
173 | " f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0\n",
174 | " print(\"Confusion Matrix:\")\n",
175 | " print(pd.DataFrame(cm, columns=[\"labels positive\", \"labels negative\"], \n",
176 | " index=[\"predicted positive\", \"predicted negative\"]))\n",
177 | " print(\"f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}\".format(f1, precision, recall, acc))\n",
178 | " print()\n",
179 | " \n",
180 | "def plot_roc_curve(fpr, tpr, roc_auc):\n",
181 | " f = plt.figure()\n",
182 | " lw = 2\n",
183 | " plt.plot(fpr, tpr, color='darkorange',\n",
184 | " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n",
185 | " plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
186 | " plt.xlim([0.0, 1.0])\n",
187 | " plt.ylim([0.0, 1.05])\n",
188 | " plt.xlabel('False Positive Rate')\n",
189 | " plt.ylabel('True Positive Rate')\n",
190 | " plt.title('Model ROC curve')\n",
191 | " plt.legend(loc=\"lower right\")\n",
192 | "\n",
193 | "print_metrics(test_y, y_preds)\n",
194 | "fpr, tpr, _ = roc_curve(test_y, y_preds)\n",
195 | "roc_auc = auc(fpr, tpr)\n",
196 | "plot_roc_curve(fpr, tpr, roc_auc)"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "predictor.delete_endpoint()"
206 | ]
207 | }
208 | ],
209 | "metadata": {
210 | "kernelspec": {
211 | "display_name": "conda_python3",
212 | "language": "python",
213 | "name": "conda_python3"
214 | },
215 | "language_info": {
216 | "codemirror_mode": {
217 | "name": "ipython",
218 | "version": 3
219 | },
220 | "file_extension": ".py",
221 | "mimetype": "text/x-python",
222 | "name": "python",
223 | "nbconvert_exporter": "python",
224 | "pygments_lexer": "ipython3",
225 | "version": "3.6.10"
226 | }
227 | },
228 | "nbformat": 4,
229 | "nbformat_minor": 2
230 | }
231 |
--------------------------------------------------------------------------------
/source/sagemaker/baselines/mlp_fraud_entry_point.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import pickle
4 | import argparse
5 | import json
6 | import logging
7 | import numpy as np
8 | import mxnet as mx
9 | from mxnet import nd, gluon, autograd
10 | from scipy.sparse import load_npz
11 |
12 |
13 | def parse_args():
14 | parser = argparse.ArgumentParser()
15 |
16 | parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
17 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
18 | parser.add_argument('--num-gpus', type=int, default=1)
19 | parser.add_argument('--batch-size', type=int, default=200000)
20 | parser.add_argument('--optimizer', type=str, default='adam')
21 | parser.add_argument('--lr', type=float, default=1e-2)
22 | parser.add_argument('--n-epochs', type=int, default=20)
23 | parser.add_argument('--n-hidden', type=int, default=64, help='number of hidden units')
24 | parser.add_argument('--n-layers', type=int, default=5, help='number of hidden layers')
25 | parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss')
26 |
27 | return parser.parse_args()
28 |
29 |
30 | def get_logger(name):
31 | logger = logging.getLogger(name)
32 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s'
33 | logging.basicConfig(format=log_format, level=logging.INFO)
34 | return logger
35 |
36 |
37 | def get_data():
38 | filename = 'mlp-fraud-dataset.npz'
39 | matrix = load_npz(os.path.join(args.training_dir, filename)).toarray().astype('float32')
40 | scale_pos_weight = np.sqrt((matrix.shape[0] - matrix[:, 0].sum()) / matrix[:, 0].sum())
41 | weight_mask = np.ones(matrix.shape[0]).astype('float32')
42 | weight_mask[np.where(matrix[:, 0])] = scale_pos_weight
43 | dataloader = gluon.data.DataLoader(gluon.data.ArrayDataset(matrix[:, 1:], matrix[:, 0], weight_mask),
44 | args.batch_size,
45 | shuffle=True,
46 | last_batch='keep')
47 | return dataloader, matrix.shape[0]
48 |
49 |
50 | def evaluate(model, dataloader, ctx):
51 | f1 = mx.metric.F1()
52 | for data, label, mask in dataloader:
53 | pred = model(data.as_in_context(ctx))
54 | f1.update(label.as_in_context(ctx), nd.softmax(pred, axis=1))
55 | return f1.get()[1]
56 |
57 |
58 | def train(model, trainer, loss, train_data, ctx):
59 | duration = []
60 | for epoch in range(args.n_epochs):
61 | tic = time.time()
62 | loss_val = 0.
63 |
64 | for features, labels, weight_mask in train_data:
65 | with autograd.record():
66 | pred = model(features.as_in_context(ctx))
67 | l = loss(pred, labels.as_in_context(ctx), mx.nd.expand_dims(weight_mask, 1).as_in_context(ctx)).sum()
68 | l.backward()
69 | trainer.step(args.batch_size)
70 |
71 | loss_val += l.asscalar()
72 | duration.append(time.time() - tic)
73 | f1 = evaluate(model, train_data, ctx)
74 | logging.info("Epoch {:05d} | Time(s) {:.4f} | Loss {:.4f} | F1 Score {:.4f} ".format(
75 | epoch, np.mean(duration), loss_val / n, f1))
76 | save_model(model)
77 | return model
78 |
79 |
80 | def save_model(model):
81 | model.save_parameters(os.path.join(args.model_dir, 'model.params'))
82 | with open(os.path.join(args.model_dir, 'model_hyperparams.pkl'), 'wb') as f:
83 | pickle.dump(args, f)
84 |
85 |
86 | def get_model(model_dir, ctx, n_classes=2, load_stored=False):
87 | if load_stored: # load using saved model state
88 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'rb') as f:
89 | hyperparams = pickle.load(f)
90 | else:
91 | hyperparams = args
92 |
93 | model = gluon.nn.Sequential()
94 | for _ in range(hyperparams.n_layers):
95 | model.add(gluon.nn.Dense(hyperparams.n_hidden, activation='relu'))
96 | model.add(gluon.nn.Dense(n_classes))
97 |
98 | if load_stored:
99 | model.load_parameters(os.path.join(model_dir, 'model.params'), ctx=ctx)
100 | else:
101 | model.initialize(ctx=ctx)
102 | return model
103 |
104 |
105 | if __name__ == '__main__':
106 | logging = get_logger(__name__)
107 | logging.info('numpy version:{} MXNet version:{}'.format(np.__version__, mx.__version__))
108 |
109 | args = parse_args()
110 |
111 | train_data, n = get_data()
112 |
113 | ctx = mx.gpu(0) if args.num_gpus else mx.cpu(0)
114 |
115 | model = get_model(args.model_dir, ctx, n_classes=2)
116 |
117 | logging.info(model)
118 | logging.info(model.collect_params())
119 |
120 | loss = gluon.loss.SoftmaxCELoss()
121 | trainer = gluon.Trainer(model.collect_params(), args.optimizer, {'learning_rate': args.lr, 'wd': args.weight_decay})
122 |
123 | logging.info("Starting Model training")
124 | model = train(model, trainer, loss, train_data, ctx)
125 |
126 |
127 | # ------------------------------------------------------------ #
128 | # Hosting methods #
129 | # ------------------------------------------------------------ #
130 |
131 | def model_fn(model_dir):
132 | ctx = mx.gpu(0) if list(mx.test_utils.list_gpus()) else mx.cpu(0)
133 | net = get_model(model_dir, ctx, n_classes=2, load_stored=True)
134 | return net
135 |
136 |
137 | def transform_fn(net, data, input_content_type, output_content_type):
138 | ctx = mx.gpu(0) if list(mx.test_utils.list_gpus()) else mx.cpu(0)
139 | nda = mx.nd.array(json.loads(data))
140 | prediction = nd.softmax(net(nda.as_in_context(ctx)), axis=1)[:, 1]
141 | response_body = json.dumps(prediction.asnumpy().tolist())
142 | return response_body, output_content_type
143 |
--------------------------------------------------------------------------------
/source/sagemaker/baselines/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pandas as pd
3 |
4 | def get_data():
5 | data_prefix = "processed-data/"
6 |
7 | if not os.path.exists(data_prefix):
8 | print("""Expected the following folder {} to contain the preprocessed data.
9 | Run data processing first in main notebook before running baselines comparisons""".format(data_prefix))
10 | return
11 |
12 | features = pd.read_csv(data_prefix + "features.csv", header=None)
13 | labels = pd.read_csv(data_prefix + "tags.csv").set_index('TransactionID')
14 | test_users = pd.read_csv(data_prefix + "test.csv", header=None)
15 |
16 | test_X = features.merge(test_users, on=[0], how='inner')
17 | train_X = features[~features[0].isin(test_users[0].values)]
18 | test_y = labels.loc[test_X[0]].isFraud
19 | train_y = labels.loc[train_X[0]].isFraud
20 |
21 | return train_X, train_y, test_X, test_y
--------------------------------------------------------------------------------
/source/sagemaker/baselines/xgboost-fraud-baseline.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Baseline XGBoost Model\n",
8 | "## Training Gradient Boosted trees with node features"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": null,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "from utils import get_data\n",
18 | "import os\n",
19 | "os.chdir(\"../\")\n",
20 | "\n",
21 | "import pandas as pd\n",
22 | "import numpy as np\n",
23 | "\n",
24 | "!bash setup.sh"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Read data and upload to S3"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import io\n",
41 | "from sklearn.datasets import dump_svmlight_file\n",
42 | "\n",
43 | "train_X, train_y, test_X, test_y = get_data()\n",
44 | "\n",
45 | "buf = io.BytesIO()\n",
46 | "dump_svmlight_file(train_X.values[:, 1:], train_y, buf)\n",
47 | "buf.seek(0);\n",
48 | "filename = 'xgboost-fraud-dataset.libsvm'\n",
49 | "with open(filename,'wb') as out:\n",
50 | " out.write(buf.read())"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "import os\n",
60 | "import sagemaker\n",
61 | "from sagemaker.s3 import S3Uploader\n",
62 | "\n",
63 | "from sagemaker_graph_fraud_detection import config\n",
64 | "\n",
65 | "role = config.role\n",
66 | "\n",
67 | "session = sagemaker.Session()\n",
68 | "bucket = config.solution_bucket\n",
69 | "prefix = 'xgboost-fraud-detection'\n",
70 | "\n",
71 | "s3_train_data = S3Uploader.upload(filename, 's3://{}/{}/{}'.format(bucket, prefix,'train'))\n",
72 | "print('Uploaded training data location: {}'.format(s3_train_data))\n",
73 | "\n",
74 | "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
75 | "print('Training artifacts will be uploaded to: {}'.format(output_location))"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "## Train SageMaker XGBoost Estimator"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "import boto3\n",
92 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
93 | "\n",
94 | "container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='0.90-2')\n",
95 | "scale_pos_weight = np.sqrt((len(train_y) - sum(train_y))/sum(train_y))\n",
96 | "\n",
97 | "hyperparams = {\n",
98 | " \"max_depth\":5,\n",
99 | " \"subsample\":0.8,\n",
100 | " \"num_round\":100,\n",
101 | " \"eta\":0.2,\n",
102 | " \"gamma\":4,\n",
103 | " \"min_child_weight\":6,\n",
104 | " \"silent\":0,\n",
105 | " \"objective\":'binary:logistic',\n",
106 | " \"eval_metric\":'f1',\n",
107 | " \"scale_pos_weight\": scale_pos_weight\n",
108 | "}\n",
109 | "\n",
110 | "xgb = sagemaker.estimator.Estimator(container,\n",
111 | " role,\n",
112 | " hyperparameters=hyperparams,\n",
113 | " train_instance_count=1, \n",
114 | " train_instance_type='ml.m4.xlarge',\n",
115 | " output_path=output_location,\n",
116 | " sagemaker_session=session)\n",
117 | "xgb.fit({'train': s3_train_data})"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {
124 | "scrolled": true
125 | },
126 | "outputs": [],
127 | "source": [
128 | "from sagemaker.predictor import csv_serializer\n",
129 | "\n",
130 | "predictor = xgb.deploy(initial_instance_count=1,\n",
131 | " endpoint_name=\"xgboost-fraud-endpoint\",\n",
132 | " instance_type='ml.m4.xlarge')\n",
133 | "\n",
134 | "# Specify input and output formats.\n",
135 | "predictor.content_type = 'text/csv'\n",
136 | "predictor.serializer = csv_serializer\n",
137 | "predictor.deserializer = None"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "def predict(current_predictor, data, rows=500):\n",
147 | " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n",
148 | " predictions = ''\n",
149 | " for array in split_array:\n",
150 | " predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])\n",
151 | " return np.fromstring(predictions[1:], sep=',')\n",
152 | "\n",
153 | "raw_preds = predict(predictor, test_X.values[:, 1:])\n",
154 | "y_preds = np.where(raw_preds > 0.5, 1, 0)"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "from sklearn.metrics import confusion_matrix, roc_curve, auc\n",
164 | "from matplotlib import pyplot as plt\n",
165 | "%matplotlib inline\n",
166 | "\n",
167 | "def print_metrics(y_true, y_predicted):\n",
168 | "\n",
169 | " cm = confusion_matrix(y_true, y_predicted)\n",
170 | " true_neg, false_pos, false_neg, true_pos = cm.ravel()\n",
171 | " cm = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]),\n",
172 | " columns=[\"labels positive\", \"labels negative\"],\n",
173 | " index=[\"predicted positive\", \"predicted negative\"])\n",
174 | " \n",
175 | " acc = (true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg)\n",
176 | " precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0\n",
177 | " recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0\n",
178 | " f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0\n",
179 | " print(\"Confusion Matrix:\")\n",
180 | " print(pd.DataFrame(cm, columns=[\"labels positive\", \"labels negative\"], \n",
181 | " index=[\"predicted positive\", \"predicted negative\"]))\n",
182 | " print(\"f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}\".format(f1, precision, recall, acc))\n",
183 | " print()\n",
184 | " \n",
185 | "def plot_roc_curve(fpr, tpr, roc_auc):\n",
186 | " f = plt.figure()\n",
187 | " lw = 2\n",
188 | " plt.plot(fpr, tpr, color='darkorange',\n",
189 | " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n",
190 | " plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n",
191 | " plt.xlim([0.0, 1.0])\n",
192 | " plt.ylim([0.0, 1.05])\n",
193 | " plt.xlabel('False Positive Rate')\n",
194 | " plt.ylabel('True Positive Rate')\n",
195 | " plt.title('Model ROC curve')\n",
196 | " plt.legend(loc=\"lower right\")\n",
197 | "\n",
198 | "print_metrics(test_y, y_preds)\n",
199 | "fpr, tpr, _ = roc_curve(test_y, y_preds)\n",
200 | "roc_auc = auc(fpr, tpr)\n",
201 | "plot_roc_curve(fpr, tpr, roc_auc)"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "predictor.delete_endpoint()"
211 | ]
212 | }
213 | ],
214 | "metadata": {
215 | "kernelspec": {
216 | "display_name": "conda_python3",
217 | "language": "python",
218 | "name": "conda_python3"
219 | },
220 | "language_info": {
221 | "codemirror_mode": {
222 | "name": "ipython",
223 | "version": 3
224 | },
225 | "file_extension": ".py",
226 | "mimetype": "text/x-python",
227 | "name": "python",
228 | "nbconvert_exporter": "python",
229 | "pygments_lexer": "ipython3",
230 | "version": "3.6.10"
231 | }
232 | },
233 | "nbformat": 4,
234 | "nbformat_minor": 2
235 | }
236 |
--------------------------------------------------------------------------------
/source/sagemaker/data-preprocessing/buildspec.yml:
--------------------------------------------------------------------------------
1 | version: 0.2
2 |
3 | phases:
4 | build:
5 | commands:
6 | - echo Build started on `date`
7 | - echo Building the Docker image...
8 | - bash container/build_and_push.sh $ecr_repository $region $account_id
9 | post_build:
10 | commands:
11 | - echo Build completed on `date`
--------------------------------------------------------------------------------
/source/sagemaker/data-preprocessing/container/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.7-slim-buster
2 |
3 | RUN pip3 install pandas==0.24.2
4 | ENV PYTHONUNBUFFERED=TRUE
5 |
6 | ENTRYPOINT ["python3"]
--------------------------------------------------------------------------------
/source/sagemaker/data-preprocessing/container/build_and_push.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 |
3 | image=$1
4 |
5 | region=$2
6 |
7 | account=$3
8 |
9 | # Get the region defined in the current configuration (default to us-east-1 if none defined)
10 | fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest"
11 |
12 | # If the repository doesn't exist in ECR, create it.
13 | aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1
14 |
15 | if [ $? -ne 0 ]
16 | then
17 | aws ecr create-repository --repository-name "${image}" > /dev/null
18 | fi
19 |
20 | # Get the login command from ECR and execute it directly
21 | $(aws ecr get-login --region ${region} --registry-ids ${account} --no-include-email)
22 |
23 |
24 | # Build the docker image locally with the image name and then push it to ECR
25 | # with the full name.
26 |
27 | docker build -t ${image} container
28 | docker tag ${image} ${fullname}
29 |
30 | docker push ${fullname}
--------------------------------------------------------------------------------
/source/sagemaker/data-preprocessing/graph_data_preprocessor.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import logging
3 | import os
4 |
5 | import pandas as pd
6 | import numpy as np
7 | from itertools import combinations
8 |
9 |
10 | def parse_args():
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument('--data-dir', type=str, default='/opt/ml/processing/input')
13 | parser.add_argument('--output-dir', type=str, default='/opt/ml/processing/output')
14 | parser.add_argument('--transactions', type=str, default='transaction.csv', help='name of file with transactions')
15 | parser.add_argument('--identity', type=str, default='identity.csv', help='name of file with identity info')
16 | parser.add_argument('--id-cols', type=str, default='', help='comma separated id cols in transactions table')
17 | parser.add_argument('--cat-cols', type=str, default='', help='comma separated categorical cols in transactions')
18 | parser.add_argument('--train-data-ratio', type=float, default=0.8, help='fraction of data to use in training set')
19 | parser.add_argument('--construct-homogeneous', action="store_true", default=False,
20 | help='use bipartite graphs edgelists to construct homogenous graph edgelist')
21 | return parser.parse_args()
22 |
23 |
24 | def get_logger(name):
25 | logger = logging.getLogger(name)
26 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s'
27 | logging.basicConfig(format=log_format, level=logging.INFO)
28 | logger.setLevel(logging.INFO)
29 | return logger
30 |
31 |
32 | def load_data(data_dir, transaction_data, identity_data, train_data_ratio, output_dir):
33 | transaction_df = pd.read_csv(os.path.join(data_dir, transaction_data))
34 | logging.info("Shape of transaction data is {}".format(transaction_df.shape))
35 | logging.info("# Tagged transactions: {}".format(len(transaction_df) - transaction_df.isFraud.isnull().sum()))
36 |
37 | identity_df = pd.read_csv(os.path.join(data_dir, identity_data))
38 | logging.info("Shape of identity data is {}".format(identity_df.shape))
39 |
40 | # extract out transactions for test/validation
41 | n_train = int(transaction_df.shape[0]*train_data_ratio)
42 | test_ids = transaction_df.TransactionID.values[n_train:]
43 |
44 | get_fraud_frac = lambda series: 100 * sum(series)/len(series)
45 | logging.info("Percent fraud for train transactions: {}".format(get_fraud_frac(transaction_df.isFraud[:n_train])))
46 | logging.info("Percent fraud for test transactions: {}".format(get_fraud_frac(transaction_df.isFraud[n_train:])))
47 | logging.info("Percent fraud for all transactions: {}".format(get_fraud_frac(transaction_df.isFraud)))
48 |
49 | with open(os.path.join(output_dir, 'test.csv'), 'w') as f:
50 | f.writelines(map(lambda x: str(x) + "\n", test_ids))
51 | logging.info("Wrote test to file: {}".format(os.path.join(output_dir, 'test.csv')))
52 |
53 | return transaction_df, identity_df, test_ids
54 |
55 |
56 | def get_features_and_labels(transactions_df, transactions_id_cols, transactions_cat_cols, output_dir):
57 | # Get features
58 | non_feature_cols = ['isFraud', 'TransactionDT'] + transactions_id_cols.split(",")
59 | feature_cols = [col for col in transactions_df.columns if col not in non_feature_cols]
60 | logging.info("Categorical columns: {}".format(transactions_cat_cols.split(",")))
61 | features = pd.get_dummies(transactions_df[feature_cols], columns=transactions_cat_cols.split(",")).fillna(0)
62 | features['TransactionAmt'] = features['TransactionAmt'].apply(np.log10)
63 | logging.info("Transformed feature columns: {}".format(list(features.columns)))
64 | logging.info("Shape of features: {}".format(features.shape))
65 | features.to_csv(os.path.join(output_dir, 'features.csv'), index=False, header=False)
66 | logging.info("Wrote features to file: {}".format(os.path.join(output_dir, 'features.csv')))
67 |
68 | # Get labels
69 | transactions_df[['TransactionID', 'isFraud']].to_csv(os.path.join(output_dir, 'tags.csv'), index=False)
70 | logging.info("Wrote labels to file: {}".format(os.path.join(output_dir, 'tags.csv')))
71 |
72 |
73 | def get_relations_and_edgelist(transactions_df, identity_df, transactions_id_cols, output_dir):
74 | # Get relations
75 | edge_types = transactions_id_cols.split(",") + list(identity_df.columns)
76 | logging.info("Found the following distinct relation types: {}".format(edge_types))
77 | id_cols = ['TransactionID'] + transactions_id_cols.split(",")
78 | full_identity_df = transactions_df[id_cols].merge(identity_df, on='TransactionID', how='left')
79 | logging.info("Shape of identity columns: {}".format(full_identity_df.shape))
80 |
81 | # extract edges
82 | edges = {}
83 | for etype in edge_types:
84 | edgelist = full_identity_df[['TransactionID', etype]].dropna()
85 | edgelist.to_csv(os.path.join(output_dir, 'relation_{}_edgelist.csv').format(etype), index=False, header=True)
86 | logging.info("Wrote edgelist to: {}".format(os.path.join(output_dir, 'relation_{}_edgelist.csv').format(etype)))
87 | edges[etype] = edgelist
88 | return edges
89 |
90 |
91 | def create_homogeneous_edgelist(edges, output_dir):
92 | homogeneous_edges = []
93 | for etype, relations in edges.items():
94 | for edge_relation, frame in relations.groupby(etype):
95 | new_edges = [(a, b) for (a, b) in combinations(frame.TransactionID.values, 2)
96 | if (a, b) not in homogeneous_edges and (b, a) not in homogeneous_edges]
97 | homogeneous_edges.extend(new_edges)
98 |
99 | with open(os.path.join(output_dir, 'homogeneous_edgelist.csv'), 'w') as f:
100 | f.writelines(map(lambda x: "{}, {}\n".format(x[0], x[1]), homogeneous_edges))
101 | logging.info("Wrote homogeneous edgelist to file: {}".format(os.path.join(output_dir, 'homogeneous_edgelist.csv')))
102 |
103 |
104 | if __name__ == '__main__':
105 | logging = get_logger(__name__)
106 |
107 | args = parse_args()
108 |
109 | transactions, identity, test_transactions = load_data(args.data_dir,
110 | args.transactions,
111 | args.identity,
112 | args.train_data_ratio,
113 | args.output_dir)
114 |
115 | get_features_and_labels(transactions, args.id_cols, args.cat_cols, args.output_dir)
116 | relational_edges = get_relations_and_edgelist(transactions, identity, args.id_cols, args.output_dir)
117 |
118 | if args.construct_homogeneous:
119 | create_homogeneous_edgelist(relational_edges, args.output_dir)
120 |
121 |
122 |
123 |
--------------------------------------------------------------------------------
/source/sagemaker/dgl-fraud-detection.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Graph Fraud Detection with DGL on Amazon SageMaker\n",
8 | "\n",
9 | "This notebook shows an end to end pipeline to train a fraud detection model using graph neural networks. \n",
10 | "\n",
11 | "First, we process the raw dataset to prepare the features and extract the interactions in the dataset that will be used to construct the graph. \n",
12 | "\n",
13 | "Then, we create a launch a training job using the SageMaker framework estimator to train a graph neural network model with DGL."
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": null,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "!bash setup.sh\n",
23 | "\n",
24 | "import sagemaker\n",
25 | "from sagemaker_graph_fraud_detection import config, container_build\n",
26 | "\n",
27 | "role = config.role\n",
28 | "sess = sagemaker.Session()"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "## Data Preprocessing and Feature Engineering"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### Upload raw data to S3\n",
43 | "\n",
44 | "The dataset we use is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:\n",
45 | "\n",
46 | "* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. \n",
47 | "* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.\n",
48 | "\n",
49 | "We will go over the specific data schema in subsequent cells but now let's move the raw data to a convenient location in the S3 bucket for this proejct, where it will be picked up by the preprocessing job and training job.\n",
50 | "\n",
51 | "If you would like to use your own dataset for this demonstration. Replace the `raw_data_location` with the s3 path or local path of your dataset, and modify the data preprocessing step as needed."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "# Replace with an S3 location or local path to point to your own dataset\n",
61 | "raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'\n",
62 | "\n",
63 | "session_prefix = 'dgl-fraud-detection'\n",
64 | "input_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_data_prefix)\n",
65 | "\n",
66 | "!aws s3 cp --recursive $raw_data_location $input_data\n",
67 | "\n",
68 | "# Set S3 locations to store processed data for training and post-training results and artifacts respectively\n",
69 | "train_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_processing_output)\n",
70 | "train_output = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_train_output)"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "### Build container for Preprocessing and Feature Engineering\n",
78 | "\n",
79 | "Data preprocessing and feature engineering is an important component of the ML lifecycle, and Amazon SageMaker Processing allows you to do these easily on a managed infrastructure. First, we'll create a lightweight container that will serve as the environment for our data preprocessing. \n",
80 | "\n",
81 | "The Dockerfile that defines the container is shown below and it only contains the pandas package as a dependency but it can also be easily customized to add in more dependencies if your data preprocessing job requires it."
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": null,
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "!pygmentize data-preprocessing/container/Dockerfile"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "Now we'll run a simple script to build a container image using the Dockerfile, and push the image to Amazon ECR. The container image will have a unique URI which the SageMaker Processing job executed."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "region = config.region_name\n",
107 | "account_id = config.aws_account\n",
108 | "ecr_repository = config.ecr_repository\n",
109 | "\n",
110 | "if config.container_build_project == \"local\":\n",
111 | " !cd data-preprocessing && bash container/build_and_push.sh $ecr_repository $region $account_id\n",
112 | "else:\n",
113 | " container_build.build(config.container_build_project)\n",
114 | "ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### Run Preprocessing job with Amazon SageMaker Processing\n",
122 | "\n",
123 | "The script we have defined at `data-preprocessing/graph_data_preprocessor.py` performs data preprocessing and feature engineering transformations on the raw data. We provide a general processing framework to convert a relational table to heterogeneous graph edgelists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:\n",
124 | "\n",
125 | "* Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount\n",
126 | "* Constructing graph edgelists between transactions and other entities for the various relation types\n",
127 | "\n",
128 | "The inputs to the data preprocessing script are passed in as python command line arguments. All the columns in the relational table are classifed into one of 3 types for the purposes of data transformation: \n",
129 | "\n",
130 | "* **Identity columns** `--id-cols`: columns that contain identity information related to a user or transaction for example IP address, Phone Number, device identifiers etc. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes. The column names for these column types need to passed in to the script.\n",
131 | "\n",
132 | "* **Categorical columns** `--cat-cols`: columns that correspond to categorical features for a user's age group or whether a provided address matches with an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph. The columns names for these column types also needs to be passed in to the script\n",
133 | "\n",
134 | "* **Numerical columns**: columns that correspond to numerical features like how many times a user has tried a transaction and so on. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that are not identity columns or categorical columns are numerical columns\n",
135 | "\n",
136 | "In order to adapt the preprocessing script to work with data in the same format, you can simply change the python arguments used in the cell below to a comma seperate string for the column names in your dataset. If your dataset is in a different format, then you will also have to modify the preprocessing script at `data-preprocessing/graph_data_preprocessor.py`"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {},
143 | "outputs": [],
144 | "source": [
145 | "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n",
146 | "\n",
147 | "script_processor = ScriptProcessor(command=['python3'],\n",
148 | " image_uri=ecr_repository_uri,\n",
149 | " role=role,\n",
150 | " instance_count=1,\n",
151 | " instance_type='ml.m4.xlarge')\n",
152 | "\n",
153 | "script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',\n",
154 | " inputs=[ProcessingInput(source=input_data,\n",
155 | " destination='/opt/ml/processing/input')],\n",
156 | " outputs=[ProcessingOutput(destination=train_data,\n",
157 | " source='/opt/ml/processing/output')],\n",
158 | " arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',\n",
159 | " '--cat-cols','M1,M2,M3,M4,M5,M6,M7,M8,M9'])"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "### View Results of Data Preprocessing\n",
167 | "\n",
168 | "Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data. We have a set of bipartite edge lists between transactions and different device id types as well as the features, labels and a set of transactions to validate our graph model performance."
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "metadata": {},
175 | "outputs": [],
176 | "source": [
177 | "from os import path\n",
178 | "from sagemaker.s3 import S3Downloader\n",
179 | "processed_files = S3Downloader.list(train_data)\n",
180 | "print(\"===== Processed Files =====\")\n",
181 | "print('\\n'.join(processed_files))\n",
182 | "\n",
183 | "# optionally download processed data\n",
184 | "# S3Downloader.download(train_data, train_data.split(\"/\")[-1])"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "## Train Graph Neural Network with DGL\n",
192 | "\n",
193 | "Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.\n",
194 | "\n",
195 | "Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types."
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "### Hyperparameters\n",
203 | "\n",
204 | "To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. \n",
205 | "\n",
206 | "Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:\n",
207 | "\n",
208 | "* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.\n",
209 | "* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists\n",
210 | "* **`labels`** is the name of the file tha contains the target `node_id`s and their labels\n",
211 | "* **`model`** specify which graph neural network to use, this should be set to `r-gcn`\n",
212 | "\n",
213 | "The following hyperparameters can be tuned and adjusted to improve model performance\n",
214 | "* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN\n",
215 | "\n",
216 | "* **embedding-size** is the size of the embedding dimension for non target nodes\n",
217 | "* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training\n",
218 | "* **n-layers** is the number of GNN layers in the model\n",
219 | "* **n-epochs** is the number of training epochs for the model training job\n",
220 | "* **optimizer** is the optimization algorithm used for gradient based parameter updates\n",
221 | "* **lr** is the learning rate for parameter updates\n"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": null,
227 | "metadata": {},
228 | "outputs": [],
229 | "source": [
230 | "edges = \",\".join(map(lambda x: x.split(\"/\")[-1], [file for file in processed_files if \"relation\" in file]))\n",
231 | "params = {'nodes' : 'features.csv',\n",
232 | " 'edges': 'relation*',\n",
233 | " 'labels': 'tags.csv',\n",
234 | " 'model': 'rgcn',\n",
235 | " 'num-gpus': 1,\n",
236 | " 'batch-size': 10000,\n",
237 | " 'embedding-size': 64,\n",
238 | " 'n-neighbors': 1000,\n",
239 | " 'n-layers': 2,\n",
240 | " 'n-epochs': 10,\n",
241 | " 'optimizer': 'adam',\n",
242 | " 'lr': 1e-2\n",
243 | " }\n",
244 | "\n",
245 | "print(\"Graph will be constructed using the following edgelists:\\n{}\" .format('\\n'.join(edges.split(\",\"))))"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "### Create and Fit SageMaker Estimator\n",
253 | "\n",
254 | "With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.\n",
255 | "\n",
256 | "We can then `fit` the estimator on the the training data location in S3."
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "scrolled": true
264 | },
265 | "outputs": [],
266 | "source": [
267 | "from sagemaker.mxnet import MXNet\n",
268 | "from time import strftime, gmtime\n",
269 | "\n",
270 | "estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',\n",
271 | " source_dir='sagemaker_graph_fraud_detection/dgl_fraud_detection',\n",
272 | " role=role, \n",
273 | " train_instance_count=1, \n",
274 | " train_instance_type='ml.p3.2xlarge',\n",
275 | " framework_version=\"1.6.0\",\n",
276 | " py_version='py3',\n",
277 | " hyperparameters=params,\n",
278 | " output_path=train_output,\n",
279 | " code_location=train_output,\n",
280 | " sagemaker_session=sess)\n",
281 | "\n",
282 | "training_job_name = \"{}-{}\".format(config.solution_prefix, strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n",
283 | "estimator.fit({'train': train_data}, job_name=training_job_name)"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3."
291 | ]
292 | }
293 | ],
294 | "metadata": {
295 | "kernelspec": {
296 | "display_name": "conda_python3",
297 | "language": "python",
298 | "name": "conda_python3"
299 | },
300 | "language_info": {
301 | "codemirror_mode": {
302 | "name": "ipython",
303 | "version": 3
304 | },
305 | "file_extension": ".py",
306 | "mimetype": "text/x-python",
307 | "name": "python",
308 | "nbconvert_exporter": "python",
309 | "pygments_lexer": "ipython3",
310 | "version": "3.6.10"
311 | }
312 | },
313 | "nbformat": 4,
314 | "nbformat_minor": 2
315 | }
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/__init__.py
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/config.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | import boto3
4 | import sagemaker
5 | from pathlib import Path
6 |
7 |
8 | def get_current_folder(global_variables):
9 | # if calling from a file
10 | if "__file__" in global_variables:
11 | current_file = Path(global_variables["__file__"])
12 | current_folder = current_file.parent.resolve()
13 | # if calling from a notebook
14 | else:
15 | current_folder = Path(os.getcwd())
16 | return current_folder
17 |
18 | region = boto3.session.Session().region_name
19 | account_id = boto3.client('sts').get_caller_identity().get('Account')
20 | default_bucket = sagemaker.session.Session(boto3.session.Session()).default_bucket()
21 | default_role = sagemaker.get_execution_role()
22 |
23 | cfn_stack_outputs = {}
24 | current_folder = get_current_folder(globals())
25 | cfn_stack_outputs_filepath = Path(current_folder, '../stack_outputs.json').resolve()
26 |
27 | if os.path.exists(cfn_stack_outputs_filepath):
28 | with open(cfn_stack_outputs_filepath) as f:
29 | cfn_stack_outputs = json.load(f)
30 |
31 | aws_account = cfn_stack_outputs.get('AccountID', account_id)
32 | region_name = cfn_stack_outputs.get('AWSRegion', region)
33 |
34 | solution_prefix = cfn_stack_outputs.get('SolutionPrefix', 'sagemaker-soln-graph-fraud')
35 | solution_bucket = cfn_stack_outputs.get('SolutionS3Bucket', default_bucket)
36 |
37 | s3_data_prefix = cfn_stack_outputs.get('S3InputDataPrefix', 'raw-data')
38 | s3_processing_output = cfn_stack_outputs.get('S3ProcessingJobOutputPrefix', 'processed-data')
39 | s3_train_output = cfn_stack_outputs.get('S3TrainingJobOutputPrefix', 'training-output')
40 |
41 | ecr_repository = cfn_stack_outputs.get('SageMakerProcessingJobContainerName', 'sagemaker-soln-graph-fraud-preprocessing')
42 | container_build_project = cfn_stack_outputs.get('SageMakerProcessingJobContainerBuild', 'local')
43 |
44 | role = cfn_stack_outputs.get('IamRole', default_role)
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/container_build/__init__.py:
--------------------------------------------------------------------------------
1 | from .container_build import build
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/container_build/container_build.py:
--------------------------------------------------------------------------------
1 | import boto3
2 | import sys
3 | import time
4 | from .logs import logs_for_build
5 |
6 | def build(project_name, session=boto3.session.Session(), log=True):
7 | print("Starting a build job for CodeBuild project: {}".format(project_name))
8 | id = _start_build(session, project_name)
9 | if log:
10 | logs_for_build(id, wait=True, session=session)
11 | else:
12 | _wait_for_build(id, session)
13 |
14 | def _start_build(session, project_name):
15 | args = {"projectName": project_name}
16 | client = session.client("codebuild")
17 |
18 | response = client.start_build(**args)
19 | return response["build"]["id"]
20 |
21 |
22 | def _wait_for_build(build_id, session, poll_seconds=10):
23 | client = session.client("codebuild")
24 | status = client.batch_get_builds(ids=[build_id])
25 | while status["builds"][0]["buildStatus"] == "IN_PROGRESS":
26 | print(".", end="")
27 | sys.stdout.flush()
28 | time.sleep(poll_seconds)
29 | status = client.batch_get_builds(ids=[build_id])
30 | print()
31 | print(f"Build complete, status = {status['builds'][0]['buildStatus']}")
32 | print(f"Logs at {status['builds'][0]['logs']['deepLink']}")
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/container_build/logs.py:
--------------------------------------------------------------------------------
1 | import collections
2 | import sys
3 | import time
4 |
5 | import botocore
6 |
7 |
8 | # Log tailing code taken as-is from sagemaker_run_notebook.
9 | # https://github.com/aws-samples/sagemaker-run-notebook/blob/master/sagemaker_run_notebook/container_build.py#L100
10 | # It can be added as a dependency once its on PyPI.
11 |
12 |
13 | class LogState(object):
14 | STARTING = 1
15 | WAIT_IN_PROGRESS = 2
16 | TAILING = 3
17 | JOB_COMPLETE = 4
18 | COMPLETE = 5
19 |
20 |
21 | # Position is a tuple that includes the last read timestamp and the number of items that were read
22 | # at that time. This is used to figure out which event to start with on the next read.
23 | Position = collections.namedtuple("Position", ["timestamp", "skip"])
24 |
25 |
26 | def log_stream(client, log_group, stream_name, position):
27 | """A generator for log items in a single stream. This will yield all the
28 | items that are available at the current moment.
29 |
30 | Args:
31 | client (boto3.CloudWatchLogs.Client): The Boto client for CloudWatch logs.
32 | log_group (str): The name of the log group.
33 | stream_name (str): The name of the specific stream.
34 | position (Position): A tuple with the time stamp value to start reading the logs from and
35 | The number of log entries to skip at the start. This is for when
36 | there are multiple entries at the same timestamp.
37 | start_time (int): The time stamp value to start reading the logs from (default: 0).
38 | skip (int): The number of log entries to skip at the start (default: 0). This is for when there are
39 | multiple entries at the same timestamp.
40 |
41 | Yields:
42 | A tuple with:
43 | dict: A CloudWatch log event with the following key-value pairs:
44 | 'timestamp' (int): The time of the event.
45 | 'message' (str): The log event data.
46 | 'ingestionTime' (int): The time the event was ingested.
47 | Position: The new position
48 | """
49 |
50 | start_time, skip = position
51 | next_token = None
52 |
53 | event_count = 1
54 | while event_count > 0:
55 | if next_token is not None:
56 | token_arg = {"nextToken": next_token}
57 | else:
58 | token_arg = {}
59 |
60 | response = client.get_log_events(
61 | logGroupName=log_group,
62 | logStreamName=stream_name,
63 | startTime=start_time,
64 | startFromHead=True,
65 | **token_arg,
66 | )
67 | next_token = response["nextForwardToken"]
68 | events = response["events"]
69 | event_count = len(events)
70 | if event_count > skip:
71 | events = events[skip:]
72 | skip = 0
73 | else:
74 | skip = skip - event_count
75 | events = []
76 | for ev in events:
77 | ts, count = position
78 | if ev["timestamp"] == ts:
79 | position = Position(timestamp=ts, skip=count + 1)
80 | else:
81 | position = Position(timestamp=ev["timestamp"], skip=1)
82 | yield ev, position
83 |
84 |
85 | # Copy/paste/slight mods from session.logs_for_job() in the SageMaker Python SDK
86 | def logs_for_build(
87 | build_id, session, wait=False, poll=10
88 | ): # noqa: C901 - suppress complexity warning for this method
89 | """Display the logs for a given build, optionally tailing them until the
90 | build is complete.
91 |
92 | Args:
93 | build_id (str): The ID of the build to display the logs for.
94 | wait (bool): Whether to keep looking for new log entries until the build completes (default: False).
95 | poll (int): The interval in seconds between polling for new log entries and build completion (default: 10).
96 | session (boto3.session.Session): A boto3 session to use
97 |
98 | Raises:
99 | ValueError: If waiting and the build fails.
100 | """
101 |
102 | codebuild = session.client("codebuild")
103 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0]
104 | status = description["buildStatus"]
105 |
106 | log_group = description["logs"].get("groupName")
107 | stream_name = description["logs"].get("streamName") # The list of log streams
108 | position = Position(
109 | timestamp=0, skip=0
110 | ) # The current position in each stream, map of stream name -> position
111 |
112 | # Increase retries allowed (from default of 4), as we don't want waiting for a build
113 | # to be interrupted by a transient exception.
114 | config = botocore.config.Config(retries={"max_attempts": 15})
115 | client = session.client("logs", config=config)
116 |
117 | job_already_completed = False if status == "IN_PROGRESS" else True
118 |
119 | state = (
120 | LogState.STARTING if wait and not job_already_completed else LogState.COMPLETE
121 | )
122 | dot = True
123 |
124 | while state == LogState.STARTING and log_group == None:
125 | time.sleep(poll)
126 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0]
127 | log_group = description["logs"].get("groupName")
128 | stream_name = description["logs"].get("streamName")
129 |
130 | if state == LogState.STARTING:
131 | state = LogState.TAILING
132 |
133 | # The loop below implements a state machine that alternates between checking the build status and
134 | # reading whatever is available in the logs at this point. Note, that if we were called with
135 | # wait == False, we never check the job status.
136 | #
137 | # If wait == TRUE and job is not completed, the initial state is STARTING
138 | # If wait == FALSE, the initial state is COMPLETE (doesn't matter if the job really is complete).
139 | #
140 | # The state table:
141 | #
142 | # STATE ACTIONS CONDITION NEW STATE
143 | # ---------------- ---------------- ------------------- ----------------
144 | # STARTING Pause, Get Status Valid LogStream Arn TAILING
145 | # Else STARTING
146 | # TAILING Read logs, Pause, Get status Job complete JOB_COMPLETE
147 | # Else TAILING
148 | # JOB_COMPLETE Read logs, Pause Any COMPLETE
149 | # COMPLETE Read logs, Exit N/A
150 | #
151 | # Notes:
152 | # - The JOB_COMPLETE state forces us to do an extra pause and read any items that got to Cloudwatch after
153 | # the build was marked complete.
154 | last_describe_job_call = time.time()
155 | dot_printed = False
156 | while True:
157 | for event, position in log_stream(client, log_group, stream_name, position):
158 | print(event["message"].rstrip())
159 | if dot:
160 | dot = False
161 | if dot_printed:
162 | print()
163 | if state == LogState.COMPLETE:
164 | break
165 |
166 | time.sleep(poll)
167 | if dot:
168 | print(".", end="")
169 | sys.stdout.flush()
170 | dot_printed = True
171 | if state == LogState.JOB_COMPLETE:
172 | state = LogState.COMPLETE
173 | elif time.time() - last_describe_job_call >= 30:
174 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0]
175 | status = description["buildStatus"]
176 |
177 | last_describe_job_call = time.time()
178 |
179 | status = description["buildStatus"]
180 |
181 | if status != "IN_PROGRESS":
182 | print()
183 | state = LogState.JOB_COMPLETE
184 |
185 | if wait:
186 | if dot:
187 | print()
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/__init__.py
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/data.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 |
5 | def get_features(id_to_node, node_features):
6 | """
7 |
8 | :param id_to_node: dictionary mapping node names(id) to dgl node idx
9 | :param node_features: path to file containing node features
10 | :return: (np.ndarray, list) node feature matrix in order and new nodes not yet in the graph
11 | """
12 | indices, features, new_nodes = [], [], []
13 | max_node = max(id_to_node.values())
14 | with open(node_features, "r") as fh:
15 | for line in fh:
16 | node_feats = line.strip().split(",")
17 | node_id = node_feats[0]
18 | feats = np.array(list(map(float, node_feats[1:])))
19 | features.append(feats)
20 | if node_id not in id_to_node:
21 | max_node += 1
22 | id_to_node[node_id] = max_node
23 | new_nodes.append(max_node)
24 |
25 | indices.append(id_to_node[node_id])
26 |
27 | features = np.array(features).astype('float32')
28 | features = features[np.argsort(indices), :]
29 | return features, new_nodes
30 |
31 |
32 | def get_labels(id_to_node, n_nodes, target_node_type, labels_path, masked_nodes_path, additional_mask_rate=0):
33 | """
34 |
35 | :param id_to_node: dictionary mapping node names(id) to dgl node idx
36 | :param n_nodes: number of user nodes in the graph
37 | :param target_node_type: column name for target node type
38 | :param labels_path: filepath containing labelled nodes
39 | :param masked_nodes_path: filepath containing list of nodes to be masked
40 | :param additional_mask_rate: additional_mask_rate: float for additional masking of nodes with labels during training
41 | :return: (list, list) train and test mask array
42 | """
43 | node_to_id = {v: k for k, v in id_to_node.items()}
44 | user_to_label = pd.read_csv(labels_path).set_index(target_node_type)
45 | labels = user_to_label.loc[map(int, pd.Series(node_to_id)[np.arange(n_nodes)].values)].values.flatten()
46 | masked_nodes = read_masked_nodes(masked_nodes_path)
47 | train_mask, test_mask = _get_mask(id_to_node, node_to_id, n_nodes, masked_nodes,
48 | additional_mask_rate=additional_mask_rate)
49 | return labels, train_mask, test_mask
50 |
51 |
52 | def read_masked_nodes(masked_nodes_path):
53 | """
54 | Returns a list of nodes extracted from the path passed in
55 |
56 | :param masked_nodes_path: filepath containing list of nodes to be masked i.e test users
57 | :return: list
58 | """
59 | with open(masked_nodes_path, "r") as fh:
60 | masked_nodes = [line.strip() for line in fh]
61 | return masked_nodes
62 |
63 |
64 | def _get_mask(id_to_node, node_to_id, num_nodes, masked_nodes, additional_mask_rate):
65 | """
66 | Returns the train and test mask arrays
67 |
68 | :param id_to_node: dictionary mapping node names(id) to dgl node idx
69 | :param node_to_id: dictionary mapping dgl node idx to node names(id)
70 | :param num_nodes: number of user/account nodes in the graph
71 | :param masked_nodes: list of nodes to be masked during training, nodes without labels
72 | :param additional_mask_rate: float for additional masking of nodes with labels during training
73 | :return: (list, list) train and test mask array
74 | """
75 | train_mask = np.ones(num_nodes)
76 | test_mask = np.zeros(num_nodes)
77 | for node_id in masked_nodes:
78 | train_mask[id_to_node[node_id]] = 0
79 | test_mask[id_to_node[node_id]] = 1
80 | if additional_mask_rate and additional_mask_rate < 1:
81 | unmasked = np.array([idx for idx in range(num_nodes) if node_to_id[idx] not in masked_nodes])
82 | yet_unmasked = np.random.permutation(unmasked)[:int(additional_mask_rate*num_nodes)]
83 | train_mask[yet_unmasked] = 0
84 | return train_mask, test_mask
85 |
86 |
87 | def _get_node_idx(id_to_node, node_type, node_id, ptr):
88 | if node_type in id_to_node:
89 | if node_id in id_to_node[node_type]:
90 | node_idx = id_to_node[node_type][node_id]
91 | else:
92 | id_to_node[node_type][node_id] = ptr
93 | node_idx = ptr
94 | ptr += 1
95 | else:
96 | id_to_node[node_type] = {}
97 | id_to_node[node_type][node_id] = ptr
98 | node_idx = ptr
99 | ptr += 1
100 |
101 | return node_idx, id_to_node, ptr
102 |
103 |
104 | def parse_edgelist(edges, id_to_node, header=False, source_type='user', sink_type='user'):
105 | """
106 | Parse an edgelist path file and return the edges as a list of tuple
107 | :param edges: path to comma separated file containing bipartite edges with header for edgetype
108 | :param id_to_node: dictionary containing mapping for node names(id) to dgl node indices
109 | :param header: boolean whether or not the file has a header row
110 | :param source_type: type of the source node in the edge. defaults to 'user' if no header
111 | :param sink_type: type of the sink node in the edge. defaults to 'user' if no header.
112 | :return: (list, dict) a list containing edges of a single relationship type as tuples and updated id_to_node dict.
113 | """
114 | edge_list = []
115 | source_pointer, sink_pointer = 0, 0
116 | with open(edges, "r") as fh:
117 | for i, line in enumerate(fh):
118 | source, sink = line.strip().split(",")
119 | if i == 0:
120 | if header:
121 | source_type, sink_type = source, sink
122 | if source_type in id_to_node:
123 | source_pointer = max(id_to_node[source_type].values()) + 1
124 | if sink_type in id_to_node:
125 | sink_pointer = max(id_to_node[sink_type].values()) + 1
126 | continue
127 |
128 | source_node, id_to_node, source_pointer = _get_node_idx(id_to_node, source_type, source, source_pointer)
129 | if source_type == sink_type:
130 | sink_node, id_to_node, source_pointer = _get_node_idx(id_to_node, sink_type, sink, source_pointer)
131 | else:
132 | sink_node, id_to_node, sink_pointer = _get_node_idx(id_to_node, sink_type, sink, sink_pointer)
133 |
134 | edge_list.append((source_node, sink_node))
135 |
136 | return edge_list, id_to_node, source_type, sink_type
137 |
138 |
139 | def read_edges(edges, nodes=None):
140 | """
141 | Read edges and node features
142 |
143 | :param edges: path to comma separated file containing all edges
144 | :param nodes: path to comma separated file containing all nodes + features
145 | :return: (list, list, list, dict) sources, sinks, features and id_to_node dictionary containing mappings
146 | from node names(id) to dgl node indices
147 | """
148 | node_pointer = 0
149 | id_to_node = {}
150 | features = []
151 | sources, sinks = [], []
152 | if nodes is not None:
153 | with open(nodes, "r") as fh:
154 | for line in fh:
155 | node_feats = line.strip().split(",")
156 | node_id = node_feats[0]
157 | if node_id not in id_to_node:
158 | id_to_node[node_id] = node_pointer
159 | node_pointer += 1
160 | if len(node_feats) > 1:
161 | feats = np.array(list(map(float, node_feats[1:])))
162 | features.append(feats)
163 | with open(edges, "r") as fh:
164 | for line in fh:
165 | source, sink = line.strip().split(",")
166 | sources.append(id_to_node[source])
167 | sinks.append(id_to_node[sink])
168 | else:
169 | with open(edges, "r") as fh:
170 | for line in fh:
171 | source, sink = line.strip().split(",")
172 | if source not in id_to_node:
173 | id_to_node[source] = node_pointer
174 | node_pointer += 1
175 | if sink not in id_to_node:
176 | id_to_node[sink] = node_pointer
177 | node_pointer += 1
178 | sources.append(id_to_node[source])
179 | sinks.append(id_to_node[sink])
180 |
181 | return sources, sinks, features, id_to_node
182 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/estimator_fns.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import logging
4 |
5 |
6 | def parse_args():
7 | parser = argparse.ArgumentParser()
8 |
9 | parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
10 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
11 | parser.add_argument('--output-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
12 | parser.add_argument('--nodes', type=str, default='features.csv')
13 | parser.add_argument('--target-ntype', type=str, default='TransactionID')
14 | parser.add_argument('--edges', type=str, default='homogeneous_edgelist.csv')
15 | parser.add_argument('--heterogeneous', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),
16 | default=True, help='use hetero graph')
17 | parser.add_argument('--no-features', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),
18 | default=False, help='do not use node features')
19 | parser.add_argument('--mini-batch', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),
20 | default=True, help='use mini-batch training and sample graph')
21 | parser.add_argument('--labels', type=str, default='tags.csv')
22 | parser.add_argument('--new-accounts', type=str, default='test.csv')
23 | parser.add_argument('--predictions', type=str, default='preds.csv', help='file to save predictions on new-accounts')
24 | parser.add_argument('--compute-metrics', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),
25 | default=True, help='compute evaluation metrics after training')
26 | parser.add_argument('--threshold', type=float, default=0, help='threshold for making predictions, default : argmax')
27 | parser.add_argument('--model', type=str, default='rgcn', help='gnn to use. options: gcn, graphsage, gat, gem')
28 | parser.add_argument('--num-gpus', type=int, default=1)
29 | parser.add_argument('--batch-size', type=int, default=500)
30 | parser.add_argument('--optimizer', type=str, default='adam')
31 | parser.add_argument('--lr', type=float, default=1e-2)
32 | parser.add_argument('--n-epochs', type=int, default=20)
33 | parser.add_argument('--n-neighbors', type=int, default=10, help='number of neighbors to sample')
34 | parser.add_argument('--n-hidden', type=int, default=16, help='number of hidden units')
35 | parser.add_argument('--n-layers', type=int, default=3, help='number of hidden layers')
36 | parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss')
37 | parser.add_argument('--dropout', type=float, default=0.2, help='dropout probability, for gat only features')
38 | parser.add_argument('--attn-drop', type=float, default=0.6, help='attention dropout for gat/gem')
39 | parser.add_argument('--num-heads', type=int, default=4, help='number of hidden attention heads for gat/gem')
40 | parser.add_argument('--num-out-heads', type=int, default=1, help='number of output attention heads for gat/gem')
41 | parser.add_argument('--residual', action="store_true", default=False, help='use residual connection for gat')
42 | parser.add_argument('--alpha', type=float, default=0.2, help='the negative slop of leaky relu')
43 | parser.add_argument('--aggregator-type', type=str, default="gcn", help="graphsage aggregator: mean/gcn/pool/lstm")
44 | parser.add_argument('--embedding-size', type=int, default=360, help="embedding size for node embedding")
45 |
46 | return parser.parse_args()
47 |
48 |
49 | def get_logger(name):
50 | logger = logging.getLogger(name)
51 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s'
52 | logging.basicConfig(format=log_format, level=logging.INFO)
53 | logger.setLevel(logging.INFO)
54 | return logger
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/graph.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import dgl
4 | import numpy as np
5 |
6 | from data import *
7 |
8 |
9 | def get_edgelists(edgelist_expression, directory):
10 | if "," in edgelist_expression:
11 | return edgelist_expression.split(",")
12 | files = os.listdir(directory)
13 | compiled_expression = re.compile(edgelist_expression)
14 | return [filename for filename in files if compiled_expression.match(filename)]
15 |
16 | def construct_graph(training_dir, edges, nodes, target_node_type, heterogeneous=True):
17 | if heterogeneous:
18 | print("Getting relation graphs from the following edge lists : {} ".format(edges))
19 | edgelists, id_to_node = {}, {}
20 | for i, edge in enumerate(edges):
21 | edgelist, id_to_node, src, dst = parse_edgelist(os.path.join(training_dir, edge), id_to_node, header=True)
22 | if src == target_node_type:
23 | src = 'target'
24 | if dst == target_node_type:
25 | dst = 'target'
26 | edgelists[(src, 'relation{}'.format(i), dst)] = edgelist
27 | print("Read edges for relation{} from edgelist: {}".format(i, os.path.join(training_dir, edge)))
28 |
29 | # reverse edge list so that relation is undirected
30 | # edgelists[(dst, 'reverse_relation{}'.format(i), src)] = [(b, a) for a, b in edgelist]
31 |
32 | # get features for target nodes
33 | features, new_nodes = get_features(id_to_node[target_node_type], os.path.join(training_dir, nodes))
34 | print("Read in features for target nodes")
35 | # handle target nodes that have features but don't have any connections
36 | # if new_nodes:
37 | # edgelists[('target', 'relation'.format(i+1), 'none')] = [(node, 0) for node in new_nodes]
38 | # edgelists[('none', 'reverse_relation{}'.format(i + 1), 'target')] = [(0, node) for node in new_nodes]
39 |
40 | # add self relation
41 | edgelists[('target', 'self_relation', 'target')] = [(t, t) for t in id_to_node[target_node_type].values()]
42 |
43 | g = dgl.heterograph(edgelists)
44 | print(
45 | "Constructed heterograph with the following metagraph structure: Node types {}, Edge types{}".format(
46 | g.ntypes, g.canonical_etypes))
47 | print("Number of nodes of type target : {}".format(g.number_of_nodes('target')))
48 |
49 | g.nodes['target'].data['features'] = features
50 |
51 | id_to_node = id_to_node[target_node_type]
52 |
53 | else:
54 | sources, sinks, features, id_to_node = read_edges(os.path.join(training_dir, edges[0]),
55 | os.path.join(training_dir, nodes))
56 |
57 | # add self relation
58 | all_nodes = sorted(id_to_node.values())
59 | sources.extend(all_nodes)
60 | sinks.extend(all_nodes)
61 |
62 | g = dgl.graph((sources, sinks))
63 |
64 | if features:
65 | g.ndata['features'] = np.array(features).astype('float32')
66 |
67 | print('read graph from node list and edge list')
68 |
69 | features = g.ndata['features']
70 |
71 | return g, features, id_to_node
72 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/__init__.py
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/mxnet.py:
--------------------------------------------------------------------------------
1 | from mxnet import gluon, nd
2 | from dgl.nn.mxnet import GraphConv, GATConv, SAGEConv
3 | import dgl.function as fn
4 |
5 |
6 | class HeteroRGCNLayer(gluon.Block):
7 | def __init__(self, in_size, out_size, etypes):
8 | super(HeteroRGCNLayer, self).__init__()
9 | # W_r for each relation
10 | with self.name_scope():
11 | self.weight = {name: gluon.nn.Dense(out_size, use_bias=False) for name in etypes}
12 | for child in self.weight.values():
13 | self.register_child(child)
14 |
15 | def forward(self, G, feat_dict):
16 | # The input is a dictionary of node features for each type
17 | funcs = {}
18 | for srctype, etype, dsttype in G.canonical_etypes:
19 | # Compute W_r * h
20 | if srctype in feat_dict:
21 | Wh = self.weight[etype](feat_dict[srctype])
22 | # Save it in graph for message passing
23 | G.srcnodes[srctype].data['Wh_%s' % etype] = Wh
24 | # Specify per-relation message passing functions: (message_func, reduce_func).
25 | # Note that the results are saved to the same destination feature 'h', which
26 | # hints the type wise reducer for aggregation.
27 | funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))
28 | # Trigger message passing of multiple types.
29 | # The first argument is the message passing functions for each relation.
30 | # The second one is the type wise reducer, could be "sum", "max",
31 | # "min", "mean", "stack"
32 | G.multi_update_all(funcs, 'sum')
33 | # return the updated node feature dictionary
34 | return {ntype: G.dstnodes[ntype].data['h'] for ntype in G.ntypes if 'h' in G.dstnodes[ntype].data}
35 |
36 |
37 | class HeteroRGCN(gluon.Block):
38 | def __init__(self, g, in_size, hidden_size, out_size, n_layers, embedding_size, ctx):
39 | super(HeteroRGCN, self).__init__()
40 | self.g = g
41 | self.ctx = ctx
42 |
43 | # Use trainable node embeddings as featureless inputs for all non target node types.
44 | with self.name_scope():
45 | self.embed_dict = {ntype: gluon.nn.Embedding(g.number_of_nodes(ntype), embedding_size)
46 | for ntype in g.ntypes if ntype != 'target'}
47 |
48 | for child in self.embed_dict.values():
49 | self.register_child(child)
50 |
51 | # create layers
52 | # input layer
53 | self.layers = gluon.nn.Sequential()
54 | self.layers.add(HeteroRGCNLayer(embedding_size, hidden_size, g.etypes))
55 | # hidden layers
56 | for i in range(n_layers - 1):
57 | self.layers.add(HeteroRGCNLayer(hidden_size, hidden_size, g.etypes))
58 | # output layer
59 | # self.layers.add(HeteroRGCNLayer(hidden_size, out_size, g.etypes))
60 | self.layers.add(gluon.nn.Dense(out_size))
61 |
62 | def forward(self, g, features):
63 | # get embeddings for all node types. for target node type, use passed in target features
64 | h_dict = {'target': features}
65 | for ntype in self.embed_dict:
66 | if g[0].number_of_nodes(ntype) > 0:
67 | h_dict[ntype] = self.embed_dict[ntype](nd.array(g[0].nodes(ntype), self.ctx))
68 |
69 | # pass through all layers
70 | for i, layer in enumerate(self.layers[:-1]):
71 | if i != 0:
72 | h_dict = {k: nd.LeakyReLU(h) for k, h in h_dict.items()}
73 | h_dict = layer(g[i], h_dict)
74 |
75 | # get target logits
76 | # return h_dict['target']
77 | return self.layers[-1](h_dict['target'])
78 |
79 |
80 | class NodeEmbeddingGNN(gluon.Block):
81 | def __init__(self,
82 | gnn,
83 | input_size,
84 | embedding_size):
85 | super(NodeEmbeddingGNN, self).__init__()
86 |
87 | with self.name_scope():
88 | self.embed = gluon.nn.Embedding(input_size, embedding_size)
89 | self.gnn = gnn
90 |
91 | def forward(self, g, nodes):
92 | features = self.embed(nodes)
93 | h = self.gnn(g, features)
94 | return h
95 |
96 |
97 | class GCN(gluon.Block):
98 | def __init__(self,
99 | g,
100 | in_feats,
101 | n_hidden,
102 | n_classes,
103 | n_layers,
104 | activation,
105 | dropout):
106 | super(GCN, self).__init__()
107 | self.g = g
108 | self.layers = gluon.nn.Sequential()
109 | # input layer
110 | self.layers.add(GraphConv(in_feats, n_hidden, activation=activation))
111 | # hidden layers
112 | for i in range(n_layers - 1):
113 | self.layers.add(GraphConv(n_hidden, n_hidden, activation=activation))
114 | # output layer
115 | # self.layers.add(GraphConv(n_hidden, n_classes))
116 | self.layers.add(gluon.nn.Dense(n_classes))
117 | self.dropout = gluon.nn.Dropout(rate=dropout)
118 |
119 | def forward(self, g, features):
120 | h = features
121 | for i, layer in enumerate(self.layers[:-1]):
122 | if i != 0:
123 | h = self.dropout(h)
124 | h = layer(g[i], h)
125 | return self.layers[-1](h)
126 |
127 |
128 | class GraphSAGE(gluon.Block):
129 | def __init__(self,
130 | g,
131 | in_feats,
132 | n_hidden,
133 | n_classes,
134 | n_layers,
135 | activation,
136 | dropout,
137 | aggregator_type):
138 | super(GraphSAGE, self).__init__()
139 | self.g = g
140 |
141 | with self.name_scope():
142 | self.layers = gluon.nn.Sequential()
143 | # input layer
144 | self.layers.add(SAGEConv(in_feats, n_hidden, aggregator_type, feat_drop=dropout, activation=activation))
145 | # hidden layers
146 | for i in range(n_layers - 1):
147 | self.layers.add(SAGEConv(n_hidden, n_hidden, aggregator_type, feat_drop=dropout, activation=activation))
148 | # output layer
149 | self.layers.add(gluon.nn.Dense(n_classes))
150 |
151 | def forward(self, g, features):
152 | h = features
153 | for i, layer in enumerate(self.layers[:-1]):
154 | h_dst = h[:g[i].number_of_dst_nodes()]
155 | h = layer(g[i], (h, h_dst))
156 | return self.layers[-1](h)
157 |
158 |
159 | class GAT(gluon.Block):
160 | def __init__(self,
161 | g,
162 | in_dim,
163 | num_hidden,
164 | num_classes,
165 | num_layers,
166 | heads,
167 | activation,
168 | feat_drop,
169 | attn_drop,
170 | alpha,
171 | residual):
172 | super(GAT, self).__init__()
173 | self.g = g
174 | self.num_layers = num_layers
175 | self.gat_layers = []
176 | self.activation = activation
177 | # input projection (no residual)
178 | self.gat_layers.append(GATConv(
179 | (in_dim, in_dim), num_hidden, heads[0],
180 | feat_drop, attn_drop, alpha, False))
181 | # hidden layers
182 | for l in range(1, num_layers):
183 | # due to multi-head, the in_dim = num_hidden * num_heads
184 | self.gat_layers.append(GATConv(
185 | (num_hidden * heads[l-1], num_hidden * heads[l-1]), num_hidden, heads[l],
186 | feat_drop, attn_drop, alpha, residual))
187 | # output projection
188 | self.output_proj = gluon.nn.Dense(num_classes)
189 | for i, layer in enumerate(self.gat_layers):
190 | self.register_child(layer, "gat_layer_{}".format(i))
191 | self.register_child(self.output_proj, "dense_layer")
192 |
193 | def forward(self, g, inputs):
194 | h = inputs
195 | for l in range(self.num_layers):
196 | h_dst = h[:g[l].number_of_dst_nodes()]
197 | h = self.gat_layers[l](g[l], (h, h_dst)).flatten()
198 | h = self.activation(h)
199 | # output projection
200 | logits = self.output_proj(h)
201 | return logits
202 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/pytorch.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 |
5 | from dgl.nn.pytorch import GraphConv, GATConv, SAGEConv
6 | import dgl.function as fn
7 |
8 |
9 | class HeteroRGCNLayer(nn.Module):
10 | def __init__(self, in_size, out_size, etypes):
11 | super(HeteroRGCNLayer, self).__init__()
12 | # W_r for each relation
13 | self.weight = nn.ModuleDict({
14 | name: nn.Linear(in_size, out_size) for name in etypes
15 | })
16 |
17 | def forward(self, G, feat_dict):
18 | # The input is a dictionary of node features for each type
19 | funcs = {}
20 | for srctype, etype, dsttype in G.canonical_etypes:
21 | # Compute W_r * h
22 | if srctype in feat_dict:
23 | Wh = self.weight[etype](feat_dict[srctype])
24 | # Save it in graph for message passing
25 | G.nodes[srctype].data['Wh_%s' % etype] = Wh
26 | # Specify per-relation message passing functions: (message_func, reduce_func).
27 | # Note that the results are saved to the same destination feature 'h', which
28 | # hints the type wise reducer for aggregation.
29 | funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))
30 | # Trigger message passing of multiple types.
31 | # The first argument is the message passing functions for each relation.
32 | # The second one is the type wise reducer, could be "sum", "max",
33 | # "min", "mean", "stack"
34 | G.multi_update_all(funcs, 'sum')
35 | # return the updated node feature dictionary
36 | return {ntype: G.dstnodes[ntype].data['h'] for ntype in G.ntypes if 'h' in G.dstnodes[ntype].data}
37 |
38 |
39 | class HeteroRGCN(nn.Module):
40 | def __init__(self, g, in_size, hidden_size, out_size, n_layers, embedding_size):
41 | super(HeteroRGCN, self).__init__()
42 | # Use trainable node embeddings as featureless inputs.
43 | embed_dict = {ntype: nn.Parameter(torch.Tensor(g.number_of_nodes(ntype), in_size))
44 | for ntype in g.ntypes if ntype != 'user'}
45 | for key, embed in embed_dict.items():
46 | nn.init.xavier_uniform_(embed)
47 | self.embed = nn.ParameterDict(embed_dict)
48 | # create layers
49 | self.layers = nn.Sequential()
50 | self.layers.add_module(HeteroRGCNLayer(embedding_size, hidden_size, g.etypes))
51 | # hidden layers
52 | for i in range(n_layers - 1):
53 | self.layers.add_module = HeteroRGCNLayer(hidden_size, hidden_size, g.etypes)
54 |
55 | # output layer
56 | self.layers.add(nn.Dense(hidden_size, out_size))
57 |
58 | def forward(self, g, features):
59 | # get embeddings for all node types. for user node type, use passed in user features
60 | h_dict = self.embed
61 | h_dict['user'] = features
62 |
63 | # pass through all layers
64 | for i, layer in enumerate(self.layers[:-1]):
65 | if i != 0:
66 | h_dict = {k: F.leaky_relu(h) for k, h in h_dict.items()}
67 | h_dict = layer(g[i], h_dict)
68 |
69 | # get user logits
70 | # return h_dict['user']
71 | return self.layers[-1](h_dict['user'])
72 |
73 |
74 | class NodeEmbeddingGNN(nn.Module):
75 | def __init__(self,
76 | gnn,
77 | input_size,
78 | embedding_size):
79 | super(NodeEmbeddingGNN, self).__init__()
80 |
81 | self.embed = nn.Embedding(input_size, embedding_size)
82 | self.gnn = gnn
83 |
84 | def forward(self, g, nodes):
85 | features = self.embed(nodes)
86 | h = self.gnn(g, features)
87 | return h
88 |
89 |
90 | class GCN(nn.Module):
91 | def __init__(self,
92 | g,
93 | in_feats,
94 | n_hidden,
95 | n_classes,
96 | n_layers,
97 | activation,
98 | dropout):
99 | super(GCN, self).__init__()
100 | self.g = g
101 | self.layers = nn.Sequential()
102 | # input layer
103 | self.layers.add_module(GraphConv(in_feats, n_hidden, activation=activation))
104 | # hidden layers
105 | for i in range(n_layers - 1):
106 | self.layers.add(GraphConv(n_hidden, n_hidden, activation=activation))
107 | # output layer
108 | # self.layers.add(GraphConv(n_hidden, n_classes))
109 | self.layers.add(nn.Linear(n_hidden, n_classes))
110 | self.dropout = nn.Dropout(p=dropout)
111 |
112 | def forward(self, g, features):
113 | h = features
114 | for i, layer in enumerate(self.layers[:-1]):
115 | if i != 0:
116 | h = self.dropout(h)
117 | h = layer(g, h)
118 | return self.layers[-1](h)
119 |
120 |
121 | class GraphSAGE(nn.Module):
122 | def __init__(self,
123 | g,
124 | in_feats,
125 | n_hidden,
126 | n_classes,
127 | n_layers,
128 | activation,
129 | dropout,
130 | aggregator_type):
131 | super(GraphSAGE, self).__init__()
132 | self.g = g
133 |
134 | with self.name_scope():
135 | self.layers = nn.Sequential()
136 | # input layer
137 | self.layers.add_module(SAGEConv(in_feats, n_hidden, aggregator_type, feat_drop=dropout, activation=activation))
138 | # hidden layers
139 | for i in range(n_layers - 1):
140 | self.layers.add_module(SAGEConv(n_hidden, n_hidden, aggregator_type, feat_drop=dropout, activation=activation))
141 | # output layer
142 | self.layers.add_module(nn.Linear(n_hidden, n_classes))
143 |
144 | def forward(self, g, features):
145 | h = features
146 | for layer in self.layers[:-1]:
147 | h = layer(g, h)
148 | return self.layers[-1](h)
149 |
150 |
151 | class GAT(nn.Module):
152 | def __init__(self,
153 | g,
154 | in_dim,
155 | num_hidden,
156 | num_classes,
157 | num_layers,
158 | heads,
159 | activation,
160 | feat_drop,
161 | attn_drop,
162 | alpha,
163 | residual):
164 | super(GAT, self).__init__()
165 | self.g = g
166 | self.num_layers = num_layers
167 | self.gat_layers = nn.ModuleList()
168 | self.activation = activation
169 | # input projection (no residual)
170 | self.gat_layers.append(GATConv(
171 | in_dim, num_hidden, heads[0],
172 | feat_drop, attn_drop, alpha, False))
173 | # hidden layers
174 | for l in range(1, num_layers):
175 | # due to multi-head, the in_dim = num_hidden * num_heads
176 | self.gat_layers.append(GATConv(
177 | num_hidden * heads[l-1], num_hidden, heads[l],
178 | feat_drop, attn_drop, alpha, residual))
179 | # output projection
180 | self.gat_layers.append(GATConv(
181 | num_hidden * heads[-2], num_classes, heads[-1],
182 | feat_drop, attn_drop, alpha, residual))
183 |
184 | def forward(self, g, inputs):
185 | h = inputs
186 | for l in range(self.num_layers):
187 | h = self.gat_layers[l](g, h).flatten()
188 | h = self.activation(h)
189 | # output projection
190 | logits = self.gat_layers[-1](g, h).mean(1)
191 | return logits
192 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/requirements.txt:
--------------------------------------------------------------------------------
1 | dgl-cu101==0.4.3
2 | matplotlib==3.0.3
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/sampler.py:
--------------------------------------------------------------------------------
1 | import dgl
2 |
3 |
4 | class HeteroGraphNeighborSampler:
5 | """Neighbor sampler on heterogeneous graphs
6 | Parameters
7 | ----------
8 | g : DGLHeteroGraph
9 | Full graph
10 | category : str
11 | Category name of the seed nodes.
12 | nhops : int
13 | Number of hops to sample/number of layers in the node flow.
14 | fanout : int
15 | Fanout of each hop starting from the seed nodes. If a fanout is None,
16 | sample full neighbors.
17 | """
18 | def __init__(self, g, category, nhops, fanout=None):
19 | self.g = g
20 | self.category = category
21 | self.fanouts = [fanout] * nhops
22 |
23 | def sample_block(self, seeds):
24 | blocks = []
25 | seeds = {self.category: seeds}
26 | cur = seeds
27 | for fanout in self.fanouts:
28 | if fanout is None:
29 | frontier = dgl.in_subgraph(self.g, cur)
30 | else:
31 | frontier = dgl.sampling.sample_neighbors(self.g, cur, fanout)
32 | block = dgl.to_block(frontier, cur)
33 | cur = {}
34 | for ntype in block.srctypes:
35 | cur[ntype] = block.srcnodes[ntype].data[dgl.NID]
36 | blocks.insert(0, block)
37 | return blocks, cur[self.category]
38 |
39 |
40 | class NeighborSampler:
41 | """Neighbor sampler on homogenous graphs
42 | Parameters
43 | ----------
44 | g : DGLGraph
45 | Full graph
46 | nhops : int
47 | Number of hops to sample/number of layers in the node flow.
48 | fanout : int
49 | Fanout of each hop starting from the seed nodes. If a fanout is None,
50 | sample full neighbors.
51 | """
52 | def __init__(self, g, nhops, fanout=None):
53 | self.g = g
54 | self.fanouts = [fanout] * nhops
55 | self.nhops = nhops
56 |
57 | def sample_block(self, seeds):
58 | blocks = []
59 | for fanout in self.fanouts:
60 | # For each seed node, sample ``fanout`` neighbors.
61 | if fanout is None:
62 | frontier = dgl.in_subgraph(self.g, seeds)
63 | else:
64 | frontier = dgl.sampling.sample_neighbors(self.g, seeds, fanout, replace=False)
65 | # Then we compact the frontier into a bipartite graph for message passing.
66 | block = dgl.to_block(frontier, seeds)
67 | # Obtain the seed nodes for next layer.
68 | seeds = block.srcdata[dgl.NID]
69 |
70 | blocks.insert(0, block)
71 | return blocks, blocks[0].srcdata[dgl.NID]
72 |
73 |
74 | class FullGraphSampler:
75 | """Does nothing and just returns the full graph
76 | Parameters
77 | ----------
78 | g : DGLGraph
79 | Full graph
80 | nhops : int
81 | Number of hops to sample/number of layers in the node flow.
82 | """
83 |
84 | def __init__(self, g, nhops):
85 | self.g = g
86 | self.nhops = nhops
87 |
88 | def sample_block(self, seeds):
89 | return [self.g] * self.nhops, seeds
90 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/train_dgl_mxnet_entry_point.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.environ['DGLBACKEND'] = 'mxnet'
3 | import mxnet as mx
4 | from mxnet import nd, gluon, autograd
5 | import dgl
6 |
7 | import numpy as np
8 | import pandas as pd
9 |
10 | import time
11 | import logging
12 | import pickle
13 |
14 | from estimator_fns import *
15 | from graph import *
16 | from data import *
17 | from utils import *
18 | from model.mxnet import *
19 | from sampler import *
20 |
21 | def normalize(feature_matrix):
22 | mean = nd.mean(feature_matrix, axis=0)
23 | stdev = nd.sqrt(nd.sum((feature_matrix - mean)**2, axis=0)/feature_matrix.shape[0])
24 | return (feature_matrix - mean) / stdev
25 |
26 |
27 | def get_dataloader(data_size, batch_size, mini_batch=True):
28 | batch_size = batch_size if mini_batch else data_size
29 | train_dataloader = gluon.data.BatchSampler(gluon.data.RandomSampler(data_size), batch_size, 'keep')
30 | test_dataloader = gluon.data.BatchSampler(gluon.data.SequentialSampler(data_size), batch_size, 'keep')
31 |
32 | return train_dataloader, test_dataloader
33 |
34 |
35 | def train(model, trainer, loss, features, labels, train_loader, test_loader, train_g, test_g, train_mask, test_mask,
36 | ctx, n_epochs, batch_size, output_dir, thresh, scale_pos_weight, compute_metrics=True, mini_batch=True):
37 | duration = []
38 | for epoch in range(n_epochs):
39 | tic = time.time()
40 | loss_val = 0.
41 |
42 | for n, batch in enumerate(train_loader):
43 | # logging.info("Iteration: {:05d}".format(n))
44 | node_flow, batch_nids = train_g.sample_block(nd.array(batch).astype('int64'))
45 | batch_indices = nd.array(batch, ctx=ctx)
46 | with autograd.record():
47 | pred = model(node_flow, features[batch_nids.as_in_context(ctx)])
48 | l = loss(pred, labels[batch_indices], mx.nd.expand_dims(scale_pos_weight*train_mask, 1)[batch_indices])
49 | l = l.sum()/len(batch)
50 |
51 | l.backward()
52 | trainer.step(batch_size=1, ignore_stale_grad=True)
53 |
54 | loss_val += l.asscalar()
55 | # logging.info("Current loss {:04f}".format(loss_val/(n+1)))
56 |
57 | duration.append(time.time() - tic)
58 | metric = evaluate(model, train_g, features, labels, train_mask, ctx, batch_size, mini_batch)
59 | logging.info("Epoch {:05d} | Time(s) {:.4f} | Loss {:.4f} | F1 {:.4f}".format(
60 | epoch, np.mean(duration), loss_val/(n+1), metric))
61 |
62 | class_preds, pred_proba = get_model_class_predictions(model, test_g, test_loader, features, ctx, threshold=thresh)
63 |
64 | if compute_metrics:
65 | acc, f1, p, r, roc, pr, ap, cm = get_metrics(class_preds, pred_proba, labels, test_mask, output_dir)
66 | logging.info("Metrics")
67 | logging.info("""Confusion Matrix:
68 | {}
69 | f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}, roc: {:.4f}, pr: {:.4f}, ap: {:.4f}
70 | """.format(cm, f1, p, r, acc, roc, pr, ap))
71 |
72 | return model, class_preds, pred_proba
73 |
74 |
75 | def evaluate(model, g, features, labels, mask, ctx, batch_size, mini_batch=True):
76 | f1 = mx.metric.F1()
77 | preds = []
78 | batch_size = batch_size if mini_batch else features.shape[0]
79 | dataloader = gluon.data.BatchSampler(gluon.data.SequentialSampler(features.shape[0]), batch_size, 'keep')
80 | for batch in dataloader:
81 | node_flow, batch_nids = g.sample_block(nd.array(batch).astype('int64'))
82 | preds.append(model(node_flow, features[batch_nids.as_in_context(ctx)]))
83 | nd.waitall()
84 |
85 | # preds = nd.concat(*preds, dim=0).argmax(axis=1)
86 | preds = nd.concat(*preds, dim=0)
87 | mask = nd.array(np.where(mask.asnumpy()), ctx=ctx)
88 | f1.update(preds=nd.softmax(preds[mask], axis=1).reshape(-3, 0), labels=labels[mask].reshape(-1,))
89 | return f1.get()[1]
90 |
91 |
92 | def get_model_predictions(model, g, dataloader, features, ctx):
93 | pred = []
94 | for batch in dataloader:
95 | node_flow, batch_nids = g.sample_block(nd.array(batch).astype('int64'))
96 | pred.append(model(node_flow, features[batch_nids.as_in_context(ctx)]))
97 | nd.waitall()
98 | return nd.concat(*pred, dim=0)
99 |
100 |
101 | def get_model_class_predictions(model, g, datalaoder, features, ctx, threshold=None):
102 | unnormalized_preds = get_model_predictions(model, g, datalaoder, features, ctx)
103 | pred_proba = nd.softmax(unnormalized_preds)[:, 1].asnumpy().flatten()
104 | if not threshold:
105 | return unnormalized_preds.argmax(axis=1).asnumpy().flatten().astype(int), pred_proba
106 | return np.where(pred_proba > threshold, 1, 0), pred_proba
107 |
108 |
109 | def save_prediction(pred, pred_proba, id_to_node, training_dir, new_accounts, output_dir, predictions_file):
110 | prediction_query = read_masked_nodes(os.path.join(training_dir, new_accounts))
111 | pred_indices = np.array([id_to_node[query] for query in prediction_query])
112 |
113 | pd.DataFrame.from_dict({'target': prediction_query,
114 | 'pred_proba': pred_proba[pred_indices],
115 | 'pred': pred[pred_indices]}).to_csv(os.path.join(output_dir, predictions_file),
116 | index=False)
117 |
118 |
119 | def save_model(g, model, model_dir, hyperparams):
120 | model.save_parameters(os.path.join(model_dir, 'model.params'))
121 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'wb') as f:
122 | pickle.dump(hyperparams, f)
123 | with open(os.path.join(model_dir, 'graph.pkl'), 'wb') as f:
124 | pickle.dump(g, f)
125 |
126 |
127 | def get_model(g, hyperparams, in_feats, n_classes, ctx, model_dir=None):
128 |
129 | if model_dir: # load using saved model state
130 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'rb') as f:
131 | hyperparams = pickle.load(f)
132 | with open(os.path.join(model_dir, 'graph.pkl'), 'rb') as f:
133 | g = pickle.load(f)
134 |
135 | if hyperparams['heterogeneous']:
136 | model = HeteroRGCN(g,
137 | in_feats,
138 | hyperparams['n_hidden'],
139 | n_classes,
140 | hyperparams['n_layers'],
141 | hyperparams['embedding_size'],
142 | ctx)
143 | else:
144 | if hyperparams['model'] == 'gcn':
145 | model = GCN(g,
146 | in_feats,
147 | hyperparams['n_hidden'],
148 | n_classes,
149 | hyperparams['n_layers'],
150 | nd.relu,
151 | hyperparams['dropout'])
152 | elif hyperparams['model'] == 'graphsage':
153 | model = GraphSAGE(g,
154 | in_feats,
155 | hyperparams['n_hidden'],
156 | n_classes,
157 | hyperparams['n_layers'],
158 | nd.relu,
159 | hyperparams['dropout'],
160 | hyperparams['aggregator_type'])
161 | else:
162 | heads = ([hyperparams['num_heads']] * hyperparams['n_layers']) + [hyperparams['num_out_heads']]
163 | model = GAT(g,
164 | in_feats,
165 | hyperparams['n_hidden'],
166 | n_classes,
167 | hyperparams['n_layers'],
168 | heads,
169 | gluon.nn.Lambda(lambda data: nd.LeakyReLU(data, act_type='elu')),
170 | hyperparams['dropout'],
171 | hyperparams['attn_drop'],
172 | hyperparams['alpha'],
173 | hyperparams['residual'])
174 |
175 | if hyperparams['no_features']:
176 | model = NodeEmbeddingGNN(model, in_feats, hyperparams['embedding_size'])
177 |
178 | if model_dir:
179 | model.load_parameters(os.path.join(model_dir, 'model.params'))
180 | else:
181 | model.initialize(ctx=ctx)
182 |
183 | return model
184 |
185 |
186 | if __name__ == '__main__':
187 | logging = get_logger(__name__)
188 | logging.info('numpy version:{} MXNet version:{} DGL version:{}'.format(np.__version__,
189 | mx.__version__,
190 | dgl.__version__))
191 |
192 | args = parse_args()
193 |
194 | args.edges = get_edgelists(args.edges, args.training_dir)
195 |
196 | g, features, id_to_node = construct_graph(args.training_dir, args.edges, args.nodes, args.target_ntype,
197 | args.heterogeneous)
198 |
199 | features = normalize(nd.array(features))
200 | if args.heterogeneous:
201 | g.nodes['target'].data['features'] = features
202 | else:
203 | g.ndata['features'] = features
204 |
205 | logging.info("Getting labels")
206 | n_nodes = g.number_of_nodes('target') if args.heterogeneous else g.number_of_nodes()
207 | labels, train_mask, test_mask = get_labels(id_to_node,
208 | n_nodes,
209 | args.target_ntype,
210 | os.path.join(args.training_dir, args.labels),
211 | os.path.join(args.training_dir, args.new_accounts))
212 | logging.info("Got labels")
213 |
214 | labels = nd.array(labels).astype('float32')
215 | train_mask = nd.array(train_mask).astype('float32')
216 | test_mask = nd.array(test_mask).astype('float32')
217 |
218 | n_nodes = sum([g.number_of_nodes(n_type) for n_type in g.ntypes]) if args.heterogeneous else g.number_of_nodes()
219 | n_edges = sum([g.number_of_edges(e_type) for e_type in g.etypes]) if args.heterogeneous else g.number_of_edges()
220 |
221 | logging.info("""----Data statistics------'
222 | #Nodes: {}
223 | #Edges: {}
224 | #Features Shape: {}
225 | #Labeled Train samples: {}
226 | #Unlabeled Test samples: {}""".format(n_nodes,
227 | n_edges,
228 | features.shape,
229 | train_mask.sum().asscalar(),
230 | test_mask.sum().asscalar()))
231 |
232 | if args.num_gpus:
233 | cuda = True
234 | ctx = mx.gpu(0)
235 | else:
236 | cuda = False
237 | ctx = mx.cpu(0)
238 |
239 | logging.info("Initializing Model")
240 | in_feats = args.embedding_size if args.no_features else features.shape[1]
241 | n_classes = 2
242 | model = get_model(g, vars(args), in_feats, n_classes, ctx)
243 | logging.info("Initialized Model")
244 |
245 | if args.no_features:
246 | features = nd.array(g.nodes('target'), ctx) if args.heterogeneous else nd.array(g.nodes(), ctx)
247 | else:
248 | features = features.as_in_context(ctx)
249 |
250 | labels = labels.as_in_context(ctx)
251 | train_mask = train_mask.as_in_context(ctx)
252 | test_mask = test_mask.as_in_context(ctx)
253 |
254 | if not args.heterogeneous:
255 | # normalization
256 | degs = g.in_degrees().astype('float32')
257 | norm = mx.nd.power(degs, -0.5)
258 | if cuda:
259 | norm = norm.as_in_context(ctx)
260 | g.ndata['norm'] = mx.nd.expand_dims(norm, 1)
261 |
262 | if args.mini_batch:
263 | train_g = HeteroGraphNeighborSampler(g, 'target', args.n_layers, args.n_neighbors) if args.heterogeneous\
264 | else NeighborSampler(g, args.n_layers, args.n_neighbors)
265 |
266 | test_g = HeteroGraphNeighborSampler(g, 'target', args.n_layers) if args.heterogeneous\
267 | else NeighborSampler(g, args.n_layers)
268 | else:
269 | train_g, test_g = FullGraphSampler(g, args.n_layers), FullGraphSampler(g, args.n_layers)
270 |
271 | train_data, test_data = get_dataloader(features.shape[0], args.batch_size, args.mini_batch)
272 |
273 | loss = gluon.loss.SoftmaxCELoss()
274 | scale_pos_weight = ((train_mask.shape[0] - train_mask.sum()) / train_mask.sum())
275 |
276 | logging.info(model)
277 | logging.info(model.collect_params())
278 | trainer = gluon.Trainer(model.collect_params(), args.optimizer, {'learning_rate': args.lr, 'wd': args.weight_decay})
279 |
280 | logging.info("Starting Model training")
281 | model, pred, pred_proba = train(model, trainer, loss, features, labels, train_data, test_data, train_g, test_g,
282 | train_mask, test_mask, ctx, args.n_epochs, args.batch_size, args.output_dir,
283 | args.threshold, scale_pos_weight, args.compute_metrics, args.mini_batch)
284 | logging.info("Finished Model training")
285 |
286 | logging.info("Saving model")
287 | save_model(g, model, args.model_dir, vars(args))
288 |
289 | logging.info("Saving model predictions for new accounts")
290 | save_prediction(pred, pred_proba, id_to_node, args.training_dir, args.new_accounts, args.output_dir, args.predictions)
291 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pandas as pd
3 | import numpy as np
4 | from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
5 | import networkx as nx
6 | import matplotlib.pyplot as plt
7 |
8 |
9 | def get_metrics(pred, pred_proba, labels, mask, out_dir):
10 | labels, mask = labels.asnumpy().flatten().astype(int), mask.asnumpy().flatten().astype(int)
11 | labels, pred, pred_proba = labels[np.where(mask)], pred[np.where(mask)], pred_proba[np.where(mask)]
12 |
13 | acc = ((pred == labels)).sum() / mask.sum()
14 |
15 | true_pos = (np.where(pred == 1, 1, 0) + np.where(labels == 1, 1, 0) > 1).sum()
16 | false_pos = (np.where(pred == 1, 1, 0) + np.where(labels == 0, 1, 0) > 1).sum()
17 | false_neg = (np.where(pred == 0, 1, 0) + np.where(labels == 1, 1, 0) > 1).sum()
18 | true_neg = (np.where(pred == 0, 1, 0) + np.where(labels == 0, 1, 0) > 1).sum()
19 |
20 | precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0
21 | recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0
22 |
23 | f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0
24 | confusion_matrix = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]),
25 | columns=["labels positive", "labels negative"],
26 | index=["predicted positive", "predicted negative"])
27 |
28 | ap = average_precision_score(labels, pred_proba)
29 |
30 | fpr, tpr, _ = roc_curve(labels, pred_proba)
31 | prc, rec, _ = precision_recall_curve(labels, pred_proba)
32 | roc_auc = auc(fpr, tpr)
33 | pr_auc = auc(rec, prc)
34 |
35 | save_roc_curve(fpr, tpr, roc_auc, os.path.join(out_dir, "roc_curve.png"))
36 | save_pr_curve(prc, rec, pr_auc, ap, os.path.join(out_dir, "pr_curve.png"))
37 |
38 | return acc, f1, precision, recall, roc_auc, pr_auc, ap, confusion_matrix
39 |
40 |
41 | def save_roc_curve(fpr, tpr, roc_auc, location):
42 | f = plt.figure()
43 | lw = 2
44 | plt.plot(fpr, tpr, color='darkorange',
45 | lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
46 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
47 | plt.xlim([0.0, 1.0])
48 | plt.ylim([0.0, 1.05])
49 | plt.xlabel('False Positive Rate')
50 | plt.ylabel('True Positive Rate')
51 | plt.title('Model ROC curve')
52 | plt.legend(loc="lower right")
53 | f.savefig(location)
54 |
55 |
56 | def save_pr_curve(fpr, tpr, pr_auc, ap, location):
57 | f = plt.figure()
58 | lw = 2
59 | plt.plot(fpr, tpr, color='darkorange',
60 | lw=lw, label='PR curve (area = %0.2f)' % pr_auc)
61 | plt.xlim([0.0, 1.0])
62 | plt.ylim([0.0, 1.05])
63 | plt.xlabel('Recall')
64 | plt.ylabel('Precision')
65 | plt.title('Model PR curve: AP={0:0.2f}'.format(ap))
66 | plt.legend(loc="lower right")
67 | f.savefig(location)
68 |
69 |
70 | def save_graph_drawing(g, location):
71 | plt.figure(figsize=(12, 8))
72 | node_colors = {node: 0.0 if 'user' in node else 0.5 for node in g.nodes()}
73 | nx.draw(g, node_size=10000, pos=nx.spring_layout(g), with_labels=True, font_size=14,
74 | node_color=list(node_colors.values()), font_color='white')
75 | plt.savefig(location, bbox_inches='tight')
76 |
77 |
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/requirements.txt:
--------------------------------------------------------------------------------
1 | sagemaker==1.72.0
2 | awscli>=1.18.140
--------------------------------------------------------------------------------
/source/sagemaker/sagemaker_graph_fraud_detection/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 |
4 | setup(
5 | name='sagemaker_graph_fraud_detection',
6 | version='1.0',
7 | description='A package to organize code in the solution',
8 | packages=find_packages(exclude=('test',))
9 | )
--------------------------------------------------------------------------------
/source/sagemaker/setup.sh:
--------------------------------------------------------------------------------
1 | export PIP_DISABLE_PIP_VERSION_CHECK=1
2 |
3 | pip install -r ./sagemaker_graph_fraud_detection/requirements.txt -q
4 | pip install -e ./sagemaker_graph_fraud_detection/
--------------------------------------------------------------------------------