├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── NOTICE ├── README.md ├── deployment ├── sagemaker-graph-fraud-detection.yaml ├── sagemaker-notebook-instance-stack.yaml ├── sagemaker-permissions-stack.yaml └── solution-assistant │ ├── requirements.in │ ├── requirements.txt │ ├── solution-assistant.yaml │ └── src │ └── lambda_function.py ├── docs ├── arch.png ├── details.md └── overview_video_thumbnail.png ├── iam ├── launch.json └── usage.json ├── metadata └── metadata.json └── source ├── lambda ├── data-preprocessing │ └── index.py └── graph-modelling │ └── index.py └── sagemaker ├── baselines ├── mlp-fraud-baseline.ipynb ├── mlp_fraud_entry_point.py ├── utils.py └── xgboost-fraud-baseline.ipynb ├── data-preprocessing ├── buildspec.yml ├── container │ ├── Dockerfile │ └── build_and_push.sh └── graph_data_preprocessor.py ├── dgl-fraud-detection.ipynb ├── sagemaker_graph_fraud_detection ├── __init__.py ├── config.py ├── container_build │ ├── __init__.py │ ├── container_build.py │ └── logs.py ├── dgl_fraud_detection │ ├── __init__.py │ ├── data.py │ ├── estimator_fns.py │ ├── graph.py │ ├── model │ │ ├── __init__.py │ │ ├── mxnet.py │ │ └── pytorch.py │ ├── requirements.txt │ ├── sampler.py │ ├── train_dgl_mxnet_entry_point.py │ └── utils.py ├── requirements.txt └── setup.py └── setup.sh /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Amazon SageMaker and Deep Graph Library for Fraud Detection in Heterogeneous Graphs 2 | 3 | Many online businesses lose billions annually to fraud, but machine learning based fraud detection models can help businesses predict what interactions or users are likely fraudulent and save them from incurring those costs. 4 | 5 | In this project, we formulate the problem of fraud detection as a classification task on a heterogeneous interaction network. The machine learning model is a Graph Neural Network (GNN) that learns latent representations of users or transactions which can then be easily separated into fraud or legitimate. 6 | 7 |

8 | 9 | 10 | 11 |

12 | 13 | 14 | 15 | This project shows how to use [Amazon SageMaker](https://aws.amazon.com/sagemaker/) and [Deep Graph Library (DGL)](https://www.dgl.ai/) to construct a heterogeneous graph from tabular data and train a GNN model to detect fraudulent transactions in the [IEEE-CIS dataset](https://www.kaggle.com/c/ieee-fraud-detection/data). 16 | 17 | See the [details page](docs/details.md) to learn more about the techniques used, and the [online webinar](https://www.youtube.com/watch?v=P_oCAbSYRwY&feature=youtu.be) or [tutorial blog post](https://aws.amazon.com/blogs/machine-learning/detecting-fraud-in-heterogeneous-networks-using-amazon-sagemaker-and-deep-graph-library/) to see step by step explanations and instructions on how to use this solution. 18 | 19 | ## Getting Started 20 | 21 | You will need an AWS account to use this solution. Sign up for an account [here](https://aws.amazon.com/). 22 | 23 | To run this JumpStart 1P Solution and have the infrastructure deploy to your AWS account you will need to create an active SageMaker Studio instance (see [Onboard to Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html)). When your Studio instance is *Ready*, use the instructions in [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html) to 1-Click Launch the solution. 24 | 25 | The solution artifacts are included in this GitHub repository for reference. 26 | 27 | *Note*: Solutions are available in most regions including us-west-2, and us-east-1. 28 | 29 | **Caution**: Cloning this GitHub repository and running the code manually could lead to unexpected issues! Use the AWS CloudFormation template. You'll get an Amazon SageMaker Notebook instance that's been correctly setup and configured to access the other resources in the solution. 30 | 31 | ## Architecture 32 | 33 | The project architecture deployed by the cloud formation template is shown here. 34 | 35 | ![](docs/arch.png) 36 | 37 | ## Project Organization 38 | The project is divided into two main modules. 39 | 40 | The [first module](source/sagemaker/data-preprocessing) uses [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) to do feature engineering and extract edgelists from a table of transactions or interactions. 41 | 42 | 43 | The [second module](source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) shows how to use DGL to define a GNN model and train the model using [Amazon SageMaker training infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/deep-graph-library.html). 44 | 45 | 46 | The [jupyter notebook](source/sagemaker/dgl-fraud-detection.ipynb) shows how to run the full project on an [example dataset](https://www.kaggle.com/c/ieee-fraud-detection/data). 47 | 48 | 49 | The project also contains a [cloud formation template](deployment/sagemaker-graph-fraud-detection.yaml) that deploys the code in this repo and all AWS resources needed to run the project in an end-to-end manner in the AWS account it's launched in. 50 | 51 | ## Contents 52 | 53 | * `deployment/` 54 | * `sagemaker-graph-fraud-detection.yaml`: Creates AWS CloudFormation Stack for solution 55 | * `source/` 56 | * `lambda/` 57 | * `data-preprocessing/` 58 | * `index.py`: Lambda function script for invoking SageMaker processing 59 | * `graph-modelling/` 60 | * `index.py`: Lambda function script for invoking SageMaker training 61 | * `sagemaker/` 62 | * `baselines/` 63 | * `mlp-fraud-baseline.ipynb`: Jupyter notebook for feature based MLP baseline method using SageMaker and MXNet 64 | * `mlp_fraud_entry_point.py`: python entry point used by the MLP baseline notebook for MXNet training/deployment 65 | * `utils.py`: utility functions for baseline notebooks 66 | * `xgboost-fraud-entry-point.ipynb`: Jupyter notebook for feature based XGBoost baseline method using SageMaker 67 | * `data-preprocessing/` 68 | * `container/` 69 | * `Dockerfile`: Describes custom Docker image hosted on Amazon ECR for SageMaker Processing 70 | * `build_and_push.sh`: Script to build Docker image and push to Amazon ECR 71 | * `graph_data_preprocessor.py`: Custom script used by SageMaker Processing for data processing/feature engineering 72 | * `sagemaker_graph_fraud_detection/` 73 | * `dgl_fraud_detection/` 74 | * `model` 75 | * `mxnet.py`: Implements the various graph neural network models used in the project with the mxnet backend 76 | * `data.py`: Contains functions for reading node features and labels 77 | * `estimator_fns.py`: Contains functions for parsing input from SageMaker estimator objects 78 | * `graph.py`: Contains functions for constructing DGL Graphs with node features and edge lists 79 | * `requirements.txt`: Describes Python package requirements of the Amazon SageMaker training instance 80 | * `sampler.py`: Contains functions for graph sampling for mini-batch training 81 | * `train_dgl_mxnet_entry_point.py`: python entry point used by the notebook for GNN training with DGL mxnet backend 82 | * `utils.py`: python script with utility functions for computing metrics and plots 83 | * `config.py`: python file to load stack configurations and pass to sagemaker notebook 84 | * `requirements.txt`: Describes Python package requirements of the SageMaker notebook instance 85 | * `setup.py`: setup sagemaker-graph-fraud-detection as a python package 86 | * `dgl-fraud-detection.ipynb`: Orchestrates the solution. Triggers preprocessing and model training 87 | * `setup.sh`: prepare notebook environment with necessary pre-reqs 88 | 89 | ## License 90 | 91 | This project is licensed under the Apache-2.0 License. 92 | 93 | -------------------------------------------------------------------------------- /deployment/sagemaker-graph-fraud-detection.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Description: "(SA0002) - sagemaker-graph-fraud-detection: Solution for training a graph neural network model for fraud detection using Amazon SageMaker. Version 1" 3 | Parameters: 4 | SolutionPrefix: 5 | Type: String 6 | Default: "sagemaker-soln-graph-fraud" 7 | Description: | 8 | Used to name resources created as part of this stack (and inside nested stacks too). 9 | Can be the same as the stack name used by AWS CloudFormation, but this field has extra 10 | constraints because it's used to name resources with restrictions (e.g. Amazon S3 bucket 11 | names cannot contain capital letters). 12 | AllowedPattern: '^sagemaker-soln-[a-z0-9\-]{1,20}$' 13 | ConstraintDescription: | 14 | Only allowed to use lowercase letters, hyphens and/or numbers. 15 | Should also start with 'sagemaker-soln-graph-fraud' for permission management. 16 | IamRole: 17 | Type: String 18 | Default: "" 19 | Description: | 20 | IAM Role that will be attached to the resources created by this cloudformation to grant them permissions to 21 | perform their required functions. This role should allow SageMaker and Lambda perform the required actions like 22 | creating training jobs and processing jobs. If left blank, the template will attempt to create a role for you. 23 | This can cause a stack creation error if you don't have privileges to create new roles. 24 | S3HistoricalTransactionsPrefix: 25 | Description: Enter the S3 prefix where historical transactions/relations are stored. 26 | Type: String 27 | Default: "raw-data" 28 | S3ProcessingJobInputPrefix: 29 | Description: Enter the S3 prefix where inputs should be monitored for changes to start the processing job 30 | Type: String 31 | Default: "processing-input" 32 | S3ProcessingJobOutputPrefix: 33 | Description: Enter the S3 prefix where preprocessed data should be stored and monitored for changes to start the training job 34 | Type: String 35 | Default: "preprocessed-data" 36 | S3TrainingJobOutputPrefix: 37 | Description: Enter the S3 prefix where model and output artifacts from the training job should be stored 38 | Type: String 39 | Default: "training-output" 40 | CreateSageMakerNotebookInstance: 41 | Description: Whether to launch classic sagemaker notebook instance 42 | Type: String 43 | AllowedValues: 44 | - "true" 45 | - "false" 46 | Default: "false" 47 | BuildSageMakerContainersRemotely: 48 | Description: | 49 | Whether to launch a CodeBuild project to build sagemaker containers. 50 | If set to 'true' SageMaker notebook will use the CodeBuild project to launch a build job for sagemaker containers. 51 | If set to 'false' SageMaker notebook will attempt to build solution containers on the notebook instance. 52 | This may lead to some unexpected issues as docker isn't installed in SageMaker studio containers. 53 | Type: String 54 | AllowedValues: 55 | - "true" 56 | - "false" 57 | Default: "true" 58 | SageMakerProcessingJobContainerName: 59 | Description: Name of the SageMaker processing job ECR Container 60 | Type: String 61 | Default: "sagemaker-soln-graph-fraud-preprocessing" 62 | SageMakerProcessingJobInstanceType: 63 | Description: Instance type of the SageMaker processing job 64 | Type: String 65 | Default: "ml.m4.xlarge" 66 | SageMakerTrainingJobInstanceType: 67 | Description: Instance type of the SageMaker processing job 68 | Type: String 69 | Default: "ml.p3.2xlarge" 70 | SageMakerNotebookInstanceType: 71 | Description: Instance type of the SageMaker notebook instance 72 | Type: String 73 | Default: "ml.m4.xlarge" 74 | StackVersion: 75 | Description: | 76 | CloudFormation Stack version. 77 | Use 'release' version unless you are customizing the 78 | CloudFormation templates and solution artifacts. 79 | Type: String 80 | Default: release 81 | AllowedValues: 82 | - release 83 | - development 84 | 85 | Metadata: 86 | AWS::CloudFormation::Interface: 87 | ParameterGroups: 88 | - 89 | Label: 90 | default: Solution Configuration 91 | Parameters: 92 | - SolutionPrefix 93 | - IamRole 94 | - StackVersion 95 | - 96 | Label: 97 | default: S3 Configuration 98 | Parameters: 99 | - S3HistoricalTransactionsPrefix 100 | - S3ProcessingJobInputPrefix 101 | - S3ProcessingJobOutputPrefix 102 | - S3TrainingJobOutputPrefix 103 | - 104 | Label: 105 | default: SageMaker Configuration 106 | Parameters: 107 | - CreateSageMakerNotebookInstance 108 | - BuildSageMakerContainersRemotely 109 | - SageMakerProcessingJobContainerName 110 | - SageMakerProcessingJobInstanceType 111 | - SageMakerTrainingJobInstanceType 112 | - SageMakerNotebookInstanceType 113 | ParameterLabels: 114 | SolutionPrefix: 115 | default: Solution Resources Name Prefix 116 | IamRole: 117 | default: Solution IAM Role Arn 118 | StackVersion: 119 | default: Solution Stack Version 120 | S3HistoricalTransactionsPrefix: 121 | default: S3 Data Prefix 122 | S3ProcessingJobInputPrefix: 123 | default: S3 Processing Input Prefix 124 | S3ProcessingJobOutputPrefix: 125 | default: S3 Preprocessed Data Prefix 126 | S3TrainingJobOutputPrefix: 127 | default: S3 Training Results Prefix 128 | CreateSageMakerNotebookInstance: 129 | default: Launch Classic SageMaker Notebook Instance 130 | BuildSageMakerContainersRemotely: 131 | default: Build SageMaker containers on AWS CodeBuild 132 | SageMakerProcessingJobContainerName: 133 | default: SageMaker Processing Container Name 134 | SageMakerProcessingJobInstanceType: 135 | default: SageMaker Processing Instance 136 | SageMakerTrainingJobInstanceType: 137 | default: SageMaker Training Instance 138 | SageMakerNotebookInstanceType: 139 | default: SageMaker Notebook Instance 140 | 141 | Mappings: 142 | S3: 143 | release: 144 | BucketPrefix: "sagemaker-solutions" 145 | development: 146 | BucketPrefix: "sagemaker-solutions-build" 147 | Lambda: 148 | DataPreprocessing: 149 | S3Key: "Fraud-detection-in-financial-networks/build/data_preprocessing.zip" 150 | GraphModelling: 151 | S3Key: "Fraud-detection-in-financial-networks/build/graph_modelling.zip" 152 | CodeBuild: 153 | ProcessingContainer: 154 | S3Key: "Fraud-detection-in-financial-networks/build/sagemaker_data_preprocessing.zip" 155 | 156 | Conditions: 157 | CreateClassicSageMakerResources: !Equals [ !Ref CreateSageMakerNotebookInstance, "true" ] 158 | CreateCustomSolutionRole: !Equals [!Ref IamRole, ""] 159 | CreateCodeBuildProject: !Equals [!Ref BuildSageMakerContainersRemotely, "true"] 160 | 161 | Resources: 162 | S3Bucket: 163 | Type: AWS::S3::Bucket 164 | DeletionPolicy: Retain 165 | Properties: 166 | BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 167 | PublicAccessBlockConfiguration: 168 | BlockPublicAcls: true 169 | BlockPublicPolicy: true 170 | IgnorePublicAcls: true 171 | RestrictPublicBuckets: true 172 | BucketEncryption: 173 | ServerSideEncryptionConfiguration: 174 | - 175 | ServerSideEncryptionByDefault: 176 | SSEAlgorithm: AES256 177 | NotificationConfiguration: 178 | LambdaConfigurations: 179 | - 180 | Event: s3:ObjectCreated:* 181 | Function: !GetAtt DataPreprocessingLambda.Arn 182 | Filter: 183 | S3Key: 184 | Rules: 185 | - Name: prefix 186 | Value: !Ref S3ProcessingJobInputPrefix 187 | - 188 | Event: s3:ObjectCreated:* 189 | Function: !GetAtt GraphModellingLambda.Arn 190 | Filter: 191 | S3Key: 192 | Rules: 193 | - Name: prefix 194 | Value: !Ref S3ProcessingJobOutputPrefix 195 | Metadata: 196 | cfn_nag: 197 | rules_to_suppress: 198 | - id: W35 199 | reason: Configuring logging requires supplying an existing customer S3 bucket to store logs 200 | - id: W51 201 | reason: Default access policy suffices 202 | 203 | SolutionAssistantStack: 204 | Type: "AWS::CloudFormation::Stack" 205 | Properties: 206 | TemplateURL: !Sub 207 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/solution-assistant/solution-assistant.yaml" 208 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 209 | Region: !Ref AWS::Region 210 | Parameters: 211 | SolutionPrefix: !Ref SolutionPrefix 212 | SolutionsRefBucketName: !Sub 213 | - "${SolutionRefBucketBase}-${AWS::Region}" 214 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 215 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 216 | ECRRepository: !Ref SageMakerProcessingJobContainerName 217 | RoleArn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 218 | 219 | SageMakerPermissionsStack: 220 | Type: "AWS::CloudFormation::Stack" 221 | Condition: CreateCustomSolutionRole 222 | Properties: 223 | TemplateURL: !Sub 224 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/sagemaker-permissions-stack.yaml" 225 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 226 | Region: !Ref AWS::Region 227 | Parameters: 228 | SolutionPrefix: !Ref SolutionPrefix 229 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 230 | SolutionCodeBuildProject: !Sub "${SolutionPrefix}-processing-job-container-build" 231 | 232 | SageMakerStack: 233 | Type: "AWS::CloudFormation::Stack" 234 | Condition: CreateClassicSageMakerResources 235 | Properties: 236 | TemplateURL: !Sub 237 | - "https://s3.amazonaws.com/${SolutionRefBucketBase}-${Region}/Fraud-detection-in-financial-networks/deployment/sagemaker-notebook-instance-stack.yaml" 238 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 239 | Region: !Ref AWS::Region 240 | Parameters: 241 | SolutionPrefix: !Ref SolutionPrefix 242 | SolutionS3BucketName: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 243 | S3InputDataPrefix: !Ref S3HistoricalTransactionsPrefix 244 | S3ProcessingJobOutputPrefix: !Ref S3ProcessingJobOutputPrefix 245 | S3TrainingJobOutputPrefix: !Ref S3TrainingJobOutputPrefix 246 | SageMakerNotebookInstanceType: !Ref SageMakerNotebookInstanceType 247 | SageMakerProcessingJobContainerName: !Ref SageMakerProcessingJobContainerName 248 | SageMakerProcessingJobContainerCodeBuild: !If [CreateCodeBuildProject, !Ref ProcessingJobContainerBuild, "local"] 249 | NotebookInstanceExecutionRoleArn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 250 | 251 | ProcessingJobContainerBuild: 252 | Condition: CreateCodeBuildProject 253 | Type: AWS::CodeBuild::Project 254 | Properties: 255 | Name: !Sub "${SolutionPrefix}-processing-job-container-build" 256 | Description: !Sub "Build docker container for SageMaker Processing job for ${SolutionPrefix}" 257 | ServiceRole: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 258 | Source: 259 | Type: S3 260 | Location: !Sub 261 | - "${SolutionRefBucketBase}-${AWS::Region}/${SourceKey}" 262 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 263 | SourceKey: !FindInMap [CodeBuild, ProcessingContainer, S3Key] 264 | Environment: 265 | ComputeType: BUILD_GENERAL1_MEDIUM 266 | Image: aws/codebuild/standard:4.0 267 | Type: LINUX_CONTAINER 268 | PrivilegedMode: True 269 | EnvironmentVariables: 270 | - Name: ecr_repository 271 | Value: !Ref SageMakerProcessingJobContainerName 272 | - Name: region 273 | Value: !Ref AWS::Region 274 | - Name: account_id 275 | Value: !Ref AWS::AccountId 276 | Artifacts: 277 | Type: NO_ARTIFACTS 278 | Metadata: 279 | cfn_nag: 280 | rules_to_suppress: 281 | - id: W32 282 | reason: overriding encryption requirements for codebuild 283 | 284 | DataPreprocessingLambda: 285 | Type: AWS::Lambda::Function 286 | Properties: 287 | Handler: "index.process_event" 288 | FunctionName: !Sub "${SolutionPrefix}-data-preprocessing" 289 | Role: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 290 | Environment: 291 | Variables: 292 | processing_job_ecr_repository: !Ref SageMakerProcessingJobContainerName 293 | processing_job_input_s3_prefix: !Ref S3ProcessingJobInputPrefix 294 | processing_job_instance_type: !Ref SageMakerProcessingJobInstanceType 295 | processing_job_output_s3_prefix: !Ref S3ProcessingJobOutputPrefix 296 | processing_job_role_arn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 297 | processing_job_s3_bucket: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 298 | processing_job_s3_raw_data_key: !Ref S3HistoricalTransactionsPrefix 299 | Runtime: "python3.7" 300 | Code: 301 | S3Bucket: !Sub 302 | - "${SolutionRefBucketBase}-${Region}" 303 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 304 | Region: !Ref AWS::Region 305 | S3Key: !FindInMap [Lambda, DataPreprocessing, S3Key] 306 | Timeout : 60 307 | MemorySize : 256 308 | Metadata: 309 | cfn_nag: 310 | rules_to_suppress: 311 | - id: W58 312 | reason: Passed in role or created role both have cloudwatch write permissions 313 | DataPreprocessingLambdaPermission: 314 | Type: AWS::Lambda::Permission 315 | Properties: 316 | Action: 'lambda:InvokeFunction' 317 | FunctionName: !Ref DataPreprocessingLambda 318 | Principal: s3.amazonaws.com 319 | SourceArn: !Sub "arn:aws:s3:::${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 320 | SourceAccount: !Ref AWS::AccountId 321 | GraphModellingLambda: 322 | Type: AWS::Lambda::Function 323 | Properties: 324 | Handler: "index.process_event" 325 | FunctionName: !Sub "${SolutionPrefix}-model-training" 326 | Role: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 327 | Environment: 328 | Variables: 329 | training_job_instance_type: !Ref SageMakerTrainingJobInstanceType 330 | training_job_output_s3_prefix: !Ref S3TrainingJobOutputPrefix 331 | training_job_role_arn: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 332 | training_job_s3_bucket: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 333 | Runtime: "python3.7" 334 | Code: 335 | S3Bucket: !Sub 336 | - "${SolutionRefBucketBase}-${Region}" 337 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 338 | Region: !Ref AWS::Region 339 | S3Key: !FindInMap [Lambda, GraphModelling, S3Key] 340 | Timeout : 60 341 | MemorySize : 256 342 | Metadata: 343 | cfn_nag: 344 | rules_to_suppress: 345 | - id: W58 346 | reason: Passed in role or created role both have cloudwatch write permissions 347 | GraphModellingLambdaPermission: 348 | Type: AWS::Lambda::Permission 349 | Properties: 350 | Action: 'lambda:InvokeFunction' 351 | FunctionName: !Ref GraphModellingLambda 352 | Principal: s3.amazonaws.com 353 | SourceAccount: !Ref AWS::AccountId 354 | 355 | Outputs: 356 | SourceCode: 357 | Condition: CreateClassicSageMakerResources 358 | Description: "Open Jupyter IDE. This authenticate you against Jupyter." 359 | Value: !GetAtt SageMakerStack.Outputs.SourceCode 360 | 361 | NotebookInstance: 362 | Condition: CreateClassicSageMakerResources 363 | Description: "SageMaker Notebook instance to manually orchestrate data preprocessing and model training" 364 | Value: !GetAtt SageMakerStack.Outputs.NotebookInstance 365 | 366 | AccountID: 367 | Description: "AWS Account ID to be passed downstream to the notebook instance" 368 | Value: !Ref AWS::AccountId 369 | 370 | AWSRegion: 371 | Description: "AWS Region to be passed downstream to the notebook instance" 372 | Value: !Ref AWS::Region 373 | 374 | IamRole: 375 | Description: "Arn of SageMaker Execution Role" 376 | Value: !If [CreateCustomSolutionRole, !GetAtt SageMakerPermissionsStack.Outputs.SageMakerRoleArn, !Ref IamRole] 377 | 378 | SolutionPrefix: 379 | Description: "Solution Prefix for naming SageMaker transient resources" 380 | Value: !Ref SolutionPrefix 381 | 382 | SolutionS3Bucket: 383 | Description: "Solution S3 bucket name" 384 | Value: !Sub "${SolutionPrefix}-${AWS::AccountId}-${AWS::Region}" 385 | 386 | S3InputDataPrefix: 387 | Description: "S3 bucket prefix for raw data" 388 | Value: !Ref S3HistoricalTransactionsPrefix 389 | 390 | S3ProcessingJobOutputPrefix: 391 | Description: "S3 bucket prefix for processed data" 392 | Value: !Ref S3ProcessingJobOutputPrefix 393 | 394 | S3TrainingJobOutputPrefix: 395 | Description: "S3 bucket prefix for trained model and other artifacts" 396 | Value: !Ref S3TrainingJobOutputPrefix 397 | 398 | SageMakerProcessingJobContainerName: 399 | Description: "ECR Container name for SageMaker processing job" 400 | Value: !Ref SageMakerProcessingJobContainerName 401 | 402 | SageMakerProcessingJobContainerBuild: 403 | Description: "Code build project for remotely building the sagemaker preprocessing container" 404 | Value: !If [CreateCodeBuildProject, !Ref ProcessingJobContainerBuild, "local"] -------------------------------------------------------------------------------- /deployment/sagemaker-notebook-instance-stack.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Description: "(SA0002) - sagemaker-graph-fraud-detection SageMaker stack" 3 | Parameters: 4 | SolutionPrefix: 5 | Description: Enter the name of the prefix for the solution used for naming 6 | Type: String 7 | SolutionS3BucketName: 8 | Description: Enter the name of the S3 bucket for the solution 9 | Type: String 10 | S3InputDataPrefix: 11 | Description: S3 prefix where raw data is stored 12 | Type: String 13 | Default: "raw-data" 14 | S3ProcessingJobOutputPrefix: 15 | Description: S3 prefix where preprocessed data is stored after processing 16 | Type: String 17 | Default: "preprocessed-data" 18 | S3TrainingJobOutputPrefix: 19 | Description: S3 prefix where training outputs are stored after training 20 | Type: String 21 | Default: "training-output" 22 | SageMakerNotebookInstanceType: 23 | Description: Instance type of the SageMaker notebook instance 24 | Type: String 25 | Default: "ml.t3.medium" 26 | SageMakerProcessingJobContainerName: 27 | Description: Name of the SageMaker processing job ECR Container 28 | Type: String 29 | Default: "sagemaker-soln-graph-fraud-preprocessing" 30 | SageMakerProcessingJobContainerCodeBuild: 31 | Description: Name of the SageMaker processing container code build project 32 | Type: String 33 | Default: "local" 34 | NotebookInstanceExecutionRoleArn: 35 | Type: String 36 | Description: Execution Role for the SageMaker notebook instance 37 | StackVersion: 38 | Type: String 39 | Description: CloudFormation Stack version. 40 | Default: "release" 41 | 42 | Mappings: 43 | S3: 44 | release: 45 | BucketPrefix: "sagemaker-solutions" 46 | development: 47 | BucketPrefix: "sagemaker-solutions-build" 48 | SageMaker: 49 | Source: 50 | S3Key: "Fraud-detection-in-financial-networks/source/sagemaker/" 51 | 52 | Resources: 53 | NotebookInstance: 54 | Type: AWS::SageMaker::NotebookInstance 55 | Properties: 56 | DirectInternetAccess: Enabled 57 | InstanceType: !Ref SageMakerNotebookInstanceType 58 | LifecycleConfigName: !GetAtt LifeCycleConfig.NotebookInstanceLifecycleConfigName 59 | NotebookInstanceName: !Sub "${SolutionPrefix}-notebook-instance" 60 | RoleArn: !Ref NotebookInstanceExecutionRoleArn 61 | VolumeSizeInGB: 120 62 | Metadata: 63 | cfn_nag: 64 | rules_to_suppress: 65 | - id: W1201 66 | reason: Solution does not have KMS encryption enabled by default 67 | LifeCycleConfig: 68 | Type: AWS::SageMaker::NotebookInstanceLifecycleConfig 69 | Properties: 70 | NotebookInstanceLifecycleConfigName: !Sub "${SolutionPrefix}-nb-lifecycle-config" 71 | OnCreate: 72 | - Content: 73 | Fn::Base64: !Sub 74 | - | 75 | cd /home/ec2-user/SageMaker 76 | aws s3 cp --recursive s3://${SolutionsRefBucketBase}-${AWS::Region}/${SolutionsRefSource} . 77 | touch stack_outputs.json 78 | echo '{' >> stack_outputs.json 79 | echo ' "AccountID": "${AWS::AccountId}",' >> stack_outputs.json 80 | echo ' "AWSRegion": "${AWS::Region}",' >> stack_outputs.json 81 | echo ' "IamRole": "${NotebookInstanceExecutionRoleArn}",' >> stack_outputs.json 82 | echo ' "SolutionPrefix": "${SolutionPrefix}",' >> stack_outputs.json 83 | echo ' "SolutionS3Bucket": "${SolutionS3BucketName}",' >> stack_outputs.json 84 | echo ' "S3InputDataPrefix": "${S3InputDataPrefix}",' >> stack_outputs.json 85 | echo ' "S3ProcessingJobOutputPrefix": "${S3ProcessingJobOutputPrefix}",' >> stack_outputs.json 86 | echo ' "S3TrainingJobOutputPrefix": "${S3TrainingJobOutputPrefix}",' >> stack_outputs.json 87 | echo ' "SageMakerProcessingJobContainerName": "${SageMakerProcessingJobContainerName}",' >> stack_outputs.json 88 | echo ' "SageMakerProcessingJobContainerBuild": "${SageMakerProcessingJobContainerCodeBuild}"' >> stack_outputs.json 89 | echo '}' >> stack_outputs.json 90 | sudo chown -R ec2-user:ec2-user . 91 | - SolutionsRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 92 | SolutionsRefSource: !FindInMap [SageMaker, Source, S3Key] 93 | Outputs: 94 | SourceCode: 95 | Description: "Open Jupyter IDE. This authenticate you against Jupyter." 96 | Value: !Sub "https://console.aws.amazon.com/sagemaker/home?region=${AWS::Region}#/notebook-instances/openNotebook/${SolutionPrefix}-notebook-instance?view=classic" 97 | NotebookInstance: 98 | Description: "SageMaker Notebook instance to manually orchestrate data preprocessing and model training" 99 | Value: !Sub "https://${SolutionPrefix}-notebook-instance.notebook.${AWS::Region}.sagemaker.aws/notebooks/dgl-fraud-detection.ipynb" -------------------------------------------------------------------------------- /deployment/sagemaker-permissions-stack.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | Description: "(SA0002) - sagemaker-graph-fraud-detection SageMaker permissions stack" 3 | Parameters: 4 | SolutionPrefix: 5 | Description: Enter the name of the prefix for the solution used for naming 6 | Type: String 7 | Default: "sagemaker-soln-graph-fraud-detection" 8 | SolutionS3BucketName: 9 | Description: Enter the name of the S3 bucket for the solution 10 | Type: String 11 | Default: "sagemaker-soln-*" 12 | SolutionCodeBuildProject: 13 | Description: Enter the name of the Code build project for the solution 14 | Type: String 15 | Default: "sagemaker-soln-*" 16 | StackVersion: 17 | Description: Enter the name of the template stack version 18 | Type: String 19 | Default: "release" 20 | 21 | Mappings: 22 | S3: 23 | release: 24 | BucketPrefix: "sagemaker-solutions" 25 | development: 26 | BucketPrefix: "sagemaker-solutions-build" 27 | 28 | Resources: 29 | NotebookInstanceExecutionRole: 30 | Type: AWS::IAM::Role 31 | Properties: 32 | RoleName: !Sub "${SolutionPrefix}-${AWS::Region}-nb-role" 33 | AssumeRolePolicyDocument: 34 | Version: '2012-10-17' 35 | Statement: 36 | - Effect: Allow 37 | Principal: 38 | AWS: 39 | - !Sub "arn:aws:iam::${AWS::AccountId}:root" 40 | Service: 41 | - sagemaker.amazonaws.com 42 | - lambda.amazonaws.com 43 | - codebuild.amazonaws.com 44 | Action: 45 | - 'sts:AssumeRole' 46 | Metadata: 47 | cfn_nag: 48 | rules_to_suppress: 49 | - id: W28 50 | reason: Needs to be explicitly named to tighten launch permissions policy 51 | 52 | NotebookInstanceIAMPolicy: 53 | Type: AWS::IAM::Policy 54 | Properties: 55 | PolicyName: !Sub "${SolutionPrefix}-nb-instance-policy" 56 | Roles: 57 | - !Ref NotebookInstanceExecutionRole 58 | PolicyDocument: 59 | Version: '2012-10-17' 60 | Statement: 61 | - Effect: Allow 62 | Action: 63 | - sagemaker:CreateTrainingJob 64 | - sagemaker:DescribeTrainingJob 65 | - sagemaker:CreateProcessingJob 66 | - sagemaker:DescribeProcessingJob 67 | Resource: 68 | - !Sub "arn:aws:sagemaker:${AWS::Region}:${AWS::AccountId}:*" 69 | - Effect: Allow 70 | Action: 71 | - ecr:GetAuthorizationToken 72 | - ecr:GetDownloadUrlForLayer 73 | - ecr:BatchGetImage 74 | - ecr:PutImage 75 | - ecr:BatchCheckLayerAvailability 76 | - ecr:CreateRepository 77 | - ecr:DescribeRepositories 78 | - ecr:InitiateLayerUpload 79 | - ecr:CompleteLayerUpload 80 | - ecr:UploadLayerPart 81 | - ecr:TagResource 82 | - ecr:DescribeImages 83 | - ecr:BatchDeleteImage 84 | Resource: 85 | - "*" 86 | - !Sub "arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repository/*" 87 | - Effect: Allow 88 | Action: 89 | - codebuild:BatchGetBuilds 90 | - codebuild:StartBuild 91 | Resource: 92 | - !Sub "arn:aws:codebuild:${AWS::Region}:${AWS::AccountId}:project/${SolutionCodeBuildProject}" 93 | - !Sub "arn:aws:codebuild:${AWS::Region}:${AWS::AccountId}:build/*" 94 | - Effect: Allow 95 | Action: 96 | - cloudwatch:PutMetricData 97 | - cloudwatch:GetMetricData 98 | - cloudwatch:GetMetricStatistics 99 | - cloudwatch:ListMetrics 100 | Resource: 101 | - !Sub "arn:aws:cloudwatch:${AWS::Region}:${AWS::AccountId}:*" 102 | - Effect: Allow 103 | Action: 104 | - logs:CreateLogGroup 105 | - logs:CreateLogStream 106 | - logs:DescribeLogStreams 107 | - logs:GetLogEvents 108 | - logs:PutLogEvents 109 | Resource: 110 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/sagemaker/*" 111 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/*" 112 | - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/codebuild/*" 113 | - Effect: Allow 114 | Action: 115 | - iam:PassRole 116 | Resource: 117 | - !GetAtt NotebookInstanceExecutionRole.Arn 118 | Condition: 119 | StringEquals: 120 | iam:PassedToService: sagemaker.amazonaws.com 121 | - Effect: Allow 122 | Action: 123 | - iam:GetRole 124 | Resource: 125 | - !GetAtt NotebookInstanceExecutionRole.Arn 126 | - Effect: Allow 127 | Action: 128 | - s3:ListBucket 129 | - s3:GetObject 130 | - s3:PutObject 131 | - s3:GetObjectVersion 132 | - s3:DeleteObject 133 | - s3:DeleteBucket 134 | Resource: 135 | - !Sub "arn:aws:s3:::${SolutionS3BucketName}" 136 | - !Sub "arn:aws:s3:::${SolutionS3BucketName}/*" 137 | - !Sub 138 | - "arn:aws:s3:::${SolutionRefBucketBase}-${Region}" 139 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 140 | Region: !Ref AWS::Region 141 | - !Sub 142 | - "arn:aws:s3:::${SolutionRefBucketBase}-${Region}/*" 143 | - SolutionRefBucketBase: !FindInMap [S3, !Ref StackVersion, BucketPrefix] 144 | Region: !Ref AWS::Region 145 | - Effect: Allow 146 | Action: 147 | - s3:CreateBucket 148 | - s3:ListBucket 149 | - s3:GetObject 150 | - s3:GetObjectVersion 151 | - s3:PutObject 152 | - s3:DeleteObject 153 | Resource: 154 | - !Sub "arn:aws:s3:::sagemaker-${AWS::Region}-${AWS::AccountId}" 155 | - !Sub "arn:aws:s3:::sagemaker-${AWS::Region}-${AWS::AccountId}/*" 156 | Metadata: 157 | cfn_nag: 158 | rules_to_suppress: 159 | - id: W12 160 | reason: ECR GetAuthorizationToken is non resource-specific action 161 | 162 | Outputs: 163 | SageMakerRoleArn: 164 | Description: "SageMaker Execution Role for the solution" 165 | Value: !GetAtt NotebookInstanceExecutionRole.Arn -------------------------------------------------------------------------------- /deployment/solution-assistant/requirements.in: -------------------------------------------------------------------------------- 1 | crhelper 2 | -------------------------------------------------------------------------------- /deployment/solution-assistant/requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile 3 | # To update, run: 4 | # 5 | # pip-compile requirements.in 6 | # 7 | crhelper==2.0.6 # via -r requirements.in 8 | -------------------------------------------------------------------------------- /deployment/solution-assistant/solution-assistant.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: 2010-09-09 2 | Description: Stack for Solution Helper resources. 3 | Parameters: 4 | SolutionPrefix: 5 | Description: Used as a prefix to name all stack resources. 6 | Type: String 7 | SolutionsRefBucketName: 8 | Description: Amazon S3 Bucket containing solutions 9 | Type: String 10 | SolutionS3BucketName: 11 | Description: Amazon S3 Bucket used to store trained model and data. 12 | Type: String 13 | ECRRepository: 14 | Description: Amazon ECR Repository containing container images for processing job. 15 | Type: String 16 | RoleArn: 17 | Description: Role to use for lambda resource 18 | Type: String 19 | Mappings: 20 | Function: 21 | SolutionAssistant: 22 | S3Key: "Fraud-detection-in-financial-networks/build/solution_assistant.zip" 23 | Resources: 24 | SolutionAssistant: 25 | Type: "Custom::SolutionAssistant" 26 | Properties: 27 | ServiceToken: !GetAtt SolutionAssistantLambda.Arn 28 | SolutionS3BucketName: !Ref SolutionS3BucketName 29 | ECRRepository: !Ref ECRRepository 30 | SolutionAssistantLambda: 31 | Type: AWS::Lambda::Function 32 | Properties: 33 | Handler: "lambda_function.handler" 34 | FunctionName: !Sub "${SolutionPrefix}-solution-assistant" 35 | Role: !Ref RoleArn 36 | Runtime: "python3.8" 37 | Code: 38 | S3Bucket: !Ref SolutionsRefBucketName 39 | S3Key: !FindInMap 40 | - Function 41 | - SolutionAssistant 42 | - S3Key 43 | Timeout : 60 44 | Metadata: 45 | cfn_nag: 46 | rules_to_suppress: 47 | - id: W58 48 | reason: Passed in role has cloudwatch write permissions 49 | -------------------------------------------------------------------------------- /deployment/solution-assistant/src/lambda_function.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import sys 3 | 4 | sys.path.append('./site-packages') 5 | from crhelper import CfnResource 6 | 7 | helper = CfnResource() 8 | 9 | 10 | @helper.create 11 | def on_create(_, __): 12 | pass 13 | 14 | @helper.update 15 | def on_update(_, __): 16 | pass 17 | 18 | 19 | def delete_s3_objects(bucket_name): 20 | s3_resource = boto3.resource("s3") 21 | try: 22 | s3_resource.Bucket(bucket_name).objects.all().delete() 23 | print( 24 | "Successfully deleted objects in bucket " 25 | "called '{}'.".format(bucket_name) 26 | ) 27 | except s3_resource.meta.client.exceptions.NoSuchBucket: 28 | print( 29 | "Could not find bucket called '{}'. " 30 | "Skipping delete.".format(bucket_name) 31 | ) 32 | 33 | def delete_ecr_images(repository_name): 34 | ecr_client = boto3.client("ecr") 35 | try: 36 | images = ecr_client.describe_images(repositoryName=repository_name) 37 | image_details = images["imageDetails"] 38 | if len(image_details) > 0: 39 | image_ids = [ 40 | {"imageDigest": i["imageDigest"]} for i in image_details 41 | ] 42 | ecr_client.batch_delete_image( 43 | repositoryName=repository_name, imageIds=image_ids 44 | ) 45 | print( 46 | "Successfully deleted {} images from repository " 47 | "called '{}'. ".format(len(image_details), repository_name) 48 | ) 49 | else: 50 | print( 51 | "Could not find any images in repository " 52 | "called '{}' not found. " 53 | "Skipping delete.".format(repository_name) 54 | ) 55 | except ecr_client.exceptions.RepositoryNotFoundException: 56 | print( 57 | "Could not find repository called '{}' not found. " 58 | "Skipping delete.".format(repository_name) 59 | ) 60 | 61 | 62 | 63 | def delete_s3_bucket(bucket_name): 64 | s3_resource = boto3.resource("s3") 65 | try: 66 | s3_resource.Bucket(bucket_name).delete() 67 | print( 68 | "Successfully deleted bucket " 69 | "called '{}'.".format(bucket_name) 70 | ) 71 | except s3_resource.meta.client.exceptions.NoSuchBucket: 72 | print( 73 | "Could not find bucket called '{}'. " 74 | "Skipping delete.".format(bucket_name) 75 | ) 76 | 77 | 78 | @helper.delete 79 | def on_delete(event, __): 80 | 81 | # delete ecr container repo 82 | repository_name = event["ResourceProperties"]["ECRRepository"] 83 | delete_ecr_images(repository_name) 84 | 85 | # remove files in s3 and delete bucket 86 | solution_bucket = event["ResourceProperties"]["SolutionS3BucketName"] 87 | delete_s3_objects(solution_bucket) 88 | delete_s3_bucket(solution_bucket) 89 | 90 | 91 | def handler(event, context): 92 | helper(event, context) 93 | -------------------------------------------------------------------------------- /docs/arch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/docs/arch.png -------------------------------------------------------------------------------- /docs/details.md: -------------------------------------------------------------------------------- 1 | # Fraud Detection with Graph Neural Networks 2 | 3 | ## Overview 4 | 5 | Graph Neural Networks (GNNs) have shown promising results in solving problems in various domains from recommendations to fraud detection. For fraud detection, GNN techniques are especially powerful because they can learn representations of users, transactions and other entities in an inductive fashion. This means GNNs are able to model potentially fraudulent users and transactions based on both their features and connectivity patterns in an interaction graph. This allows GNN based fraud detections methods to be resilient towards evasive or camouflaging techniques that are employed by malicious users to fool rules based systems or simple feature based models. However, real world applications of GNNs for fraud detection have been limited due to the complex infrastructure required to train large graphs. This project addresses this issue by using Deep Graph Library (DGL) and Amazon SageMaker to manage the complexity of training a GNN on large graphs for fraud detection. 6 | 7 | DGL is an easy-to-use, high performance and scalable Python package for deep learning on graphs. It supports the major deep learning frameworks (Pytorch, MXNet and Tensorflow) as a backend. This project uses DGL to define the graph and implement the GNN models and performs all of the modeling and training using Amazon SageMaker managed resources. 8 | 9 | ## Problem Description and GNN Formulation 10 | 11 | Many businesses lose billions annually to fraud but machine learning based fraud detection models can help businesses predict based on training data what interactions or users are likely fraudulent or malicious and save them from incurring those costs. In this project, we formulate the problem of fraud detection as a classification task, where the machine learning model is a Graph Neural Network that learns good latent representations that can be easily separated into fraud and legitimate. The model is trained using historical transactions or interactions data that contains ground-truth labels for some of the transactions/users. 12 | 13 | The interaction data is assumed to be in the form of a relational table or a set of relational tables. The tables record interactions between a user and other users or other relevant entities. From this table, we extract all the different kind of relations and create edge lists per relation type. In order to make the node representation learning inductive, we also assume that the data contains some attributes or information about the user. We use the attributes if they are present, to create initial feature vectors. We can also encode temporal attributes extracted from the interaction table into the user features to capture the temporal behavior of the user in the case where we our interaction data is timestamped. 14 | 15 | Using the edge lists, we construct a heterogeneous graph which contains the user nodes and all the other node types corresponding to relevant entities in the edge lists. A heterogeneous graph is one where user/account nodes and possibly other entities have several kinds of distinct relationships. Examples of use cases that would fall under this include 16 | 17 | 18 | * a financial network where users transact with other users as well as specific financial institutions or applications 19 | * a gaming network where users interact with other users but also with distinct games or devices 20 | * a social network where users can have different types of links to other users 21 | 22 | 23 | Once the graph is constructed, we define an R-GCN model to learn representations for the graph nodes. The R-GCN module is connected to a fully connected neural network layer to perform classification based on the R-GCN learned node representations. The full model is trained end-to-end in a semi-supervised fashion, where the training loss is computed only using information from nodes that have labels. 24 | 25 | Overall, the project is divided into two modules: 26 | 27 | 28 | * The [first module](../source/sagemaker/data-preprocessing) uses [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) to construct a heterogeneous graph with node features, from a relational table of transactions or interactions. 29 | 30 | 31 | 32 | * The [second module](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) shows how to use DGL to define a Graph Neural Network model and train the model using [Amazon SageMaker training infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/deep-graph-library.html). 33 | 34 | 35 | To run the full project end to end, use the [jupyter notebook](../source/sagemaker/dgl-fraud-detection.ipynb) 36 | 37 | ## Data Processing and Feature Engineering 38 | 39 | The data processing and feature engineering steps convert the data from relational form in a table, to a set of edge lists and features for the user nodes. 40 | 41 | Amazon SageMaker Processing is used perform the data processing and feature engineering. The Amazon SageMaker Processing ScriptProcessor requires a docker container with the processing environment and dependencies, and a processing script that defines the actual data processing implementation. All the artifacts necessary for building the processing environment docker container are in the [_container_ folder](../source/sagemaker/data-preprocessing/container). The [_Dockerfile_](source/sagemaker/data-preoprocessing/container/Dockerfile) specifies the content of the container. The only requirements for the data processing script is pandas so that’s the only package that’s installed in the container. 42 | 43 | The actual data processing script is [_graph_data_preprocessor.py_](source/sagemaker/data-preoprocessing/graph_data_preprocessor.py). The script accepts the transaction data and the identity attributes as input arguments and performs a train/validation split by choosing for the validation data new users that joined on after the train_days argument. The script then extracts all the various relations from the relation table and performs deduplication before writing the relations to an output folder. The script also performs feature engineering to encode the categorical features and the temporal features. Finally, if the construct_homogenous argument is passed in, the script also writes a homogeneous edge list that consists only of edges between user nodes to the output folder. 44 | 45 | Once the SageMaker Processing instance finishes running the script the files in the output folder are then uploaded to S3 for graph modeling and training. 46 | 47 | 48 | ## Graph Modeling and Training 49 | 50 | The graph modeling and training code is implemented using DGL with MXNet as the backend framework and is designed to be run on the managed SageMaker training instances. The [_dgl-fraud-detection_ folder](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection) contains the code that is run on those training instances. The supported graph neural network models are defined in [_model.py_](source/sagemaker/dgl-fraud-detection/model.py), and helper functions for graph construction are implemented in [_data.py_](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/data.py). The graph sampling functions for mini-batch graph training are implemented in [sampler.py](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/sampler.py) and [_utils.py_](../source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/utils.py) contains utility functions. The entry point for the graph modeling and training however is [_train_dgl_entry_point.py_](source/sagemaker/dgl-fraud-detection/train_dgl_entry_point.py). 51 | 52 | The entry point script orchestrates the entire graph training process by going through the following steps: 53 | 54 | 55 | * Reading in the edge lists and user features to construct the graph using the DGLGraph or DGLHeteroGraph API 56 | * Reading in the labels for the target nodes and masking labels for target nodes that won't have labels during training 57 | * Creating the Graph Neural Network model 58 | * Initializing the DataLoader and the Graph Sampler if performing mini-batch graph training 59 | * Initializing the model parameters and training the model 60 | 61 | 62 | At the end of model training, the script saves the models and metrics or predictions to the the output folder which gets uploaded to S3 before the SageMaker training instance is terminated. 63 | 64 | 65 | 66 | ## FAQ 67 | 68 | ### What is fraud detection? 69 | 70 | Fraud is when a malicious actor illicitly or deceitfully tries to obtain goods or services that a business provides means. Fraud detection is a set of techniques for identifying fraudulent cases and distinguishing them from normal or legitimate cases. In this project, we model fraud detection as a semi-supervised learning process, where we have some amount of users that have already been labelled as fraudulent or legitimate, and other users which have no labels during training. The task is to use the trained model to infer the labels for the unlabelled users. 71 | 72 | ### What are Graphs? 73 | 74 | Graphs are a data structure that can be used to represent relationships between entities. They are convenient and flexible way of representing information about interacting entities, and can be easily used to model a lot of real world processes. Graphs consists of a set of entities called the nodes, where pairs of the nodes are connected by links called edges. Lots of systems that exist in the world are networks that are naturally expressed as graphs. Graphs can be directed if the edges have an orientation or are asymmetric, while undirected graphs have symmetric edges. A homogeneous graph consists of nodes and edges of one type while a heterogeneous allows multiple node types and/or edge types. 75 | 76 | 77 | ### What are Graph Neural Networks? 78 | 79 | Graph Neural Networks are a family of neural networks that using the graph structure directly to learn useful representations for nodes and edges in a graph and solve graph based tasks like node classification, link prediction or graph classification. Effectively, graph neural networks are message passing networks that learn node representations by using deep learning techniques to aggregate information from neighboring nodes and edges. Popular examples of Graph Neural Networks are [Graph Convolutional Networks (GCN)](https://arxiv.org/abs/1609.02907), [GraphSage](https://arxiv.org/abs/1706.02216), [Graph Attention Networks (GAT)](https://arxiv.org/abs/1710.10903) and [Relational-Graph Convolutional Networks (R-GCN)](https://arxiv.org/abs/1703.06103). 80 | 81 | 82 | ### What makes Graph Neural Networks useful? 83 | 84 | One reason Graph neural networks are useful is that they can learn both inductive and transductive representations compared to classical graph learning techniques, like random walks and graph factorizations, that can only learn transductive representations. A transductive representation is one that applies specifically to a particular node instance, while an inductive representation is one that can levarage features of the node, and change change as the node features change, allowing for better generalization. Additionally, varying the depth of a Graph Neural Network allows the network to integrate topologically distant information into the representation of a particular node. Graph Neural Networks are also end-to-end differentiable so they can be trained jointly with a downstream, task-specific model, which allows the downstream model to supervise and tailor the representations learned by the GNN. 85 | 86 | 87 | ### How do Graph Neural Networks work? 88 | 89 | Graph Neural Networks are trained, like other Deep Neural Networks, by using gradient based optimizers like sgd or adam to learn network parameter values that optimize a particular loss function. As with other neural networks, this is performed by running a forward step - to compute the feature representations and the loss function, a backward step - to compute the gradients of loss with respect to the network parameters, and an optimize step - to update the network parameter values with the computed gradient. Graph Neural Networks are unique in the forward step. They compute the intermediate representations by a process known as ‘message passing’. For a particular node, this involves using the graph structure to collect all or a subset of the neighboring nodes and edges. At each layer, the intermediate representation of the neighboring nodes and edges are then aggregated into a single message which is combined with the previous intermediate representation of the node to form the new node representation. At the earliest layers, a node representation is informed by it’s eager network - it’s immediate neighbors, but at later layers, the nodes representation is informed by the current representation of the nodes neighbors which had been earlier informed by those neighbor’s neighbors, thus extending the sphere of influence to nodes that are multiple hops away. 90 | 91 | 92 | ### What is an R-GCN Model? 93 | 94 | The Relational Graph Convolutional Network (R-GCN) (https://arxiv.org/abs/1703.06103) model is a GNN that specifically models different edge types and node types differently during message passing and aggregation. Thus, it is especially effective for learning on heterogenous graphs and R-GCN is the default model used in this project for node representation learning. It is based on the simpler GCN architecture but adapted for multi-relational data. 95 | 96 | -------------------------------------------------------------------------------- /docs/overview_video_thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/docs/overview_video_thumbnail.png -------------------------------------------------------------------------------- /iam/launch.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Effect": "Allow", 6 | "Action": [ 7 | "servicecatalog:*" 8 | ], 9 | "Resource": [ 10 | "*" 11 | ] 12 | }, 13 | { 14 | "Effect": "Allow", 15 | "Action": [ 16 | "cloudformation:CreateStack" 17 | ], 18 | "Resource": [ 19 | "arn:aws:cloudformation:*:*:stack/sagemaker-soln-*" 20 | ], 21 | "Condition" : { 22 | "ForAllValues:StringLike" : { 23 | "cloudformation:TemplateUrl" : [ 24 | "https://s3.amazonaws.com/sagemaker-solutions-us-east-1/*", 25 | "https://s3-us-east-2.amazonaws.com/sagemaker-solutions-us-east-2/*", 26 | "https://s3-us-west-2.amazonaws.com/sagemaker-solutions-us-west-2/*", 27 | "https://s3.amazonaws.com/sagemaker-solutions-*", 28 | "https://s3.amazonaws.com/sc-*", 29 | "https://s3.us-east-2.amazonaws.com/sc-*", 30 | "https://s3.us-west-2.amazonaws.com/sc-*" 31 | ] 32 | } 33 | } 34 | }, 35 | { 36 | "Effect": "Allow", 37 | "Action": [ 38 | "cloudformation:DeleteStack" 39 | ], 40 | "Resource": "arn:aws:cloudformation:*:*:stack/sagemaker-soln-*" 41 | }, 42 | { 43 | "Effect": "Allow", 44 | "Action": [ 45 | "cloudformation:DescribeStackEvents", 46 | "cloudformation:DescribeStacks", 47 | "cloudformation:DescribeStackResource", 48 | "cloudformation:DescribeStackResources", 49 | "cloudformation:GetTemplate", 50 | "cloudformation:GetTemplateSummary", 51 | "cloudformation:ValidateTemplate", 52 | "cloudformation:ListStacks", 53 | "cloudformation:ListStackResources" 54 | ], 55 | "Resource": "*" 56 | }, 57 | { 58 | "Action": [ 59 | "s3:GetObject" 60 | ], 61 | "Resource": [ 62 | "arn:aws:s3:::sc-*/*" 63 | ], 64 | "Effect": "Allow" 65 | }, 66 | { 67 | "Effect": "Allow", 68 | "Action": [ 69 | "iam:PassRole" 70 | ], 71 | "Resource": "arn:aws:iam::*:role/sagemaker-soln-*" 72 | }, 73 | { 74 | "Effect": "Allow", 75 | "Action": [ 76 | "codebuild:CreateProject", 77 | "codebuild:DeleteProject" 78 | ], 79 | "Resource": "arn:aws:codebuild:*:*:project/sagemaker-soln-*" 80 | }, 81 | { 82 | "Effect": "Allow", 83 | "Action": [ 84 | "lambda:AddPermission", 85 | "lambda:RemovePermission", 86 | "lambda:CreateFunction", 87 | "lambda:DeleteFunction", 88 | "lambda:GetFunction", 89 | "lambda:GetFunctionConfiguration", 90 | "lambda:InvokeFunction" 91 | ], 92 | "Resource": "arn:aws:lambda:*:*:function:sagemaker-soln-*" 93 | }, 94 | { 95 | "Effect": "Allow", 96 | "Action": [ 97 | "s3:CreateBucket", 98 | "s3:DeleteBucket", 99 | "s3:PutBucketNotification", 100 | "s3:PutBucketPublicAccessBlock", 101 | "s3:PutEncryptionConfiguration" 102 | ], 103 | "Resource": "arn:aws:s3:::sagemaker-soln-*" 104 | }, 105 | { 106 | "Effect": "Allow", 107 | "Action": [ 108 | "s3:GetObject" 109 | ], 110 | "Resource": "arn:aws:s3:::sagemaker-solutions-*/*" 111 | }, 112 | { 113 | "Effect": "Allow", 114 | "Action": [ 115 | "sagemaker:CreateNotebookInstance", 116 | "sagemaker:StopNotebookInstance", 117 | "sagemaker:DescribeNotebookInstance", 118 | "sagemaker:DeleteNotebookInstance" 119 | ], 120 | "Resource": "arn:aws:sagemaker:*:*:notebook-instance/sagemaker-soln-*" 121 | }, 122 | { 123 | "Effect": "Allow", 124 | "Action": [ 125 | "sagemaker:CreateNotebookInstanceLifecycleConfig", 126 | "sagemaker:DescribeNotebookInstanceLifecycleConfig", 127 | "sagemaker:DeleteNotebookInstanceLifecycleConfig" 128 | ], 129 | "Resource": "arn:aws:sagemaker:*:*:notebook-instance-lifecycle-config/sagemaker-soln-*" 130 | } 131 | ] 132 | } -------------------------------------------------------------------------------- /iam/usage.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Action": [ 6 | "sagemaker:CreateTrainingJob", 7 | "sagemaker:DescribeTrainingJob", 8 | "sagemaker:CreateProcessingJob", 9 | "sagemaker:DescribeProcessingJob" 10 | ], 11 | "Resource": [ 12 | "arn:aws:sagemaker:*:*:*" 13 | ], 14 | "Effect": "Allow" 15 | }, 16 | { 17 | "Action": [ 18 | "ecr:GetAuthorizationToken", 19 | "ecr:GetDownloadUrlForLayer", 20 | "ecr:BatchGetImage", 21 | "ecr:PutImage", 22 | "ecr:BatchCheckLayerAvailability", 23 | "ecr:CreateRepository", 24 | "ecr:DescribeRepositories", 25 | "ecr:InitiateLayerUpload", 26 | "ecr:CompleteLayerUpload", 27 | "ecr:UploadLayerPart", 28 | "ecr:TagResource", 29 | "ecr:DescribeImages", 30 | "ecr:BatchDeleteImage" 31 | ], 32 | "Resource": [ 33 | "*", 34 | "arn:aws:ecr:*:*:repository/*" 35 | ], 36 | "Effect": "Allow" 37 | }, 38 | { 39 | "Action": [ 40 | "codebuild:BatchGetBuilds", 41 | "codebuild:StartBuild" 42 | ], 43 | "Resource": [ 44 | "arn:aws:codebuild:*:*:project/sagemaker-soln-*", 45 | "arn:aws:codebuild:*:*:build/*" 46 | ], 47 | "Effect": "Allow" 48 | }, 49 | { 50 | "Action": [ 51 | "cloudwatch:PutMetricData", 52 | "cloudwatch:GetMetricData", 53 | "cloudwatch:GetMetricStatistics", 54 | "cloudwatch:ListMetrics" 55 | ], 56 | "Resource": [ 57 | "arn:aws:cloudwatch:*:*:*" 58 | ], 59 | "Effect": "Allow" 60 | }, 61 | { 62 | "Action": [ 63 | "logs:CreateLogGroup", 64 | "logs:CreateLogStream", 65 | "logs:DescribeLogStreams", 66 | "logs:GetLogEvents", 67 | "logs:PutLogEvents" 68 | ], 69 | "Resource": [ 70 | "arn:aws:logs:*:*:log-group:/aws/sagemaker/*" 71 | ], 72 | "Effect": "Allow" 73 | }, 74 | { 75 | "Condition": { 76 | "StringEquals": { 77 | "iam:PassedToService": "sagemaker.amazonaws.com" 78 | } 79 | }, 80 | "Action": [ 81 | "iam:PassRole" 82 | ], 83 | "Resource": [ 84 | "arn:aws:iam::*:role/sagemaker-soln-*" 85 | ], 86 | "Effect": "Allow" 87 | }, 88 | { 89 | "Action": [ 90 | "iam:GetRole" 91 | ], 92 | "Resource": [ 93 | "arn:aws:iam::*:role/sagemaker-soln-*" 94 | ], 95 | "Effect": "Allow" 96 | }, 97 | { 98 | "Action": [ 99 | "s3:ListBucket", 100 | "s3:DeleteBucket", 101 | "s3:GetObject", 102 | "s3:PutObject", 103 | "s3:DeleteObject" 104 | ], 105 | "Resource": [ 106 | "arn:aws:s3:::sagemaker-solutions-*", 107 | "arn:aws:s3:::sagemaker-solutions-*/*", 108 | "arn:aws:s3:::sagemaker-soln-*", 109 | "arn:aws:s3:::sagemaker-soln-*/*" 110 | ], 111 | "Effect": "Allow" 112 | }, 113 | { 114 | "Action": [ 115 | "s3:CreateBucket", 116 | "s3:ListBucket", 117 | "s3:GetObject", 118 | "s3:PutObject", 119 | "s3:DeleteObject" 120 | ], 121 | "Resource": [ 122 | "arn:aws:s3:::sagemaker-*", 123 | "arn:aws:s3:::sagemaker-*/*" 124 | ], 125 | "Effect": "Allow" 126 | } 127 | ] 128 | } -------------------------------------------------------------------------------- /metadata/metadata.json: -------------------------------------------------------------------------------- 1 | { 2 | "id": "sagemaker-soln-gfd", 3 | "name": "Amazon SageMaker and Deep Graph Library for Fraud Detection in Heterogeneous Graphs", 4 | "shortName": "Graph Fraud Detection", 5 | "priority": 0, 6 | "desc": "Many online businesses lose billions annually to fraud, but machine learning based fraud detection models can help businesses predict what interactions or users are likely fraudulent and save them from incurring those costs. In this project, we formulate the problem of fraud detection as a classification task on a heterogeneous interaction network. The machine learning model is a Graph Neural Network (GNN) that learns latent representations of users or transactions which can then be easily separated into fraud or legitimate.", 7 | "meta": "fraud classification detection graph gnn network interaction", 8 | "tags": ["financial services", "fraud detection", "internet retail"], 9 | "parameters": [ 10 | { 11 | "name": "SolutionPrefix", 12 | "type": "text", 13 | "default": "sagemaker-soln-graph-fraud" 14 | }, 15 | { 16 | "name": "IamRole", 17 | "type": "text", 18 | "default": "" 19 | }, 20 | { 21 | "name": "S3HistoricalTransactionsPrefix", 22 | "type": "text", 23 | "default": "raw-data" 24 | }, 25 | { 26 | "name": "S3ProcessingJobInputPrefix", 27 | "type": "text", 28 | "default": "processing-input" 29 | }, 30 | { 31 | "name": "S3ProcessingJobOutputPrefix", 32 | "type": "text", 33 | "default": "preprocessed-data" 34 | }, 35 | { 36 | "name": "S3TrainingJobOutputPrefix", 37 | "type": "text", 38 | "default": "training-output" 39 | }, 40 | { 41 | "name": "CreateSageMakerNotebookInstance", 42 | "type": "text", 43 | "default": "false" 44 | }, 45 | { 46 | "name": "BuildSageMakerContainersRemotely", 47 | "type": "text", 48 | "default": "true" 49 | }, 50 | { 51 | "name": "SageMakerProcessingJobContainerName", 52 | "type": "text", 53 | "default": "sagemaker-soln-graph-fraud-preprocessing" 54 | }, 55 | { 56 | "name": "SageMakerProcessingJobInstanceType", 57 | "type": "text", 58 | "default": "ml.m4.xlarge" 59 | }, 60 | { 61 | "name": "SageMakerTrainingJobInstanceType", 62 | "type": "text", 63 | "default": "ml.p3.2xlarge" 64 | }, 65 | { 66 | "name": "SageMakerNotebookInstanceType", 67 | "type": "text", 68 | "default": "ml.m4.xlarge" 69 | }, 70 | { 71 | "name": "StackVersion", 72 | "type": "text", 73 | "default": "release" 74 | } 75 | ], 76 | "acknowledgements": ["CAPABILITY_IAM","CAPABILITY_NAMED_IAM"], 77 | "cloudFormationTemplate": "s3-us-east-2.amazonaws.com/sagemaker-solutions-build-us-east-2/Fraud-detection-in-financial-networks/deployment/sagemaker-graph-fraud-detection.yaml", 78 | "serviceCatalogProduct": "TBD", 79 | "copyS3Source": "sagemaker-solutions-build-us-east-2", 80 | "copyS3SourcePrefix": "Fraud-detection-in-financial-networks/source/sagemaker", 81 | "notebooksDirectory": "Fraud-detection-in-financial-networks/source/sagemaker", 82 | "notebookPaths": [ 83 | "Fraud-detection-in-financial-networks/source/sagemaker/dgl-fraud-detection.ipynb" 84 | ], 85 | "permissions": "TBD" 86 | } -------------------------------------------------------------------------------- /source/lambda/data-preprocessing/index.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import json 4 | from time import strftime, gmtime 5 | 6 | s3_client = boto3.client('s3') 7 | S3_BUCKET = os.environ['processing_job_s3_bucket'] 8 | DATA_PREFIX = os.environ['processing_job_s3_raw_data_key'] 9 | INPUT_PREFIX = os.environ['processing_job_input_s3_prefix'] 10 | OUTPUT_PREFIX = os.environ['processing_job_output_s3_prefix'] 11 | INSTANCE_TYPE = os.environ['processing_job_instance_type'] 12 | ROLE_ARN = os.environ['processing_job_role_arn'] 13 | IMAGE_URI = os.environ['processing_job_ecr_repository'] 14 | 15 | 16 | def process_event(event, context): 17 | print(event) 18 | 19 | event_source_s3 = event['Records'][0]['s3'] 20 | 21 | print("S3 Put event source: {}".format(get_full_path(event_source_s3))) 22 | timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime()) 23 | 24 | inputs = prepare_preprocessing_inputs(timestamp) 25 | outputs = prepare_preprocessing_output(event_source_s3, timestamp) 26 | response = run_preprocessing_job(inputs, outputs, timestamp) 27 | 28 | print(response) 29 | return response 30 | 31 | 32 | def prepare_preprocessing_inputs(timestamp, s3_bucket=S3_BUCKET, data_prefix=DATA_PREFIX, input_prefix=INPUT_PREFIX): 33 | print("Preparing Inputs") 34 | key = os.path.join(input_prefix, timestamp) 35 | 36 | print("Copying raw data from {} to {}".format(get_full_s3_path(s3_bucket, data_prefix), 37 | get_full_s3_path(s3_bucket, key))) 38 | objects = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=data_prefix) 39 | files = [content['Key'] for content in objects['Contents']] 40 | verify(files) 41 | 42 | for file in files: 43 | dest = os.path.join(key, os.path.basename(file)) 44 | s3_client.copy({'Bucket': s3_bucket, 'Key': file}, s3_bucket, dest) 45 | 46 | return get_full_s3_path(s3_bucket, key) 47 | 48 | 49 | def prepare_preprocessing_output(event_source_s3, timestamp, s3_bucket=S3_BUCKET, output_prefix=OUTPUT_PREFIX): 50 | print("Preparing Output") 51 | copy_source = { 52 | 'Bucket': event_source_s3['bucket']['name'], 53 | 'Key': event_source_s3['object']['key'] 54 | } 55 | 56 | key = os.path.join(output_prefix, timestamp, os.path.basename(event_source_s3['object']['key'])) 57 | 58 | destination = get_full_s3_path(s3_bucket, key) 59 | print("Copying new accounts from {} to {}".format(get_full_path(event_source_s3), destination)) 60 | s3_client.copy(copy_source, s3_bucket, key) 61 | return get_full_s3_path(s3_bucket, os.path.join(output_prefix, timestamp)) 62 | 63 | 64 | def verify(files, s3_bucket=S3_BUCKET, data_prefix=DATA_PREFIX, expected_files=['transaction.csv', 'identity.csv']): 65 | if not all([file in list(map(os.path.basename, files)) for file in expected_files]): 66 | raise Exception("Raw data absent or incomplete in {}".format(get_full_s3_path( 67 | s3_bucket, data_prefix))) 68 | 69 | 70 | def get_full_s3_path(bucket, key): 71 | return os.path.join('s3://', bucket, key) 72 | 73 | 74 | def get_full_path(event_source_s3): 75 | return get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key']) 76 | 77 | 78 | def run_preprocessing_job(input, 79 | output, 80 | timestamp, 81 | s3_bucket=S3_BUCKET, 82 | input_prefix=INPUT_PREFIX, 83 | instance_type=INSTANCE_TYPE, 84 | image_uri=IMAGE_URI 85 | ): 86 | print("Creating SageMaker Processing job with inputs from {} and outputs to {}".format(input, output)) 87 | 88 | sagemaker_client = boto3.client('sagemaker') 89 | 90 | region = boto3.session.Session().region_name 91 | account_id = boto3.client('sts').get_caller_identity().get('Account') 92 | ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, image_uri) 93 | 94 | # upload code 95 | code_file = 'data-preprocessing/graph_data_preprocessor.py' 96 | code_file_s3_key = os.path.join(input_prefix, timestamp, code_file) 97 | s3_client.upload_file(code_file, s3_bucket, code_file_s3_key) 98 | 99 | entrypoint = ["python3"] + [os.path.join("/opt/ml/processing/input/code", 100 | os.path.basename(code_file))] 101 | 102 | app_spec = { 103 | 'ImageUri': ecr_repository_uri, 104 | 'ContainerEntrypoint': entrypoint, 105 | 'ContainerArguments': ['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain', 106 | '--cat-cols','M1,M2,M3,M4,M5,M6,M7,M8,M9'] 107 | } 108 | 109 | processing_inputs = [ 110 | { 111 | 'InputName': 'input1', 112 | 'S3Input': { 113 | 'S3Uri': input, 114 | 'LocalPath': '/opt/ml/processing/input', 115 | 'S3DataType': 'S3Prefix', 116 | 'S3InputMode': 'File', 117 | } 118 | }, 119 | { 120 | 'InputName': 'code', 121 | 'S3Input': { 122 | 'S3Uri': get_full_s3_path(s3_bucket, code_file_s3_key), 123 | 'LocalPath': '/opt/ml/processing/input/code', 124 | 'S3DataType': 'S3Prefix', 125 | 'S3InputMode': 'File', 126 | } 127 | }, 128 | 129 | ] 130 | processing_output = {'Outputs': [{'OutputName': 'output1', 131 | 'S3Output': {'S3Uri': output, 132 | 'LocalPath': '/opt/ml/processing/output', 133 | 'S3UploadMode': 'EndOfJob'} 134 | }]} 135 | 136 | processing_job_name = "sagemaker-graph-fraud-data-processing-{}".format(timestamp) 137 | resources = { 138 | 'ClusterConfig': { 139 | 'InstanceCount': 1, 140 | 'InstanceType': instance_type, 141 | 'VolumeSizeInGB': 30 142 | } 143 | } 144 | 145 | network_config = {'EnableNetworkIsolation': False} 146 | stopping_condition = {'MaxRuntimeInSeconds': 3600} 147 | 148 | response = sagemaker_client.create_processing_job(ProcessingInputs=processing_inputs, 149 | ProcessingOutputConfig=processing_output, 150 | ProcessingJobName=processing_job_name, 151 | ProcessingResources=resources, 152 | StoppingCondition=stopping_condition, 153 | AppSpecification=app_spec, 154 | NetworkConfig=network_config, 155 | RoleArn=ROLE_ARN) 156 | return response 157 | -------------------------------------------------------------------------------- /source/lambda/graph-modelling/index.py: -------------------------------------------------------------------------------- 1 | import os 2 | import boto3 3 | import json 4 | import tarfile 5 | from time import strftime, gmtime 6 | 7 | s3_client = boto3.client('s3') 8 | S3_BUCKET = os.environ['training_job_s3_bucket'] 9 | OUTPUT_PREFIX = os.environ['training_job_output_s3_prefix'] 10 | INSTANCE_TYPE = os.environ['training_job_instance_type'] 11 | ROLE_ARN = os.environ['training_job_role_arn'] 12 | 13 | 14 | def process_event(event, context): 15 | print(event) 16 | 17 | event_source_s3 = event['Records'][0]['s3'] 18 | 19 | print("S3 Put event source: {}".format(get_full_path(event_source_s3))) 20 | train_input, msg = verify_modelling_inputs(event_source_s3) 21 | print(msg) 22 | if not train_input: 23 | return msg 24 | timestamp = strftime("%Y-%m-%d-%H-%M-%S", gmtime()) 25 | response = run_modelling_job(timestamp, train_input) 26 | return response 27 | 28 | 29 | def verify_modelling_inputs(event_source_s3): 30 | training_kickoff_signal = "tags.csv" 31 | if training_kickoff_signal not in event_source_s3['object']['key']: 32 | msg = "Event source was not the training signal. Triggered by {} but expected folder to contain {}" 33 | return False, msg.format(get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key']), 34 | training_kickoff_signal) 35 | 36 | training_folder = os.path.dirname(event_source_s3['object']['key']) 37 | full_s3_training_folder = get_full_s3_path(event_source_s3['bucket']['name'], training_folder) 38 | 39 | objects = s3_client.list_objects_v2(Bucket=event_source_s3['bucket']['name'], Prefix=training_folder) 40 | files = [content['Key'] for content in objects['Contents']] 41 | print("Contents of training data folder :") 42 | print("\n".join(files)) 43 | minimum_expected_files = ['features.csv', 'tags.csv'] 44 | 45 | if not all([file in [os.path.basename(s3_file) for s3_file in files] for file in minimum_expected_files]): 46 | return False, "Training data absent or incomplete in {}".format(full_s3_training_folder) 47 | 48 | return full_s3_training_folder, "Minimum files needed for training present in {}".format(full_s3_training_folder) 49 | 50 | 51 | def run_modelling_job(timestamp, 52 | train_input, 53 | s3_bucket=S3_BUCKET, 54 | train_out_prefix=OUTPUT_PREFIX, 55 | train_job_prefix='sagemaker-graph-fraud-model-training', 56 | train_source_dir='dgl_fraud_detection', 57 | train_entry_point='train_dgl_mxnet_entry_point.py', 58 | framework='mxnet', 59 | framework_version='1.6.0', 60 | xpu='gpu', 61 | python_version='py3', 62 | instance_type=INSTANCE_TYPE 63 | ): 64 | print("Creating SageMaker Training job with inputs from {}".format(train_input)) 65 | 66 | sagemaker_client = boto3.client('sagemaker') 67 | region = boto3.session.Session().region_name 68 | 69 | container = "763104351884.dkr.ecr.{}.amazonaws.com/{}-training:{}-{}-{}".format(region, 70 | framework, 71 | framework_version, 72 | xpu, 73 | python_version) 74 | 75 | training_job_name = "{}-{}".format(train_job_prefix, timestamp) 76 | 77 | code_path = tar_and_upload_to_s3(train_source_dir, 78 | s3_bucket, 79 | os.path.join(train_out_prefix, training_job_name, 'source')) 80 | 81 | framework_params = { 82 | 'sagemaker_container_log_level': str(20), 83 | 'sagemaker_enable_cloudwatch_metrics': 'false', 84 | 'sagemaker_job_name': json.dumps(training_job_name), 85 | 'sagemaker_program': json.dumps(train_entry_point), 86 | 'sagemaker_region': json.dumps(region), 87 | 'sagemaker_submit_directory': json.dumps(code_path) 88 | } 89 | 90 | model_params = { 91 | 'nodes': 'features.csv', 92 | 'edges': 'relation*', 93 | 'labels': 'tags.csv', 94 | 'model': 'rgcn', 95 | 'num-gpus': 1, 96 | 'batch-size': 10000, 97 | 'embedding-size': 64, 98 | 'n-neighbors': 1000, 99 | 'n-layers': 2, 100 | 'n-epochs': 10, 101 | 'optimizer': 'adam', 102 | 'lr': 1e-2 103 | } 104 | model_params = {k: json.dumps(str(v)) for k, v in model_params.items()} 105 | 106 | model_params.update(framework_params) 107 | 108 | train_params = \ 109 | { 110 | 'TrainingJobName': training_job_name, 111 | "AlgorithmSpecification": { 112 | "TrainingImage": container, 113 | "TrainingInputMode": "File" 114 | }, 115 | "RoleArn": ROLE_ARN, 116 | "OutputDataConfig": { 117 | "S3OutputPath": get_full_s3_path(s3_bucket, train_out_prefix) 118 | }, 119 | "ResourceConfig": { 120 | "InstanceCount": 1, 121 | "InstanceType": instance_type, 122 | "VolumeSizeInGB": 30 123 | }, 124 | "HyperParameters": model_params, 125 | "StoppingCondition": { 126 | "MaxRuntimeInSeconds": 86400 127 | }, 128 | "InputDataConfig": [ 129 | { 130 | "ChannelName": "train", 131 | "DataSource": { 132 | "S3DataSource": { 133 | "S3DataType": "S3Prefix", 134 | "S3Uri": train_input, 135 | "S3DataDistributionType": "FullyReplicated" 136 | } 137 | }, 138 | }, 139 | ] 140 | } 141 | 142 | response = sagemaker_client.create_training_job(**train_params) 143 | return response 144 | 145 | 146 | def tar_and_upload_to_s3(source, s3_bucket, s3_key): 147 | filename = "/tmp/sourcedir.tar.gz" 148 | with tarfile.open(filename, mode="w:gz") as t: 149 | for file in os.listdir(source): 150 | t.add(os.path.join(source, file), arcname=file) 151 | 152 | s3_client.upload_file(filename, s3_bucket, os.path.join(s3_key, 'sourcedir.tar.gz')) 153 | 154 | return get_full_s3_path(s3_bucket, os.path.join(s3_key, 'sourcedir.tar.gz')) 155 | 156 | 157 | def get_full_s3_path(bucket, key): 158 | return os.path.join('s3://', bucket, key) 159 | 160 | 161 | def get_full_path(event_source_s3): 162 | return get_full_s3_path(event_source_s3['bucket']['name'], event_source_s3['object']['key']) 163 | -------------------------------------------------------------------------------- /source/sagemaker/baselines/mlp-fraud-baseline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Baseline MLP Model\n", 8 | "## Training a Feed-foward Neural Network with node features" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "from utils import get_data\n", 18 | "import os\n", 19 | "os.chdir(\"../\")\n", 20 | "\n", 21 | "import pandas as pd\n", 22 | "import numpy as np\n", 23 | "\n", 24 | "!bash setup.sh" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Read data and upload to S3" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import io\n", 41 | "from scipy.sparse import csr_matrix, save_npz \n", 42 | "\n", 43 | "train_X, train_y, test_X, test_y = get_data()\n", 44 | "\n", 45 | "train_X.loc[:, 0] = train_y.values\n", 46 | "sparse_matrix = csr_matrix(train_X.values)\n", 47 | "filename = 'mlp-fraud-dataset.npz'\n", 48 | "save_npz(filename, sparse_matrix, compressed=True)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "import os\n", 58 | "import sagemaker\n", 59 | "from sagemaker.s3 import S3Uploader\n", 60 | "\n", 61 | "from sagemaker_graph_fraud_detection import config\n", 62 | "\n", 63 | "role = config.role\n", 64 | "\n", 65 | "session = sagemaker.Session()\n", 66 | "bucket = config.solution_bucket\n", 67 | "prefix = 'mlp-fraud-detection'\n", 68 | "\n", 69 | "s3_train_data = S3Uploader.upload(filename, 's3://{}/{}/{}'.format(bucket, prefix,'train'))\n", 70 | "print('Uploaded training data location: {}'.format(s3_train_data))\n", 71 | "\n", 72 | "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", 73 | "print('Training artifacts will be uploaded to: {}'.format(output_location))" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "## Train SageMaker MXNet Estimator" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": null, 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "from sagemaker import get_execution_role\n", 90 | "from sagemaker.mxnet import MXNet\n", 91 | "\n", 92 | "params = {'num-gpus': 1,\n", 93 | " 'n-layers': 5,\n", 94 | " 'n-epochs': 100,\n", 95 | " 'optimizer': 'adam',\n", 96 | " 'lr': 1e-2\n", 97 | " } \n", 98 | "\n", 99 | "mlp = MXNet(entry_point='baselines/mlp_fraud_entry_point.py',\n", 100 | " role=role, \n", 101 | " train_instance_count=1, \n", 102 | " train_instance_type='ml.p3.2xlarge',\n", 103 | " framework_version=\"1.4.1\",\n", 104 | " py_version='py3',\n", 105 | " hyperparameters=params,\n", 106 | " output_path=output_location,\n", 107 | " code_location=output_location,\n", 108 | " sagemaker_session=session)\n", 109 | "\n", 110 | "mlp.fit({'train': s3_train_data})" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "scrolled": true 118 | }, 119 | "outputs": [], 120 | "source": [ 121 | "from sagemaker.predictor import json_serializer\n", 122 | "\n", 123 | "predictor = mlp.deploy(initial_instance_count=1,\n", 124 | " endpoint_name=\"mlp-fraud-endpoint\",\n", 125 | " instance_type='ml.p3.2xlarge')\n", 126 | "\n", 127 | "# Specify input and output formats.\n", 128 | "predictor.content_type = 'text/json'\n", 129 | "predictor.serializer = json_serializer\n", 130 | "predictor.deserializer = None" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "import json\n", 140 | "\n", 141 | "def predict(current_predictor, data, rows=500):\n", 142 | " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n", 143 | " predictions = []\n", 144 | " for array in split_array:\n", 145 | " predictions.append(np.array(json.loads(current_predictor.predict(array.tolist()))))\n", 146 | " return np.concatenate(tuple(predictions), axis=0)\n", 147 | "\n", 148 | "raw_preds = predict(predictor, test_X.values[:, 1:])\n", 149 | "y_preds = np.where(raw_preds > 0.5, 1, 0)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "from sklearn.metrics import confusion_matrix, roc_curve, auc\n", 159 | "from matplotlib import pyplot as plt\n", 160 | "%matplotlib inline\n", 161 | "\n", 162 | "def print_metrics(y_true, y_predicted):\n", 163 | "\n", 164 | " cm = confusion_matrix(y_true, y_predicted)\n", 165 | " true_neg, false_pos, false_neg, true_pos = cm.ravel()\n", 166 | " cm = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]),\n", 167 | " columns=[\"labels positive\", \"labels negative\"],\n", 168 | " index=[\"predicted positive\", \"predicted negative\"])\n", 169 | " \n", 170 | " acc = (true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg)\n", 171 | " precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0\n", 172 | " recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0\n", 173 | " f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0\n", 174 | " print(\"Confusion Matrix:\")\n", 175 | " print(pd.DataFrame(cm, columns=[\"labels positive\", \"labels negative\"], \n", 176 | " index=[\"predicted positive\", \"predicted negative\"]))\n", 177 | " print(\"f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}\".format(f1, precision, recall, acc))\n", 178 | " print()\n", 179 | " \n", 180 | "def plot_roc_curve(fpr, tpr, roc_auc):\n", 181 | " f = plt.figure()\n", 182 | " lw = 2\n", 183 | " plt.plot(fpr, tpr, color='darkorange',\n", 184 | " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", 185 | " plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", 186 | " plt.xlim([0.0, 1.0])\n", 187 | " plt.ylim([0.0, 1.05])\n", 188 | " plt.xlabel('False Positive Rate')\n", 189 | " plt.ylabel('True Positive Rate')\n", 190 | " plt.title('Model ROC curve')\n", 191 | " plt.legend(loc=\"lower right\")\n", 192 | "\n", 193 | "print_metrics(test_y, y_preds)\n", 194 | "fpr, tpr, _ = roc_curve(test_y, y_preds)\n", 195 | "roc_auc = auc(fpr, tpr)\n", 196 | "plot_roc_curve(fpr, tpr, roc_auc)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "predictor.delete_endpoint()" 206 | ] 207 | } 208 | ], 209 | "metadata": { 210 | "kernelspec": { 211 | "display_name": "conda_python3", 212 | "language": "python", 213 | "name": "conda_python3" 214 | }, 215 | "language_info": { 216 | "codemirror_mode": { 217 | "name": "ipython", 218 | "version": 3 219 | }, 220 | "file_extension": ".py", 221 | "mimetype": "text/x-python", 222 | "name": "python", 223 | "nbconvert_exporter": "python", 224 | "pygments_lexer": "ipython3", 225 | "version": "3.6.10" 226 | } 227 | }, 228 | "nbformat": 4, 229 | "nbformat_minor": 2 230 | } 231 | -------------------------------------------------------------------------------- /source/sagemaker/baselines/mlp_fraud_entry_point.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import pickle 4 | import argparse 5 | import json 6 | import logging 7 | import numpy as np 8 | import mxnet as mx 9 | from mxnet import nd, gluon, autograd 10 | from scipy.sparse import load_npz 11 | 12 | 13 | def parse_args(): 14 | parser = argparse.ArgumentParser() 15 | 16 | parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) 17 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 18 | parser.add_argument('--num-gpus', type=int, default=1) 19 | parser.add_argument('--batch-size', type=int, default=200000) 20 | parser.add_argument('--optimizer', type=str, default='adam') 21 | parser.add_argument('--lr', type=float, default=1e-2) 22 | parser.add_argument('--n-epochs', type=int, default=20) 23 | parser.add_argument('--n-hidden', type=int, default=64, help='number of hidden units') 24 | parser.add_argument('--n-layers', type=int, default=5, help='number of hidden layers') 25 | parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss') 26 | 27 | return parser.parse_args() 28 | 29 | 30 | def get_logger(name): 31 | logger = logging.getLogger(name) 32 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s' 33 | logging.basicConfig(format=log_format, level=logging.INFO) 34 | return logger 35 | 36 | 37 | def get_data(): 38 | filename = 'mlp-fraud-dataset.npz' 39 | matrix = load_npz(os.path.join(args.training_dir, filename)).toarray().astype('float32') 40 | scale_pos_weight = np.sqrt((matrix.shape[0] - matrix[:, 0].sum()) / matrix[:, 0].sum()) 41 | weight_mask = np.ones(matrix.shape[0]).astype('float32') 42 | weight_mask[np.where(matrix[:, 0])] = scale_pos_weight 43 | dataloader = gluon.data.DataLoader(gluon.data.ArrayDataset(matrix[:, 1:], matrix[:, 0], weight_mask), 44 | args.batch_size, 45 | shuffle=True, 46 | last_batch='keep') 47 | return dataloader, matrix.shape[0] 48 | 49 | 50 | def evaluate(model, dataloader, ctx): 51 | f1 = mx.metric.F1() 52 | for data, label, mask in dataloader: 53 | pred = model(data.as_in_context(ctx)) 54 | f1.update(label.as_in_context(ctx), nd.softmax(pred, axis=1)) 55 | return f1.get()[1] 56 | 57 | 58 | def train(model, trainer, loss, train_data, ctx): 59 | duration = [] 60 | for epoch in range(args.n_epochs): 61 | tic = time.time() 62 | loss_val = 0. 63 | 64 | for features, labels, weight_mask in train_data: 65 | with autograd.record(): 66 | pred = model(features.as_in_context(ctx)) 67 | l = loss(pred, labels.as_in_context(ctx), mx.nd.expand_dims(weight_mask, 1).as_in_context(ctx)).sum() 68 | l.backward() 69 | trainer.step(args.batch_size) 70 | 71 | loss_val += l.asscalar() 72 | duration.append(time.time() - tic) 73 | f1 = evaluate(model, train_data, ctx) 74 | logging.info("Epoch {:05d} | Time(s) {:.4f} | Loss {:.4f} | F1 Score {:.4f} ".format( 75 | epoch, np.mean(duration), loss_val / n, f1)) 76 | save_model(model) 77 | return model 78 | 79 | 80 | def save_model(model): 81 | model.save_parameters(os.path.join(args.model_dir, 'model.params')) 82 | with open(os.path.join(args.model_dir, 'model_hyperparams.pkl'), 'wb') as f: 83 | pickle.dump(args, f) 84 | 85 | 86 | def get_model(model_dir, ctx, n_classes=2, load_stored=False): 87 | if load_stored: # load using saved model state 88 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'rb') as f: 89 | hyperparams = pickle.load(f) 90 | else: 91 | hyperparams = args 92 | 93 | model = gluon.nn.Sequential() 94 | for _ in range(hyperparams.n_layers): 95 | model.add(gluon.nn.Dense(hyperparams.n_hidden, activation='relu')) 96 | model.add(gluon.nn.Dense(n_classes)) 97 | 98 | if load_stored: 99 | model.load_parameters(os.path.join(model_dir, 'model.params'), ctx=ctx) 100 | else: 101 | model.initialize(ctx=ctx) 102 | return model 103 | 104 | 105 | if __name__ == '__main__': 106 | logging = get_logger(__name__) 107 | logging.info('numpy version:{} MXNet version:{}'.format(np.__version__, mx.__version__)) 108 | 109 | args = parse_args() 110 | 111 | train_data, n = get_data() 112 | 113 | ctx = mx.gpu(0) if args.num_gpus else mx.cpu(0) 114 | 115 | model = get_model(args.model_dir, ctx, n_classes=2) 116 | 117 | logging.info(model) 118 | logging.info(model.collect_params()) 119 | 120 | loss = gluon.loss.SoftmaxCELoss() 121 | trainer = gluon.Trainer(model.collect_params(), args.optimizer, {'learning_rate': args.lr, 'wd': args.weight_decay}) 122 | 123 | logging.info("Starting Model training") 124 | model = train(model, trainer, loss, train_data, ctx) 125 | 126 | 127 | # ------------------------------------------------------------ # 128 | # Hosting methods # 129 | # ------------------------------------------------------------ # 130 | 131 | def model_fn(model_dir): 132 | ctx = mx.gpu(0) if list(mx.test_utils.list_gpus()) else mx.cpu(0) 133 | net = get_model(model_dir, ctx, n_classes=2, load_stored=True) 134 | return net 135 | 136 | 137 | def transform_fn(net, data, input_content_type, output_content_type): 138 | ctx = mx.gpu(0) if list(mx.test_utils.list_gpus()) else mx.cpu(0) 139 | nda = mx.nd.array(json.loads(data)) 140 | prediction = nd.softmax(net(nda.as_in_context(ctx)), axis=1)[:, 1] 141 | response_body = json.dumps(prediction.asnumpy().tolist()) 142 | return response_body, output_content_type 143 | -------------------------------------------------------------------------------- /source/sagemaker/baselines/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | 4 | def get_data(): 5 | data_prefix = "processed-data/" 6 | 7 | if not os.path.exists(data_prefix): 8 | print("""Expected the following folder {} to contain the preprocessed data. 9 | Run data processing first in main notebook before running baselines comparisons""".format(data_prefix)) 10 | return 11 | 12 | features = pd.read_csv(data_prefix + "features.csv", header=None) 13 | labels = pd.read_csv(data_prefix + "tags.csv").set_index('TransactionID') 14 | test_users = pd.read_csv(data_prefix + "test.csv", header=None) 15 | 16 | test_X = features.merge(test_users, on=[0], how='inner') 17 | train_X = features[~features[0].isin(test_users[0].values)] 18 | test_y = labels.loc[test_X[0]].isFraud 19 | train_y = labels.loc[train_X[0]].isFraud 20 | 21 | return train_X, train_y, test_X, test_y -------------------------------------------------------------------------------- /source/sagemaker/baselines/xgboost-fraud-baseline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Baseline XGBoost Model\n", 8 | "## Training Gradient Boosted trees with node features" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "from utils import get_data\n", 18 | "import os\n", 19 | "os.chdir(\"../\")\n", 20 | "\n", 21 | "import pandas as pd\n", 22 | "import numpy as np\n", 23 | "\n", 24 | "!bash setup.sh" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Read data and upload to S3" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import io\n", 41 | "from sklearn.datasets import dump_svmlight_file\n", 42 | "\n", 43 | "train_X, train_y, test_X, test_y = get_data()\n", 44 | "\n", 45 | "buf = io.BytesIO()\n", 46 | "dump_svmlight_file(train_X.values[:, 1:], train_y, buf)\n", 47 | "buf.seek(0);\n", 48 | "filename = 'xgboost-fraud-dataset.libsvm'\n", 49 | "with open(filename,'wb') as out:\n", 50 | " out.write(buf.read())" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "import os\n", 60 | "import sagemaker\n", 61 | "from sagemaker.s3 import S3Uploader\n", 62 | "\n", 63 | "from sagemaker_graph_fraud_detection import config\n", 64 | "\n", 65 | "role = config.role\n", 66 | "\n", 67 | "session = sagemaker.Session()\n", 68 | "bucket = config.solution_bucket\n", 69 | "prefix = 'xgboost-fraud-detection'\n", 70 | "\n", 71 | "s3_train_data = S3Uploader.upload(filename, 's3://{}/{}/{}'.format(bucket, prefix,'train'))\n", 72 | "print('Uploaded training data location: {}'.format(s3_train_data))\n", 73 | "\n", 74 | "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", 75 | "print('Training artifacts will be uploaded to: {}'.format(output_location))" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## Train SageMaker XGBoost Estimator" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "import boto3\n", 92 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 93 | "\n", 94 | "container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='0.90-2')\n", 95 | "scale_pos_weight = np.sqrt((len(train_y) - sum(train_y))/sum(train_y))\n", 96 | "\n", 97 | "hyperparams = {\n", 98 | " \"max_depth\":5,\n", 99 | " \"subsample\":0.8,\n", 100 | " \"num_round\":100,\n", 101 | " \"eta\":0.2,\n", 102 | " \"gamma\":4,\n", 103 | " \"min_child_weight\":6,\n", 104 | " \"silent\":0,\n", 105 | " \"objective\":'binary:logistic',\n", 106 | " \"eval_metric\":'f1',\n", 107 | " \"scale_pos_weight\": scale_pos_weight\n", 108 | "}\n", 109 | "\n", 110 | "xgb = sagemaker.estimator.Estimator(container,\n", 111 | " role,\n", 112 | " hyperparameters=hyperparams,\n", 113 | " train_instance_count=1, \n", 114 | " train_instance_type='ml.m4.xlarge',\n", 115 | " output_path=output_location,\n", 116 | " sagemaker_session=session)\n", 117 | "xgb.fit({'train': s3_train_data})" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "scrolled": true 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "from sagemaker.predictor import csv_serializer\n", 129 | "\n", 130 | "predictor = xgb.deploy(initial_instance_count=1,\n", 131 | " endpoint_name=\"xgboost-fraud-endpoint\",\n", 132 | " instance_type='ml.m4.xlarge')\n", 133 | "\n", 134 | "# Specify input and output formats.\n", 135 | "predictor.content_type = 'text/csv'\n", 136 | "predictor.serializer = csv_serializer\n", 137 | "predictor.deserializer = None" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "def predict(current_predictor, data, rows=500):\n", 147 | " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n", 148 | " predictions = ''\n", 149 | " for array in split_array:\n", 150 | " predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])\n", 151 | " return np.fromstring(predictions[1:], sep=',')\n", 152 | "\n", 153 | "raw_preds = predict(predictor, test_X.values[:, 1:])\n", 154 | "y_preds = np.where(raw_preds > 0.5, 1, 0)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "from sklearn.metrics import confusion_matrix, roc_curve, auc\n", 164 | "from matplotlib import pyplot as plt\n", 165 | "%matplotlib inline\n", 166 | "\n", 167 | "def print_metrics(y_true, y_predicted):\n", 168 | "\n", 169 | " cm = confusion_matrix(y_true, y_predicted)\n", 170 | " true_neg, false_pos, false_neg, true_pos = cm.ravel()\n", 171 | " cm = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]),\n", 172 | " columns=[\"labels positive\", \"labels negative\"],\n", 173 | " index=[\"predicted positive\", \"predicted negative\"])\n", 174 | " \n", 175 | " acc = (true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg)\n", 176 | " precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0\n", 177 | " recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0\n", 178 | " f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0\n", 179 | " print(\"Confusion Matrix:\")\n", 180 | " print(pd.DataFrame(cm, columns=[\"labels positive\", \"labels negative\"], \n", 181 | " index=[\"predicted positive\", \"predicted negative\"]))\n", 182 | " print(\"f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}\".format(f1, precision, recall, acc))\n", 183 | " print()\n", 184 | " \n", 185 | "def plot_roc_curve(fpr, tpr, roc_auc):\n", 186 | " f = plt.figure()\n", 187 | " lw = 2\n", 188 | " plt.plot(fpr, tpr, color='darkorange',\n", 189 | " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", 190 | " plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", 191 | " plt.xlim([0.0, 1.0])\n", 192 | " plt.ylim([0.0, 1.05])\n", 193 | " plt.xlabel('False Positive Rate')\n", 194 | " plt.ylabel('True Positive Rate')\n", 195 | " plt.title('Model ROC curve')\n", 196 | " plt.legend(loc=\"lower right\")\n", 197 | "\n", 198 | "print_metrics(test_y, y_preds)\n", 199 | "fpr, tpr, _ = roc_curve(test_y, y_preds)\n", 200 | "roc_auc = auc(fpr, tpr)\n", 201 | "plot_roc_curve(fpr, tpr, roc_auc)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "predictor.delete_endpoint()" 211 | ] 212 | } 213 | ], 214 | "metadata": { 215 | "kernelspec": { 216 | "display_name": "conda_python3", 217 | "language": "python", 218 | "name": "conda_python3" 219 | }, 220 | "language_info": { 221 | "codemirror_mode": { 222 | "name": "ipython", 223 | "version": 3 224 | }, 225 | "file_extension": ".py", 226 | "mimetype": "text/x-python", 227 | "name": "python", 228 | "nbconvert_exporter": "python", 229 | "pygments_lexer": "ipython3", 230 | "version": "3.6.10" 231 | } 232 | }, 233 | "nbformat": 4, 234 | "nbformat_minor": 2 235 | } 236 | -------------------------------------------------------------------------------- /source/sagemaker/data-preprocessing/buildspec.yml: -------------------------------------------------------------------------------- 1 | version: 0.2 2 | 3 | phases: 4 | build: 5 | commands: 6 | - echo Build started on `date` 7 | - echo Building the Docker image... 8 | - bash container/build_and_push.sh $ecr_repository $region $account_id 9 | post_build: 10 | commands: 11 | - echo Build completed on `date` -------------------------------------------------------------------------------- /source/sagemaker/data-preprocessing/container/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.7-slim-buster 2 | 3 | RUN pip3 install pandas==0.24.2 4 | ENV PYTHONUNBUFFERED=TRUE 5 | 6 | ENTRYPOINT ["python3"] -------------------------------------------------------------------------------- /source/sagemaker/data-preprocessing/container/build_and_push.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | image=$1 4 | 5 | region=$2 6 | 7 | account=$3 8 | 9 | # Get the region defined in the current configuration (default to us-east-1 if none defined) 10 | fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest" 11 | 12 | # If the repository doesn't exist in ECR, create it. 13 | aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1 14 | 15 | if [ $? -ne 0 ] 16 | then 17 | aws ecr create-repository --repository-name "${image}" > /dev/null 18 | fi 19 | 20 | # Get the login command from ECR and execute it directly 21 | $(aws ecr get-login --region ${region} --registry-ids ${account} --no-include-email) 22 | 23 | 24 | # Build the docker image locally with the image name and then push it to ECR 25 | # with the full name. 26 | 27 | docker build -t ${image} container 28 | docker tag ${image} ${fullname} 29 | 30 | docker push ${fullname} -------------------------------------------------------------------------------- /source/sagemaker/data-preprocessing/graph_data_preprocessor.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import logging 3 | import os 4 | 5 | import pandas as pd 6 | import numpy as np 7 | from itertools import combinations 8 | 9 | 10 | def parse_args(): 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument('--data-dir', type=str, default='/opt/ml/processing/input') 13 | parser.add_argument('--output-dir', type=str, default='/opt/ml/processing/output') 14 | parser.add_argument('--transactions', type=str, default='transaction.csv', help='name of file with transactions') 15 | parser.add_argument('--identity', type=str, default='identity.csv', help='name of file with identity info') 16 | parser.add_argument('--id-cols', type=str, default='', help='comma separated id cols in transactions table') 17 | parser.add_argument('--cat-cols', type=str, default='', help='comma separated categorical cols in transactions') 18 | parser.add_argument('--train-data-ratio', type=float, default=0.8, help='fraction of data to use in training set') 19 | parser.add_argument('--construct-homogeneous', action="store_true", default=False, 20 | help='use bipartite graphs edgelists to construct homogenous graph edgelist') 21 | return parser.parse_args() 22 | 23 | 24 | def get_logger(name): 25 | logger = logging.getLogger(name) 26 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s' 27 | logging.basicConfig(format=log_format, level=logging.INFO) 28 | logger.setLevel(logging.INFO) 29 | return logger 30 | 31 | 32 | def load_data(data_dir, transaction_data, identity_data, train_data_ratio, output_dir): 33 | transaction_df = pd.read_csv(os.path.join(data_dir, transaction_data)) 34 | logging.info("Shape of transaction data is {}".format(transaction_df.shape)) 35 | logging.info("# Tagged transactions: {}".format(len(transaction_df) - transaction_df.isFraud.isnull().sum())) 36 | 37 | identity_df = pd.read_csv(os.path.join(data_dir, identity_data)) 38 | logging.info("Shape of identity data is {}".format(identity_df.shape)) 39 | 40 | # extract out transactions for test/validation 41 | n_train = int(transaction_df.shape[0]*train_data_ratio) 42 | test_ids = transaction_df.TransactionID.values[n_train:] 43 | 44 | get_fraud_frac = lambda series: 100 * sum(series)/len(series) 45 | logging.info("Percent fraud for train transactions: {}".format(get_fraud_frac(transaction_df.isFraud[:n_train]))) 46 | logging.info("Percent fraud for test transactions: {}".format(get_fraud_frac(transaction_df.isFraud[n_train:]))) 47 | logging.info("Percent fraud for all transactions: {}".format(get_fraud_frac(transaction_df.isFraud))) 48 | 49 | with open(os.path.join(output_dir, 'test.csv'), 'w') as f: 50 | f.writelines(map(lambda x: str(x) + "\n", test_ids)) 51 | logging.info("Wrote test to file: {}".format(os.path.join(output_dir, 'test.csv'))) 52 | 53 | return transaction_df, identity_df, test_ids 54 | 55 | 56 | def get_features_and_labels(transactions_df, transactions_id_cols, transactions_cat_cols, output_dir): 57 | # Get features 58 | non_feature_cols = ['isFraud', 'TransactionDT'] + transactions_id_cols.split(",") 59 | feature_cols = [col for col in transactions_df.columns if col not in non_feature_cols] 60 | logging.info("Categorical columns: {}".format(transactions_cat_cols.split(","))) 61 | features = pd.get_dummies(transactions_df[feature_cols], columns=transactions_cat_cols.split(",")).fillna(0) 62 | features['TransactionAmt'] = features['TransactionAmt'].apply(np.log10) 63 | logging.info("Transformed feature columns: {}".format(list(features.columns))) 64 | logging.info("Shape of features: {}".format(features.shape)) 65 | features.to_csv(os.path.join(output_dir, 'features.csv'), index=False, header=False) 66 | logging.info("Wrote features to file: {}".format(os.path.join(output_dir, 'features.csv'))) 67 | 68 | # Get labels 69 | transactions_df[['TransactionID', 'isFraud']].to_csv(os.path.join(output_dir, 'tags.csv'), index=False) 70 | logging.info("Wrote labels to file: {}".format(os.path.join(output_dir, 'tags.csv'))) 71 | 72 | 73 | def get_relations_and_edgelist(transactions_df, identity_df, transactions_id_cols, output_dir): 74 | # Get relations 75 | edge_types = transactions_id_cols.split(",") + list(identity_df.columns) 76 | logging.info("Found the following distinct relation types: {}".format(edge_types)) 77 | id_cols = ['TransactionID'] + transactions_id_cols.split(",") 78 | full_identity_df = transactions_df[id_cols].merge(identity_df, on='TransactionID', how='left') 79 | logging.info("Shape of identity columns: {}".format(full_identity_df.shape)) 80 | 81 | # extract edges 82 | edges = {} 83 | for etype in edge_types: 84 | edgelist = full_identity_df[['TransactionID', etype]].dropna() 85 | edgelist.to_csv(os.path.join(output_dir, 'relation_{}_edgelist.csv').format(etype), index=False, header=True) 86 | logging.info("Wrote edgelist to: {}".format(os.path.join(output_dir, 'relation_{}_edgelist.csv').format(etype))) 87 | edges[etype] = edgelist 88 | return edges 89 | 90 | 91 | def create_homogeneous_edgelist(edges, output_dir): 92 | homogeneous_edges = [] 93 | for etype, relations in edges.items(): 94 | for edge_relation, frame in relations.groupby(etype): 95 | new_edges = [(a, b) for (a, b) in combinations(frame.TransactionID.values, 2) 96 | if (a, b) not in homogeneous_edges and (b, a) not in homogeneous_edges] 97 | homogeneous_edges.extend(new_edges) 98 | 99 | with open(os.path.join(output_dir, 'homogeneous_edgelist.csv'), 'w') as f: 100 | f.writelines(map(lambda x: "{}, {}\n".format(x[0], x[1]), homogeneous_edges)) 101 | logging.info("Wrote homogeneous edgelist to file: {}".format(os.path.join(output_dir, 'homogeneous_edgelist.csv'))) 102 | 103 | 104 | if __name__ == '__main__': 105 | logging = get_logger(__name__) 106 | 107 | args = parse_args() 108 | 109 | transactions, identity, test_transactions = load_data(args.data_dir, 110 | args.transactions, 111 | args.identity, 112 | args.train_data_ratio, 113 | args.output_dir) 114 | 115 | get_features_and_labels(transactions, args.id_cols, args.cat_cols, args.output_dir) 116 | relational_edges = get_relations_and_edgelist(transactions, identity, args.id_cols, args.output_dir) 117 | 118 | if args.construct_homogeneous: 119 | create_homogeneous_edgelist(relational_edges, args.output_dir) 120 | 121 | 122 | 123 | -------------------------------------------------------------------------------- /source/sagemaker/dgl-fraud-detection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Graph Fraud Detection with DGL on Amazon SageMaker\n", 8 | "\n", 9 | "This notebook shows an end to end pipeline to train a fraud detection model using graph neural networks. \n", 10 | "\n", 11 | "First, we process the raw dataset to prepare the features and extract the interactions in the dataset that will be used to construct the graph. \n", 12 | "\n", 13 | "Then, we create a launch a training job using the SageMaker framework estimator to train a graph neural network model with DGL." 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "!bash setup.sh\n", 23 | "\n", 24 | "import sagemaker\n", 25 | "from sagemaker_graph_fraud_detection import config, container_build\n", 26 | "\n", 27 | "role = config.role\n", 28 | "sess = sagemaker.Session()" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## Data Preprocessing and Feature Engineering" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Upload raw data to S3\n", 43 | "\n", 44 | "The dataset we use is the [IEEE-CIS Fraud Detection dataset](https://www.kaggle.com/c/ieee-fraud-detection/data?select=train_transaction.csv) which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:\n", 45 | "\n", 46 | "* **Transactions**: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction. \n", 47 | "* **Identity**: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.\n", 48 | "\n", 49 | "We will go over the specific data schema in subsequent cells but now let's move the raw data to a convenient location in the S3 bucket for this proejct, where it will be picked up by the preprocessing job and training job.\n", 50 | "\n", 51 | "If you would like to use your own dataset for this demonstration. Replace the `raw_data_location` with the s3 path or local path of your dataset, and modify the data preprocessing step as needed." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# Replace with an S3 location or local path to point to your own dataset\n", 61 | "raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'\n", 62 | "\n", 63 | "session_prefix = 'dgl-fraud-detection'\n", 64 | "input_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_data_prefix)\n", 65 | "\n", 66 | "!aws s3 cp --recursive $raw_data_location $input_data\n", 67 | "\n", 68 | "# Set S3 locations to store processed data for training and post-training results and artifacts respectively\n", 69 | "train_data = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_processing_output)\n", 70 | "train_output = 's3://{}/{}/{}'.format(config.solution_bucket, session_prefix, config.s3_train_output)" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "### Build container for Preprocessing and Feature Engineering\n", 78 | "\n", 79 | "Data preprocessing and feature engineering is an important component of the ML lifecycle, and Amazon SageMaker Processing allows you to do these easily on a managed infrastructure. First, we'll create a lightweight container that will serve as the environment for our data preprocessing. \n", 80 | "\n", 81 | "The Dockerfile that defines the container is shown below and it only contains the pandas package as a dependency but it can also be easily customized to add in more dependencies if your data preprocessing job requires it." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "!pygmentize data-preprocessing/container/Dockerfile" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Now we'll run a simple script to build a container image using the Dockerfile, and push the image to Amazon ECR. The container image will have a unique URI which the SageMaker Processing job executed." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "region = config.region_name\n", 107 | "account_id = config.aws_account\n", 108 | "ecr_repository = config.ecr_repository\n", 109 | "\n", 110 | "if config.container_build_project == \"local\":\n", 111 | " !cd data-preprocessing && bash container/build_and_push.sh $ecr_repository $region $account_id\n", 112 | "else:\n", 113 | " container_build.build(config.container_build_project)\n", 114 | "ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### Run Preprocessing job with Amazon SageMaker Processing\n", 122 | "\n", 123 | "The script we have defined at `data-preprocessing/graph_data_preprocessor.py` performs data preprocessing and feature engineering transformations on the raw data. We provide a general processing framework to convert a relational table to heterogeneous graph edgelists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:\n", 124 | "\n", 125 | "* Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount\n", 126 | "* Constructing graph edgelists between transactions and other entities for the various relation types\n", 127 | "\n", 128 | "The inputs to the data preprocessing script are passed in as python command line arguments. All the columns in the relational table are classifed into one of 3 types for the purposes of data transformation: \n", 129 | "\n", 130 | "* **Identity columns** `--id-cols`: columns that contain identity information related to a user or transaction for example IP address, Phone Number, device identifiers etc. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes. The column names for these column types need to passed in to the script.\n", 131 | "\n", 132 | "* **Categorical columns** `--cat-cols`: columns that correspond to categorical features for a user's age group or whether a provided address matches with an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph. The columns names for these column types also needs to be passed in to the script\n", 133 | "\n", 134 | "* **Numerical columns**: columns that correspond to numerical features like how many times a user has tried a transaction and so on. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that are not identity columns or categorical columns are numerical columns\n", 135 | "\n", 136 | "In order to adapt the preprocessing script to work with data in the same format, you can simply change the python arguments used in the cell below to a comma seperate string for the column names in your dataset. If your dataset is in a different format, then you will also have to modify the preprocessing script at `data-preprocessing/graph_data_preprocessor.py`" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n", 146 | "\n", 147 | "script_processor = ScriptProcessor(command=['python3'],\n", 148 | " image_uri=ecr_repository_uri,\n", 149 | " role=role,\n", 150 | " instance_count=1,\n", 151 | " instance_type='ml.m4.xlarge')\n", 152 | "\n", 153 | "script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',\n", 154 | " inputs=[ProcessingInput(source=input_data,\n", 155 | " destination='/opt/ml/processing/input')],\n", 156 | " outputs=[ProcessingOutput(destination=train_data,\n", 157 | " source='/opt/ml/processing/output')],\n", 158 | " arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',\n", 159 | " '--cat-cols','M1,M2,M3,M4,M5,M6,M7,M8,M9'])" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### View Results of Data Preprocessing\n", 167 | "\n", 168 | "Once the preprocessing job is complete, we can take a look at the contents of the S3 bucket to see the transformed data. We have a set of bipartite edge lists between transactions and different device id types as well as the features, labels and a set of transactions to validate our graph model performance." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from os import path\n", 178 | "from sagemaker.s3 import S3Downloader\n", 179 | "processed_files = S3Downloader.list(train_data)\n", 180 | "print(\"===== Processed Files =====\")\n", 181 | "print('\\n'.join(processed_files))\n", 182 | "\n", 183 | "# optionally download processed data\n", 184 | "# S3Downloader.download(train_data, train_data.split(\"/\")[-1])" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "## Train Graph Neural Network with DGL\n", 192 | "\n", 193 | "Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.\n", 194 | "\n", 195 | "Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types." 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "### Hyperparameters\n", 203 | "\n", 204 | "To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. \n", 205 | "\n", 206 | "Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:\n", 207 | "\n", 208 | "* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.\n", 209 | "* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists\n", 210 | "* **`labels`** is the name of the file tha contains the target `node_id`s and their labels\n", 211 | "* **`model`** specify which graph neural network to use, this should be set to `r-gcn`\n", 212 | "\n", 213 | "The following hyperparameters can be tuned and adjusted to improve model performance\n", 214 | "* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN\n", 215 | "\n", 216 | "* **embedding-size** is the size of the embedding dimension for non target nodes\n", 217 | "* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training\n", 218 | "* **n-layers** is the number of GNN layers in the model\n", 219 | "* **n-epochs** is the number of training epochs for the model training job\n", 220 | "* **optimizer** is the optimization algorithm used for gradient based parameter updates\n", 221 | "* **lr** is the learning rate for parameter updates\n" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "edges = \",\".join(map(lambda x: x.split(\"/\")[-1], [file for file in processed_files if \"relation\" in file]))\n", 231 | "params = {'nodes' : 'features.csv',\n", 232 | " 'edges': 'relation*',\n", 233 | " 'labels': 'tags.csv',\n", 234 | " 'model': 'rgcn',\n", 235 | " 'num-gpus': 1,\n", 236 | " 'batch-size': 10000,\n", 237 | " 'embedding-size': 64,\n", 238 | " 'n-neighbors': 1000,\n", 239 | " 'n-layers': 2,\n", 240 | " 'n-epochs': 10,\n", 241 | " 'optimizer': 'adam',\n", 242 | " 'lr': 1e-2\n", 243 | " }\n", 244 | "\n", 245 | "print(\"Graph will be constructed using the following edgelists:\\n{}\" .format('\\n'.join(edges.split(\",\"))))" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "### Create and Fit SageMaker Estimator\n", 253 | "\n", 254 | "With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.\n", 255 | "\n", 256 | "We can then `fit` the estimator on the the training data location in S3." 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": { 263 | "scrolled": true 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "from sagemaker.mxnet import MXNet\n", 268 | "from time import strftime, gmtime\n", 269 | "\n", 270 | "estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',\n", 271 | " source_dir='sagemaker_graph_fraud_detection/dgl_fraud_detection',\n", 272 | " role=role, \n", 273 | " train_instance_count=1, \n", 274 | " train_instance_type='ml.p3.2xlarge',\n", 275 | " framework_version=\"1.6.0\",\n", 276 | " py_version='py3',\n", 277 | " hyperparameters=params,\n", 278 | " output_path=train_output,\n", 279 | " code_location=train_output,\n", 280 | " sagemaker_session=sess)\n", 281 | "\n", 282 | "training_job_name = \"{}-{}\".format(config.solution_prefix, strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n", 283 | "estimator.fit({'train': train_data}, job_name=training_job_name)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "Once the training is completed, the training instances are automatically saved and SageMaker stores the trained model and evaluation results to a location in S3." 291 | ] 292 | } 293 | ], 294 | "metadata": { 295 | "kernelspec": { 296 | "display_name": "conda_python3", 297 | "language": "python", 298 | "name": "conda_python3" 299 | }, 300 | "language_info": { 301 | "codemirror_mode": { 302 | "name": "ipython", 303 | "version": 3 304 | }, 305 | "file_extension": ".py", 306 | "mimetype": "text/x-python", 307 | "name": "python", 308 | "nbconvert_exporter": "python", 309 | "pygments_lexer": "ipython3", 310 | "version": "3.6.10" 311 | } 312 | }, 313 | "nbformat": 4, 314 | "nbformat_minor": 2 315 | } -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/__init__.py -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/config.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import boto3 4 | import sagemaker 5 | from pathlib import Path 6 | 7 | 8 | def get_current_folder(global_variables): 9 | # if calling from a file 10 | if "__file__" in global_variables: 11 | current_file = Path(global_variables["__file__"]) 12 | current_folder = current_file.parent.resolve() 13 | # if calling from a notebook 14 | else: 15 | current_folder = Path(os.getcwd()) 16 | return current_folder 17 | 18 | region = boto3.session.Session().region_name 19 | account_id = boto3.client('sts').get_caller_identity().get('Account') 20 | default_bucket = sagemaker.session.Session(boto3.session.Session()).default_bucket() 21 | default_role = sagemaker.get_execution_role() 22 | 23 | cfn_stack_outputs = {} 24 | current_folder = get_current_folder(globals()) 25 | cfn_stack_outputs_filepath = Path(current_folder, '../stack_outputs.json').resolve() 26 | 27 | if os.path.exists(cfn_stack_outputs_filepath): 28 | with open(cfn_stack_outputs_filepath) as f: 29 | cfn_stack_outputs = json.load(f) 30 | 31 | aws_account = cfn_stack_outputs.get('AccountID', account_id) 32 | region_name = cfn_stack_outputs.get('AWSRegion', region) 33 | 34 | solution_prefix = cfn_stack_outputs.get('SolutionPrefix', 'sagemaker-soln-graph-fraud') 35 | solution_bucket = cfn_stack_outputs.get('SolutionS3Bucket', default_bucket) 36 | 37 | s3_data_prefix = cfn_stack_outputs.get('S3InputDataPrefix', 'raw-data') 38 | s3_processing_output = cfn_stack_outputs.get('S3ProcessingJobOutputPrefix', 'processed-data') 39 | s3_train_output = cfn_stack_outputs.get('S3TrainingJobOutputPrefix', 'training-output') 40 | 41 | ecr_repository = cfn_stack_outputs.get('SageMakerProcessingJobContainerName', 'sagemaker-soln-graph-fraud-preprocessing') 42 | container_build_project = cfn_stack_outputs.get('SageMakerProcessingJobContainerBuild', 'local') 43 | 44 | role = cfn_stack_outputs.get('IamRole', default_role) -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/container_build/__init__.py: -------------------------------------------------------------------------------- 1 | from .container_build import build -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/container_build/container_build.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import sys 3 | import time 4 | from .logs import logs_for_build 5 | 6 | def build(project_name, session=boto3.session.Session(), log=True): 7 | print("Starting a build job for CodeBuild project: {}".format(project_name)) 8 | id = _start_build(session, project_name) 9 | if log: 10 | logs_for_build(id, wait=True, session=session) 11 | else: 12 | _wait_for_build(id, session) 13 | 14 | def _start_build(session, project_name): 15 | args = {"projectName": project_name} 16 | client = session.client("codebuild") 17 | 18 | response = client.start_build(**args) 19 | return response["build"]["id"] 20 | 21 | 22 | def _wait_for_build(build_id, session, poll_seconds=10): 23 | client = session.client("codebuild") 24 | status = client.batch_get_builds(ids=[build_id]) 25 | while status["builds"][0]["buildStatus"] == "IN_PROGRESS": 26 | print(".", end="") 27 | sys.stdout.flush() 28 | time.sleep(poll_seconds) 29 | status = client.batch_get_builds(ids=[build_id]) 30 | print() 31 | print(f"Build complete, status = {status['builds'][0]['buildStatus']}") 32 | print(f"Logs at {status['builds'][0]['logs']['deepLink']}") -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/container_build/logs.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import sys 3 | import time 4 | 5 | import botocore 6 | 7 | 8 | # Log tailing code taken as-is from sagemaker_run_notebook. 9 | # https://github.com/aws-samples/sagemaker-run-notebook/blob/master/sagemaker_run_notebook/container_build.py#L100 10 | # It can be added as a dependency once its on PyPI. 11 | 12 | 13 | class LogState(object): 14 | STARTING = 1 15 | WAIT_IN_PROGRESS = 2 16 | TAILING = 3 17 | JOB_COMPLETE = 4 18 | COMPLETE = 5 19 | 20 | 21 | # Position is a tuple that includes the last read timestamp and the number of items that were read 22 | # at that time. This is used to figure out which event to start with on the next read. 23 | Position = collections.namedtuple("Position", ["timestamp", "skip"]) 24 | 25 | 26 | def log_stream(client, log_group, stream_name, position): 27 | """A generator for log items in a single stream. This will yield all the 28 | items that are available at the current moment. 29 | 30 | Args: 31 | client (boto3.CloudWatchLogs.Client): The Boto client for CloudWatch logs. 32 | log_group (str): The name of the log group. 33 | stream_name (str): The name of the specific stream. 34 | position (Position): A tuple with the time stamp value to start reading the logs from and 35 | The number of log entries to skip at the start. This is for when 36 | there are multiple entries at the same timestamp. 37 | start_time (int): The time stamp value to start reading the logs from (default: 0). 38 | skip (int): The number of log entries to skip at the start (default: 0). This is for when there are 39 | multiple entries at the same timestamp. 40 | 41 | Yields: 42 | A tuple with: 43 | dict: A CloudWatch log event with the following key-value pairs: 44 | 'timestamp' (int): The time of the event. 45 | 'message' (str): The log event data. 46 | 'ingestionTime' (int): The time the event was ingested. 47 | Position: The new position 48 | """ 49 | 50 | start_time, skip = position 51 | next_token = None 52 | 53 | event_count = 1 54 | while event_count > 0: 55 | if next_token is not None: 56 | token_arg = {"nextToken": next_token} 57 | else: 58 | token_arg = {} 59 | 60 | response = client.get_log_events( 61 | logGroupName=log_group, 62 | logStreamName=stream_name, 63 | startTime=start_time, 64 | startFromHead=True, 65 | **token_arg, 66 | ) 67 | next_token = response["nextForwardToken"] 68 | events = response["events"] 69 | event_count = len(events) 70 | if event_count > skip: 71 | events = events[skip:] 72 | skip = 0 73 | else: 74 | skip = skip - event_count 75 | events = [] 76 | for ev in events: 77 | ts, count = position 78 | if ev["timestamp"] == ts: 79 | position = Position(timestamp=ts, skip=count + 1) 80 | else: 81 | position = Position(timestamp=ev["timestamp"], skip=1) 82 | yield ev, position 83 | 84 | 85 | # Copy/paste/slight mods from session.logs_for_job() in the SageMaker Python SDK 86 | def logs_for_build( 87 | build_id, session, wait=False, poll=10 88 | ): # noqa: C901 - suppress complexity warning for this method 89 | """Display the logs for a given build, optionally tailing them until the 90 | build is complete. 91 | 92 | Args: 93 | build_id (str): The ID of the build to display the logs for. 94 | wait (bool): Whether to keep looking for new log entries until the build completes (default: False). 95 | poll (int): The interval in seconds between polling for new log entries and build completion (default: 10). 96 | session (boto3.session.Session): A boto3 session to use 97 | 98 | Raises: 99 | ValueError: If waiting and the build fails. 100 | """ 101 | 102 | codebuild = session.client("codebuild") 103 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0] 104 | status = description["buildStatus"] 105 | 106 | log_group = description["logs"].get("groupName") 107 | stream_name = description["logs"].get("streamName") # The list of log streams 108 | position = Position( 109 | timestamp=0, skip=0 110 | ) # The current position in each stream, map of stream name -> position 111 | 112 | # Increase retries allowed (from default of 4), as we don't want waiting for a build 113 | # to be interrupted by a transient exception. 114 | config = botocore.config.Config(retries={"max_attempts": 15}) 115 | client = session.client("logs", config=config) 116 | 117 | job_already_completed = False if status == "IN_PROGRESS" else True 118 | 119 | state = ( 120 | LogState.STARTING if wait and not job_already_completed else LogState.COMPLETE 121 | ) 122 | dot = True 123 | 124 | while state == LogState.STARTING and log_group == None: 125 | time.sleep(poll) 126 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0] 127 | log_group = description["logs"].get("groupName") 128 | stream_name = description["logs"].get("streamName") 129 | 130 | if state == LogState.STARTING: 131 | state = LogState.TAILING 132 | 133 | # The loop below implements a state machine that alternates between checking the build status and 134 | # reading whatever is available in the logs at this point. Note, that if we were called with 135 | # wait == False, we never check the job status. 136 | # 137 | # If wait == TRUE and job is not completed, the initial state is STARTING 138 | # If wait == FALSE, the initial state is COMPLETE (doesn't matter if the job really is complete). 139 | # 140 | # The state table: 141 | # 142 | # STATE ACTIONS CONDITION NEW STATE 143 | # ---------------- ---------------- ------------------- ---------------- 144 | # STARTING Pause, Get Status Valid LogStream Arn TAILING 145 | # Else STARTING 146 | # TAILING Read logs, Pause, Get status Job complete JOB_COMPLETE 147 | # Else TAILING 148 | # JOB_COMPLETE Read logs, Pause Any COMPLETE 149 | # COMPLETE Read logs, Exit N/A 150 | # 151 | # Notes: 152 | # - The JOB_COMPLETE state forces us to do an extra pause and read any items that got to Cloudwatch after 153 | # the build was marked complete. 154 | last_describe_job_call = time.time() 155 | dot_printed = False 156 | while True: 157 | for event, position in log_stream(client, log_group, stream_name, position): 158 | print(event["message"].rstrip()) 159 | if dot: 160 | dot = False 161 | if dot_printed: 162 | print() 163 | if state == LogState.COMPLETE: 164 | break 165 | 166 | time.sleep(poll) 167 | if dot: 168 | print(".", end="") 169 | sys.stdout.flush() 170 | dot_printed = True 171 | if state == LogState.JOB_COMPLETE: 172 | state = LogState.COMPLETE 173 | elif time.time() - last_describe_job_call >= 30: 174 | description = codebuild.batch_get_builds(ids=[build_id])["builds"][0] 175 | status = description["buildStatus"] 176 | 177 | last_describe_job_call = time.time() 178 | 179 | status = description["buildStatus"] 180 | 181 | if status != "IN_PROGRESS": 182 | print() 183 | state = LogState.JOB_COMPLETE 184 | 185 | if wait: 186 | if dot: 187 | print() -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/__init__.py -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | 5 | def get_features(id_to_node, node_features): 6 | """ 7 | 8 | :param id_to_node: dictionary mapping node names(id) to dgl node idx 9 | :param node_features: path to file containing node features 10 | :return: (np.ndarray, list) node feature matrix in order and new nodes not yet in the graph 11 | """ 12 | indices, features, new_nodes = [], [], [] 13 | max_node = max(id_to_node.values()) 14 | with open(node_features, "r") as fh: 15 | for line in fh: 16 | node_feats = line.strip().split(",") 17 | node_id = node_feats[0] 18 | feats = np.array(list(map(float, node_feats[1:]))) 19 | features.append(feats) 20 | if node_id not in id_to_node: 21 | max_node += 1 22 | id_to_node[node_id] = max_node 23 | new_nodes.append(max_node) 24 | 25 | indices.append(id_to_node[node_id]) 26 | 27 | features = np.array(features).astype('float32') 28 | features = features[np.argsort(indices), :] 29 | return features, new_nodes 30 | 31 | 32 | def get_labels(id_to_node, n_nodes, target_node_type, labels_path, masked_nodes_path, additional_mask_rate=0): 33 | """ 34 | 35 | :param id_to_node: dictionary mapping node names(id) to dgl node idx 36 | :param n_nodes: number of user nodes in the graph 37 | :param target_node_type: column name for target node type 38 | :param labels_path: filepath containing labelled nodes 39 | :param masked_nodes_path: filepath containing list of nodes to be masked 40 | :param additional_mask_rate: additional_mask_rate: float for additional masking of nodes with labels during training 41 | :return: (list, list) train and test mask array 42 | """ 43 | node_to_id = {v: k for k, v in id_to_node.items()} 44 | user_to_label = pd.read_csv(labels_path).set_index(target_node_type) 45 | labels = user_to_label.loc[map(int, pd.Series(node_to_id)[np.arange(n_nodes)].values)].values.flatten() 46 | masked_nodes = read_masked_nodes(masked_nodes_path) 47 | train_mask, test_mask = _get_mask(id_to_node, node_to_id, n_nodes, masked_nodes, 48 | additional_mask_rate=additional_mask_rate) 49 | return labels, train_mask, test_mask 50 | 51 | 52 | def read_masked_nodes(masked_nodes_path): 53 | """ 54 | Returns a list of nodes extracted from the path passed in 55 | 56 | :param masked_nodes_path: filepath containing list of nodes to be masked i.e test users 57 | :return: list 58 | """ 59 | with open(masked_nodes_path, "r") as fh: 60 | masked_nodes = [line.strip() for line in fh] 61 | return masked_nodes 62 | 63 | 64 | def _get_mask(id_to_node, node_to_id, num_nodes, masked_nodes, additional_mask_rate): 65 | """ 66 | Returns the train and test mask arrays 67 | 68 | :param id_to_node: dictionary mapping node names(id) to dgl node idx 69 | :param node_to_id: dictionary mapping dgl node idx to node names(id) 70 | :param num_nodes: number of user/account nodes in the graph 71 | :param masked_nodes: list of nodes to be masked during training, nodes without labels 72 | :param additional_mask_rate: float for additional masking of nodes with labels during training 73 | :return: (list, list) train and test mask array 74 | """ 75 | train_mask = np.ones(num_nodes) 76 | test_mask = np.zeros(num_nodes) 77 | for node_id in masked_nodes: 78 | train_mask[id_to_node[node_id]] = 0 79 | test_mask[id_to_node[node_id]] = 1 80 | if additional_mask_rate and additional_mask_rate < 1: 81 | unmasked = np.array([idx for idx in range(num_nodes) if node_to_id[idx] not in masked_nodes]) 82 | yet_unmasked = np.random.permutation(unmasked)[:int(additional_mask_rate*num_nodes)] 83 | train_mask[yet_unmasked] = 0 84 | return train_mask, test_mask 85 | 86 | 87 | def _get_node_idx(id_to_node, node_type, node_id, ptr): 88 | if node_type in id_to_node: 89 | if node_id in id_to_node[node_type]: 90 | node_idx = id_to_node[node_type][node_id] 91 | else: 92 | id_to_node[node_type][node_id] = ptr 93 | node_idx = ptr 94 | ptr += 1 95 | else: 96 | id_to_node[node_type] = {} 97 | id_to_node[node_type][node_id] = ptr 98 | node_idx = ptr 99 | ptr += 1 100 | 101 | return node_idx, id_to_node, ptr 102 | 103 | 104 | def parse_edgelist(edges, id_to_node, header=False, source_type='user', sink_type='user'): 105 | """ 106 | Parse an edgelist path file and return the edges as a list of tuple 107 | :param edges: path to comma separated file containing bipartite edges with header for edgetype 108 | :param id_to_node: dictionary containing mapping for node names(id) to dgl node indices 109 | :param header: boolean whether or not the file has a header row 110 | :param source_type: type of the source node in the edge. defaults to 'user' if no header 111 | :param sink_type: type of the sink node in the edge. defaults to 'user' if no header. 112 | :return: (list, dict) a list containing edges of a single relationship type as tuples and updated id_to_node dict. 113 | """ 114 | edge_list = [] 115 | source_pointer, sink_pointer = 0, 0 116 | with open(edges, "r") as fh: 117 | for i, line in enumerate(fh): 118 | source, sink = line.strip().split(",") 119 | if i == 0: 120 | if header: 121 | source_type, sink_type = source, sink 122 | if source_type in id_to_node: 123 | source_pointer = max(id_to_node[source_type].values()) + 1 124 | if sink_type in id_to_node: 125 | sink_pointer = max(id_to_node[sink_type].values()) + 1 126 | continue 127 | 128 | source_node, id_to_node, source_pointer = _get_node_idx(id_to_node, source_type, source, source_pointer) 129 | if source_type == sink_type: 130 | sink_node, id_to_node, source_pointer = _get_node_idx(id_to_node, sink_type, sink, source_pointer) 131 | else: 132 | sink_node, id_to_node, sink_pointer = _get_node_idx(id_to_node, sink_type, sink, sink_pointer) 133 | 134 | edge_list.append((source_node, sink_node)) 135 | 136 | return edge_list, id_to_node, source_type, sink_type 137 | 138 | 139 | def read_edges(edges, nodes=None): 140 | """ 141 | Read edges and node features 142 | 143 | :param edges: path to comma separated file containing all edges 144 | :param nodes: path to comma separated file containing all nodes + features 145 | :return: (list, list, list, dict) sources, sinks, features and id_to_node dictionary containing mappings 146 | from node names(id) to dgl node indices 147 | """ 148 | node_pointer = 0 149 | id_to_node = {} 150 | features = [] 151 | sources, sinks = [], [] 152 | if nodes is not None: 153 | with open(nodes, "r") as fh: 154 | for line in fh: 155 | node_feats = line.strip().split(",") 156 | node_id = node_feats[0] 157 | if node_id not in id_to_node: 158 | id_to_node[node_id] = node_pointer 159 | node_pointer += 1 160 | if len(node_feats) > 1: 161 | feats = np.array(list(map(float, node_feats[1:]))) 162 | features.append(feats) 163 | with open(edges, "r") as fh: 164 | for line in fh: 165 | source, sink = line.strip().split(",") 166 | sources.append(id_to_node[source]) 167 | sinks.append(id_to_node[sink]) 168 | else: 169 | with open(edges, "r") as fh: 170 | for line in fh: 171 | source, sink = line.strip().split(",") 172 | if source not in id_to_node: 173 | id_to_node[source] = node_pointer 174 | node_pointer += 1 175 | if sink not in id_to_node: 176 | id_to_node[sink] = node_pointer 177 | node_pointer += 1 178 | sources.append(id_to_node[source]) 179 | sinks.append(id_to_node[sink]) 180 | 181 | return sources, sinks, features, id_to_node 182 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/estimator_fns.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import logging 4 | 5 | 6 | def parse_args(): 7 | parser = argparse.ArgumentParser() 8 | 9 | parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) 10 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 11 | parser.add_argument('--output-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) 12 | parser.add_argument('--nodes', type=str, default='features.csv') 13 | parser.add_argument('--target-ntype', type=str, default='TransactionID') 14 | parser.add_argument('--edges', type=str, default='homogeneous_edgelist.csv') 15 | parser.add_argument('--heterogeneous', type=lambda x: (str(x).lower() in ['true', '1', 'yes']), 16 | default=True, help='use hetero graph') 17 | parser.add_argument('--no-features', type=lambda x: (str(x).lower() in ['true', '1', 'yes']), 18 | default=False, help='do not use node features') 19 | parser.add_argument('--mini-batch', type=lambda x: (str(x).lower() in ['true', '1', 'yes']), 20 | default=True, help='use mini-batch training and sample graph') 21 | parser.add_argument('--labels', type=str, default='tags.csv') 22 | parser.add_argument('--new-accounts', type=str, default='test.csv') 23 | parser.add_argument('--predictions', type=str, default='preds.csv', help='file to save predictions on new-accounts') 24 | parser.add_argument('--compute-metrics', type=lambda x: (str(x).lower() in ['true', '1', 'yes']), 25 | default=True, help='compute evaluation metrics after training') 26 | parser.add_argument('--threshold', type=float, default=0, help='threshold for making predictions, default : argmax') 27 | parser.add_argument('--model', type=str, default='rgcn', help='gnn to use. options: gcn, graphsage, gat, gem') 28 | parser.add_argument('--num-gpus', type=int, default=1) 29 | parser.add_argument('--batch-size', type=int, default=500) 30 | parser.add_argument('--optimizer', type=str, default='adam') 31 | parser.add_argument('--lr', type=float, default=1e-2) 32 | parser.add_argument('--n-epochs', type=int, default=20) 33 | parser.add_argument('--n-neighbors', type=int, default=10, help='number of neighbors to sample') 34 | parser.add_argument('--n-hidden', type=int, default=16, help='number of hidden units') 35 | parser.add_argument('--n-layers', type=int, default=3, help='number of hidden layers') 36 | parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss') 37 | parser.add_argument('--dropout', type=float, default=0.2, help='dropout probability, for gat only features') 38 | parser.add_argument('--attn-drop', type=float, default=0.6, help='attention dropout for gat/gem') 39 | parser.add_argument('--num-heads', type=int, default=4, help='number of hidden attention heads for gat/gem') 40 | parser.add_argument('--num-out-heads', type=int, default=1, help='number of output attention heads for gat/gem') 41 | parser.add_argument('--residual', action="store_true", default=False, help='use residual connection for gat') 42 | parser.add_argument('--alpha', type=float, default=0.2, help='the negative slop of leaky relu') 43 | parser.add_argument('--aggregator-type', type=str, default="gcn", help="graphsage aggregator: mean/gcn/pool/lstm") 44 | parser.add_argument('--embedding-size', type=int, default=360, help="embedding size for node embedding") 45 | 46 | return parser.parse_args() 47 | 48 | 49 | def get_logger(name): 50 | logger = logging.getLogger(name) 51 | log_format = '%(asctime)s %(levelname)s %(name)s: %(message)s' 52 | logging.basicConfig(format=log_format, level=logging.INFO) 53 | logger.setLevel(logging.INFO) 54 | return logger -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/graph.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import dgl 4 | import numpy as np 5 | 6 | from data import * 7 | 8 | 9 | def get_edgelists(edgelist_expression, directory): 10 | if "," in edgelist_expression: 11 | return edgelist_expression.split(",") 12 | files = os.listdir(directory) 13 | compiled_expression = re.compile(edgelist_expression) 14 | return [filename for filename in files if compiled_expression.match(filename)] 15 | 16 | def construct_graph(training_dir, edges, nodes, target_node_type, heterogeneous=True): 17 | if heterogeneous: 18 | print("Getting relation graphs from the following edge lists : {} ".format(edges)) 19 | edgelists, id_to_node = {}, {} 20 | for i, edge in enumerate(edges): 21 | edgelist, id_to_node, src, dst = parse_edgelist(os.path.join(training_dir, edge), id_to_node, header=True) 22 | if src == target_node_type: 23 | src = 'target' 24 | if dst == target_node_type: 25 | dst = 'target' 26 | edgelists[(src, 'relation{}'.format(i), dst)] = edgelist 27 | print("Read edges for relation{} from edgelist: {}".format(i, os.path.join(training_dir, edge))) 28 | 29 | # reverse edge list so that relation is undirected 30 | # edgelists[(dst, 'reverse_relation{}'.format(i), src)] = [(b, a) for a, b in edgelist] 31 | 32 | # get features for target nodes 33 | features, new_nodes = get_features(id_to_node[target_node_type], os.path.join(training_dir, nodes)) 34 | print("Read in features for target nodes") 35 | # handle target nodes that have features but don't have any connections 36 | # if new_nodes: 37 | # edgelists[('target', 'relation'.format(i+1), 'none')] = [(node, 0) for node in new_nodes] 38 | # edgelists[('none', 'reverse_relation{}'.format(i + 1), 'target')] = [(0, node) for node in new_nodes] 39 | 40 | # add self relation 41 | edgelists[('target', 'self_relation', 'target')] = [(t, t) for t in id_to_node[target_node_type].values()] 42 | 43 | g = dgl.heterograph(edgelists) 44 | print( 45 | "Constructed heterograph with the following metagraph structure: Node types {}, Edge types{}".format( 46 | g.ntypes, g.canonical_etypes)) 47 | print("Number of nodes of type target : {}".format(g.number_of_nodes('target'))) 48 | 49 | g.nodes['target'].data['features'] = features 50 | 51 | id_to_node = id_to_node[target_node_type] 52 | 53 | else: 54 | sources, sinks, features, id_to_node = read_edges(os.path.join(training_dir, edges[0]), 55 | os.path.join(training_dir, nodes)) 56 | 57 | # add self relation 58 | all_nodes = sorted(id_to_node.values()) 59 | sources.extend(all_nodes) 60 | sinks.extend(all_nodes) 61 | 62 | g = dgl.graph((sources, sinks)) 63 | 64 | if features: 65 | g.ndata['features'] = np.array(features).astype('float32') 66 | 67 | print('read graph from node list and edge list') 68 | 69 | features = g.ndata['features'] 70 | 71 | return g, features, id_to_node 72 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/awslabs/sagemaker-graph-fraud-detection/35e4203dd6ec7298c12361140013b487765cbd11/source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/__init__.py -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/mxnet.py: -------------------------------------------------------------------------------- 1 | from mxnet import gluon, nd 2 | from dgl.nn.mxnet import GraphConv, GATConv, SAGEConv 3 | import dgl.function as fn 4 | 5 | 6 | class HeteroRGCNLayer(gluon.Block): 7 | def __init__(self, in_size, out_size, etypes): 8 | super(HeteroRGCNLayer, self).__init__() 9 | # W_r for each relation 10 | with self.name_scope(): 11 | self.weight = {name: gluon.nn.Dense(out_size, use_bias=False) for name in etypes} 12 | for child in self.weight.values(): 13 | self.register_child(child) 14 | 15 | def forward(self, G, feat_dict): 16 | # The input is a dictionary of node features for each type 17 | funcs = {} 18 | for srctype, etype, dsttype in G.canonical_etypes: 19 | # Compute W_r * h 20 | if srctype in feat_dict: 21 | Wh = self.weight[etype](feat_dict[srctype]) 22 | # Save it in graph for message passing 23 | G.srcnodes[srctype].data['Wh_%s' % etype] = Wh 24 | # Specify per-relation message passing functions: (message_func, reduce_func). 25 | # Note that the results are saved to the same destination feature 'h', which 26 | # hints the type wise reducer for aggregation. 27 | funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h')) 28 | # Trigger message passing of multiple types. 29 | # The first argument is the message passing functions for each relation. 30 | # The second one is the type wise reducer, could be "sum", "max", 31 | # "min", "mean", "stack" 32 | G.multi_update_all(funcs, 'sum') 33 | # return the updated node feature dictionary 34 | return {ntype: G.dstnodes[ntype].data['h'] for ntype in G.ntypes if 'h' in G.dstnodes[ntype].data} 35 | 36 | 37 | class HeteroRGCN(gluon.Block): 38 | def __init__(self, g, in_size, hidden_size, out_size, n_layers, embedding_size, ctx): 39 | super(HeteroRGCN, self).__init__() 40 | self.g = g 41 | self.ctx = ctx 42 | 43 | # Use trainable node embeddings as featureless inputs for all non target node types. 44 | with self.name_scope(): 45 | self.embed_dict = {ntype: gluon.nn.Embedding(g.number_of_nodes(ntype), embedding_size) 46 | for ntype in g.ntypes if ntype != 'target'} 47 | 48 | for child in self.embed_dict.values(): 49 | self.register_child(child) 50 | 51 | # create layers 52 | # input layer 53 | self.layers = gluon.nn.Sequential() 54 | self.layers.add(HeteroRGCNLayer(embedding_size, hidden_size, g.etypes)) 55 | # hidden layers 56 | for i in range(n_layers - 1): 57 | self.layers.add(HeteroRGCNLayer(hidden_size, hidden_size, g.etypes)) 58 | # output layer 59 | # self.layers.add(HeteroRGCNLayer(hidden_size, out_size, g.etypes)) 60 | self.layers.add(gluon.nn.Dense(out_size)) 61 | 62 | def forward(self, g, features): 63 | # get embeddings for all node types. for target node type, use passed in target features 64 | h_dict = {'target': features} 65 | for ntype in self.embed_dict: 66 | if g[0].number_of_nodes(ntype) > 0: 67 | h_dict[ntype] = self.embed_dict[ntype](nd.array(g[0].nodes(ntype), self.ctx)) 68 | 69 | # pass through all layers 70 | for i, layer in enumerate(self.layers[:-1]): 71 | if i != 0: 72 | h_dict = {k: nd.LeakyReLU(h) for k, h in h_dict.items()} 73 | h_dict = layer(g[i], h_dict) 74 | 75 | # get target logits 76 | # return h_dict['target'] 77 | return self.layers[-1](h_dict['target']) 78 | 79 | 80 | class NodeEmbeddingGNN(gluon.Block): 81 | def __init__(self, 82 | gnn, 83 | input_size, 84 | embedding_size): 85 | super(NodeEmbeddingGNN, self).__init__() 86 | 87 | with self.name_scope(): 88 | self.embed = gluon.nn.Embedding(input_size, embedding_size) 89 | self.gnn = gnn 90 | 91 | def forward(self, g, nodes): 92 | features = self.embed(nodes) 93 | h = self.gnn(g, features) 94 | return h 95 | 96 | 97 | class GCN(gluon.Block): 98 | def __init__(self, 99 | g, 100 | in_feats, 101 | n_hidden, 102 | n_classes, 103 | n_layers, 104 | activation, 105 | dropout): 106 | super(GCN, self).__init__() 107 | self.g = g 108 | self.layers = gluon.nn.Sequential() 109 | # input layer 110 | self.layers.add(GraphConv(in_feats, n_hidden, activation=activation)) 111 | # hidden layers 112 | for i in range(n_layers - 1): 113 | self.layers.add(GraphConv(n_hidden, n_hidden, activation=activation)) 114 | # output layer 115 | # self.layers.add(GraphConv(n_hidden, n_classes)) 116 | self.layers.add(gluon.nn.Dense(n_classes)) 117 | self.dropout = gluon.nn.Dropout(rate=dropout) 118 | 119 | def forward(self, g, features): 120 | h = features 121 | for i, layer in enumerate(self.layers[:-1]): 122 | if i != 0: 123 | h = self.dropout(h) 124 | h = layer(g[i], h) 125 | return self.layers[-1](h) 126 | 127 | 128 | class GraphSAGE(gluon.Block): 129 | def __init__(self, 130 | g, 131 | in_feats, 132 | n_hidden, 133 | n_classes, 134 | n_layers, 135 | activation, 136 | dropout, 137 | aggregator_type): 138 | super(GraphSAGE, self).__init__() 139 | self.g = g 140 | 141 | with self.name_scope(): 142 | self.layers = gluon.nn.Sequential() 143 | # input layer 144 | self.layers.add(SAGEConv(in_feats, n_hidden, aggregator_type, feat_drop=dropout, activation=activation)) 145 | # hidden layers 146 | for i in range(n_layers - 1): 147 | self.layers.add(SAGEConv(n_hidden, n_hidden, aggregator_type, feat_drop=dropout, activation=activation)) 148 | # output layer 149 | self.layers.add(gluon.nn.Dense(n_classes)) 150 | 151 | def forward(self, g, features): 152 | h = features 153 | for i, layer in enumerate(self.layers[:-1]): 154 | h_dst = h[:g[i].number_of_dst_nodes()] 155 | h = layer(g[i], (h, h_dst)) 156 | return self.layers[-1](h) 157 | 158 | 159 | class GAT(gluon.Block): 160 | def __init__(self, 161 | g, 162 | in_dim, 163 | num_hidden, 164 | num_classes, 165 | num_layers, 166 | heads, 167 | activation, 168 | feat_drop, 169 | attn_drop, 170 | alpha, 171 | residual): 172 | super(GAT, self).__init__() 173 | self.g = g 174 | self.num_layers = num_layers 175 | self.gat_layers = [] 176 | self.activation = activation 177 | # input projection (no residual) 178 | self.gat_layers.append(GATConv( 179 | (in_dim, in_dim), num_hidden, heads[0], 180 | feat_drop, attn_drop, alpha, False)) 181 | # hidden layers 182 | for l in range(1, num_layers): 183 | # due to multi-head, the in_dim = num_hidden * num_heads 184 | self.gat_layers.append(GATConv( 185 | (num_hidden * heads[l-1], num_hidden * heads[l-1]), num_hidden, heads[l], 186 | feat_drop, attn_drop, alpha, residual)) 187 | # output projection 188 | self.output_proj = gluon.nn.Dense(num_classes) 189 | for i, layer in enumerate(self.gat_layers): 190 | self.register_child(layer, "gat_layer_{}".format(i)) 191 | self.register_child(self.output_proj, "dense_layer") 192 | 193 | def forward(self, g, inputs): 194 | h = inputs 195 | for l in range(self.num_layers): 196 | h_dst = h[:g[l].number_of_dst_nodes()] 197 | h = self.gat_layers[l](g[l], (h, h_dst)).flatten() 198 | h = self.activation(h) 199 | # output projection 200 | logits = self.output_proj(h) 201 | return logits 202 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/model/pytorch.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | from dgl.nn.pytorch import GraphConv, GATConv, SAGEConv 6 | import dgl.function as fn 7 | 8 | 9 | class HeteroRGCNLayer(nn.Module): 10 | def __init__(self, in_size, out_size, etypes): 11 | super(HeteroRGCNLayer, self).__init__() 12 | # W_r for each relation 13 | self.weight = nn.ModuleDict({ 14 | name: nn.Linear(in_size, out_size) for name in etypes 15 | }) 16 | 17 | def forward(self, G, feat_dict): 18 | # The input is a dictionary of node features for each type 19 | funcs = {} 20 | for srctype, etype, dsttype in G.canonical_etypes: 21 | # Compute W_r * h 22 | if srctype in feat_dict: 23 | Wh = self.weight[etype](feat_dict[srctype]) 24 | # Save it in graph for message passing 25 | G.nodes[srctype].data['Wh_%s' % etype] = Wh 26 | # Specify per-relation message passing functions: (message_func, reduce_func). 27 | # Note that the results are saved to the same destination feature 'h', which 28 | # hints the type wise reducer for aggregation. 29 | funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h')) 30 | # Trigger message passing of multiple types. 31 | # The first argument is the message passing functions for each relation. 32 | # The second one is the type wise reducer, could be "sum", "max", 33 | # "min", "mean", "stack" 34 | G.multi_update_all(funcs, 'sum') 35 | # return the updated node feature dictionary 36 | return {ntype: G.dstnodes[ntype].data['h'] for ntype in G.ntypes if 'h' in G.dstnodes[ntype].data} 37 | 38 | 39 | class HeteroRGCN(nn.Module): 40 | def __init__(self, g, in_size, hidden_size, out_size, n_layers, embedding_size): 41 | super(HeteroRGCN, self).__init__() 42 | # Use trainable node embeddings as featureless inputs. 43 | embed_dict = {ntype: nn.Parameter(torch.Tensor(g.number_of_nodes(ntype), in_size)) 44 | for ntype in g.ntypes if ntype != 'user'} 45 | for key, embed in embed_dict.items(): 46 | nn.init.xavier_uniform_(embed) 47 | self.embed = nn.ParameterDict(embed_dict) 48 | # create layers 49 | self.layers = nn.Sequential() 50 | self.layers.add_module(HeteroRGCNLayer(embedding_size, hidden_size, g.etypes)) 51 | # hidden layers 52 | for i in range(n_layers - 1): 53 | self.layers.add_module = HeteroRGCNLayer(hidden_size, hidden_size, g.etypes) 54 | 55 | # output layer 56 | self.layers.add(nn.Dense(hidden_size, out_size)) 57 | 58 | def forward(self, g, features): 59 | # get embeddings for all node types. for user node type, use passed in user features 60 | h_dict = self.embed 61 | h_dict['user'] = features 62 | 63 | # pass through all layers 64 | for i, layer in enumerate(self.layers[:-1]): 65 | if i != 0: 66 | h_dict = {k: F.leaky_relu(h) for k, h in h_dict.items()} 67 | h_dict = layer(g[i], h_dict) 68 | 69 | # get user logits 70 | # return h_dict['user'] 71 | return self.layers[-1](h_dict['user']) 72 | 73 | 74 | class NodeEmbeddingGNN(nn.Module): 75 | def __init__(self, 76 | gnn, 77 | input_size, 78 | embedding_size): 79 | super(NodeEmbeddingGNN, self).__init__() 80 | 81 | self.embed = nn.Embedding(input_size, embedding_size) 82 | self.gnn = gnn 83 | 84 | def forward(self, g, nodes): 85 | features = self.embed(nodes) 86 | h = self.gnn(g, features) 87 | return h 88 | 89 | 90 | class GCN(nn.Module): 91 | def __init__(self, 92 | g, 93 | in_feats, 94 | n_hidden, 95 | n_classes, 96 | n_layers, 97 | activation, 98 | dropout): 99 | super(GCN, self).__init__() 100 | self.g = g 101 | self.layers = nn.Sequential() 102 | # input layer 103 | self.layers.add_module(GraphConv(in_feats, n_hidden, activation=activation)) 104 | # hidden layers 105 | for i in range(n_layers - 1): 106 | self.layers.add(GraphConv(n_hidden, n_hidden, activation=activation)) 107 | # output layer 108 | # self.layers.add(GraphConv(n_hidden, n_classes)) 109 | self.layers.add(nn.Linear(n_hidden, n_classes)) 110 | self.dropout = nn.Dropout(p=dropout) 111 | 112 | def forward(self, g, features): 113 | h = features 114 | for i, layer in enumerate(self.layers[:-1]): 115 | if i != 0: 116 | h = self.dropout(h) 117 | h = layer(g, h) 118 | return self.layers[-1](h) 119 | 120 | 121 | class GraphSAGE(nn.Module): 122 | def __init__(self, 123 | g, 124 | in_feats, 125 | n_hidden, 126 | n_classes, 127 | n_layers, 128 | activation, 129 | dropout, 130 | aggregator_type): 131 | super(GraphSAGE, self).__init__() 132 | self.g = g 133 | 134 | with self.name_scope(): 135 | self.layers = nn.Sequential() 136 | # input layer 137 | self.layers.add_module(SAGEConv(in_feats, n_hidden, aggregator_type, feat_drop=dropout, activation=activation)) 138 | # hidden layers 139 | for i in range(n_layers - 1): 140 | self.layers.add_module(SAGEConv(n_hidden, n_hidden, aggregator_type, feat_drop=dropout, activation=activation)) 141 | # output layer 142 | self.layers.add_module(nn.Linear(n_hidden, n_classes)) 143 | 144 | def forward(self, g, features): 145 | h = features 146 | for layer in self.layers[:-1]: 147 | h = layer(g, h) 148 | return self.layers[-1](h) 149 | 150 | 151 | class GAT(nn.Module): 152 | def __init__(self, 153 | g, 154 | in_dim, 155 | num_hidden, 156 | num_classes, 157 | num_layers, 158 | heads, 159 | activation, 160 | feat_drop, 161 | attn_drop, 162 | alpha, 163 | residual): 164 | super(GAT, self).__init__() 165 | self.g = g 166 | self.num_layers = num_layers 167 | self.gat_layers = nn.ModuleList() 168 | self.activation = activation 169 | # input projection (no residual) 170 | self.gat_layers.append(GATConv( 171 | in_dim, num_hidden, heads[0], 172 | feat_drop, attn_drop, alpha, False)) 173 | # hidden layers 174 | for l in range(1, num_layers): 175 | # due to multi-head, the in_dim = num_hidden * num_heads 176 | self.gat_layers.append(GATConv( 177 | num_hidden * heads[l-1], num_hidden, heads[l], 178 | feat_drop, attn_drop, alpha, residual)) 179 | # output projection 180 | self.gat_layers.append(GATConv( 181 | num_hidden * heads[-2], num_classes, heads[-1], 182 | feat_drop, attn_drop, alpha, residual)) 183 | 184 | def forward(self, g, inputs): 185 | h = inputs 186 | for l in range(self.num_layers): 187 | h = self.gat_layers[l](g, h).flatten() 188 | h = self.activation(h) 189 | # output projection 190 | logits = self.gat_layers[-1](g, h).mean(1) 191 | return logits 192 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/requirements.txt: -------------------------------------------------------------------------------- 1 | dgl-cu101==0.4.3 2 | matplotlib==3.0.3 -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/sampler.py: -------------------------------------------------------------------------------- 1 | import dgl 2 | 3 | 4 | class HeteroGraphNeighborSampler: 5 | """Neighbor sampler on heterogeneous graphs 6 | Parameters 7 | ---------- 8 | g : DGLHeteroGraph 9 | Full graph 10 | category : str 11 | Category name of the seed nodes. 12 | nhops : int 13 | Number of hops to sample/number of layers in the node flow. 14 | fanout : int 15 | Fanout of each hop starting from the seed nodes. If a fanout is None, 16 | sample full neighbors. 17 | """ 18 | def __init__(self, g, category, nhops, fanout=None): 19 | self.g = g 20 | self.category = category 21 | self.fanouts = [fanout] * nhops 22 | 23 | def sample_block(self, seeds): 24 | blocks = [] 25 | seeds = {self.category: seeds} 26 | cur = seeds 27 | for fanout in self.fanouts: 28 | if fanout is None: 29 | frontier = dgl.in_subgraph(self.g, cur) 30 | else: 31 | frontier = dgl.sampling.sample_neighbors(self.g, cur, fanout) 32 | block = dgl.to_block(frontier, cur) 33 | cur = {} 34 | for ntype in block.srctypes: 35 | cur[ntype] = block.srcnodes[ntype].data[dgl.NID] 36 | blocks.insert(0, block) 37 | return blocks, cur[self.category] 38 | 39 | 40 | class NeighborSampler: 41 | """Neighbor sampler on homogenous graphs 42 | Parameters 43 | ---------- 44 | g : DGLGraph 45 | Full graph 46 | nhops : int 47 | Number of hops to sample/number of layers in the node flow. 48 | fanout : int 49 | Fanout of each hop starting from the seed nodes. If a fanout is None, 50 | sample full neighbors. 51 | """ 52 | def __init__(self, g, nhops, fanout=None): 53 | self.g = g 54 | self.fanouts = [fanout] * nhops 55 | self.nhops = nhops 56 | 57 | def sample_block(self, seeds): 58 | blocks = [] 59 | for fanout in self.fanouts: 60 | # For each seed node, sample ``fanout`` neighbors. 61 | if fanout is None: 62 | frontier = dgl.in_subgraph(self.g, seeds) 63 | else: 64 | frontier = dgl.sampling.sample_neighbors(self.g, seeds, fanout, replace=False) 65 | # Then we compact the frontier into a bipartite graph for message passing. 66 | block = dgl.to_block(frontier, seeds) 67 | # Obtain the seed nodes for next layer. 68 | seeds = block.srcdata[dgl.NID] 69 | 70 | blocks.insert(0, block) 71 | return blocks, blocks[0].srcdata[dgl.NID] 72 | 73 | 74 | class FullGraphSampler: 75 | """Does nothing and just returns the full graph 76 | Parameters 77 | ---------- 78 | g : DGLGraph 79 | Full graph 80 | nhops : int 81 | Number of hops to sample/number of layers in the node flow. 82 | """ 83 | 84 | def __init__(self, g, nhops): 85 | self.g = g 86 | self.nhops = nhops 87 | 88 | def sample_block(self, seeds): 89 | return [self.g] * self.nhops, seeds 90 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/train_dgl_mxnet_entry_point.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.environ['DGLBACKEND'] = 'mxnet' 3 | import mxnet as mx 4 | from mxnet import nd, gluon, autograd 5 | import dgl 6 | 7 | import numpy as np 8 | import pandas as pd 9 | 10 | import time 11 | import logging 12 | import pickle 13 | 14 | from estimator_fns import * 15 | from graph import * 16 | from data import * 17 | from utils import * 18 | from model.mxnet import * 19 | from sampler import * 20 | 21 | def normalize(feature_matrix): 22 | mean = nd.mean(feature_matrix, axis=0) 23 | stdev = nd.sqrt(nd.sum((feature_matrix - mean)**2, axis=0)/feature_matrix.shape[0]) 24 | return (feature_matrix - mean) / stdev 25 | 26 | 27 | def get_dataloader(data_size, batch_size, mini_batch=True): 28 | batch_size = batch_size if mini_batch else data_size 29 | train_dataloader = gluon.data.BatchSampler(gluon.data.RandomSampler(data_size), batch_size, 'keep') 30 | test_dataloader = gluon.data.BatchSampler(gluon.data.SequentialSampler(data_size), batch_size, 'keep') 31 | 32 | return train_dataloader, test_dataloader 33 | 34 | 35 | def train(model, trainer, loss, features, labels, train_loader, test_loader, train_g, test_g, train_mask, test_mask, 36 | ctx, n_epochs, batch_size, output_dir, thresh, scale_pos_weight, compute_metrics=True, mini_batch=True): 37 | duration = [] 38 | for epoch in range(n_epochs): 39 | tic = time.time() 40 | loss_val = 0. 41 | 42 | for n, batch in enumerate(train_loader): 43 | # logging.info("Iteration: {:05d}".format(n)) 44 | node_flow, batch_nids = train_g.sample_block(nd.array(batch).astype('int64')) 45 | batch_indices = nd.array(batch, ctx=ctx) 46 | with autograd.record(): 47 | pred = model(node_flow, features[batch_nids.as_in_context(ctx)]) 48 | l = loss(pred, labels[batch_indices], mx.nd.expand_dims(scale_pos_weight*train_mask, 1)[batch_indices]) 49 | l = l.sum()/len(batch) 50 | 51 | l.backward() 52 | trainer.step(batch_size=1, ignore_stale_grad=True) 53 | 54 | loss_val += l.asscalar() 55 | # logging.info("Current loss {:04f}".format(loss_val/(n+1))) 56 | 57 | duration.append(time.time() - tic) 58 | metric = evaluate(model, train_g, features, labels, train_mask, ctx, batch_size, mini_batch) 59 | logging.info("Epoch {:05d} | Time(s) {:.4f} | Loss {:.4f} | F1 {:.4f}".format( 60 | epoch, np.mean(duration), loss_val/(n+1), metric)) 61 | 62 | class_preds, pred_proba = get_model_class_predictions(model, test_g, test_loader, features, ctx, threshold=thresh) 63 | 64 | if compute_metrics: 65 | acc, f1, p, r, roc, pr, ap, cm = get_metrics(class_preds, pred_proba, labels, test_mask, output_dir) 66 | logging.info("Metrics") 67 | logging.info("""Confusion Matrix: 68 | {} 69 | f1: {:.4f}, precision: {:.4f}, recall: {:.4f}, acc: {:.4f}, roc: {:.4f}, pr: {:.4f}, ap: {:.4f} 70 | """.format(cm, f1, p, r, acc, roc, pr, ap)) 71 | 72 | return model, class_preds, pred_proba 73 | 74 | 75 | def evaluate(model, g, features, labels, mask, ctx, batch_size, mini_batch=True): 76 | f1 = mx.metric.F1() 77 | preds = [] 78 | batch_size = batch_size if mini_batch else features.shape[0] 79 | dataloader = gluon.data.BatchSampler(gluon.data.SequentialSampler(features.shape[0]), batch_size, 'keep') 80 | for batch in dataloader: 81 | node_flow, batch_nids = g.sample_block(nd.array(batch).astype('int64')) 82 | preds.append(model(node_flow, features[batch_nids.as_in_context(ctx)])) 83 | nd.waitall() 84 | 85 | # preds = nd.concat(*preds, dim=0).argmax(axis=1) 86 | preds = nd.concat(*preds, dim=0) 87 | mask = nd.array(np.where(mask.asnumpy()), ctx=ctx) 88 | f1.update(preds=nd.softmax(preds[mask], axis=1).reshape(-3, 0), labels=labels[mask].reshape(-1,)) 89 | return f1.get()[1] 90 | 91 | 92 | def get_model_predictions(model, g, dataloader, features, ctx): 93 | pred = [] 94 | for batch in dataloader: 95 | node_flow, batch_nids = g.sample_block(nd.array(batch).astype('int64')) 96 | pred.append(model(node_flow, features[batch_nids.as_in_context(ctx)])) 97 | nd.waitall() 98 | return nd.concat(*pred, dim=0) 99 | 100 | 101 | def get_model_class_predictions(model, g, datalaoder, features, ctx, threshold=None): 102 | unnormalized_preds = get_model_predictions(model, g, datalaoder, features, ctx) 103 | pred_proba = nd.softmax(unnormalized_preds)[:, 1].asnumpy().flatten() 104 | if not threshold: 105 | return unnormalized_preds.argmax(axis=1).asnumpy().flatten().astype(int), pred_proba 106 | return np.where(pred_proba > threshold, 1, 0), pred_proba 107 | 108 | 109 | def save_prediction(pred, pred_proba, id_to_node, training_dir, new_accounts, output_dir, predictions_file): 110 | prediction_query = read_masked_nodes(os.path.join(training_dir, new_accounts)) 111 | pred_indices = np.array([id_to_node[query] for query in prediction_query]) 112 | 113 | pd.DataFrame.from_dict({'target': prediction_query, 114 | 'pred_proba': pred_proba[pred_indices], 115 | 'pred': pred[pred_indices]}).to_csv(os.path.join(output_dir, predictions_file), 116 | index=False) 117 | 118 | 119 | def save_model(g, model, model_dir, hyperparams): 120 | model.save_parameters(os.path.join(model_dir, 'model.params')) 121 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'wb') as f: 122 | pickle.dump(hyperparams, f) 123 | with open(os.path.join(model_dir, 'graph.pkl'), 'wb') as f: 124 | pickle.dump(g, f) 125 | 126 | 127 | def get_model(g, hyperparams, in_feats, n_classes, ctx, model_dir=None): 128 | 129 | if model_dir: # load using saved model state 130 | with open(os.path.join(model_dir, 'model_hyperparams.pkl'), 'rb') as f: 131 | hyperparams = pickle.load(f) 132 | with open(os.path.join(model_dir, 'graph.pkl'), 'rb') as f: 133 | g = pickle.load(f) 134 | 135 | if hyperparams['heterogeneous']: 136 | model = HeteroRGCN(g, 137 | in_feats, 138 | hyperparams['n_hidden'], 139 | n_classes, 140 | hyperparams['n_layers'], 141 | hyperparams['embedding_size'], 142 | ctx) 143 | else: 144 | if hyperparams['model'] == 'gcn': 145 | model = GCN(g, 146 | in_feats, 147 | hyperparams['n_hidden'], 148 | n_classes, 149 | hyperparams['n_layers'], 150 | nd.relu, 151 | hyperparams['dropout']) 152 | elif hyperparams['model'] == 'graphsage': 153 | model = GraphSAGE(g, 154 | in_feats, 155 | hyperparams['n_hidden'], 156 | n_classes, 157 | hyperparams['n_layers'], 158 | nd.relu, 159 | hyperparams['dropout'], 160 | hyperparams['aggregator_type']) 161 | else: 162 | heads = ([hyperparams['num_heads']] * hyperparams['n_layers']) + [hyperparams['num_out_heads']] 163 | model = GAT(g, 164 | in_feats, 165 | hyperparams['n_hidden'], 166 | n_classes, 167 | hyperparams['n_layers'], 168 | heads, 169 | gluon.nn.Lambda(lambda data: nd.LeakyReLU(data, act_type='elu')), 170 | hyperparams['dropout'], 171 | hyperparams['attn_drop'], 172 | hyperparams['alpha'], 173 | hyperparams['residual']) 174 | 175 | if hyperparams['no_features']: 176 | model = NodeEmbeddingGNN(model, in_feats, hyperparams['embedding_size']) 177 | 178 | if model_dir: 179 | model.load_parameters(os.path.join(model_dir, 'model.params')) 180 | else: 181 | model.initialize(ctx=ctx) 182 | 183 | return model 184 | 185 | 186 | if __name__ == '__main__': 187 | logging = get_logger(__name__) 188 | logging.info('numpy version:{} MXNet version:{} DGL version:{}'.format(np.__version__, 189 | mx.__version__, 190 | dgl.__version__)) 191 | 192 | args = parse_args() 193 | 194 | args.edges = get_edgelists(args.edges, args.training_dir) 195 | 196 | g, features, id_to_node = construct_graph(args.training_dir, args.edges, args.nodes, args.target_ntype, 197 | args.heterogeneous) 198 | 199 | features = normalize(nd.array(features)) 200 | if args.heterogeneous: 201 | g.nodes['target'].data['features'] = features 202 | else: 203 | g.ndata['features'] = features 204 | 205 | logging.info("Getting labels") 206 | n_nodes = g.number_of_nodes('target') if args.heterogeneous else g.number_of_nodes() 207 | labels, train_mask, test_mask = get_labels(id_to_node, 208 | n_nodes, 209 | args.target_ntype, 210 | os.path.join(args.training_dir, args.labels), 211 | os.path.join(args.training_dir, args.new_accounts)) 212 | logging.info("Got labels") 213 | 214 | labels = nd.array(labels).astype('float32') 215 | train_mask = nd.array(train_mask).astype('float32') 216 | test_mask = nd.array(test_mask).astype('float32') 217 | 218 | n_nodes = sum([g.number_of_nodes(n_type) for n_type in g.ntypes]) if args.heterogeneous else g.number_of_nodes() 219 | n_edges = sum([g.number_of_edges(e_type) for e_type in g.etypes]) if args.heterogeneous else g.number_of_edges() 220 | 221 | logging.info("""----Data statistics------' 222 | #Nodes: {} 223 | #Edges: {} 224 | #Features Shape: {} 225 | #Labeled Train samples: {} 226 | #Unlabeled Test samples: {}""".format(n_nodes, 227 | n_edges, 228 | features.shape, 229 | train_mask.sum().asscalar(), 230 | test_mask.sum().asscalar())) 231 | 232 | if args.num_gpus: 233 | cuda = True 234 | ctx = mx.gpu(0) 235 | else: 236 | cuda = False 237 | ctx = mx.cpu(0) 238 | 239 | logging.info("Initializing Model") 240 | in_feats = args.embedding_size if args.no_features else features.shape[1] 241 | n_classes = 2 242 | model = get_model(g, vars(args), in_feats, n_classes, ctx) 243 | logging.info("Initialized Model") 244 | 245 | if args.no_features: 246 | features = nd.array(g.nodes('target'), ctx) if args.heterogeneous else nd.array(g.nodes(), ctx) 247 | else: 248 | features = features.as_in_context(ctx) 249 | 250 | labels = labels.as_in_context(ctx) 251 | train_mask = train_mask.as_in_context(ctx) 252 | test_mask = test_mask.as_in_context(ctx) 253 | 254 | if not args.heterogeneous: 255 | # normalization 256 | degs = g.in_degrees().astype('float32') 257 | norm = mx.nd.power(degs, -0.5) 258 | if cuda: 259 | norm = norm.as_in_context(ctx) 260 | g.ndata['norm'] = mx.nd.expand_dims(norm, 1) 261 | 262 | if args.mini_batch: 263 | train_g = HeteroGraphNeighborSampler(g, 'target', args.n_layers, args.n_neighbors) if args.heterogeneous\ 264 | else NeighborSampler(g, args.n_layers, args.n_neighbors) 265 | 266 | test_g = HeteroGraphNeighborSampler(g, 'target', args.n_layers) if args.heterogeneous\ 267 | else NeighborSampler(g, args.n_layers) 268 | else: 269 | train_g, test_g = FullGraphSampler(g, args.n_layers), FullGraphSampler(g, args.n_layers) 270 | 271 | train_data, test_data = get_dataloader(features.shape[0], args.batch_size, args.mini_batch) 272 | 273 | loss = gluon.loss.SoftmaxCELoss() 274 | scale_pos_weight = ((train_mask.shape[0] - train_mask.sum()) / train_mask.sum()) 275 | 276 | logging.info(model) 277 | logging.info(model.collect_params()) 278 | trainer = gluon.Trainer(model.collect_params(), args.optimizer, {'learning_rate': args.lr, 'wd': args.weight_decay}) 279 | 280 | logging.info("Starting Model training") 281 | model, pred, pred_proba = train(model, trainer, loss, features, labels, train_data, test_data, train_g, test_g, 282 | train_mask, test_mask, ctx, args.n_epochs, args.batch_size, args.output_dir, 283 | args.threshold, scale_pos_weight, args.compute_metrics, args.mini_batch) 284 | logging.info("Finished Model training") 285 | 286 | logging.info("Saving model") 287 | save_model(g, model, args.model_dir, vars(args)) 288 | 289 | logging.info("Saving model predictions for new accounts") 290 | save_prediction(pred, pred_proba, id_to_node, args.training_dir, args.new_accounts, args.output_dir, args.predictions) 291 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/dgl_fraud_detection/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import numpy as np 4 | from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score 5 | import networkx as nx 6 | import matplotlib.pyplot as plt 7 | 8 | 9 | def get_metrics(pred, pred_proba, labels, mask, out_dir): 10 | labels, mask = labels.asnumpy().flatten().astype(int), mask.asnumpy().flatten().astype(int) 11 | labels, pred, pred_proba = labels[np.where(mask)], pred[np.where(mask)], pred_proba[np.where(mask)] 12 | 13 | acc = ((pred == labels)).sum() / mask.sum() 14 | 15 | true_pos = (np.where(pred == 1, 1, 0) + np.where(labels == 1, 1, 0) > 1).sum() 16 | false_pos = (np.where(pred == 1, 1, 0) + np.where(labels == 0, 1, 0) > 1).sum() 17 | false_neg = (np.where(pred == 0, 1, 0) + np.where(labels == 1, 1, 0) > 1).sum() 18 | true_neg = (np.where(pred == 0, 1, 0) + np.where(labels == 0, 1, 0) > 1).sum() 19 | 20 | precision = true_pos/(true_pos + false_pos) if (true_pos + false_pos) > 0 else 0 21 | recall = true_pos/(true_pos + false_neg) if (true_pos + false_neg) > 0 else 0 22 | 23 | f1 = 2*(precision*recall)/(precision + recall) if (precision + recall) > 0 else 0 24 | confusion_matrix = pd.DataFrame(np.array([[true_pos, false_pos], [false_neg, true_neg]]), 25 | columns=["labels positive", "labels negative"], 26 | index=["predicted positive", "predicted negative"]) 27 | 28 | ap = average_precision_score(labels, pred_proba) 29 | 30 | fpr, tpr, _ = roc_curve(labels, pred_proba) 31 | prc, rec, _ = precision_recall_curve(labels, pred_proba) 32 | roc_auc = auc(fpr, tpr) 33 | pr_auc = auc(rec, prc) 34 | 35 | save_roc_curve(fpr, tpr, roc_auc, os.path.join(out_dir, "roc_curve.png")) 36 | save_pr_curve(prc, rec, pr_auc, ap, os.path.join(out_dir, "pr_curve.png")) 37 | 38 | return acc, f1, precision, recall, roc_auc, pr_auc, ap, confusion_matrix 39 | 40 | 41 | def save_roc_curve(fpr, tpr, roc_auc, location): 42 | f = plt.figure() 43 | lw = 2 44 | plt.plot(fpr, tpr, color='darkorange', 45 | lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) 46 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 47 | plt.xlim([0.0, 1.0]) 48 | plt.ylim([0.0, 1.05]) 49 | plt.xlabel('False Positive Rate') 50 | plt.ylabel('True Positive Rate') 51 | plt.title('Model ROC curve') 52 | plt.legend(loc="lower right") 53 | f.savefig(location) 54 | 55 | 56 | def save_pr_curve(fpr, tpr, pr_auc, ap, location): 57 | f = plt.figure() 58 | lw = 2 59 | plt.plot(fpr, tpr, color='darkorange', 60 | lw=lw, label='PR curve (area = %0.2f)' % pr_auc) 61 | plt.xlim([0.0, 1.0]) 62 | plt.ylim([0.0, 1.05]) 63 | plt.xlabel('Recall') 64 | plt.ylabel('Precision') 65 | plt.title('Model PR curve: AP={0:0.2f}'.format(ap)) 66 | plt.legend(loc="lower right") 67 | f.savefig(location) 68 | 69 | 70 | def save_graph_drawing(g, location): 71 | plt.figure(figsize=(12, 8)) 72 | node_colors = {node: 0.0 if 'user' in node else 0.5 for node in g.nodes()} 73 | nx.draw(g, node_size=10000, pos=nx.spring_layout(g), with_labels=True, font_size=14, 74 | node_color=list(node_colors.values()), font_color='white') 75 | plt.savefig(location, bbox_inches='tight') 76 | 77 | -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/requirements.txt: -------------------------------------------------------------------------------- 1 | sagemaker==1.72.0 2 | awscli>=1.18.140 -------------------------------------------------------------------------------- /source/sagemaker/sagemaker_graph_fraud_detection/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | 4 | setup( 5 | name='sagemaker_graph_fraud_detection', 6 | version='1.0', 7 | description='A package to organize code in the solution', 8 | packages=find_packages(exclude=('test',)) 9 | ) -------------------------------------------------------------------------------- /source/sagemaker/setup.sh: -------------------------------------------------------------------------------- 1 | export PIP_DISABLE_PIP_VERSION_CHECK=1 2 | 3 | pip install -r ./sagemaker_graph_fraud_detection/requirements.txt -q 4 | pip install -e ./sagemaker_graph_fraud_detection/ --------------------------------------------------------------------------------