├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── credit-card-fraud-detection ├── 1_feature_engineering.ipynb ├── 2_model_training.ipynb ├── 3_deployment_visualization.ipynb ├── assets │ └── knowledge_graph.b0e9408219d92f2ca3c7a05cccf9a5a72e34ddbd.png ├── config │ ├── model-hpo-configuration.json │ └── training-data-configuration.json └── neptune_ml_utils.py └── social-network-recommendations ├── README.md ├── assets ├── NeptuneML-illustration.png ├── message-passing.png └── social-network.png └── social-network-recommendations.ipynb /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Amazon Neptune ML Use Cases Workshops 2 | This repository contains code examples for building machine learning on graphs using Neptune ML. The code example covers a few use cases including Fraud Detection and Recommendations. 3 | 4 | 5 | ## What Is Amazon Neptune: 6 | ![](credit-card-fraud-detection/assets/knowledge_graph.b0e9408219d92f2ca3c7a05cccf9a5a72e34ddbd.png?raw=true) 7 | 8 | Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Amazon Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency. Amazon Neptune supports popular graph models Property Graph and W3C's RDF, and their respective query languages Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune powers graph use cases such as recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security. 9 | 10 | 11 | ## Amazon Neptune ML: 12 | Amazon Neptune ML is a new capability of Neptune that uses Graph Neural Networks (GNNs), a machine learning technique purpose-built for graphs, to make easy, fast, and more accurate predictions using graph data. With Neptune ML, you can improve the accuracy of most predictions for graphs by over 50% when compared to making predictions using non-graph methods. 13 | 14 | Making accurate predictions on graphs with billions of relationships can be difficult and time consuming. Existing ML approaches such as XGBoost can’t operate effectively on graphs because they are designed for tabular data. As a result, using these methods on graphs can take time, require specialized skills from developers, and produce sub-optimal predictions. 15 | 16 | Using the Deep Graph Library (DGL), an open-source library to which AWS contributes, that makes it easy to apply deep learning to graph data, Neptune ML automates the heavy lifting of selecting and training the best ML model for graph data, and lets users run machine learning on their graph directly using Neptune APIs and queries. As a result, you can now create, train, and apply ML on Amazon Neptune data in hours instead of weeks without the need to learn new tools and ML technologies. 17 | 18 | #### Example Use Cases 19 | 20 | **[1- Fraud Detection](credit-card-fraud-detection/):** in this use case, we use the [IEEE CIS Credit Transations](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset to build a graph dataset, ingest it in a Neptune DB cluster, build a node classification task to predict if a transaction is fraud by leveraging the different relationships between entities. 21 | 22 | 23 | ## Environment Setup 24 | All of the examples in this repository assumes that you have a Neptune DB cluster provisioned and running. If you don't have a Neptune DB cluster, you can use a CloudFormation template from here [CloudFormation Template](https://docs.aws.amazon.com/neptune/latest/userguide/get-started-create-cluster.html) to provision a new Amazon Neptune Cluster 25 | ## License 26 | 27 | This library is licensed under the MIT-0 License. See the LICENSE file. 28 | 29 | -------------------------------------------------------------------------------- /credit-card-fraud-detection/2_model_training.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8975db06", 6 | "metadata": { 7 | "slideshow": { 8 | "slide_type": "slide" 9 | } 10 | }, 11 | "source": [ 12 | "# Graph Fraud Detection with Neptune ML" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "73b417d6", 18 | "metadata": { 19 | "slideshow": { 20 | "slide_type": "slide" 21 | } 22 | }, 23 | "source": [ 24 | "In this module, we will run an end to end pipeline to train a fraud detection model using graph neural networks. The steps will include the following:\n", 25 | "\n", 26 | "\n", 27 | "* Fraud detection dataset\n", 28 | "* Export and Processing\n", 29 | "* Model training\n", 30 | "* Inference queries\n", 31 | "\n", 32 | "\n", 33 | "**Fraud Detection** is a set of techniques and analyses that allow organizations to identify and prevent unauthorized activity. Fraud can also be any kind of abuse to the system in place to gain undeserved benefits. This can include fraudulent credit card transactions, identify theft, insurance scams, etc. Fraudesters can collude to commit illegal activities and strive to make it look normal so it can be difficult to detect. The most effective solutions that fights fraud use a multifaceted approaches that integrates several of techniques. One of these techniques is the use of graphs.\n", 34 | "\n", 35 | "Graphs allow us to understand the relationship between various entities and how they are connected together which help in detecting fraud patterns that couldn't be detected by traditional methods. In this workshop, we will go through building a ML model from a graph database and train a graph neural network to estimate the probability of fraud for a certain transaction." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "id": "a09eed09", 41 | "metadata": {}, 42 | "source": [ 43 | "### 1- Restore Variables" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "id": "14e451fb", 50 | "metadata": { 51 | "slideshow": { 52 | "slide_type": "fragment" 53 | } 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "%store -r" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "076567f8", 63 | "metadata": {}, 64 | "source": [ 65 | "### 3- Establish Connection with Neptune Graph\n", 66 | "\n", 67 | "The next cell of code will establish a connection with the Neptune graph DB using a python wrapper around Apache TinkerPop Gremlin. Apache TinkerPop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). Gremlin is the graph traversal language of TinkerPop. It is a functional, data-flow language that enables users to write complex traversals on (or queries of) their application’s property graph. \n", 68 | "\n", 69 | "Once we establish the remote graph connection, we can traverse through the graph and run different queries on the graph object" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "id": "05c4c08e", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "from __future__ import print_function # Python 2/3 compatibility\n", 80 | "\n", 81 | "from gremlin_python import statics\n", 82 | "from gremlin_python.structure.graph import Graph\n", 83 | "from gremlin_python.process.graph_traversal import __\n", 84 | "from gremlin_python.process.strategies import *\n", 85 | "from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection\n", 86 | "\n", 87 | "graph = Graph()\n", 88 | "\n", 89 | "remoteConn = DriverRemoteConnection('wss://'+NEPTUNE_ENDPOINT+':8182/gremlin','g')\n", 90 | "g = graph.traversal().withRemote(remoteConn)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "bfb09daa", 96 | "metadata": { 97 | "slideshow": { 98 | "slide_type": "slide" 99 | } 100 | }, 101 | "source": [ 102 | "### 5- Reset the Neptune Database (Optional)\n", 103 | "\n", 104 | "If you created a new Neptune cluster for this excercise, no need to run this step. This step will make sure that the database is empty before populating it with the new data.\n", 105 | "\n", 106 | "#### 5.1- Initiate a DB reset" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "id": "b6acea2e", 113 | "metadata": { 114 | "slideshow": { 115 | "slide_type": "fragment" 116 | } 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "%%bash -s \"$NEPTUNE_ENDPOINT\" --out RESPONSE\n", 121 | "\n", 122 | "awscurl -X POST \\\n", 123 | "-H 'Content-Type: application/json' https://$1:8182/system \\\n", 124 | "-d '{ \"action\" : \"initiateDatabaseReset\" }'" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "45c8d460", 130 | "metadata": {}, 131 | "source": [ 132 | "#### 5.2- Process the respose and get the token" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "id": "0703b00f", 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "import ast\n", 143 | "reset_token = ast.literal_eval(RESPONSE)['payload']['token']" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "id": "4bcec973", 149 | "metadata": { 150 | "slideshow": { 151 | "slide_type": "slide" 152 | } 153 | }, 154 | "source": [ 155 | "#### 5.3- Perform the DB Reset Using the Token\n", 156 | "\n", 157 | "Replace the Token ID below with the one from the output above. The next cell will initiate the DB reset" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "id": "1f3103e5", 164 | "metadata": { 165 | "slideshow": { 166 | "slide_type": "fragment" 167 | } 168 | }, 169 | "outputs": [], 170 | "source": [ 171 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$reset_token\"\n", 172 | "\n", 173 | "awscurl -X POST -H 'Content-Type: application/json' https://$1:8182/system -d '\n", 174 | "{ \n", 175 | "\"action\": \"performDatabaseReset\" ,\n", 176 | "\"token\" : \"'${2}'\"\n", 177 | "}'" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "id": "b66fd0b1", 183 | "metadata": {}, 184 | "source": [ 185 | "#### 5.4- Scale up the Neptune Instance Size for the Export Process (if needed)\n", 186 | "The bulk loader uses most of the free CPU cycles available in the cluster. Scaling up the instance before ingesting the graph data will help make the process much faster." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "id": "2ee15079", 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "!aws neptune modify-db-instance --db-instance-identifier $NEPTUNE_INSTANCE_ID --apply-immediately --db-instance-class db.r5.12xlarge" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "id": "d6e4c56d", 202 | "metadata": {}, 203 | "source": [ 204 | "Now, you can go to the Neptune Cluster console and wait for the new larger instance to be added" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "id": "be3385a5", 210 | "metadata": {}, 211 | "source": [ 212 | "### 6- Preparing the data" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "id": "608a2f31", 218 | "metadata": { 219 | "slideshow": { 220 | "slide_type": "slide" 221 | } 222 | }, 223 | "source": [ 224 | "#### 6.1- Loading the data\n", 225 | "\n", 226 | "Amazon Neptune, has a Bulk Loader to ingest data into the db. In the next block of code, we will use the loader API and point to the location of the files to upload them to Neptune.\n", 227 | "\n", 228 | "In this example, we are using the `OVERSUBSCRIBE` parallelism parameter. This parameter sets the bulk loader to use all available CPU resources when it runs. It generally takes 60%-70% of CPU capacity to keep the operation running as fast as I/O constraints permit.\n", 229 | "\n", 230 | "\n", 231 | "Loading data from an Amazon Simple Storage Service (Amazon S3) bucket requires an AWS Identity and Access Management (IAM) role that has access to the bucket. Follow the instructions here: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-IAM.html. In this instance, we call it `LoadFromNeptune`" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "id": "bcf8932e", 237 | "metadata": {}, 238 | "source": [ 239 | "#### 6.3- Upload the dataset to S3 bucket" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "id": "25a6474a", 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "!aws s3 cp --recursive ./data/ s3://$BUCKET/$PREFIX/ --exclude \"*\" --include \"*_vertices.csv\"\n", 250 | "!aws s3 cp data/edges.csv s3://$BUCKET/$PREFIX/" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "id": "cb4ca979", 256 | "metadata": {}, 257 | "source": [ 258 | "#### 6.4- Import the data into the Cluster " 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "id": "ff668b25", 265 | "metadata": { 266 | "slideshow": { 267 | "slide_type": "slide" 268 | } 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "%%sh -s \"$NEPTUNE_ENDPOINT\" \"$BUCKET\" \"$ACCOUNT_ID\" \"$REGION\" --out loadId\n", 273 | "\n", 274 | "awscurl -X POST -H 'Content-Type: application/json' https://$1:8182/loader -d '\n", 275 | " { \n", 276 | " \"region\" : \"'${4}'\", \n", 277 | " \"source\" : \"s3://'$2'/credit-transaction-fraud/\", \n", 278 | " \"format\" : \"csv\", \n", 279 | " \"iamRoleArn\" : \"arn:aws:iam::'$3':role/LoadFromNeptune\", \n", 280 | " \"parallelism\" : \"OVERSUBSCRIBE\",\n", 281 | " \"queueRequest\": \"TRUE\"\n", 282 | " }'" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "id": "4a06e7c7", 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "load_id = ast.literal_eval(loadId)['payload']['loadId']" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "id": "4314f550", 298 | "metadata": {}, 299 | "source": [ 300 | "#### 6.5 Get the status of the load\n", 301 | "\n", 302 | "Loading the data can take ~7 minutes." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "id": "0765cc34", 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$load_id\"\n", 313 | "\n", 314 | "awscurl -X GET 'https://'\"$1\"':8182/loader?loadId='$2''" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "id": "cf1cf472", 320 | "metadata": {}, 321 | "source": [ 322 | "#### Verbose status information\n", 323 | "\n", 324 | "If the bulk load failed for any reason, you can get more details on the error and location of the logs from the command below" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "id": "1a87dcab", 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "!awscurl -X GET 'https://$NEPTUNE_ENDPOINT:8182/loader/'$load_id'?details=true&errors=true&page=1&errorsPerPage=3'" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "id": "a24ee64d", 340 | "metadata": { 341 | "slideshow": { 342 | "slide_type": "slide" 343 | } 344 | }, 345 | "source": [ 346 | "#### Drop 10% of the fraud labels\n", 347 | "\n", 348 | "**NOTE: You must wait for the bulk load from previous step to complete first before dropping the fraud labels**\n", 349 | "\n", 350 | "Once the data ingestion is complete, we need to simulate entities with no labels so that the algorithm learn their label during training. In the next cell, we drop 10% of the transactions' labels so that the graph can infer this 10% after training the graph" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "id": "23147be5", 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "#pick a random range of transactions\n", 361 | "ids = [*range(2987000, 2992000)]\n", 362 | "idss = [str(id) for id in ids]\n", 363 | "\n", 364 | "#Save their values before dropping them\n", 365 | "fraud_labels = g.V(idss).hasLabel('Transaction').valueMap('isFraud').toList()\n", 366 | "\n", 367 | "#drop their fraud labels\n", 368 | "g.V(idss).hasLabel('Transaction').properties('isFraud').drop().toList()" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "id": "d38da067", 374 | "metadata": {}, 375 | "source": [ 376 | "#### Count the entities with no labels" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "id": "1d6c97b8", 383 | "metadata": { 384 | "slideshow": { 385 | "slide_type": "skip" 386 | } 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "g.V().hasLabel('Transaction').hasNot('isFraud').count().toList()" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "id": "afdf34d8", 396 | "metadata": { 397 | "slideshow": { 398 | "slide_type": "slide" 399 | } 400 | }, 401 | "source": [ 402 | "### 7- Preparing for Export\n", 403 | "\n", 404 | "Neptune ML requires that you provide training data for the Deep Graph Library (DGL) to create and test models using Amazon SageMaker in your account. To do this, you can export data from Neptune using an open-source tool named [neptune-export](https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export). \n", 405 | "\n", 406 | "You can use the tool either as a service (the Neptune-Export service) or as the Java neptune-export command line tool. The next block of code shows how to trigger the Neptune export through the API\n", 407 | "\n", 408 | "In the export command, we can pass parameters in the additionalParams field to guide the creation of a training data configuration file." 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "id": "8e6ca744", 414 | "metadata": {}, 415 | "source": [ 416 | "#### 7.1- Invoke the export process" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "id": "6b26bb87", 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "%%bash -s \"$NEPTUNE_ENDPOINT\" --out response \n", 427 | "\n", 428 | "awscurl --region us-east-2 -X POST -H 'Content-Type: application/json' -d ' \n", 429 | " { \"command\": \"export-pg\", \n", 430 | " \"params\": { \n", 431 | " \"endpoint\": \"\",\n", 432 | " \"cloneCluster\": false,\n", 433 | " \"cloneClusterInstanceType\": \"r5.8xlarge\"\n", 434 | " },\n", 435 | " \n", 436 | " \"additionalParams\": {\n", 437 | " \"neptune_ml\": {\n", 438 | " \"version\": \"v2.0\",\n", 439 | " \"split_rate\": [0.8,0.1,0.1],\n", 440 | " \"targets\": [\n", 441 | " {\n", 442 | " \"node\": \"Transaction\",\n", 443 | " \"property\": \"isFraud\",\n", 444 | " \"type\": \"classification\"\n", 445 | " }\n", 446 | " ]\n", 447 | " }\n", 448 | " },\n", 449 | " \"outputS3Path\": \"s3:///neptune-export\", \n", 450 | " \"jobSize\": \"medium\" }' \n", 451 | "\n", 452 | " " 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "id": "b2515051", 458 | "metadata": {}, 459 | "source": [ 460 | "#### 7.2- Get the job ID from the Previous Job" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "id": "f0d21a7a", 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [ 470 | "import ast\n", 471 | "jobId = ast.literal_eval(response)['jobId']" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "id": "8928a046", 477 | "metadata": {}, 478 | "source": [ 479 | "#### 7.3- Check the Status of the Export Job\n", 480 | "\n", 481 | "The export job above will spin up an instance and create a clone for the Neptune cluster to avoid disrubting the cluster. The clone will be teared down once the export job is complete. Wait until the export job status is **Successful** before proceeding with the next steps" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "id": "67adf236", 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [ 491 | "%%bash -s \"$jobId\" --out export_response\n", 492 | "\n", 493 | "awscurl --region us-east-2 https://r7zvc0y2ji.execute-api.us-east-2.amazonaws.com/Deployment/neptune-export/$1" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "id": "4ed873dc", 500 | "metadata": {}, 501 | "outputs": [], 502 | "source": [ 503 | "ast.literal_eval(export_response)" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "id": "1d5e70dc", 509 | "metadata": {}, 510 | "source": [ 511 | "Now we wait until the status of the export job is completed successfully" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "id": "fb8743ef", 517 | "metadata": {}, 518 | "source": [ 519 | "#### 7.4 Get the Output S3 Location" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "id": "e9eec1f4", 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "outputS3Uri = ast.literal_eval(export_response)['outputS3Uri']" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "id": "3a2fd9bd", 535 | "metadata": {}, 536 | "source": [ 537 | "#### 7.5 Examine the Training Configurations File" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "id": "1416aa9d", 544 | "metadata": {}, 545 | "outputs": [], 546 | "source": [ 547 | "import json\n", 548 | "!aws s3 cp $outputS3Uri/training-data-configuration.json ./\n", 549 | "\n", 550 | "with open('training-data-configuration.json', 'r') as handle:\n", 551 | " parsed = json.load(handle)\n", 552 | "parsed " 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "id": "87e4f49c", 558 | "metadata": {}, 559 | "source": [ 560 | "Neptune ML infers the data types of the entities and its properties automatically but you can also set them manually in the training configurations file. We've already modified some of the data types in the configuration file and defined some pre-processing steps that will be handled by Neptune ML " 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "id": "f8aabc7b", 566 | "metadata": {}, 567 | "source": [ 568 | "#### 7.6 Copy the JSON file to examine its content" 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "execution_count": null, 574 | "id": "75d7e499", 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "!aws s3 cp $outputS3Uri/training-data-configuration.json ./" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "id": "4ecd5050", 584 | "metadata": {}, 585 | "source": [ 586 | "#### 7.7 Copy the training configurations file to S3 output location\n", 587 | "\n", 588 | "After modifying any necessary fields, upload the file back to the S3 output location" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "id": "f04a7740", 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "!aws s3 cp ./training-data-configuration.json $outputS3Uri/" 599 | ] 600 | }, 601 | { 602 | "cell_type": "markdown", 603 | "id": "343016d5", 604 | "metadata": { 605 | "slideshow": { 606 | "slide_type": "slide" 607 | } 608 | }, 609 | "source": [ 610 | "### 8- Model training\n" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "id": "25d5ccd7", 616 | "metadata": {}, 617 | "source": [ 618 | "Model training in Neptune ML is a 2-step process: The first step is to run a SageMaker Processing job to carry out any data pre-processing needed before training - such as categorical features encoding, data imputation, numerical features scaling, etc.\n", 619 | "\n", 620 | "#### 8.1 Data Pre-processing and Feature Engineering\n", 621 | "##### 8.1.1 Define the Training and Processing IDs" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "id": "0fe23c59", 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "import time\n", 632 | "epoch_time = int(time.time())\n", 633 | "TRAINING_ID = 'data-training-' + str(epoch_time)\n", 634 | "PROCESSING_ID = 'data-processing-' + str(epoch_time)\n", 635 | "ENDPOINT_ID = 'endpoint-' + str(epoch_time)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "id": "e44e1060", 641 | "metadata": {}, 642 | "source": [ 643 | "##### 8.1.2 Invoke the Data Processing Job" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": null, 649 | "id": "cce83eff", 650 | "metadata": { 651 | "slideshow": { 652 | "slide_type": "fragment" 653 | } 654 | }, 655 | "outputs": [], 656 | "source": [ 657 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$REGION\" \"$BUCKET\" \"$outputS3Uri\" \"$TRAINING_ID\" \"$PROCESSING_ID\"\n", 658 | "\n", 659 | "awscurl --region $2 --service neptune-db -X POST https://$1:8182/ml/dataprocessing -H 'Content-Type: application/json' -d '\n", 660 | " {\n", 661 | " \"inputDataS3Location\" : \"'${4}'/\",\n", 662 | " \"id\" : \"'${6}'\",\n", 663 | " \"processedDataS3Location\" : \"s3://'${3}'/neptune-export/output/\",\n", 664 | " \"processingInstanceType\": \"ml.r5.16xlarge\"\n", 665 | " }'" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "id": "ac29827f", 671 | "metadata": {}, 672 | "source": [ 673 | "##### 8.1.3 Check the Processing Job Status" 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": null, 679 | "id": "528be2dc", 680 | "metadata": {}, 681 | "outputs": [], 682 | "source": [ 683 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$PROCESSING_ID\" --out preprocess_response\n", 684 | "\n", 685 | "curl -s https://${1}:8182/ml/dataprocessing/${2}" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "id": "978a591c", 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "ast.literal_eval(preprocess_response)" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "id": "1f73c10a", 701 | "metadata": {}, 702 | "source": [ 703 | "#### 8.2 Examine the generated HPO Config file" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "id": "ae650b2c", 710 | "metadata": {}, 711 | "outputs": [], 712 | "source": [ 713 | "preprocess_location = ast.literal_eval(preprocess_response)['processingJob']['outputLocation']\n", 714 | "!aws s3 cp $preprocess_location/model-hpo-configuration.json ./config/\n", 715 | "with open('model-hpo-configuration.json', 'r') as handle:\n", 716 | " parsed = json.load(handle)\n", 717 | "parsed " 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "id": "52e17f82", 723 | "metadata": {}, 724 | "source": [ 725 | "##### 8.2.1 Let's Change some HPs" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": null, 731 | "id": "562a2698", 732 | "metadata": {}, 733 | "outputs": [], 734 | "source": [ 735 | "HPO_file = open(\"model-hpo-configuration.json\", \"r\")\n", 736 | "HPO_JSON = json.load(HPO_file)\n", 737 | "\n", 738 | "#change the objective metric to ROC AUC\n", 739 | "HPO_JSON[\"models\"][0]['eval_metric']['metric'] = 'roc_auc'\n", 740 | "\n", 741 | "#change the frequency of evaluation to 3 epochs instead of 1\n", 742 | "HPO_JSON[\"models\"][0]['eval_frequency']['value'] = '3'\n", 743 | "\n", 744 | "HPO_file = open(\"model-hpo-configuration.json\", \"w\")\n", 745 | "json.dump(HPO_JSON, HPO_file)\n", 746 | "HPO_file.close()\n", 747 | "\n", 748 | "#upload the new model HPO configuration file to S3 processing output location\n", 749 | "!aws s3 cp config/model-hpo-configuration.json $preprocess_location/" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "id": "518d9693", 755 | "metadata": {}, 756 | "source": [ 757 | "#### 8.2 Train the Model" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "id": "88900bf5", 764 | "metadata": { 765 | "slideshow": { 766 | "slide_type": "fragment" 767 | } 768 | }, 769 | "outputs": [], 770 | "source": [ 771 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$REGION\" \"$BUCKET\" \"$TRAINING_ID\" \"$PROCESSING_ID\"\n", 772 | "\n", 773 | "awscurl --region $2 --service neptune-db -X POST https://$1:8182/ml/modeltraining -H 'Content-Type: application/json' -d '\n", 774 | " {\n", 775 | " \"id\" : \"'${4}'\",\n", 776 | " \"dataProcessingJobId\" : \"'${5}'\",\n", 777 | " \"trainModelS3Location\" : \"s3://'${3}'/neptune-export/neptune-model-graph-autotrainer\",\n", 778 | " \"trainingInstanceType\" : \"ml.p3.2xlarge\",\n", 779 | " \"maxHPONumberOfTrainingJobs\": 2\n", 780 | " }'" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": null, 786 | "id": "f8ac5930", 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$TRAINING_ID\" --out training_response\n", 791 | "\n", 792 | "curl -s https://${1}:8182/ml/modeltraining/${2}" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": null, 798 | "id": "9cfdd31e", 799 | "metadata": {}, 800 | "outputs": [], 801 | "source": [ 802 | "ast.literal_eval(training_response)" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "id": "177a4030", 808 | "metadata": {}, 809 | "source": [ 810 | "### 9- Store the Variables" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "id": "ba118446", 817 | "metadata": {}, 818 | "outputs": [], 819 | "source": [ 820 | "%store BUCKET\n", 821 | "%store REGION\n", 822 | "%store ACCOUNT_ID\n", 823 | "%store PREFIX\n", 824 | "%store outputS3Uri\n", 825 | "%store fraud_labels\n", 826 | "%store idss\n", 827 | "%store NEPTUNE_ENDPOINT\n", 828 | "%store NEPTUNE_LOAD_ROLE\n", 829 | "%store PROCESSING_ID\n", 830 | "%store TRAINING_ID\n", 831 | "%store ENDPOINT_ID" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": null, 837 | "id": "b2178d53", 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [] 841 | } 842 | ], 843 | "metadata": { 844 | "celltoolbar": "Slideshow", 845 | "kernelspec": { 846 | "display_name": "conda_python3", 847 | "language": "python", 848 | "name": "conda_python3" 849 | }, 850 | "language_info": { 851 | "codemirror_mode": { 852 | "name": "ipython", 853 | "version": 3 854 | }, 855 | "file_extension": ".py", 856 | "mimetype": "text/x-python", 857 | "name": "python", 858 | "nbconvert_exporter": "python", 859 | "pygments_lexer": "ipython3", 860 | "version": "3.6.13" 861 | } 862 | }, 863 | "nbformat": 4, 864 | "nbformat_minor": 5 865 | } 866 | -------------------------------------------------------------------------------- /credit-card-fraud-detection/3_deployment_visualization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "99b69bc8", 6 | "metadata": {}, 7 | "source": [ 8 | "## Restore Variables" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "id": "a7a3550c", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "%store -r" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "eb0a8f5a", 24 | "metadata": { 25 | "slideshow": { 26 | "slide_type": "slide" 27 | } 28 | }, 29 | "source": [ 30 | "## Deploying to a Model Endpoint" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "id": "afeab85a", 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$REGION\" \"$TRAINING_ID\" \"$ENDPOINT_ID\"\n", 41 | "\n", 42 | "awscurl --region $2 --service neptune-db -X POST https://$1:8182/ml/endpoints -H 'Content-Type: application/json' -d '\n", 43 | " {\n", 44 | " \"id\" : \"'${4}'\",\n", 45 | " \"mlModelTrainingJobId\": \"'${3}'\"\n", 46 | " }'\n" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "id": "1d67b1bb", 52 | "metadata": {}, 53 | "source": [ 54 | "## Get the Endpoint Name" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "e89b9275", 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "%%bash -s \"$NEPTUNE_ENDPOINT\" \"$ENDPOINT_ID\" --out endpoint_response\n", 65 | "\n", 66 | "curl -s https://${1}:8182/ml/endpoints/${2}" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "92f92549", 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "import ast\n", 77 | "endpoint_name = ast.literal_eval(endpoint_response)['endpoint']['name']" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "id": "f8ddbcf2", 83 | "metadata": {}, 84 | "source": [ 85 | "While the model is being deployed, let's get visualize the computed embeddings" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "5c3d4142", 91 | "metadata": { 92 | "slideshow": { 93 | "slide_type": "slide" 94 | } 95 | }, 96 | "source": [ 97 | "## Visualization\n", 98 | "\n", 99 | "During the model training, Neptune ML will work on producing predictions and calculating the node embeddings then save them in the training output location. In the next cell, we use a helper library to download an visualize the node embeddings saved by Neptune ML" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "id": "f871cefb", 105 | "metadata": {}, 106 | "source": [ 107 | "#### Define the graph notebook config\n", 108 | "This will be used by the helper library to get the right information about the model trained\n" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "id": "db3c5a32", 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "import json\n", 119 | "neptune_config = {\n", 120 | " \"host\": NEPTUNE_ENDPOINT,\n", 121 | " \"port\": 8182,\n", 122 | " \"auth_mode\": \"DEFAULT\",\n", 123 | " \"load_from_s3_arn\": NEPTUNE_LOAD_ROLE,\n", 124 | " \"ssl\": True,\n", 125 | " \"aws_region\": REGION,\n", 126 | " \"sparql\": {\n", 127 | " \"path\": \"sparql\"\n", 128 | " }\n", 129 | "}\n", 130 | "neptune_config_json = json.dumps(neptune_config, indent = 4)\n", 131 | "\n", 132 | "with open('/home/ec2-user/graph_notebook_config.json', 'w') as file:\n", 133 | " file.write(neptune_config_json)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "b9f53cbc", 139 | "metadata": {}, 140 | "source": [ 141 | "#### Download the generated embeddings and predictions" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "id": "d8942d1c", 148 | "metadata": { 149 | "slideshow": { 150 | "slide_type": "slide" 151 | } 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "import neptune_ml_utils as neptune_ml\n", 156 | "\n", 157 | "transaction_mapping = neptune_ml.get_node_to_idx_mapping(dataprocessing_job_name=PROCESSING_ID,vertex_label=\"Transaction\")\n", 158 | "embeddings = neptune_ml.get_embeddings(training_job_name=TRAINING_ID)\n", 159 | "predictions = neptune_ml.get_predictions(training_job_name=TRAINING_ID, class_preds=True)" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "id": "fd2ba55f", 165 | "metadata": {}, 166 | "source": [ 167 | "#### Reduce the embeddings dimensions for visulaization" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "id": "84c85aa9", 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from sklearn.decomposition import PCA\n", 178 | "from sklearn.manifold import TSNE\n", 179 | "%matplotlib inline\n", 180 | "import matplotlib.pyplot as plt\n", 181 | "from mpl_toolkits.mplot3d import Axes3D\n", 182 | "import seaborn as sns\n", 183 | "\n", 184 | "pca = PCA(n_components=3)\n", 185 | "pca_result = pca.fit_transform(embeddings)\n", 186 | "\n", 187 | "pcaone = pca_result[:,0]\n", 188 | "pcatwo = pca_result[:,1] \n", 189 | "pcathree = pca_result[:,2]\n", 190 | "print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "d42a536a", 196 | "metadata": {}, 197 | "source": [ 198 | "#### Plot the embeddings in 2D graph" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "id": "45d3d75a", 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "fig = plt.figure(figsize=(16,12))\n", 209 | "fig.suptitle(\"2D representation of node embeddings\")\n", 210 | "\n", 211 | "scatter = plt.scatter(pcaone, pcatwo, c=predictions)\n", 212 | "plt.legend(*scatter.legend_elements(), title=\"isFraud\", loc=\"upper right\")\n", 213 | "plt.grid()" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "id": "a267646e", 219 | "metadata": {}, 220 | "source": [ 221 | "## Invoke the Deployed Endpoint\n", 222 | "\n", 223 | "Since Neptune ML will deploy an endpoint using Amazon SageMaker, you can also invoke the SageMaker endpoint and generate the score for the fraud label" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "id": "6e0b53d6", 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "import json\n", 234 | "import boto3\n", 235 | "client = boto3.client('runtime.sagemaker')\n", 236 | "data = {\"vertices\": idss, \"topk\": 1, \"property\": \"isFraud\"} \n", 237 | "response = client.invoke_endpoint(EndpointName=endpoint_name,\n", 238 | " Body=json.dumps(data))\n", 239 | "response_body = response['Body'] \n", 240 | "res = json.loads(response_body.read())\n", 241 | "results = []\n", 242 | "for i in res['output']['nodes']:\n", 243 | " results.append(i['mlResults'][0]['inferredValue'])" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "id": "bd592861", 249 | "metadata": {}, 250 | "source": [ 251 | "#### Get Original labels to Compute Confusion Matrix" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "id": "ed52050b", 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "y_test = [i['isFraud'][0] for i in fraud_labels]" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "id": "c9ba6157", 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "from sklearn.metrics import accuracy_score\n", 272 | "accuracy = accuracy_score(y_test, results)\n", 273 | "print(\"Accuracy: %.2f%%\" % (accuracy * 100.0))" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "id": "56f320c6", 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "from sklearn.metrics import confusion_matrix\n", 284 | "confusion_matrix(y_test, results)" 285 | ] 286 | } 287 | ], 288 | "metadata": { 289 | "kernelspec": { 290 | "display_name": "conda_python3", 291 | "language": "python", 292 | "name": "conda_python3" 293 | }, 294 | "language_info": { 295 | "codemirror_mode": { 296 | "name": "ipython", 297 | "version": 3 298 | }, 299 | "file_extension": ".py", 300 | "mimetype": "text/x-python", 301 | "name": "python", 302 | "nbconvert_exporter": "python", 303 | "pygments_lexer": "ipython3", 304 | "version": "3.6.13" 305 | } 306 | }, 307 | "nbformat": 4, 308 | "nbformat_minor": 5 309 | } 310 | -------------------------------------------------------------------------------- /credit-card-fraud-detection/assets/knowledge_graph.b0e9408219d92f2ca3c7a05cccf9a5a72e34ddbd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-neptune-ml-use-cases/cd10587d16510a60b3446752995a51cc5faf81bd/credit-card-fraud-detection/assets/knowledge_graph.b0e9408219d92f2ca3c7a05cccf9a5a72e34ddbd.png -------------------------------------------------------------------------------- /credit-card-fraud-detection/config/model-hpo-configuration.json: -------------------------------------------------------------------------------- 1 | {"models": [{"model": "rgcn", "task_type": "node_class", "eval_metric": {"metric": "roc_auc"}, "eval_frequency": {"type": "evaluate_every_epoch", "value": "3"}, "1-tier-param": [{"param": "num-hidden", "range": [16, 128], "type": "int", "inc_strategy": "power2"}, {"param": "num-epochs", "range": [3, 30], "inc_strategy": "linear", "inc_val": 1, "type": "int", "node_strategy": "perM"}, {"param": "lr", "range": [0.001, 0.01], "type": "float", "inc_strategy": "log"}], "2-tier-param": [{"param": "dropout", "range": [0.0, 0.5], "inc_strategy": "linear", "type": "float", "default": 0.3}, {"param": "global-norm", "type": "bool", "default": true}], "3-tier-param": [{"param": "batch-size", "range": [128, 4096], "inc_strategy": "power2", "type": "int", "default": 1024}, {"param": "sparse-embedding", "type": "bool", "default": true}, {"param": "concat-node-embed", "type": "bool", "default": true}, {"param": "sparse-lr", "range": [0.001, 0.01], "inc_strategy": "log", "type": "float", "default": 0.001}, {"param": "per-feat-name-embed", "type": "bool", "default": true}, {"param": "use-class-weight", "type": "bool", "default": true}, {"param": "fanout", "type": "int", "options": [[10, 30], [15, 30], [15, 30]], "default": [10, 15, 15]}, {"param": "num-layer", "range": [1, 3], "inc_strategy": "linear", "inc_val": 1, "type": "int", "default": 2}, {"param": "num-bases", "range": [2, 8], "inc_strategy": "linear", "inc_val": 2, "type": "int", "default": 2}], "fixed-param": [{"param": "layer-norm", "type": "bool", "default": false}, {"param": "use-self-loop", "type": "bool", "default": true}, {"param": "low-mem", "type": "bool", "default": true}, {"param": "enable-early-stop", "type": "bool", "default": true}, {"param": "l2norm", "type": "float", "default": 0}]}]} -------------------------------------------------------------------------------- /credit-card-fraud-detection/config/training-data-configuration.json: -------------------------------------------------------------------------------- 1 | { 2 | "version" : "v2.0", 3 | "query_engine" : "gremlin", 4 | "graph" : { 5 | "nodes" : [ { 6 | "file_name" : "nodes/Card.consolidated.csv", 7 | "separator" : ",", 8 | "node" : [ "~id", "Card" ], 9 | "features" : [ { 10 | "feature" : [ "card1", "card1", "numerical" ], 11 | "norm" : "min-max", 12 | "imputer" : "median" 13 | }, { 14 | "feature" : [ "card2", "card2", "category" ] 15 | }, { 16 | "feature" : [ "card3", "card3", "category" ] 17 | }, { 18 | "feature" : [ "card4", "card4", "category" ] 19 | }, { 20 | "feature" : [ "card5", "card5", "category" ] 21 | }, { 22 | "feature" : [ "card6", "card6", "category" ] 23 | } ] 24 | }, { 25 | "file_name" : "nodes/Identifier.consolidated.csv", 26 | "separator" : ",", 27 | "node" : [ "~id", "Identifier" ], 28 | "features" : [ { 29 | "feature" : [ "id_01", "id_01", "numerical" ], 30 | "norm" : "min-max", 31 | "imputer" : "median" 32 | }, { 33 | "feature" : [ "id_02", "id_02", "numerical" ], 34 | "norm" : "min-max", 35 | "imputer" : "median" 36 | }, { 37 | "feature" : [ "id_05", "id_05", "numerical" ], 38 | "norm" : "min-max", 39 | "imputer" : "median" 40 | }, { 41 | "feature" : [ "id_06", "id_06", "numerical" ], 42 | "norm" : "min-max", 43 | "imputer" : "median" 44 | }, { 45 | "feature" : [ "id_11", "id_11", "numerical" ], 46 | "norm" : "min-max", 47 | "imputer" : "median" 48 | }, { 49 | "feature" : [ "id_12", "id_12", "category" ] 50 | }, { 51 | "feature" : [ "id_15", "id_15", "category" ] 52 | }, { 53 | "feature" : [ "id_16", "id_16", "category" ] 54 | }, { 55 | "feature" : [ "id_17", "id_17", "category" ] 56 | }, { 57 | "feature" : [ "id_19", "id_19", "category" ] 58 | }, { 59 | "feature" : [ "id_20", "id_20", "category" ] 60 | }, { 61 | "feature" : [ "id_28", "id_28", "category" ] 62 | }, { 63 | "feature" : [ "id_29", "id_29", "category" ] 64 | }, { 65 | "feature" : [ "id_31", "id_31", "category" ] 66 | }, { 67 | "feature" : [ "id_35", "id_35", "category" ] 68 | }, { 69 | "feature" : [ "id_36", "id_36", "category" ] 70 | }, { 71 | "feature" : [ "id_37", "id_37", "category" ] 72 | }, { 73 | "feature" : [ "id_38", "id_38", "category" ] 74 | }, { 75 | "feature" : [ "id_13", "id_13", "category" ] 76 | } ] 77 | }, { 78 | "file_name" : "nodes/Device.consolidated.csv", 79 | "separator" : ",", 80 | "node" : [ "~id", "Device" ], 81 | "features" : [ { 82 | "feature" : [ "DeviceType", "DeviceType", "category" ] 83 | }, { 84 | "feature" : [ "DeviceInfo", "DeviceInfo", "category" ] 85 | } ] 86 | }, { 87 | "file_name" : "nodes/Transaction.consolidated.csv", 88 | "separator" : ",", 89 | "node" : [ "~id", "Transaction" ], 90 | "features" : [ { 91 | "feature" : [ "P_emaildomain", "P_emaildomain", "category" ] 92 | }, { 93 | "feature" : [ "C1", "C1", "numerical" ], 94 | "norm" : "min-max", 95 | "imputer" : "median" 96 | }, { 97 | "feature" : [ "C2", "C2", "numerical" ], 98 | "norm" : "min-max", 99 | "imputer" : "median" 100 | }, { 101 | "feature" : [ "C3", "C3", "numerical" ], 102 | "norm" : "min-max", 103 | "imputer" : "median" 104 | }, { 105 | "feature" : [ "C4", "C4", "numerical" ], 106 | "norm" : "min-max", 107 | "imputer" : "median" 108 | }, { 109 | "feature" : [ "C5", "C5", "numerical" ], 110 | "norm" : "min-max", 111 | "imputer" : "median" 112 | }, { 113 | "feature" : [ "C6", "C6", "numerical" ], 114 | "norm" : "min-max", 115 | "imputer" : "median" 116 | }, { 117 | "feature" : [ "C7", "C7", "numerical" ], 118 | "norm" : "min-max", 119 | "imputer" : "median" 120 | }, { 121 | "feature" : [ "C8", "C8", "numerical" ], 122 | "norm" : "min-max", 123 | "imputer" : "median" 124 | }, { 125 | "feature" : [ "C9", "C9", "numerical" ], 126 | "norm" : "min-max", 127 | "imputer" : "median" 128 | }, { 129 | "feature" : [ "C10", "C10", "numerical" ], 130 | "norm" : "min-max", 131 | "imputer" : "median" 132 | }, { 133 | "feature" : [ "C11", "C11", "numerical" ], 134 | "norm" : "min-max", 135 | "imputer" : "median" 136 | }, { 137 | "feature" : [ "C12", "C12", "numerical" ], 138 | "norm" : "min-max", 139 | "imputer" : "median" 140 | }, { 141 | "feature" : [ "C13", "C13", "numerical" ], 142 | "norm" : "min-max", 143 | "imputer" : "median" 144 | }, { 145 | "feature" : [ "C14", "C14", "numerical" ], 146 | "norm" : "min-max", 147 | "imputer" : "median" 148 | }, { 149 | "feature" : [ "D1", "D1", "numerical" ], 150 | "norm" : "min-max", 151 | "imputer" : "median" 152 | }, { 153 | "feature" : [ "V95", "V95", "numerical" ], 154 | "norm" : "min-max", 155 | "imputer" : "median" 156 | }, { 157 | "feature" : [ "V96", "V96", "numerical" ], 158 | "norm" : "min-max", 159 | "imputer" : "median" 160 | }, { 161 | "feature" : [ "V97", "V97", "numerical" ], 162 | "norm" : "min-max", 163 | "imputer" : "median" 164 | }, { 165 | "feature" : [ "V98", "V98", "numerical" ], 166 | "norm" : "min-max", 167 | "imputer" : "median" 168 | }, { 169 | "feature" : [ "V99", "V99", "numerical" ], 170 | "norm" : "min-max", 171 | "imputer" : "median" 172 | }, { 173 | "feature" : [ "V100", "V100", "numerical" ], 174 | "norm" : "min-max", 175 | "imputer" : "median" 176 | }, { 177 | "feature" : [ "V101", "V101", "numerical" ], 178 | "norm" : "min-max", 179 | "imputer" : "median" 180 | }, { 181 | "feature" : [ "V102", "V102", "numerical" ], 182 | "norm" : "min-max", 183 | "imputer" : "median" 184 | }, { 185 | "feature" : [ "V103", "V103", "numerical" ], 186 | "norm" : "min-max", 187 | "imputer" : "median" 188 | }, { 189 | "feature" : [ "V104", "V104", "numerical" ], 190 | "norm" : "min-max", 191 | "imputer" : "median" 192 | }, { 193 | "feature" : [ "V105", "V105", "numerical" ], 194 | "norm" : "min-max", 195 | "imputer" : "median" 196 | }, { 197 | "feature" : [ "V106", "V106", "numerical" ], 198 | "norm" : "min-max", 199 | "imputer" : "median" 200 | }, { 201 | "feature" : [ "V107", "V107", "numerical" ], 202 | "norm" : "min-max", 203 | "imputer" : "median" 204 | }, { 205 | "feature" : [ "V108", "V108", "numerical" ], 206 | "norm" : "min-max", 207 | "imputer" : "median" 208 | }, { 209 | "feature" : [ "V109", "V109", "numerical" ], 210 | "norm" : "min-max", 211 | "imputer" : "median" 212 | }, { 213 | "feature" : [ "V110", "V110", "numerical" ], 214 | "norm" : "min-max", 215 | "imputer" : "median" 216 | }, { 217 | "feature" : [ "V111", "V111", "numerical" ], 218 | "norm" : "min-max", 219 | "imputer" : "median" 220 | }, { 221 | "feature" : [ "V112", "V112", "numerical" ], 222 | "norm" : "min-max", 223 | "imputer" : "median" 224 | }, { 225 | "feature" : [ "V113", "V113", "numerical" ], 226 | "norm" : "min-max", 227 | "imputer" : "median" 228 | }, { 229 | "feature" : [ "V114", "V114", "numerical" ], 230 | "norm" : "min-max", 231 | "imputer" : "median" 232 | }, { 233 | "feature" : [ "V115", "V115", "numerical" ], 234 | "norm" : "min-max", 235 | "imputer" : "median" 236 | }, { 237 | "feature" : [ "V116", "V116", "numerical" ], 238 | "norm" : "min-max", 239 | "imputer" : "median" 240 | }, { 241 | "feature" : [ "V117", "V117", "numerical" ], 242 | "norm" : "min-max", 243 | "imputer" : "median" 244 | }, { 245 | "feature" : [ "V118", "V118", "numerical" ], 246 | "norm" : "min-max", 247 | "imputer" : "median" 248 | }, { 249 | "feature" : [ "V119", "V119", "numerical" ], 250 | "norm" : "min-max", 251 | "imputer" : "median" 252 | }, { 253 | "feature" : [ "V120", "V120", "numerical" ], 254 | "norm" : "min-max", 255 | "imputer" : "median" 256 | }, { 257 | "feature" : [ "V121", "V121", "numerical" ], 258 | "norm" : "min-max", 259 | "imputer" : "median" 260 | }, { 261 | "feature" : [ "V122", "V122", "numerical" ], 262 | "norm" : "min-max", 263 | "imputer" : "median" 264 | }, { 265 | "feature" : [ "V123", "V123", "numerical" ], 266 | "norm" : "min-max", 267 | "imputer" : "median" 268 | }, { 269 | "feature" : [ "V124", "V124", "numerical" ], 270 | "norm" : "min-max", 271 | "imputer" : "median" 272 | }, { 273 | "feature" : [ "V125", "V125", "numerical" ], 274 | "norm" : "min-max", 275 | "imputer" : "median" 276 | }, { 277 | "feature" : [ "V126", "V126", "numerical" ], 278 | "norm" : "min-max", 279 | "imputer" : "median" 280 | }, { 281 | "feature" : [ "V127", "V127", "numerical" ], 282 | "norm" : "min-max", 283 | "imputer" : "median" 284 | }, { 285 | "feature" : [ "V128", "V128", "numerical" ], 286 | "norm" : "min-max", 287 | "imputer" : "median" 288 | }, { 289 | "feature" : [ "V129", "V129", "numerical" ], 290 | "norm" : "min-max", 291 | "imputer" : "median" 292 | }, { 293 | "feature" : [ "V130", "V130", "numerical" ], 294 | "norm" : "min-max", 295 | "imputer" : "median" 296 | }, { 297 | "feature" : [ "V131", "V131", "numerical" ], 298 | "norm" : "min-max", 299 | "imputer" : "median" 300 | }, { 301 | "feature" : [ "V132", "V132", "numerical" ], 302 | "norm" : "min-max", 303 | "imputer" : "median" 304 | }, { 305 | "feature" : [ "V133", "V133", "numerical" ], 306 | "norm" : "min-max", 307 | "imputer" : "median" 308 | }, { 309 | "feature" : [ "V134", "V134", "numerical" ], 310 | "norm" : "min-max", 311 | "imputer" : "median" 312 | }, { 313 | "feature" : [ "V135", "V135", "numerical" ], 314 | "norm" : "min-max", 315 | "imputer" : "median" 316 | }, { 317 | "feature" : [ "V136", "V136", "numerical" ], 318 | "norm" : "min-max", 319 | "imputer" : "median" 320 | }, { 321 | "feature" : [ "V137", "V137", "numerical" ], 322 | "norm" : "min-max", 323 | "imputer" : "median" 324 | }, { 325 | "feature" : [ "V279", "V279", "numerical" ], 326 | "norm" : "min-max", 327 | "imputer" : "median" 328 | }, { 329 | "feature" : [ "V280", "V280", "numerical" ], 330 | "norm" : "min-max", 331 | "imputer" : "median" 332 | }, { 333 | "feature" : [ "V281", "V281", "numerical" ], 334 | "norm" : "min-max", 335 | "imputer" : "median" 336 | }, { 337 | "feature" : [ "V282", "V282", "numerical" ], 338 | "norm" : "min-max", 339 | "imputer" : "median" 340 | }, { 341 | "feature" : [ "V283", "V283", "numerical" ], 342 | "norm" : "min-max", 343 | "imputer" : "median" 344 | }, { 345 | "feature" : [ "V284", "V284", "numerical" ], 346 | "norm" : "min-max", 347 | "imputer" : "median" 348 | }, { 349 | "feature" : [ "V285", "V285", "numerical" ], 350 | "norm" : "min-max", 351 | "imputer" : "median" 352 | }, { 353 | "feature" : [ "V286", "V286", "numerical" ], 354 | "norm" : "min-max", 355 | "imputer" : "median" 356 | }, { 357 | "feature" : [ "V287", "V287", "numerical" ], 358 | "norm" : "min-max", 359 | "imputer" : "median" 360 | }, { 361 | "feature" : [ "V288", "V288", "numerical" ], 362 | "norm" : "min-max", 363 | "imputer" : "median" 364 | }, { 365 | "feature" : [ "V289", "V289", "numerical" ], 366 | "norm" : "min-max", 367 | "imputer" : "median" 368 | }, { 369 | "feature" : [ "V290", "V290", "numerical" ], 370 | "norm" : "min-max", 371 | "imputer" : "median" 372 | }, { 373 | "feature" : [ "V291", "V291", "numerical" ], 374 | "norm" : "min-max", 375 | "imputer" : "median" 376 | }, { 377 | "feature" : [ "V292", "V292", "numerical" ], 378 | "norm" : "min-max", 379 | "imputer" : "median" 380 | }, { 381 | "feature" : [ "V293", "V293", "numerical" ], 382 | "norm" : "min-max", 383 | "imputer" : "median" 384 | }, { 385 | "feature" : [ "V294", "V294", "numerical" ], 386 | "norm" : "min-max", 387 | "imputer" : "median" 388 | }, { 389 | "feature" : [ "V295", "V295", "numerical" ], 390 | "norm" : "min-max", 391 | "imputer" : "median" 392 | }, { 393 | "feature" : [ "V296", "V296", "numerical" ], 394 | "norm" : "min-max", 395 | "imputer" : "median" 396 | }, { 397 | "feature" : [ "V297", "V297", "numerical" ], 398 | "norm" : "min-max", 399 | "imputer" : "median" 400 | }, { 401 | "feature" : [ "V298", "V298", "numerical" ], 402 | "norm" : "min-max", 403 | "imputer" : "median" 404 | }, { 405 | "feature" : [ "V299", "V299", "numerical" ], 406 | "norm" : "min-max", 407 | "imputer" : "median" 408 | }, { 409 | "feature" : [ "V300", "V300", "numerical" ], 410 | "norm" : "min-max", 411 | "imputer" : "median" 412 | }, { 413 | "feature" : [ "V301", "V301", "numerical" ], 414 | "norm" : "min-max", 415 | "imputer" : "median" 416 | }, { 417 | "feature" : [ "V302", "V302", "numerical" ], 418 | "norm" : "min-max", 419 | "imputer" : "median" 420 | }, { 421 | "feature" : [ "V303", "V303", "numerical" ], 422 | "norm" : "min-max", 423 | "imputer" : "median" 424 | }, { 425 | "feature" : [ "V304", "V304", "numerical" ], 426 | "norm" : "min-max", 427 | "imputer" : "median" 428 | }, { 429 | "feature" : [ "V305", "V305", "numerical" ], 430 | "norm" : "min-max", 431 | "imputer" : "median" 432 | }, { 433 | "feature" : [ "V306", "V306", "numerical" ], 434 | "norm" : "min-max", 435 | "imputer" : "median" 436 | }, { 437 | "feature" : [ "V307", "V307", "numerical" ], 438 | "norm" : "min-max", 439 | "imputer" : "median" 440 | }, { 441 | "feature" : [ "V308", "V308", "numerical" ], 442 | "norm" : "min-max", 443 | "imputer" : "median" 444 | }, { 445 | "feature" : [ "V309", "V309", "numerical" ], 446 | "norm" : "min-max", 447 | "imputer" : "median" 448 | }, { 449 | "feature" : [ "V310", "V310", "numerical" ], 450 | "norm" : "min-max", 451 | "imputer" : "median" 452 | }, { 453 | "feature" : [ "V311", "V311", "numerical" ], 454 | "norm" : "min-max", 455 | "imputer" : "median" 456 | }, { 457 | "feature" : [ "V312", "V312", "numerical" ], 458 | "norm" : "min-max", 459 | "imputer" : "median" 460 | }, { 461 | "feature" : [ "V313", "V313", "numerical" ], 462 | "norm" : "min-max", 463 | "imputer" : "median" 464 | }, { 465 | "feature" : [ "V314", "V314", "numerical" ], 466 | "norm" : "min-max", 467 | "imputer" : "median" 468 | }, { 469 | "feature" : [ "V315", "V315", "numerical" ], 470 | "norm" : "min-max", 471 | "imputer" : "median" 472 | }, { 473 | "feature" : [ "V316", "V316", "numerical" ], 474 | "norm" : "min-max", 475 | "imputer" : "median" 476 | }, { 477 | "feature" : [ "V317", "V317", "numerical" ], 478 | "norm" : "min-max", 479 | "imputer" : "median" 480 | }, { 481 | "feature" : [ "V318", "V318", "numerical" ], 482 | "norm" : "min-max", 483 | "imputer" : "median" 484 | }, { 485 | "feature" : [ "V319", "V319", "numerical" ], 486 | "norm" : "min-max", 487 | "imputer" : "median" 488 | }, { 489 | "feature" : [ "V320", "V320", "numerical" ], 490 | "norm" : "min-max", 491 | "imputer" : "median" 492 | }, { 493 | "feature" : [ "V321", "V321", "numerical" ], 494 | "norm" : "min-max", 495 | "imputer" : "median" 496 | }, { 497 | "feature" : [ "ProductCD", "ProductCD", "category" ] 498 | }, { 499 | "feature" : [ "addr1", "addr1", "category" ] 500 | }, { 501 | "feature" : [ "addr2", "addr2", "category" ] 502 | }, { 503 | "feature" : [ "TransactionDT", "TransactionDT", "numerical" ], 504 | "norm" : "min-max", 505 | "imputer" : "median" 506 | }, { 507 | "feature" : [ "TransactionAmt", "TransactionAmt", "numerical" ], 508 | "norm" : "min-max", 509 | "imputer" : "median" 510 | }, { 511 | "feature" : [ "D10", "D10", "numerical" ], 512 | "norm" : "min-max", 513 | "imputer" : "median" 514 | }, { 515 | "feature" : [ "D15", "D15", "numerical" ], 516 | "norm" : "min-max", 517 | "imputer" : "median" 518 | }, { 519 | "feature" : [ "V12", "V12", "numerical" ], 520 | "norm" : "min-max", 521 | "imputer" : "median" 522 | }, { 523 | "feature" : [ "V13", "V13", "numerical" ], 524 | "norm" : "min-max", 525 | "imputer" : "median" 526 | }, { 527 | "feature" : [ "V14", "V14", "numerical" ], 528 | "norm" : "min-max", 529 | "imputer" : "median" 530 | }, { 531 | "feature" : [ "V15", "V15", "numerical" ], 532 | "norm" : "min-max", 533 | "imputer" : "median" 534 | }, { 535 | "feature" : [ "V16", "V16", "numerical" ], 536 | "norm" : "min-max", 537 | "imputer" : "median" 538 | }, { 539 | "feature" : [ "V17", "V17", "numerical" ], 540 | "norm" : "min-max", 541 | "imputer" : "median" 542 | }, { 543 | "feature" : [ "V18", "V18", "numerical" ], 544 | "norm" : "min-max", 545 | "imputer" : "median" 546 | }, { 547 | "feature" : [ "V19", "V19", "numerical" ], 548 | "norm" : "min-max", 549 | "imputer" : "median" 550 | }, { 551 | "feature" : [ "V20", "V20", "numerical" ], 552 | "norm" : "min-max", 553 | "imputer" : "median" 554 | }, { 555 | "feature" : [ "V21", "V21", "numerical" ], 556 | "norm" : "min-max", 557 | "imputer" : "median" 558 | }, { 559 | "feature" : [ "V22", "V22", "numerical" ], 560 | "norm" : "min-max", 561 | "imputer" : "median" 562 | }, { 563 | "feature" : [ "V23", "V23", "numerical" ], 564 | "norm" : "min-max", 565 | "imputer" : "median" 566 | }, { 567 | "feature" : [ "V24", "V24", "numerical" ], 568 | "norm" : "min-max", 569 | "imputer" : "median" 570 | }, { 571 | "feature" : [ "V25", "V25", "numerical" ], 572 | "norm" : "min-max", 573 | "imputer" : "median" 574 | }, { 575 | "feature" : [ "V26", "V26", "numerical" ], 576 | "norm" : "min-max", 577 | "imputer" : "median" 578 | }, { 579 | "feature" : [ "V27", "V27", "numerical" ], 580 | "norm" : "min-max", 581 | "imputer" : "median" 582 | }, { 583 | "feature" : [ "V28", "V28", "numerical" ], 584 | "norm" : "min-max", 585 | "imputer" : "median" 586 | }, { 587 | "feature" : [ "V29", "V29", "numerical" ], 588 | "norm" : "min-max", 589 | "imputer" : "median" 590 | }, { 591 | "feature" : [ "V30", "V30", "numerical" ], 592 | "norm" : "min-max", 593 | "imputer" : "median" 594 | }, { 595 | "feature" : [ "V31", "V31", "numerical" ], 596 | "norm" : "min-max", 597 | "imputer" : "median" 598 | }, { 599 | "feature" : [ "V32", "V32", "numerical" ], 600 | "norm" : "min-max", 601 | "imputer" : "median" 602 | }, { 603 | "feature" : [ "V33", "V33", "numerical" ], 604 | "norm" : "min-max", 605 | "imputer" : "median" 606 | }, { 607 | "feature" : [ "V34", "V34", "numerical" ], 608 | "norm" : "min-max", 609 | "imputer" : "median" 610 | }, { 611 | "feature" : [ "V53", "V53", "numerical" ], 612 | "norm" : "min-max", 613 | "imputer" : "median" 614 | }, { 615 | "feature" : [ "V54", "V54", "numerical" ], 616 | "norm" : "min-max", 617 | "imputer" : "median" 618 | }, { 619 | "feature" : [ "V55", "V55", "numerical" ], 620 | "norm" : "min-max", 621 | "imputer" : "median" 622 | }, { 623 | "feature" : [ "V56", "V56", "numerical" ], 624 | "norm" : "min-max", 625 | "imputer" : "median" 626 | }, { 627 | "feature" : [ "V57", "V57", "numerical" ], 628 | "norm" : "min-max", 629 | "imputer" : "median" 630 | }, { 631 | "feature" : [ "V58", "V58", "numerical" ], 632 | "norm" : "min-max", 633 | "imputer" : "median" 634 | }, { 635 | "feature" : [ "V59", "V59", "numerical" ], 636 | "norm" : "min-max", 637 | "imputer" : "median" 638 | }, { 639 | "feature" : [ "V60", "V60", "numerical" ], 640 | "norm" : "min-max", 641 | "imputer" : "median" 642 | }, { 643 | "feature" : [ "V61", "V61", "numerical" ], 644 | "norm" : "min-max", 645 | "imputer" : "median" 646 | }, { 647 | "feature" : [ "V62", "V62", "numerical" ], 648 | "norm" : "min-max", 649 | "imputer" : "median" 650 | }, { 651 | "feature" : [ "V63", "V63", "numerical" ], 652 | "norm" : "min-max", 653 | "imputer" : "median" 654 | }, { 655 | "feature" : [ "V64", "V64", "numerical" ], 656 | "norm" : "min-max", 657 | "imputer" : "median" 658 | }, { 659 | "feature" : [ "V65", "V65", "numerical" ], 660 | "norm" : "min-max", 661 | "imputer" : "median" 662 | }, { 663 | "feature" : [ "V66", "V66", "numerical" ], 664 | "norm" : "min-max", 665 | "imputer" : "median" 666 | }, { 667 | "feature" : [ "V67", "V67", "numerical" ], 668 | "norm" : "min-max", 669 | "imputer" : "median" 670 | }, { 671 | "feature" : [ "V68", "V68", "numerical" ], 672 | "norm" : "min-max", 673 | "imputer" : "median" 674 | }, { 675 | "feature" : [ "V69", "V69", "numerical" ], 676 | "norm" : "min-max", 677 | "imputer" : "median" 678 | }, { 679 | "feature" : [ "V70", "V70", "numerical" ], 680 | "norm" : "min-max", 681 | "imputer" : "median" 682 | }, { 683 | "feature" : [ "V71", "V71", "numerical" ], 684 | "norm" : "min-max", 685 | "imputer" : "median" 686 | }, { 687 | "feature" : [ "V72", "V72", "numerical" ], 688 | "norm" : "min-max", 689 | "imputer" : "median" 690 | }, { 691 | "feature" : [ "V73", "V73", "numerical" ], 692 | "norm" : "min-max", 693 | "imputer" : "median" 694 | }, { 695 | "feature" : [ "V74", "V74", "numerical" ], 696 | "norm" : "min-max", 697 | "imputer" : "median" 698 | }, { 699 | "feature" : [ "V75", "V75", "numerical" ], 700 | "norm" : "min-max", 701 | "imputer" : "median" 702 | }, { 703 | "feature" : [ "V76", "V76", "numerical" ], 704 | "norm" : "min-max", 705 | "imputer" : "median" 706 | }, { 707 | "feature" : [ "V77", "V77", "numerical" ], 708 | "norm" : "min-max", 709 | "imputer" : "median" 710 | }, { 711 | "feature" : [ "V78", "V78", "numerical" ], 712 | "norm" : "min-max", 713 | "imputer" : "median" 714 | }, { 715 | "feature" : [ "V79", "V79", "numerical" ], 716 | "norm" : "min-max", 717 | "imputer" : "median" 718 | }, { 719 | "feature" : [ "V80", "V80", "numerical" ], 720 | "norm" : "min-max", 721 | "imputer" : "median" 722 | }, { 723 | "feature" : [ "V81", "V81", "numerical" ], 724 | "norm" : "min-max", 725 | "imputer" : "median" 726 | }, { 727 | "feature" : [ "V82", "V82", "numerical" ], 728 | "norm" : "min-max", 729 | "imputer" : "median" 730 | }, { 731 | "feature" : [ "V83", "V83", "numerical" ], 732 | "norm" : "min-max", 733 | "imputer" : "median" 734 | }, { 735 | "feature" : [ "V84", "V84", "numerical" ], 736 | "norm" : "min-max", 737 | "imputer" : "median" 738 | }, { 739 | "feature" : [ "V85", "V85", "numerical" ], 740 | "norm" : "min-max", 741 | "imputer" : "median" 742 | }, { 743 | "feature" : [ "V86", "V86", "numerical" ], 744 | "norm" : "min-max", 745 | "imputer" : "median" 746 | }, { 747 | "feature" : [ "V87", "V87", "numerical" ], 748 | "norm" : "min-max", 749 | "imputer" : "median" 750 | }, { 751 | "feature" : [ "V88", "V88", "numerical" ], 752 | "norm" : "min-max", 753 | "imputer" : "median" 754 | }, { 755 | "feature" : [ "V89", "V89", "numerical" ], 756 | "norm" : "min-max", 757 | "imputer" : "median" 758 | }, { 759 | "feature" : [ "V90", "V90", "numerical" ], 760 | "norm" : "min-max", 761 | "imputer" : "median" 762 | }, { 763 | "feature" : [ "V91", "V91", "numerical" ], 764 | "norm" : "min-max", 765 | "imputer" : "median" 766 | }, { 767 | "feature" : [ "V92", "V92", "numerical" ], 768 | "norm" : "min-max", 769 | "imputer" : "median" 770 | }, { 771 | "feature" : [ "V93", "V93", "numerical" ], 772 | "norm" : "min-max", 773 | "imputer" : "median" 774 | }, { 775 | "feature" : [ "V94", "V94", "numerical" ], 776 | "norm" : "min-max", 777 | "imputer" : "median" 778 | }, { 779 | "feature" : [ "M6", "M6", "category" ] 780 | }, { 781 | "feature" : [ "D4", "D4", "numerical" ], 782 | "norm" : "min-max", 783 | "imputer" : "median" 784 | }, { 785 | "feature" : [ "V35", "V35", "numerical" ], 786 | "norm" : "min-max", 787 | "imputer" : "median" 788 | }, { 789 | "feature" : [ "V36", "V36", "numerical" ], 790 | "norm" : "min-max", 791 | "imputer" : "median" 792 | }, { 793 | "feature" : [ "V37", "V37", "numerical" ], 794 | "norm" : "min-max", 795 | "imputer" : "median" 796 | }, { 797 | "feature" : [ "V38", "V38", "numerical" ], 798 | "norm" : "min-max", 799 | "imputer" : "median" 800 | }, { 801 | "feature" : [ "V39", "V39", "numerical" ], 802 | "norm" : "min-max", 803 | "imputer" : "median" 804 | }, { 805 | "feature" : [ "V40", "V40", "numerical" ], 806 | "norm" : "min-max", 807 | "imputer" : "median" 808 | }, { 809 | "feature" : [ "V41", "V41", "numerical" ], 810 | "norm" : "min-max", 811 | "imputer" : "median" 812 | }, { 813 | "feature" : [ "V42", "V42", "numerical" ], 814 | "norm" : "min-max", 815 | "imputer" : "median" 816 | }, { 817 | "feature" : [ "V43", "V43", "numerical" ], 818 | "norm" : "min-max", 819 | "imputer" : "median" 820 | }, { 821 | "feature" : [ "V44", "V44", "numerical" ], 822 | "norm" : "min-max", 823 | "imputer" : "median" 824 | }, { 825 | "feature" : [ "V45", "V45", "numerical" ], 826 | "norm" : "min-max", 827 | "imputer" : "median" 828 | }, { 829 | "feature" : [ "V46", "V46", "numerical" ], 830 | "norm" : "min-max", 831 | "imputer" : "median" 832 | }, { 833 | "feature" : [ "V47", "V47", "numerical" ], 834 | "norm" : "min-max", 835 | "imputer" : "median" 836 | }, { 837 | "feature" : [ "V48", "V48", "numerical" ], 838 | "norm" : "min-max", 839 | "imputer" : "median" 840 | }, { 841 | "feature" : [ "V49", "V49", "numerical" ], 842 | "norm" : "min-max", 843 | "imputer" : "median" 844 | }, { 845 | "feature" : [ "V50", "V50", "numerical" ], 846 | "norm" : "min-max", 847 | "imputer" : "median" 848 | }, { 849 | "feature" : [ "V51", "V51", "numerical" ], 850 | "norm" : "min-max", 851 | "imputer" : "median" 852 | }, { 853 | "feature" : [ "V52", "V52", "numerical" ], 854 | "norm" : "min-max", 855 | "imputer" : "median" 856 | } ], 857 | "labels" : [ { 858 | "label" : [ "isFraud", "classification" ], 859 | "split_rate" : [ 0.8, 0.1, 0.1 ] 860 | } ] 861 | } ], 862 | "edges" : [ { 863 | "file_name" : "edges/%28Transaction%29-purchased_by-%28Card%29.consolidated.csv", 864 | "separator" : ",", 865 | "source" : [ "~from", "Transaction" ], 866 | "relation" : [ "", "purchased_by" ], 867 | "dest" : [ "~to", "Card" ] 868 | }, { 869 | "file_name" : "edges/%28Transaction%29-transacted_through-%28Email%29.consolidated.csv", 870 | "separator" : ",", 871 | "source" : [ "~from", "Transaction" ], 872 | "relation" : [ "", "transacted_through" ], 873 | "dest" : [ "~to", "Email" ] 874 | }, { 875 | "file_name" : "edges/%28Transaction%29-associated_with-%28Device%29.consolidated.csv", 876 | "separator" : ",", 877 | "source" : [ "~from", "Transaction" ], 878 | "relation" : [ "", "associated_with" ], 879 | "dest" : [ "~to", "Device" ] 880 | }, { 881 | "file_name" : "edges/%28Transaction%29-identified_by-%28Identifier%29.consolidated.csv", 882 | "separator" : ",", 883 | "source" : [ "~from", "Transaction" ], 884 | "relation" : [ "", "identified_by" ], 885 | "dest" : [ "~to", "Identifier" ] 886 | } ] 887 | }, 888 | "warnings" : [ ] 889 | } -------------------------------------------------------------------------------- /credit-card-fraud-detection/neptune_ml_utils.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import pandas as pd 3 | import numpy as np 4 | import pickle 5 | import os 6 | import requests 7 | import json 8 | import zipfile 9 | import logging 10 | import time 11 | from time import strftime, gmtime, sleep 12 | from botocore.auth import SigV4Auth 13 | from botocore.awsrequest import AWSRequest 14 | from datetime import datetime 15 | from urllib.parse import urlparse 16 | from sagemaker.s3 import S3Downloader 17 | 18 | # How often to check the status 19 | UPDATE_DELAY_SECONDS = 15 20 | HOME_DIRECTORY = os.path.expanduser("~") 21 | 22 | 23 | def signed_request(method, url, data=None, params=None, headers=None, service=None): 24 | request = AWSRequest(method=method, url=url, data=data, 25 | params=params, headers=headers) 26 | session = boto3.Session() 27 | credentials = session.get_credentials() 28 | try: 29 | frozen_creds = credentials.get_frozen_credentials() 30 | except AttributeError: 31 | print("Could not find valid IAM credentials in any the following locations:\n") 32 | print("env, assume-role, assume-role-with-web-identity, sso, shared-credential-file, custom-process, " 33 | "config-file, ec2-credentials-file, boto-config, container-role, iam-role\n") 34 | print("Go to https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html for more " 35 | "details on configuring your IAM credentials.") 36 | return request 37 | SigV4Auth(frozen_creds, service, boto3.Session().region_name).add_auth(request) 38 | return requests.request(method=method, url=url, headers=dict(request.headers), data=data) 39 | 40 | 41 | def load_configuration(): 42 | with open(f'{HOME_DIRECTORY}/graph_notebook_config.json') as f: 43 | data = json.load(f) 44 | host = data['host'] 45 | port = data['port'] 46 | if data['auth_mode'] == 'IAM': 47 | iam = True 48 | else: 49 | iam = False 50 | return host, port, iam 51 | 52 | 53 | def get_host(): 54 | host, port, iam = load_configuration() 55 | return host 56 | 57 | 58 | def get_iam(): 59 | host, port, iam = load_configuration() 60 | return iam 61 | 62 | 63 | def get_training_job_name(prefix: str): 64 | return f'{prefix}-{int(time.time())}' 65 | 66 | 67 | def check_ml_enabled(): 68 | host, port, use_iam = load_configuration() 69 | response = signed_request( 70 | "GET", url=f'https://{host}:{port}/ml/modeltraining', service='neptune-db') 71 | if response.status_code != 200: 72 | print('''This Neptune cluster \033[1mis not\033[0m configured to use Neptune ML. 73 | Please configure the cluster according to the Amazpnm Neptune ML documentation before proceeding.''') 74 | else: 75 | print("This Neptune cluster is configured to use Neptune ML") 76 | 77 | 78 | def get_export_service_host(): 79 | with open(f'{HOME_DIRECTORY}/.bashrc') as f: 80 | data = f.readlines() 81 | for d in data: 82 | if str.startswith(d, 'export NEPTUNE_EXPORT_API_URI'): 83 | parts = d.split('=') 84 | if len(parts) == 2: 85 | path = urlparse(parts[1].rstrip()) 86 | return path.hostname + "/v1" 87 | logging.error( 88 | "Unable to determine the Neptune Export Service Endpoint. You will need to enter this or assign it manually.") 89 | return None 90 | 91 | 92 | def delete_pretrained_data(setup_node_classification: bool, 93 | setup_node_regression: bool, setup_link_prediction: bool, 94 | setup_edge_regression: bool, setup_edge_classification: bool): 95 | host, port, use_iam = load_configuration() 96 | if setup_node_classification: 97 | response = signed_request("POST", service='neptune-db', 98 | url=f'https://{host}:{port}/gremlin', 99 | headers={'content-type': 'application/json'}, 100 | data=json.dumps( 101 | { 102 | 'gremlin': "g.V('movie_28', 'movie_69', 'movie_88').properties('genre').drop()"})) 103 | 104 | if response.status_code != 200: 105 | print(response.content.decode('utf-8')) 106 | if setup_node_regression: 107 | response = signed_request("POST", service='neptune-db', 108 | url=f'https://{host}:{port}/gremlin', 109 | headers={'content-type': 'application/json'}, 110 | data=json.dumps({'gremlin': "g.V('user_1').out('wrote').properties('score').drop()"})) 111 | if response.status_code != 200: 112 | print(response.content.decode('utf-8')) 113 | if setup_link_prediction: 114 | response = signed_request("POST", service='neptune-db', 115 | url=f'https://{host}:{port}/gremlin', 116 | headers={'content-type': 'application/json'}, 117 | data=json.dumps({'gremlin': "g.V('user_1').outE('rated').drop()"})) 118 | if response.status_code != 200: 119 | print(response.content.decode('utf-8')) 120 | 121 | if setup_edge_regression: 122 | response = signed_request("POST", service='neptune-db', 123 | url=f'https://{host}:{port}/gremlin', 124 | headers={'content-type': 'application/json'}, 125 | data=json.dumps( 126 | {'gremlin': "g.V('user_1').outE('rated').properties('score').drop()"})) 127 | if response.status_code != 200: 128 | print(response.content.decode('utf-8')) 129 | 130 | if setup_edge_classification: 131 | response = signed_request("POST", service='neptune-db', 132 | url=f'https://{host}:{port}/gremlin', 133 | headers={'content-type': 'application/json'}, 134 | data=json.dumps( 135 | {'gremlin': "g.V('user_1').outE('rated').properties('scale').drop()"})) 136 | if response.status_code != 200: 137 | print(response.content.decode('utf-8')) 138 | 139 | 140 | def delete_pretrained_endpoints(endpoints: dict): 141 | sm = boto3.client("sagemaker") 142 | try: 143 | if 'node_classification_endpoint_name' in endpoints and endpoints['node_classification_endpoint_name']: 144 | sm.delete_endpoint( 145 | EndpointName=endpoints['node_classification_endpoint_name']["EndpointName"]) 146 | if 'node_regression_endpoint_name' in endpoints and endpoints['node_regression_endpoint_name']: 147 | sm.delete_endpoint( 148 | EndpointName=endpoints['node_regression_endpoint_name']["EndpointName"]) 149 | if 'prediction_endpoint_name' in endpoints and endpoints['prediction_endpoint_name']: 150 | sm.delete_endpoint( 151 | EndpointName=endpoints['prediction_endpoint_name']["EndpointName"]) 152 | if 'edge_classification_endpoint_name' in endpoints and endpoints['edge_classification_endpoint_name']: 153 | sm.delete_endpoint( 154 | EndpointName=endpoints['edge_classification_endpoint_name']["EndpointName"]) 155 | if 'edge_regression_endpoint_name' in endpoints and endpoints['edge_regression_endpoint_name']: 156 | sm.delete_endpoint( 157 | EndpointName=endpoints['edge_regression_endpoint_name']["EndpointName"]) 158 | print(f'Endpoint(s) have been deleted') 159 | except Exception as e: 160 | logging.error(e) 161 | 162 | 163 | def delete_endpoint(training_job_name: str, neptune_iam_role_arn=None): 164 | query_string = "" 165 | if neptune_iam_role_arn: 166 | query_string = f'?neptuneIamRoleArn={neptune_iam_role_arn}' 167 | host, port, use_iam = load_configuration() 168 | response = signed_request("DELETE", service='neptune-db', 169 | url=f'https://{host}:{port}/ml/endpoints/{training_job_name}{query_string}', 170 | headers={'content-type': 'application/json'}) 171 | if response.status_code != 200: 172 | print(response.content.decode('utf-8')) 173 | else: 174 | print(response.content.decode('utf-8')) 175 | print(f'Endpoint {training_job_name} has been deleted') 176 | 177 | 178 | def prepare_movielens_data(s3_bucket_uri: str): 179 | try: 180 | return MovieLensProcessor().prepare_movielens_data(s3_bucket_uri) 181 | except Exception as e: 182 | logging.error(e) 183 | 184 | 185 | def setup_pretrained_endpoints(s3_bucket_uri: str, setup_node_classification: bool, 186 | setup_node_regression: bool, setup_link_prediction: bool, \ 187 | setup_edge_classification: bool, setup_edge_regression: bool): 188 | delete_pretrained_data(setup_node_classification, 189 | setup_node_regression, setup_link_prediction, 190 | setup_edge_classification, setup_edge_regression) 191 | try: 192 | return PretrainedModels().setup_pretrained_endpoints(s3_bucket_uri, setup_node_classification, 193 | setup_node_regression, setup_link_prediction, 194 | setup_edge_classification, setup_edge_regression) 195 | except Exception as e: 196 | logging.error(e) 197 | 198 | def get_neptune_ml_job_output_location(job_name: str, job_type: str): 199 | assert job_type in ["dataprocessing", "modeltraining", "modeltransform"], "Invalid neptune ml job type" 200 | 201 | host, port, use_iam = load_configuration() 202 | 203 | response = signed_request("GET", service='neptune-db', 204 | url=f'https://{host}:{port}/ml/{job_type}/{job_name}', 205 | headers={'content-type': 'application/json'}) 206 | result = json.loads(response.content.decode('utf-8')) 207 | if result["status"] != "Completed": 208 | logging.error("Neptune ML {} job: {} is not completed".format(job_type, job_name)) 209 | return 210 | return result["processingJob"]["outputLocation"] 211 | 212 | 213 | def get_dataprocessing_job_output_location(dataprocessing_job_name: str): 214 | assert dataprocessing_job_name is not None, \ 215 | "Neptune ML training job name id should be passed, if training job s3 output is missing" 216 | return get_neptune_ml_job_output_location(dataprocessing_job_name, "dataprocessing") 217 | 218 | 219 | def get_modeltraining_job_output_location(training_job_name: str): 220 | assert training_job_name is not None, \ 221 | "Neptune ML training job name id should be passed, if training job s3 output is missing" 222 | return get_neptune_ml_job_output_location(training_job_name, "modeltraining") 223 | 224 | 225 | def get_node_to_idx_mapping(training_job_name: str = None, dataprocessing_job_name: str = None, 226 | model_artifacts_location: str = './model-artifacts', vertex_label: str = None): 227 | assert training_job_name is not None or dataprocessing_job_name is not None, \ 228 | "You must provide either a modeltraining job id or a dataprocessing job id to obtain node to index mappings" 229 | 230 | job_name = training_job_name if training_job_name is not None else dataprocessing_job_name 231 | job_type = "modeltraining" if training_job_name == job_name else "dataprocessing" 232 | filename = "mapping.info" if training_job_name == job_name else "info.pkl" 233 | mapping_key = "node2id" if training_job_name == job_name else "node_id_map" 234 | 235 | # get mappings 236 | model_artifacts_location = os.path.join(model_artifacts_location, job_name) 237 | if not os.path.exists(os.path.join(model_artifacts_location, filename)): 238 | job_s3_output = get_neptune_ml_job_output_location(job_name, job_type) 239 | print(job_s3_output) 240 | if not job_s3_output: 241 | return 242 | S3Downloader.download(os.path.join(job_s3_output, filename), model_artifacts_location) 243 | 244 | with open(os.path.join(model_artifacts_location, filename), "rb") as f: 245 | mapping = pickle.load(f)[mapping_key] 246 | if vertex_label is not None: 247 | if vertex_label in mapping: 248 | mapping = mapping[vertex_label] 249 | else: 250 | print("Mapping for vertex label: {} not found.".format(vertex_label)) 251 | print("valid vertex labels which have vertices mapped to embeddings: {} ".format(list(mapping.keys()))) 252 | print("Returning mapping for all valid vertex labels") 253 | 254 | return mapping 255 | 256 | 257 | def get_embeddings(training_job_name: str, download_location: str = './model-artifacts'): 258 | training_job_s3_output = get_modeltraining_job_output_location(training_job_name) 259 | if not training_job_s3_output: 260 | return 261 | 262 | download_location = os.path.join(download_location, training_job_name) 263 | os.makedirs(download_location, exist_ok=True) 264 | # download embeddings and mapping info 265 | 266 | S3Downloader.download(os.path.join(training_job_s3_output, "embeddings/"), 267 | os.path.join(download_location, "embeddings/")) 268 | 269 | entity_emb = np.load(os.path.join(download_location, "embeddings", "entity.npy")) 270 | 271 | return entity_emb 272 | 273 | 274 | def get_predictions(training_job_name: str, download_location: str = './model-artifacts', class_preds: bool = False): 275 | training_job_s3_output = get_modeltraining_job_output_location(training_job_name) 276 | if not training_job_s3_output: 277 | return 278 | 279 | download_location = os.path.join(download_location, training_job_name) 280 | os.makedirs(download_location, exist_ok=True) 281 | # download embeddings and mapping info 282 | 283 | S3Downloader.download(os.path.join(training_job_s3_output, "predictions/"), 284 | os.path.join(download_location, "predictions/")) 285 | 286 | preds = np.load(os.path.join(download_location, "predictions", "result.npz"))['infer_scores'] 287 | 288 | if class_preds: 289 | return preds.argmax(axis=1) 290 | 291 | return preds 292 | 293 | 294 | def get_performance_metrics(training_job_name: str, download_location: str = './model-artifacts'): 295 | training_job_s3_output = get_modeltraining_job_output_location(training_job_name) 296 | if not training_job_s3_output: 297 | return 298 | 299 | download_location = os.path.join(download_location, training_job_name) 300 | os.makedirs(download_location, exist_ok=True) 301 | # download embeddings and mapping info 302 | 303 | S3Downloader.download(os.path.join(training_job_s3_output, "eval_metrics_info.json"), 304 | download_location) 305 | 306 | with open(os.path.join(download_location, "eval_metrics_info.json")) as f: 307 | metrics = json.load(f) 308 | 309 | return metrics 310 | 311 | 312 | class MovieLensProcessor: 313 | raw_directory = fr'{HOME_DIRECTORY}/data/raw' 314 | formatted_directory = fr'{HOME_DIRECTORY}/data/formatted' 315 | 316 | def __download_and_unzip(self): 317 | if not os.path.exists(f'{HOME_DIRECTORY}/data'): 318 | os.makedirs(f'{HOME_DIRECTORY}/data') 319 | if not os.path.exists(f'{HOME_DIRECTORY}/data/raw'): 320 | os.makedirs(f'{HOME_DIRECTORY}/data/raw') 321 | if not os.path.exists(f'{HOME_DIRECTORY}/data/formatted'): 322 | os.makedirs(f'{HOME_DIRECTORY}/data/formatted') 323 | # Download the MovieLens dataset 324 | url = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip' 325 | r = requests.get(url, allow_redirects=True) 326 | open(os.path.join(self.raw_directory, 'ml-100k.zip'), 'wb').write(r.content) 327 | 328 | with zipfile.ZipFile(os.path.join(self.raw_directory, 'ml-100k.zip'), 'r') as zip_ref: 329 | zip_ref.extractall(self.raw_directory) 330 | 331 | def __process_movies_genres(self): 332 | # process the movies_vertex.csv 333 | print('Processing Movies', end='\r') 334 | movies_df = pd.read_csv(os.path.join( 335 | self.raw_directory, 'ml-100k/u.item'), sep='|', encoding='ISO-8859-1', 336 | names=['~id', 'title', 'release_date', 'video_release_date', 'imdb_url', 337 | 'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 338 | 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 339 | 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']) 340 | # Parse date and convert to ISO format 341 | movies_df['release_date'] = movies_df['release_date'].apply( 342 | lambda x: str( 343 | datetime.strptime(x, '%d-%b-%Y').isoformat()) if not pd.isna(x) else x) 344 | movies_df['~label'] = 'movie' 345 | movies_df['~id'] = movies_df['~id'].apply( 346 | lambda x: f'movie_{x}') 347 | movie_genre_df = movies_df[[ 348 | '~id', 'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 349 | 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 350 | 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']] 351 | genres_edges_df = pd.DataFrame( 352 | columns=['~id', '~from', '~to', '~label']) 353 | 354 | genres = ['unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 355 | 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 356 | 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'] 357 | 358 | genre_df = pd.DataFrame(genres, columns=['~id']) 359 | genre_df['~label'] = 'genre' 360 | genre_df['name'] = genre_df['~id'] 361 | genre_df.to_csv(os.path.join(self.formatted_directory, 362 | 'genre_vertex.csv'), index=False) 363 | 364 | # Loop through all the movies and pull out the genres 365 | for index, row in movie_genre_df.iterrows(): 366 | genre_lst = [] 367 | for g in genres: 368 | if row[g] == 1: 369 | genres_edges_df = genres_edges_df.append( 370 | {'~id': f"{row['~id']}-included_in-{g}", '~label': 'included_in', 371 | '~from': row['~id'], '~to': g}, ignore_index=True) 372 | genre_lst.append(g) 373 | movies_df.loc[index, 'genre:String[]'] = ';'.join(genre_lst) 374 | 375 | # rename the release data column to specify the data type 376 | movies_df['release_date:Date'] = movies_df['release_date'] 377 | # Drop the genre columns as well as the uneeded release date columns 378 | genres.append('video_release_date') 379 | genres.append('release_date') 380 | movies_df = movies_df.drop(columns=genres) 381 | 382 | movies_df.to_csv(os.path.join(self.formatted_directory, 383 | 'movie_vertex.csv'), index=False) 384 | genres_edges_df.to_csv(os.path.join(self.formatted_directory, 385 | 'genre_edges.csv'), index=False) 386 | 387 | def __process_ratings_users(self): 388 | # Create ratings vertices and add edges on both sides 389 | print('Processing Ratings', end='\r') 390 | ratings_vertices = pd.read_csv(os.path.join( 391 | self.raw_directory, 'ml-100k/u.data'), sep='\t', encoding='ISO-8859-1', 392 | names=['~from', '~to', 'score:Int', 'timestamp']) 393 | ratings_vertices['~from'] = ratings_vertices['~from'].apply( 394 | lambda x: f'user_{x}') 395 | ratings_vertices['~to'] = ratings_vertices['~to'].apply( 396 | lambda x: f'movie_{x}') 397 | rated_edges = ratings_vertices.copy(deep=True) 398 | 399 | ratings_vertices['~id'] = ratings_vertices['~from'].str.cat( 400 | ratings_vertices['~to'], sep=":") 401 | ratings_vertices['~label'] = "rating" 402 | dict = {} 403 | edges = {} 404 | for index, row in ratings_vertices.iterrows(): 405 | id_from = row['~from'] 406 | id_to = row['~to'] 407 | id_id = row['~id'] 408 | dict[index * 2] = {'~id': f"{id_from}-wrote-{id_id}", '~label': 'wrote', 409 | '~from': id_from, '~to': id_id} 410 | dict[index * 2 + 1] = {'~id': f"{id_id}-about-{id_to}", '~label': 'about', 411 | '~from': id_id, '~to': id_to} 412 | score = row['score:Int'] 413 | scale = '' 414 | if score == 1: 415 | scale = 'Hate' 416 | elif score == 2: 417 | scale = 'Dislike' 418 | elif score == 3: 419 | scale = 'Neutral' 420 | elif score == 4: 421 | scale = 'Like' 422 | elif score == 5: 423 | scale = 'Love' 424 | edges[index] = {'~id': f"{id_from}-rated-{id_to}", '~label': 'rated', 425 | '~from': id_from, '~to': id_to, 'score:Int': score, 'scale': scale} 426 | rating_edges_df = pd.DataFrame.from_dict(dict, "index") 427 | 428 | # Remove the from and to columns and write this out as a vertex now 429 | ratings_vertices = ratings_vertices.drop(columns=['~from', '~to']) 430 | ratings_vertices.to_csv(os.path.join(self.formatted_directory, 431 | 'ratings_vertices.csv'), index=False) 432 | # Write out the rating vertex edges for wrote and about 433 | rating_edges_df.to_csv(os.path.join(self.formatted_directory, 434 | 'ratings_vertex_edges.csv'), index=False) 435 | # Write out the rated edges 436 | rated_edges_df = pd.DataFrame.from_dict(edges, "index") 437 | rated_edges_df.to_csv(os.path.join(self.formatted_directory, 438 | 'rated_edges.csv'), index=False) 439 | 440 | def __process_users(self): 441 | print("Processing Users", end='\r') 442 | # User Vertices - Load, rename column with type, and save 443 | 444 | user_df = pd.read_csv(os.path.join( 445 | self.raw_directory, 'ml-100k/u.user'), sep='|', encoding='ISO-8859-1', 446 | names=['~id', 'age:Int', 'gender', 'occupation', 'zip_code']) 447 | user_df['~id'] = user_df['~id'].apply( 448 | lambda x: f'user_{x}') 449 | user_df['~label'] = 'user' 450 | user_df.to_csv(os.path.join(self.formatted_directory, 451 | 'user_vertex.csv'), index=False) 452 | 453 | def __upload_to_s3(self, bucketname: str): 454 | path = urlparse(bucketname, allow_fragments=False) 455 | bucket = path.netloc 456 | file_path = path.path.lstrip('/').rstrip('/') 457 | 458 | s3_client = boto3.client('s3') 459 | for root, dirs, files in os.walk(self.formatted_directory): 460 | for file in files: 461 | s3_client.upload_file(os.path.join( 462 | self.formatted_directory, file), bucket, f'{file_path}/{file}') 463 | 464 | def prepare_movielens_data(self, s3_bucket: str): 465 | bucket_name = f'{s3_bucket}/neptune-formatted/movielens-100k' 466 | self.__download_and_unzip() 467 | self.__process_movies_genres() 468 | self.__process_users() 469 | self.__process_ratings_users() 470 | self.__upload_to_s3(bucket_name) 471 | print('Completed Processing, data is ready for loading using the s3 url below:') 472 | print(bucket_name) 473 | return bucket_name 474 | 475 | 476 | class PretrainedModels: 477 | SCRIPT_PARAM_NAME = "sagemaker_program" 478 | DIR_PARAM_NAME = "sagemaker_submit_directory" 479 | CONTAINER_LOG_LEVEL_PARAM_NAME = "sagemaker_container_log_level" 480 | ENABLE_CLOUDWATCH_METRICS_PARAM = "sagemaker_enable_cloudwatch_metrics" 481 | MODEL_SERVER_TIMEOUT_PARAM_NAME = "sagemaker_model_server_timeout" 482 | MODEL_SERVER_WORKERS_PARAM_NAME = "sagemaker_model_server_workers" 483 | SAGEMAKER_REGION_PARAM_NAME = "sagemaker_region" 484 | INSTANCE_TYPE = 'ml.m5.2xlarge' 485 | PYTORCH_CPU_CONTAINER_IMAGE = "" 486 | PRETRAINED_MODEL = {} 487 | 488 | def __init__(self): 489 | with open('./neptune-ml-pretrained-model-config.json') as f: 490 | config = json.load(f) 491 | self.PRETRAINED_MODEL = config['models'] 492 | self.PYTORCH_CPU_CONTAINER_IMAGE = config['container_images'][boto3.session.Session( 493 | ).region_name] 494 | 495 | def __run_create_model(self, sm_client, 496 | name, 497 | role, 498 | image_uri, 499 | model_s3_location, 500 | container_mode='SingleModel', 501 | script_name='infer_entry_point.py', 502 | ): 503 | model_environment_vars = {self.SCRIPT_PARAM_NAME.upper(): script_name, 504 | self.DIR_PARAM_NAME.upper(): model_s3_location, 505 | self.CONTAINER_LOG_LEVEL_PARAM_NAME.upper(): str(20), 506 | self.MODEL_SERVER_TIMEOUT_PARAM_NAME.upper(): str(1200), 507 | self.MODEL_SERVER_WORKERS_PARAM_NAME.upper(): str(1), 508 | self.SAGEMAKER_REGION_PARAM_NAME.upper(): boto3.session.Session().region_name, 509 | self.ENABLE_CLOUDWATCH_METRICS_PARAM.upper(): "false" 510 | } 511 | 512 | container_def = [{"Image": self.PYTORCH_CPU_CONTAINER_IMAGE, 513 | "Environment": model_environment_vars, 514 | "ModelDataUrl": model_s3_location, 515 | "Mode": container_mode 516 | }] 517 | request = {"ModelName": name, 518 | "ExecutionRoleArn": role, 519 | "Containers": container_def 520 | } 521 | return sm_client.create_model(**request) 522 | 523 | def __run_create_endpoint_config(self, sm_client, 524 | model_name, 525 | instance_type='ml.m5.2xlarge', 526 | initial_instance_count=1, 527 | initial_weight=1, 528 | variant_name='AllTraffic' 529 | ): 530 | production_variant_configuration = [{ 531 | "ModelName": model_name, 532 | "InstanceType": instance_type, 533 | "InitialInstanceCount": initial_instance_count, 534 | "VariantName": variant_name, 535 | "InitialVariantWeight": initial_weight, 536 | }] 537 | request = {"EndpointConfigName": model_name, 538 | "ProductionVariants": production_variant_configuration 539 | } 540 | 541 | return sm_client.create_endpoint_config(**request) 542 | 543 | def __create_model(self, name: str, model_s3_location: str): 544 | image_uri = self.PYTORCH_CPU_CONTAINER_IMAGE 545 | instance_type = self.INSTANCE_TYPE 546 | role = self.__get_neptune_ml_role() 547 | sm = boto3.client("sagemaker") 548 | name = "{}-{}".format(name, strftime("%Y-%m-%d-%H-%M-%S", gmtime())) 549 | create_model_result = self.__run_create_model( 550 | sm, name, role, image_uri, model_s3_location) 551 | create_endpoint_config_result = self.__run_create_endpoint_config( 552 | sm, name, instance_type=instance_type) 553 | create_endpoint_result = sm.create_endpoint( 554 | EndpointName=name, EndpointConfigName=name) 555 | return name 556 | 557 | def __get_neptune_ml_role(self): 558 | with open(f'{HOME_DIRECTORY}/.bashrc') as f: 559 | data = f.readlines() 560 | for d in data: 561 | if str.startswith(d, 'export NEPTUNE_ML_ROLE_ARN'): 562 | parts = d.split('=') 563 | if len(parts) == 2: 564 | return parts[1].rstrip() 565 | logging.error("Unable to determine the Neptune ML IAM Role.") 566 | return None 567 | 568 | def __copy_s3(self, s3_bucket_uri: str, source_s3_uri: str): 569 | path = urlparse(s3_bucket_uri, allow_fragments=False) 570 | bucket = path.netloc 571 | file_path = path.path.lstrip('/').rstrip('/') 572 | source_path = urlparse(source_s3_uri, allow_fragments=False) 573 | source_bucket = source_path.netloc 574 | source_file_path = source_path.path.lstrip('/').rstrip('/') 575 | s3 = boto3.resource('s3') 576 | s3.meta.client.copy( 577 | {"Bucket": source_bucket, "Key": source_file_path}, bucket, file_path) 578 | 579 | def setup_pretrained_endpoints(self, s3_bucket_uri: str, 580 | setup_node_classification: bool, setup_node_regression: bool, 581 | setup_link_prediction: bool, setup_edge_classification: bool, 582 | setup_edge_regression: bool): 583 | print('Beginning endpoint creation', end='\r') 584 | if setup_node_classification: 585 | # copy model 586 | self.__copy_s3(f'{s3_bucket_uri}/pretrained-models/node-classification/model.tar.gz', 587 | self.PRETRAINED_MODEL['node_classification']) 588 | # create model 589 | classification_output = self.__create_model( 590 | 'classifi', f'{s3_bucket_uri}/pretrained-models/node-classification/model.tar.gz') 591 | if setup_node_regression: 592 | # copy model 593 | self.__copy_s3(f'{s3_bucket_uri}/pretrained-models/node-regression/model.tar.gz', 594 | self.PRETRAINED_MODEL['node_regression']) 595 | # create model 596 | regression_output = self.__create_model( 597 | 'regressi', f'{s3_bucket_uri}/pretrained-models/node-regression/model.tar.gz') 598 | if setup_link_prediction: 599 | # copy model 600 | self.__copy_s3(f'{s3_bucket_uri}/pretrained-models/link-prediction/model.tar.gz', 601 | self.PRETRAINED_MODEL['link_prediction']) 602 | # create model 603 | prediction_output = self.__create_model( 604 | 'linkpred', f'{s3_bucket_uri}/pretrained-models/link-prediction/model.tar.gz') 605 | if setup_edge_classification: 606 | # copy model 607 | self.__copy_s3(f'{s3_bucket_uri}/pretrained-models/edge-classification/model.tar.gz', 608 | self.PRETRAINED_MODEL['edge_classification']) 609 | # create model 610 | edgeclass_output = self.__create_model( 611 | 'edgeclass', f'{s3_bucket_uri}/pretrained-models/edge-classification/model.tar.gz') 612 | if setup_edge_regression: 613 | # copy model 614 | self.__copy_s3(f'{s3_bucket_uri}/pretrained-models/edge-regression/model.tar.gz', 615 | self.PRETRAINED_MODEL['edge_regression']) 616 | # create model 617 | edgereg_output = self.__create_model( 618 | 'edgereg', f'{s3_bucket_uri}/pretrained-models/edge-regression/model.tar.gz') 619 | 620 | sleep(UPDATE_DELAY_SECONDS) 621 | classification_running = setup_node_classification 622 | regression_running = setup_node_regression 623 | prediction_running = setup_link_prediction 624 | edgeclass_running = setup_edge_classification 625 | edgereg_running = setup_edge_regression 626 | classification_endpoint_name = "" 627 | regression_endpoint_name = "" 628 | prediction_endpoint_name = "" 629 | edge_classification_endpoint_name = "" 630 | edge_regression_endpoint_name = "" 631 | sucessful = False 632 | sm = boto3.client("sagemaker") 633 | while classification_running or regression_running or prediction_running or edgeclass_running or edgereg_running: 634 | if classification_running: 635 | response = sm.describe_endpoint( 636 | EndpointName=classification_output 637 | ) 638 | if response['EndpointStatus'] in ['InService', 'Failed']: 639 | if response['EndpointStatus'] == 'InService': 640 | classification_endpoint_name = response 641 | classification_running = False 642 | if regression_running: 643 | response = sm.describe_endpoint( 644 | EndpointName=regression_output 645 | ) 646 | if response['EndpointStatus'] in ['InService', 'Failed']: 647 | if response['EndpointStatus'] == 'InService': 648 | regression_endpoint_name = response 649 | regression_running = False 650 | if prediction_running: 651 | response = sm.describe_endpoint( 652 | EndpointName=prediction_output 653 | ) 654 | if response['EndpointStatus'] in ['InService', 'Failed']: 655 | if response['EndpointStatus'] == 'InService': 656 | prediction_endpoint_name = response 657 | prediction_running = False 658 | if edgeclass_running: 659 | response = sm.describe_endpoint( 660 | EndpointName=edgeclass_output 661 | ) 662 | if response['EndpointStatus'] in ['InService', 'Failed']: 663 | if response['EndpointStatus'] == 'InService': 664 | edge_classification_endpoint_name = response 665 | edgeclass_running = False 666 | if edgereg_running: 667 | response = sm.describe_endpoint( 668 | EndpointName=edgereg_output 669 | ) 670 | if response['EndpointStatus'] in ['InService', 'Failed']: 671 | if response['EndpointStatus'] == 'InService': 672 | edge_regression_endpoint_name = response 673 | edgereg_running = False 674 | 675 | print( 676 | f'Checking Endpoint Creation Statuses at {datetime.now().strftime("%H:%M:%S")}', end='\r') 677 | sleep(UPDATE_DELAY_SECONDS) 678 | 679 | print("") 680 | if classification_endpoint_name: 681 | print( 682 | f"Node Classification Endpoint Name: {classification_endpoint_name['EndpointName']}") 683 | if regression_endpoint_name: 684 | print( 685 | f"Node Regression Endpoint Name: {regression_endpoint_name['EndpointName']}") 686 | if prediction_endpoint_name: 687 | print( 688 | f"Link Prediction Endpoint Name: {prediction_endpoint_name['EndpointName']}") 689 | if edge_classification_endpoint_name: 690 | print( 691 | f"Edge Classification Endpoint Name: {edge_classification_endpoint_name['EndpointName']}") 692 | if edge_regression_endpoint_name: 693 | print( 694 | f"Edge Regression Endpoint Name: {edge_regression_endpoint_name['EndpointName']}") 695 | print('Endpoint creation complete', end='\r') 696 | return { 697 | 'node_classification_endpoint_name': classification_endpoint_name, 698 | 'node_regression_endpoint_name': regression_endpoint_name, 699 | 'prediction_endpoint_name': prediction_endpoint_name, 700 | 'edge_classification_endpoint_name': edge_classification_endpoint_name, 701 | 'edge_regression_endpoint_name': edge_regression_endpoint_name 702 | } 703 | -------------------------------------------------------------------------------- /social-network-recommendations/README.md: -------------------------------------------------------------------------------- 1 | ## Social network graph 2 | 3 | Social networks can naturally be expressed in the form of a graph, where the nodes represent people and the connections between people, such as friendship or coworkers, is represented by edges. Here is one illustration of such social network. Let's imagine that we have a social network whose members (nodes) are Bill, Terry, Henry, Cary and Alistair etc. The relationship between them are represented by a link (edge), and each person’s interests such as, sports, arts, games and comics are represented by node properties. 4 | 5 | ![](assets/social-network.png?raw=true) 6 | 7 | 8 | ## GNN message passing mechanism to generate node embedding 9 | The way a GNN transforms the intial node features to node embeddings, is by a technique called message passing. The process of message passing is illustrated in the figure below. In the beginning, the node attributes/features are converted into numerical attributes. In our case, we do one-hot encoding of the categorical features (Henry’s interests: arts, comics, games). Then, the first layer of GNN aggregates all neighbor’s (Gary and Alistair) raw features (in black) to form a new set of features (in yellow). A common approach is to do linear transformation of all neighboring features, aggregate them through a normalized sum, then pass the results into a non-linear activation function, such as ReLU, to generate a set of new features. The following figure illustrates how message passing works for node *Henry*. The figure below illustrates the computations for Henry and Bill, however, the GNN message passing algorithm will compute representations for all of the graph nodes, which are later used as the input features for the second layer. 10 | 11 | 12 | 13 | ![](assets/message-passing.png?raw=true) 14 | 15 | The second layer of GNN repeats the same process, but takes the the previously computed feature (in yellow) as an input from the first layer. The aggregation sums-up all neighbor’s new embedded features (Gary and Alistair), and generate second layer feature vectors for Henry (in orange). As you can see, by repeating the message passing mechanism, we extended the feature aggregation to 2-hop neighbors. In our illustration, we limit ourselves to 2-hop neighbors, but extending into 3-hop neighbors can be done in the same way by adding another GNN layer. 16 | 17 | 18 | 19 | The final embeddings from Henry and Bill (in orange) are used for computing the score. During the training phase, link score is defined as 1 when the edge exists between two nodes (positive sample), and as 0 when edges between two nodes don’t exist (negative sample). The error or loss between the actual score and the prediction *f(e1,e2)* is then back-propagated into previous layer to adjust the weights in the aggregation function and node embedding. Once the training is finished, we can rely on the embedded feature vectors for each node to compute their link’s score with our function *f.* 20 | 21 | 22 | ## Train your Graph Convolution Network with Amazon Neptune ML 23 | 24 | The figure below illustrates different steps for Neptune ML to train GNN-based recommendation system. You can follow the notebook and try different steps with sample code. 25 | 26 | 27 | 28 | ![](assets/NeptuneML-illustration.png?raw=true) -------------------------------------------------------------------------------- /social-network-recommendations/assets/NeptuneML-illustration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-neptune-ml-use-cases/cd10587d16510a60b3446752995a51cc5faf81bd/social-network-recommendations/assets/NeptuneML-illustration.png -------------------------------------------------------------------------------- /social-network-recommendations/assets/message-passing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-neptune-ml-use-cases/cd10587d16510a60b3446752995a51cc5faf81bd/social-network-recommendations/assets/message-passing.png -------------------------------------------------------------------------------- /social-network-recommendations/assets/social-network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-neptune-ml-use-cases/cd10587d16510a60b3446752995a51cc5faf81bd/social-network-recommendations/assets/social-network.png -------------------------------------------------------------------------------- /social-network-recommendations/social-network-recommendations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "-" 8 | } 9 | }, 10 | "source": [ 11 | "# Social Network Recommendations\n", 12 | "\n", 13 | "In this example, we're going to build a powerful social network predictive capability with Netpune ML. The techniques introduced here can be used to build predictions in other domains outside of social networks.\n", 14 | "\n", 15 | "You can quickly setup the environment by using the Neptune ML AWS CloudFormation template: \n", 16 | "https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-quick-start.html\n", 17 | "\n", 18 | "This code is extended base on Neptune ML code samples preconfigured using above CloudFormation. \n", 19 | "\n", 20 | "### People You May Know\n", 21 | "\n", 22 | "Recommender systems are one of most widely adopted machine learning technologies in real world applications, ranging from social network to e-commerce platforms. In social network, one common use case is to recommend new friends to a user, based on user’s friendship with the others. Users that have common friends are likely to know each other, thus should have a higher score for recommendation system to propose if they are not yet connected." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Setup\n", 30 | "\n", 31 | "Before we begin, we'll clear any existing data from our Neptune cluster, using the cell magic `%%gremlin` and a subsequent drop query:" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "%%gremlin\n", 41 | "\n", 42 | "g.V().drop()" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "How do we know which Neptune cluster to access? The cell magics exposed by Neptune Notebooks use a configuration located by default under `~/graph_notebook_config.json` At the time of initialization of the Sagemaker instance, this configuration is generated using environment variables derived from the cluster being connected to. \n", 50 | "\n", 51 | "You can check the contents of the configuration in two ways. You can print the file itself, or you can look for the configuration being used by the notebook which you have opened." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "scrolled": true 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "%%bash\n", 63 | "\n", 64 | "cat ~/graph_notebook_config.json" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Create a Social Network\n", 72 | "\n", 73 | "Next, we'll create a small social network. Note that the script below comprises a single statement. All the vertices and edges here will be created in the context of a single transaction." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "%%gremlin\n", 83 | "\n", 84 | "g.\n", 85 | "addV('User').property('name','Bill').property('interests', 'arts;comics;games;sports').\n", 86 | "addV('User').property('name','Sarah').property('interests', 'arts').\n", 87 | "addV('User').property('name','Ben').property('interests', 'electronics').\n", 88 | "addV('User').property('name','Lucy').property('interests', 'electronics').\n", 89 | "addV('User').property('name','Colin').property('interests', 'games;sports').\n", 90 | "addV('User').property('name','Emily').property('interests', 'sports').\n", 91 | "addV('User').property('name','Gordon').property('interests', 'sports').\n", 92 | "addV('User').property('name','Kate').property('interests', 'arts').\n", 93 | "addV('User').property('name','Peter').property('interests', 'games').\n", 94 | "addV('User').property('name','Terry').property('interests', 'sports').\n", 95 | "addV('User').property('name','Alistair').property('interests', 'arts;sports').\n", 96 | "addV('User').property('name','Eve').property('interests', 'arts;electronics').\n", 97 | "addV('User').property('name','Gary').property('interests', 'sports').\n", 98 | "addV('User').property('name','Mary').property('interests', 'comics;games').\n", 99 | "addV('User').property('name','Charlie').property('interests', 'games;electronics').\n", 100 | "addV('User').property('name','Sue').property('interests', 'electronics').\n", 101 | "addV('User').property('name','Arnold').property('interests', 'comics;games').\n", 102 | "addV('User').property('name','Chloe').property('interests', 'sports').\n", 103 | "addV('User').property('name','Henry').property('interests', 'arts;comics;games').\n", 104 | "addV('User').property('name','Josie').property('interests', 'electronics').\n", 105 | "V().hasLabel('User').has('name','Sarah').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').\n", 106 | "V().hasLabel('User').has('name','Colin').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').\n", 107 | "V().hasLabel('User').has('name','Terry').as('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').\n", 108 | "V().hasLabel('User').has('name','Peter').as('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').\n", 109 | "V().hasLabel('User').has('name','Kate').as('a').V().hasLabel('User').has('name','Ben').addE('FRIEND').to('a').\n", 110 | "V().hasLabel('User').has('name','Kate').as('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').\n", 111 | "V().hasLabel('User').has('name','Eve').as('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').\n", 112 | "V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Kate').addE('FRIEND').to('a').\n", 113 | "V().hasLabel('User').has('name','Gary').as('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').\n", 114 | "V().hasLabel('User').has('name','Gordon').as('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').\n", 115 | "V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').\n", 116 | "V().hasLabel('User').has('name','Terry').as('a').V().hasLabel('User').has('name','Gordon').addE('FRIEND').to('a').\n", 117 | "V().hasLabel('User').has('name','Alistair').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').\n", 118 | "V().hasLabel('User').has('name','Gary').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').\n", 119 | "V().hasLabel('User').has('name','Mary').as('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').\n", 120 | "V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Alistair').addE('FRIEND').to('a').\n", 121 | "V().hasLabel('User').has('name','Sue').as('a').V().hasLabel('User').has('name','Eve').addE('FRIEND').to('a').\n", 122 | "V().hasLabel('User').has('name','Sue').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').\n", 123 | "V().hasLabel('User').has('name','Josie').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').\n", 124 | "V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').\n", 125 | "V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Mary').addE('FRIEND').to('a').\n", 126 | "V().hasLabel('User').has('name','Mary').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').\n", 127 | "V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').\n", 128 | "V().hasLabel('User').has('name','Chloe').as('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').\n", 129 | "V().hasLabel('User').has('name','Henry').as('a').V().hasLabel('User').has('name','Arnold').addE('FRIEND').to('a').\n", 130 | "next()" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "This is what the network looks like:\n", 138 | " \n", 139 | "" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### Check the number of users in the graph " 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "%%gremlin\n", 156 | "g.V().groupCount().by(label).unfold()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "### Check the number of relations among users" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "%%gremlin\n", 173 | "g.E().groupCount().by(label).unfold()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "### Explore Henry's friends" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "%%gremlin\n", 190 | "\n", 191 | "g.V().hasLabel('User').has('name', 'Henry').both('FRIEND').groupCount().by('name')" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "### Explore Henry's interests" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "%%gremlin\n", 208 | "\n", 209 | "g.V().hasLabel('User').has('name', 'Henry').values('interests')" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "### Generate a recommendation by simple query\n", 217 | "\n", 218 | "Let's now create a simple query to recommend for a specific user.\n", 219 | "\n", 220 | "In the query below, we're finding the vertex that represents our user. We're then traversing `FRIEND` relationships (we don't care about relationship direction, so we're using `both()`) to find that user's immediate friends. We're then traversing another hop into the graph, looking for friends of those friends who _are not currently connected to our user.\n", 221 | "\n", 222 | "We then count the paths to these candidate friends, and order the results based on the number of times we can reach a candidate via one of the user's immediate friends." 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": { 229 | "scrolled": true 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "%%gremlin\n", 234 | "\n", 235 | "g.V().hasLabel('User').has('name', 'Henry').as('user'). \n", 236 | " both('FRIEND').aggregate('friends'). \n", 237 | " both('FRIEND').\n", 238 | " where(P.neq('user')).where(P.without('friends')). \n", 239 | " groupCount().by('name'). \n", 240 | " order(Scope.local).by(values, Order.decr).\n", 241 | " next()" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "## Train your Graph Convolution Network with Amazon Neptune ML\n", 249 | "\n", 250 | "Neptune ML uses graph neural network technology to automatically creates, trains, and applies ML models on your graph data. Neptune ML supports common graph prediction tasks such as node classification, node regression, edge classification and regression, and link prediction. \n", 251 | "It is powered by: \n", 252 | "- **Amazon Neptune:** a purpose-built, high-performance managed graph database, which is optimized for storing billions of relationships and querying the graph with milliseconds latency. Learn more at Overview of Amazon Neptune Features.\n", 253 | "- **Amazon SageMaker:** a fully managed service that provides every developer and data scientist with the ability to prepare build, train, and deploy machine learning (ML) models quickly. \n", 254 | "- **Deep Graph Library (DGL):** an open-source, high performance and scalable Python package for deep learning on graphs. It provides fast and memory-efficient message passing primitives for training Graph Neural Networks. Neptune ML uses DGL to automatically choose and train the best ML model for your workload, enabling you to make ML-based predictions on graph data in hours instead of weeks. " 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### Data export and configuration\n", 262 | "\n", 263 | "The first step in our Neptune ML process is to export the graph data from the Neptune Cluster." 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "#### Setup for S3 bucket" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "s3_bucket_uri=\"s3://(put-your-bucket-name-here-****)/neptune-ml-social-network-recommendation/\"\n", 280 | "# remove trailing slashes\n", 281 | "s3_bucket_uri = s3_bucket_uri[:-1] if s3_bucket_uri.endswith('/') else s3_bucket_uri\n", 282 | "s3_bucket_uri" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "HOME_DIRECTORY = '~'\n", 292 | "\n", 293 | "import os \n", 294 | "import json\n", 295 | "import logging\n", 296 | "def load_configuration():\n", 297 | " with open(os.path.expanduser(f'{HOME_DIRECTORY}/graph_notebook_config.json')) as f:\n", 298 | " data = json.load(f)\n", 299 | " host = data['host']\n", 300 | " port = data['port']\n", 301 | " if data['auth_mode'] == 'IAM':\n", 302 | " iam = True\n", 303 | " else:\n", 304 | " iam = False\n", 305 | " return host, port, iam\n", 306 | "\n", 307 | "\n", 308 | "def get_host():\n", 309 | " host, port, iam = load_configuration()\n", 310 | " return host" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "neptune_host = get_host()\n", 320 | "neptune_host" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "from urllib.parse import urlparse\n", 330 | "\n", 331 | "def get_export_service_host():\n", 332 | " with open(os.path.expanduser(f'{HOME_DIRECTORY}/.bashrc')) as f:\n", 333 | " data = f.readlines()\n", 334 | " print(data)\n", 335 | " for d in data:\n", 336 | " if str.startswith(d, 'export NEPTUNE_EXPORT_API_URI'):\n", 337 | " parts = d.split('=')\n", 338 | " if len(parts) == 2:\n", 339 | " path = urlparse(parts[1].rstrip())\n", 340 | " return path.hostname + \"/v1\"\n", 341 | " logging.error(\n", 342 | " \"Unable to determine the Neptune Export Service Endpoint. You will need to enter this or assign it manually.\")\n", 343 | " return None" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "#### export_params\n", 351 | "\n", 352 | "The first step in our Neptune ML process is to export the graph data from the Neptune Cluster. To do so, we need to specify the parameters for the data export and model configuration. Here is our example of export parameters. \n", 353 | "\n", 354 | "In export_params, we need to configure the basic setup such as the neptune host and output S3 path for exported data storage. The configuration specified in additionalParams is the type of machine learning task to perform. In this example, link prediction is optionally used to predict a particular edge type (User—FRIEND—User). If no target type is specified, Neptune ML will assume that the task is Link Prediction. The parameters also specify details about the data stored in our graph and how the machine learning model will interpret that data (we have “User” as node, and node property as “interests”). " 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": null, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "export_params={ \n", 364 | "\"command\": \"export-pg\", \n", 365 | "\"params\": { \"endpoint\": neptune_host,\n", 366 | " \"profile\": \"neptune_ml\",\n", 367 | " \"cloneCluster\": False\n", 368 | " }, \n", 369 | "\"outputS3Path\": f'{s3_bucket_uri}/neptune-export',\n", 370 | "\"additionalParams\": {\n", 371 | " \"neptune_ml\": {\n", 372 | " \"version\": \"v2.0\",\n", 373 | " \"targets\": [\n", 374 | " {\n", 375 | " \"edge\": [\"User\", \"FRIEND\", \"User\"],\n", 376 | " \"type\" : \"link_prediction\"\n", 377 | " }\n", 378 | " ],\n", 379 | " \"features\": [\n", 380 | " {\n", 381 | " \"node\": \"User\",\n", 382 | " \"property\": \"interests\",\n", 383 | " \"type\": \"category\",\n", 384 | " \"separator\": \";\"\n", 385 | " }\n", 386 | " ]\n", 387 | " }\n", 388 | " },\n", 389 | "\"jobSize\": \"small\"}" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "%%neptune_ml export start --export-url {get_export_service_host()} --export-iam --wait --store-to export_results\n", 399 | "${export_params}" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "Once the export job succeed, we will have the Neptune graph DB exported into CSV format and stored in an S3 bucket. There will be two types of files; nodes.csv and edges.csv. training-data-configuration.json: will also be generated which has configuration needed for Neptune ML to do model training. See [export data from Neptune for Neptune ML](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-data-export.html)\n" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "## Data processing\n", 414 | "\n", 415 | "Neptune ML performs feature extraction and encoding as part of the data-processing steps. Common types of pre-processing of properties are: encoding categorical features through one-hot encoding, bucketing numerical features, or using word2vec to encode a string property or other free-form text property values.\n", 416 | "\n", 417 | "In our example, we will simply use the property “interests”. Neptune ML encodes the values as multi-categorical. However, if such categorical value is complex, i.e. more than 3 words per node. Neptune ML infers the property type to be text and uses the text_word2vec encoding." 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": null, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "# The training_job_name can be set to a unique value below, otherwise one will be auto generated\n", 427 | "import time \n", 428 | "processing_job_name=f'social-link-prediction-processing-{int(time.time())}'\n", 429 | "\n", 430 | "processing_params = f\"\"\"\n", 431 | "--config-file-name training-data-configuration.json\n", 432 | "--job-id {processing_job_name} \n", 433 | "--s3-input-uri {export_results['outputS3Uri']} \n", 434 | "--s3-processed-uri {str(s3_bucket_uri)}/preloading \"\"\"" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "%neptune_ml dataprocessing start --wait --store-to processing_results {processing_params}" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "At the end of this step, a DGL (Deep Graph library) graph is generated from the exported dataset for the model training step to use. Neptune ML automatically tune the model with Hyperparameter Optimization Tuning jobs defined in training-data-configuration.json. We can download and modify this file to tune the model’s hyperparameters, such as batch-size, num-hidden, num-epochs, dropout etc. \n", 451 | "\n", 452 | "See [Processing the graph data exported from Neptune for training](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-on-graphs-processing.html)\n", 453 | "\n" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "## Model training\n", 461 | "\n", 462 | "The next step in the process is the automated training of the GNN model. The model training is done in two stages. The first stage uses a SageMaker Processing job to generate a model training strategy — a configuration set that specifies what type of model and model hyperparameter ranges will be used for the model training. \n", 463 | "Then, SageMaker hyperparameter tuning job will be launched. " 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "#### Important ! Change the batch size \n", 471 | "\n", 472 | "Neptune ML automatically tune the model with Hyperparameter Optimization Tuning jobs defined in training-data-configuration.json. Customer has the possibility to modify this file to tune the model according to the given parameters, such as batch_size, num-hidden, num-epochs, dropout etc. \n", 473 | "\n", 474 | "We illustrate how to change batch size. In our unconventional tiny network example here, it is required to change the batch size to prevent training job failure. " 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [ 483 | "prcossing_location = processing_results['processingJob']['outputLocation']" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "bucket_name, key_name = prcossing_location.replace(\"s3://\", \"\").split(\"/\", 1)" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "import boto3\n", 502 | "\n", 503 | "s3 = boto3.client('s3')\n", 504 | "s3.download_file(bucket_name,key_name + '/model-hpo-configuration.json','model-hpo-configuration.json')" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "!cat model-hpo-configuration.json" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "#### Replace batch-size as our network size is tiny \n", 521 | " {\n", 522 | " \"param\": \"batch-size\",\n", 523 | " \"range\": [\n", 524 | " 2,\n", 525 | " 4\n", 526 | " ],\n", 527 | " \"inc_strategy\": \"power2\",\n", 528 | " \"type\": \"int\",\n", 529 | " \"default\": 2\n", 530 | " }," 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "s3.upload_file('model-hpo-configuration.json', bucket_name, key_name + '/model-hpo-configuration.json')" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "The SageMaker Hyperparameter Tuning Optimization job runs a pre-specified number of model training job trials on the processed data, try different hyperparameter combinaisons according to **model-hpo-configuration.json**, and stores the model artifacts generated by the training in the output S3 location. " 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": null, 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "training_job_name=f'social-link-prediction-{int(time.time())}'\n", 556 | "\n", 557 | "training_params=f\"\"\"\n", 558 | "--job-id {training_job_name} \n", 559 | "--data-processing-id {processing_job_name} \n", 560 | "--instance-type ml.c5.xlarge\n", 561 | "--s3-output-uri {str(s3_bucket_uri)}/training \"\"\"" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "%neptune_ml training start --wait --store-to training_results {training_params}" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "## Create an inference endpoint in Amazon SageMaker\n", 578 | "\n", 579 | "Now that the graph representation is learned, we can deploy the learned model behind an endpoint to perform inference requests." 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": {}, 586 | "outputs": [], 587 | "source": [ 588 | "endpoint_params=f\"\"\"\n", 589 | "--job-id {training_job_name} \n", 590 | "--model-job-id {training_job_name}\"\"\"" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [ 599 | "%neptune_ml endpoint create --wait --store-to endpoint_results {endpoint_params}" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": null, 605 | "metadata": {}, 606 | "outputs": [], 607 | "source": [ 608 | "endpoint_name=endpoint_results['endpoint']['name']" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "## Query the machine learning model using Gremlin\n", 616 | "\n", 617 | "Once the endpoint is ready, we can use it for graph inference queries. In our example, we can now check the friends recommendation with Neptune ML on User “Henry”. It requires almost the exact same syntax to traverse the edge, and list the other User that are connected to Henry through FRIEND connection." 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "%%gremlin\n", 627 | "g.with(\"Neptune#ml.endpoint\",\"${endpoint_name}\").\n", 628 | " V().hasLabel('User').has('name', 'Henry').\n", 629 | " out('FRIEND').with(\"Neptune#ml.prediction\").hasLabel('User').values('name')" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": {}, 635 | "source": [ 636 | "Here is another sample prediction query, used to predict the top eight users that are most likely to connect with Henry." 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "%%gremlin\n", 646 | "g.with(\"Neptune#ml.endpoint\",\"${endpoint_name}\").with(\"Neptune#ml.limit\",8).\n", 647 | " V().hasLabel('User').has('name', 'Henry').out('FRIEND').with(\"Neptune#ml.prediction\").hasLabel('User').values('name')" 648 | ] 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "### Delete the endpoint \n", 655 | "\n", 656 | "Now that you have completed this walkthrough you have created a Sagemaker endpoint which is currently running and will incur the standard charges. If you are done trying out Neptune ML and would like to avoid these recurring costs, run the cell below to delete the inference endpoint." 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "metadata": {}, 663 | "outputs": [], 664 | "source": [ 665 | "import boto3\n", 666 | "sm_boto3 = boto3.client('sagemaker')\n", 667 | "sm_boto3.delete_endpoint(EndpointName=endpoint_name)" 668 | ] 669 | }, 670 | { 671 | "cell_type": "markdown", 672 | "metadata": {}, 673 | "source": [ 674 | "## Model transform or retraining when graph data changed" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": {}, 680 | "source": [ 681 | "In the scenarios where you have continuously changing graphs, you may need to update ML predictions with the newest graph data. The generated model artifacts after training are directly tied to the training graph which means that the inference endpoint needs to be updated once the entities in the original training graph changes. \n", 682 | "\n", 683 | "However, you don’t need to retrain the whole model in order to make predictions on the updated graph. With incremental model inference workflow, you only need to export the data from Neptune DB, incremental data preprocessing, model transform and update the inference endpoint. The model-transform step takes the trained model from the main workflow and the results of the incremental data preprocessing step as inputs, and output new model artifact to use for inference. This new model artifact has the up-to-date graph. \n", 684 | "\n", 685 | "See more Neptune ML implementation details at [Generating new model artifacts](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-model-artifacts.html)" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": {}, 692 | "outputs": [], 693 | "source": [] 694 | } 695 | ], 696 | "metadata": { 697 | "kernelspec": { 698 | "display_name": "Python 3", 699 | "language": "python", 700 | "name": "python3" 701 | }, 702 | "language_info": { 703 | "codemirror_mode": { 704 | "name": "ipython", 705 | "version": 3 706 | }, 707 | "file_extension": ".py", 708 | "mimetype": "text/x-python", 709 | "name": "python", 710 | "nbconvert_exporter": "python", 711 | "pygments_lexer": "ipython3", 712 | "version": "3.6.13" 713 | } 714 | }, 715 | "nbformat": 4, 716 | "nbformat_minor": 4 717 | } 718 | --------------------------------------------------------------------------------