├── .github ├── solutionid_validator.sh └── workflows │ └── maintainer_workflows.yml ├── .gitignore ├── CODEOWNERS ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── architecture ├── LaunchStack.jpg └── architecture-diagram.png ├── cfn_template ├── preprocess-multimodal-data ├── README.md ├── clinical │ └── preprocess-clinical.ipynb ├── genomic │ └── preprocess-genomic.ipynb ├── medical-imaging │ ├── imaging-radiomics.ipynb │ ├── preprocess-imaging.ipynb │ └── src │ │ ├── Api.py │ │ └── radiogenomics-imaging-workflow.json └── single-patient-records.ipynb ├── store-multimodal-data ├── clinical │ └── README.md ├── genomic │ ├── store-analyze-genomicdata-with-awshealthomics.ipynb │ └── utils.py └── medical-imaging │ ├── src │ └── Api.py │ └── store-imagingdata-with-awshealthimaging.ipynb ├── train-test-ml-model ├── README.md ├── hypertension-train-test-deploy.ipynb └── train-test-model.ipynb └── visualization-dashboard └── README.md /.github/solutionid_validator.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | #set -e 3 | 4 | echo "checking solution id $1" 5 | echo "grep -nr --exclude-dir='.github' "$1" ./.." 6 | result=$(grep -nr --exclude-dir='.github' "$1" ./..) 7 | if [ $? -eq 0 ] 8 | then 9 | echo "Solution ID $1 found\n" 10 | echo "$result" 11 | exit 0 12 | else 13 | echo "Solution ID $1 not found" 14 | exit 1 15 | fi 16 | 17 | export result 18 | -------------------------------------------------------------------------------- /.github/workflows/maintainer_workflows.yml: -------------------------------------------------------------------------------- 1 | # Workflows managed by aws-solutions-library-samples maintainers 2 | name: Maintainer Workflows 3 | on: 4 | # Triggers the workflow on push or pull request events but only for the "main" branch 5 | push: 6 | branches: [ "main" ] 7 | pull_request: 8 | branches: [ "main" ] 9 | types: [opened, reopened, edited] 10 | 11 | jobs: 12 | CheckSolutionId: 13 | runs-on: ubuntu-latest 14 | steps: 15 | - uses: actions/checkout@v4 16 | - name: Run solutionid validator 17 | run: | 18 | chmod u+x ./.github/solutionid_validator.sh 19 | ./.github/solutionid_validator.sh ${{ vars.SOLUTIONID }} -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ### vsCode ### 2 | .vscode 3 | 4 | ### macOS ### 5 | # General 6 | .DS_Store 7 | .AppleDouble 8 | .LSOverride 9 | # Files that might appear in the root of a volume 10 | .DocumentRevisions-V100 11 | .fseventsd 12 | .Spotlight-V100 13 | .TemporaryItems 14 | .Trashes 15 | .VolumeIcon.icns 16 | .com.apple.timemachine.donotpresent 17 | 18 | **/node_modules 19 | **/dist 20 | **/build 21 | **/.DS_Store 22 | **/.angular 23 | **/.idea 24 | cdk.out 25 | **/data/**.csv 26 | 27 | .env 28 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | CODEOWNERS @aws-solutions-library-samples/maintainers 2 | /.github/workflows/maintainer_workflows.yml @aws-solutions-library-samples/maintainers 3 | /.github/solutionid_validator.sh @aws-solutions-library-samples/maintainers 4 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Multimodal Data Analysis with AWS Health and Machine Learning Services 2 | 3 | This repository contains code samples related to the AWS [Guidance for Multimodal Data Analysis with AWS Health and Machine Learning Services](https://aws.amazon.com/solutions/guidance/multi-modal-data-analysis-with-aws-health-and-ml-services/). You can follow the given instructions to build an end-to-end framework for storing, integrating, and analyzing genomic, clinical, and medical imaging data. 4 | 5 | As an example, we will use the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/), an open-source, synthetic dataset that includes FHIR resources, MRI DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. All the data types are linked together by FHIR. For the purpose of demonstration, we already converted the genomic data (originally released as CSV files with variant and annotation information) to VCF format, that is compatible with AWS HealthOmics. Similarly, we reformatted the clinical data into NDJSON files that comply with AWS HealthLake. For ease of use, we have made this derived dataset available on an Amazon S3 bucket (s3://guidance-multimodal-hcls-healthai-machinelearning) with public access. Since the Coherent Data Set focuses on cardiovascular disease, we consider the use case of predicting patients' outcomes, such as hypertension, stroke, coronary heart disease, and Alzheimer's disease. 6 | 7 | 8 | #### Architecture Overview 9 | 10 | The following diagram provides an overview of the architecture and the steps followed to ingest, store, integrate, and analyze multimodal data leveraging AWS services. 11 | 12 | ![Figure 1: Architecture for multimodal data analysis with purpose-built Health AI and machine learning services on AWS](./architecture/architecture-diagram.png) 13 | 14 | #### Instructions 15 | 16 | All of the data analytics and AI modeling are done using [Amazon SageMaker](https://aws.amazon.com/sagemaker/). You can create the Amazon SageMaker domain by 1-click deployment (If you have S3 Bucket or IAM role with the same names used in the template, please adjust the template parameter values for them): 17 | 18 | [![deployment](architecture/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/template?stackName=sagemakerstack&templateURL=https://aws-opendata-demo.s3.amazonaws.com/sagemaker_template.yaml) 19 | 20 | Once the CloudFormation template is created, go to Studio Domain created on Amazon SageMaker managment console, select the UserProfile provisioned and launch Studio application. Once the Studio application is ready, follow [this instruction](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-git.html) to clone [this code repository](https://github.com/aws-solutions-library-samples/guidance-for-multimodal-hcls-data-analysis-with-omics-healthlake-imaging-and-sagemaker-on-aws.git) in SageMaker Studio and run the following notebooks. 21 | 22 | 1. Store multimodal data with purpose-built Health AI services ([AWS HealthOmics](https://aws.amazon.com/omics/), [AWS HealthLake](https://aws.amazon.com/healthlake/), and [AWS HealthImaging](https://aws.amazon.com/healthlake/imaging/)) 23 | * To store each data type in the purpose-built Health AI service, follow the artifacts in the corresponding folders. 24 | * genomic - Run the notebook store-multimodal-data/genomic/store-analyze-genomicdata-with-awshealthomics.ipynb. This creates AWS HealthOmics data stores (Reference Store, Variant Store, and Annotation Store) to import reference genome, VCF files, and ClinVar annotation file. 25 | * clinical - Follow the instructions in store-multimodal-data/clinical/README.md to create AWS HealthLake data store and import NDJSON files. 26 | * medical imaging - First, run the notebook store-multimodal-data/medical_imaging/store-imagingdata-with-awshealthimaging.ipynb to create AWS HealthImaging data stores and import DICOM files. Then, run preprocess-multimodal-date/medical-imaging/imaging-radiomics.ipynb to generate radiomic features from multimple images in parallel using Amazon SageMaker Preprocessing. 27 | 28 | 2. Preprocess and analyze multimodal data with [AWS Lake Formation](https://aws.amazon.com/lake-formation/), [Amazon Athena](https://aws.amazon.com/athena/), and [Amazon SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/?sagemaker-data-wrangler-whats-new.sort-by=item.additionalFields.postDateTime&sagemaker-data-wrangler-whats-new.sort-order=desc) 29 | * To prepare and analyze the multimodal data for downstream analysis (eg. querying with Amazon Athena, training a machine learning (ML) model with Amazon SageMaker), follow the artifacts in the corresponding folders. 30 | * genomic - Run the notebook preprocess-multimodal-data/genomic/preprocess-genomic.ipynb to execute feature engineering and store the final set of genomic features in SageMaker Feature Store, a fully managed service for machine learning features. 31 | * clinical - Run the notebook preprocess-multimodal-data/clinical/preprocess-clinical.ipynb to execute feature engineering and store the final set of clinical features in SageMaker Feature Store. 32 | * medical imaging - Run the notebook preprocess-multimodal-data/medical-imaging/preprocess-imaging.ipynb to execute feature engineering and store the final set of radiomic features in SageMaker Feature Store. 33 | 34 | At the end of this step, you have created three Feature Groups in SageMaker Feature Store, one for each data modality. We will use these features in the following steps to train a machine learning model. 35 | 36 | 3. Build, train, test, and deploy machine learning models with Amazon SageMaker 37 | * For the given use case of patients' outcome prediction, we will use [SageMaker AutoGluon](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), an AutoML framework that ensembles multiple ML models to improve predictive performance. To train and test the ML model on the multimodal feature set, run the notebook train-test-ml-model/train-test-model.ipynb. This generates evaluation metrics (accuracy, precision, recall, and f1 score) for the four outcomes (hypertension, stroke, coronary heart disease, and Alzheimer's disease). 38 | 39 | 4. Create data visualization dashboards with Amazon QuickSight 40 | * Follow the instructions in visualization-dashboard/README.md to use the interactive dashboards and get a comprehensive view of patients across all three data modalities. These dashboards mitigate the challenge of data siloes and help end users (eg. clinicians, bioinformaticians, radiologists) easily access, view, and compare clinical, genomic, and medical imaging data across individual patients and cohorts. Prior to any production deployment, please coordinate with your local security team to ensure appropriate authentication/authorization is in place. 41 | 42 | 43 | #### Security 44 | 45 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 46 | 47 | #### License 48 | 49 | This library is licensed under the MIT-0 License. See the LICENSE file. 50 | 51 | -------------------------------------------------------------------------------- /architecture/LaunchStack.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-solutions-library-samples/guidance-for-multi-modal-data-analysis-with-aws-health-and-ml-services/bb7c42edb5991c733fbf96382293a0820192a3f2/architecture/LaunchStack.jpg -------------------------------------------------------------------------------- /architecture/architecture-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-solutions-library-samples/guidance-for-multi-modal-data-analysis-with-aws-health-and-ml-services/bb7c42edb5991c733fbf96382293a0820192a3f2/architecture/architecture-diagram.png -------------------------------------------------------------------------------- /cfn_template: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | Description: 'Create resources needed for the AWS Solution: SO9254' 3 | 4 | Parameters: 5 | ParameterVPCId: 6 | Description: ID of the AWS Virtual Private Cloud (VPC) 7 | Type: AWS::EC2::VPC::Id 8 | ParameterSubnet1Id: 9 | Description: SubnetId, for Availability Zone 1 in the region in your VPC 10 | Type: AWS::EC2::Subnet::Id 11 | ParameterSubnet2Id: 12 | Description: SubnetId, for Availability Zone 2 in the region in your VPC 13 | Type: AWS::EC2::Subnet::Id 14 | HealthImagingImportRoleName: 15 | Type: String 16 | Default: HealthImagingImportJobRole 17 | Description: This is an IAM role used by Amazon HealthImaging to import data. If you have an IAM role with the same name, please change this name. 18 | HealthOmicsImportRoleName: 19 | Type: String 20 | Default: OmicsUnifiedJobRole 21 | Description: This is an IAM role used by Amazon HealthOmics to import data. If you have an IAM role with the same name, please change this name. 22 | CreateS3BucketforSageMaker: 23 | Description: Do you want to create a S3 Bucket name like sagemaker-AWS::Region-AWS::AccountId 24 | Type: String 25 | Default: false 26 | AllowedValues: 27 | - true 28 | - false 29 | 30 | Conditions: 31 | CreateS3Bucket: !Equals 32 | - true 33 | - !Ref 'CreateS3BucketforSageMaker' 34 | 35 | Resources: 36 | SageMakerBucket: 37 | Condition: CreateS3Bucket 38 | Type: AWS::S3::Bucket 39 | Properties: 40 | BucketName: !Sub sagemaker-${AWS::Region}-${AWS::AccountId} 41 | VersioningConfiguration: 42 | Status: Enabled 43 | AccessControl: Private 44 | PublicAccessBlockConfiguration: 45 | BlockPublicAcls: TRUE 46 | BlockPublicPolicy: TRUE 47 | IgnorePublicAcls: TRUE 48 | RestrictPublicBuckets: TRUE 49 | BucketEncryption: 50 | ServerSideEncryptionConfiguration: 51 | - ServerSideEncryptionByDefault: 52 | SSEAlgorithm: AES256 53 | 54 | SageMakerExecutionRole: 55 | Type: "AWS::IAM::Role" 56 | DependsOn: HealthImagingImportJobRole 57 | Properties: 58 | RoleName: SageMakerStudioExecutionRole 59 | AssumeRolePolicyDocument: 60 | Version: 2012-10-17 61 | Statement: 62 | - Effect: Allow 63 | Principal: 64 | Service: 65 | - sagemaker.amazonaws.com 66 | - omics.amazonaws.com 67 | - medical-imaging.amazonaws.com 68 | - states.amazonaws.com 69 | - glue.amazonaws.com 70 | - codepipeline.amazonaws.com 71 | - codebuild.amazonaws.com 72 | Action: 73 | - "sts:AssumeRole" 74 | Policies: 75 | - PolicyName: S3andMedicalImaging 76 | PolicyDocument: 77 | Version: '2012-10-17' 78 | Statement: 79 | - Effect: Allow 80 | Action: 81 | - "s3:GetObject" 82 | - "s3:PutObject" 83 | - "s3:DeleteObject" 84 | - "s3:ListBucket" 85 | - "s3:GetEncryptionConfiguration" 86 | Resource: "arn:aws:s3:::*" 87 | - Effect: Allow 88 | Action: 89 | - "medical-imaging:CreateDatastore" 90 | - "medical-imaging:Get*" 91 | - "medical-imaging:List*" 92 | - "medical-imaging:Update*" 93 | - "medical-imaging:StartDICOMImportJob" 94 | - "medical-imaging:DeleteDatastore" 95 | - "medical-imaging:DeleteImageSet" 96 | Resource: "*" 97 | - Effect: Allow 98 | Action: 99 | - "iam:GetUser" 100 | - "iam:GetPolicy" 101 | - "iam:CreatePolicy" 102 | - "iam:GetRole" 103 | - "iam:CreateRole" 104 | - "iam:AttachRolePolicy" 105 | Resource: "*" 106 | - Effect: Allow 107 | Action: 108 | - "glue:CreateCrawler" 109 | - "glue:GetCrawler" 110 | - "glue:StartCrawler" 111 | - "glue:DeleteCrawler" 112 | Resource: "*" 113 | - Effect: "Allow" 114 | Action: 115 | - "iam:PassRole" 116 | Resource: 117 | - !GetAtt HealthImagingImportJobRole.Arn 118 | - !GetAtt OmicsAccessRole.Arn 119 | - !Sub arn:aws:iam::${AWS::AccountId}:role/SageMakerStudioExecutionRole 120 | Path: / 121 | ManagedPolicyArns: 122 | - arn:aws:iam::aws:policy/AmazonSageMakerFullAccess 123 | - arn:aws:iam::aws:policy/AmazonHealthLakeFullAccess 124 | - arn:aws:iam::aws:policy/AmazonOmicsFullAccess 125 | - arn:aws:iam::aws:policy/AWSLakeFormationDataAdmin 126 | - arn:aws:iam::aws:policy/AmazonAthenaFullAccess 127 | 128 | HealthImagingImportJobRole: 129 | Type: "AWS::IAM::Role" 130 | Properties: 131 | RoleName: !Ref HealthImagingImportRoleName 132 | AssumeRolePolicyDocument: 133 | Version: 2012-10-17 134 | Statement: 135 | - Effect: Allow 136 | Principal: 137 | Service: 138 | - medical-imaging.amazonaws.com 139 | Action: 140 | - "sts:AssumeRole" 141 | Policies: 142 | - PolicyName: S3andMedicalImaging 143 | PolicyDocument: 144 | Version: '2012-10-17' 145 | Statement: 146 | - Effect: Allow 147 | Action: 148 | - "s3:GetObject" 149 | - "s3:PutObject" 150 | - "s3:DeleteObject" 151 | - "s3:ListBucket" 152 | - "s3:GetEncryptionConfiguration" 153 | Resource: arn:aws:s3:::* 154 | - Effect: Allow 155 | Action: 156 | - "medical-imaging:StartDICOMImportJob" 157 | Resource: "*" 158 | 159 | OmicsAccessRole: 160 | Type: AWS::IAM::Role 161 | Properties: 162 | RoleName: !Ref HealthOmicsImportRoleName 163 | AssumeRolePolicyDocument: 164 | Statement: 165 | - Action: 166 | - sts:AssumeRole 167 | Effect: Allow 168 | Principal: 169 | Service: 170 | - omics.amazonaws.com 171 | Version: '2012-10-17' 172 | Path: / 173 | Policies: 174 | - PolicyName: s3-access 175 | PolicyDocument: 176 | Statement: 177 | - Action: 178 | - s3:GetObject 179 | - s3:GetBucketLocation 180 | - s3:ListBucket 181 | - s3:Put* 182 | Effect: Allow 183 | Resource: 184 | - arn:aws:s3:::* 185 | - arn:aws:s3:::*/* 186 | ManagedPolicyArns: 187 | - arn:aws:iam::aws:policy/AmazonOmicsFullAccess 188 | 189 | WorkshopDomain: 190 | Type: AWS::SageMaker::Domain 191 | DependsOn: SageMakerExecutionRole 192 | Properties: 193 | AuthMode: IAM 194 | DefaultUserSettings: 195 | ExecutionRole: !GetAtt SageMakerExecutionRole.Arn 196 | DomainName: "SageMakerStudioWorkshopDomain" 197 | SubnetIds: 198 | - !Ref ParameterSubnet1Id 199 | - !Ref ParameterSubnet2Id 200 | VpcId: !Ref ParameterVPCId 201 | 202 | DefaultUser: 203 | Type: AWS::SageMaker::UserProfile 204 | Properties: 205 | DomainId: !Ref WorkshopDomain 206 | UserProfileName: sagemaker-user 207 | DependsOn: WorkshopDomain 208 | 209 | 210 | 211 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/README.md: -------------------------------------------------------------------------------- 1 | Before running the preprocessing notebooks, you will need to create a SageMaker domain if you haven't already and provide the necessary Lake Formation and Athena permissions. 2 | 3 | The preprocessing notebooks directly take data from S3. For queries used to generate the S3 files used in the preprocessing notebooks, refer to the single-patient-records.ipynb notebook. 4 | 5 | ### Create a SageMaker domain 6 | 7 | You can skip this step if you already have a SageMaker domain setup and move on to provide the necessary persmissions. 8 | 9 | 1. Go to the **Amazon SageMaker Console** 10 | 2. Select **Admin configurations** on the left and then **Domains**. 11 | 3. Click **Create domain**. 12 | 4. Choose **Quick setup** 13 | 5. Click on **Set up**. It takes a couple of minutes for the domain setup to complete. Remain on the same page. 14 | 15 | ### Provide necessary permissions to the role 16 | 17 | 1. Get the execution role of your SageMaker domain. Click on the default user profile and note down the execution role on the right hand side of the page. The role will look like `arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-XXXX`. Note down the `AmazonSageMaker-ExecutionRole-XXXX` part as it will be used in the next steps. 18 | 2. Go to **Lake Formation** service page and choose **Administrative roles and tasks**. 19 | 3. Go to the **Data lake administrators** section and click on **Add**. Under **IAM users and roles**, choose the execution role created in the previous step. Click on **Confirm**. 20 | 4. Go to **IAM** service page. Click on **Roles**. Search for the execution role created in the previous step. Click on the execution role. 21 | 5. Click on **Add permissions** and **Attach policies**. Select **AmazonAthenaFullAccess** and **Add permissions**. 22 | 6. Go back to the SageMaker domain in your SageMaker console. Under **User profiles**, click on **Launch** and select **Studio**. Wait for the Studio to launch. 23 | 24 | 25 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/clinical/preprocess-clinical.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "c39ef86c-3c08-4a76-b852-939248bcd94d", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "# Read and preprocess clinical data from S3 and store features in SageMaker FeatureStore\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "id": "64e72b0f-db13-48b4-8d71-0e7c2d2bb7b9", 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "!pip install --no-dependencies s3fs" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "id": "ff045821-8bf7-46f8-8186-08d9e013875a", 26 | "metadata": { 27 | "tags": [] 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "import boto3\n", 32 | "import numpy as np\n", 33 | "import pandas as pd\n", 34 | "import matplotlib.pyplot as plt\n", 35 | "import io, os\n", 36 | "from time import gmtime, strftime, sleep\n", 37 | "import time\n", 38 | "import sagemaker\n", 39 | "from sagemaker.session import Session\n", 40 | "from sagemaker import get_execution_role\n", 41 | "from sagemaker.feature_store.feature_group import FeatureGroup" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "58fd3667-9479-46ca-833f-96e1948fca39", 47 | "metadata": {}, 48 | "source": [ 49 | "\n", 50 | "## Set up SageMaker FeatureStore\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "id": "b044786e-8c94-4755-b84f-8941c7e4f05d", 57 | "metadata": { 58 | "tags": [] 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "region = boto3.Session().region_name\n", 63 | "\n", 64 | "boto_session = boto3.Session(region_name=region)\n", 65 | "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n", 66 | "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n", 67 | "\n", 68 | "feature_store_session = Session(\n", 69 | " boto_session=boto_session,\n", 70 | " sagemaker_client=sagemaker_client,\n", 71 | " sagemaker_featurestore_runtime_client=featurestore_runtime\n", 72 | ")\n", 73 | "\n", 74 | "role = get_execution_role()\n", 75 | "s3_client = boto3.client('s3', region_name=region)\n", 76 | "\n", 77 | "default_s3_bucket_name = feature_store_session.default_bucket()\n", 78 | "prefix = 'sagemaker-featurestore-demo'" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "id": "6df3eac8-a59d-4a34-9a49-b24ec97045d2", 84 | "metadata": {}, 85 | "source": [ 86 | "\n", 87 | "## Get data from S3\n" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28", 94 | "metadata": { 95 | "tags": [] 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "# Get data from S3. \n", 100 | "bucket_clin = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n", 101 | "#bucket_clin = \n", 102 | "\n", 103 | "# Clinical data \n", 104 | "data_key_clin = 'final_clinical_df.csv'\n", 105 | "#data_key_clin = \n", 106 | "\n", 107 | "data_location_clin = 's3://{}/{}'.format(bucket_clin, data_key_clin)\n", 108 | "data_clinical = pd.read_csv(data_location_clin)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "ab13547e-ca00-4f74-a806-d053cbb34c1c", 114 | "metadata": {}, 115 | "source": [ 116 | "\n", 117 | "## Preprocess Data\n" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "id": "4bd20e05-596b-410d-a027-ddf2eb70678f", 124 | "metadata": { 125 | "tags": [] 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "# Replacing NaN with zeros\n", 130 | "data_clinical_1 = data_clinical.copy()\n", 131 | "data_clinical_1 = data_clinical_1.replace(np.nan, 0)\n", 132 | "data_clinical_1 = data_clinical_1.astype(str)\n", 133 | "\n", 134 | "#Converting all diagnosis codes to a set\n", 135 | "data_clinical_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n", 136 | "data_clinical_pred = data_clinical_1.copy()\n", 137 | "for i in range(len(data_clinical_1)):\n", 138 | " data_clinical_pred.loc[i, 'diagnosisCode'] = set(data_clinical_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n", 139 | "\n", 140 | "# Adding a column for prediction of Alzheimer's disease code '26929004'\n", 141 | "# Adding a column for prediction of Coronary heart disease '53741008'\n", 142 | "# Adding a column for prediction of Stroke code '230690007'\n", 143 | "# Adding a column for prediction of Hypertension code '59621000'\n", 144 | "for i in range(len(data_clinical_pred)):\n", 145 | " if \"26929004\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n", 146 | " data_clinical_pred.loc[i, 'alzheimers_prediction'] = '1'\n", 147 | " if \"53741008\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n", 148 | " data_clinical_pred.loc[i, 'coronary_heart_disease_prediction'] = '1'\n", 149 | " if \"230690007\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n", 150 | " data_clinical_pred.loc[i, 'stroke_prediction'] = '1'\n", 151 | " if \"59621000\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n", 152 | " data_clinical_pred.loc[i, 'hypertension_prediction'] = '1'\n", 153 | "print(\"Patients with Alzheimer's disease = \", len(data_clinical_pred[data_clinical_pred['alzheimers_prediction']=='1']))\n", 154 | "print(\"Patients with Coronary Heart disease = \", len(data_clinical_pred[data_clinical_pred['coronary_heart_disease_prediction']=='1']))\n", 155 | "print(\"Patients with Stroke = \", len(data_clinical_pred[data_clinical_pred['stroke_prediction']=='1']))\n", 156 | "print(\"Patients with Hypertension = \", len(data_clinical_pred[data_clinical_pred['hypertension_prediction']=='1']))\n", 157 | "\n", 158 | "# Delete columns with leakage and features irrelevant for model training\n", 159 | "list_delete_cols = ['diagnosisDescription', 'diagnosisCode', 'onsetdatetime', 'name', 'addressline',\n", 160 | " 'city', 'state', 'country', 'latitude', 'longitude']\n", 161 | "data_clinical_pred.drop(list_delete_cols, axis=1, inplace=True)\n", 162 | "\n", 163 | "data_clinical_pred.head(10)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "id": "1c8f4398-53df-4465-9259-ce80406f893e", 169 | "metadata": { 170 | "tags": [] 171 | }, 172 | "source": [ 173 | "\n", 174 | "## Ingest data into FeatureStore\n" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5", 181 | "metadata": { 182 | "tags": [] 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "clinical_feature_group_name = 'clinical-feature-group'\n", 187 | "clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)\n", 188 | "\n", 189 | "current_time_sec = int(round(time.time()))\n", 190 | "\n", 191 | "def cast_object_to_string(data_frame):\n", 192 | " for label in data_frame.columns:\n", 193 | " print (label)\n", 194 | " if data_frame.dtypes[label] == 'object':\n", 195 | " data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n", 196 | "\n", 197 | "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n", 198 | "cast_object_to_string(data_clinical_pred)\n", 199 | "\n", 200 | "# Record identifier and event time feature names\n", 201 | "record_identifier_feature_name = \"patientID\"\n", 202 | "event_time_feature_name = \"EventTime\"\n", 203 | "\n", 204 | "# Append EventTime feature\n", 205 | "data_clinical_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_clinical_pred), dtype=\"float64\")\n", 206 | "\n", 207 | "## If event time generates NaN\n", 208 | "data_clinical_pred[event_time_feature_name] = data_clinical_pred[event_time_feature_name].fillna(0)\n", 209 | "\n", 210 | "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n", 211 | "clinical_feature_group.load_feature_definitions(data_frame=data_clinical_pred); # output is suppressed\n", 212 | "\n", 213 | "\n", 214 | "def wait_for_feature_group_creation_complete(feature_group):\n", 215 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 216 | " while status == \"Creating\":\n", 217 | " print(\"Waiting for Feature Group Creation\")\n", 218 | " time.sleep(15)\n", 219 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 220 | " if status != \"Created\":\n", 221 | " raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n", 222 | " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", 223 | "\n", 224 | "clinical_feature_group.create(\n", 225 | " s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n", 226 | " record_identifier_name=record_identifier_feature_name,\n", 227 | " event_time_feature_name=event_time_feature_name,\n", 228 | " role_arn=role,\n", 229 | " enable_online_store=True\n", 230 | ")\n", 231 | "\n", 232 | "wait_for_feature_group_creation_complete(feature_group=clinical_feature_group)\n", 233 | "\n", 234 | "clinical_feature_group.ingest(\n", 235 | " data_frame=data_clinical_pred, max_workers=5, wait=True\n", 236 | ")" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba", 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [] 246 | } 247 | ], 248 | "metadata": { 249 | "availableInstances": [ 250 | { 251 | "_defaultOrder": 0, 252 | "_isFastLaunch": true, 253 | "category": "General purpose", 254 | "gpuNum": 0, 255 | "hideHardwareSpecs": false, 256 | "memoryGiB": 4, 257 | "name": "ml.t3.medium", 258 | "vcpuNum": 2 259 | }, 260 | { 261 | "_defaultOrder": 1, 262 | "_isFastLaunch": false, 263 | "category": "General purpose", 264 | "gpuNum": 0, 265 | "hideHardwareSpecs": false, 266 | "memoryGiB": 8, 267 | "name": "ml.t3.large", 268 | "vcpuNum": 2 269 | }, 270 | { 271 | "_defaultOrder": 2, 272 | "_isFastLaunch": false, 273 | "category": "General purpose", 274 | "gpuNum": 0, 275 | "hideHardwareSpecs": false, 276 | "memoryGiB": 16, 277 | "name": "ml.t3.xlarge", 278 | "vcpuNum": 4 279 | }, 280 | { 281 | "_defaultOrder": 3, 282 | "_isFastLaunch": false, 283 | "category": "General purpose", 284 | "gpuNum": 0, 285 | "hideHardwareSpecs": false, 286 | "memoryGiB": 32, 287 | "name": "ml.t3.2xlarge", 288 | "vcpuNum": 8 289 | }, 290 | { 291 | "_defaultOrder": 4, 292 | "_isFastLaunch": true, 293 | "category": "General purpose", 294 | "gpuNum": 0, 295 | "hideHardwareSpecs": false, 296 | "memoryGiB": 8, 297 | "name": "ml.m5.large", 298 | "vcpuNum": 2 299 | }, 300 | { 301 | "_defaultOrder": 5, 302 | "_isFastLaunch": false, 303 | "category": "General purpose", 304 | "gpuNum": 0, 305 | "hideHardwareSpecs": false, 306 | "memoryGiB": 16, 307 | "name": "ml.m5.xlarge", 308 | "vcpuNum": 4 309 | }, 310 | { 311 | "_defaultOrder": 6, 312 | "_isFastLaunch": false, 313 | "category": "General purpose", 314 | "gpuNum": 0, 315 | "hideHardwareSpecs": false, 316 | "memoryGiB": 32, 317 | "name": "ml.m5.2xlarge", 318 | "vcpuNum": 8 319 | }, 320 | { 321 | "_defaultOrder": 7, 322 | "_isFastLaunch": false, 323 | "category": "General purpose", 324 | "gpuNum": 0, 325 | "hideHardwareSpecs": false, 326 | "memoryGiB": 64, 327 | "name": "ml.m5.4xlarge", 328 | "vcpuNum": 16 329 | }, 330 | { 331 | "_defaultOrder": 8, 332 | "_isFastLaunch": false, 333 | "category": "General purpose", 334 | "gpuNum": 0, 335 | "hideHardwareSpecs": false, 336 | "memoryGiB": 128, 337 | "name": "ml.m5.8xlarge", 338 | "vcpuNum": 32 339 | }, 340 | { 341 | "_defaultOrder": 9, 342 | "_isFastLaunch": false, 343 | "category": "General purpose", 344 | "gpuNum": 0, 345 | "hideHardwareSpecs": false, 346 | "memoryGiB": 192, 347 | "name": "ml.m5.12xlarge", 348 | "vcpuNum": 48 349 | }, 350 | { 351 | "_defaultOrder": 10, 352 | "_isFastLaunch": false, 353 | "category": "General purpose", 354 | "gpuNum": 0, 355 | "hideHardwareSpecs": false, 356 | "memoryGiB": 256, 357 | "name": "ml.m5.16xlarge", 358 | "vcpuNum": 64 359 | }, 360 | { 361 | "_defaultOrder": 11, 362 | "_isFastLaunch": false, 363 | "category": "General purpose", 364 | "gpuNum": 0, 365 | "hideHardwareSpecs": false, 366 | "memoryGiB": 384, 367 | "name": "ml.m5.24xlarge", 368 | "vcpuNum": 96 369 | }, 370 | { 371 | "_defaultOrder": 12, 372 | "_isFastLaunch": false, 373 | "category": "General purpose", 374 | "gpuNum": 0, 375 | "hideHardwareSpecs": false, 376 | "memoryGiB": 8, 377 | "name": "ml.m5d.large", 378 | "vcpuNum": 2 379 | }, 380 | { 381 | "_defaultOrder": 13, 382 | "_isFastLaunch": false, 383 | "category": "General purpose", 384 | "gpuNum": 0, 385 | "hideHardwareSpecs": false, 386 | "memoryGiB": 16, 387 | "name": "ml.m5d.xlarge", 388 | "vcpuNum": 4 389 | }, 390 | { 391 | "_defaultOrder": 14, 392 | "_isFastLaunch": false, 393 | "category": "General purpose", 394 | "gpuNum": 0, 395 | "hideHardwareSpecs": false, 396 | "memoryGiB": 32, 397 | "name": "ml.m5d.2xlarge", 398 | "vcpuNum": 8 399 | }, 400 | { 401 | "_defaultOrder": 15, 402 | "_isFastLaunch": false, 403 | "category": "General purpose", 404 | "gpuNum": 0, 405 | "hideHardwareSpecs": false, 406 | "memoryGiB": 64, 407 | "name": "ml.m5d.4xlarge", 408 | "vcpuNum": 16 409 | }, 410 | { 411 | "_defaultOrder": 16, 412 | "_isFastLaunch": false, 413 | "category": "General purpose", 414 | "gpuNum": 0, 415 | "hideHardwareSpecs": false, 416 | "memoryGiB": 128, 417 | "name": "ml.m5d.8xlarge", 418 | "vcpuNum": 32 419 | }, 420 | { 421 | "_defaultOrder": 17, 422 | "_isFastLaunch": false, 423 | "category": "General purpose", 424 | "gpuNum": 0, 425 | "hideHardwareSpecs": false, 426 | "memoryGiB": 192, 427 | "name": "ml.m5d.12xlarge", 428 | "vcpuNum": 48 429 | }, 430 | { 431 | "_defaultOrder": 18, 432 | "_isFastLaunch": false, 433 | "category": "General purpose", 434 | "gpuNum": 0, 435 | "hideHardwareSpecs": false, 436 | "memoryGiB": 256, 437 | "name": "ml.m5d.16xlarge", 438 | "vcpuNum": 64 439 | }, 440 | { 441 | "_defaultOrder": 19, 442 | "_isFastLaunch": false, 443 | "category": "General purpose", 444 | "gpuNum": 0, 445 | "hideHardwareSpecs": false, 446 | "memoryGiB": 384, 447 | "name": "ml.m5d.24xlarge", 448 | "vcpuNum": 96 449 | }, 450 | { 451 | "_defaultOrder": 20, 452 | "_isFastLaunch": false, 453 | "category": "General purpose", 454 | "gpuNum": 0, 455 | "hideHardwareSpecs": true, 456 | "memoryGiB": 0, 457 | "name": "ml.geospatial.interactive", 458 | "supportedImageNames": [ 459 | "sagemaker-geospatial-v1-0" 460 | ], 461 | "vcpuNum": 0 462 | }, 463 | { 464 | "_defaultOrder": 21, 465 | "_isFastLaunch": true, 466 | "category": "Compute optimized", 467 | "gpuNum": 0, 468 | "hideHardwareSpecs": false, 469 | "memoryGiB": 4, 470 | "name": "ml.c5.large", 471 | "vcpuNum": 2 472 | }, 473 | { 474 | "_defaultOrder": 22, 475 | "_isFastLaunch": false, 476 | "category": "Compute optimized", 477 | "gpuNum": 0, 478 | "hideHardwareSpecs": false, 479 | "memoryGiB": 8, 480 | "name": "ml.c5.xlarge", 481 | "vcpuNum": 4 482 | }, 483 | { 484 | "_defaultOrder": 23, 485 | "_isFastLaunch": false, 486 | "category": "Compute optimized", 487 | "gpuNum": 0, 488 | "hideHardwareSpecs": false, 489 | "memoryGiB": 16, 490 | "name": "ml.c5.2xlarge", 491 | "vcpuNum": 8 492 | }, 493 | { 494 | "_defaultOrder": 24, 495 | "_isFastLaunch": false, 496 | "category": "Compute optimized", 497 | "gpuNum": 0, 498 | "hideHardwareSpecs": false, 499 | "memoryGiB": 32, 500 | "name": "ml.c5.4xlarge", 501 | "vcpuNum": 16 502 | }, 503 | { 504 | "_defaultOrder": 25, 505 | "_isFastLaunch": false, 506 | "category": "Compute optimized", 507 | "gpuNum": 0, 508 | "hideHardwareSpecs": false, 509 | "memoryGiB": 72, 510 | "name": "ml.c5.9xlarge", 511 | "vcpuNum": 36 512 | }, 513 | { 514 | "_defaultOrder": 26, 515 | "_isFastLaunch": false, 516 | "category": "Compute optimized", 517 | "gpuNum": 0, 518 | "hideHardwareSpecs": false, 519 | "memoryGiB": 96, 520 | "name": "ml.c5.12xlarge", 521 | "vcpuNum": 48 522 | }, 523 | { 524 | "_defaultOrder": 27, 525 | "_isFastLaunch": false, 526 | "category": "Compute optimized", 527 | "gpuNum": 0, 528 | "hideHardwareSpecs": false, 529 | "memoryGiB": 144, 530 | "name": "ml.c5.18xlarge", 531 | "vcpuNum": 72 532 | }, 533 | { 534 | "_defaultOrder": 28, 535 | "_isFastLaunch": false, 536 | "category": "Compute optimized", 537 | "gpuNum": 0, 538 | "hideHardwareSpecs": false, 539 | "memoryGiB": 192, 540 | "name": "ml.c5.24xlarge", 541 | "vcpuNum": 96 542 | }, 543 | { 544 | "_defaultOrder": 29, 545 | "_isFastLaunch": true, 546 | "category": "Accelerated computing", 547 | "gpuNum": 1, 548 | "hideHardwareSpecs": false, 549 | "memoryGiB": 16, 550 | "name": "ml.g4dn.xlarge", 551 | "vcpuNum": 4 552 | }, 553 | { 554 | "_defaultOrder": 30, 555 | "_isFastLaunch": false, 556 | "category": "Accelerated computing", 557 | "gpuNum": 1, 558 | "hideHardwareSpecs": false, 559 | "memoryGiB": 32, 560 | "name": "ml.g4dn.2xlarge", 561 | "vcpuNum": 8 562 | }, 563 | { 564 | "_defaultOrder": 31, 565 | "_isFastLaunch": false, 566 | "category": "Accelerated computing", 567 | "gpuNum": 1, 568 | "hideHardwareSpecs": false, 569 | "memoryGiB": 64, 570 | "name": "ml.g4dn.4xlarge", 571 | "vcpuNum": 16 572 | }, 573 | { 574 | "_defaultOrder": 32, 575 | "_isFastLaunch": false, 576 | "category": "Accelerated computing", 577 | "gpuNum": 1, 578 | "hideHardwareSpecs": false, 579 | "memoryGiB": 128, 580 | "name": "ml.g4dn.8xlarge", 581 | "vcpuNum": 32 582 | }, 583 | { 584 | "_defaultOrder": 33, 585 | "_isFastLaunch": false, 586 | "category": "Accelerated computing", 587 | "gpuNum": 4, 588 | "hideHardwareSpecs": false, 589 | "memoryGiB": 192, 590 | "name": "ml.g4dn.12xlarge", 591 | "vcpuNum": 48 592 | }, 593 | { 594 | "_defaultOrder": 34, 595 | "_isFastLaunch": false, 596 | "category": "Accelerated computing", 597 | "gpuNum": 1, 598 | "hideHardwareSpecs": false, 599 | "memoryGiB": 256, 600 | "name": "ml.g4dn.16xlarge", 601 | "vcpuNum": 64 602 | }, 603 | { 604 | "_defaultOrder": 35, 605 | "_isFastLaunch": false, 606 | "category": "Accelerated computing", 607 | "gpuNum": 1, 608 | "hideHardwareSpecs": false, 609 | "memoryGiB": 61, 610 | "name": "ml.p3.2xlarge", 611 | "vcpuNum": 8 612 | }, 613 | { 614 | "_defaultOrder": 36, 615 | "_isFastLaunch": false, 616 | "category": "Accelerated computing", 617 | "gpuNum": 4, 618 | "hideHardwareSpecs": false, 619 | "memoryGiB": 244, 620 | "name": "ml.p3.8xlarge", 621 | "vcpuNum": 32 622 | }, 623 | { 624 | "_defaultOrder": 37, 625 | "_isFastLaunch": false, 626 | "category": "Accelerated computing", 627 | "gpuNum": 8, 628 | "hideHardwareSpecs": false, 629 | "memoryGiB": 488, 630 | "name": "ml.p3.16xlarge", 631 | "vcpuNum": 64 632 | }, 633 | { 634 | "_defaultOrder": 38, 635 | "_isFastLaunch": false, 636 | "category": "Accelerated computing", 637 | "gpuNum": 8, 638 | "hideHardwareSpecs": false, 639 | "memoryGiB": 768, 640 | "name": "ml.p3dn.24xlarge", 641 | "vcpuNum": 96 642 | }, 643 | { 644 | "_defaultOrder": 39, 645 | "_isFastLaunch": false, 646 | "category": "Memory Optimized", 647 | "gpuNum": 0, 648 | "hideHardwareSpecs": false, 649 | "memoryGiB": 16, 650 | "name": "ml.r5.large", 651 | "vcpuNum": 2 652 | }, 653 | { 654 | "_defaultOrder": 40, 655 | "_isFastLaunch": false, 656 | "category": "Memory Optimized", 657 | "gpuNum": 0, 658 | "hideHardwareSpecs": false, 659 | "memoryGiB": 32, 660 | "name": "ml.r5.xlarge", 661 | "vcpuNum": 4 662 | }, 663 | { 664 | "_defaultOrder": 41, 665 | "_isFastLaunch": false, 666 | "category": "Memory Optimized", 667 | "gpuNum": 0, 668 | "hideHardwareSpecs": false, 669 | "memoryGiB": 64, 670 | "name": "ml.r5.2xlarge", 671 | "vcpuNum": 8 672 | }, 673 | { 674 | "_defaultOrder": 42, 675 | "_isFastLaunch": false, 676 | "category": "Memory Optimized", 677 | "gpuNum": 0, 678 | "hideHardwareSpecs": false, 679 | "memoryGiB": 128, 680 | "name": "ml.r5.4xlarge", 681 | "vcpuNum": 16 682 | }, 683 | { 684 | "_defaultOrder": 43, 685 | "_isFastLaunch": false, 686 | "category": "Memory Optimized", 687 | "gpuNum": 0, 688 | "hideHardwareSpecs": false, 689 | "memoryGiB": 256, 690 | "name": "ml.r5.8xlarge", 691 | "vcpuNum": 32 692 | }, 693 | { 694 | "_defaultOrder": 44, 695 | "_isFastLaunch": false, 696 | "category": "Memory Optimized", 697 | "gpuNum": 0, 698 | "hideHardwareSpecs": false, 699 | "memoryGiB": 384, 700 | "name": "ml.r5.12xlarge", 701 | "vcpuNum": 48 702 | }, 703 | { 704 | "_defaultOrder": 45, 705 | "_isFastLaunch": false, 706 | "category": "Memory Optimized", 707 | "gpuNum": 0, 708 | "hideHardwareSpecs": false, 709 | "memoryGiB": 512, 710 | "name": "ml.r5.16xlarge", 711 | "vcpuNum": 64 712 | }, 713 | { 714 | "_defaultOrder": 46, 715 | "_isFastLaunch": false, 716 | "category": "Memory Optimized", 717 | "gpuNum": 0, 718 | "hideHardwareSpecs": false, 719 | "memoryGiB": 768, 720 | "name": "ml.r5.24xlarge", 721 | "vcpuNum": 96 722 | }, 723 | { 724 | "_defaultOrder": 47, 725 | "_isFastLaunch": false, 726 | "category": "Accelerated computing", 727 | "gpuNum": 1, 728 | "hideHardwareSpecs": false, 729 | "memoryGiB": 16, 730 | "name": "ml.g5.xlarge", 731 | "vcpuNum": 4 732 | }, 733 | { 734 | "_defaultOrder": 48, 735 | "_isFastLaunch": false, 736 | "category": "Accelerated computing", 737 | "gpuNum": 1, 738 | "hideHardwareSpecs": false, 739 | "memoryGiB": 32, 740 | "name": "ml.g5.2xlarge", 741 | "vcpuNum": 8 742 | }, 743 | { 744 | "_defaultOrder": 49, 745 | "_isFastLaunch": false, 746 | "category": "Accelerated computing", 747 | "gpuNum": 1, 748 | "hideHardwareSpecs": false, 749 | "memoryGiB": 64, 750 | "name": "ml.g5.4xlarge", 751 | "vcpuNum": 16 752 | }, 753 | { 754 | "_defaultOrder": 50, 755 | "_isFastLaunch": false, 756 | "category": "Accelerated computing", 757 | "gpuNum": 1, 758 | "hideHardwareSpecs": false, 759 | "memoryGiB": 128, 760 | "name": "ml.g5.8xlarge", 761 | "vcpuNum": 32 762 | }, 763 | { 764 | "_defaultOrder": 51, 765 | "_isFastLaunch": false, 766 | "category": "Accelerated computing", 767 | "gpuNum": 1, 768 | "hideHardwareSpecs": false, 769 | "memoryGiB": 256, 770 | "name": "ml.g5.16xlarge", 771 | "vcpuNum": 64 772 | }, 773 | { 774 | "_defaultOrder": 52, 775 | "_isFastLaunch": false, 776 | "category": "Accelerated computing", 777 | "gpuNum": 4, 778 | "hideHardwareSpecs": false, 779 | "memoryGiB": 192, 780 | "name": "ml.g5.12xlarge", 781 | "vcpuNum": 48 782 | }, 783 | { 784 | "_defaultOrder": 53, 785 | "_isFastLaunch": false, 786 | "category": "Accelerated computing", 787 | "gpuNum": 4, 788 | "hideHardwareSpecs": false, 789 | "memoryGiB": 384, 790 | "name": "ml.g5.24xlarge", 791 | "vcpuNum": 96 792 | }, 793 | { 794 | "_defaultOrder": 54, 795 | "_isFastLaunch": false, 796 | "category": "Accelerated computing", 797 | "gpuNum": 8, 798 | "hideHardwareSpecs": false, 799 | "memoryGiB": 768, 800 | "name": "ml.g5.48xlarge", 801 | "vcpuNum": 192 802 | } 803 | ], 804 | "instance_type": "ml.t3.medium", 805 | "kernelspec": { 806 | "display_name": "conda_mxnet_p38", 807 | "language": "python", 808 | "name": "conda_mxnet_p38" 809 | }, 810 | "language_info": { 811 | "codemirror_mode": { 812 | "name": "ipython", 813 | "version": 3 814 | }, 815 | "file_extension": ".py", 816 | "mimetype": "text/x-python", 817 | "name": "python", 818 | "nbconvert_exporter": "python", 819 | "pygments_lexer": "ipython3", 820 | "version": "3.8.15" 821 | } 822 | }, 823 | "nbformat": 4, 824 | "nbformat_minor": 5 825 | } 826 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/genomic/preprocess-genomic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "81a85a9a-7642-422c-b4f9-f15d95b46a5b", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "# Read and preprocess genomic data from S3 and store features in SageMaker FeatureStore\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "id": "ff045821-8bf7-46f8-8186-08d9e013875a", 16 | "metadata": { 17 | "tags": [] 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import boto3\n", 22 | "import numpy as np\n", 23 | "import pandas as pd\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import io, os\n", 26 | "from time import gmtime, strftime, sleep\n", 27 | "import time\n", 28 | "import sagemaker\n", 29 | "from sagemaker.session import Session\n", 30 | "from sagemaker import get_execution_role\n", 31 | "from sagemaker.feature_store.feature_group import FeatureGroup" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "69d1dabe-1be7-4d30-aaf3-9b053aff6af5", 37 | "metadata": {}, 38 | "source": [ 39 | "\n", 40 | "## Set up SageMaker FeatureStore\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "id": "b044786e-8c94-4755-b84f-8941c7e4f05d", 47 | "metadata": { 48 | "tags": [] 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "region = boto3.Session().region_name\n", 53 | "\n", 54 | "boto_session = boto3.Session(region_name=region)\n", 55 | "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n", 56 | "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n", 57 | "\n", 58 | "feature_store_session = Session(\n", 59 | " boto_session=boto_session,\n", 60 | " sagemaker_client=sagemaker_client,\n", 61 | " sagemaker_featurestore_runtime_client=featurestore_runtime\n", 62 | ")\n", 63 | "\n", 64 | "role = get_execution_role()\n", 65 | "s3_client = boto3.client('s3', region_name=region)\n", 66 | "\n", 67 | "default_s3_bucket_name = feature_store_session.default_bucket()\n", 68 | "prefix = 'sagemaker-featurestore-demo'" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "becf6d9e-39b5-4a28-97a5-254e8fae34ae", 74 | "metadata": {}, 75 | "source": [ 76 | "\n", 77 | "## Get data from S3\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28", 84 | "metadata": { 85 | "tags": [] 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "# Get data from S3 \n", 90 | "bucket_gen = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n", 91 | "#bucket_gen = \n", 92 | "\n", 93 | "# Genomic data \n", 94 | "data_key_gen = 'final_genomic_df.csv'\n", 95 | "#data_key_gen = \n", 96 | "\n", 97 | "data_location_gen = 's3://{}/{}'.format(bucket_gen, data_key_gen)\n", 98 | "data_genomic = pd.read_csv(data_location_gen)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "20e1c644-31a9-4d35-8197-74111504e269", 104 | "metadata": {}, 105 | "source": [ 106 | "\n", 107 | "## Preprocess Data\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "4bd20e05-596b-410d-a027-ddf2eb70678f", 114 | "metadata": { 115 | "tags": [] 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "# Replacing NaN with zeros\n", 120 | "data_genomic_1 = data_genomic.copy()\n", 121 | "data_genomic_1 = data_genomic_1.replace(np.nan, 0)\n", 122 | "data_genomic_1 = data_genomic_1.astype(str)\n", 123 | "\n", 124 | "# Converting all diagnosis codes to a set\n", 125 | "data_genomic_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n", 126 | "data_genomic_pred = data_genomic_1.copy()\n", 127 | "for i in range(len(data_genomic_1)):\n", 128 | " data_genomic_pred.loc[i, 'diagnosisCode'] = set(data_genomic_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n", 129 | "\n", 130 | "# Adding a column for prediction of Alzheimer's disease code '26929004'\n", 131 | "# Adding a column for prediction of Coronary heart disease '53741008'\n", 132 | "# Adding a column for prediction of Stroke code '230690007'\n", 133 | "# Adding a column for prediction of Hypertension code '59621000'\n", 134 | "for i in range(len(data_genomic_pred)):\n", 135 | " if \"26929004\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n", 136 | " data_genomic_pred.loc[i, 'alzheimers_prediction'] = '1'\n", 137 | " if \"53741008\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n", 138 | " data_genomic_pred.loc[i, 'coronary_heart_disease_prediction'] = '1'\n", 139 | " if \"230690007\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n", 140 | " data_genomic_pred.loc[i, 'stroke_prediction'] = '1'\n", 141 | " if \"59621000\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n", 142 | " data_genomic_pred.loc[i, 'hypertension_prediction'] = '1'\n", 143 | "print(\"Patients with Alzheimer's disease = \", len(data_genomic_pred[data_genomic_pred['alzheimers_prediction']=='1']))\n", 144 | "print(\"Patients with Coronary Heart disease = \", len(data_genomic_pred[data_genomic_pred['coronary_heart_disease_prediction']=='1']))\n", 145 | "print(\"Patients with Stroke = \", len(data_genomic_pred[data_genomic_pred['stroke_prediction']=='1']))\n", 146 | "print(\"Patients with Hypertension = \", len(data_genomic_pred[data_genomic_pred['hypertension_prediction']=='1']))\n", 147 | "\n", 148 | "# Delete columns with leakage and features irrelevant for model training\n", 149 | "list_delete_cols = ['diagnosisDescription', 'diagnosisCode']\n", 150 | "data_genomic_pred.drop(list_delete_cols, axis=1, inplace=True)\n", 151 | "\n", 152 | "data_genomic_pred.head(10)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "79ca5c9e-7689-4fe9-a668-6439425f13c7", 158 | "metadata": {}, 159 | "source": [ 160 | "\n", 161 | "## Ingest data into FeatureStore\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5", 168 | "metadata": { 169 | "tags": [] 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "genomic_feature_group_name = 'genomic-feature-group'\n", 174 | "genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)\n", 175 | "\n", 176 | "current_time_sec = int(round(time.time()))\n", 177 | "\n", 178 | "def cast_object_to_string(data_frame):\n", 179 | " for label in data_frame.columns:\n", 180 | " print (label)\n", 181 | " if data_frame.dtypes[label] == 'object':\n", 182 | " data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n", 183 | "\n", 184 | "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n", 185 | "cast_object_to_string(data_genomic_pred)\n", 186 | "\n", 187 | "# Record identifier and event time feature names\n", 188 | "record_identifier_feature_name = \"patientID\"\n", 189 | "event_time_feature_name = \"EventTime\"\n", 190 | "\n", 191 | "# Append EventTime feature\n", 192 | "data_genomic_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_genomic_pred), dtype=\"float64\")\n", 193 | "\n", 194 | "## If event time generates NaN\n", 195 | "data_genomic_pred[event_time_feature_name] = data_genomic_pred[event_time_feature_name].fillna(0)\n", 196 | "\n", 197 | "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n", 198 | "genomic_feature_group.load_feature_definitions(data_frame=data_genomic_pred); # output is suppressed\n", 199 | "\n", 200 | "\n", 201 | "def wait_for_feature_group_creation_complete(feature_group):\n", 202 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 203 | " while status == \"Creating\":\n", 204 | " print(\"Waiting for Feature Group Creation\")\n", 205 | " time.sleep(15)\n", 206 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 207 | " if status != \"Created\":\n", 208 | " raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n", 209 | " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", 210 | "\n", 211 | "genomic_feature_group.create(\n", 212 | " s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n", 213 | " record_identifier_name=record_identifier_feature_name,\n", 214 | " event_time_feature_name=event_time_feature_name,\n", 215 | " role_arn=role,\n", 216 | " enable_online_store=True\n", 217 | ")\n", 218 | "\n", 219 | "wait_for_feature_group_creation_complete(feature_group=genomic_feature_group)\n", 220 | "\n", 221 | "genomic_feature_group.ingest(\n", 222 | " data_frame=data_genomic_pred, max_workers=5, wait=True\n", 223 | ")" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba", 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [] 233 | } 234 | ], 235 | "metadata": { 236 | "availableInstances": [ 237 | { 238 | "_defaultOrder": 0, 239 | "_isFastLaunch": true, 240 | "category": "General purpose", 241 | "gpuNum": 0, 242 | "hideHardwareSpecs": false, 243 | "memoryGiB": 4, 244 | "name": "ml.t3.medium", 245 | "vcpuNum": 2 246 | }, 247 | { 248 | "_defaultOrder": 1, 249 | "_isFastLaunch": false, 250 | "category": "General purpose", 251 | "gpuNum": 0, 252 | "hideHardwareSpecs": false, 253 | "memoryGiB": 8, 254 | "name": "ml.t3.large", 255 | "vcpuNum": 2 256 | }, 257 | { 258 | "_defaultOrder": 2, 259 | "_isFastLaunch": false, 260 | "category": "General purpose", 261 | "gpuNum": 0, 262 | "hideHardwareSpecs": false, 263 | "memoryGiB": 16, 264 | "name": "ml.t3.xlarge", 265 | "vcpuNum": 4 266 | }, 267 | { 268 | "_defaultOrder": 3, 269 | "_isFastLaunch": false, 270 | "category": "General purpose", 271 | "gpuNum": 0, 272 | "hideHardwareSpecs": false, 273 | "memoryGiB": 32, 274 | "name": "ml.t3.2xlarge", 275 | "vcpuNum": 8 276 | }, 277 | { 278 | "_defaultOrder": 4, 279 | "_isFastLaunch": true, 280 | "category": "General purpose", 281 | "gpuNum": 0, 282 | "hideHardwareSpecs": false, 283 | "memoryGiB": 8, 284 | "name": "ml.m5.large", 285 | "vcpuNum": 2 286 | }, 287 | { 288 | "_defaultOrder": 5, 289 | "_isFastLaunch": false, 290 | "category": "General purpose", 291 | "gpuNum": 0, 292 | "hideHardwareSpecs": false, 293 | "memoryGiB": 16, 294 | "name": "ml.m5.xlarge", 295 | "vcpuNum": 4 296 | }, 297 | { 298 | "_defaultOrder": 6, 299 | "_isFastLaunch": false, 300 | "category": "General purpose", 301 | "gpuNum": 0, 302 | "hideHardwareSpecs": false, 303 | "memoryGiB": 32, 304 | "name": "ml.m5.2xlarge", 305 | "vcpuNum": 8 306 | }, 307 | { 308 | "_defaultOrder": 7, 309 | "_isFastLaunch": false, 310 | "category": "General purpose", 311 | "gpuNum": 0, 312 | "hideHardwareSpecs": false, 313 | "memoryGiB": 64, 314 | "name": "ml.m5.4xlarge", 315 | "vcpuNum": 16 316 | }, 317 | { 318 | "_defaultOrder": 8, 319 | "_isFastLaunch": false, 320 | "category": "General purpose", 321 | "gpuNum": 0, 322 | "hideHardwareSpecs": false, 323 | "memoryGiB": 128, 324 | "name": "ml.m5.8xlarge", 325 | "vcpuNum": 32 326 | }, 327 | { 328 | "_defaultOrder": 9, 329 | "_isFastLaunch": false, 330 | "category": "General purpose", 331 | "gpuNum": 0, 332 | "hideHardwareSpecs": false, 333 | "memoryGiB": 192, 334 | "name": "ml.m5.12xlarge", 335 | "vcpuNum": 48 336 | }, 337 | { 338 | "_defaultOrder": 10, 339 | "_isFastLaunch": false, 340 | "category": "General purpose", 341 | "gpuNum": 0, 342 | "hideHardwareSpecs": false, 343 | "memoryGiB": 256, 344 | "name": "ml.m5.16xlarge", 345 | "vcpuNum": 64 346 | }, 347 | { 348 | "_defaultOrder": 11, 349 | "_isFastLaunch": false, 350 | "category": "General purpose", 351 | "gpuNum": 0, 352 | "hideHardwareSpecs": false, 353 | "memoryGiB": 384, 354 | "name": "ml.m5.24xlarge", 355 | "vcpuNum": 96 356 | }, 357 | { 358 | "_defaultOrder": 12, 359 | "_isFastLaunch": false, 360 | "category": "General purpose", 361 | "gpuNum": 0, 362 | "hideHardwareSpecs": false, 363 | "memoryGiB": 8, 364 | "name": "ml.m5d.large", 365 | "vcpuNum": 2 366 | }, 367 | { 368 | "_defaultOrder": 13, 369 | "_isFastLaunch": false, 370 | "category": "General purpose", 371 | "gpuNum": 0, 372 | "hideHardwareSpecs": false, 373 | "memoryGiB": 16, 374 | "name": "ml.m5d.xlarge", 375 | "vcpuNum": 4 376 | }, 377 | { 378 | "_defaultOrder": 14, 379 | "_isFastLaunch": false, 380 | "category": "General purpose", 381 | "gpuNum": 0, 382 | "hideHardwareSpecs": false, 383 | "memoryGiB": 32, 384 | "name": "ml.m5d.2xlarge", 385 | "vcpuNum": 8 386 | }, 387 | { 388 | "_defaultOrder": 15, 389 | "_isFastLaunch": false, 390 | "category": "General purpose", 391 | "gpuNum": 0, 392 | "hideHardwareSpecs": false, 393 | "memoryGiB": 64, 394 | "name": "ml.m5d.4xlarge", 395 | "vcpuNum": 16 396 | }, 397 | { 398 | "_defaultOrder": 16, 399 | "_isFastLaunch": false, 400 | "category": "General purpose", 401 | "gpuNum": 0, 402 | "hideHardwareSpecs": false, 403 | "memoryGiB": 128, 404 | "name": "ml.m5d.8xlarge", 405 | "vcpuNum": 32 406 | }, 407 | { 408 | "_defaultOrder": 17, 409 | "_isFastLaunch": false, 410 | "category": "General purpose", 411 | "gpuNum": 0, 412 | "hideHardwareSpecs": false, 413 | "memoryGiB": 192, 414 | "name": "ml.m5d.12xlarge", 415 | "vcpuNum": 48 416 | }, 417 | { 418 | "_defaultOrder": 18, 419 | "_isFastLaunch": false, 420 | "category": "General purpose", 421 | "gpuNum": 0, 422 | "hideHardwareSpecs": false, 423 | "memoryGiB": 256, 424 | "name": "ml.m5d.16xlarge", 425 | "vcpuNum": 64 426 | }, 427 | { 428 | "_defaultOrder": 19, 429 | "_isFastLaunch": false, 430 | "category": "General purpose", 431 | "gpuNum": 0, 432 | "hideHardwareSpecs": false, 433 | "memoryGiB": 384, 434 | "name": "ml.m5d.24xlarge", 435 | "vcpuNum": 96 436 | }, 437 | { 438 | "_defaultOrder": 20, 439 | "_isFastLaunch": false, 440 | "category": "General purpose", 441 | "gpuNum": 0, 442 | "hideHardwareSpecs": true, 443 | "memoryGiB": 0, 444 | "name": "ml.geospatial.interactive", 445 | "supportedImageNames": [ 446 | "sagemaker-geospatial-v1-0" 447 | ], 448 | "vcpuNum": 0 449 | }, 450 | { 451 | "_defaultOrder": 21, 452 | "_isFastLaunch": true, 453 | "category": "Compute optimized", 454 | "gpuNum": 0, 455 | "hideHardwareSpecs": false, 456 | "memoryGiB": 4, 457 | "name": "ml.c5.large", 458 | "vcpuNum": 2 459 | }, 460 | { 461 | "_defaultOrder": 22, 462 | "_isFastLaunch": false, 463 | "category": "Compute optimized", 464 | "gpuNum": 0, 465 | "hideHardwareSpecs": false, 466 | "memoryGiB": 8, 467 | "name": "ml.c5.xlarge", 468 | "vcpuNum": 4 469 | }, 470 | { 471 | "_defaultOrder": 23, 472 | "_isFastLaunch": false, 473 | "category": "Compute optimized", 474 | "gpuNum": 0, 475 | "hideHardwareSpecs": false, 476 | "memoryGiB": 16, 477 | "name": "ml.c5.2xlarge", 478 | "vcpuNum": 8 479 | }, 480 | { 481 | "_defaultOrder": 24, 482 | "_isFastLaunch": false, 483 | "category": "Compute optimized", 484 | "gpuNum": 0, 485 | "hideHardwareSpecs": false, 486 | "memoryGiB": 32, 487 | "name": "ml.c5.4xlarge", 488 | "vcpuNum": 16 489 | }, 490 | { 491 | "_defaultOrder": 25, 492 | "_isFastLaunch": false, 493 | "category": "Compute optimized", 494 | "gpuNum": 0, 495 | "hideHardwareSpecs": false, 496 | "memoryGiB": 72, 497 | "name": "ml.c5.9xlarge", 498 | "vcpuNum": 36 499 | }, 500 | { 501 | "_defaultOrder": 26, 502 | "_isFastLaunch": false, 503 | "category": "Compute optimized", 504 | "gpuNum": 0, 505 | "hideHardwareSpecs": false, 506 | "memoryGiB": 96, 507 | "name": "ml.c5.12xlarge", 508 | "vcpuNum": 48 509 | }, 510 | { 511 | "_defaultOrder": 27, 512 | "_isFastLaunch": false, 513 | "category": "Compute optimized", 514 | "gpuNum": 0, 515 | "hideHardwareSpecs": false, 516 | "memoryGiB": 144, 517 | "name": "ml.c5.18xlarge", 518 | "vcpuNum": 72 519 | }, 520 | { 521 | "_defaultOrder": 28, 522 | "_isFastLaunch": false, 523 | "category": "Compute optimized", 524 | "gpuNum": 0, 525 | "hideHardwareSpecs": false, 526 | "memoryGiB": 192, 527 | "name": "ml.c5.24xlarge", 528 | "vcpuNum": 96 529 | }, 530 | { 531 | "_defaultOrder": 29, 532 | "_isFastLaunch": true, 533 | "category": "Accelerated computing", 534 | "gpuNum": 1, 535 | "hideHardwareSpecs": false, 536 | "memoryGiB": 16, 537 | "name": "ml.g4dn.xlarge", 538 | "vcpuNum": 4 539 | }, 540 | { 541 | "_defaultOrder": 30, 542 | "_isFastLaunch": false, 543 | "category": "Accelerated computing", 544 | "gpuNum": 1, 545 | "hideHardwareSpecs": false, 546 | "memoryGiB": 32, 547 | "name": "ml.g4dn.2xlarge", 548 | "vcpuNum": 8 549 | }, 550 | { 551 | "_defaultOrder": 31, 552 | "_isFastLaunch": false, 553 | "category": "Accelerated computing", 554 | "gpuNum": 1, 555 | "hideHardwareSpecs": false, 556 | "memoryGiB": 64, 557 | "name": "ml.g4dn.4xlarge", 558 | "vcpuNum": 16 559 | }, 560 | { 561 | "_defaultOrder": 32, 562 | "_isFastLaunch": false, 563 | "category": "Accelerated computing", 564 | "gpuNum": 1, 565 | "hideHardwareSpecs": false, 566 | "memoryGiB": 128, 567 | "name": "ml.g4dn.8xlarge", 568 | "vcpuNum": 32 569 | }, 570 | { 571 | "_defaultOrder": 33, 572 | "_isFastLaunch": false, 573 | "category": "Accelerated computing", 574 | "gpuNum": 4, 575 | "hideHardwareSpecs": false, 576 | "memoryGiB": 192, 577 | "name": "ml.g4dn.12xlarge", 578 | "vcpuNum": 48 579 | }, 580 | { 581 | "_defaultOrder": 34, 582 | "_isFastLaunch": false, 583 | "category": "Accelerated computing", 584 | "gpuNum": 1, 585 | "hideHardwareSpecs": false, 586 | "memoryGiB": 256, 587 | "name": "ml.g4dn.16xlarge", 588 | "vcpuNum": 64 589 | }, 590 | { 591 | "_defaultOrder": 35, 592 | "_isFastLaunch": false, 593 | "category": "Accelerated computing", 594 | "gpuNum": 1, 595 | "hideHardwareSpecs": false, 596 | "memoryGiB": 61, 597 | "name": "ml.p3.2xlarge", 598 | "vcpuNum": 8 599 | }, 600 | { 601 | "_defaultOrder": 36, 602 | "_isFastLaunch": false, 603 | "category": "Accelerated computing", 604 | "gpuNum": 4, 605 | "hideHardwareSpecs": false, 606 | "memoryGiB": 244, 607 | "name": "ml.p3.8xlarge", 608 | "vcpuNum": 32 609 | }, 610 | { 611 | "_defaultOrder": 37, 612 | "_isFastLaunch": false, 613 | "category": "Accelerated computing", 614 | "gpuNum": 8, 615 | "hideHardwareSpecs": false, 616 | "memoryGiB": 488, 617 | "name": "ml.p3.16xlarge", 618 | "vcpuNum": 64 619 | }, 620 | { 621 | "_defaultOrder": 38, 622 | "_isFastLaunch": false, 623 | "category": "Accelerated computing", 624 | "gpuNum": 8, 625 | "hideHardwareSpecs": false, 626 | "memoryGiB": 768, 627 | "name": "ml.p3dn.24xlarge", 628 | "vcpuNum": 96 629 | }, 630 | { 631 | "_defaultOrder": 39, 632 | "_isFastLaunch": false, 633 | "category": "Memory Optimized", 634 | "gpuNum": 0, 635 | "hideHardwareSpecs": false, 636 | "memoryGiB": 16, 637 | "name": "ml.r5.large", 638 | "vcpuNum": 2 639 | }, 640 | { 641 | "_defaultOrder": 40, 642 | "_isFastLaunch": false, 643 | "category": "Memory Optimized", 644 | "gpuNum": 0, 645 | "hideHardwareSpecs": false, 646 | "memoryGiB": 32, 647 | "name": "ml.r5.xlarge", 648 | "vcpuNum": 4 649 | }, 650 | { 651 | "_defaultOrder": 41, 652 | "_isFastLaunch": false, 653 | "category": "Memory Optimized", 654 | "gpuNum": 0, 655 | "hideHardwareSpecs": false, 656 | "memoryGiB": 64, 657 | "name": "ml.r5.2xlarge", 658 | "vcpuNum": 8 659 | }, 660 | { 661 | "_defaultOrder": 42, 662 | "_isFastLaunch": false, 663 | "category": "Memory Optimized", 664 | "gpuNum": 0, 665 | "hideHardwareSpecs": false, 666 | "memoryGiB": 128, 667 | "name": "ml.r5.4xlarge", 668 | "vcpuNum": 16 669 | }, 670 | { 671 | "_defaultOrder": 43, 672 | "_isFastLaunch": false, 673 | "category": "Memory Optimized", 674 | "gpuNum": 0, 675 | "hideHardwareSpecs": false, 676 | "memoryGiB": 256, 677 | "name": "ml.r5.8xlarge", 678 | "vcpuNum": 32 679 | }, 680 | { 681 | "_defaultOrder": 44, 682 | "_isFastLaunch": false, 683 | "category": "Memory Optimized", 684 | "gpuNum": 0, 685 | "hideHardwareSpecs": false, 686 | "memoryGiB": 384, 687 | "name": "ml.r5.12xlarge", 688 | "vcpuNum": 48 689 | }, 690 | { 691 | "_defaultOrder": 45, 692 | "_isFastLaunch": false, 693 | "category": "Memory Optimized", 694 | "gpuNum": 0, 695 | "hideHardwareSpecs": false, 696 | "memoryGiB": 512, 697 | "name": "ml.r5.16xlarge", 698 | "vcpuNum": 64 699 | }, 700 | { 701 | "_defaultOrder": 46, 702 | "_isFastLaunch": false, 703 | "category": "Memory Optimized", 704 | "gpuNum": 0, 705 | "hideHardwareSpecs": false, 706 | "memoryGiB": 768, 707 | "name": "ml.r5.24xlarge", 708 | "vcpuNum": 96 709 | }, 710 | { 711 | "_defaultOrder": 47, 712 | "_isFastLaunch": false, 713 | "category": "Accelerated computing", 714 | "gpuNum": 1, 715 | "hideHardwareSpecs": false, 716 | "memoryGiB": 16, 717 | "name": "ml.g5.xlarge", 718 | "vcpuNum": 4 719 | }, 720 | { 721 | "_defaultOrder": 48, 722 | "_isFastLaunch": false, 723 | "category": "Accelerated computing", 724 | "gpuNum": 1, 725 | "hideHardwareSpecs": false, 726 | "memoryGiB": 32, 727 | "name": "ml.g5.2xlarge", 728 | "vcpuNum": 8 729 | }, 730 | { 731 | "_defaultOrder": 49, 732 | "_isFastLaunch": false, 733 | "category": "Accelerated computing", 734 | "gpuNum": 1, 735 | "hideHardwareSpecs": false, 736 | "memoryGiB": 64, 737 | "name": "ml.g5.4xlarge", 738 | "vcpuNum": 16 739 | }, 740 | { 741 | "_defaultOrder": 50, 742 | "_isFastLaunch": false, 743 | "category": "Accelerated computing", 744 | "gpuNum": 1, 745 | "hideHardwareSpecs": false, 746 | "memoryGiB": 128, 747 | "name": "ml.g5.8xlarge", 748 | "vcpuNum": 32 749 | }, 750 | { 751 | "_defaultOrder": 51, 752 | "_isFastLaunch": false, 753 | "category": "Accelerated computing", 754 | "gpuNum": 1, 755 | "hideHardwareSpecs": false, 756 | "memoryGiB": 256, 757 | "name": "ml.g5.16xlarge", 758 | "vcpuNum": 64 759 | }, 760 | { 761 | "_defaultOrder": 52, 762 | "_isFastLaunch": false, 763 | "category": "Accelerated computing", 764 | "gpuNum": 4, 765 | "hideHardwareSpecs": false, 766 | "memoryGiB": 192, 767 | "name": "ml.g5.12xlarge", 768 | "vcpuNum": 48 769 | }, 770 | { 771 | "_defaultOrder": 53, 772 | "_isFastLaunch": false, 773 | "category": "Accelerated computing", 774 | "gpuNum": 4, 775 | "hideHardwareSpecs": false, 776 | "memoryGiB": 384, 777 | "name": "ml.g5.24xlarge", 778 | "vcpuNum": 96 779 | }, 780 | { 781 | "_defaultOrder": 54, 782 | "_isFastLaunch": false, 783 | "category": "Accelerated computing", 784 | "gpuNum": 8, 785 | "hideHardwareSpecs": false, 786 | "memoryGiB": 768, 787 | "name": "ml.g5.48xlarge", 788 | "vcpuNum": 192 789 | } 790 | ], 791 | "instance_type": "ml.t3.medium", 792 | "kernelspec": { 793 | "display_name": "conda_mxnet_p38", 794 | "language": "python", 795 | "name": "conda_mxnet_p38" 796 | }, 797 | "language_info": { 798 | "codemirror_mode": { 799 | "name": "ipython", 800 | "version": 3 801 | }, 802 | "file_extension": ".py", 803 | "mimetype": "text/x-python", 804 | "name": "python", 805 | "nbconvert_exporter": "python", 806 | "pygments_lexer": "ipython3", 807 | "version": "3.8.15" 808 | } 809 | }, 810 | "nbformat": 4, 811 | "nbformat_minor": 5 812 | } 813 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/medical-imaging/preprocess-imaging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a4094068-1e01-41da-a9a7-15ee28c0022d", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "# Read and preprocess imaging data from S3 and store features in SageMaker FeatureStore\n" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "id": "ff045821-8bf7-46f8-8186-08d9e013875a", 16 | "metadata": { 17 | "tags": [] 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import boto3\n", 22 | "import numpy as np\n", 23 | "import pandas as pd\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import io, os\n", 26 | "from time import gmtime, strftime, sleep\n", 27 | "import time\n", 28 | "import sagemaker\n", 29 | "from sagemaker.session import Session\n", 30 | "from sagemaker import get_execution_role\n", 31 | "from sagemaker.feature_store.feature_group import FeatureGroup" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "3a0ce6f0-eb7e-4420-b994-843f436ff660", 37 | "metadata": {}, 38 | "source": [ 39 | "\n", 40 | "## Set up SageMaker FeatureStore\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "id": "b044786e-8c94-4755-b84f-8941c7e4f05d", 47 | "metadata": { 48 | "tags": [] 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "region = boto3.Session().region_name\n", 53 | "\n", 54 | "boto_session = boto3.Session(region_name=region)\n", 55 | "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n", 56 | "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n", 57 | "\n", 58 | "feature_store_session = Session(\n", 59 | " boto_session=boto_session,\n", 60 | " sagemaker_client=sagemaker_client,\n", 61 | " sagemaker_featurestore_runtime_client=featurestore_runtime\n", 62 | ")\n", 63 | "\n", 64 | "role = get_execution_role()\n", 65 | "s3_client = boto3.client('s3', region_name=region)\n", 66 | "\n", 67 | "default_s3_bucket_name = feature_store_session.default_bucket()\n", 68 | "prefix = 'sagemaker-featurestore-demo'" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "9211f510-8be4-4877-b4f5-fa778aae298a", 74 | "metadata": {}, 75 | "source": [ 76 | "\n", 77 | "## Get data from S3\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28", 84 | "metadata": { 85 | "tags": [] 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "# Get data from S3 \n", 90 | "bucket_imag = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n", 91 | "#bucket_imag = \n", 92 | "\n", 93 | "# imaging data \n", 94 | "data_key_imag = 'final_imaging_df.csv'\n", 95 | "#data_key_imag = \n", 96 | "\n", 97 | "data_location_imag = 's3://{}/{}'.format(bucket_imag, data_key_imag)\n", 98 | "data_imaging = pd.read_csv(data_location_imag)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "id": "7744c961-807f-4451-b6ba-1fa24025e652", 104 | "metadata": {}, 105 | "source": [ 106 | "\n", 107 | "## Preprocess Data\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "id": "4bd20e05-596b-410d-a027-ddf2eb70678f", 114 | "metadata": { 115 | "tags": [] 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "# Replacing NaN with zeros\n", 120 | "data_imaging_1 = data_imaging.copy()\n", 121 | "data_imaging_1 = data_imaging_1.replace(np.nan, 0)\n", 122 | "data_imaging_1 = data_imaging_1.astype(str)\n", 123 | "\n", 124 | "#Converting all diagnosis codes to a set\n", 125 | "data_imaging_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n", 126 | "data_imaging_pred = data_imaging_1.copy()\n", 127 | "for i in range(len(data_imaging_1)):\n", 128 | " data_imaging_pred.loc[i, 'diagnosisCode'] = set(data_imaging_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n", 129 | "\n", 130 | "# Adding a column for prediction of Alzheimer's disease code '26929004'\n", 131 | "# Adding a column for prediction of Coronary heart disease '53741008'\n", 132 | "# Adding a column for prediction of Stroke code '230690007'\n", 133 | "# Adding a column for prediction of Hypertension code '59621000'\n", 134 | "for i in range(len(data_imaging_pred)):\n", 135 | " if \"26929004\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n", 136 | " data_imaging_pred.loc[i, 'alzheimers_prediction'] = '1'\n", 137 | " if \"53741008\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n", 138 | " data_imaging_pred.loc[i, 'coronary_heart_disease_prediction'] = '1'\n", 139 | " if \"230690007\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n", 140 | " data_imaging_pred.loc[i, 'stroke_prediction'] = '1'\n", 141 | " if \"59621000\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n", 142 | " data_imaging_pred.loc[i, 'hypertension_prediction'] = '1'\n", 143 | "print(\"Patients with Alzheimer's disease = \", len(data_imaging_pred[data_imaging_pred['alzheimers_prediction']=='1']))\n", 144 | "print(\"Patients with Coronary Heart disease = \", len(data_imaging_pred[data_imaging_pred['coronary_heart_disease_prediction']=='1']))\n", 145 | "print(\"Patients with Stroke = \", len(data_imaging_pred[data_imaging_pred['stroke_prediction']=='1']))\n", 146 | "print(\"Patients with Hypertension = \", len(data_imaging_pred[data_imaging_pred['hypertension_prediction']=='1']))\n", 147 | "\n", 148 | "# Delete columns with leakage and features irrelevant for model training\n", 149 | "list_delete_cols = ['diagnosisDescription', 'diagnosisCode']\n", 150 | "data_imaging_pred.drop(list_delete_cols, axis=1, inplace=True)\n", 151 | "\n", 152 | "data_imaging_pred.head(10)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "7d80bf0c-bee9-4f46-b05f-496ae95ae0a8", 158 | "metadata": { 159 | "tags": [] 160 | }, 161 | "source": [ 162 | "\n", 163 | "## Ingest data into FeatureStore\n" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5", 170 | "metadata": { 171 | "tags": [] 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "imaging_feature_group_name = 'imaging-feature-group' \n", 176 | "imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)\n", 177 | "\n", 178 | "current_time_sec = int(round(time.time()))\n", 179 | "\n", 180 | "def cast_object_to_string(data_frame):\n", 181 | " for label in data_frame.columns:\n", 182 | " print (label)\n", 183 | " if data_frame.dtypes[label] == 'object':\n", 184 | " data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n", 185 | "\n", 186 | "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n", 187 | "cast_object_to_string(data_imaging_pred)\n", 188 | "\n", 189 | "# Record identifier and event time feature names\n", 190 | "record_identifier_feature_name = \"patientID\"\n", 191 | "event_time_feature_name = \"EventTime\"\n", 192 | "\n", 193 | "# Append EventTime feature\n", 194 | "data_imaging_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_imaging_pred), dtype=\"float64\")\n", 195 | "\n", 196 | "## If event time generates NaN\n", 197 | "data_imaging_pred[event_time_feature_name] = data_imaging_pred[event_time_feature_name].fillna(0)\n", 198 | "\n", 199 | "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n", 200 | "imaging_feature_group.load_feature_definitions(data_frame=data_imaging_pred); # output is suppressed\n", 201 | "\n", 202 | "\n", 203 | "def wait_for_feature_group_creation_complete(feature_group):\n", 204 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 205 | " while status == \"Creating\":\n", 206 | " print(\"Waiting for Feature Group Creation\")\n", 207 | " time.sleep(15)\n", 208 | " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", 209 | " if status != \"Created\":\n", 210 | " raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n", 211 | " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", 212 | "\n", 213 | "imaging_feature_group.create(\n", 214 | " s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n", 215 | " record_identifier_name=record_identifier_feature_name,\n", 216 | " event_time_feature_name=event_time_feature_name,\n", 217 | " role_arn=role,\n", 218 | " enable_online_store=True\n", 219 | ")\n", 220 | "\n", 221 | "wait_for_feature_group_creation_complete(feature_group=imaging_feature_group)\n", 222 | "\n", 223 | "imaging_feature_group.ingest(\n", 224 | " data_frame=data_imaging_pred, max_workers=5, wait=True\n", 225 | ")" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [] 235 | } 236 | ], 237 | "metadata": { 238 | "availableInstances": [ 239 | { 240 | "_defaultOrder": 0, 241 | "_isFastLaunch": true, 242 | "category": "General purpose", 243 | "gpuNum": 0, 244 | "hideHardwareSpecs": false, 245 | "memoryGiB": 4, 246 | "name": "ml.t3.medium", 247 | "vcpuNum": 2 248 | }, 249 | { 250 | "_defaultOrder": 1, 251 | "_isFastLaunch": false, 252 | "category": "General purpose", 253 | "gpuNum": 0, 254 | "hideHardwareSpecs": false, 255 | "memoryGiB": 8, 256 | "name": "ml.t3.large", 257 | "vcpuNum": 2 258 | }, 259 | { 260 | "_defaultOrder": 2, 261 | "_isFastLaunch": false, 262 | "category": "General purpose", 263 | "gpuNum": 0, 264 | "hideHardwareSpecs": false, 265 | "memoryGiB": 16, 266 | "name": "ml.t3.xlarge", 267 | "vcpuNum": 4 268 | }, 269 | { 270 | "_defaultOrder": 3, 271 | "_isFastLaunch": false, 272 | "category": "General purpose", 273 | "gpuNum": 0, 274 | "hideHardwareSpecs": false, 275 | "memoryGiB": 32, 276 | "name": "ml.t3.2xlarge", 277 | "vcpuNum": 8 278 | }, 279 | { 280 | "_defaultOrder": 4, 281 | "_isFastLaunch": true, 282 | "category": "General purpose", 283 | "gpuNum": 0, 284 | "hideHardwareSpecs": false, 285 | "memoryGiB": 8, 286 | "name": "ml.m5.large", 287 | "vcpuNum": 2 288 | }, 289 | { 290 | "_defaultOrder": 5, 291 | "_isFastLaunch": false, 292 | "category": "General purpose", 293 | "gpuNum": 0, 294 | "hideHardwareSpecs": false, 295 | "memoryGiB": 16, 296 | "name": "ml.m5.xlarge", 297 | "vcpuNum": 4 298 | }, 299 | { 300 | "_defaultOrder": 6, 301 | "_isFastLaunch": false, 302 | "category": "General purpose", 303 | "gpuNum": 0, 304 | "hideHardwareSpecs": false, 305 | "memoryGiB": 32, 306 | "name": "ml.m5.2xlarge", 307 | "vcpuNum": 8 308 | }, 309 | { 310 | "_defaultOrder": 7, 311 | "_isFastLaunch": false, 312 | "category": "General purpose", 313 | "gpuNum": 0, 314 | "hideHardwareSpecs": false, 315 | "memoryGiB": 64, 316 | "name": "ml.m5.4xlarge", 317 | "vcpuNum": 16 318 | }, 319 | { 320 | "_defaultOrder": 8, 321 | "_isFastLaunch": false, 322 | "category": "General purpose", 323 | "gpuNum": 0, 324 | "hideHardwareSpecs": false, 325 | "memoryGiB": 128, 326 | "name": "ml.m5.8xlarge", 327 | "vcpuNum": 32 328 | }, 329 | { 330 | "_defaultOrder": 9, 331 | "_isFastLaunch": false, 332 | "category": "General purpose", 333 | "gpuNum": 0, 334 | "hideHardwareSpecs": false, 335 | "memoryGiB": 192, 336 | "name": "ml.m5.12xlarge", 337 | "vcpuNum": 48 338 | }, 339 | { 340 | "_defaultOrder": 10, 341 | "_isFastLaunch": false, 342 | "category": "General purpose", 343 | "gpuNum": 0, 344 | "hideHardwareSpecs": false, 345 | "memoryGiB": 256, 346 | "name": "ml.m5.16xlarge", 347 | "vcpuNum": 64 348 | }, 349 | { 350 | "_defaultOrder": 11, 351 | "_isFastLaunch": false, 352 | "category": "General purpose", 353 | "gpuNum": 0, 354 | "hideHardwareSpecs": false, 355 | "memoryGiB": 384, 356 | "name": "ml.m5.24xlarge", 357 | "vcpuNum": 96 358 | }, 359 | { 360 | "_defaultOrder": 12, 361 | "_isFastLaunch": false, 362 | "category": "General purpose", 363 | "gpuNum": 0, 364 | "hideHardwareSpecs": false, 365 | "memoryGiB": 8, 366 | "name": "ml.m5d.large", 367 | "vcpuNum": 2 368 | }, 369 | { 370 | "_defaultOrder": 13, 371 | "_isFastLaunch": false, 372 | "category": "General purpose", 373 | "gpuNum": 0, 374 | "hideHardwareSpecs": false, 375 | "memoryGiB": 16, 376 | "name": "ml.m5d.xlarge", 377 | "vcpuNum": 4 378 | }, 379 | { 380 | "_defaultOrder": 14, 381 | "_isFastLaunch": false, 382 | "category": "General purpose", 383 | "gpuNum": 0, 384 | "hideHardwareSpecs": false, 385 | "memoryGiB": 32, 386 | "name": "ml.m5d.2xlarge", 387 | "vcpuNum": 8 388 | }, 389 | { 390 | "_defaultOrder": 15, 391 | "_isFastLaunch": false, 392 | "category": "General purpose", 393 | "gpuNum": 0, 394 | "hideHardwareSpecs": false, 395 | "memoryGiB": 64, 396 | "name": "ml.m5d.4xlarge", 397 | "vcpuNum": 16 398 | }, 399 | { 400 | "_defaultOrder": 16, 401 | "_isFastLaunch": false, 402 | "category": "General purpose", 403 | "gpuNum": 0, 404 | "hideHardwareSpecs": false, 405 | "memoryGiB": 128, 406 | "name": "ml.m5d.8xlarge", 407 | "vcpuNum": 32 408 | }, 409 | { 410 | "_defaultOrder": 17, 411 | "_isFastLaunch": false, 412 | "category": "General purpose", 413 | "gpuNum": 0, 414 | "hideHardwareSpecs": false, 415 | "memoryGiB": 192, 416 | "name": "ml.m5d.12xlarge", 417 | "vcpuNum": 48 418 | }, 419 | { 420 | "_defaultOrder": 18, 421 | "_isFastLaunch": false, 422 | "category": "General purpose", 423 | "gpuNum": 0, 424 | "hideHardwareSpecs": false, 425 | "memoryGiB": 256, 426 | "name": "ml.m5d.16xlarge", 427 | "vcpuNum": 64 428 | }, 429 | { 430 | "_defaultOrder": 19, 431 | "_isFastLaunch": false, 432 | "category": "General purpose", 433 | "gpuNum": 0, 434 | "hideHardwareSpecs": false, 435 | "memoryGiB": 384, 436 | "name": "ml.m5d.24xlarge", 437 | "vcpuNum": 96 438 | }, 439 | { 440 | "_defaultOrder": 20, 441 | "_isFastLaunch": false, 442 | "category": "General purpose", 443 | "gpuNum": 0, 444 | "hideHardwareSpecs": true, 445 | "memoryGiB": 0, 446 | "name": "ml.geospatial.interactive", 447 | "supportedImageNames": [ 448 | "sagemaker-geospatial-v1-0" 449 | ], 450 | "vcpuNum": 0 451 | }, 452 | { 453 | "_defaultOrder": 21, 454 | "_isFastLaunch": true, 455 | "category": "Compute optimized", 456 | "gpuNum": 0, 457 | "hideHardwareSpecs": false, 458 | "memoryGiB": 4, 459 | "name": "ml.c5.large", 460 | "vcpuNum": 2 461 | }, 462 | { 463 | "_defaultOrder": 22, 464 | "_isFastLaunch": false, 465 | "category": "Compute optimized", 466 | "gpuNum": 0, 467 | "hideHardwareSpecs": false, 468 | "memoryGiB": 8, 469 | "name": "ml.c5.xlarge", 470 | "vcpuNum": 4 471 | }, 472 | { 473 | "_defaultOrder": 23, 474 | "_isFastLaunch": false, 475 | "category": "Compute optimized", 476 | "gpuNum": 0, 477 | "hideHardwareSpecs": false, 478 | "memoryGiB": 16, 479 | "name": "ml.c5.2xlarge", 480 | "vcpuNum": 8 481 | }, 482 | { 483 | "_defaultOrder": 24, 484 | "_isFastLaunch": false, 485 | "category": "Compute optimized", 486 | "gpuNum": 0, 487 | "hideHardwareSpecs": false, 488 | "memoryGiB": 32, 489 | "name": "ml.c5.4xlarge", 490 | "vcpuNum": 16 491 | }, 492 | { 493 | "_defaultOrder": 25, 494 | "_isFastLaunch": false, 495 | "category": "Compute optimized", 496 | "gpuNum": 0, 497 | "hideHardwareSpecs": false, 498 | "memoryGiB": 72, 499 | "name": "ml.c5.9xlarge", 500 | "vcpuNum": 36 501 | }, 502 | { 503 | "_defaultOrder": 26, 504 | "_isFastLaunch": false, 505 | "category": "Compute optimized", 506 | "gpuNum": 0, 507 | "hideHardwareSpecs": false, 508 | "memoryGiB": 96, 509 | "name": "ml.c5.12xlarge", 510 | "vcpuNum": 48 511 | }, 512 | { 513 | "_defaultOrder": 27, 514 | "_isFastLaunch": false, 515 | "category": "Compute optimized", 516 | "gpuNum": 0, 517 | "hideHardwareSpecs": false, 518 | "memoryGiB": 144, 519 | "name": "ml.c5.18xlarge", 520 | "vcpuNum": 72 521 | }, 522 | { 523 | "_defaultOrder": 28, 524 | "_isFastLaunch": false, 525 | "category": "Compute optimized", 526 | "gpuNum": 0, 527 | "hideHardwareSpecs": false, 528 | "memoryGiB": 192, 529 | "name": "ml.c5.24xlarge", 530 | "vcpuNum": 96 531 | }, 532 | { 533 | "_defaultOrder": 29, 534 | "_isFastLaunch": true, 535 | "category": "Accelerated computing", 536 | "gpuNum": 1, 537 | "hideHardwareSpecs": false, 538 | "memoryGiB": 16, 539 | "name": "ml.g4dn.xlarge", 540 | "vcpuNum": 4 541 | }, 542 | { 543 | "_defaultOrder": 30, 544 | "_isFastLaunch": false, 545 | "category": "Accelerated computing", 546 | "gpuNum": 1, 547 | "hideHardwareSpecs": false, 548 | "memoryGiB": 32, 549 | "name": "ml.g4dn.2xlarge", 550 | "vcpuNum": 8 551 | }, 552 | { 553 | "_defaultOrder": 31, 554 | "_isFastLaunch": false, 555 | "category": "Accelerated computing", 556 | "gpuNum": 1, 557 | "hideHardwareSpecs": false, 558 | "memoryGiB": 64, 559 | "name": "ml.g4dn.4xlarge", 560 | "vcpuNum": 16 561 | }, 562 | { 563 | "_defaultOrder": 32, 564 | "_isFastLaunch": false, 565 | "category": "Accelerated computing", 566 | "gpuNum": 1, 567 | "hideHardwareSpecs": false, 568 | "memoryGiB": 128, 569 | "name": "ml.g4dn.8xlarge", 570 | "vcpuNum": 32 571 | }, 572 | { 573 | "_defaultOrder": 33, 574 | "_isFastLaunch": false, 575 | "category": "Accelerated computing", 576 | "gpuNum": 4, 577 | "hideHardwareSpecs": false, 578 | "memoryGiB": 192, 579 | "name": "ml.g4dn.12xlarge", 580 | "vcpuNum": 48 581 | }, 582 | { 583 | "_defaultOrder": 34, 584 | "_isFastLaunch": false, 585 | "category": "Accelerated computing", 586 | "gpuNum": 1, 587 | "hideHardwareSpecs": false, 588 | "memoryGiB": 256, 589 | "name": "ml.g4dn.16xlarge", 590 | "vcpuNum": 64 591 | }, 592 | { 593 | "_defaultOrder": 35, 594 | "_isFastLaunch": false, 595 | "category": "Accelerated computing", 596 | "gpuNum": 1, 597 | "hideHardwareSpecs": false, 598 | "memoryGiB": 61, 599 | "name": "ml.p3.2xlarge", 600 | "vcpuNum": 8 601 | }, 602 | { 603 | "_defaultOrder": 36, 604 | "_isFastLaunch": false, 605 | "category": "Accelerated computing", 606 | "gpuNum": 4, 607 | "hideHardwareSpecs": false, 608 | "memoryGiB": 244, 609 | "name": "ml.p3.8xlarge", 610 | "vcpuNum": 32 611 | }, 612 | { 613 | "_defaultOrder": 37, 614 | "_isFastLaunch": false, 615 | "category": "Accelerated computing", 616 | "gpuNum": 8, 617 | "hideHardwareSpecs": false, 618 | "memoryGiB": 488, 619 | "name": "ml.p3.16xlarge", 620 | "vcpuNum": 64 621 | }, 622 | { 623 | "_defaultOrder": 38, 624 | "_isFastLaunch": false, 625 | "category": "Accelerated computing", 626 | "gpuNum": 8, 627 | "hideHardwareSpecs": false, 628 | "memoryGiB": 768, 629 | "name": "ml.p3dn.24xlarge", 630 | "vcpuNum": 96 631 | }, 632 | { 633 | "_defaultOrder": 39, 634 | "_isFastLaunch": false, 635 | "category": "Memory Optimized", 636 | "gpuNum": 0, 637 | "hideHardwareSpecs": false, 638 | "memoryGiB": 16, 639 | "name": "ml.r5.large", 640 | "vcpuNum": 2 641 | }, 642 | { 643 | "_defaultOrder": 40, 644 | "_isFastLaunch": false, 645 | "category": "Memory Optimized", 646 | "gpuNum": 0, 647 | "hideHardwareSpecs": false, 648 | "memoryGiB": 32, 649 | "name": "ml.r5.xlarge", 650 | "vcpuNum": 4 651 | }, 652 | { 653 | "_defaultOrder": 41, 654 | "_isFastLaunch": false, 655 | "category": "Memory Optimized", 656 | "gpuNum": 0, 657 | "hideHardwareSpecs": false, 658 | "memoryGiB": 64, 659 | "name": "ml.r5.2xlarge", 660 | "vcpuNum": 8 661 | }, 662 | { 663 | "_defaultOrder": 42, 664 | "_isFastLaunch": false, 665 | "category": "Memory Optimized", 666 | "gpuNum": 0, 667 | "hideHardwareSpecs": false, 668 | "memoryGiB": 128, 669 | "name": "ml.r5.4xlarge", 670 | "vcpuNum": 16 671 | }, 672 | { 673 | "_defaultOrder": 43, 674 | "_isFastLaunch": false, 675 | "category": "Memory Optimized", 676 | "gpuNum": 0, 677 | "hideHardwareSpecs": false, 678 | "memoryGiB": 256, 679 | "name": "ml.r5.8xlarge", 680 | "vcpuNum": 32 681 | }, 682 | { 683 | "_defaultOrder": 44, 684 | "_isFastLaunch": false, 685 | "category": "Memory Optimized", 686 | "gpuNum": 0, 687 | "hideHardwareSpecs": false, 688 | "memoryGiB": 384, 689 | "name": "ml.r5.12xlarge", 690 | "vcpuNum": 48 691 | }, 692 | { 693 | "_defaultOrder": 45, 694 | "_isFastLaunch": false, 695 | "category": "Memory Optimized", 696 | "gpuNum": 0, 697 | "hideHardwareSpecs": false, 698 | "memoryGiB": 512, 699 | "name": "ml.r5.16xlarge", 700 | "vcpuNum": 64 701 | }, 702 | { 703 | "_defaultOrder": 46, 704 | "_isFastLaunch": false, 705 | "category": "Memory Optimized", 706 | "gpuNum": 0, 707 | "hideHardwareSpecs": false, 708 | "memoryGiB": 768, 709 | "name": "ml.r5.24xlarge", 710 | "vcpuNum": 96 711 | }, 712 | { 713 | "_defaultOrder": 47, 714 | "_isFastLaunch": false, 715 | "category": "Accelerated computing", 716 | "gpuNum": 1, 717 | "hideHardwareSpecs": false, 718 | "memoryGiB": 16, 719 | "name": "ml.g5.xlarge", 720 | "vcpuNum": 4 721 | }, 722 | { 723 | "_defaultOrder": 48, 724 | "_isFastLaunch": false, 725 | "category": "Accelerated computing", 726 | "gpuNum": 1, 727 | "hideHardwareSpecs": false, 728 | "memoryGiB": 32, 729 | "name": "ml.g5.2xlarge", 730 | "vcpuNum": 8 731 | }, 732 | { 733 | "_defaultOrder": 49, 734 | "_isFastLaunch": false, 735 | "category": "Accelerated computing", 736 | "gpuNum": 1, 737 | "hideHardwareSpecs": false, 738 | "memoryGiB": 64, 739 | "name": "ml.g5.4xlarge", 740 | "vcpuNum": 16 741 | }, 742 | { 743 | "_defaultOrder": 50, 744 | "_isFastLaunch": false, 745 | "category": "Accelerated computing", 746 | "gpuNum": 1, 747 | "hideHardwareSpecs": false, 748 | "memoryGiB": 128, 749 | "name": "ml.g5.8xlarge", 750 | "vcpuNum": 32 751 | }, 752 | { 753 | "_defaultOrder": 51, 754 | "_isFastLaunch": false, 755 | "category": "Accelerated computing", 756 | "gpuNum": 1, 757 | "hideHardwareSpecs": false, 758 | "memoryGiB": 256, 759 | "name": "ml.g5.16xlarge", 760 | "vcpuNum": 64 761 | }, 762 | { 763 | "_defaultOrder": 52, 764 | "_isFastLaunch": false, 765 | "category": "Accelerated computing", 766 | "gpuNum": 4, 767 | "hideHardwareSpecs": false, 768 | "memoryGiB": 192, 769 | "name": "ml.g5.12xlarge", 770 | "vcpuNum": 48 771 | }, 772 | { 773 | "_defaultOrder": 53, 774 | "_isFastLaunch": false, 775 | "category": "Accelerated computing", 776 | "gpuNum": 4, 777 | "hideHardwareSpecs": false, 778 | "memoryGiB": 384, 779 | "name": "ml.g5.24xlarge", 780 | "vcpuNum": 96 781 | }, 782 | { 783 | "_defaultOrder": 54, 784 | "_isFastLaunch": false, 785 | "category": "Accelerated computing", 786 | "gpuNum": 8, 787 | "hideHardwareSpecs": false, 788 | "memoryGiB": 768, 789 | "name": "ml.g5.48xlarge", 790 | "vcpuNum": 192 791 | } 792 | ], 793 | "instance_type": "ml.t3.medium", 794 | "kernelspec": { 795 | "display_name": "conda_mxnet_p38", 796 | "language": "python", 797 | "name": "conda_mxnet_p38" 798 | }, 799 | "language_info": { 800 | "codemirror_mode": { 801 | "name": "ipython", 802 | "version": 3 803 | }, 804 | "file_extension": ".py", 805 | "mimetype": "text/x-python", 806 | "name": "python", 807 | "nbconvert_exporter": "python", 808 | "pygments_lexer": "ipython3", 809 | "version": "3.8.15" 810 | } 811 | }, 812 | "nbformat": 4, 813 | "nbformat_minor": 5 814 | } 815 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/medical-imaging/src/Api.py: -------------------------------------------------------------------------------- 1 | import array 2 | import pydicom 3 | from pydicom.sequence import Sequence 4 | from pydicom import Dataset , DataElement 5 | from pydicom.dataset import FileMetaDataset 6 | from pydicom.uid import UID 7 | import json 8 | import logging 9 | import importlib 10 | import boto3 11 | from openjpeg import decode 12 | import io 13 | import sys 14 | import time 15 | import os 16 | import gzip 17 | 18 | logging.basicConfig( level="INFO" ) 19 | 20 | class MedicalImaging: 21 | def __init__(self, endpoint=""): 22 | session = boto3.Session() 23 | if len(endpoint)>1: 24 | self.client = boto3.client('medical-imaging', endpoint_url=endpoint) 25 | else: 26 | self.client = boto3.client('medical-imaging') 27 | 28 | def stopwatch(self, start_time, end_time): 29 | time_lapsed = end_time - start_time 30 | return time_lapsed*1000 31 | 32 | 33 | def getMetadata(self, datastoreId, imageSetId): 34 | start_time = time.time() 35 | dicom_study_metadata = self.client.get_image_set_metadata(datastoreId=datastoreId , imageSetId=imageSetId ) 36 | json_study_metadata = json.loads( gzip.decompress(dicom_study_metadata["imageSetMetadataBlob"].read()) ) 37 | end_time = time.time() 38 | logging.info(f"Metadata fetch : {self.stopwatch(start_time,end_time)} ms") 39 | return json_study_metadata 40 | 41 | 42 | def listDatastores(self): 43 | start_time = time.time() 44 | response = self.client.list_datastores() 45 | end_time = time.time() 46 | logging.info(f"List Datastores : {self.stopwatch(start_time,end_time)} ms") 47 | return response 48 | 49 | 50 | def createDatastore(self, datastoreName): 51 | start_time = time.time() 52 | response = self.client.create_datastore(datastoreName=datastoreName) 53 | end_time = time.time() 54 | logging.info(f"Create Datastore : {self.stopwatch(start_time,end_time)} ms") 55 | return response 56 | 57 | 58 | def getDatastore(self, datastoreId): 59 | start_time = time.time() 60 | response = self.client.get_datastore(datastoreId=datastoreId) 61 | end_time = time.time() 62 | logging.info(f"Get Datastore : {self.stopwatch(start_time,end_time)} ms") 63 | return response 64 | 65 | 66 | def deleteDatastore(self, datastoreId): 67 | start_time = time.time() 68 | response = self.client.delete_datastore(datastoreId=datastoreId) 69 | end_time = time.time() 70 | logging.info(f"Delete Datastore : {self.stopwatch(start_time,end_time)} ms") 71 | return response 72 | 73 | 74 | def startImportJob(self, datastoreId, IamRoleArn, inputS3, outputS3): 75 | start_time = time.time() 76 | response = self.client.start_dicom_import_job( 77 | datastoreId=datastoreId, 78 | dataAccessRoleArn = IamRoleArn, 79 | inputS3Uri = inputS3, 80 | outputS3Uri = outputS3, 81 | clientToken = "demoClient" 82 | ) 83 | end_time = time.time() 84 | logging.info(f"Start Import Job : {self.stopwatch(start_time,end_time)} ms") 85 | return response 86 | 87 | 88 | def getImportJob(self, datastoreId, jobId): 89 | start_time = time.time() 90 | response = self.client.get_dicom_import_job(datastoreId=datastoreId, jobId=jobId) 91 | end_time = time.time() 92 | logging.info(f"Get Import Job : {self.stopwatch(start_time,end_time)} ms") 93 | return response 94 | 95 | 96 | def getFramePixels(self, datastoreId, imageSetId, imageFrameId): 97 | start_time = time.time() 98 | res = self.client.get_image_frame( 99 | datastoreId=datastoreId, 100 | imageSetId=imageSetId, 101 | imageFrameInformation={ 102 | 'imageFrameId': imageFrameId 103 | }) 104 | end_time = time.time() 105 | logging.debug(f"Frame fetch : {self.stopwatch(start_time,end_time)} ms") 106 | start_time = time.time() 107 | b = io.BytesIO() 108 | b.write(res['imageFrameBlob'].read()) 109 | b.seek(0) 110 | d = decode(b) 111 | end_time = time.time() 112 | logging.debug(f"Frame decode : {self.stopwatch(start_time,end_time)} ms") 113 | return d 114 | 115 | def getDICOMdataset(self, datastoreId, imageSetId): 116 | logging.debug("Reading the JSON metadata file") 117 | json_dicom_header = self.getMetadata(datastoreId , imageSetId) 118 | 119 | vrlist = [] 120 | sop_instances = [] 121 | 122 | file_meta = FileMetaDataset() 123 | file_meta.MediaStorageSOPClassUID = UID('1.2.840.10008.5.1.4.1.1.1') ## Media Storage SOP Class UID, e.g. "1.2.840.10008.5.1.4.1.1.88.34" for Comprehensive 3D SR IOD. 124 | file_meta.MediaStorageSOPInstanceUID = UID("1.3.51.5145.5142.20010109.1105627.1.0.1") 125 | file_meta.ImplementationClassUID = UID("1.2.826.0.1.3680043.9.3811.2.0.1") 126 | file_meta.TransferSyntaxUID = UID('1.2.840.10008.1.2.1') # Made up. Not registered. 127 | 128 | logging.debug("Reading the Pixels") 129 | for series in json_dicom_header["Study"]["Series"]: 130 | for instances in json_dicom_header["Study"]["Series"][series]["Instances"]: 131 | ds = Dataset() 132 | ds.file_meta = file_meta 133 | 134 | PatientLevel = json_dicom_header["Patient"]["DICOM"] 135 | self.getTags(PatientLevel, ds, vrlist) 136 | StudyLevel = json_dicom_header["Study"]["DICOM"] 137 | self.getTags(StudyLevel, ds, vrlist) 138 | self.getDICOMVRs(json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOMVRs"] , vrlist) 139 | self.getTags( json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOM"] , ds, vrlist) 140 | self.getTags(json_dicom_header["Study"]["Series"][series]["DICOM"], ds, vrlist) 141 | 142 | ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian 143 | ds.file_meta.MediaStorageSOPInstanceUID = UID(instances) 144 | ds.is_little_endian = True 145 | ds.is_implicit_VR = False 146 | 147 | frameId = json_dicom_header["Study"]["Series"][series]["Instances"][instances]["ImageFrames"][0]["ID"] 148 | pixels = self.getFramePixels(datastoreId, json_dicom_header["ImageSetID"], frameId) 149 | 150 | start_time = time.time() 151 | ds.PixelData = pixels.tobytes() 152 | sop_instances.append(ds) 153 | vrlist.clear() 154 | end_time = time.time() 155 | logging.debug(f"Outpout save : {self.stopwatch(start_time,end_time)} ms") 156 | return sop_instances 157 | 158 | def getDICOMVRs(self, taglevel, vrlist): 159 | for theKey in taglevel: 160 | vrlist.append( [ theKey , taglevel[theKey] ]) 161 | logging.debug(f"[getDICOMVRs] - List of private tags VRs: {vrlist}\r\n") 162 | 163 | 164 | def getTags(self, tagLevel, ds, vrlist): 165 | for theKey in tagLevel: 166 | if theKey in ['PrivateCreatorID', 'FileMetaInformationVersion', '00291203']: 167 | continue 168 | try: 169 | try: 170 | tagvr = pydicom.datadict.dictionary_VR(theKey) 171 | except: #In case the vr is not in the pydicom dictionnary, it might be a private tag , listed in the vrlist 172 | tagvr = None 173 | for vr in vrlist: 174 | if theKey == vr[0]: 175 | tagvr = vr[1] 176 | datavalue=tagLevel[theKey] 177 | #print(f"{tagvr} {theKey} : {datavalue}") 178 | if(tagvr == 'SQ'): 179 | logging.debug(f"{theKey} : {tagLevel[theKey]} , {vrlist}") 180 | seqs = [] 181 | for underSeq in tagLevel[theKey]: 182 | seqds = Dataset() 183 | self.getTags(underSeq, seqds, vrlist) 184 | seqs.append(seqds) 185 | datavalue = Sequence(seqs) 186 | continue 187 | if(tagvr == 'US or SS'): 188 | datavalue=tagLevel[theKey] 189 | if (int(datavalue) > 32767): 190 | tagvr = 'US' 191 | if( tagvr == 'OB'): 192 | datavalue = self.getOBVRTagValue(tagLevel[theKey] ) 193 | 194 | data_element = DataElement(theKey , tagvr , datavalue ) 195 | if data_element.tag.group != 2: 196 | try: 197 | if (int(data_element.tag.group) % 2) == 0 : # we are skipping all the private tags 198 | ds.add(data_element) 199 | except: 200 | continue 201 | except Exception as err: 202 | logging.warning(f"[HLIDataDICOMizer][getTags] - {err} for Key: {theKey}") 203 | continue 204 | 205 | 206 | 207 | def getOBVRTagValue(self, datalist): 208 | bytevals = [] 209 | for byteval in datalist: 210 | bytevals.append(int(byteval)) 211 | OBArray = bytearray(bytevals) 212 | return bytes(OBArray) 213 | 214 | -------------------------------------------------------------------------------- /preprocess-multimodal-data/medical-imaging/src/radiogenomics-imaging-workflow.json: -------------------------------------------------------------------------------- 1 | { 2 | "StartAt": "iterate_over_subjects", 3 | "States": { 4 | "iterate_over_subjects": { 5 | "ItemsPath": "$.Subject", 6 | "MaxConcurrency": 50, 7 | "Type": "Map", 8 | "Next": "Finish", 9 | "Iterator": { 10 | "StartAt": "AHI Radiomic Feature Extraction", 11 | "States": { 12 | "Fallback": { 13 | "Type": "Pass", 14 | "Result": "This iteration failed for some reason", 15 | "End": true 16 | }, 17 | "AHI Radiomic Feature Extraction": { 18 | "Type": "Task", 19 | "OutputPath": "$.ProcessingJobArn", 20 | "Resource": "arn:aws:states:::sagemaker:createProcessingJob.sync", 21 | "Retry": [ 22 | { 23 | "ErrorEquals": [ 24 | "SageMaker.AmazonSageMakerException" 25 | ], 26 | "IntervalSeconds": 15, 27 | "MaxAttempts": 8, 28 | "BackoffRate": 1.5 29 | } 30 | ], 31 | "Catch": [ 32 | { 33 | "ErrorEquals": [ 34 | "States.TaskFailed" 35 | ], 36 | "Next": "Fallback" 37 | } 38 | ], 39 | "Parameters": { 40 | "ProcessingJobName.$": "$$.Execution.Input['PreprocessingJobName']", 41 | "ProcessingInputs": [ 42 | { 43 | "InputName": "JSON", 44 | "AppManaged": false, 45 | "S3Input": { 46 | "S3Uri.$": "States.Format('##INPUT_DATA_S3URI##/{}' , $)", 47 | "LocalPath": "/opt/ml/processing/input", 48 | "S3DataType": "S3Prefix", 49 | "S3InputMode": "File", 50 | "S3DataDistributionType": "ShardedByS3Key", 51 | "S3CompressionType": "None" 52 | } 53 | } 54 | ], 55 | "ProcessingOutputConfig": { 56 | "Outputs": [ 57 | { 58 | "OutputName": "radiomicsfeature", 59 | "AppManaged": false, 60 | "S3Output": { 61 | "S3Uri": "##OUTPUT_DATA_S3URI##", 62 | "LocalPath": "/opt/ml/processing/output/", 63 | "S3UploadMode": "EndOfJob" 64 | } 65 | } 66 | ] 67 | }, 68 | "AppSpecification": { 69 | "ImageUri": "##ECR_IMAGE_URI##", 70 | "ContainerArguments.$": "States.Array('--datastore_id', $, '--feature_store_name', $$.Execution.Input['FeatureStoreName'], '--offline_store_s3uri', $$.Execution.Input['OfflineStoreS3Uri'])", 71 | "ContainerEntrypoint": [ 72 | "python3", 73 | "/opt/ahiradiomics.py" 74 | ] 75 | }, 76 | "RoleArn": "##IAM_ROLE_ARN##", 77 | "ProcessingResources": { 78 | "ClusterConfig": { 79 | "InstanceCount": 10, 80 | "InstanceType": "ml.m5.large", 81 | "VolumeSizeInGB": 5 82 | } 83 | } 84 | }, 85 | "End": true 86 | } 87 | } 88 | } 89 | }, 90 | "Finish": { 91 | "Type": "Succeed" 92 | } 93 | } 94 | } 95 | 96 | -------------------------------------------------------------------------------- /store-multimodal-data/clinical/README.md: -------------------------------------------------------------------------------- 1 | ## Store and analyze clinical data with Amazon HealthLake 2 | 3 | To get started with storing clinical data, follow the steps in the guide [here](https://docs.aws.amazon.com/healthlake/latest/devguide/getting-started.html). 4 | 5 | Login to your AWS account, search for Amazon HealthLake, and [create an empty Amazon HealthLake datastore](https://docs.aws.amazon.com/healthlake/latest/devguide/create-data-store.html). This will take 20 minutes to provision. 6 | 7 | While the datastore is creating, navigate to Amazon S3, create a new bucket to hold the sample clinical data, and then copy the sample FHIR data folder from the public code repo into your an S3 bucket in your account. You can run the following CLI command to copy the data... 8 | aws s3 sync s3://guidance-multimodal-hcls-healthai-machinelearning/clinical s3://yournewbucketnamehere/clinical 9 | 10 | Once the datastore has been created, navigate to the datastore in the console and click the "Import" button in the top right. 11 | * On the Import page, press the "Browse S3" button, navigate to the sample data bucket in your account, then select the clinical folder. This folder contains various .ndjson files for the patients. 12 | * For an output file location, you can use the "sandbox-data-" folder to store your output job data. 13 | * We have created a HealthLake KMS key you can use throughout the workshop. Select that key for encryption. 14 | * Under the "Access Permissions" section, create an IAM role with a name of your preference. 15 | * Click the "Import data" button. 16 | 17 | 18 | -------------------------------------------------------------------------------- /store-multimodal-data/genomic/store-analyze-genomicdata-with-awshealthomics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f9472cd9", 6 | "metadata": {}, 7 | "source": [] 8 | }, 9 | { 10 | "cell_type": "markdown", 11 | "id": "3c8ef7a7", 12 | "metadata": {}, 13 | "source": [ 14 | "# Create AWS HealthOmics Analytic Stores to Import Genomic Data and Run Queries" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "id": "85d3918d", 20 | "metadata": {}, 21 | "source": [ 22 | "Follow steps in this notebook to:\n", 23 | "1. create AWS HealthOmics Reference, Variant, and Annotation Stores\n", 24 | "2. import reference genome, variant files, and ClinVar annotation file from S3 to the respective data stores\n", 25 | "3. query the variant and annotation data. " 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "id": "aef34065", 31 | "metadata": {}, 32 | "source": [ 33 | "## Prerequisites and package dependencies" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "id": "fee6df5c", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "!pip install awswrangler" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "id": "2dee102e", 50 | "metadata": { 51 | "tags": [] 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "from datetime import datetime\n", 56 | "from pprint import pprint\n", 57 | "import urllib\n", 58 | "\n", 59 | "import boto3\n", 60 | "import botocore.exceptions\n", 61 | "\n", 62 | "from utils import *" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "86a19a6d", 68 | "metadata": {}, 69 | "source": [ 70 | "### Create service role" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "id": "a036d0a8", 77 | "metadata": { 78 | "jupyter": { 79 | "source_hidden": true 80 | }, 81 | "tags": [] 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# set a timestamp\n", 86 | "dt_fmt = '%Y%m%dT%H%M%S'\n", 87 | "ts = datetime.now().strftime(dt_fmt)\n", 88 | "\n", 89 | "policy = {\n", 90 | " \"Version\": \"2012-10-17\",\n", 91 | " \"Statement\": [\n", 92 | " {\n", 93 | " \"Effect\": \"Allow\",\n", 94 | " \"Action\": [\n", 95 | " \"omics:*\"\n", 96 | " ],\n", 97 | " \"Resource\": \"*\"\n", 98 | " },\n", 99 | " {\n", 100 | " \"Effect\": \"Allow\",\n", 101 | " \"Action\": [\n", 102 | " \"ram:AcceptResourceShareInvitation\",\n", 103 | " \"ram:GetResourceShareInvitations\"\n", 104 | " ],\n", 105 | " \"Resource\": \"*\"\n", 106 | " },\n", 107 | " {\n", 108 | " \"Effect\": \"Allow\",\n", 109 | " \"Action\": [\n", 110 | " \"s3:GetBucketLocation\",\n", 111 | " \"s3:PutObject\",\n", 112 | " \"s3:GetObject\",\n", 113 | " \"s3:ListBucket\",\n", 114 | " \"s3:AbortMultipartUpload\",\n", 115 | " \"s3:ListMultipartUploadParts\",\n", 116 | " \"s3:GetObjectAcl\",\n", 117 | " \"s3:PutObjectAcl\"\n", 118 | " ],\n", 119 | " \"Resource\": \"*\"\n", 120 | " }\n", 121 | " ]\n", 122 | "}\n", 123 | "\n", 124 | "trust_policy = {\n", 125 | " \"Version\": \"2012-10-17\",\n", 126 | " \"Statement\": [\n", 127 | " {\n", 128 | " \"Effect\": \"Allow\",\n", 129 | " \"Principal\": {\n", 130 | " \"Service\": \"omics.amazonaws.com\"\n", 131 | " },\n", 132 | " \"Action\": \"sts:AssumeRole\"\n", 133 | " }\n", 134 | " ]\n", 135 | "}" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 3, 141 | "id": "9eb174a7", 142 | "metadata": { 143 | "tags": [] 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "# Base name for role and policy\n", 148 | "omics_iam_name = f'multimodal-omics-{ts}'\n", 149 | "create_omics_role(omics_iam_name, policy, trust_policy)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "id": "d7243a7e", 155 | "metadata": {}, 156 | "source": [ 157 | "### Create Omics client" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 5, 163 | "id": "c02fcf1f", 164 | "metadata": { 165 | "tags": [] 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "omics = boto3.client('omics', region_name='us-east-1')" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "id": "94b5014a", 175 | "metadata": {}, 176 | "source": [ 177 | "### Source data\n", 178 | "Set the source data bucket to the regional replica of the Synthea Coherent dataset. If you want to use different source data, replace the bucket name here and any S3 URIs where it is used in the rest of the notebook." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "id": "9bfd290d", 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "SOURCE_BUCKET_NAME = f\"guidance-multimodal-hcls-healthai-machinelearning-{omics.meta.region_name}\"" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "adda409b", 194 | "metadata": {}, 195 | "source": [ 196 | "## Create reference store and import reference genome" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "id": "ec7c48f4", 202 | "metadata": {}, 203 | "source": [ 204 | "### Create reference store " 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 6, 210 | "id": "a3071ab4", 211 | "metadata": { 212 | "tags": [] 213 | }, 214 | "outputs": [ 215 | { 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "Checking for a reference store in region: us-east-1\n", 220 | "Congratulations, you have an existing reference store!\n" 221 | ] 222 | } 223 | ], 224 | "source": [ 225 | "print(f\"Checking for a reference store in region: {omics.meta.region_name}\")\n", 226 | "if get_ref_store_id(omics) == None:\n", 227 | " response = omics.create_reference_store(name='myReferenceStore')\n", 228 | " print(response)\n", 229 | "else:\n", 230 | " print(\"Congratulations, you have an existing reference store!\")" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "id": "9aee28d4", 236 | "metadata": {}, 237 | "source": [ 238 | "### Import reference genome to reference store" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 4, 244 | "id": "161d02b0", 245 | "metadata": { 246 | "tags": [] 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "SOURCE_S3_URIS = {\n", 251 | " \"reference\": f\"s3://{SOURCE_BUCKET_NAME}/genomic/reference/hg19.fa\"\n", 252 | "}" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 7, 258 | "id": "6bf75b4f", 259 | "metadata": { 260 | "tags": [] 261 | }, 262 | "outputs": [], 263 | "source": [ 264 | "# If using a different reference genomem, replace \"hg19\" with a different prefix\n", 265 | "\n", 266 | "ref_name = f'hg19-{ts}'\n", 267 | "\n", 268 | "ref_import_job = omics.start_reference_import_job(\n", 269 | " referenceStoreId=get_ref_store_id(omics), \n", 270 | " roleArn=get_role_arn(omics_iam_name),\n", 271 | " sources=[{\n", 272 | " 'sourceFile': SOURCE_S3_URIS[\"reference\"],\n", 273 | " 'name': ref_name,\n", 274 | " 'tags': {'SourceLocation': '1kg'}\n", 275 | " }])" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "id": "219c738a", 282 | "metadata": { 283 | "tags": [] 284 | }, 285 | "outputs": [], 286 | "source": [ 287 | "ref_import_job = omics.get_reference_import_job(\n", 288 | " referenceStoreId=get_ref_store_id(omics), \n", 289 | " id=ref_import_job['id'])\n", 290 | "ref_import_job" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": null, 296 | "id": "5e1b56ff-92db-4db7-8e5f-f51f5733b557", 297 | "metadata": { 298 | "tags": [] 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "try:\n", 303 | " waiter = omics.get_waiter('reference_import_job_completed')\n", 304 | " waiter.wait(id=ref_import_job['id'], referenceStoreId=ref_import_job['referenceStoreId'])\n", 305 | " \n", 306 | " print(f\"reference import job {ref_import_job['id']} complete\")\n", 307 | "except botocore.exceptions.WaiterError as e:\n", 308 | " print(f\"reference import job {ref_import_job['id']} FAILED\")\n", 309 | " print(e)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "b0f91c51", 315 | "metadata": {}, 316 | "source": [ 317 | "### !!! Wait until the above import job has finished !!!" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "id": "055d8c70", 324 | "metadata": { 325 | "tags": [] 326 | }, 327 | "outputs": [], 328 | "source": [ 329 | "resp = omics.list_references(referenceStoreId=get_ref_store_id(omics), filter={\"name\": ref_name})\n", 330 | "\n", 331 | "ref_list = resp\n", 332 | "pprint(resp)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "id": "41525f3c", 339 | "metadata": { 340 | "tags": [] 341 | }, 342 | "outputs": [], 343 | "source": [ 344 | "# Store this reference\n", 345 | "ref = omics.get_reference_metadata(\n", 346 | " referenceStoreId=get_ref_store_id(omics), \n", 347 | " id=ref_list['references'][0]['id'])\n", 348 | "ref" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "id": "c9ed9b75", 354 | "metadata": {}, 355 | "source": [ 356 | "## Create Variant Store and import VCF files" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 12, 362 | "id": "01ec4e3d", 363 | "metadata": { 364 | "tags": [] 365 | }, 366 | "outputs": [], 367 | "source": [ 368 | "SOURCE_VARIANT_URI = f\"s3://{SOURCE_BUCKET_NAME}\"" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": 13, 374 | "id": "801a2d03", 375 | "metadata": { 376 | "tags": [] 377 | }, 378 | "outputs": [], 379 | "source": [ 380 | "# generate a list of VCF files to import\n", 381 | "\n", 382 | "source = urllib.parse.urlparse(SOURCE_VARIANT_URI)\n", 383 | "bucket = source.netloc\n", 384 | "prefix = source.path[1:]\n", 385 | "\n", 386 | "s3r = boto3.resource('s3')\n", 387 | "\n", 388 | "bucket = s3r.Bucket(bucket)\n", 389 | "objects = bucket.objects.filter(Prefix=prefix, MaxKeys=10_000)\n", 390 | "ext = '_dna.vcf'\n", 391 | "\n", 392 | "vcf_list = [f\"s3://{o.bucket_name}/{o.key}\" for o in objects if o.key.endswith(ext)]" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "id": "ff428d41", 398 | "metadata": {}, 399 | "source": [ 400 | "### Create Variant Store" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "id": "9c1fa8fd", 407 | "metadata": { 408 | "tags": [] 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "var_store_name = f'synthea_newvariants_{ts.lower()}'\n", 413 | "\n", 414 | "response = omics.create_variant_store(\n", 415 | " name=var_store_name, \n", 416 | " reference={\"referenceArn\": get_reference_arn(ref_name, omics)}\n", 417 | ")\n", 418 | "\n", 419 | "var_store = response\n", 420 | "response" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "id": "82351bce", 426 | "metadata": {}, 427 | "source": [ 428 | "### !!! Wait until the Variant Store is created !!!" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "id": "3aed46b5", 435 | "metadata": { 436 | "tags": [] 437 | }, 438 | "outputs": [], 439 | "source": [ 440 | "try:\n", 441 | " waiter = omics.get_waiter('variant_store_created')\n", 442 | " waiter.wait(name=var_store['name'])\n", 443 | "\n", 444 | " print(f\"variant store {var_store['name']} ready for use\")\n", 445 | "except botocore.exceptions.WaiterError as e:\n", 446 | " print(f\"variant store {var_store['name']} FAILED:\")\n", 447 | " print(e)\n", 448 | "\n", 449 | "var_store = omics.get_variant_store(name=var_store['name'])" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "id": "c6bb0979", 455 | "metadata": {}, 456 | "source": [ 457 | "### Import VCF files" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 17, 463 | "id": "73a348de", 464 | "metadata": { 465 | "tags": [] 466 | }, 467 | "outputs": [], 468 | "source": [ 469 | "l_vcf = [dict(zip([\"source\"],[uri])) for i, uri in enumerate(vcf_list)]\n", 470 | "\n", 471 | "response = omics.start_variant_import_job(destinationName=var_store['name'], \n", 472 | " roleArn=get_role_arn(omics_iam_name),\n", 473 | " items=l_vcf)" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "id": "cc6a56d1", 479 | "metadata": {}, 480 | "source": [ 481 | "## Query Variant Store with Amazon Athena" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": null, 487 | "id": "b2db8f94-c240-45da-8671-8370f0067407", 488 | "metadata": { 489 | "tags": [] 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "# To run Athena queries on the data, use AWS LakeFormation to create resource links to the database\n", 494 | "# For the following function to work, you need to ensure the IAM user running this notebook is a Data Lake Administrator.\n", 495 | "\n", 496 | "create_resource_link('omicsdb', var_store, store_type='variant')" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 19, 502 | "id": "e570e5d2", 503 | "metadata": { 504 | "tags": [] 505 | }, 506 | "outputs": [ 507 | { 508 | "name": "stdout", 509 | "output_type": "stream", 510 | "text": [ 511 | "Athena engine version 3\n", 512 | "Workgroup 'omics' found using Athena engine version 3\n" 513 | ] 514 | }, 515 | { 516 | "data": { 517 | "text/plain": [ 518 | "{'Name': 'omics',\n", 519 | " 'State': 'ENABLED',\n", 520 | " 'Description': '',\n", 521 | " 'CreationTime': datetime.datetime(2023, 3, 21, 20, 36, 59, 255000, tzinfo=tzlocal()),\n", 522 | " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n", 523 | " 'EffectiveEngineVersion': 'Athena engine version 3'}}" 524 | ] 525 | }, 526 | "execution_count": 19, 527 | "metadata": {}, 528 | "output_type": "execute_result" 529 | } 530 | ], 531 | "source": [ 532 | "# Omics Analytic Stores requires Athena engine version 3 for querying\n", 533 | "# https://docs.aws.amazon.com/athena/latest/ug/versions.html\n", 534 | "\n", 535 | "# Locate or create a suitable workgroup for Athena queries\n", 536 | "\n", 537 | "athena = boto3.client('athena')\n", 538 | "\n", 539 | "athena_workgroups = athena.list_work_groups()['WorkGroups']\n", 540 | "\n", 541 | "athena_workgroup = None\n", 542 | "for wg in athena_workgroups:\n", 543 | " print(wg['EngineVersion']['EffectiveEngineVersion'])\n", 544 | " if wg['EngineVersion']['EffectiveEngineVersion'] == 'Athena engine version 3':\n", 545 | " print(f\"Workgroup '{wg['Name']}' found using Athena engine version 3\")\n", 546 | " athena_workgroup = wg\n", 547 | " break\n", 548 | "else:\n", 549 | " print(\"No workgroups with Athena engine version 3 found. creating one\")\n", 550 | " athena_workgroup = athena.create_work_group(\n", 551 | " Name='omics',\n", 552 | " Configuration={\n", 553 | " \"EngineVersion\": {\n", 554 | " \"SelectedEngineVersion\": \"Athena engine version 3\"\n", 555 | " }\n", 556 | " }\n", 557 | " )\n", 558 | "\n", 559 | "athena_workgroup" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 26, 565 | "id": "8dbf203d", 566 | "metadata": { 567 | "tags": [] 568 | }, 569 | "outputs": [ 570 | { 571 | "data": { 572 | "text/html": [ 573 | "
\n", 574 | "\n", 587 | "\n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | "
sampleidcontignamestartreferenceallelealternateallelescalls
069eab197-6c14-7fcf-16d8-a18a222b82a4146932823A[G][0, 1]
1c7fff683-fd1b-f937-71ae-a490a80c9197146932823A[G][0, 1]
2c7fff683-fd1b-f937-71ae-a490a80c9197155039973G[A, T][0, 1]
3c7fff683-fd1b-f937-71ae-a490a80c9197146932823A[G][0, 1]
41c906349-d5f7-3b79-9385-291d6ca12ddc146932823A[G][0, 1]
569eab197-6c14-7fcf-16d8-a18a222b82a4146932823A[G][0, 1]
669eab197-6c14-7fcf-16d8-a18a222b82a4155039973G[A, T][0, 1]
71c906349-d5f7-3b79-9385-291d6ca12ddc146932823A[G][0, 1]
81c906349-d5f7-3b79-9385-291d6ca12ddc155039973G[A, T][0, 1]
9dccfd9ed-8080-2743-5c17-7888e93617d5146932823A[G][0, 1]
\n", 692 | "
" 693 | ], 694 | "text/plain": [ 695 | " sampleid contigname start referenceallele \\\n", 696 | "0 69eab197-6c14-7fcf-16d8-a18a222b82a4 1 46932823 A \n", 697 | "1 c7fff683-fd1b-f937-71ae-a490a80c9197 1 46932823 A \n", 698 | "2 c7fff683-fd1b-f937-71ae-a490a80c9197 1 55039973 G \n", 699 | "3 c7fff683-fd1b-f937-71ae-a490a80c9197 1 46932823 A \n", 700 | "4 1c906349-d5f7-3b79-9385-291d6ca12ddc 1 46932823 A \n", 701 | "5 69eab197-6c14-7fcf-16d8-a18a222b82a4 1 46932823 A \n", 702 | "6 69eab197-6c14-7fcf-16d8-a18a222b82a4 1 55039973 G \n", 703 | "7 1c906349-d5f7-3b79-9385-291d6ca12ddc 1 46932823 A \n", 704 | "8 1c906349-d5f7-3b79-9385-291d6ca12ddc 1 55039973 G \n", 705 | "9 dccfd9ed-8080-2743-5c17-7888e93617d5 1 46932823 A \n", 706 | "\n", 707 | " alternatealleles calls \n", 708 | "0 [G] [0, 1] \n", 709 | "1 [G] [0, 1] \n", 710 | "2 [A, T] [0, 1] \n", 711 | "3 [G] [0, 1] \n", 712 | "4 [G] [0, 1] \n", 713 | "5 [G] [0, 1] \n", 714 | "6 [A, T] [0, 1] \n", 715 | "7 [G] [0, 1] \n", 716 | "8 [A, T] [0, 1] \n", 717 | "9 [G] [0, 1] " 718 | ] 719 | }, 720 | "execution_count": 26, 721 | "metadata": {}, 722 | "output_type": "execute_result" 723 | } 724 | ], 725 | "source": [ 726 | "# Use AWS Wrangler to submit query and get results as a Pandas Dataframe\n", 727 | "\n", 728 | "import awswrangler as wr\n", 729 | "\n", 730 | "df_var = wr.athena.read_sql_query(\n", 731 | " f\"select sampleid, contigname, start, referenceallele, alternatealleles, calls from {var_store['name']} limit 10;\", \n", 732 | " database=\"omicsdb\", workgroup = \"omics\")\n", 733 | "df_var" 734 | ] 735 | }, 736 | { 737 | "cell_type": "markdown", 738 | "id": "ef5d4866", 739 | "metadata": {}, 740 | "source": [ 741 | "## Create Annotation Store and import ClinVar annotation file" 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": null, 747 | "id": "4b747f19", 748 | "metadata": { 749 | "tags": [] 750 | }, 751 | "outputs": [], 752 | "source": [ 753 | "SOURCE_ANNOTATION_URI = f\"s3://{SOURCE_BUCKET_NAME}/genomic/annotation/clinvar.vcf.gz\"\n", 754 | "\n", 755 | "ann_store_name = f'synthea_annotations_{ts.lower()}'\n", 756 | "\n", 757 | "response = omics.create_annotation_store(\n", 758 | " name=ann_store_name, \n", 759 | " reference={\"referenceArn\": get_reference_arn(ref_name, omics)},\n", 760 | " storeFormat='VCF'\n", 761 | ")\n", 762 | "\n", 763 | "ann_store = response\n", 764 | "response\n", 765 | "\n", 766 | "try:\n", 767 | " waiter = omics.get_waiter('annotation_store_created')\n", 768 | " waiter.wait(name=ann_store['name'])\n", 769 | "\n", 770 | " print(f\"annotation store {ann_store['name']} ready for use\")\n", 771 | "except botocore.exceptions.WaiterError as e:\n", 772 | " print(f\"annotation store {ann_store['name']} FAILED:\")\n", 773 | " print(e)\n", 774 | "\n", 775 | "ann_store = omics.get_annotation_store(name=ann_store['name'])\n" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "id": "29db5e49", 781 | "metadata": {}, 782 | "source": [ 783 | "### !!! Wait until the Annotation Store is created !!!" 784 | ] 785 | }, 786 | { 787 | "cell_type": "markdown", 788 | "id": "167610f7", 789 | "metadata": {}, 790 | "source": [ 791 | "### Import annotation file" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "id": "330bff87", 798 | "metadata": { 799 | "tags": [] 800 | }, 801 | "outputs": [], 802 | "source": [ 803 | "response = omics.start_annotation_import_job(\n", 804 | " destinationName=ann_store['name'],\n", 805 | " roleArn=get_role_arn(omics_iam_name),\n", 806 | " items=[{\"source\": SOURCE_ANNOTATION_URI}]\n", 807 | ")\n", 808 | "response" 809 | ] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "id": "8b1199c0", 814 | "metadata": {}, 815 | "source": [ 816 | "## Query Annotation Store with Amazon Athena" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": null, 822 | "id": "8345b58b-bfbd-45b0-93a7-882891f4f53a", 823 | "metadata": { 824 | "tags": [] 825 | }, 826 | "outputs": [], 827 | "source": [ 828 | "create_resource_link('omicsdb', ann_store, store_type='annotation')" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": 27, 834 | "id": "b0f4bf17", 835 | "metadata": { 836 | "tags": [] 837 | }, 838 | "outputs": [ 839 | { 840 | "data": { 841 | "text/html": [ 842 | "
\n", 843 | "\n", 856 | "\n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | "
contignamestartreferenceallelealternateallelesattributes
01926009G[T][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
11926026C[T][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
21925975T[C][(CLNSIG, Uncertain_significance), (GENEINFO, ...
31926002C[T][(CLNSIG, Uncertain_significance), (GENEINFO, ...
41926013G[A][(CLNSIG, Uncertain_significance), (GENEINFO, ...
51926024G[A][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
61925955C[T][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
71925968C[T][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
81925985C[T][(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...
91925951G[A][(RS, 1640863258), (CLNSIG, Uncertain_signific...
\n", 950 | "
" 951 | ], 952 | "text/plain": [ 953 | " contigname start referenceallele alternatealleles \\\n", 954 | "0 1 926009 G [T] \n", 955 | "1 1 926026 C [T] \n", 956 | "2 1 925975 T [C] \n", 957 | "3 1 926002 C [T] \n", 958 | "4 1 926013 G [A] \n", 959 | "5 1 926024 G [A] \n", 960 | "6 1 925955 C [T] \n", 961 | "7 1 925968 C [T] \n", 962 | "8 1 925985 C [T] \n", 963 | "9 1 925951 G [A] \n", 964 | "\n", 965 | " attributes \n", 966 | "0 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 967 | "1 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 968 | "2 [(CLNSIG, Uncertain_significance), (GENEINFO, ... \n", 969 | "3 [(CLNSIG, Uncertain_significance), (GENEINFO, ... \n", 970 | "4 [(CLNSIG, Uncertain_significance), (GENEINFO, ... \n", 971 | "5 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 972 | "6 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 973 | "7 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 974 | "8 [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14... \n", 975 | "9 [(RS, 1640863258), (CLNSIG, Uncertain_signific... " 976 | ] 977 | }, 978 | "execution_count": 27, 979 | "metadata": {}, 980 | "output_type": "execute_result" 981 | } 982 | ], 983 | "source": [ 984 | "df_ann = wr.athena.read_sql_query(\n", 985 | " f\"select contigname, start, referenceallele, alternatealleles, attributes from {ann_store['name']} order by contigname limit 10;\", \n", 986 | " database=\"omicsdb\", workgroup = \"omics\")\n", 987 | "df_ann" 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": null, 993 | "id": "77f9922d", 994 | "metadata": {}, 995 | "outputs": [], 996 | "source": [] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "id": "065dc264", 1002 | "metadata": {}, 1003 | "outputs": [], 1004 | "source": [] 1005 | } 1006 | ], 1007 | "metadata": { 1008 | "kernelspec": { 1009 | "display_name": "conda_python3", 1010 | "language": "python", 1011 | "name": "conda_python3" 1012 | }, 1013 | "language_info": { 1014 | "codemirror_mode": { 1015 | "name": "ipython", 1016 | "version": 3 1017 | }, 1018 | "file_extension": ".py", 1019 | "mimetype": "text/x-python", 1020 | "name": "python", 1021 | "nbconvert_exporter": "python", 1022 | "pygments_lexer": "ipython3", 1023 | "version": "3.10.12" 1024 | } 1025 | }, 1026 | "nbformat": 4, 1027 | "nbformat_minor": 5 1028 | } 1029 | -------------------------------------------------------------------------------- /store-multimodal-data/genomic/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | 4 | import boto3 5 | import botocore.exceptions 6 | 7 | 8 | def create_omics_role(rolename, policy, trust_policy, iam=None): 9 | # Create the IAM client 10 | if not iam: 11 | iam = boto3.resource('iam') 12 | 13 | # Check if the role already exist. If not, create it 14 | try: 15 | role = iam.Role(rolename) 16 | role.load() 17 | 18 | except botocore.exceptions.ClientError as ex: 19 | if ex.response["Error"]["Code"] == "NoSuchEntity": 20 | #Create the role with the corresponding trust policy 21 | role = iam.create_role( 22 | RoleName=rolename, 23 | AssumeRolePolicyDocument=json.dumps(trust_policy)) 24 | 25 | #Create policy 26 | policy = iam.create_policy( 27 | PolicyName='{}-policy'.format(rolename), 28 | Description="Policy for Amazon Omics", 29 | PolicyDocument=json.dumps(policy)) 30 | 31 | #Attach the policy to the role 32 | policy.attach_role(RoleName=rolename) 33 | else: 34 | print (ex) 35 | print('Somthing went wrong, please retry and check your account settings and permissions') 36 | 37 | 38 | def get_role_arn(rolename, client=None): 39 | """retrieves the arn for an iam role name""" 40 | if not client: 41 | client = boto3.client('iam') 42 | 43 | role = client.get_role(RoleName=rolename)['Role'] 44 | return role['Arn'] 45 | 46 | 47 | def get_ref_store_id(client=None): 48 | if not client: 49 | client = boto3.client('omics') 50 | 51 | resp = client.list_reference_stores(maxResults=10) 52 | list_of_stores = resp.get('referenceStores') 53 | store_id = None 54 | 55 | if list_of_stores != None: 56 | # Since there can only be one store per region, if there is a store present use the first one 57 | store_id = list_of_stores[0].get('id') 58 | 59 | return store_id 60 | 61 | 62 | def get_reference_arn(ref_name, client=None): 63 | if not client: 64 | client = boto3.client('omics') 65 | 66 | resp = client.list_reference_stores(maxResults=10) 67 | ref_stores = resp.get('referenceStores') 68 | 69 | # There can only be one reference store per account per region 70 | # if there is a store present, it is the first one 71 | ref_store = ref_stores[0] if ref_stores else None 72 | 73 | if not ref_store: 74 | raise RuntimeError("You have not created a reference store, please got to the Amazon Omics Storage tutorial to learn how to create one. Do not continue with this notebook") 75 | 76 | ref_arn = None 77 | resp = client.list_references(referenceStoreId=ref_store['id']) 78 | ref_list = resp.get('references') 79 | 80 | for ref in resp.get('references'): 81 | if ref['name'] == ref_name: 82 | ref_arn = ref['arn'] 83 | 84 | if ref_arn == None: 85 | raise RuntimeError(f"Could not find {ref_name}.") 86 | 87 | return ref_arn 88 | 89 | 90 | def create_resource_link(database_name, store, store_type='variant'): 91 | ram = boto3.client('ram') 92 | glue = boto3.client('glue') 93 | 94 | caller_identity = boto3.client('sts').get_caller_identity() 95 | AWS_ACCOUNT_ID = caller_identity['Account'] 96 | AWS_IDENITY_ARN = caller_identity['Arn'] 97 | 98 | response = ram.list_resources(resourceOwner='OTHER-ACCOUNTS', resourceType='glue:Database') 99 | 100 | if not response.get('resources'): 101 | print('no shared resources found. verify that you have successfully created an Omics Analytics store') 102 | else: 103 | store_resources = [resource for resource in response['resources'] if store['id'] in resource['arn']] 104 | if not store_resources: 105 | print(f"no shared resources matching {store_type} store id {store['id']} found") 106 | else: 107 | store_resource = store_resources[0] 108 | 109 | resource_share = ram.get_resource_shares( 110 | resourceOwner='OTHER-ACCOUNTS', 111 | resourceShareArns=[store_resource['resourceShareArn']])['resourceShares'][0] 112 | 113 | # this creates a resource link to the table for the variant store and adds it to the `omicsdb` database 114 | response = glue.create_table( 115 | DatabaseName=database_name, 116 | TableInput = { 117 | "Name": store['name'], 118 | "TargetTable": { 119 | "CatalogId": resource_share['owningAccountId'], 120 | "DatabaseName": f"{store_type}_{AWS_ACCOUNT_ID}_{store['id']}", 121 | "Name": store['name'], 122 | } 123 | } 124 | ) 125 | 126 | return store_resource, resource_share, response -------------------------------------------------------------------------------- /store-multimodal-data/medical-imaging/src/Api.py: -------------------------------------------------------------------------------- 1 | import array 2 | import pydicom 3 | from pydicom.sequence import Sequence 4 | from pydicom import Dataset , DataElement 5 | from pydicom.dataset import FileMetaDataset 6 | from pydicom.uid import UID 7 | import json 8 | import logging 9 | import importlib 10 | import boto3 11 | from openjpeg import decode 12 | import io 13 | import sys 14 | import time 15 | import os 16 | import gzip 17 | 18 | logging.basicConfig( level="INFO" ) 19 | 20 | class MedicalImaging: 21 | def __init__(self, endpoint=""): 22 | session = boto3.Session() 23 | if len(endpoint)>1: 24 | self.client = boto3.client('medical-imaging', endpoint_url=endpoint) 25 | else: 26 | self.client = boto3.client('medical-imaging') 27 | 28 | def stopwatch(self, start_time, end_time): 29 | time_lapsed = end_time - start_time 30 | return time_lapsed*1000 31 | 32 | 33 | def getMetadata(self, datastoreId, imageSetId): 34 | start_time = time.time() 35 | dicom_study_metadata = self.client.get_image_set_metadata(datastoreId=datastoreId , imageSetId=imageSetId ) 36 | json_study_metadata = json.loads( gzip.decompress(dicom_study_metadata["imageSetMetadataBlob"].read()) ) 37 | end_time = time.time() 38 | logging.info(f"Metadata fetch : {self.stopwatch(start_time,end_time)} ms") 39 | return json_study_metadata 40 | 41 | 42 | def listDatastores(self): 43 | start_time = time.time() 44 | response = self.client.list_datastores() 45 | end_time = time.time() 46 | logging.info(f"List Datastores : {self.stopwatch(start_time,end_time)} ms") 47 | return response 48 | 49 | 50 | def createDatastore(self, datastoreName): 51 | start_time = time.time() 52 | response = self.client.create_datastore(datastoreName=datastoreName) 53 | end_time = time.time() 54 | logging.info(f"Create Datastore : {self.stopwatch(start_time,end_time)} ms") 55 | return response 56 | 57 | 58 | def getDatastore(self, datastoreId): 59 | start_time = time.time() 60 | response = self.client.get_datastore(datastoreId=datastoreId) 61 | end_time = time.time() 62 | logging.info(f"Get Datastore : {self.stopwatch(start_time,end_time)} ms") 63 | return response 64 | 65 | 66 | def deleteDatastore(self, datastoreId): 67 | start_time = time.time() 68 | response = self.client.delete_datastore(datastoreId=datastoreId) 69 | end_time = time.time() 70 | logging.info(f"Delete Datastore : {self.stopwatch(start_time,end_time)} ms") 71 | return response 72 | 73 | 74 | def startImportJob(self, datastoreId, IamRoleArn, inputS3, outputS3): 75 | start_time = time.time() 76 | response = self.client.start_dicom_import_job( 77 | datastoreId=datastoreId, 78 | dataAccessRoleArn = IamRoleArn, 79 | inputS3Uri = inputS3, 80 | outputS3Uri = outputS3, 81 | clientToken = "demoClient" 82 | ) 83 | end_time = time.time() 84 | logging.info(f"Start Import Job : {self.stopwatch(start_time,end_time)} ms") 85 | return response 86 | 87 | 88 | def getImportJob(self, datastoreId, jobId): 89 | start_time = time.time() 90 | response = self.client.get_dicom_import_job(datastoreId=datastoreId, jobId=jobId) 91 | end_time = time.time() 92 | logging.info(f"Get Import Job : {self.stopwatch(start_time,end_time)} ms") 93 | return response 94 | 95 | 96 | def getFramePixels(self, datastoreId, imageSetId, imageFrameId): 97 | start_time = time.time() 98 | res = self.client.get_image_frame( 99 | datastoreId=datastoreId, 100 | imageSetId=imageSetId, 101 | imageFrameInformation={ 102 | 'imageFrameId': imageFrameId 103 | }) 104 | end_time = time.time() 105 | logging.debug(f"Frame fetch : {self.stopwatch(start_time,end_time)} ms") 106 | start_time = time.time() 107 | b = io.BytesIO() 108 | b.write(res['imageFrameBlob'].read()) 109 | b.seek(0) 110 | d = decode(b) 111 | end_time = time.time() 112 | logging.debug(f"Frame decode : {self.stopwatch(start_time,end_time)} ms") 113 | return d 114 | 115 | def getDICOMdataset(self, datastoreId, imageSetId): 116 | logging.debug("Reading the JSON metadata file") 117 | json_dicom_header = self.getMetadata(datastoreId , imageSetId) 118 | 119 | vrlist = [] 120 | sop_instances = [] 121 | 122 | file_meta = FileMetaDataset() 123 | file_meta.MediaStorageSOPClassUID = UID('1.2.840.10008.5.1.4.1.1.1') ## Media Storage SOP Class UID, e.g. "1.2.840.10008.5.1.4.1.1.88.34" for Comprehensive 3D SR IOD. 124 | file_meta.MediaStorageSOPInstanceUID = UID("1.3.51.5145.5142.20010109.1105627.1.0.1") 125 | file_meta.ImplementationClassUID = UID("1.2.826.0.1.3680043.9.3811.2.0.1") 126 | file_meta.TransferSyntaxUID = UID('1.2.840.10008.1.2.1') # Made up. Not registered. 127 | 128 | logging.debug("Reading the Pixels") 129 | for series in json_dicom_header["Study"]["Series"]: 130 | for instances in json_dicom_header["Study"]["Series"][series]["Instances"]: 131 | ds = Dataset() 132 | ds.file_meta = file_meta 133 | 134 | PatientLevel = json_dicom_header["Patient"]["DICOM"] 135 | self.getTags(PatientLevel, ds, vrlist) 136 | StudyLevel = json_dicom_header["Study"]["DICOM"] 137 | self.getTags(StudyLevel, ds, vrlist) 138 | self.getDICOMVRs(json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOMVRs"] , vrlist) 139 | self.getTags( json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOM"] , ds, vrlist) 140 | self.getTags(json_dicom_header["Study"]["Series"][series]["DICOM"], ds, vrlist) 141 | 142 | ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian 143 | ds.file_meta.MediaStorageSOPInstanceUID = UID(instances) 144 | ds.is_little_endian = True 145 | ds.is_implicit_VR = False 146 | 147 | frameId = json_dicom_header["Study"]["Series"][series]["Instances"][instances]["ImageFrames"][0]["ID"] 148 | pixels = self.getFramePixels(datastoreId, json_dicom_header["ImageSetID"], frameId) 149 | 150 | start_time = time.time() 151 | ds.PixelData = pixels.tobytes() 152 | sop_instances.append(ds) 153 | vrlist.clear() 154 | end_time = time.time() 155 | logging.debug(f"Outpout save : {self.stopwatch(start_time,end_time)} ms") 156 | return sop_instances 157 | 158 | def getDICOMVRs(self, taglevel, vrlist): 159 | for theKey in taglevel: 160 | vrlist.append( [ theKey , taglevel[theKey] ]) 161 | logging.debug(f"[getDICOMVRs] - List of private tags VRs: {vrlist}\r\n") 162 | 163 | 164 | def getTags(self, tagLevel, ds, vrlist): 165 | for theKey in tagLevel: 166 | if theKey in ['PrivateCreatorID', 'FileMetaInformationVersion', '00291203']: 167 | continue 168 | try: 169 | try: 170 | tagvr = pydicom.datadict.dictionary_VR(theKey) 171 | except: #In case the vr is not in the pydicom dictionnary, it might be a private tag , listed in the vrlist 172 | tagvr = None 173 | for vr in vrlist: 174 | if theKey == vr[0]: 175 | tagvr = vr[1] 176 | datavalue=tagLevel[theKey] 177 | #print(f"{tagvr} {theKey} : {datavalue}") 178 | if(tagvr == 'SQ'): 179 | logging.debug(f"{theKey} : {tagLevel[theKey]} , {vrlist}") 180 | seqs = [] 181 | for underSeq in tagLevel[theKey]: 182 | seqds = Dataset() 183 | self.getTags(underSeq, seqds, vrlist) 184 | seqs.append(seqds) 185 | datavalue = Sequence(seqs) 186 | continue 187 | if(tagvr == 'US or SS'): 188 | datavalue=tagLevel[theKey] 189 | if (int(datavalue) > 32767): 190 | tagvr = 'US' 191 | if( tagvr == 'OB'): 192 | datavalue = self.getOBVRTagValue(tagLevel[theKey] ) 193 | 194 | data_element = DataElement(theKey , tagvr , datavalue ) 195 | if data_element.tag.group != 2: 196 | try: 197 | if (int(data_element.tag.group) % 2) == 0 : # we are skipping all the private tags 198 | ds.add(data_element) 199 | except: 200 | continue 201 | except Exception as err: 202 | logging.warning(f"[HLIDataDICOMizer][getTags] - {err} for Key: {theKey}") 203 | continue 204 | 205 | 206 | 207 | def getOBVRTagValue(self, datalist): 208 | bytevals = [] 209 | for byteval in datalist: 210 | bytevals.append(int(byteval)) 211 | OBArray = bytearray(bytevals) 212 | return bytes(OBArray) 213 | 214 | -------------------------------------------------------------------------------- /store-multimodal-data/medical-imaging/store-imagingdata-with-awshealthimaging.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "3c073628-e7a4-4598-98cc-b6134c104446", 7 | "metadata": { 8 | "tags": [] 9 | }, 10 | "outputs": [], 11 | "source": [ 12 | " %%sh\n", 13 | "pip install -q --upgrade pip\n", 14 | "pip install -q --upgrade boto3 botocore\n", 15 | "pip install -q tqdm nibabel pydicom numpy pylibjpeg-openjpeg " 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "id": "a1971f92-2784-46f7-8528-47cdd530ae19", 22 | "metadata": { 23 | "tags": [] 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "import pydicom\n", 28 | "from pydicom.sequence import Sequence\n", 29 | "from pydicom import Dataset , DataElement \n", 30 | "from pydicom.dataset import FileDataset, FileMetaDataset\n", 31 | "from pydicom.uid import UID\n", 32 | "from pydicom.pixel_data_handlers.util import convert_color_space , apply_color_lut\n", 33 | "from openjpeg import decode\n", 34 | "import array\n", 35 | "import json\n", 36 | "import logging\n", 37 | "import importlib \n", 38 | "import boto3\n", 39 | "import sagemaker\n", 40 | "from sagemaker import get_execution_role\n", 41 | "import io\n", 42 | "import sys\n", 43 | "import time\n", 44 | "import os\n", 45 | "import pandas as pd\n", 46 | "from botocore.exceptions import ClientError\n", 47 | "logging.basicConfig( level=\"INFO\" )\n", 48 | "# logging.basicConfig( level=\"DEBUG\" )\n", 49 | "from src.Api import MedicalImaging \n", 50 | "medicalimaging = MedicalImaging()\n", 51 | "\n", 52 | "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", 53 | "region = boto3.Session().region_name\n", 54 | "bucket = sagemaker.Session().default_bucket()\n", 55 | "role = get_execution_role()\n", 56 | "print(f\"S3 Bucket is {bucket}\")\n", 57 | "print(f\"IAM role is {role}\")" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "af33e224-5a2e-42f4-b35e-70c04a93106c", 63 | "metadata": {}, 64 | "source": [ 65 | "Copy over Coherent DICOM images to Default SageMaker S3 bucket" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "id": "6994f21d-f0d8-461a-9353-f288125ed474", 72 | "metadata": { 73 | "tags": [] 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "!aws s3 sync s3://guidance-multimodal-hcls-healthai-machinelearning-{region}/imaging s3://{bucket}/imaging/ 2>&1 > /dev/null" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "fd781e6e-606e-4bad-aa36-c2a48fe7bd91", 84 | "metadata": { 85 | "tags": [] 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "DatastoreName = \"WorkshopDataStore\"\n", 90 | "datastoreList = medicalimaging.listDatastores()\n", 91 | "\n", 92 | "res_createstore = None\n", 93 | "for datastore in datastoreList[\"datastoreSummaries\"]:\n", 94 | " if datastore[\"datastoreName\"] == DatastoreName:\n", 95 | " res_createstore = datastore\n", 96 | " break\n", 97 | "if res_createstore is None: \n", 98 | " res_createstore = medicalimaging.createDatastore(DatastoreName)\n", 99 | "\n", 100 | "datastoreId = res_createstore['datastoreId']\n", 101 | "res_getstore = medicalimaging.getDatastore(res_createstore['datastoreId']) \n", 102 | "status = res_getstore['datastoreProperties']['datastoreStatus']\n", 103 | "while status!='ACTIVE':\n", 104 | " time.sleep(30)\n", 105 | " res_getstore = medicalimaging.getDatastore(res_createstore['datastoreId']) \n", 106 | " status = res_getstore['datastoreProperties']['datastoreStatus']\n", 107 | " print(status)\n", 108 | "print(f\"datastoreId: {datastoreId}; status: {status}\")" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "id": "ede3adfd-4f26-4ada-a75a-3cbff1ed9cc4", 115 | "metadata": { 116 | "tags": [] 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "res_startimportjob = medicalimaging.startImportJob(\n", 121 | " res_createstore['datastoreId'],\n", 122 | " f\"arn:aws:iam::{account_id}:role/HealthImagingImportJobRole\",\n", 123 | " 's3://'+bucket+'/imaging/', \n", 124 | " 's3://'+bucket+'/ahi_importjob_output/'\n", 125 | ")\n", 126 | "\n", 127 | "jobId = res_startimportjob['jobId']\n", 128 | "jobstatus = medicalimaging.getImportJob(datastoreId, jobId)['jobProperties']['jobStatus']\n", 129 | "while jobstatus!='COMPLETED':\n", 130 | " time.sleep(30)\n", 131 | " jobstatus = medicalimaging.getImportJob(datastoreId, jobId)['jobProperties']['jobStatus']\n", 132 | "print(f\"jobstatus is {jobstatus}\")" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "id": "525b2359-60ec-4395-830a-639709af8614", 139 | "metadata": { 140 | "tags": [] 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "imageSetIds = {}\n", 145 | "s3=boto3.client('s3')\n", 146 | "try:\n", 147 | " response = s3.head_object(Bucket=bucket, Key=f\"ahi_importjob_output/{datastoreId}-DicomImport-{jobId}/job-output-manifest.json\")\n", 148 | " if response['ResponseMetadata']['HTTPStatusCode'] == 200:\n", 149 | " data = s3.get_object(Bucket=bucket, Key=f\"ahi_importjob_output/{datastoreId}-DicomImport-{jobId}/SUCCESS/success.ndjson\")\n", 150 | " contents = data['Body'].read().decode(\"utf-8\")\n", 151 | " for l in contents.splitlines():\n", 152 | " isid = json.loads(l)['importResponse']['imageSetId']\n", 153 | " if isid in imageSetIds:\n", 154 | " imageSetIds[isid]+=1\n", 155 | " else:\n", 156 | " imageSetIds[isid]=1\n", 157 | "except ClientError:\n", 158 | " pass\n", 159 | "\n", 160 | "\n", 161 | "print(\"number of image sets: {}\".format(len(imageSetIds)))" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "id": "378fee7b-2b32-4985-a6b0-de722495c966", 168 | "metadata": { 169 | "tags": [] 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "%store datastoreId\n", 174 | "%store imageSetIds\n", 175 | "%store jobId" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "id": "1f5a7471-e9ad-45ea-a96c-572a774a7529", 181 | "metadata": {}, 182 | "source": [ 183 | "## (Optional) Save JSON to S3" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "id": "58885cdc-a662-4e43-9f3c-d215719a7931", 190 | "metadata": { 191 | "tags": [] 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "for s in imageSetIds.keys():\n", 196 | " json_dicom_header = medicalimaging.getMetadata(datastoreId, s)\n", 197 | " patient = json_dicom_header['Patient']['DICOM']\n", 198 | " patient['imagesetid'] = s\n", 199 | " s3.put_object(\n", 200 | " Body=json.dumps(patient),\n", 201 | " Bucket=OutputBucketName,\n", 202 | " Key='dicom_header/json/patient/{}'.format(s)\n", 203 | " )\n", 204 | " study=json_dicom_header['Study']['DICOM']\n", 205 | " study['imagesetid'] = s\n", 206 | " s3.put_object(\n", 207 | " Body=json.dumps(study),\n", 208 | " Bucket=OutputBucketName,\n", 209 | " Key='dicom_header/json/study/{}'.format(s)\n", 210 | " )\n", 211 | " for se in list(json_dicom_header['Study']['Series'].keys()):\n", 212 | " s3.put_object(\n", 213 | " Body=json.dumps(json_dicom_header['Study']['Series'][se]['DICOM']),\n", 214 | " Bucket=OutputBucketName,\n", 215 | " Key='dicom_header/json/series/{}'.format(s)\n", 216 | " )\n", 217 | " for i in list(json_dicom_header['Study']['Series'][se]['Instances']):\n", 218 | " s3.put_object(\n", 219 | " Body=json.dumps(json_dicom_header['Study']['Series'][se]['Instances'][i]),\n", 220 | " Bucket=OutputBucketName,\n", 221 | " Key='dicom_header/json/series/{}'.format(s)\n", 222 | " )" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "1b99d75c-f3a5-42f1-b4f8-b46f756dffbe", 228 | "metadata": {}, 229 | "source": [ 230 | "## Clean Up" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "ec1c8cbb-8d90-4806-a873-655ccb9b20fa", 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "try:\n", 241 | " s3res = boto3.resource('s3')\n", 242 | " bucket = s3res.Bucket(InputBucketName)\n", 243 | " bucket.object_versions.delete()\n", 244 | " s3.delete_bucket(Bucket=InputBucketName)\n", 245 | " bucket = s3res.Bucket(OutputBucketName)\n", 246 | " bucket.object_versions.delete()\n", 247 | " s3.delete_bucket(Bucket=OutputBucketName)\n", 248 | "except ClientError as e:\n", 249 | " if e.response['Error']['Code'] == 'NoSuchBucket':\n", 250 | " print(\"Bucket already deleted\")\n", 251 | " \n", 252 | "try: \n", 253 | " resp = iam.detach_role_policy(PolicyArn=respons_createpolicy['Policy']['Arn'],RoleName=response_createrole['Role']['RoleName'])\n", 254 | " resp = iam.delete_policy(PolicyArn=respons_createpolicy['Policy']['Arn'])\n", 255 | " resp = iam.delete_role(RoleName=response_createrole['Role']['RoleName'])\n", 256 | "except ClientError as ee:\n", 257 | " if ee.response['Error']['Code'] == 'NoSuchEntity':\n", 258 | " print(\"Policy not attached, ignore\")\n", 259 | " else: \n", 260 | " print(ee)" 261 | ] 262 | } 263 | ], 264 | "metadata": { 265 | "availableInstances": [ 266 | { 267 | "_defaultOrder": 0, 268 | "_isFastLaunch": true, 269 | "category": "General purpose", 270 | "gpuNum": 0, 271 | "hideHardwareSpecs": false, 272 | "memoryGiB": 4, 273 | "name": "ml.t3.medium", 274 | "vcpuNum": 2 275 | }, 276 | { 277 | "_defaultOrder": 1, 278 | "_isFastLaunch": false, 279 | "category": "General purpose", 280 | "gpuNum": 0, 281 | "hideHardwareSpecs": false, 282 | "memoryGiB": 8, 283 | "name": "ml.t3.large", 284 | "vcpuNum": 2 285 | }, 286 | { 287 | "_defaultOrder": 2, 288 | "_isFastLaunch": false, 289 | "category": "General purpose", 290 | "gpuNum": 0, 291 | "hideHardwareSpecs": false, 292 | "memoryGiB": 16, 293 | "name": "ml.t3.xlarge", 294 | "vcpuNum": 4 295 | }, 296 | { 297 | "_defaultOrder": 3, 298 | "_isFastLaunch": false, 299 | "category": "General purpose", 300 | "gpuNum": 0, 301 | "hideHardwareSpecs": false, 302 | "memoryGiB": 32, 303 | "name": "ml.t3.2xlarge", 304 | "vcpuNum": 8 305 | }, 306 | { 307 | "_defaultOrder": 4, 308 | "_isFastLaunch": true, 309 | "category": "General purpose", 310 | "gpuNum": 0, 311 | "hideHardwareSpecs": false, 312 | "memoryGiB": 8, 313 | "name": "ml.m5.large", 314 | "vcpuNum": 2 315 | }, 316 | { 317 | "_defaultOrder": 5, 318 | "_isFastLaunch": false, 319 | "category": "General purpose", 320 | "gpuNum": 0, 321 | "hideHardwareSpecs": false, 322 | "memoryGiB": 16, 323 | "name": "ml.m5.xlarge", 324 | "vcpuNum": 4 325 | }, 326 | { 327 | "_defaultOrder": 6, 328 | "_isFastLaunch": false, 329 | "category": "General purpose", 330 | "gpuNum": 0, 331 | "hideHardwareSpecs": false, 332 | "memoryGiB": 32, 333 | "name": "ml.m5.2xlarge", 334 | "vcpuNum": 8 335 | }, 336 | { 337 | "_defaultOrder": 7, 338 | "_isFastLaunch": false, 339 | "category": "General purpose", 340 | "gpuNum": 0, 341 | "hideHardwareSpecs": false, 342 | "memoryGiB": 64, 343 | "name": "ml.m5.4xlarge", 344 | "vcpuNum": 16 345 | }, 346 | { 347 | "_defaultOrder": 8, 348 | "_isFastLaunch": false, 349 | "category": "General purpose", 350 | "gpuNum": 0, 351 | "hideHardwareSpecs": false, 352 | "memoryGiB": 128, 353 | "name": "ml.m5.8xlarge", 354 | "vcpuNum": 32 355 | }, 356 | { 357 | "_defaultOrder": 9, 358 | "_isFastLaunch": false, 359 | "category": "General purpose", 360 | "gpuNum": 0, 361 | "hideHardwareSpecs": false, 362 | "memoryGiB": 192, 363 | "name": "ml.m5.12xlarge", 364 | "vcpuNum": 48 365 | }, 366 | { 367 | "_defaultOrder": 10, 368 | "_isFastLaunch": false, 369 | "category": "General purpose", 370 | "gpuNum": 0, 371 | "hideHardwareSpecs": false, 372 | "memoryGiB": 256, 373 | "name": "ml.m5.16xlarge", 374 | "vcpuNum": 64 375 | }, 376 | { 377 | "_defaultOrder": 11, 378 | "_isFastLaunch": false, 379 | "category": "General purpose", 380 | "gpuNum": 0, 381 | "hideHardwareSpecs": false, 382 | "memoryGiB": 384, 383 | "name": "ml.m5.24xlarge", 384 | "vcpuNum": 96 385 | }, 386 | { 387 | "_defaultOrder": 12, 388 | "_isFastLaunch": false, 389 | "category": "General purpose", 390 | "gpuNum": 0, 391 | "hideHardwareSpecs": false, 392 | "memoryGiB": 8, 393 | "name": "ml.m5d.large", 394 | "vcpuNum": 2 395 | }, 396 | { 397 | "_defaultOrder": 13, 398 | "_isFastLaunch": false, 399 | "category": "General purpose", 400 | "gpuNum": 0, 401 | "hideHardwareSpecs": false, 402 | "memoryGiB": 16, 403 | "name": "ml.m5d.xlarge", 404 | "vcpuNum": 4 405 | }, 406 | { 407 | "_defaultOrder": 14, 408 | "_isFastLaunch": false, 409 | "category": "General purpose", 410 | "gpuNum": 0, 411 | "hideHardwareSpecs": false, 412 | "memoryGiB": 32, 413 | "name": "ml.m5d.2xlarge", 414 | "vcpuNum": 8 415 | }, 416 | { 417 | "_defaultOrder": 15, 418 | "_isFastLaunch": false, 419 | "category": "General purpose", 420 | "gpuNum": 0, 421 | "hideHardwareSpecs": false, 422 | "memoryGiB": 64, 423 | "name": "ml.m5d.4xlarge", 424 | "vcpuNum": 16 425 | }, 426 | { 427 | "_defaultOrder": 16, 428 | "_isFastLaunch": false, 429 | "category": "General purpose", 430 | "gpuNum": 0, 431 | "hideHardwareSpecs": false, 432 | "memoryGiB": 128, 433 | "name": "ml.m5d.8xlarge", 434 | "vcpuNum": 32 435 | }, 436 | { 437 | "_defaultOrder": 17, 438 | "_isFastLaunch": false, 439 | "category": "General purpose", 440 | "gpuNum": 0, 441 | "hideHardwareSpecs": false, 442 | "memoryGiB": 192, 443 | "name": "ml.m5d.12xlarge", 444 | "vcpuNum": 48 445 | }, 446 | { 447 | "_defaultOrder": 18, 448 | "_isFastLaunch": false, 449 | "category": "General purpose", 450 | "gpuNum": 0, 451 | "hideHardwareSpecs": false, 452 | "memoryGiB": 256, 453 | "name": "ml.m5d.16xlarge", 454 | "vcpuNum": 64 455 | }, 456 | { 457 | "_defaultOrder": 19, 458 | "_isFastLaunch": false, 459 | "category": "General purpose", 460 | "gpuNum": 0, 461 | "hideHardwareSpecs": false, 462 | "memoryGiB": 384, 463 | "name": "ml.m5d.24xlarge", 464 | "vcpuNum": 96 465 | }, 466 | { 467 | "_defaultOrder": 20, 468 | "_isFastLaunch": false, 469 | "category": "General purpose", 470 | "gpuNum": 0, 471 | "hideHardwareSpecs": true, 472 | "memoryGiB": 0, 473 | "name": "ml.geospatial.interactive", 474 | "supportedImageNames": [ 475 | "sagemaker-geospatial-v1-0" 476 | ], 477 | "vcpuNum": 0 478 | }, 479 | { 480 | "_defaultOrder": 21, 481 | "_isFastLaunch": true, 482 | "category": "Compute optimized", 483 | "gpuNum": 0, 484 | "hideHardwareSpecs": false, 485 | "memoryGiB": 4, 486 | "name": "ml.c5.large", 487 | "vcpuNum": 2 488 | }, 489 | { 490 | "_defaultOrder": 22, 491 | "_isFastLaunch": false, 492 | "category": "Compute optimized", 493 | "gpuNum": 0, 494 | "hideHardwareSpecs": false, 495 | "memoryGiB": 8, 496 | "name": "ml.c5.xlarge", 497 | "vcpuNum": 4 498 | }, 499 | { 500 | "_defaultOrder": 23, 501 | "_isFastLaunch": false, 502 | "category": "Compute optimized", 503 | "gpuNum": 0, 504 | "hideHardwareSpecs": false, 505 | "memoryGiB": 16, 506 | "name": "ml.c5.2xlarge", 507 | "vcpuNum": 8 508 | }, 509 | { 510 | "_defaultOrder": 24, 511 | "_isFastLaunch": false, 512 | "category": "Compute optimized", 513 | "gpuNum": 0, 514 | "hideHardwareSpecs": false, 515 | "memoryGiB": 32, 516 | "name": "ml.c5.4xlarge", 517 | "vcpuNum": 16 518 | }, 519 | { 520 | "_defaultOrder": 25, 521 | "_isFastLaunch": false, 522 | "category": "Compute optimized", 523 | "gpuNum": 0, 524 | "hideHardwareSpecs": false, 525 | "memoryGiB": 72, 526 | "name": "ml.c5.9xlarge", 527 | "vcpuNum": 36 528 | }, 529 | { 530 | "_defaultOrder": 26, 531 | "_isFastLaunch": false, 532 | "category": "Compute optimized", 533 | "gpuNum": 0, 534 | "hideHardwareSpecs": false, 535 | "memoryGiB": 96, 536 | "name": "ml.c5.12xlarge", 537 | "vcpuNum": 48 538 | }, 539 | { 540 | "_defaultOrder": 27, 541 | "_isFastLaunch": false, 542 | "category": "Compute optimized", 543 | "gpuNum": 0, 544 | "hideHardwareSpecs": false, 545 | "memoryGiB": 144, 546 | "name": "ml.c5.18xlarge", 547 | "vcpuNum": 72 548 | }, 549 | { 550 | "_defaultOrder": 28, 551 | "_isFastLaunch": false, 552 | "category": "Compute optimized", 553 | "gpuNum": 0, 554 | "hideHardwareSpecs": false, 555 | "memoryGiB": 192, 556 | "name": "ml.c5.24xlarge", 557 | "vcpuNum": 96 558 | }, 559 | { 560 | "_defaultOrder": 29, 561 | "_isFastLaunch": true, 562 | "category": "Accelerated computing", 563 | "gpuNum": 1, 564 | "hideHardwareSpecs": false, 565 | "memoryGiB": 16, 566 | "name": "ml.g4dn.xlarge", 567 | "vcpuNum": 4 568 | }, 569 | { 570 | "_defaultOrder": 30, 571 | "_isFastLaunch": false, 572 | "category": "Accelerated computing", 573 | "gpuNum": 1, 574 | "hideHardwareSpecs": false, 575 | "memoryGiB": 32, 576 | "name": "ml.g4dn.2xlarge", 577 | "vcpuNum": 8 578 | }, 579 | { 580 | "_defaultOrder": 31, 581 | "_isFastLaunch": false, 582 | "category": "Accelerated computing", 583 | "gpuNum": 1, 584 | "hideHardwareSpecs": false, 585 | "memoryGiB": 64, 586 | "name": "ml.g4dn.4xlarge", 587 | "vcpuNum": 16 588 | }, 589 | { 590 | "_defaultOrder": 32, 591 | "_isFastLaunch": false, 592 | "category": "Accelerated computing", 593 | "gpuNum": 1, 594 | "hideHardwareSpecs": false, 595 | "memoryGiB": 128, 596 | "name": "ml.g4dn.8xlarge", 597 | "vcpuNum": 32 598 | }, 599 | { 600 | "_defaultOrder": 33, 601 | "_isFastLaunch": false, 602 | "category": "Accelerated computing", 603 | "gpuNum": 4, 604 | "hideHardwareSpecs": false, 605 | "memoryGiB": 192, 606 | "name": "ml.g4dn.12xlarge", 607 | "vcpuNum": 48 608 | }, 609 | { 610 | "_defaultOrder": 34, 611 | "_isFastLaunch": false, 612 | "category": "Accelerated computing", 613 | "gpuNum": 1, 614 | "hideHardwareSpecs": false, 615 | "memoryGiB": 256, 616 | "name": "ml.g4dn.16xlarge", 617 | "vcpuNum": 64 618 | }, 619 | { 620 | "_defaultOrder": 35, 621 | "_isFastLaunch": false, 622 | "category": "Accelerated computing", 623 | "gpuNum": 1, 624 | "hideHardwareSpecs": false, 625 | "memoryGiB": 61, 626 | "name": "ml.p3.2xlarge", 627 | "vcpuNum": 8 628 | }, 629 | { 630 | "_defaultOrder": 36, 631 | "_isFastLaunch": false, 632 | "category": "Accelerated computing", 633 | "gpuNum": 4, 634 | "hideHardwareSpecs": false, 635 | "memoryGiB": 244, 636 | "name": "ml.p3.8xlarge", 637 | "vcpuNum": 32 638 | }, 639 | { 640 | "_defaultOrder": 37, 641 | "_isFastLaunch": false, 642 | "category": "Accelerated computing", 643 | "gpuNum": 8, 644 | "hideHardwareSpecs": false, 645 | "memoryGiB": 488, 646 | "name": "ml.p3.16xlarge", 647 | "vcpuNum": 64 648 | }, 649 | { 650 | "_defaultOrder": 38, 651 | "_isFastLaunch": false, 652 | "category": "Accelerated computing", 653 | "gpuNum": 8, 654 | "hideHardwareSpecs": false, 655 | "memoryGiB": 768, 656 | "name": "ml.p3dn.24xlarge", 657 | "vcpuNum": 96 658 | }, 659 | { 660 | "_defaultOrder": 39, 661 | "_isFastLaunch": false, 662 | "category": "Memory Optimized", 663 | "gpuNum": 0, 664 | "hideHardwareSpecs": false, 665 | "memoryGiB": 16, 666 | "name": "ml.r5.large", 667 | "vcpuNum": 2 668 | }, 669 | { 670 | "_defaultOrder": 40, 671 | "_isFastLaunch": false, 672 | "category": "Memory Optimized", 673 | "gpuNum": 0, 674 | "hideHardwareSpecs": false, 675 | "memoryGiB": 32, 676 | "name": "ml.r5.xlarge", 677 | "vcpuNum": 4 678 | }, 679 | { 680 | "_defaultOrder": 41, 681 | "_isFastLaunch": false, 682 | "category": "Memory Optimized", 683 | "gpuNum": 0, 684 | "hideHardwareSpecs": false, 685 | "memoryGiB": 64, 686 | "name": "ml.r5.2xlarge", 687 | "vcpuNum": 8 688 | }, 689 | { 690 | "_defaultOrder": 42, 691 | "_isFastLaunch": false, 692 | "category": "Memory Optimized", 693 | "gpuNum": 0, 694 | "hideHardwareSpecs": false, 695 | "memoryGiB": 128, 696 | "name": "ml.r5.4xlarge", 697 | "vcpuNum": 16 698 | }, 699 | { 700 | "_defaultOrder": 43, 701 | "_isFastLaunch": false, 702 | "category": "Memory Optimized", 703 | "gpuNum": 0, 704 | "hideHardwareSpecs": false, 705 | "memoryGiB": 256, 706 | "name": "ml.r5.8xlarge", 707 | "vcpuNum": 32 708 | }, 709 | { 710 | "_defaultOrder": 44, 711 | "_isFastLaunch": false, 712 | "category": "Memory Optimized", 713 | "gpuNum": 0, 714 | "hideHardwareSpecs": false, 715 | "memoryGiB": 384, 716 | "name": "ml.r5.12xlarge", 717 | "vcpuNum": 48 718 | }, 719 | { 720 | "_defaultOrder": 45, 721 | "_isFastLaunch": false, 722 | "category": "Memory Optimized", 723 | "gpuNum": 0, 724 | "hideHardwareSpecs": false, 725 | "memoryGiB": 512, 726 | "name": "ml.r5.16xlarge", 727 | "vcpuNum": 64 728 | }, 729 | { 730 | "_defaultOrder": 46, 731 | "_isFastLaunch": false, 732 | "category": "Memory Optimized", 733 | "gpuNum": 0, 734 | "hideHardwareSpecs": false, 735 | "memoryGiB": 768, 736 | "name": "ml.r5.24xlarge", 737 | "vcpuNum": 96 738 | }, 739 | { 740 | "_defaultOrder": 47, 741 | "_isFastLaunch": false, 742 | "category": "Accelerated computing", 743 | "gpuNum": 1, 744 | "hideHardwareSpecs": false, 745 | "memoryGiB": 16, 746 | "name": "ml.g5.xlarge", 747 | "vcpuNum": 4 748 | }, 749 | { 750 | "_defaultOrder": 48, 751 | "_isFastLaunch": false, 752 | "category": "Accelerated computing", 753 | "gpuNum": 1, 754 | "hideHardwareSpecs": false, 755 | "memoryGiB": 32, 756 | "name": "ml.g5.2xlarge", 757 | "vcpuNum": 8 758 | }, 759 | { 760 | "_defaultOrder": 49, 761 | "_isFastLaunch": false, 762 | "category": "Accelerated computing", 763 | "gpuNum": 1, 764 | "hideHardwareSpecs": false, 765 | "memoryGiB": 64, 766 | "name": "ml.g5.4xlarge", 767 | "vcpuNum": 16 768 | }, 769 | { 770 | "_defaultOrder": 50, 771 | "_isFastLaunch": false, 772 | "category": "Accelerated computing", 773 | "gpuNum": 1, 774 | "hideHardwareSpecs": false, 775 | "memoryGiB": 128, 776 | "name": "ml.g5.8xlarge", 777 | "vcpuNum": 32 778 | }, 779 | { 780 | "_defaultOrder": 51, 781 | "_isFastLaunch": false, 782 | "category": "Accelerated computing", 783 | "gpuNum": 1, 784 | "hideHardwareSpecs": false, 785 | "memoryGiB": 256, 786 | "name": "ml.g5.16xlarge", 787 | "vcpuNum": 64 788 | }, 789 | { 790 | "_defaultOrder": 52, 791 | "_isFastLaunch": false, 792 | "category": "Accelerated computing", 793 | "gpuNum": 4, 794 | "hideHardwareSpecs": false, 795 | "memoryGiB": 192, 796 | "name": "ml.g5.12xlarge", 797 | "vcpuNum": 48 798 | }, 799 | { 800 | "_defaultOrder": 53, 801 | "_isFastLaunch": false, 802 | "category": "Accelerated computing", 803 | "gpuNum": 4, 804 | "hideHardwareSpecs": false, 805 | "memoryGiB": 384, 806 | "name": "ml.g5.24xlarge", 807 | "vcpuNum": 96 808 | }, 809 | { 810 | "_defaultOrder": 54, 811 | "_isFastLaunch": false, 812 | "category": "Accelerated computing", 813 | "gpuNum": 8, 814 | "hideHardwareSpecs": false, 815 | "memoryGiB": 768, 816 | "name": "ml.g5.48xlarge", 817 | "vcpuNum": 192 818 | }, 819 | { 820 | "_defaultOrder": 55, 821 | "_isFastLaunch": false, 822 | "category": "Accelerated computing", 823 | "gpuNum": 8, 824 | "hideHardwareSpecs": false, 825 | "memoryGiB": 1152, 826 | "name": "ml.p4d.24xlarge", 827 | "vcpuNum": 96 828 | }, 829 | { 830 | "_defaultOrder": 56, 831 | "_isFastLaunch": false, 832 | "category": "Accelerated computing", 833 | "gpuNum": 8, 834 | "hideHardwareSpecs": false, 835 | "memoryGiB": 1152, 836 | "name": "ml.p4de.24xlarge", 837 | "vcpuNum": 96 838 | } 839 | ], 840 | "instance_type": "ml.t3.medium", 841 | "kernelspec": { 842 | "display_name": "Python 3 (Data Science 3.0)", 843 | "language": "python", 844 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" 845 | }, 846 | "language_info": { 847 | "codemirror_mode": { 848 | "name": "ipython", 849 | "version": 3 850 | }, 851 | "file_extension": ".py", 852 | "mimetype": "text/x-python", 853 | "name": "python", 854 | "nbconvert_exporter": "python", 855 | "pygments_lexer": "ipython3", 856 | "version": "3.10.6" 857 | } 858 | }, 859 | "nbformat": 4, 860 | "nbformat_minor": 5 861 | } 862 | -------------------------------------------------------------------------------- /train-test-ml-model/README.md: -------------------------------------------------------------------------------- 1 | ## Results from tests conducted on four different outcomes 2 | 3 | The train-test-model.ipynb notebook shows the test results for four different outcomes on the entire dataset of clinial, genomic, and imaging features. The four different outcomes tesed are Hypertension, Stroke, Alzheimer's disease, and Coronary heart disease. Tests were conducted for each of the modalities separately and finally by combining all three. In majority of cases it can be seen that using features from all modalities increases the relevant metrics. Please note: As the models were trained on synthetic data, it may not be reflective of real-world examples. 4 | 5 | | **Hypertension** | **Accuracy** | **Precision** | **Recall** | **F1** | 6 | |:--------------------------------:|:------------:|:-------------:|:----------:|:------:| 7 | | **Clinical** | 0.80 | 0.79 | 0.80 | 0.79 | 8 | | **Genomic** | 0.70 | 0.49 | 0.70 | 0.58 | 9 | | **Imaging** | 0.70 | 0.49 | 0.70 | 0.58 | 10 | | **Clinical + Genomic + Imaging** | 0.87 | 0.91 | 0.87 | 0.87 | 11 | | | | | | | 12 | | **Coronary heart disease** | **Accuracy** | **Precision** | **Recall** | **F1** | 13 | | **Clinical** | 0.63 | 0.76 | 0.63 | 0.68 | 14 | | **Genomic** | 0.87 | 0.75 | 0.87 | 0.80 | 15 | | **Imaging** | 0.80 | 0.74 | 0.80 | 0.77 | 16 | | **Clinical + Genomic + Imaging** | 0.83 | 0.85 | 0.83 | 0.84 | 17 | | | | | | | 18 | | **Stroke** | **Accuracy** | **Precision** | **Recall** | **F1** | 19 | | **Clinical** | 0.53 | 0.54 | 0.53 | 0.53 | 20 | | **Genomic** | 0.50 | 0.50 | 0.50 | 0.50 | 21 | | **Imaging** | 0.47 | 0.48 | 0.47 | 0.45 | 22 | | **Clinical + Genomic + Imaging** | 0.97 | 0.97 | 0.97 | 0.97 | 23 | | | | | | | 24 | | **Alzheimer's disease** | **Accuracy** | **Precision** | **Recall** | **F1** | 25 | | **Clinical** | 0.73 | 0.53 | 0.73 | 0.62 | 26 | | **Genomic** | 0.73 | 0.53 | 0.73 | 0.62 | 27 | | **Imaging** | 0.97 | 0.93 | 0.97 | 0.95 | 28 | | **Clinical + Genomic + Imaging** | 0.90 | 0.84 | 0.80 | 0.75 | 29 | 30 | The hypertension-train-test-deploy.ipynb notebook shows an example of how to deploy the hypertension AutoGluon model on an Amazon SageMaker endpoint and infer from it. A similar approach can be used to deploy models for other outcomes. -------------------------------------------------------------------------------- /train-test-ml-model/train-test-model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "825b615a-d04a-4c3e-8a14-a80257e2ed18", 6 | "metadata": {}, 7 | "source": [ 8 | "# Clinical, Genomic, and Imaging data - Training and Testing " 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "c062ea3f-bdbf-4496-9cf6-d0893261c440", 14 | "metadata": { 15 | "tags": [] 16 | }, 17 | "source": [ 18 | "---\n", 19 | "This notebook demonstrates the use of Amazon SageMaker [AutoGluon-Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) algorithm to train and test a tabular binary classification model. Tabular classification is the task of assigning a class to an example of structured or relational data. The Amazon SageMaker API for tabular classification can be used for classification of an example in two classes (binary classification) or more than two classes (multi-class classification).\n", 20 | "\n", 21 | "In this notebook, we demonstrate two use cases of tabular classification models using the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/):\n", 22 | "\n", 23 | "* How to get features from Amazon SageMaker FeatureStore. The preprocess-multimodal-data notebooks for clinical, genomic, and imaging notebooks need to be run before running this notebook.\n", 24 | "* How to train a tabular model on a multimodal dataset to do binary classification. This notebook shows example for four different outcomes: Alzheimers Disease, Coronary Heart Disease, Stroke, and Hypertension\n", 25 | "* How to use evaluate predictions from the out of sample test data.\n", 26 | "\n", 27 | "Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.xlarge instance with Python 3 (Data Science 3.0) kernel.\n", 28 | "\n", 29 | "---" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "id": "9515c767-f825-4691-a8dc-ebe2016e6ae2", 36 | "metadata": { 37 | "tags": [] 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "import boto3\n", 42 | "import sagemaker\n", 43 | "from sagemaker.session import Session\n", 44 | "from sagemaker import get_execution_role\n", 45 | "import pandas as pd\n", 46 | "import io, os\n", 47 | "import sys\n", 48 | "from sklearn.model_selection import train_test_split" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "outputs": [], 55 | "source": [ 56 | "!pip install autogluon" 57 | ], 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "id": "339915720e97ea08" 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "id": "55f9a0b7-fc8f-4c71-8732-9ac79f0faee6", 67 | "metadata": { 68 | "tags": [] 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "from autogluon.tabular import TabularPredictor\n", 73 | "import autogluon as ag" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "id": "aaa341b3-3db4-42ca-b40d-ed3f904af38e", 79 | "metadata": {}, 80 | "source": [ 81 | "## Get data type to train model" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "id": "9588fcd1-5671-4266-bc97-8607c4d6b6ca", 88 | "metadata": { 89 | "tags": [] 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "data_type = 'genomic-clinical-imaging'\n", 94 | "PatientID = 'patientid'" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "id": "f648bda3-d7c7-4c0e-984a-d96e925e25fd", 100 | "metadata": {}, 101 | "source": [ 102 | "## Set up S3 buckets and session" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "id": "191ef27d-efbe-4229-a2c8-2f4fcd02b581", 109 | "metadata": { 110 | "tags": [] 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "sm_session = sagemaker.Session()\n", 115 | "bucket = sm_session.default_bucket()\n", 116 | "region = boto3.Session().region_name\n", 117 | "role = get_execution_role()\n", 118 | "\n", 119 | "boto_session = boto3.Session(region_name=region)\n", 120 | "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n", 121 | "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n", 122 | "\n", 123 | "feature_store_session = Session(\n", 124 | " boto_session=boto_session,\n", 125 | " sagemaker_client=sagemaker_client,\n", 126 | " sagemaker_featurestore_runtime_client=featurestore_runtime\n", 127 | ")\n", 128 | "\n", 129 | "s3_client = boto3.client('s3', region_name=region)\n", 130 | "\n", 131 | "default_s3_bucket_name = sm_session.default_bucket()\n", 132 | "prefix = 'multi-model-health-ml'\n" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "043669dd-325a-46ac-a5ee-21cca32a9470", 138 | "metadata": {}, 139 | "source": [ 140 | "## Get features from SageMaker FeatureStore based on data type" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "id": "0e6a0bbe-aae3-4fdd-bca0-7507c935012e", 147 | "metadata": { 148 | "tags": [] 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "from sagemaker.feature_store.feature_group import FeatureGroup\n", 153 | "\n", 154 | "genomic_feature_group_name = 'genomic-feature-group'\n", 155 | "clinical_feature_group_name = 'clinical-feature-group'\n", 156 | "imaging_feature_group_name = 'imaging-feature-group'\n", 157 | "\n", 158 | "genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)\n", 159 | "clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)\n", 160 | "imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "id": "eef950fe-e6cd-46f7-ab9a-ba2416af69be", 167 | "metadata": { 168 | "tags": [] 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "genomic_query = genomic_feature_group.athena_query()\n", 173 | "clinical_query = clinical_feature_group.athena_query()\n", 174 | "imaging_query = imaging_feature_group.athena_query()\n", 175 | "\n", 176 | "genomic_table = genomic_query.table_name\n", 177 | "clinical_table = clinical_query.table_name\n", 178 | "imaging_table = imaging_query.table_name\n", 179 | "\n", 180 | "print('Table names')\n", 181 | "print(genomic_table)\n", 182 | "print(clinical_table)\n", 183 | "print(imaging_table)\n" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "id": "c25be564-ebb7-429e-936e-6e00e4fd89d2", 190 | "metadata": { 191 | "tags": [] 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "def get_features(data_type, output_location): \n", 196 | " if (data_type == 'genomic-clinical-imaging'):\n", 197 | " query_string = f'''SELECT * FROM \"{genomic_table}\", \"{clinical_table}\", \"{imaging_table}\"\n", 198 | " WHERE \"{genomic_table}\".{PatientID} = \"{clinical_table}\".{PatientID}\n", 199 | " AND \"{genomic_table}\".{PatientID} = \"{imaging_table}\".{PatientID}\n", 200 | " ORDER BY \"{clinical_table}\".{PatientID} ASC''' \n", 201 | " print(query_string)\n", 202 | " \n", 203 | " genomic_query.run(query_string=query_string, output_location=output_location)\n", 204 | " genomic_query.wait()\n", 205 | " dataset = genomic_query.as_dataframe()\n", 206 | " \n", 207 | " elif data_type not in supported_data_type:\n", 208 | " raise KeyError(f'data_type {data_type} is not supported for this analysis.')\n", 209 | " \n", 210 | " return dataset" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "id": "7e6f4687-3bd3-4c1d-8db7-9d164a8841a3", 217 | "metadata": { 218 | "tags": [] 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "fs_output_location = f's3://{default_s3_bucket_name}/{prefix}/feature-store-queries'\n", 223 | "dataset = get_features(data_type, fs_output_location)\n", 224 | "dataset = dataset.astype(str).replace({\"{\":\"\", \"}\":\"\"}, regex=True)\n", 225 | "\n", 226 | "# Write to csv in S3 without headers and index column.\n", 227 | "filename=f'{data_type}-dataset.csv'\n", 228 | "dataset_uri_prefix = f's3://{default_s3_bucket_name}/{prefix}/training_input/';\n", 229 | "\n", 230 | "dataset.to_csv(filename)\n", 231 | "s3_client.upload_file(filename, default_s3_bucket_name, f'{prefix}/training_input/{filename}')\n", 232 | "print(\"Observing the different features in the dataset\")\n", 233 | "dataset.head(3)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "id": "79588825-d49c-4309-b27e-7c79bc3f866c", 240 | "metadata": { 241 | "tags": [] 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "ag.core.utils.random.seed(25)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "id": "1d2deae2-386a-49a0-9662-e81a1c17d550", 251 | "metadata": {}, 252 | "source": [ 253 | "## Alzheimers Prediction\n", 254 | "Splitting data for training and testing" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "id": "544d6b79-7ea9-4082-bc95-cacd57015a13", 261 | "metadata": { 262 | "tags": [] 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "#Alzheimers Prediction\n", 267 | "#Splitting data into training and testing 80:20\n", 268 | "dataset = dataset.loc[:, ~dataset.columns.str.startswith('diagnostics')]\n", 269 | "dataset = dataset.drop(columns = ['eventtime', 'write_time', 'api_invocation_time', 'is_deleted', 'eventtime.1', 'write_time.1', 'api_invocation_time.1', 'is_deleted.1', 'alzheimers_prediction.1',\n", 270 | " 'coronary_heart_disease_prediction.1', 'stroke_prediction.1', 'hypertension_prediction.1', 'patientid.1', 'eventtime.2', 'write_time.2', 'api_invocation_time.2', 'is_deleted.2', \n", 271 | " 'alzheimers_prediction.2', 'coronary_heart_disease_prediction.2', 'stroke_prediction.2', 'hypertension_prediction.2', 'patientid.2'])\n", 272 | "training= dataset.sample(frac=0.8, random_state=23)\n", 273 | "training = training.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])\n", 274 | "testing = dataset.drop(training.index)\n", 275 | "testing = testing.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])\n", 276 | "X_test = testing.drop(columns = ['alzheimers_prediction'])\n", 277 | "print(\"Training size = \", len(training))\n", 278 | "print(\"Out of sample testing size = \", len(testing))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "id": "0deacd5b-1dd8-4ee3-9b67-395e2298b52c", 284 | "metadata": {}, 285 | "source": [ 286 | "### Alzheimers prediction on clinical, genomic, and imaging data using Autogluon" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "id": "e29cd4ea-2d3b-4e94-bdbb-19b8b64e9b92", 293 | "metadata": { 294 | "tags": [] 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "import time\n", 299 | "start_time = time.time()\n", 300 | "buckt = sm_session.default_bucket()\n", 301 | "prefix= \"genomic-clinical-imaging-alzheimers-prediction\"\n", 302 | "save_file = 's3://{}/{}'.format(buckt, prefix)\n", 303 | "predictor = TabularPredictor(label= 'alzheimers_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n", 304 | "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "id": "3a8cc026-7387-4b82-9a9c-1bc2b7818a94", 311 | "metadata": { 312 | "tags": [] 313 | }, 314 | "outputs": [], 315 | "source": [ 316 | "predictor.evaluate_predictions(y_true=testing['alzheimers_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "c1630bd6-ff2f-4b36-be9b-c8b30b6bbf70", 322 | "metadata": {}, 323 | "source": [ 324 | "## Coronary heart disease Prediction\n", 325 | "Splitting data for training and testing" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "id": "accbdc2f-b152-46fb-ab09-f6ce919a8c95", 332 | "metadata": { 333 | "tags": [] 334 | }, 335 | "outputs": [], 336 | "source": [ 337 | "#coronary_heart_disease_prediction\n", 338 | "#Splitting data into training and testing 80:20\n", 339 | "training = dataset.sample(frac=0.8, random_state=25)\n", 340 | "training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])\n", 341 | "testing = dataset.drop(training.index)\n", 342 | "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])\n", 343 | "X_test = testing.drop(columns = ['coronary_heart_disease_prediction'])\n", 344 | "print(\"Training size = \", len(training))\n", 345 | "print(\"Out of sample testing size = \", len(testing))" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "id": "6e242581-457b-43e3-a279-9fe488391fab", 351 | "metadata": {}, 352 | "source": [ 353 | "### Coronary heart disease prediction on clinical, genomic, and imaging data using Autogluon" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": null, 359 | "id": "645bf140-25de-443f-b852-34066b33d203", 360 | "metadata": { 361 | "tags": [] 362 | }, 363 | "outputs": [], 364 | "source": [ 365 | "import time\n", 366 | "start_time = time.time()\n", 367 | "buckt = sm_session.default_bucket()\n", 368 | "prefix= \"genomic-clinical-imaging-coronary-heart-disease-prediction\"\n", 369 | "save_file = 's3://{}/{}'.format(buckt, prefix)\n", 370 | "predictor = TabularPredictor(label= 'coronary_heart_disease_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n", 371 | "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": null, 377 | "id": "9eacb091-7a47-4efc-a360-4a265efbfe0c", 378 | "metadata": { 379 | "tags": [] 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "predictor.evaluate_predictions(y_true=testing['coronary_heart_disease_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "id": "7deb337c-9b26-4d70-b700-c8eda40fa115", 389 | "metadata": {}, 390 | "source": [ 391 | "## Stroke Prediction\n", 392 | "Splitting data for training and testing" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "id": "b0ef9c2d-eafc-4f05-a155-eb5237eb5052", 399 | "metadata": { 400 | "tags": [] 401 | }, 402 | "outputs": [], 403 | "source": [ 404 | "#stroke_prediction\n", 405 | "#Splitting data into training and testing 80:20\n", 406 | "training = dataset.sample(frac=0.8, random_state=30)\n", 407 | "training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])\n", 408 | "testing = dataset.drop(training.index)\n", 409 | "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])\n", 410 | "X_test = testing.drop(columns = ['stroke_prediction'])\n", 411 | "print(\"Training size = \", len(training))\n", 412 | "print(\"Out of sample testing size = \", len(testing))" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "0a753dc2-452f-44d0-92af-2c335a2478a9", 418 | "metadata": {}, 419 | "source": [ 420 | "### Stroke prediction on clinical, genomic, and imaging data using Autogluon" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "id": "3e828813-1792-434b-9f84-c1c7efb4c236", 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "import time\n", 431 | "start_time = time.time()\n", 432 | "buckt = sm_session.default_bucket()\n", 433 | "prefix= \"genomic-clinical-imaging-stroke_prediction\"\n", 434 | "save_file = 's3://{}/{}'.format(buckt, prefix)\n", 435 | "predictor = TabularPredictor(label= 'stroke_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n", 436 | "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "id": "fe947544-7daa-4836-b74a-7291cb83d421", 443 | "metadata": { 444 | "tags": [] 445 | }, 446 | "outputs": [], 447 | "source": [ 448 | "predictor.evaluate_predictions(y_true=testing['stroke_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "id": "8cda76e3-db6c-44d3-a6e5-339a8344f55a", 454 | "metadata": {}, 455 | "source": [ 456 | "## Hypertension Prediction\n", 457 | "Splitting data for training and testing" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "id": "926e139c-ec74-4c58-84c1-dca1c46f6022", 464 | "metadata": { 465 | "tags": [] 466 | }, 467 | "outputs": [], 468 | "source": [ 469 | "#hypertension_prediction\n", 470 | "#Splitting data into training and testing 80:20\n", 471 | "training = dataset.sample(frac=0.8, random_state=25)\n", 472 | "training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])\n", 473 | "testing = dataset.drop(training.index)\n", 474 | "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])\n", 475 | "X_test = testing.drop(columns = ['hypertension_prediction'])\n", 476 | "print(\"Training size = \", len(training))\n", 477 | "print(\"Out of sample testing size = \", len(testing))" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "id": "a6bf0905-5b45-4943-82a5-9211e40e010e", 483 | "metadata": {}, 484 | "source": [ 485 | "### Hypertension prediction on clinical, genomic, and imaging data using Autogluon" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "id": "dcdc67d7-d74c-4810-bccb-943ab9cedee6", 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "import time\n", 496 | "start_time = time.time()\n", 497 | "buckt = sm_session.default_bucket()\n", 498 | "prefix= \"genomic-clinical-imaging-hypertension-prediction\"\n", 499 | "save_file = 's3://{}/{}'.format(buckt, prefix)\n", 500 | "predictor = TabularPredictor(label= 'hypertension_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n", 501 | "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "id": "1611732f-6432-4713-b8da-90cbf404d3ec", 508 | "metadata": { 509 | "tags": [] 510 | }, 511 | "outputs": [], 512 | "source": [ 513 | "predictor.evaluate_predictions(y_true=testing['hypertension_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "id": "b6669950-15c3-4231-b10d-69971fdd656a", 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [] 523 | } 524 | ], 525 | "metadata": { 526 | "availableInstances": [ 527 | { 528 | "_defaultOrder": 0, 529 | "_isFastLaunch": true, 530 | "category": "General purpose", 531 | "gpuNum": 0, 532 | "hideHardwareSpecs": false, 533 | "memoryGiB": 4, 534 | "name": "ml.t3.medium", 535 | "vcpuNum": 2 536 | }, 537 | { 538 | "_defaultOrder": 1, 539 | "_isFastLaunch": false, 540 | "category": "General purpose", 541 | "gpuNum": 0, 542 | "hideHardwareSpecs": false, 543 | "memoryGiB": 8, 544 | "name": "ml.t3.large", 545 | "vcpuNum": 2 546 | }, 547 | { 548 | "_defaultOrder": 2, 549 | "_isFastLaunch": false, 550 | "category": "General purpose", 551 | "gpuNum": 0, 552 | "hideHardwareSpecs": false, 553 | "memoryGiB": 16, 554 | "name": "ml.t3.xlarge", 555 | "vcpuNum": 4 556 | }, 557 | { 558 | "_defaultOrder": 3, 559 | "_isFastLaunch": false, 560 | "category": "General purpose", 561 | "gpuNum": 0, 562 | "hideHardwareSpecs": false, 563 | "memoryGiB": 32, 564 | "name": "ml.t3.2xlarge", 565 | "vcpuNum": 8 566 | }, 567 | { 568 | "_defaultOrder": 4, 569 | "_isFastLaunch": true, 570 | "category": "General purpose", 571 | "gpuNum": 0, 572 | "hideHardwareSpecs": false, 573 | "memoryGiB": 8, 574 | "name": "ml.m5.large", 575 | "vcpuNum": 2 576 | }, 577 | { 578 | "_defaultOrder": 5, 579 | "_isFastLaunch": false, 580 | "category": "General purpose", 581 | "gpuNum": 0, 582 | "hideHardwareSpecs": false, 583 | "memoryGiB": 16, 584 | "name": "ml.m5.xlarge", 585 | "vcpuNum": 4 586 | }, 587 | { 588 | "_defaultOrder": 6, 589 | "_isFastLaunch": false, 590 | "category": "General purpose", 591 | "gpuNum": 0, 592 | "hideHardwareSpecs": false, 593 | "memoryGiB": 32, 594 | "name": "ml.m5.2xlarge", 595 | "vcpuNum": 8 596 | }, 597 | { 598 | "_defaultOrder": 7, 599 | "_isFastLaunch": false, 600 | "category": "General purpose", 601 | "gpuNum": 0, 602 | "hideHardwareSpecs": false, 603 | "memoryGiB": 64, 604 | "name": "ml.m5.4xlarge", 605 | "vcpuNum": 16 606 | }, 607 | { 608 | "_defaultOrder": 8, 609 | "_isFastLaunch": false, 610 | "category": "General purpose", 611 | "gpuNum": 0, 612 | "hideHardwareSpecs": false, 613 | "memoryGiB": 128, 614 | "name": "ml.m5.8xlarge", 615 | "vcpuNum": 32 616 | }, 617 | { 618 | "_defaultOrder": 9, 619 | "_isFastLaunch": false, 620 | "category": "General purpose", 621 | "gpuNum": 0, 622 | "hideHardwareSpecs": false, 623 | "memoryGiB": 192, 624 | "name": "ml.m5.12xlarge", 625 | "vcpuNum": 48 626 | }, 627 | { 628 | "_defaultOrder": 10, 629 | "_isFastLaunch": false, 630 | "category": "General purpose", 631 | "gpuNum": 0, 632 | "hideHardwareSpecs": false, 633 | "memoryGiB": 256, 634 | "name": "ml.m5.16xlarge", 635 | "vcpuNum": 64 636 | }, 637 | { 638 | "_defaultOrder": 11, 639 | "_isFastLaunch": false, 640 | "category": "General purpose", 641 | "gpuNum": 0, 642 | "hideHardwareSpecs": false, 643 | "memoryGiB": 384, 644 | "name": "ml.m5.24xlarge", 645 | "vcpuNum": 96 646 | }, 647 | { 648 | "_defaultOrder": 12, 649 | "_isFastLaunch": false, 650 | "category": "General purpose", 651 | "gpuNum": 0, 652 | "hideHardwareSpecs": false, 653 | "memoryGiB": 8, 654 | "name": "ml.m5d.large", 655 | "vcpuNum": 2 656 | }, 657 | { 658 | "_defaultOrder": 13, 659 | "_isFastLaunch": false, 660 | "category": "General purpose", 661 | "gpuNum": 0, 662 | "hideHardwareSpecs": false, 663 | "memoryGiB": 16, 664 | "name": "ml.m5d.xlarge", 665 | "vcpuNum": 4 666 | }, 667 | { 668 | "_defaultOrder": 14, 669 | "_isFastLaunch": false, 670 | "category": "General purpose", 671 | "gpuNum": 0, 672 | "hideHardwareSpecs": false, 673 | "memoryGiB": 32, 674 | "name": "ml.m5d.2xlarge", 675 | "vcpuNum": 8 676 | }, 677 | { 678 | "_defaultOrder": 15, 679 | "_isFastLaunch": false, 680 | "category": "General purpose", 681 | "gpuNum": 0, 682 | "hideHardwareSpecs": false, 683 | "memoryGiB": 64, 684 | "name": "ml.m5d.4xlarge", 685 | "vcpuNum": 16 686 | }, 687 | { 688 | "_defaultOrder": 16, 689 | "_isFastLaunch": false, 690 | "category": "General purpose", 691 | "gpuNum": 0, 692 | "hideHardwareSpecs": false, 693 | "memoryGiB": 128, 694 | "name": "ml.m5d.8xlarge", 695 | "vcpuNum": 32 696 | }, 697 | { 698 | "_defaultOrder": 17, 699 | "_isFastLaunch": false, 700 | "category": "General purpose", 701 | "gpuNum": 0, 702 | "hideHardwareSpecs": false, 703 | "memoryGiB": 192, 704 | "name": "ml.m5d.12xlarge", 705 | "vcpuNum": 48 706 | }, 707 | { 708 | "_defaultOrder": 18, 709 | "_isFastLaunch": false, 710 | "category": "General purpose", 711 | "gpuNum": 0, 712 | "hideHardwareSpecs": false, 713 | "memoryGiB": 256, 714 | "name": "ml.m5d.16xlarge", 715 | "vcpuNum": 64 716 | }, 717 | { 718 | "_defaultOrder": 19, 719 | "_isFastLaunch": false, 720 | "category": "General purpose", 721 | "gpuNum": 0, 722 | "hideHardwareSpecs": false, 723 | "memoryGiB": 384, 724 | "name": "ml.m5d.24xlarge", 725 | "vcpuNum": 96 726 | }, 727 | { 728 | "_defaultOrder": 20, 729 | "_isFastLaunch": false, 730 | "category": "General purpose", 731 | "gpuNum": 0, 732 | "hideHardwareSpecs": true, 733 | "memoryGiB": 0, 734 | "name": "ml.geospatial.interactive", 735 | "supportedImageNames": [ 736 | "sagemaker-geospatial-v1-0" 737 | ], 738 | "vcpuNum": 0 739 | }, 740 | { 741 | "_defaultOrder": 21, 742 | "_isFastLaunch": true, 743 | "category": "Compute optimized", 744 | "gpuNum": 0, 745 | "hideHardwareSpecs": false, 746 | "memoryGiB": 4, 747 | "name": "ml.c5.large", 748 | "vcpuNum": 2 749 | }, 750 | { 751 | "_defaultOrder": 22, 752 | "_isFastLaunch": false, 753 | "category": "Compute optimized", 754 | "gpuNum": 0, 755 | "hideHardwareSpecs": false, 756 | "memoryGiB": 8, 757 | "name": "ml.c5.xlarge", 758 | "vcpuNum": 4 759 | }, 760 | { 761 | "_defaultOrder": 23, 762 | "_isFastLaunch": false, 763 | "category": "Compute optimized", 764 | "gpuNum": 0, 765 | "hideHardwareSpecs": false, 766 | "memoryGiB": 16, 767 | "name": "ml.c5.2xlarge", 768 | "vcpuNum": 8 769 | }, 770 | { 771 | "_defaultOrder": 24, 772 | "_isFastLaunch": false, 773 | "category": "Compute optimized", 774 | "gpuNum": 0, 775 | "hideHardwareSpecs": false, 776 | "memoryGiB": 32, 777 | "name": "ml.c5.4xlarge", 778 | "vcpuNum": 16 779 | }, 780 | { 781 | "_defaultOrder": 25, 782 | "_isFastLaunch": false, 783 | "category": "Compute optimized", 784 | "gpuNum": 0, 785 | "hideHardwareSpecs": false, 786 | "memoryGiB": 72, 787 | "name": "ml.c5.9xlarge", 788 | "vcpuNum": 36 789 | }, 790 | { 791 | "_defaultOrder": 26, 792 | "_isFastLaunch": false, 793 | "category": "Compute optimized", 794 | "gpuNum": 0, 795 | "hideHardwareSpecs": false, 796 | "memoryGiB": 96, 797 | "name": "ml.c5.12xlarge", 798 | "vcpuNum": 48 799 | }, 800 | { 801 | "_defaultOrder": 27, 802 | "_isFastLaunch": false, 803 | "category": "Compute optimized", 804 | "gpuNum": 0, 805 | "hideHardwareSpecs": false, 806 | "memoryGiB": 144, 807 | "name": "ml.c5.18xlarge", 808 | "vcpuNum": 72 809 | }, 810 | { 811 | "_defaultOrder": 28, 812 | "_isFastLaunch": false, 813 | "category": "Compute optimized", 814 | "gpuNum": 0, 815 | "hideHardwareSpecs": false, 816 | "memoryGiB": 192, 817 | "name": "ml.c5.24xlarge", 818 | "vcpuNum": 96 819 | }, 820 | { 821 | "_defaultOrder": 29, 822 | "_isFastLaunch": true, 823 | "category": "Accelerated computing", 824 | "gpuNum": 1, 825 | "hideHardwareSpecs": false, 826 | "memoryGiB": 16, 827 | "name": "ml.g4dn.xlarge", 828 | "vcpuNum": 4 829 | }, 830 | { 831 | "_defaultOrder": 30, 832 | "_isFastLaunch": false, 833 | "category": "Accelerated computing", 834 | "gpuNum": 1, 835 | "hideHardwareSpecs": false, 836 | "memoryGiB": 32, 837 | "name": "ml.g4dn.2xlarge", 838 | "vcpuNum": 8 839 | }, 840 | { 841 | "_defaultOrder": 31, 842 | "_isFastLaunch": false, 843 | "category": "Accelerated computing", 844 | "gpuNum": 1, 845 | "hideHardwareSpecs": false, 846 | "memoryGiB": 64, 847 | "name": "ml.g4dn.4xlarge", 848 | "vcpuNum": 16 849 | }, 850 | { 851 | "_defaultOrder": 32, 852 | "_isFastLaunch": false, 853 | "category": "Accelerated computing", 854 | "gpuNum": 1, 855 | "hideHardwareSpecs": false, 856 | "memoryGiB": 128, 857 | "name": "ml.g4dn.8xlarge", 858 | "vcpuNum": 32 859 | }, 860 | { 861 | "_defaultOrder": 33, 862 | "_isFastLaunch": false, 863 | "category": "Accelerated computing", 864 | "gpuNum": 4, 865 | "hideHardwareSpecs": false, 866 | "memoryGiB": 192, 867 | "name": "ml.g4dn.12xlarge", 868 | "vcpuNum": 48 869 | }, 870 | { 871 | "_defaultOrder": 34, 872 | "_isFastLaunch": false, 873 | "category": "Accelerated computing", 874 | "gpuNum": 1, 875 | "hideHardwareSpecs": false, 876 | "memoryGiB": 256, 877 | "name": "ml.g4dn.16xlarge", 878 | "vcpuNum": 64 879 | }, 880 | { 881 | "_defaultOrder": 35, 882 | "_isFastLaunch": false, 883 | "category": "Accelerated computing", 884 | "gpuNum": 1, 885 | "hideHardwareSpecs": false, 886 | "memoryGiB": 61, 887 | "name": "ml.p3.2xlarge", 888 | "vcpuNum": 8 889 | }, 890 | { 891 | "_defaultOrder": 36, 892 | "_isFastLaunch": false, 893 | "category": "Accelerated computing", 894 | "gpuNum": 4, 895 | "hideHardwareSpecs": false, 896 | "memoryGiB": 244, 897 | "name": "ml.p3.8xlarge", 898 | "vcpuNum": 32 899 | }, 900 | { 901 | "_defaultOrder": 37, 902 | "_isFastLaunch": false, 903 | "category": "Accelerated computing", 904 | "gpuNum": 8, 905 | "hideHardwareSpecs": false, 906 | "memoryGiB": 488, 907 | "name": "ml.p3.16xlarge", 908 | "vcpuNum": 64 909 | }, 910 | { 911 | "_defaultOrder": 38, 912 | "_isFastLaunch": false, 913 | "category": "Accelerated computing", 914 | "gpuNum": 8, 915 | "hideHardwareSpecs": false, 916 | "memoryGiB": 768, 917 | "name": "ml.p3dn.24xlarge", 918 | "vcpuNum": 96 919 | }, 920 | { 921 | "_defaultOrder": 39, 922 | "_isFastLaunch": false, 923 | "category": "Memory Optimized", 924 | "gpuNum": 0, 925 | "hideHardwareSpecs": false, 926 | "memoryGiB": 16, 927 | "name": "ml.r5.large", 928 | "vcpuNum": 2 929 | }, 930 | { 931 | "_defaultOrder": 40, 932 | "_isFastLaunch": false, 933 | "category": "Memory Optimized", 934 | "gpuNum": 0, 935 | "hideHardwareSpecs": false, 936 | "memoryGiB": 32, 937 | "name": "ml.r5.xlarge", 938 | "vcpuNum": 4 939 | }, 940 | { 941 | "_defaultOrder": 41, 942 | "_isFastLaunch": false, 943 | "category": "Memory Optimized", 944 | "gpuNum": 0, 945 | "hideHardwareSpecs": false, 946 | "memoryGiB": 64, 947 | "name": "ml.r5.2xlarge", 948 | "vcpuNum": 8 949 | }, 950 | { 951 | "_defaultOrder": 42, 952 | "_isFastLaunch": false, 953 | "category": "Memory Optimized", 954 | "gpuNum": 0, 955 | "hideHardwareSpecs": false, 956 | "memoryGiB": 128, 957 | "name": "ml.r5.4xlarge", 958 | "vcpuNum": 16 959 | }, 960 | { 961 | "_defaultOrder": 43, 962 | "_isFastLaunch": false, 963 | "category": "Memory Optimized", 964 | "gpuNum": 0, 965 | "hideHardwareSpecs": false, 966 | "memoryGiB": 256, 967 | "name": "ml.r5.8xlarge", 968 | "vcpuNum": 32 969 | }, 970 | { 971 | "_defaultOrder": 44, 972 | "_isFastLaunch": false, 973 | "category": "Memory Optimized", 974 | "gpuNum": 0, 975 | "hideHardwareSpecs": false, 976 | "memoryGiB": 384, 977 | "name": "ml.r5.12xlarge", 978 | "vcpuNum": 48 979 | }, 980 | { 981 | "_defaultOrder": 45, 982 | "_isFastLaunch": false, 983 | "category": "Memory Optimized", 984 | "gpuNum": 0, 985 | "hideHardwareSpecs": false, 986 | "memoryGiB": 512, 987 | "name": "ml.r5.16xlarge", 988 | "vcpuNum": 64 989 | }, 990 | { 991 | "_defaultOrder": 46, 992 | "_isFastLaunch": false, 993 | "category": "Memory Optimized", 994 | "gpuNum": 0, 995 | "hideHardwareSpecs": false, 996 | "memoryGiB": 768, 997 | "name": "ml.r5.24xlarge", 998 | "vcpuNum": 96 999 | }, 1000 | { 1001 | "_defaultOrder": 47, 1002 | "_isFastLaunch": false, 1003 | "category": "Accelerated computing", 1004 | "gpuNum": 1, 1005 | "hideHardwareSpecs": false, 1006 | "memoryGiB": 16, 1007 | "name": "ml.g5.xlarge", 1008 | "vcpuNum": 4 1009 | }, 1010 | { 1011 | "_defaultOrder": 48, 1012 | "_isFastLaunch": false, 1013 | "category": "Accelerated computing", 1014 | "gpuNum": 1, 1015 | "hideHardwareSpecs": false, 1016 | "memoryGiB": 32, 1017 | "name": "ml.g5.2xlarge", 1018 | "vcpuNum": 8 1019 | }, 1020 | { 1021 | "_defaultOrder": 49, 1022 | "_isFastLaunch": false, 1023 | "category": "Accelerated computing", 1024 | "gpuNum": 1, 1025 | "hideHardwareSpecs": false, 1026 | "memoryGiB": 64, 1027 | "name": "ml.g5.4xlarge", 1028 | "vcpuNum": 16 1029 | }, 1030 | { 1031 | "_defaultOrder": 50, 1032 | "_isFastLaunch": false, 1033 | "category": "Accelerated computing", 1034 | "gpuNum": 1, 1035 | "hideHardwareSpecs": false, 1036 | "memoryGiB": 128, 1037 | "name": "ml.g5.8xlarge", 1038 | "vcpuNum": 32 1039 | }, 1040 | { 1041 | "_defaultOrder": 51, 1042 | "_isFastLaunch": false, 1043 | "category": "Accelerated computing", 1044 | "gpuNum": 1, 1045 | "hideHardwareSpecs": false, 1046 | "memoryGiB": 256, 1047 | "name": "ml.g5.16xlarge", 1048 | "vcpuNum": 64 1049 | }, 1050 | { 1051 | "_defaultOrder": 52, 1052 | "_isFastLaunch": false, 1053 | "category": "Accelerated computing", 1054 | "gpuNum": 4, 1055 | "hideHardwareSpecs": false, 1056 | "memoryGiB": 192, 1057 | "name": "ml.g5.12xlarge", 1058 | "vcpuNum": 48 1059 | }, 1060 | { 1061 | "_defaultOrder": 53, 1062 | "_isFastLaunch": false, 1063 | "category": "Accelerated computing", 1064 | "gpuNum": 4, 1065 | "hideHardwareSpecs": false, 1066 | "memoryGiB": 384, 1067 | "name": "ml.g5.24xlarge", 1068 | "vcpuNum": 96 1069 | }, 1070 | { 1071 | "_defaultOrder": 54, 1072 | "_isFastLaunch": false, 1073 | "category": "Accelerated computing", 1074 | "gpuNum": 8, 1075 | "hideHardwareSpecs": false, 1076 | "memoryGiB": 768, 1077 | "name": "ml.g5.48xlarge", 1078 | "vcpuNum": 192 1079 | } 1080 | ], 1081 | "instance_type": "ml.t3.xlarge", 1082 | "kernelspec": { 1083 | "display_name": "Python 3 (Data Science 3.0)", 1084 | "language": "python", 1085 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" 1086 | }, 1087 | "language_info": { 1088 | "codemirror_mode": { 1089 | "name": "ipython", 1090 | "version": 3 1091 | }, 1092 | "file_extension": ".py", 1093 | "mimetype": "text/x-python", 1094 | "name": "python", 1095 | "nbconvert_exporter": "python", 1096 | "pygments_lexer": "ipython3", 1097 | "version": "3.10.6" 1098 | } 1099 | }, 1100 | "nbformat": 4, 1101 | "nbformat_minor": 5 1102 | } 1103 | -------------------------------------------------------------------------------- /visualization-dashboard/README.md: -------------------------------------------------------------------------------- 1 | ## [Visualization Dashboard for Population-level Analysis](https://us-east-1.quicksight.aws.amazon.com/sn/accounts/659535263284/dashboards/774bd843-f43f-4faf-858f-accd8376c95b/sheets/774bd843-f43f-4faf-858f-accd8376c95b_7871daf0-08f6-4c73-b85e-b7e88b3da794) 2 | 3 | 4 | 5 | 6 | This dashboard, built with Amazon QuickSight, offers an interactive visual interface to help users (e.g. clinicians, bioinformaticians, radiologists) get a comprehensive view of patients at the population or cohort-level. It includes the following sheets: 7 | 8 | * Clinical Analysis - Provides an overview of clinical data at the population level 9 | * Genomic Analysis - Provides an overview of genomic data at the population level 10 | * Imaging Analysis - Provides an overview of medical imaging data at the population level 11 | 12 | 13 | ## [Visualization Dashboard for Patient-level Analysis](https://us-east-1.quicksight.aws.amazon.com/sn/accounts/659535263284/dashboards/9b340b41-9b19-4496-bde1-dcf3638edd39?directory_alias=hcls-multimodal) 14 | 15 | 16 | 17 | 18 | This dashboard, built with Amazon QuickSight, offers a single, interactive visual interface to help users (e.g. clinicians) get a complete view of a patient across multiple data modalities (clinical, genomic, and imaging). To use this dashboard, simply select the Patient ID of interest using the menu located at the top-right of the dashboard. This will generate visulaizations across multiple data types, filtered for the selected patient. 19 | 20 | Prior to any production deployment, please coordinate with your local security team to ensure appropriate authentication/authorization is in place. 21 | --------------------------------------------------------------------------------