├── .github
    ├── solutionid_validator.sh
    └── workflows
    │   └── maintainer_workflows.yml
├── .gitignore
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── architecture
    ├── LaunchStack.jpg
    └── architecture-diagram.png
├── cfn_template
├── preprocess-multimodal-data
    ├── README.md
    ├── clinical
    │   └── preprocess-clinical.ipynb
    ├── genomic
    │   └── preprocess-genomic.ipynb
    ├── medical-imaging
    │   ├── imaging-radiomics.ipynb
    │   ├── preprocess-imaging.ipynb
    │   └── src
    │   │   ├── Api.py
    │   │   └── radiogenomics-imaging-workflow.json
    └── single-patient-records.ipynb
├── store-multimodal-data
    ├── clinical
    │   └── README.md
    ├── genomic
    │   ├── store-analyze-genomicdata-with-awshealthomics.ipynb
    │   └── utils.py
    └── medical-imaging
    │   ├── src
    │       └── Api.py
    │   └── store-imagingdata-with-awshealthimaging.ipynb
├── train-test-ml-model
    ├── README.md
    ├── hypertension-train-test-deploy.ipynb
    └── train-test-model.ipynb
└── visualization-dashboard
    └── README.md


/.github/solutionid_validator.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh 
 2 | #set -e 
 3 | 
 4 | echo "checking solution id $1"
 5 | echo "grep -nr --exclude-dir='.github' "$1" ./.."
 6 | result=$(grep -nr --exclude-dir='.github' "$1" ./..)
 7 | if [ $? -eq 0 ]
 8 | then
 9 |   echo "Solution ID $1 found\n"
10 |   echo "$result"
11 |   exit 0
12 | else
13 |   echo "Solution ID $1 not found"
14 |   exit 1
15 | fi
16 | 
17 | export result
18 | 


--------------------------------------------------------------------------------
/.github/workflows/maintainer_workflows.yml:
--------------------------------------------------------------------------------
 1 | # Workflows managed by aws-solutions-library-samples maintainers
 2 | name: Maintainer Workflows
 3 | on:
 4 |   # Triggers the workflow on push or pull request events but only for the "main" branch
 5 |   push:
 6 |     branches: [ "main" ]
 7 |   pull_request:
 8 |     branches: [ "main" ]
 9 |     types: [opened, reopened, edited]
10 | 
11 | jobs:
12 |   CheckSolutionId:
13 |     runs-on: ubuntu-latest
14 |     steps:
15 |       - uses: actions/checkout@v4
16 |       - name: Run solutionid validator
17 |         run: |
18 |           chmod u+x ./.github/solutionid_validator.sh
19 |           ./.github/solutionid_validator.sh ${{ vars.SOLUTIONID }}


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | ### vsCode ###
 2 | .vscode
 3 | 
 4 | ### macOS ###
 5 | # General
 6 | .DS_Store
 7 | .AppleDouble
 8 | .LSOverride
 9 | # Files that might appear in the root of a volume
10 | .DocumentRevisions-V100
11 | .fseventsd
12 | .Spotlight-V100
13 | .TemporaryItems
14 | .Trashes
15 | .VolumeIcon.icns
16 | .com.apple.timemachine.donotpresent
17 | 
18 | **/node_modules
19 | **/dist
20 | **/build
21 | **/.DS_Store
22 | **/.angular
23 | **/.idea
24 | cdk.out
25 | **/data/**.csv
26 | 
27 | .env
28 | 


--------------------------------------------------------------------------------
/CODEOWNERS:
--------------------------------------------------------------------------------
1 | CODEOWNERS @aws-solutions-library-samples/maintainers
2 | /.github/workflows/maintainer_workflows.yml @aws-solutions-library-samples/maintainers
3 | /.github/solutionid_validator.sh @aws-solutions-library-samples/maintainers
4 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Multimodal Data Analysis with AWS Health and Machine Learning Services
 2 | 
 3 | This repository contains code samples related to the AWS [Guidance for Multimodal Data Analysis with AWS Health and Machine Learning Services](https://aws.amazon.com/solutions/guidance/multi-modal-data-analysis-with-aws-health-and-ml-services/). You can follow the given instructions to build an end-to-end framework for storing, integrating, and analyzing genomic, clinical, and medical imaging data. 
 4 | 
 5 | As an example, we will use the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/), an open-source, synthetic dataset that includes FHIR resources, MRI DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. All the data types are linked together by FHIR. For the purpose of demonstration, we already converted the genomic data (originally released as CSV files with variant and annotation information) to VCF format, that is compatible with AWS HealthOmics. Similarly, we reformatted the clinical data into NDJSON files that comply with AWS HealthLake. For ease of use, we have made this derived dataset available on an Amazon S3 bucket (s3://guidance-multimodal-hcls-healthai-machinelearning) with public access. Since the Coherent Data Set focuses on cardiovascular disease, we consider the use case of predicting patients' outcomes, such as hypertension, stroke, coronary heart disease, and Alzheimer's disease. 
 6 | 
 7 | 
 8 | #### Architecture Overview 
 9 | 
10 | The following diagram provides an overview of the architecture and the steps followed to ingest, store, integrate, and analyze multimodal data leveraging AWS services. 
11 | 
12 | ![Figure 1: Architecture for multimodal data analysis with purpose-built Health AI and machine learning services on AWS](./architecture/architecture-diagram.png)
13 | 
14 | #### Instructions 
15 | 
16 | All of the data analytics and AI modeling are done using [Amazon SageMaker](https://aws.amazon.com/sagemaker/). You can create the Amazon SageMaker domain by 1-click deployment (If you have S3 Bucket or IAM role with the same names used in the template, please adjust the template parameter values for them):
17 | 
18 | [![deployment](architecture/LaunchStack.jpg)](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/template?stackName=sagemakerstack&templateURL=https://aws-opendata-demo.s3.amazonaws.com/sagemaker_template.yaml)  
19 | 
20 | Once the CloudFormation template is created, go to Studio Domain created on Amazon SageMaker managment console, select the UserProfile provisioned and launch Studio application. Once the Studio application is ready, follow [this instruction](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-git.html) to clone [this code repository](https://github.com/aws-solutions-library-samples/guidance-for-multimodal-hcls-data-analysis-with-omics-healthlake-imaging-and-sagemaker-on-aws.git) in SageMaker Studio and run the following notebooks.
21 | 
22 | 1. Store multimodal data with purpose-built Health AI services ([AWS HealthOmics](https://aws.amazon.com/omics/), [AWS HealthLake](https://aws.amazon.com/healthlake/), and [AWS HealthImaging](https://aws.amazon.com/healthlake/imaging/))
23 |     * To store each data type in the purpose-built Health AI service, follow the artifacts in the corresponding folders. 
24 |         * genomic - Run the notebook store-multimodal-data/genomic/store-analyze-genomicdata-with-awshealthomics.ipynb. This creates AWS HealthOmics data stores (Reference Store, Variant Store, and Annotation Store) to import reference genome, VCF files, and ClinVar annotation file. 
25 |         * clinical - Follow the instructions in store-multimodal-data/clinical/README.md to create AWS HealthLake data store and import NDJSON files. 
26 |         * medical imaging - First, run the notebook store-multimodal-data/medical_imaging/store-imagingdata-with-awshealthimaging.ipynb to create AWS HealthImaging data stores and import DICOM files. Then, run preprocess-multimodal-date/medical-imaging/imaging-radiomics.ipynb to generate radiomic features from multimple images in parallel using Amazon SageMaker Preprocessing. 
27 | 
28 | 2. Preprocess and analyze multimodal data with [AWS Lake Formation](https://aws.amazon.com/lake-formation/), [Amazon Athena](https://aws.amazon.com/athena/), and [Amazon SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/?sagemaker-data-wrangler-whats-new.sort-by=item.additionalFields.postDateTime&sagemaker-data-wrangler-whats-new.sort-order=desc)
29 |     * To prepare and analyze the multimodal data for downstream analysis (eg. querying with Amazon Athena, training a machine learning (ML) model with Amazon SageMaker), follow the artifacts in the corresponding folders.
30 |         * genomic - Run the notebook preprocess-multimodal-data/genomic/preprocess-genomic.ipynb to execute feature engineering and store the final set of genomic features in SageMaker Feature Store, a fully managed service for machine learning features. 
31 |         * clinical - Run the notebook preprocess-multimodal-data/clinical/preprocess-clinical.ipynb to execute feature engineering and store the final set of clinical features in SageMaker Feature Store. 
32 |         * medical imaging - Run the notebook preprocess-multimodal-data/medical-imaging/preprocess-imaging.ipynb to execute feature engineering and store the final set of radiomic features in SageMaker Feature Store.
33 | 
34 |     At the end of this step, you have created three Feature Groups in SageMaker Feature Store, one for each data modality. We will use these features in the following steps to train a machine learning model.
35 | 
36 | 3. Build, train, test, and deploy machine learning models with Amazon SageMaker 
37 |     * For the given use case of patients' outcome prediction, we will use [SageMaker AutoGluon](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), an AutoML framework that ensembles multiple ML models to improve predictive performance. To train and test the ML model on the multimodal feature set, run the notebook train-test-ml-model/train-test-model.ipynb. This generates evaluation metrics (accuracy, precision, recall, and f1 score) for the four outcomes (hypertension, stroke, coronary heart disease, and Alzheimer's disease).
38 | 
39 | 4. Create data visualization dashboards with Amazon QuickSight 
40 |     * Follow the instructions in visualization-dashboard/README.md to use the interactive dashboards and get a comprehensive view of patients across all three data modalities. These dashboards mitigate the challenge of data siloes and help end users (eg. clinicians, bioinformaticians, radiologists) easily access, view, and compare clinical, genomic, and medical imaging data across individual patients and cohorts. Prior to any production deployment, please coordinate with your local security team to ensure appropriate authentication/authorization is in place.
41 | 
42 | 
43 | #### Security
44 | 
45 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
46 | 
47 | #### License
48 | 
49 | This library is licensed under the MIT-0 License. See the LICENSE file.
50 | 
51 | 


--------------------------------------------------------------------------------
/architecture/LaunchStack.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-solutions-library-samples/guidance-for-multi-modal-data-analysis-with-aws-health-and-ml-services/bb7c42edb5991c733fbf96382293a0820192a3f2/architecture/LaunchStack.jpg


--------------------------------------------------------------------------------
/architecture/architecture-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-solutions-library-samples/guidance-for-multi-modal-data-analysis-with-aws-health-and-ml-services/bb7c42edb5991c733fbf96382293a0820192a3f2/architecture/architecture-diagram.png


--------------------------------------------------------------------------------
/cfn_template:
--------------------------------------------------------------------------------
  1 | AWSTemplateFormatVersion: '2010-09-09'
  2 | Description: 'Create resources needed for the AWS Solution: SO9254'
  3 | 
  4 | Parameters:
  5 |   ParameterVPCId:
  6 |     Description: ID of the AWS Virtual Private Cloud (VPC)
  7 |     Type: AWS::EC2::VPC::Id
  8 |   ParameterSubnet1Id:
  9 |     Description: SubnetId, for Availability Zone 1 in the region in your VPC
 10 |     Type: AWS::EC2::Subnet::Id
 11 |   ParameterSubnet2Id:
 12 |     Description: SubnetId, for Availability Zone 2 in the region in your VPC
 13 |     Type: AWS::EC2::Subnet::Id
 14 |   HealthImagingImportRoleName:
 15 |     Type: String
 16 |     Default: HealthImagingImportJobRole
 17 |     Description: This is an IAM role used by Amazon HealthImaging to import data. If you have an IAM role with the same name, please change this name.
 18 |   HealthOmicsImportRoleName:
 19 |     Type: String
 20 |     Default: OmicsUnifiedJobRole
 21 |     Description: This is an IAM role used by Amazon HealthOmics to import data. If you have an IAM role with the same name, please change this name.
 22 |   CreateS3BucketforSageMaker:
 23 |     Description: Do you want to create a S3 Bucket name like sagemaker-AWS::Region-AWS::AccountId
 24 |     Type: String
 25 |     Default: false
 26 |     AllowedValues:
 27 |       - true
 28 |       - false
 29 | 
 30 | Conditions:
 31 |   CreateS3Bucket: !Equals
 32 |     - true
 33 |     - !Ref 'CreateS3BucketforSageMaker'
 34 | 
 35 | Resources:
 36 |   SageMakerBucket:
 37 |     Condition: CreateS3Bucket
 38 |     Type: AWS::S3::Bucket
 39 |     Properties:
 40 |       BucketName: !Sub sagemaker-${AWS::Region}-${AWS::AccountId}
 41 |       VersioningConfiguration:
 42 |         Status: Enabled
 43 |       AccessControl: Private
 44 |       PublicAccessBlockConfiguration:
 45 |         BlockPublicAcls: TRUE
 46 |         BlockPublicPolicy: TRUE
 47 |         IgnorePublicAcls: TRUE
 48 |         RestrictPublicBuckets: TRUE
 49 |       BucketEncryption:
 50 |         ServerSideEncryptionConfiguration:
 51 |           - ServerSideEncryptionByDefault:
 52 |               SSEAlgorithm: AES256
 53 | 
 54 |   SageMakerExecutionRole:
 55 |     Type: "AWS::IAM::Role"
 56 |     DependsOn: HealthImagingImportJobRole
 57 |     Properties:
 58 |       RoleName: SageMakerStudioExecutionRole
 59 |       AssumeRolePolicyDocument:
 60 |         Version: 2012-10-17
 61 |         Statement:
 62 |           - Effect: Allow
 63 |             Principal:
 64 |               Service:
 65 |                 - sagemaker.amazonaws.com
 66 |                 - omics.amazonaws.com
 67 |                 - medical-imaging.amazonaws.com
 68 |                 - states.amazonaws.com
 69 |                 - glue.amazonaws.com
 70 |                 - codepipeline.amazonaws.com
 71 |                 - codebuild.amazonaws.com
 72 |             Action:
 73 |               - "sts:AssumeRole"
 74 |       Policies:
 75 |         - PolicyName: S3andMedicalImaging
 76 |           PolicyDocument:
 77 |             Version: '2012-10-17'
 78 |             Statement:
 79 |             - Effect: Allow
 80 |               Action:
 81 |                 - "s3:GetObject"
 82 |                 - "s3:PutObject"
 83 |                 - "s3:DeleteObject"
 84 |                 - "s3:ListBucket"
 85 |                 - "s3:GetEncryptionConfiguration"
 86 |               Resource: "arn:aws:s3:::*" 
 87 |             - Effect: Allow
 88 |               Action:
 89 |                 - "medical-imaging:CreateDatastore"
 90 |                 - "medical-imaging:Get*"
 91 |                 - "medical-imaging:List*"
 92 |                 - "medical-imaging:Update*"
 93 |                 - "medical-imaging:StartDICOMImportJob"
 94 |                 - "medical-imaging:DeleteDatastore"
 95 |                 - "medical-imaging:DeleteImageSet"
 96 |               Resource: "*"
 97 |             - Effect: Allow
 98 |               Action:
 99 |                 - "iam:GetUser"
100 |                 - "iam:GetPolicy"
101 |                 - "iam:CreatePolicy"
102 |                 - "iam:GetRole"
103 |                 - "iam:CreateRole"
104 |                 - "iam:AttachRolePolicy"
105 |               Resource: "*"
106 |             - Effect: Allow
107 |               Action:
108 |                 - "glue:CreateCrawler"
109 |                 - "glue:GetCrawler"
110 |                 - "glue:StartCrawler"
111 |                 - "glue:DeleteCrawler"
112 |               Resource: "*"
113 |             - Effect: "Allow"
114 |               Action: 
115 |                 - "iam:PassRole"
116 |               Resource: 
117 |                 - !GetAtt HealthImagingImportJobRole.Arn
118 |                 - !GetAtt OmicsAccessRole.Arn
119 |                 - !Sub arn:aws:iam::${AWS::AccountId}:role/SageMakerStudioExecutionRole
120 |       Path: /
121 |       ManagedPolicyArns:
122 |         - arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
123 |         - arn:aws:iam::aws:policy/AmazonHealthLakeFullAccess
124 |         - arn:aws:iam::aws:policy/AmazonOmicsFullAccess
125 |         - arn:aws:iam::aws:policy/AWSLakeFormationDataAdmin
126 |         - arn:aws:iam::aws:policy/AmazonAthenaFullAccess
127 | 
128 |   HealthImagingImportJobRole:
129 |     Type: "AWS::IAM::Role"
130 |     Properties:
131 |       RoleName: !Ref HealthImagingImportRoleName
132 |       AssumeRolePolicyDocument:
133 |         Version: 2012-10-17
134 |         Statement:
135 |           - Effect: Allow
136 |             Principal:
137 |               Service:
138 |                 - medical-imaging.amazonaws.com
139 |             Action:
140 |               - "sts:AssumeRole"
141 |       Policies:
142 |         - PolicyName: S3andMedicalImaging
143 |           PolicyDocument:
144 |             Version: '2012-10-17'
145 |             Statement:
146 |             - Effect: Allow
147 |               Action:
148 |                 - "s3:GetObject"
149 |                 - "s3:PutObject"
150 |                 - "s3:DeleteObject"
151 |                 - "s3:ListBucket"
152 |                 - "s3:GetEncryptionConfiguration"
153 |               Resource: arn:aws:s3:::* 
154 |             - Effect: Allow
155 |               Action:
156 |                 - "medical-imaging:StartDICOMImportJob"
157 |               Resource: "*" 
158 |   
159 |   OmicsAccessRole:
160 |     Type: AWS::IAM::Role
161 |     Properties:
162 |       RoleName: !Ref HealthOmicsImportRoleName
163 |       AssumeRolePolicyDocument:
164 |         Statement:
165 |         - Action:
166 |           - sts:AssumeRole
167 |           Effect: Allow
168 |           Principal:
169 |             Service:
170 |             - omics.amazonaws.com
171 |         Version: '2012-10-17'
172 |       Path: /
173 |       Policies:
174 |         - PolicyName: s3-access
175 |           PolicyDocument:
176 |             Statement:
177 |             - Action:
178 |                 - s3:GetObject
179 |                 - s3:GetBucketLocation
180 |                 - s3:ListBucket
181 |                 - s3:Put*
182 |               Effect: Allow
183 |               Resource:
184 |                 - arn:aws:s3:::*
185 |                 - arn:aws:s3:::*/*
186 |       ManagedPolicyArns:
187 |         - arn:aws:iam::aws:policy/AmazonOmicsFullAccess
188 | 
189 |   WorkshopDomain: 
190 |     Type: AWS::SageMaker::Domain
191 |     DependsOn: SageMakerExecutionRole
192 |     Properties: 
193 |       AuthMode: IAM
194 |       DefaultUserSettings: 
195 |         ExecutionRole: !GetAtt SageMakerExecutionRole.Arn
196 |       DomainName: "SageMakerStudioWorkshopDomain"
197 |       SubnetIds: 
198 |         - !Ref ParameterSubnet1Id
199 |         - !Ref ParameterSubnet2Id
200 |       VpcId: !Ref ParameterVPCId
201 |       
202 |   DefaultUser:
203 |     Type: AWS::SageMaker::UserProfile
204 |     Properties: 
205 |       DomainId: !Ref WorkshopDomain
206 |       UserProfileName: sagemaker-user
207 |     DependsOn: WorkshopDomain
208 |     
209 |       
210 | 
211 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/README.md:
--------------------------------------------------------------------------------
 1 | Before running the preprocessing notebooks, you will need to create a SageMaker domain if you haven't already and provide the necessary Lake Formation and Athena permissions.
 2 | 
 3 | The preprocessing notebooks directly take data from S3. For queries used to generate the S3 files used in the preprocessing notebooks, refer to the single-patient-records.ipynb notebook.
 4 | 
 5 | ### Create a SageMaker domain
 6 | 
 7 | You can skip this step if you already have a SageMaker domain setup and move on to provide the necessary persmissions.
 8 | 
 9 | 1. Go to the **Amazon SageMaker Console**
10 | 2. Select **Admin configurations** on the left and then **Domains**.
11 | 3. Click **Create domain**. 
12 | 4. Choose **Quick setup**
13 | 5. Click on **Set up**. It takes a couple of minutes for the domain setup to complete. Remain on the same page.
14 | 
15 | ### Provide necessary permissions to the role
16 | 
17 | 1. Get the execution role of your SageMaker domain. Click on the default user profile and note down the execution role on the right hand side of the page. The role will look like `arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-XXXX`. Note down the `AmazonSageMaker-ExecutionRole-XXXX` part as it will be used in the next steps.
18 | 2. Go to **Lake Formation** service page and choose **Administrative roles and tasks**.
19 | 3. Go to the **Data lake administrators** section and click on **Add**. Under **IAM users and roles**, choose the execution role created in the previous step. Click on **Confirm**.
20 | 4. Go to **IAM** service page. Click on **Roles**. Search for the execution role created in the previous step. Click on the execution role.
21 | 5. Click on **Add permissions** and **Attach policies**. Select **AmazonAthenaFullAccess** and **Add permissions**.
22 | 6. Go back to the SageMaker domain in your SageMaker console. Under **User profiles**, click on **Launch** and select **Studio**. Wait for the Studio to launch.
23 | 
24 | 
25 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/clinical/preprocess-clinical.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "c39ef86c-3c08-4a76-b852-939248bcd94d",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "\n",
  9 |     "# Read and preprocess clinical data from S3 and store features in SageMaker FeatureStore\n"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "id": "64e72b0f-db13-48b4-8d71-0e7c2d2bb7b9",
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "!pip install --no-dependencies s3fs"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "id": "ff045821-8bf7-46f8-8186-08d9e013875a",
 26 |    "metadata": {
 27 |     "tags": []
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "import boto3\n",
 32 |     "import numpy as np\n",
 33 |     "import pandas as pd\n",
 34 |     "import matplotlib.pyplot as plt\n",
 35 |     "import io, os\n",
 36 |     "from time import gmtime, strftime, sleep\n",
 37 |     "import time\n",
 38 |     "import sagemaker\n",
 39 |     "from sagemaker.session import Session\n",
 40 |     "from sagemaker import get_execution_role\n",
 41 |     "from sagemaker.feature_store.feature_group import FeatureGroup"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "id": "58fd3667-9479-46ca-833f-96e1948fca39",
 47 |    "metadata": {},
 48 |    "source": [
 49 |     "\n",
 50 |     "## Set up SageMaker FeatureStore\n"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "id": "b044786e-8c94-4755-b84f-8941c7e4f05d",
 57 |    "metadata": {
 58 |     "tags": []
 59 |    },
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "region = boto3.Session().region_name\n",
 63 |     "\n",
 64 |     "boto_session = boto3.Session(region_name=region)\n",
 65 |     "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n",
 66 |     "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n",
 67 |     "\n",
 68 |     "feature_store_session = Session(\n",
 69 |     "    boto_session=boto_session,\n",
 70 |     "    sagemaker_client=sagemaker_client,\n",
 71 |     "    sagemaker_featurestore_runtime_client=featurestore_runtime\n",
 72 |     ")\n",
 73 |     "\n",
 74 |     "role = get_execution_role()\n",
 75 |     "s3_client = boto3.client('s3', region_name=region)\n",
 76 |     "\n",
 77 |     "default_s3_bucket_name = feature_store_session.default_bucket()\n",
 78 |     "prefix = 'sagemaker-featurestore-demo'"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "markdown",
 83 |    "id": "6df3eac8-a59d-4a34-9a49-b24ec97045d2",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "\n",
 87 |     "## Get data from S3\n"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28",
 94 |    "metadata": {
 95 |     "tags": []
 96 |    },
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "# Get data from S3. \n",
100 |     "bucket_clin = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n",
101 |     "#bucket_clin = <S3-bucket-name>\n",
102 |     "\n",
103 |     "# Clinical data \n",
104 |     "data_key_clin = 'final_clinical_df.csv'\n",
105 |     "#data_key_clin = <file-name.csv>\n",
106 |     "\n",
107 |     "data_location_clin = 's3://{}/{}'.format(bucket_clin, data_key_clin)\n",
108 |     "data_clinical = pd.read_csv(data_location_clin)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "id": "ab13547e-ca00-4f74-a806-d053cbb34c1c",
114 |    "metadata": {},
115 |    "source": [
116 |     "\n",
117 |     "## Preprocess Data\n"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "id": "4bd20e05-596b-410d-a027-ddf2eb70678f",
124 |    "metadata": {
125 |     "tags": []
126 |    },
127 |    "outputs": [],
128 |    "source": [
129 |     "# Replacing NaN with zeros\n",
130 |     "data_clinical_1 = data_clinical.copy()\n",
131 |     "data_clinical_1 = data_clinical_1.replace(np.nan, 0)\n",
132 |     "data_clinical_1 = data_clinical_1.astype(str)\n",
133 |     "\n",
134 |     "#Converting all diagnosis codes to a set\n",
135 |     "data_clinical_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n",
136 |     "data_clinical_pred = data_clinical_1.copy()\n",
137 |     "for i in range(len(data_clinical_1)):\n",
138 |     "    data_clinical_pred.loc[i, 'diagnosisCode'] = set(data_clinical_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n",
139 |     "\n",
140 |     "# Adding a column for prediction of Alzheimer's disease code '26929004'\n",
141 |     "# Adding a column for prediction of Coronary heart disease '53741008'\n",
142 |     "# Adding a column for prediction of Stroke code '230690007'\n",
143 |     "# Adding a column for prediction of Hypertension code '59621000'\n",
144 |     "for i in range(len(data_clinical_pred)):\n",
145 |     "    if \"26929004\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n",
146 |     "        data_clinical_pred.loc[i, 'alzheimers_prediction']  =  '1'\n",
147 |     "    if \"53741008\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n",
148 |     "        data_clinical_pred.loc[i, 'coronary_heart_disease_prediction']  =  '1'\n",
149 |     "    if \"230690007\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n",
150 |     "        data_clinical_pred.loc[i, 'stroke_prediction']  =  '1'\n",
151 |     "    if \"59621000\" in (data_clinical_pred.loc[i, 'diagnosisCode']):\n",
152 |     "        data_clinical_pred.loc[i, 'hypertension_prediction']  =  '1'\n",
153 |     "print(\"Patients with Alzheimer's disease = \", len(data_clinical_pred[data_clinical_pred['alzheimers_prediction']=='1']))\n",
154 |     "print(\"Patients with Coronary Heart disease = \", len(data_clinical_pred[data_clinical_pred['coronary_heart_disease_prediction']=='1']))\n",
155 |     "print(\"Patients with Stroke = \", len(data_clinical_pred[data_clinical_pred['stroke_prediction']=='1']))\n",
156 |     "print(\"Patients with Hypertension = \", len(data_clinical_pred[data_clinical_pred['hypertension_prediction']=='1']))\n",
157 |     "\n",
158 |     "# Delete columns with leakage and features irrelevant for model training\n",
159 |     "list_delete_cols = ['diagnosisDescription', 'diagnosisCode', 'onsetdatetime', 'name', 'addressline',\n",
160 |     "       'city', 'state', 'country', 'latitude', 'longitude']\n",
161 |     "data_clinical_pred.drop(list_delete_cols, axis=1, inplace=True)\n",
162 |     "\n",
163 |     "data_clinical_pred.head(10)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "id": "1c8f4398-53df-4465-9259-ce80406f893e",
169 |    "metadata": {
170 |     "tags": []
171 |    },
172 |    "source": [
173 |     "\n",
174 |     "## Ingest data into FeatureStore\n"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5",
181 |    "metadata": {
182 |     "tags": []
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "clinical_feature_group_name = 'clinical-feature-group'\n",
187 |     "clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)\n",
188 |     "\n",
189 |     "current_time_sec = int(round(time.time()))\n",
190 |     "\n",
191 |     "def cast_object_to_string(data_frame):\n",
192 |     "    for label in data_frame.columns:\n",
193 |     "        print (label)\n",
194 |     "        if data_frame.dtypes[label] == 'object':\n",
195 |     "            data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n",
196 |     "\n",
197 |     "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n",
198 |     "cast_object_to_string(data_clinical_pred)\n",
199 |     "\n",
200 |     "# Record identifier and event time feature names\n",
201 |     "record_identifier_feature_name = \"patientID\"\n",
202 |     "event_time_feature_name = \"EventTime\"\n",
203 |     "\n",
204 |     "# Append EventTime feature\n",
205 |     "data_clinical_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_clinical_pred), dtype=\"float64\")\n",
206 |     "\n",
207 |     "## If event time generates NaN\n",
208 |     "data_clinical_pred[event_time_feature_name] = data_clinical_pred[event_time_feature_name].fillna(0)\n",
209 |     "\n",
210 |     "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n",
211 |     "clinical_feature_group.load_feature_definitions(data_frame=data_clinical_pred); # output is suppressed\n",
212 |     "\n",
213 |     "\n",
214 |     "def wait_for_feature_group_creation_complete(feature_group):\n",
215 |     "    status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
216 |     "    while status == \"Creating\":\n",
217 |     "        print(\"Waiting for Feature Group Creation\")\n",
218 |     "        time.sleep(15)\n",
219 |     "        status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
220 |     "    if status != \"Created\":\n",
221 |     "        raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n",
222 |     "    print(f\"FeatureGroup {feature_group.name} successfully created.\")\n",
223 |     "\n",
224 |     "clinical_feature_group.create(\n",
225 |     "    s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n",
226 |     "    record_identifier_name=record_identifier_feature_name,\n",
227 |     "    event_time_feature_name=event_time_feature_name,\n",
228 |     "    role_arn=role,\n",
229 |     "    enable_online_store=True\n",
230 |     ")\n",
231 |     "\n",
232 |     "wait_for_feature_group_creation_complete(feature_group=clinical_feature_group)\n",
233 |     "\n",
234 |     "clinical_feature_group.ingest(\n",
235 |     "    data_frame=data_clinical_pred, max_workers=5, wait=True\n",
236 |     ")"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba",
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": []
246 |   }
247 |  ],
248 |  "metadata": {
249 |   "availableInstances": [
250 |    {
251 |     "_defaultOrder": 0,
252 |     "_isFastLaunch": true,
253 |     "category": "General purpose",
254 |     "gpuNum": 0,
255 |     "hideHardwareSpecs": false,
256 |     "memoryGiB": 4,
257 |     "name": "ml.t3.medium",
258 |     "vcpuNum": 2
259 |    },
260 |    {
261 |     "_defaultOrder": 1,
262 |     "_isFastLaunch": false,
263 |     "category": "General purpose",
264 |     "gpuNum": 0,
265 |     "hideHardwareSpecs": false,
266 |     "memoryGiB": 8,
267 |     "name": "ml.t3.large",
268 |     "vcpuNum": 2
269 |    },
270 |    {
271 |     "_defaultOrder": 2,
272 |     "_isFastLaunch": false,
273 |     "category": "General purpose",
274 |     "gpuNum": 0,
275 |     "hideHardwareSpecs": false,
276 |     "memoryGiB": 16,
277 |     "name": "ml.t3.xlarge",
278 |     "vcpuNum": 4
279 |    },
280 |    {
281 |     "_defaultOrder": 3,
282 |     "_isFastLaunch": false,
283 |     "category": "General purpose",
284 |     "gpuNum": 0,
285 |     "hideHardwareSpecs": false,
286 |     "memoryGiB": 32,
287 |     "name": "ml.t3.2xlarge",
288 |     "vcpuNum": 8
289 |    },
290 |    {
291 |     "_defaultOrder": 4,
292 |     "_isFastLaunch": true,
293 |     "category": "General purpose",
294 |     "gpuNum": 0,
295 |     "hideHardwareSpecs": false,
296 |     "memoryGiB": 8,
297 |     "name": "ml.m5.large",
298 |     "vcpuNum": 2
299 |    },
300 |    {
301 |     "_defaultOrder": 5,
302 |     "_isFastLaunch": false,
303 |     "category": "General purpose",
304 |     "gpuNum": 0,
305 |     "hideHardwareSpecs": false,
306 |     "memoryGiB": 16,
307 |     "name": "ml.m5.xlarge",
308 |     "vcpuNum": 4
309 |    },
310 |    {
311 |     "_defaultOrder": 6,
312 |     "_isFastLaunch": false,
313 |     "category": "General purpose",
314 |     "gpuNum": 0,
315 |     "hideHardwareSpecs": false,
316 |     "memoryGiB": 32,
317 |     "name": "ml.m5.2xlarge",
318 |     "vcpuNum": 8
319 |    },
320 |    {
321 |     "_defaultOrder": 7,
322 |     "_isFastLaunch": false,
323 |     "category": "General purpose",
324 |     "gpuNum": 0,
325 |     "hideHardwareSpecs": false,
326 |     "memoryGiB": 64,
327 |     "name": "ml.m5.4xlarge",
328 |     "vcpuNum": 16
329 |    },
330 |    {
331 |     "_defaultOrder": 8,
332 |     "_isFastLaunch": false,
333 |     "category": "General purpose",
334 |     "gpuNum": 0,
335 |     "hideHardwareSpecs": false,
336 |     "memoryGiB": 128,
337 |     "name": "ml.m5.8xlarge",
338 |     "vcpuNum": 32
339 |    },
340 |    {
341 |     "_defaultOrder": 9,
342 |     "_isFastLaunch": false,
343 |     "category": "General purpose",
344 |     "gpuNum": 0,
345 |     "hideHardwareSpecs": false,
346 |     "memoryGiB": 192,
347 |     "name": "ml.m5.12xlarge",
348 |     "vcpuNum": 48
349 |    },
350 |    {
351 |     "_defaultOrder": 10,
352 |     "_isFastLaunch": false,
353 |     "category": "General purpose",
354 |     "gpuNum": 0,
355 |     "hideHardwareSpecs": false,
356 |     "memoryGiB": 256,
357 |     "name": "ml.m5.16xlarge",
358 |     "vcpuNum": 64
359 |    },
360 |    {
361 |     "_defaultOrder": 11,
362 |     "_isFastLaunch": false,
363 |     "category": "General purpose",
364 |     "gpuNum": 0,
365 |     "hideHardwareSpecs": false,
366 |     "memoryGiB": 384,
367 |     "name": "ml.m5.24xlarge",
368 |     "vcpuNum": 96
369 |    },
370 |    {
371 |     "_defaultOrder": 12,
372 |     "_isFastLaunch": false,
373 |     "category": "General purpose",
374 |     "gpuNum": 0,
375 |     "hideHardwareSpecs": false,
376 |     "memoryGiB": 8,
377 |     "name": "ml.m5d.large",
378 |     "vcpuNum": 2
379 |    },
380 |    {
381 |     "_defaultOrder": 13,
382 |     "_isFastLaunch": false,
383 |     "category": "General purpose",
384 |     "gpuNum": 0,
385 |     "hideHardwareSpecs": false,
386 |     "memoryGiB": 16,
387 |     "name": "ml.m5d.xlarge",
388 |     "vcpuNum": 4
389 |    },
390 |    {
391 |     "_defaultOrder": 14,
392 |     "_isFastLaunch": false,
393 |     "category": "General purpose",
394 |     "gpuNum": 0,
395 |     "hideHardwareSpecs": false,
396 |     "memoryGiB": 32,
397 |     "name": "ml.m5d.2xlarge",
398 |     "vcpuNum": 8
399 |    },
400 |    {
401 |     "_defaultOrder": 15,
402 |     "_isFastLaunch": false,
403 |     "category": "General purpose",
404 |     "gpuNum": 0,
405 |     "hideHardwareSpecs": false,
406 |     "memoryGiB": 64,
407 |     "name": "ml.m5d.4xlarge",
408 |     "vcpuNum": 16
409 |    },
410 |    {
411 |     "_defaultOrder": 16,
412 |     "_isFastLaunch": false,
413 |     "category": "General purpose",
414 |     "gpuNum": 0,
415 |     "hideHardwareSpecs": false,
416 |     "memoryGiB": 128,
417 |     "name": "ml.m5d.8xlarge",
418 |     "vcpuNum": 32
419 |    },
420 |    {
421 |     "_defaultOrder": 17,
422 |     "_isFastLaunch": false,
423 |     "category": "General purpose",
424 |     "gpuNum": 0,
425 |     "hideHardwareSpecs": false,
426 |     "memoryGiB": 192,
427 |     "name": "ml.m5d.12xlarge",
428 |     "vcpuNum": 48
429 |    },
430 |    {
431 |     "_defaultOrder": 18,
432 |     "_isFastLaunch": false,
433 |     "category": "General purpose",
434 |     "gpuNum": 0,
435 |     "hideHardwareSpecs": false,
436 |     "memoryGiB": 256,
437 |     "name": "ml.m5d.16xlarge",
438 |     "vcpuNum": 64
439 |    },
440 |    {
441 |     "_defaultOrder": 19,
442 |     "_isFastLaunch": false,
443 |     "category": "General purpose",
444 |     "gpuNum": 0,
445 |     "hideHardwareSpecs": false,
446 |     "memoryGiB": 384,
447 |     "name": "ml.m5d.24xlarge",
448 |     "vcpuNum": 96
449 |    },
450 |    {
451 |     "_defaultOrder": 20,
452 |     "_isFastLaunch": false,
453 |     "category": "General purpose",
454 |     "gpuNum": 0,
455 |     "hideHardwareSpecs": true,
456 |     "memoryGiB": 0,
457 |     "name": "ml.geospatial.interactive",
458 |     "supportedImageNames": [
459 |      "sagemaker-geospatial-v1-0"
460 |     ],
461 |     "vcpuNum": 0
462 |    },
463 |    {
464 |     "_defaultOrder": 21,
465 |     "_isFastLaunch": true,
466 |     "category": "Compute optimized",
467 |     "gpuNum": 0,
468 |     "hideHardwareSpecs": false,
469 |     "memoryGiB": 4,
470 |     "name": "ml.c5.large",
471 |     "vcpuNum": 2
472 |    },
473 |    {
474 |     "_defaultOrder": 22,
475 |     "_isFastLaunch": false,
476 |     "category": "Compute optimized",
477 |     "gpuNum": 0,
478 |     "hideHardwareSpecs": false,
479 |     "memoryGiB": 8,
480 |     "name": "ml.c5.xlarge",
481 |     "vcpuNum": 4
482 |    },
483 |    {
484 |     "_defaultOrder": 23,
485 |     "_isFastLaunch": false,
486 |     "category": "Compute optimized",
487 |     "gpuNum": 0,
488 |     "hideHardwareSpecs": false,
489 |     "memoryGiB": 16,
490 |     "name": "ml.c5.2xlarge",
491 |     "vcpuNum": 8
492 |    },
493 |    {
494 |     "_defaultOrder": 24,
495 |     "_isFastLaunch": false,
496 |     "category": "Compute optimized",
497 |     "gpuNum": 0,
498 |     "hideHardwareSpecs": false,
499 |     "memoryGiB": 32,
500 |     "name": "ml.c5.4xlarge",
501 |     "vcpuNum": 16
502 |    },
503 |    {
504 |     "_defaultOrder": 25,
505 |     "_isFastLaunch": false,
506 |     "category": "Compute optimized",
507 |     "gpuNum": 0,
508 |     "hideHardwareSpecs": false,
509 |     "memoryGiB": 72,
510 |     "name": "ml.c5.9xlarge",
511 |     "vcpuNum": 36
512 |    },
513 |    {
514 |     "_defaultOrder": 26,
515 |     "_isFastLaunch": false,
516 |     "category": "Compute optimized",
517 |     "gpuNum": 0,
518 |     "hideHardwareSpecs": false,
519 |     "memoryGiB": 96,
520 |     "name": "ml.c5.12xlarge",
521 |     "vcpuNum": 48
522 |    },
523 |    {
524 |     "_defaultOrder": 27,
525 |     "_isFastLaunch": false,
526 |     "category": "Compute optimized",
527 |     "gpuNum": 0,
528 |     "hideHardwareSpecs": false,
529 |     "memoryGiB": 144,
530 |     "name": "ml.c5.18xlarge",
531 |     "vcpuNum": 72
532 |    },
533 |    {
534 |     "_defaultOrder": 28,
535 |     "_isFastLaunch": false,
536 |     "category": "Compute optimized",
537 |     "gpuNum": 0,
538 |     "hideHardwareSpecs": false,
539 |     "memoryGiB": 192,
540 |     "name": "ml.c5.24xlarge",
541 |     "vcpuNum": 96
542 |    },
543 |    {
544 |     "_defaultOrder": 29,
545 |     "_isFastLaunch": true,
546 |     "category": "Accelerated computing",
547 |     "gpuNum": 1,
548 |     "hideHardwareSpecs": false,
549 |     "memoryGiB": 16,
550 |     "name": "ml.g4dn.xlarge",
551 |     "vcpuNum": 4
552 |    },
553 |    {
554 |     "_defaultOrder": 30,
555 |     "_isFastLaunch": false,
556 |     "category": "Accelerated computing",
557 |     "gpuNum": 1,
558 |     "hideHardwareSpecs": false,
559 |     "memoryGiB": 32,
560 |     "name": "ml.g4dn.2xlarge",
561 |     "vcpuNum": 8
562 |    },
563 |    {
564 |     "_defaultOrder": 31,
565 |     "_isFastLaunch": false,
566 |     "category": "Accelerated computing",
567 |     "gpuNum": 1,
568 |     "hideHardwareSpecs": false,
569 |     "memoryGiB": 64,
570 |     "name": "ml.g4dn.4xlarge",
571 |     "vcpuNum": 16
572 |    },
573 |    {
574 |     "_defaultOrder": 32,
575 |     "_isFastLaunch": false,
576 |     "category": "Accelerated computing",
577 |     "gpuNum": 1,
578 |     "hideHardwareSpecs": false,
579 |     "memoryGiB": 128,
580 |     "name": "ml.g4dn.8xlarge",
581 |     "vcpuNum": 32
582 |    },
583 |    {
584 |     "_defaultOrder": 33,
585 |     "_isFastLaunch": false,
586 |     "category": "Accelerated computing",
587 |     "gpuNum": 4,
588 |     "hideHardwareSpecs": false,
589 |     "memoryGiB": 192,
590 |     "name": "ml.g4dn.12xlarge",
591 |     "vcpuNum": 48
592 |    },
593 |    {
594 |     "_defaultOrder": 34,
595 |     "_isFastLaunch": false,
596 |     "category": "Accelerated computing",
597 |     "gpuNum": 1,
598 |     "hideHardwareSpecs": false,
599 |     "memoryGiB": 256,
600 |     "name": "ml.g4dn.16xlarge",
601 |     "vcpuNum": 64
602 |    },
603 |    {
604 |     "_defaultOrder": 35,
605 |     "_isFastLaunch": false,
606 |     "category": "Accelerated computing",
607 |     "gpuNum": 1,
608 |     "hideHardwareSpecs": false,
609 |     "memoryGiB": 61,
610 |     "name": "ml.p3.2xlarge",
611 |     "vcpuNum": 8
612 |    },
613 |    {
614 |     "_defaultOrder": 36,
615 |     "_isFastLaunch": false,
616 |     "category": "Accelerated computing",
617 |     "gpuNum": 4,
618 |     "hideHardwareSpecs": false,
619 |     "memoryGiB": 244,
620 |     "name": "ml.p3.8xlarge",
621 |     "vcpuNum": 32
622 |    },
623 |    {
624 |     "_defaultOrder": 37,
625 |     "_isFastLaunch": false,
626 |     "category": "Accelerated computing",
627 |     "gpuNum": 8,
628 |     "hideHardwareSpecs": false,
629 |     "memoryGiB": 488,
630 |     "name": "ml.p3.16xlarge",
631 |     "vcpuNum": 64
632 |    },
633 |    {
634 |     "_defaultOrder": 38,
635 |     "_isFastLaunch": false,
636 |     "category": "Accelerated computing",
637 |     "gpuNum": 8,
638 |     "hideHardwareSpecs": false,
639 |     "memoryGiB": 768,
640 |     "name": "ml.p3dn.24xlarge",
641 |     "vcpuNum": 96
642 |    },
643 |    {
644 |     "_defaultOrder": 39,
645 |     "_isFastLaunch": false,
646 |     "category": "Memory Optimized",
647 |     "gpuNum": 0,
648 |     "hideHardwareSpecs": false,
649 |     "memoryGiB": 16,
650 |     "name": "ml.r5.large",
651 |     "vcpuNum": 2
652 |    },
653 |    {
654 |     "_defaultOrder": 40,
655 |     "_isFastLaunch": false,
656 |     "category": "Memory Optimized",
657 |     "gpuNum": 0,
658 |     "hideHardwareSpecs": false,
659 |     "memoryGiB": 32,
660 |     "name": "ml.r5.xlarge",
661 |     "vcpuNum": 4
662 |    },
663 |    {
664 |     "_defaultOrder": 41,
665 |     "_isFastLaunch": false,
666 |     "category": "Memory Optimized",
667 |     "gpuNum": 0,
668 |     "hideHardwareSpecs": false,
669 |     "memoryGiB": 64,
670 |     "name": "ml.r5.2xlarge",
671 |     "vcpuNum": 8
672 |    },
673 |    {
674 |     "_defaultOrder": 42,
675 |     "_isFastLaunch": false,
676 |     "category": "Memory Optimized",
677 |     "gpuNum": 0,
678 |     "hideHardwareSpecs": false,
679 |     "memoryGiB": 128,
680 |     "name": "ml.r5.4xlarge",
681 |     "vcpuNum": 16
682 |    },
683 |    {
684 |     "_defaultOrder": 43,
685 |     "_isFastLaunch": false,
686 |     "category": "Memory Optimized",
687 |     "gpuNum": 0,
688 |     "hideHardwareSpecs": false,
689 |     "memoryGiB": 256,
690 |     "name": "ml.r5.8xlarge",
691 |     "vcpuNum": 32
692 |    },
693 |    {
694 |     "_defaultOrder": 44,
695 |     "_isFastLaunch": false,
696 |     "category": "Memory Optimized",
697 |     "gpuNum": 0,
698 |     "hideHardwareSpecs": false,
699 |     "memoryGiB": 384,
700 |     "name": "ml.r5.12xlarge",
701 |     "vcpuNum": 48
702 |    },
703 |    {
704 |     "_defaultOrder": 45,
705 |     "_isFastLaunch": false,
706 |     "category": "Memory Optimized",
707 |     "gpuNum": 0,
708 |     "hideHardwareSpecs": false,
709 |     "memoryGiB": 512,
710 |     "name": "ml.r5.16xlarge",
711 |     "vcpuNum": 64
712 |    },
713 |    {
714 |     "_defaultOrder": 46,
715 |     "_isFastLaunch": false,
716 |     "category": "Memory Optimized",
717 |     "gpuNum": 0,
718 |     "hideHardwareSpecs": false,
719 |     "memoryGiB": 768,
720 |     "name": "ml.r5.24xlarge",
721 |     "vcpuNum": 96
722 |    },
723 |    {
724 |     "_defaultOrder": 47,
725 |     "_isFastLaunch": false,
726 |     "category": "Accelerated computing",
727 |     "gpuNum": 1,
728 |     "hideHardwareSpecs": false,
729 |     "memoryGiB": 16,
730 |     "name": "ml.g5.xlarge",
731 |     "vcpuNum": 4
732 |    },
733 |    {
734 |     "_defaultOrder": 48,
735 |     "_isFastLaunch": false,
736 |     "category": "Accelerated computing",
737 |     "gpuNum": 1,
738 |     "hideHardwareSpecs": false,
739 |     "memoryGiB": 32,
740 |     "name": "ml.g5.2xlarge",
741 |     "vcpuNum": 8
742 |    },
743 |    {
744 |     "_defaultOrder": 49,
745 |     "_isFastLaunch": false,
746 |     "category": "Accelerated computing",
747 |     "gpuNum": 1,
748 |     "hideHardwareSpecs": false,
749 |     "memoryGiB": 64,
750 |     "name": "ml.g5.4xlarge",
751 |     "vcpuNum": 16
752 |    },
753 |    {
754 |     "_defaultOrder": 50,
755 |     "_isFastLaunch": false,
756 |     "category": "Accelerated computing",
757 |     "gpuNum": 1,
758 |     "hideHardwareSpecs": false,
759 |     "memoryGiB": 128,
760 |     "name": "ml.g5.8xlarge",
761 |     "vcpuNum": 32
762 |    },
763 |    {
764 |     "_defaultOrder": 51,
765 |     "_isFastLaunch": false,
766 |     "category": "Accelerated computing",
767 |     "gpuNum": 1,
768 |     "hideHardwareSpecs": false,
769 |     "memoryGiB": 256,
770 |     "name": "ml.g5.16xlarge",
771 |     "vcpuNum": 64
772 |    },
773 |    {
774 |     "_defaultOrder": 52,
775 |     "_isFastLaunch": false,
776 |     "category": "Accelerated computing",
777 |     "gpuNum": 4,
778 |     "hideHardwareSpecs": false,
779 |     "memoryGiB": 192,
780 |     "name": "ml.g5.12xlarge",
781 |     "vcpuNum": 48
782 |    },
783 |    {
784 |     "_defaultOrder": 53,
785 |     "_isFastLaunch": false,
786 |     "category": "Accelerated computing",
787 |     "gpuNum": 4,
788 |     "hideHardwareSpecs": false,
789 |     "memoryGiB": 384,
790 |     "name": "ml.g5.24xlarge",
791 |     "vcpuNum": 96
792 |    },
793 |    {
794 |     "_defaultOrder": 54,
795 |     "_isFastLaunch": false,
796 |     "category": "Accelerated computing",
797 |     "gpuNum": 8,
798 |     "hideHardwareSpecs": false,
799 |     "memoryGiB": 768,
800 |     "name": "ml.g5.48xlarge",
801 |     "vcpuNum": 192
802 |    }
803 |   ],
804 |   "instance_type": "ml.t3.medium",
805 |   "kernelspec": {
806 |    "display_name": "conda_mxnet_p38",
807 |    "language": "python",
808 |    "name": "conda_mxnet_p38"
809 |   },
810 |   "language_info": {
811 |    "codemirror_mode": {
812 |     "name": "ipython",
813 |     "version": 3
814 |    },
815 |    "file_extension": ".py",
816 |    "mimetype": "text/x-python",
817 |    "name": "python",
818 |    "nbconvert_exporter": "python",
819 |    "pygments_lexer": "ipython3",
820 |    "version": "3.8.15"
821 |   }
822 |  },
823 |  "nbformat": 4,
824 |  "nbformat_minor": 5
825 | }
826 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/genomic/preprocess-genomic.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "81a85a9a-7642-422c-b4f9-f15d95b46a5b",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "\n",
  9 |     "# Read and preprocess genomic data from S3 and store features in SageMaker FeatureStore\n"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "id": "ff045821-8bf7-46f8-8186-08d9e013875a",
 16 |    "metadata": {
 17 |     "tags": []
 18 |    },
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "import boto3\n",
 22 |     "import numpy as np\n",
 23 |     "import pandas as pd\n",
 24 |     "import matplotlib.pyplot as plt\n",
 25 |     "import io, os\n",
 26 |     "from time import gmtime, strftime, sleep\n",
 27 |     "import time\n",
 28 |     "import sagemaker\n",
 29 |     "from sagemaker.session import Session\n",
 30 |     "from sagemaker import get_execution_role\n",
 31 |     "from sagemaker.feature_store.feature_group import FeatureGroup"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "69d1dabe-1be7-4d30-aaf3-9b053aff6af5",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "\n",
 40 |     "## Set up SageMaker FeatureStore\n"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "id": "b044786e-8c94-4755-b84f-8941c7e4f05d",
 47 |    "metadata": {
 48 |     "tags": []
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "region = boto3.Session().region_name\n",
 53 |     "\n",
 54 |     "boto_session = boto3.Session(region_name=region)\n",
 55 |     "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n",
 56 |     "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n",
 57 |     "\n",
 58 |     "feature_store_session = Session(\n",
 59 |     "    boto_session=boto_session,\n",
 60 |     "    sagemaker_client=sagemaker_client,\n",
 61 |     "    sagemaker_featurestore_runtime_client=featurestore_runtime\n",
 62 |     ")\n",
 63 |     "\n",
 64 |     "role = get_execution_role()\n",
 65 |     "s3_client = boto3.client('s3', region_name=region)\n",
 66 |     "\n",
 67 |     "default_s3_bucket_name = feature_store_session.default_bucket()\n",
 68 |     "prefix = 'sagemaker-featurestore-demo'"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "id": "becf6d9e-39b5-4a28-97a5-254e8fae34ae",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "\n",
 77 |     "## Get data from S3\n"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28",
 84 |    "metadata": {
 85 |     "tags": []
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "# Get data from S3 \n",
 90 |     "bucket_gen = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n",
 91 |     "#bucket_gen = <S3-bucket-name>\n",
 92 |     "\n",
 93 |     "# Genomic data \n",
 94 |     "data_key_gen = 'final_genomic_df.csv'\n",
 95 |     "#data_key_gen = <file-name.csv>\n",
 96 |     "\n",
 97 |     "data_location_gen = 's3://{}/{}'.format(bucket_gen, data_key_gen)\n",
 98 |     "data_genomic = pd.read_csv(data_location_gen)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "id": "20e1c644-31a9-4d35-8197-74111504e269",
104 |    "metadata": {},
105 |    "source": [
106 |     "\n",
107 |     "## Preprocess Data\n"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "id": "4bd20e05-596b-410d-a027-ddf2eb70678f",
114 |    "metadata": {
115 |     "tags": []
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "# Replacing NaN with zeros\n",
120 |     "data_genomic_1 = data_genomic.copy()\n",
121 |     "data_genomic_1 = data_genomic_1.replace(np.nan, 0)\n",
122 |     "data_genomic_1 = data_genomic_1.astype(str)\n",
123 |     "\n",
124 |     "# Converting all diagnosis codes to a set\n",
125 |     "data_genomic_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n",
126 |     "data_genomic_pred = data_genomic_1.copy()\n",
127 |     "for i in range(len(data_genomic_1)):\n",
128 |     "    data_genomic_pred.loc[i, 'diagnosisCode'] = set(data_genomic_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n",
129 |     "\n",
130 |     "# Adding a column for prediction of Alzheimer's disease code '26929004'\n",
131 |     "# Adding a column for prediction of Coronary heart disease '53741008'\n",
132 |     "# Adding a column for prediction of Stroke code '230690007'\n",
133 |     "# Adding a column for prediction of Hypertension code '59621000'\n",
134 |     "for i in range(len(data_genomic_pred)):\n",
135 |     "    if \"26929004\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n",
136 |     "        data_genomic_pred.loc[i, 'alzheimers_prediction']  =  '1'\n",
137 |     "    if \"53741008\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n",
138 |     "        data_genomic_pred.loc[i, 'coronary_heart_disease_prediction']  =  '1'\n",
139 |     "    if \"230690007\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n",
140 |     "        data_genomic_pred.loc[i, 'stroke_prediction']  =  '1'\n",
141 |     "    if \"59621000\" in (data_genomic_pred.loc[i, 'diagnosisCode']):\n",
142 |     "        data_genomic_pred.loc[i, 'hypertension_prediction']  =  '1'\n",
143 |     "print(\"Patients with Alzheimer's disease = \", len(data_genomic_pred[data_genomic_pred['alzheimers_prediction']=='1']))\n",
144 |     "print(\"Patients with Coronary Heart disease = \", len(data_genomic_pred[data_genomic_pred['coronary_heart_disease_prediction']=='1']))\n",
145 |     "print(\"Patients with Stroke = \", len(data_genomic_pred[data_genomic_pred['stroke_prediction']=='1']))\n",
146 |     "print(\"Patients with Hypertension = \", len(data_genomic_pred[data_genomic_pred['hypertension_prediction']=='1']))\n",
147 |     "\n",
148 |     "# Delete columns with leakage and features irrelevant for model training\n",
149 |     "list_delete_cols = ['diagnosisDescription', 'diagnosisCode']\n",
150 |     "data_genomic_pred.drop(list_delete_cols, axis=1, inplace=True)\n",
151 |     "\n",
152 |     "data_genomic_pred.head(10)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "id": "79ca5c9e-7689-4fe9-a668-6439425f13c7",
158 |    "metadata": {},
159 |    "source": [
160 |     "\n",
161 |     "## Ingest data into FeatureStore\n"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5",
168 |    "metadata": {
169 |     "tags": []
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "genomic_feature_group_name = 'genomic-feature-group'\n",
174 |     "genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)\n",
175 |     "\n",
176 |     "current_time_sec = int(round(time.time()))\n",
177 |     "\n",
178 |     "def cast_object_to_string(data_frame):\n",
179 |     "    for label in data_frame.columns:\n",
180 |     "        print (label)\n",
181 |     "        if data_frame.dtypes[label] == 'object':\n",
182 |     "            data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n",
183 |     "\n",
184 |     "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n",
185 |     "cast_object_to_string(data_genomic_pred)\n",
186 |     "\n",
187 |     "# Record identifier and event time feature names\n",
188 |     "record_identifier_feature_name = \"patientID\"\n",
189 |     "event_time_feature_name = \"EventTime\"\n",
190 |     "\n",
191 |     "# Append EventTime feature\n",
192 |     "data_genomic_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_genomic_pred), dtype=\"float64\")\n",
193 |     "\n",
194 |     "## If event time generates NaN\n",
195 |     "data_genomic_pred[event_time_feature_name] = data_genomic_pred[event_time_feature_name].fillna(0)\n",
196 |     "\n",
197 |     "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n",
198 |     "genomic_feature_group.load_feature_definitions(data_frame=data_genomic_pred); # output is suppressed\n",
199 |     "\n",
200 |     "\n",
201 |     "def wait_for_feature_group_creation_complete(feature_group):\n",
202 |     "    status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
203 |     "    while status == \"Creating\":\n",
204 |     "        print(\"Waiting for Feature Group Creation\")\n",
205 |     "        time.sleep(15)\n",
206 |     "        status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
207 |     "    if status != \"Created\":\n",
208 |     "        raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n",
209 |     "    print(f\"FeatureGroup {feature_group.name} successfully created.\")\n",
210 |     "\n",
211 |     "genomic_feature_group.create(\n",
212 |     "    s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n",
213 |     "    record_identifier_name=record_identifier_feature_name,\n",
214 |     "    event_time_feature_name=event_time_feature_name,\n",
215 |     "    role_arn=role,\n",
216 |     "    enable_online_store=True\n",
217 |     ")\n",
218 |     "\n",
219 |     "wait_for_feature_group_creation_complete(feature_group=genomic_feature_group)\n",
220 |     "\n",
221 |     "genomic_feature_group.ingest(\n",
222 |     "    data_frame=data_genomic_pred, max_workers=5, wait=True\n",
223 |     ")"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": null,
229 |    "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba",
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": []
233 |   }
234 |  ],
235 |  "metadata": {
236 |   "availableInstances": [
237 |    {
238 |     "_defaultOrder": 0,
239 |     "_isFastLaunch": true,
240 |     "category": "General purpose",
241 |     "gpuNum": 0,
242 |     "hideHardwareSpecs": false,
243 |     "memoryGiB": 4,
244 |     "name": "ml.t3.medium",
245 |     "vcpuNum": 2
246 |    },
247 |    {
248 |     "_defaultOrder": 1,
249 |     "_isFastLaunch": false,
250 |     "category": "General purpose",
251 |     "gpuNum": 0,
252 |     "hideHardwareSpecs": false,
253 |     "memoryGiB": 8,
254 |     "name": "ml.t3.large",
255 |     "vcpuNum": 2
256 |    },
257 |    {
258 |     "_defaultOrder": 2,
259 |     "_isFastLaunch": false,
260 |     "category": "General purpose",
261 |     "gpuNum": 0,
262 |     "hideHardwareSpecs": false,
263 |     "memoryGiB": 16,
264 |     "name": "ml.t3.xlarge",
265 |     "vcpuNum": 4
266 |    },
267 |    {
268 |     "_defaultOrder": 3,
269 |     "_isFastLaunch": false,
270 |     "category": "General purpose",
271 |     "gpuNum": 0,
272 |     "hideHardwareSpecs": false,
273 |     "memoryGiB": 32,
274 |     "name": "ml.t3.2xlarge",
275 |     "vcpuNum": 8
276 |    },
277 |    {
278 |     "_defaultOrder": 4,
279 |     "_isFastLaunch": true,
280 |     "category": "General purpose",
281 |     "gpuNum": 0,
282 |     "hideHardwareSpecs": false,
283 |     "memoryGiB": 8,
284 |     "name": "ml.m5.large",
285 |     "vcpuNum": 2
286 |    },
287 |    {
288 |     "_defaultOrder": 5,
289 |     "_isFastLaunch": false,
290 |     "category": "General purpose",
291 |     "gpuNum": 0,
292 |     "hideHardwareSpecs": false,
293 |     "memoryGiB": 16,
294 |     "name": "ml.m5.xlarge",
295 |     "vcpuNum": 4
296 |    },
297 |    {
298 |     "_defaultOrder": 6,
299 |     "_isFastLaunch": false,
300 |     "category": "General purpose",
301 |     "gpuNum": 0,
302 |     "hideHardwareSpecs": false,
303 |     "memoryGiB": 32,
304 |     "name": "ml.m5.2xlarge",
305 |     "vcpuNum": 8
306 |    },
307 |    {
308 |     "_defaultOrder": 7,
309 |     "_isFastLaunch": false,
310 |     "category": "General purpose",
311 |     "gpuNum": 0,
312 |     "hideHardwareSpecs": false,
313 |     "memoryGiB": 64,
314 |     "name": "ml.m5.4xlarge",
315 |     "vcpuNum": 16
316 |    },
317 |    {
318 |     "_defaultOrder": 8,
319 |     "_isFastLaunch": false,
320 |     "category": "General purpose",
321 |     "gpuNum": 0,
322 |     "hideHardwareSpecs": false,
323 |     "memoryGiB": 128,
324 |     "name": "ml.m5.8xlarge",
325 |     "vcpuNum": 32
326 |    },
327 |    {
328 |     "_defaultOrder": 9,
329 |     "_isFastLaunch": false,
330 |     "category": "General purpose",
331 |     "gpuNum": 0,
332 |     "hideHardwareSpecs": false,
333 |     "memoryGiB": 192,
334 |     "name": "ml.m5.12xlarge",
335 |     "vcpuNum": 48
336 |    },
337 |    {
338 |     "_defaultOrder": 10,
339 |     "_isFastLaunch": false,
340 |     "category": "General purpose",
341 |     "gpuNum": 0,
342 |     "hideHardwareSpecs": false,
343 |     "memoryGiB": 256,
344 |     "name": "ml.m5.16xlarge",
345 |     "vcpuNum": 64
346 |    },
347 |    {
348 |     "_defaultOrder": 11,
349 |     "_isFastLaunch": false,
350 |     "category": "General purpose",
351 |     "gpuNum": 0,
352 |     "hideHardwareSpecs": false,
353 |     "memoryGiB": 384,
354 |     "name": "ml.m5.24xlarge",
355 |     "vcpuNum": 96
356 |    },
357 |    {
358 |     "_defaultOrder": 12,
359 |     "_isFastLaunch": false,
360 |     "category": "General purpose",
361 |     "gpuNum": 0,
362 |     "hideHardwareSpecs": false,
363 |     "memoryGiB": 8,
364 |     "name": "ml.m5d.large",
365 |     "vcpuNum": 2
366 |    },
367 |    {
368 |     "_defaultOrder": 13,
369 |     "_isFastLaunch": false,
370 |     "category": "General purpose",
371 |     "gpuNum": 0,
372 |     "hideHardwareSpecs": false,
373 |     "memoryGiB": 16,
374 |     "name": "ml.m5d.xlarge",
375 |     "vcpuNum": 4
376 |    },
377 |    {
378 |     "_defaultOrder": 14,
379 |     "_isFastLaunch": false,
380 |     "category": "General purpose",
381 |     "gpuNum": 0,
382 |     "hideHardwareSpecs": false,
383 |     "memoryGiB": 32,
384 |     "name": "ml.m5d.2xlarge",
385 |     "vcpuNum": 8
386 |    },
387 |    {
388 |     "_defaultOrder": 15,
389 |     "_isFastLaunch": false,
390 |     "category": "General purpose",
391 |     "gpuNum": 0,
392 |     "hideHardwareSpecs": false,
393 |     "memoryGiB": 64,
394 |     "name": "ml.m5d.4xlarge",
395 |     "vcpuNum": 16
396 |    },
397 |    {
398 |     "_defaultOrder": 16,
399 |     "_isFastLaunch": false,
400 |     "category": "General purpose",
401 |     "gpuNum": 0,
402 |     "hideHardwareSpecs": false,
403 |     "memoryGiB": 128,
404 |     "name": "ml.m5d.8xlarge",
405 |     "vcpuNum": 32
406 |    },
407 |    {
408 |     "_defaultOrder": 17,
409 |     "_isFastLaunch": false,
410 |     "category": "General purpose",
411 |     "gpuNum": 0,
412 |     "hideHardwareSpecs": false,
413 |     "memoryGiB": 192,
414 |     "name": "ml.m5d.12xlarge",
415 |     "vcpuNum": 48
416 |    },
417 |    {
418 |     "_defaultOrder": 18,
419 |     "_isFastLaunch": false,
420 |     "category": "General purpose",
421 |     "gpuNum": 0,
422 |     "hideHardwareSpecs": false,
423 |     "memoryGiB": 256,
424 |     "name": "ml.m5d.16xlarge",
425 |     "vcpuNum": 64
426 |    },
427 |    {
428 |     "_defaultOrder": 19,
429 |     "_isFastLaunch": false,
430 |     "category": "General purpose",
431 |     "gpuNum": 0,
432 |     "hideHardwareSpecs": false,
433 |     "memoryGiB": 384,
434 |     "name": "ml.m5d.24xlarge",
435 |     "vcpuNum": 96
436 |    },
437 |    {
438 |     "_defaultOrder": 20,
439 |     "_isFastLaunch": false,
440 |     "category": "General purpose",
441 |     "gpuNum": 0,
442 |     "hideHardwareSpecs": true,
443 |     "memoryGiB": 0,
444 |     "name": "ml.geospatial.interactive",
445 |     "supportedImageNames": [
446 |      "sagemaker-geospatial-v1-0"
447 |     ],
448 |     "vcpuNum": 0
449 |    },
450 |    {
451 |     "_defaultOrder": 21,
452 |     "_isFastLaunch": true,
453 |     "category": "Compute optimized",
454 |     "gpuNum": 0,
455 |     "hideHardwareSpecs": false,
456 |     "memoryGiB": 4,
457 |     "name": "ml.c5.large",
458 |     "vcpuNum": 2
459 |    },
460 |    {
461 |     "_defaultOrder": 22,
462 |     "_isFastLaunch": false,
463 |     "category": "Compute optimized",
464 |     "gpuNum": 0,
465 |     "hideHardwareSpecs": false,
466 |     "memoryGiB": 8,
467 |     "name": "ml.c5.xlarge",
468 |     "vcpuNum": 4
469 |    },
470 |    {
471 |     "_defaultOrder": 23,
472 |     "_isFastLaunch": false,
473 |     "category": "Compute optimized",
474 |     "gpuNum": 0,
475 |     "hideHardwareSpecs": false,
476 |     "memoryGiB": 16,
477 |     "name": "ml.c5.2xlarge",
478 |     "vcpuNum": 8
479 |    },
480 |    {
481 |     "_defaultOrder": 24,
482 |     "_isFastLaunch": false,
483 |     "category": "Compute optimized",
484 |     "gpuNum": 0,
485 |     "hideHardwareSpecs": false,
486 |     "memoryGiB": 32,
487 |     "name": "ml.c5.4xlarge",
488 |     "vcpuNum": 16
489 |    },
490 |    {
491 |     "_defaultOrder": 25,
492 |     "_isFastLaunch": false,
493 |     "category": "Compute optimized",
494 |     "gpuNum": 0,
495 |     "hideHardwareSpecs": false,
496 |     "memoryGiB": 72,
497 |     "name": "ml.c5.9xlarge",
498 |     "vcpuNum": 36
499 |    },
500 |    {
501 |     "_defaultOrder": 26,
502 |     "_isFastLaunch": false,
503 |     "category": "Compute optimized",
504 |     "gpuNum": 0,
505 |     "hideHardwareSpecs": false,
506 |     "memoryGiB": 96,
507 |     "name": "ml.c5.12xlarge",
508 |     "vcpuNum": 48
509 |    },
510 |    {
511 |     "_defaultOrder": 27,
512 |     "_isFastLaunch": false,
513 |     "category": "Compute optimized",
514 |     "gpuNum": 0,
515 |     "hideHardwareSpecs": false,
516 |     "memoryGiB": 144,
517 |     "name": "ml.c5.18xlarge",
518 |     "vcpuNum": 72
519 |    },
520 |    {
521 |     "_defaultOrder": 28,
522 |     "_isFastLaunch": false,
523 |     "category": "Compute optimized",
524 |     "gpuNum": 0,
525 |     "hideHardwareSpecs": false,
526 |     "memoryGiB": 192,
527 |     "name": "ml.c5.24xlarge",
528 |     "vcpuNum": 96
529 |    },
530 |    {
531 |     "_defaultOrder": 29,
532 |     "_isFastLaunch": true,
533 |     "category": "Accelerated computing",
534 |     "gpuNum": 1,
535 |     "hideHardwareSpecs": false,
536 |     "memoryGiB": 16,
537 |     "name": "ml.g4dn.xlarge",
538 |     "vcpuNum": 4
539 |    },
540 |    {
541 |     "_defaultOrder": 30,
542 |     "_isFastLaunch": false,
543 |     "category": "Accelerated computing",
544 |     "gpuNum": 1,
545 |     "hideHardwareSpecs": false,
546 |     "memoryGiB": 32,
547 |     "name": "ml.g4dn.2xlarge",
548 |     "vcpuNum": 8
549 |    },
550 |    {
551 |     "_defaultOrder": 31,
552 |     "_isFastLaunch": false,
553 |     "category": "Accelerated computing",
554 |     "gpuNum": 1,
555 |     "hideHardwareSpecs": false,
556 |     "memoryGiB": 64,
557 |     "name": "ml.g4dn.4xlarge",
558 |     "vcpuNum": 16
559 |    },
560 |    {
561 |     "_defaultOrder": 32,
562 |     "_isFastLaunch": false,
563 |     "category": "Accelerated computing",
564 |     "gpuNum": 1,
565 |     "hideHardwareSpecs": false,
566 |     "memoryGiB": 128,
567 |     "name": "ml.g4dn.8xlarge",
568 |     "vcpuNum": 32
569 |    },
570 |    {
571 |     "_defaultOrder": 33,
572 |     "_isFastLaunch": false,
573 |     "category": "Accelerated computing",
574 |     "gpuNum": 4,
575 |     "hideHardwareSpecs": false,
576 |     "memoryGiB": 192,
577 |     "name": "ml.g4dn.12xlarge",
578 |     "vcpuNum": 48
579 |    },
580 |    {
581 |     "_defaultOrder": 34,
582 |     "_isFastLaunch": false,
583 |     "category": "Accelerated computing",
584 |     "gpuNum": 1,
585 |     "hideHardwareSpecs": false,
586 |     "memoryGiB": 256,
587 |     "name": "ml.g4dn.16xlarge",
588 |     "vcpuNum": 64
589 |    },
590 |    {
591 |     "_defaultOrder": 35,
592 |     "_isFastLaunch": false,
593 |     "category": "Accelerated computing",
594 |     "gpuNum": 1,
595 |     "hideHardwareSpecs": false,
596 |     "memoryGiB": 61,
597 |     "name": "ml.p3.2xlarge",
598 |     "vcpuNum": 8
599 |    },
600 |    {
601 |     "_defaultOrder": 36,
602 |     "_isFastLaunch": false,
603 |     "category": "Accelerated computing",
604 |     "gpuNum": 4,
605 |     "hideHardwareSpecs": false,
606 |     "memoryGiB": 244,
607 |     "name": "ml.p3.8xlarge",
608 |     "vcpuNum": 32
609 |    },
610 |    {
611 |     "_defaultOrder": 37,
612 |     "_isFastLaunch": false,
613 |     "category": "Accelerated computing",
614 |     "gpuNum": 8,
615 |     "hideHardwareSpecs": false,
616 |     "memoryGiB": 488,
617 |     "name": "ml.p3.16xlarge",
618 |     "vcpuNum": 64
619 |    },
620 |    {
621 |     "_defaultOrder": 38,
622 |     "_isFastLaunch": false,
623 |     "category": "Accelerated computing",
624 |     "gpuNum": 8,
625 |     "hideHardwareSpecs": false,
626 |     "memoryGiB": 768,
627 |     "name": "ml.p3dn.24xlarge",
628 |     "vcpuNum": 96
629 |    },
630 |    {
631 |     "_defaultOrder": 39,
632 |     "_isFastLaunch": false,
633 |     "category": "Memory Optimized",
634 |     "gpuNum": 0,
635 |     "hideHardwareSpecs": false,
636 |     "memoryGiB": 16,
637 |     "name": "ml.r5.large",
638 |     "vcpuNum": 2
639 |    },
640 |    {
641 |     "_defaultOrder": 40,
642 |     "_isFastLaunch": false,
643 |     "category": "Memory Optimized",
644 |     "gpuNum": 0,
645 |     "hideHardwareSpecs": false,
646 |     "memoryGiB": 32,
647 |     "name": "ml.r5.xlarge",
648 |     "vcpuNum": 4
649 |    },
650 |    {
651 |     "_defaultOrder": 41,
652 |     "_isFastLaunch": false,
653 |     "category": "Memory Optimized",
654 |     "gpuNum": 0,
655 |     "hideHardwareSpecs": false,
656 |     "memoryGiB": 64,
657 |     "name": "ml.r5.2xlarge",
658 |     "vcpuNum": 8
659 |    },
660 |    {
661 |     "_defaultOrder": 42,
662 |     "_isFastLaunch": false,
663 |     "category": "Memory Optimized",
664 |     "gpuNum": 0,
665 |     "hideHardwareSpecs": false,
666 |     "memoryGiB": 128,
667 |     "name": "ml.r5.4xlarge",
668 |     "vcpuNum": 16
669 |    },
670 |    {
671 |     "_defaultOrder": 43,
672 |     "_isFastLaunch": false,
673 |     "category": "Memory Optimized",
674 |     "gpuNum": 0,
675 |     "hideHardwareSpecs": false,
676 |     "memoryGiB": 256,
677 |     "name": "ml.r5.8xlarge",
678 |     "vcpuNum": 32
679 |    },
680 |    {
681 |     "_defaultOrder": 44,
682 |     "_isFastLaunch": false,
683 |     "category": "Memory Optimized",
684 |     "gpuNum": 0,
685 |     "hideHardwareSpecs": false,
686 |     "memoryGiB": 384,
687 |     "name": "ml.r5.12xlarge",
688 |     "vcpuNum": 48
689 |    },
690 |    {
691 |     "_defaultOrder": 45,
692 |     "_isFastLaunch": false,
693 |     "category": "Memory Optimized",
694 |     "gpuNum": 0,
695 |     "hideHardwareSpecs": false,
696 |     "memoryGiB": 512,
697 |     "name": "ml.r5.16xlarge",
698 |     "vcpuNum": 64
699 |    },
700 |    {
701 |     "_defaultOrder": 46,
702 |     "_isFastLaunch": false,
703 |     "category": "Memory Optimized",
704 |     "gpuNum": 0,
705 |     "hideHardwareSpecs": false,
706 |     "memoryGiB": 768,
707 |     "name": "ml.r5.24xlarge",
708 |     "vcpuNum": 96
709 |    },
710 |    {
711 |     "_defaultOrder": 47,
712 |     "_isFastLaunch": false,
713 |     "category": "Accelerated computing",
714 |     "gpuNum": 1,
715 |     "hideHardwareSpecs": false,
716 |     "memoryGiB": 16,
717 |     "name": "ml.g5.xlarge",
718 |     "vcpuNum": 4
719 |    },
720 |    {
721 |     "_defaultOrder": 48,
722 |     "_isFastLaunch": false,
723 |     "category": "Accelerated computing",
724 |     "gpuNum": 1,
725 |     "hideHardwareSpecs": false,
726 |     "memoryGiB": 32,
727 |     "name": "ml.g5.2xlarge",
728 |     "vcpuNum": 8
729 |    },
730 |    {
731 |     "_defaultOrder": 49,
732 |     "_isFastLaunch": false,
733 |     "category": "Accelerated computing",
734 |     "gpuNum": 1,
735 |     "hideHardwareSpecs": false,
736 |     "memoryGiB": 64,
737 |     "name": "ml.g5.4xlarge",
738 |     "vcpuNum": 16
739 |    },
740 |    {
741 |     "_defaultOrder": 50,
742 |     "_isFastLaunch": false,
743 |     "category": "Accelerated computing",
744 |     "gpuNum": 1,
745 |     "hideHardwareSpecs": false,
746 |     "memoryGiB": 128,
747 |     "name": "ml.g5.8xlarge",
748 |     "vcpuNum": 32
749 |    },
750 |    {
751 |     "_defaultOrder": 51,
752 |     "_isFastLaunch": false,
753 |     "category": "Accelerated computing",
754 |     "gpuNum": 1,
755 |     "hideHardwareSpecs": false,
756 |     "memoryGiB": 256,
757 |     "name": "ml.g5.16xlarge",
758 |     "vcpuNum": 64
759 |    },
760 |    {
761 |     "_defaultOrder": 52,
762 |     "_isFastLaunch": false,
763 |     "category": "Accelerated computing",
764 |     "gpuNum": 4,
765 |     "hideHardwareSpecs": false,
766 |     "memoryGiB": 192,
767 |     "name": "ml.g5.12xlarge",
768 |     "vcpuNum": 48
769 |    },
770 |    {
771 |     "_defaultOrder": 53,
772 |     "_isFastLaunch": false,
773 |     "category": "Accelerated computing",
774 |     "gpuNum": 4,
775 |     "hideHardwareSpecs": false,
776 |     "memoryGiB": 384,
777 |     "name": "ml.g5.24xlarge",
778 |     "vcpuNum": 96
779 |    },
780 |    {
781 |     "_defaultOrder": 54,
782 |     "_isFastLaunch": false,
783 |     "category": "Accelerated computing",
784 |     "gpuNum": 8,
785 |     "hideHardwareSpecs": false,
786 |     "memoryGiB": 768,
787 |     "name": "ml.g5.48xlarge",
788 |     "vcpuNum": 192
789 |    }
790 |   ],
791 |   "instance_type": "ml.t3.medium",
792 |   "kernelspec": {
793 |    "display_name": "conda_mxnet_p38",
794 |    "language": "python",
795 |    "name": "conda_mxnet_p38"
796 |   },
797 |   "language_info": {
798 |    "codemirror_mode": {
799 |     "name": "ipython",
800 |     "version": 3
801 |    },
802 |    "file_extension": ".py",
803 |    "mimetype": "text/x-python",
804 |    "name": "python",
805 |    "nbconvert_exporter": "python",
806 |    "pygments_lexer": "ipython3",
807 |    "version": "3.8.15"
808 |   }
809 |  },
810 |  "nbformat": 4,
811 |  "nbformat_minor": 5
812 | }
813 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/medical-imaging/preprocess-imaging.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "a4094068-1e01-41da-a9a7-15ee28c0022d",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "\n",
  9 |     "# Read and preprocess imaging data from S3 and store features in SageMaker FeatureStore\n"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "id": "ff045821-8bf7-46f8-8186-08d9e013875a",
 16 |    "metadata": {
 17 |     "tags": []
 18 |    },
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "import boto3\n",
 22 |     "import numpy as np\n",
 23 |     "import pandas as pd\n",
 24 |     "import matplotlib.pyplot as plt\n",
 25 |     "import io, os\n",
 26 |     "from time import gmtime, strftime, sleep\n",
 27 |     "import time\n",
 28 |     "import sagemaker\n",
 29 |     "from sagemaker.session import Session\n",
 30 |     "from sagemaker import get_execution_role\n",
 31 |     "from sagemaker.feature_store.feature_group import FeatureGroup"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "3a0ce6f0-eb7e-4420-b994-843f436ff660",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "\n",
 40 |     "## Set up SageMaker FeatureStore\n"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "id": "b044786e-8c94-4755-b84f-8941c7e4f05d",
 47 |    "metadata": {
 48 |     "tags": []
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "region = boto3.Session().region_name\n",
 53 |     "\n",
 54 |     "boto_session = boto3.Session(region_name=region)\n",
 55 |     "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n",
 56 |     "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n",
 57 |     "\n",
 58 |     "feature_store_session = Session(\n",
 59 |     "    boto_session=boto_session,\n",
 60 |     "    sagemaker_client=sagemaker_client,\n",
 61 |     "    sagemaker_featurestore_runtime_client=featurestore_runtime\n",
 62 |     ")\n",
 63 |     "\n",
 64 |     "role = get_execution_role()\n",
 65 |     "s3_client = boto3.client('s3', region_name=region)\n",
 66 |     "\n",
 67 |     "default_s3_bucket_name = feature_store_session.default_bucket()\n",
 68 |     "prefix = 'sagemaker-featurestore-demo'"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "id": "9211f510-8be4-4877-b4f5-fa778aae298a",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "\n",
 77 |     "## Get data from S3\n"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "56d71658-97bf-42a8-882b-38d4d2ccbc28",
 84 |    "metadata": {
 85 |     "tags": []
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "# Get data from S3 \n",
 90 |     "bucket_imag = 'guidance-multimodal-hcls-healthai-machinelearning/preprocessing'\n",
 91 |     "#bucket_imag = <S3-bucket-name>\n",
 92 |     "\n",
 93 |     "# imaging data \n",
 94 |     "data_key_imag = 'final_imaging_df.csv'\n",
 95 |     "#data_key_imag = <file-name.csv>\n",
 96 |     "\n",
 97 |     "data_location_imag = 's3://{}/{}'.format(bucket_imag, data_key_imag)\n",
 98 |     "data_imaging = pd.read_csv(data_location_imag)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "id": "7744c961-807f-4451-b6ba-1fa24025e652",
104 |    "metadata": {},
105 |    "source": [
106 |     "\n",
107 |     "## Preprocess Data\n"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "id": "4bd20e05-596b-410d-a027-ddf2eb70678f",
114 |    "metadata": {
115 |     "tags": []
116 |    },
117 |    "outputs": [],
118 |    "source": [
119 |     "# Replacing NaN with zeros\n",
120 |     "data_imaging_1 = data_imaging.copy()\n",
121 |     "data_imaging_1 = data_imaging_1.replace(np.nan, 0)\n",
122 |     "data_imaging_1 = data_imaging_1.astype(str)\n",
123 |     "\n",
124 |     "#Converting all diagnosis codes to a set\n",
125 |     "data_imaging_1[['alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction']] = '0'\n",
126 |     "data_imaging_pred = data_imaging_1.copy()\n",
127 |     "for i in range(len(data_imaging_1)):\n",
128 |     "    data_imaging_pred.loc[i, 'diagnosisCode'] = set(data_imaging_pred.loc[i, 'diagnosisCode'].replace('\\'', '').replace(' ', '').replace('{', '').replace('}', '').split(','))\n",
129 |     "\n",
130 |     "# Adding a column for prediction of Alzheimer's disease code '26929004'\n",
131 |     "# Adding a column for prediction of Coronary heart disease '53741008'\n",
132 |     "# Adding a column for prediction of Stroke code '230690007'\n",
133 |     "# Adding a column for prediction of Hypertension code '59621000'\n",
134 |     "for i in range(len(data_imaging_pred)):\n",
135 |     "    if \"26929004\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n",
136 |     "        data_imaging_pred.loc[i, 'alzheimers_prediction']  =  '1'\n",
137 |     "    if \"53741008\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n",
138 |     "        data_imaging_pred.loc[i, 'coronary_heart_disease_prediction']  =  '1'\n",
139 |     "    if \"230690007\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n",
140 |     "        data_imaging_pred.loc[i, 'stroke_prediction']  =  '1'\n",
141 |     "    if \"59621000\" in (data_imaging_pred.loc[i, 'diagnosisCode']):\n",
142 |     "        data_imaging_pred.loc[i, 'hypertension_prediction']  =  '1'\n",
143 |     "print(\"Patients with Alzheimer's disease = \", len(data_imaging_pred[data_imaging_pred['alzheimers_prediction']=='1']))\n",
144 |     "print(\"Patients with Coronary Heart disease = \", len(data_imaging_pred[data_imaging_pred['coronary_heart_disease_prediction']=='1']))\n",
145 |     "print(\"Patients with Stroke = \", len(data_imaging_pred[data_imaging_pred['stroke_prediction']=='1']))\n",
146 |     "print(\"Patients with Hypertension = \", len(data_imaging_pred[data_imaging_pred['hypertension_prediction']=='1']))\n",
147 |     "\n",
148 |     "# Delete columns with leakage and features irrelevant for model training\n",
149 |     "list_delete_cols = ['diagnosisDescription', 'diagnosisCode']\n",
150 |     "data_imaging_pred.drop(list_delete_cols, axis=1, inplace=True)\n",
151 |     "\n",
152 |     "data_imaging_pred.head(10)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "id": "7d80bf0c-bee9-4f46-b05f-496ae95ae0a8",
158 |    "metadata": {
159 |     "tags": []
160 |    },
161 |    "source": [
162 |     "\n",
163 |     "## Ingest data into FeatureStore\n"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": null,
169 |    "id": "b8f82d22-8ad1-4258-be19-3f5d1dc0f4d5",
170 |    "metadata": {
171 |     "tags": []
172 |    },
173 |    "outputs": [],
174 |    "source": [
175 |     "imaging_feature_group_name = 'imaging-feature-group' \n",
176 |     "imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)\n",
177 |     "\n",
178 |     "current_time_sec = int(round(time.time()))\n",
179 |     "\n",
180 |     "def cast_object_to_string(data_frame):\n",
181 |     "    for label in data_frame.columns:\n",
182 |     "        print (label)\n",
183 |     "        if data_frame.dtypes[label] == 'object':\n",
184 |     "            data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n",
185 |     "\n",
186 |     "# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n",
187 |     "cast_object_to_string(data_imaging_pred)\n",
188 |     "\n",
189 |     "# Record identifier and event time feature names\n",
190 |     "record_identifier_feature_name = \"patientID\"\n",
191 |     "event_time_feature_name = \"EventTime\"\n",
192 |     "\n",
193 |     "# Append EventTime feature\n",
194 |     "data_imaging_pred[event_time_feature_name] = pd.Series([current_time_sec]*len(data_imaging_pred), dtype=\"float64\")\n",
195 |     "\n",
196 |     "## If event time generates NaN\n",
197 |     "data_imaging_pred[event_time_feature_name] = data_imaging_pred[event_time_feature_name].fillna(0)\n",
198 |     "\n",
199 |     "# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n",
200 |     "imaging_feature_group.load_feature_definitions(data_frame=data_imaging_pred); # output is suppressed\n",
201 |     "\n",
202 |     "\n",
203 |     "def wait_for_feature_group_creation_complete(feature_group):\n",
204 |     "    status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
205 |     "    while status == \"Creating\":\n",
206 |     "        print(\"Waiting for Feature Group Creation\")\n",
207 |     "        time.sleep(15)\n",
208 |     "        status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
209 |     "    if status != \"Created\":\n",
210 |     "        raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n",
211 |     "    print(f\"FeatureGroup {feature_group.name} successfully created.\")\n",
212 |     "\n",
213 |     "imaging_feature_group.create(\n",
214 |     "    s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n",
215 |     "    record_identifier_name=record_identifier_feature_name,\n",
216 |     "    event_time_feature_name=event_time_feature_name,\n",
217 |     "    role_arn=role,\n",
218 |     "    enable_online_store=True\n",
219 |     ")\n",
220 |     "\n",
221 |     "wait_for_feature_group_creation_complete(feature_group=imaging_feature_group)\n",
222 |     "\n",
223 |     "imaging_feature_group.ingest(\n",
224 |     "    data_frame=data_imaging_pred, max_workers=5, wait=True\n",
225 |     ")"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "id": "aa2a8752-7e16-4880-b6e0-32e7882216ba",
232 |    "metadata": {},
233 |    "outputs": [],
234 |    "source": []
235 |   }
236 |  ],
237 |  "metadata": {
238 |   "availableInstances": [
239 |    {
240 |     "_defaultOrder": 0,
241 |     "_isFastLaunch": true,
242 |     "category": "General purpose",
243 |     "gpuNum": 0,
244 |     "hideHardwareSpecs": false,
245 |     "memoryGiB": 4,
246 |     "name": "ml.t3.medium",
247 |     "vcpuNum": 2
248 |    },
249 |    {
250 |     "_defaultOrder": 1,
251 |     "_isFastLaunch": false,
252 |     "category": "General purpose",
253 |     "gpuNum": 0,
254 |     "hideHardwareSpecs": false,
255 |     "memoryGiB": 8,
256 |     "name": "ml.t3.large",
257 |     "vcpuNum": 2
258 |    },
259 |    {
260 |     "_defaultOrder": 2,
261 |     "_isFastLaunch": false,
262 |     "category": "General purpose",
263 |     "gpuNum": 0,
264 |     "hideHardwareSpecs": false,
265 |     "memoryGiB": 16,
266 |     "name": "ml.t3.xlarge",
267 |     "vcpuNum": 4
268 |    },
269 |    {
270 |     "_defaultOrder": 3,
271 |     "_isFastLaunch": false,
272 |     "category": "General purpose",
273 |     "gpuNum": 0,
274 |     "hideHardwareSpecs": false,
275 |     "memoryGiB": 32,
276 |     "name": "ml.t3.2xlarge",
277 |     "vcpuNum": 8
278 |    },
279 |    {
280 |     "_defaultOrder": 4,
281 |     "_isFastLaunch": true,
282 |     "category": "General purpose",
283 |     "gpuNum": 0,
284 |     "hideHardwareSpecs": false,
285 |     "memoryGiB": 8,
286 |     "name": "ml.m5.large",
287 |     "vcpuNum": 2
288 |    },
289 |    {
290 |     "_defaultOrder": 5,
291 |     "_isFastLaunch": false,
292 |     "category": "General purpose",
293 |     "gpuNum": 0,
294 |     "hideHardwareSpecs": false,
295 |     "memoryGiB": 16,
296 |     "name": "ml.m5.xlarge",
297 |     "vcpuNum": 4
298 |    },
299 |    {
300 |     "_defaultOrder": 6,
301 |     "_isFastLaunch": false,
302 |     "category": "General purpose",
303 |     "gpuNum": 0,
304 |     "hideHardwareSpecs": false,
305 |     "memoryGiB": 32,
306 |     "name": "ml.m5.2xlarge",
307 |     "vcpuNum": 8
308 |    },
309 |    {
310 |     "_defaultOrder": 7,
311 |     "_isFastLaunch": false,
312 |     "category": "General purpose",
313 |     "gpuNum": 0,
314 |     "hideHardwareSpecs": false,
315 |     "memoryGiB": 64,
316 |     "name": "ml.m5.4xlarge",
317 |     "vcpuNum": 16
318 |    },
319 |    {
320 |     "_defaultOrder": 8,
321 |     "_isFastLaunch": false,
322 |     "category": "General purpose",
323 |     "gpuNum": 0,
324 |     "hideHardwareSpecs": false,
325 |     "memoryGiB": 128,
326 |     "name": "ml.m5.8xlarge",
327 |     "vcpuNum": 32
328 |    },
329 |    {
330 |     "_defaultOrder": 9,
331 |     "_isFastLaunch": false,
332 |     "category": "General purpose",
333 |     "gpuNum": 0,
334 |     "hideHardwareSpecs": false,
335 |     "memoryGiB": 192,
336 |     "name": "ml.m5.12xlarge",
337 |     "vcpuNum": 48
338 |    },
339 |    {
340 |     "_defaultOrder": 10,
341 |     "_isFastLaunch": false,
342 |     "category": "General purpose",
343 |     "gpuNum": 0,
344 |     "hideHardwareSpecs": false,
345 |     "memoryGiB": 256,
346 |     "name": "ml.m5.16xlarge",
347 |     "vcpuNum": 64
348 |    },
349 |    {
350 |     "_defaultOrder": 11,
351 |     "_isFastLaunch": false,
352 |     "category": "General purpose",
353 |     "gpuNum": 0,
354 |     "hideHardwareSpecs": false,
355 |     "memoryGiB": 384,
356 |     "name": "ml.m5.24xlarge",
357 |     "vcpuNum": 96
358 |    },
359 |    {
360 |     "_defaultOrder": 12,
361 |     "_isFastLaunch": false,
362 |     "category": "General purpose",
363 |     "gpuNum": 0,
364 |     "hideHardwareSpecs": false,
365 |     "memoryGiB": 8,
366 |     "name": "ml.m5d.large",
367 |     "vcpuNum": 2
368 |    },
369 |    {
370 |     "_defaultOrder": 13,
371 |     "_isFastLaunch": false,
372 |     "category": "General purpose",
373 |     "gpuNum": 0,
374 |     "hideHardwareSpecs": false,
375 |     "memoryGiB": 16,
376 |     "name": "ml.m5d.xlarge",
377 |     "vcpuNum": 4
378 |    },
379 |    {
380 |     "_defaultOrder": 14,
381 |     "_isFastLaunch": false,
382 |     "category": "General purpose",
383 |     "gpuNum": 0,
384 |     "hideHardwareSpecs": false,
385 |     "memoryGiB": 32,
386 |     "name": "ml.m5d.2xlarge",
387 |     "vcpuNum": 8
388 |    },
389 |    {
390 |     "_defaultOrder": 15,
391 |     "_isFastLaunch": false,
392 |     "category": "General purpose",
393 |     "gpuNum": 0,
394 |     "hideHardwareSpecs": false,
395 |     "memoryGiB": 64,
396 |     "name": "ml.m5d.4xlarge",
397 |     "vcpuNum": 16
398 |    },
399 |    {
400 |     "_defaultOrder": 16,
401 |     "_isFastLaunch": false,
402 |     "category": "General purpose",
403 |     "gpuNum": 0,
404 |     "hideHardwareSpecs": false,
405 |     "memoryGiB": 128,
406 |     "name": "ml.m5d.8xlarge",
407 |     "vcpuNum": 32
408 |    },
409 |    {
410 |     "_defaultOrder": 17,
411 |     "_isFastLaunch": false,
412 |     "category": "General purpose",
413 |     "gpuNum": 0,
414 |     "hideHardwareSpecs": false,
415 |     "memoryGiB": 192,
416 |     "name": "ml.m5d.12xlarge",
417 |     "vcpuNum": 48
418 |    },
419 |    {
420 |     "_defaultOrder": 18,
421 |     "_isFastLaunch": false,
422 |     "category": "General purpose",
423 |     "gpuNum": 0,
424 |     "hideHardwareSpecs": false,
425 |     "memoryGiB": 256,
426 |     "name": "ml.m5d.16xlarge",
427 |     "vcpuNum": 64
428 |    },
429 |    {
430 |     "_defaultOrder": 19,
431 |     "_isFastLaunch": false,
432 |     "category": "General purpose",
433 |     "gpuNum": 0,
434 |     "hideHardwareSpecs": false,
435 |     "memoryGiB": 384,
436 |     "name": "ml.m5d.24xlarge",
437 |     "vcpuNum": 96
438 |    },
439 |    {
440 |     "_defaultOrder": 20,
441 |     "_isFastLaunch": false,
442 |     "category": "General purpose",
443 |     "gpuNum": 0,
444 |     "hideHardwareSpecs": true,
445 |     "memoryGiB": 0,
446 |     "name": "ml.geospatial.interactive",
447 |     "supportedImageNames": [
448 |      "sagemaker-geospatial-v1-0"
449 |     ],
450 |     "vcpuNum": 0
451 |    },
452 |    {
453 |     "_defaultOrder": 21,
454 |     "_isFastLaunch": true,
455 |     "category": "Compute optimized",
456 |     "gpuNum": 0,
457 |     "hideHardwareSpecs": false,
458 |     "memoryGiB": 4,
459 |     "name": "ml.c5.large",
460 |     "vcpuNum": 2
461 |    },
462 |    {
463 |     "_defaultOrder": 22,
464 |     "_isFastLaunch": false,
465 |     "category": "Compute optimized",
466 |     "gpuNum": 0,
467 |     "hideHardwareSpecs": false,
468 |     "memoryGiB": 8,
469 |     "name": "ml.c5.xlarge",
470 |     "vcpuNum": 4
471 |    },
472 |    {
473 |     "_defaultOrder": 23,
474 |     "_isFastLaunch": false,
475 |     "category": "Compute optimized",
476 |     "gpuNum": 0,
477 |     "hideHardwareSpecs": false,
478 |     "memoryGiB": 16,
479 |     "name": "ml.c5.2xlarge",
480 |     "vcpuNum": 8
481 |    },
482 |    {
483 |     "_defaultOrder": 24,
484 |     "_isFastLaunch": false,
485 |     "category": "Compute optimized",
486 |     "gpuNum": 0,
487 |     "hideHardwareSpecs": false,
488 |     "memoryGiB": 32,
489 |     "name": "ml.c5.4xlarge",
490 |     "vcpuNum": 16
491 |    },
492 |    {
493 |     "_defaultOrder": 25,
494 |     "_isFastLaunch": false,
495 |     "category": "Compute optimized",
496 |     "gpuNum": 0,
497 |     "hideHardwareSpecs": false,
498 |     "memoryGiB": 72,
499 |     "name": "ml.c5.9xlarge",
500 |     "vcpuNum": 36
501 |    },
502 |    {
503 |     "_defaultOrder": 26,
504 |     "_isFastLaunch": false,
505 |     "category": "Compute optimized",
506 |     "gpuNum": 0,
507 |     "hideHardwareSpecs": false,
508 |     "memoryGiB": 96,
509 |     "name": "ml.c5.12xlarge",
510 |     "vcpuNum": 48
511 |    },
512 |    {
513 |     "_defaultOrder": 27,
514 |     "_isFastLaunch": false,
515 |     "category": "Compute optimized",
516 |     "gpuNum": 0,
517 |     "hideHardwareSpecs": false,
518 |     "memoryGiB": 144,
519 |     "name": "ml.c5.18xlarge",
520 |     "vcpuNum": 72
521 |    },
522 |    {
523 |     "_defaultOrder": 28,
524 |     "_isFastLaunch": false,
525 |     "category": "Compute optimized",
526 |     "gpuNum": 0,
527 |     "hideHardwareSpecs": false,
528 |     "memoryGiB": 192,
529 |     "name": "ml.c5.24xlarge",
530 |     "vcpuNum": 96
531 |    },
532 |    {
533 |     "_defaultOrder": 29,
534 |     "_isFastLaunch": true,
535 |     "category": "Accelerated computing",
536 |     "gpuNum": 1,
537 |     "hideHardwareSpecs": false,
538 |     "memoryGiB": 16,
539 |     "name": "ml.g4dn.xlarge",
540 |     "vcpuNum": 4
541 |    },
542 |    {
543 |     "_defaultOrder": 30,
544 |     "_isFastLaunch": false,
545 |     "category": "Accelerated computing",
546 |     "gpuNum": 1,
547 |     "hideHardwareSpecs": false,
548 |     "memoryGiB": 32,
549 |     "name": "ml.g4dn.2xlarge",
550 |     "vcpuNum": 8
551 |    },
552 |    {
553 |     "_defaultOrder": 31,
554 |     "_isFastLaunch": false,
555 |     "category": "Accelerated computing",
556 |     "gpuNum": 1,
557 |     "hideHardwareSpecs": false,
558 |     "memoryGiB": 64,
559 |     "name": "ml.g4dn.4xlarge",
560 |     "vcpuNum": 16
561 |    },
562 |    {
563 |     "_defaultOrder": 32,
564 |     "_isFastLaunch": false,
565 |     "category": "Accelerated computing",
566 |     "gpuNum": 1,
567 |     "hideHardwareSpecs": false,
568 |     "memoryGiB": 128,
569 |     "name": "ml.g4dn.8xlarge",
570 |     "vcpuNum": 32
571 |    },
572 |    {
573 |     "_defaultOrder": 33,
574 |     "_isFastLaunch": false,
575 |     "category": "Accelerated computing",
576 |     "gpuNum": 4,
577 |     "hideHardwareSpecs": false,
578 |     "memoryGiB": 192,
579 |     "name": "ml.g4dn.12xlarge",
580 |     "vcpuNum": 48
581 |    },
582 |    {
583 |     "_defaultOrder": 34,
584 |     "_isFastLaunch": false,
585 |     "category": "Accelerated computing",
586 |     "gpuNum": 1,
587 |     "hideHardwareSpecs": false,
588 |     "memoryGiB": 256,
589 |     "name": "ml.g4dn.16xlarge",
590 |     "vcpuNum": 64
591 |    },
592 |    {
593 |     "_defaultOrder": 35,
594 |     "_isFastLaunch": false,
595 |     "category": "Accelerated computing",
596 |     "gpuNum": 1,
597 |     "hideHardwareSpecs": false,
598 |     "memoryGiB": 61,
599 |     "name": "ml.p3.2xlarge",
600 |     "vcpuNum": 8
601 |    },
602 |    {
603 |     "_defaultOrder": 36,
604 |     "_isFastLaunch": false,
605 |     "category": "Accelerated computing",
606 |     "gpuNum": 4,
607 |     "hideHardwareSpecs": false,
608 |     "memoryGiB": 244,
609 |     "name": "ml.p3.8xlarge",
610 |     "vcpuNum": 32
611 |    },
612 |    {
613 |     "_defaultOrder": 37,
614 |     "_isFastLaunch": false,
615 |     "category": "Accelerated computing",
616 |     "gpuNum": 8,
617 |     "hideHardwareSpecs": false,
618 |     "memoryGiB": 488,
619 |     "name": "ml.p3.16xlarge",
620 |     "vcpuNum": 64
621 |    },
622 |    {
623 |     "_defaultOrder": 38,
624 |     "_isFastLaunch": false,
625 |     "category": "Accelerated computing",
626 |     "gpuNum": 8,
627 |     "hideHardwareSpecs": false,
628 |     "memoryGiB": 768,
629 |     "name": "ml.p3dn.24xlarge",
630 |     "vcpuNum": 96
631 |    },
632 |    {
633 |     "_defaultOrder": 39,
634 |     "_isFastLaunch": false,
635 |     "category": "Memory Optimized",
636 |     "gpuNum": 0,
637 |     "hideHardwareSpecs": false,
638 |     "memoryGiB": 16,
639 |     "name": "ml.r5.large",
640 |     "vcpuNum": 2
641 |    },
642 |    {
643 |     "_defaultOrder": 40,
644 |     "_isFastLaunch": false,
645 |     "category": "Memory Optimized",
646 |     "gpuNum": 0,
647 |     "hideHardwareSpecs": false,
648 |     "memoryGiB": 32,
649 |     "name": "ml.r5.xlarge",
650 |     "vcpuNum": 4
651 |    },
652 |    {
653 |     "_defaultOrder": 41,
654 |     "_isFastLaunch": false,
655 |     "category": "Memory Optimized",
656 |     "gpuNum": 0,
657 |     "hideHardwareSpecs": false,
658 |     "memoryGiB": 64,
659 |     "name": "ml.r5.2xlarge",
660 |     "vcpuNum": 8
661 |    },
662 |    {
663 |     "_defaultOrder": 42,
664 |     "_isFastLaunch": false,
665 |     "category": "Memory Optimized",
666 |     "gpuNum": 0,
667 |     "hideHardwareSpecs": false,
668 |     "memoryGiB": 128,
669 |     "name": "ml.r5.4xlarge",
670 |     "vcpuNum": 16
671 |    },
672 |    {
673 |     "_defaultOrder": 43,
674 |     "_isFastLaunch": false,
675 |     "category": "Memory Optimized",
676 |     "gpuNum": 0,
677 |     "hideHardwareSpecs": false,
678 |     "memoryGiB": 256,
679 |     "name": "ml.r5.8xlarge",
680 |     "vcpuNum": 32
681 |    },
682 |    {
683 |     "_defaultOrder": 44,
684 |     "_isFastLaunch": false,
685 |     "category": "Memory Optimized",
686 |     "gpuNum": 0,
687 |     "hideHardwareSpecs": false,
688 |     "memoryGiB": 384,
689 |     "name": "ml.r5.12xlarge",
690 |     "vcpuNum": 48
691 |    },
692 |    {
693 |     "_defaultOrder": 45,
694 |     "_isFastLaunch": false,
695 |     "category": "Memory Optimized",
696 |     "gpuNum": 0,
697 |     "hideHardwareSpecs": false,
698 |     "memoryGiB": 512,
699 |     "name": "ml.r5.16xlarge",
700 |     "vcpuNum": 64
701 |    },
702 |    {
703 |     "_defaultOrder": 46,
704 |     "_isFastLaunch": false,
705 |     "category": "Memory Optimized",
706 |     "gpuNum": 0,
707 |     "hideHardwareSpecs": false,
708 |     "memoryGiB": 768,
709 |     "name": "ml.r5.24xlarge",
710 |     "vcpuNum": 96
711 |    },
712 |    {
713 |     "_defaultOrder": 47,
714 |     "_isFastLaunch": false,
715 |     "category": "Accelerated computing",
716 |     "gpuNum": 1,
717 |     "hideHardwareSpecs": false,
718 |     "memoryGiB": 16,
719 |     "name": "ml.g5.xlarge",
720 |     "vcpuNum": 4
721 |    },
722 |    {
723 |     "_defaultOrder": 48,
724 |     "_isFastLaunch": false,
725 |     "category": "Accelerated computing",
726 |     "gpuNum": 1,
727 |     "hideHardwareSpecs": false,
728 |     "memoryGiB": 32,
729 |     "name": "ml.g5.2xlarge",
730 |     "vcpuNum": 8
731 |    },
732 |    {
733 |     "_defaultOrder": 49,
734 |     "_isFastLaunch": false,
735 |     "category": "Accelerated computing",
736 |     "gpuNum": 1,
737 |     "hideHardwareSpecs": false,
738 |     "memoryGiB": 64,
739 |     "name": "ml.g5.4xlarge",
740 |     "vcpuNum": 16
741 |    },
742 |    {
743 |     "_defaultOrder": 50,
744 |     "_isFastLaunch": false,
745 |     "category": "Accelerated computing",
746 |     "gpuNum": 1,
747 |     "hideHardwareSpecs": false,
748 |     "memoryGiB": 128,
749 |     "name": "ml.g5.8xlarge",
750 |     "vcpuNum": 32
751 |    },
752 |    {
753 |     "_defaultOrder": 51,
754 |     "_isFastLaunch": false,
755 |     "category": "Accelerated computing",
756 |     "gpuNum": 1,
757 |     "hideHardwareSpecs": false,
758 |     "memoryGiB": 256,
759 |     "name": "ml.g5.16xlarge",
760 |     "vcpuNum": 64
761 |    },
762 |    {
763 |     "_defaultOrder": 52,
764 |     "_isFastLaunch": false,
765 |     "category": "Accelerated computing",
766 |     "gpuNum": 4,
767 |     "hideHardwareSpecs": false,
768 |     "memoryGiB": 192,
769 |     "name": "ml.g5.12xlarge",
770 |     "vcpuNum": 48
771 |    },
772 |    {
773 |     "_defaultOrder": 53,
774 |     "_isFastLaunch": false,
775 |     "category": "Accelerated computing",
776 |     "gpuNum": 4,
777 |     "hideHardwareSpecs": false,
778 |     "memoryGiB": 384,
779 |     "name": "ml.g5.24xlarge",
780 |     "vcpuNum": 96
781 |    },
782 |    {
783 |     "_defaultOrder": 54,
784 |     "_isFastLaunch": false,
785 |     "category": "Accelerated computing",
786 |     "gpuNum": 8,
787 |     "hideHardwareSpecs": false,
788 |     "memoryGiB": 768,
789 |     "name": "ml.g5.48xlarge",
790 |     "vcpuNum": 192
791 |    }
792 |   ],
793 |   "instance_type": "ml.t3.medium",
794 |   "kernelspec": {
795 |    "display_name": "conda_mxnet_p38",
796 |    "language": "python",
797 |    "name": "conda_mxnet_p38"
798 |   },
799 |   "language_info": {
800 |    "codemirror_mode": {
801 |     "name": "ipython",
802 |     "version": 3
803 |    },
804 |    "file_extension": ".py",
805 |    "mimetype": "text/x-python",
806 |    "name": "python",
807 |    "nbconvert_exporter": "python",
808 |    "pygments_lexer": "ipython3",
809 |    "version": "3.8.15"
810 |   }
811 |  },
812 |  "nbformat": 4,
813 |  "nbformat_minor": 5
814 | }
815 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/medical-imaging/src/Api.py:
--------------------------------------------------------------------------------
  1 | import array
  2 | import pydicom
  3 | from pydicom.sequence import Sequence
  4 | from pydicom import Dataset , DataElement 
  5 | from pydicom.dataset import FileMetaDataset
  6 | from pydicom.uid import UID
  7 | import json
  8 | import logging
  9 | import importlib  
 10 | import boto3
 11 | from openjpeg import decode
 12 | import io
 13 | import sys
 14 | import time
 15 | import os
 16 | import gzip
 17 | 
 18 | logging.basicConfig( level="INFO" )
 19 | 
 20 | class MedicalImaging: 
 21 |     def __init__(self, endpoint=""):
 22 |         session = boto3.Session()
 23 |         if len(endpoint)>1:
 24 |             self.client = boto3.client('medical-imaging', endpoint_url=endpoint)
 25 |         else:
 26 |             self.client = boto3.client('medical-imaging')
 27 |     
 28 |     def stopwatch(self, start_time, end_time):
 29 |         time_lapsed = end_time - start_time
 30 |         return time_lapsed*1000 
 31 |     
 32 |     
 33 |     def getMetadata(self, datastoreId, imageSetId):
 34 |         start_time = time.time()
 35 |         dicom_study_metadata = self.client.get_image_set_metadata(datastoreId=datastoreId , imageSetId=imageSetId )
 36 |         json_study_metadata = json.loads( gzip.decompress(dicom_study_metadata["imageSetMetadataBlob"].read()) )
 37 |         end_time = time.time()
 38 |         logging.info(f"Metadata fetch  : {self.stopwatch(start_time,end_time)} ms")   
 39 |         return json_study_metadata
 40 | 
 41 |     
 42 |     def listDatastores(self):
 43 |         start_time = time.time()
 44 |         response = self.client.list_datastores()
 45 |         end_time = time.time()
 46 |         logging.info(f"List Datastores  : {self.stopwatch(start_time,end_time)} ms")        
 47 |         return response
 48 |     
 49 |     
 50 |     def createDatastore(self, datastoreName):
 51 |         start_time = time.time()
 52 |         response = self.client.create_datastore(datastoreName=datastoreName)
 53 |         end_time = time.time()
 54 |         logging.info(f"Create Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 55 |         return response
 56 |     
 57 |     
 58 |     def getDatastore(self, datastoreId):
 59 |         start_time = time.time()
 60 |         response = self.client.get_datastore(datastoreId=datastoreId)
 61 |         end_time = time.time()
 62 |         logging.info(f"Get Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 63 |         return response
 64 |     
 65 |     
 66 |     def deleteDatastore(self, datastoreId):
 67 |         start_time = time.time()
 68 |         response = self.client.delete_datastore(datastoreId=datastoreId)
 69 |         end_time = time.time()
 70 |         logging.info(f"Delete Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 71 |         return response
 72 |     
 73 |     
 74 |     def startImportJob(self, datastoreId, IamRoleArn, inputS3, outputS3):
 75 |         start_time = time.time()
 76 |         response = self.client.start_dicom_import_job(
 77 |             datastoreId=datastoreId,
 78 |             dataAccessRoleArn = IamRoleArn,
 79 |             inputS3Uri = inputS3,
 80 |             outputS3Uri = outputS3,
 81 |             clientToken = "demoClient"
 82 |         )
 83 |         end_time = time.time()
 84 |         logging.info(f"Start Import Job  : {self.stopwatch(start_time,end_time)} ms")        
 85 |         return response
 86 |     
 87 |     
 88 |     def getImportJob(self, datastoreId, jobId):
 89 |         start_time = time.time()
 90 |         response = self.client.get_dicom_import_job(datastoreId=datastoreId, jobId=jobId)
 91 |         end_time = time.time()
 92 |         logging.info(f"Get Import Job  : {self.stopwatch(start_time,end_time)} ms")        
 93 |         return response
 94 |     
 95 |     
 96 |     def getFramePixels(self, datastoreId, imageSetId, imageFrameId):
 97 |         start_time = time.time()
 98 |         res = self.client.get_image_frame(
 99 |             datastoreId=datastoreId,
100 |             imageSetId=imageSetId,
101 |             imageFrameInformation={
102 |                 'imageFrameId': imageFrameId
103 |             })
104 |         end_time = time.time()
105 |         logging.debug(f"Frame fetch     : {self.stopwatch(start_time,end_time)} ms") 
106 |         start_time = time.time() 
107 |         b = io.BytesIO()
108 |         b.write(res['imageFrameBlob'].read())
109 |         b.seek(0)
110 |         d = decode(b)
111 |         end_time = time.time()
112 |         logging.debug(f"Frame decode    : {self.stopwatch(start_time,end_time)} ms")    
113 |         return d 
114 | 
115 |     def getDICOMdataset(self, datastoreId, imageSetId):
116 |         logging.debug("Reading the JSON metadata file")
117 |         json_dicom_header = self.getMetadata(datastoreId , imageSetId)
118 | 
119 |         vrlist = []
120 |         sop_instances = []
121 |         
122 |         file_meta = FileMetaDataset()
123 |         file_meta.MediaStorageSOPClassUID = UID('1.2.840.10008.5.1.4.1.1.1')  ## Media Storage SOP Class UID, e.g. "1.2.840.10008.5.1.4.1.1.88.34" for Comprehensive 3D SR IOD.
124 |         file_meta.MediaStorageSOPInstanceUID = UID("1.3.51.5145.5142.20010109.1105627.1.0.1")
125 |         file_meta.ImplementationClassUID = UID("1.2.826.0.1.3680043.9.3811.2.0.1")
126 |         file_meta.TransferSyntaxUID = UID('1.2.840.10008.1.2.1')  # Made up. Not registered.
127 |         
128 |         logging.debug("Reading the Pixels")
129 |         for series in json_dicom_header["Study"]["Series"]:
130 |             for instances in json_dicom_header["Study"]["Series"][series]["Instances"]:
131 |                 ds = Dataset()
132 |                 ds.file_meta = file_meta
133 |                 
134 |                 PatientLevel = json_dicom_header["Patient"]["DICOM"]
135 |                 self.getTags(PatientLevel, ds, vrlist)
136 |                 StudyLevel = json_dicom_header["Study"]["DICOM"]
137 |                 self.getTags(StudyLevel, ds, vrlist)
138 |                 self.getDICOMVRs(json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOMVRs"] , vrlist)
139 |                 self.getTags( json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOM"] , ds, vrlist)
140 |                 self.getTags(json_dicom_header["Study"]["Series"][series]["DICOM"], ds, vrlist)
141 |                 
142 |                 ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian
143 |                 ds.file_meta.MediaStorageSOPInstanceUID = UID(instances)
144 |                 ds.is_little_endian = True
145 |                 ds.is_implicit_VR = False
146 |                 
147 |                 frameId = json_dicom_header["Study"]["Series"][series]["Instances"][instances]["ImageFrames"][0]["ID"]
148 |                 pixels = self.getFramePixels(datastoreId, json_dicom_header["ImageSetID"], frameId)
149 |                 
150 |                 start_time = time.time()
151 |                 ds.PixelData = pixels.tobytes()
152 |                 sop_instances.append(ds)
153 |                 vrlist.clear()
154 |                 end_time = time.time()
155 |                 logging.debug(f"Outpout save     : {self.stopwatch(start_time,end_time)} ms")     
156 |         return sop_instances
157 |     
158 |     def getDICOMVRs(self, taglevel, vrlist):
159 |         for theKey in taglevel:
160 |             vrlist.append( [ theKey , taglevel[theKey] ])
161 |             logging.debug(f"[getDICOMVRs] - List of private tags VRs: {vrlist}\r\n")
162 | 
163 | 
164 |     def getTags(self, tagLevel, ds, vrlist):    
165 |         for theKey in tagLevel:
166 |             if theKey in ['PrivateCreatorID', 'FileMetaInformationVersion', '00291203']:
167 |                 continue
168 |             try:
169 |                 try:
170 |                     tagvr = pydicom.datadict.dictionary_VR(theKey)
171 |                 except:  #In case the vr is not in the pydicom dictionnary, it might be a private tag , listed in the vrlist
172 |                     tagvr = None
173 |                     for vr in vrlist:
174 |                         if theKey == vr[0]:
175 |                             tagvr = vr[1]
176 |                 datavalue=tagLevel[theKey]
177 |                 #print(f"{tagvr} {theKey} : {datavalue}")
178 |                 if(tagvr == 'SQ'):
179 |                     logging.debug(f"{theKey} : {tagLevel[theKey]} , {vrlist}")
180 |                     seqs = []
181 |                     for underSeq in tagLevel[theKey]:
182 |                         seqds = Dataset()
183 |                         self.getTags(underSeq, seqds, vrlist)
184 |                         seqs.append(seqds)
185 |                     datavalue = Sequence(seqs)
186 |                     continue
187 |                 if(tagvr == 'US or SS'):
188 |                     datavalue=tagLevel[theKey]
189 |                     if (int(datavalue) > 32767):
190 |                         tagvr = 'US'
191 |                 if( tagvr == 'OB'):
192 |                     datavalue = self.getOBVRTagValue(tagLevel[theKey] )
193 |                     
194 |                 data_element = DataElement(theKey , tagvr , datavalue )
195 |                 if data_element.tag.group != 2:
196 |                     try:
197 |                         if (int(data_element.tag.group) % 2) == 0 : # we are skipping all the private tags
198 |                             ds.add(data_element) 
199 |                     except:
200 |                         continue
201 |             except Exception as err:
202 |                 logging.warning(f"[HLIDataDICOMizer][getTags] - {err} for Key: {theKey}")
203 |                 continue 
204 | 
205 | 
206 | 
207 |     def getOBVRTagValue(self, datalist):
208 |         bytevals = []
209 |         for byteval in datalist:
210 |             bytevals.append(int(byteval)) 
211 |         OBArray = bytearray(bytevals)
212 |         return bytes(OBArray)
213 | 
214 | 


--------------------------------------------------------------------------------
/preprocess-multimodal-data/medical-imaging/src/radiogenomics-imaging-workflow.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "StartAt": "iterate_over_subjects",
 3 |   "States": {
 4 |     "iterate_over_subjects": {
 5 |       "ItemsPath": "$.Subject",
 6 |       "MaxConcurrency": 50,
 7 |       "Type": "Map",
 8 |       "Next": "Finish",
 9 |       "Iterator": {
10 |         "StartAt": "AHI Radiomic Feature Extraction",
11 |         "States": {
12 |           "Fallback": {
13 |             "Type": "Pass",
14 |             "Result": "This iteration failed for some reason",
15 |             "End": true
16 |           },
17 |           "AHI Radiomic Feature Extraction": {
18 |             "Type": "Task",
19 |             "OutputPath": "$.ProcessingJobArn",
20 |             "Resource": "arn:aws:states:::sagemaker:createProcessingJob.sync",
21 |             "Retry": [
22 |               {
23 |                 "ErrorEquals": [
24 |                   "SageMaker.AmazonSageMakerException"
25 |                 ],
26 |                 "IntervalSeconds": 15,
27 |                 "MaxAttempts": 8,
28 |                 "BackoffRate": 1.5
29 |               }
30 |             ],
31 |             "Catch": [
32 |               {
33 |                 "ErrorEquals": [
34 |                   "States.TaskFailed"
35 |                 ],
36 |                 "Next": "Fallback"
37 |               }
38 |             ],
39 |             "Parameters": {
40 |               "ProcessingJobName.$": "$$.Execution.Input['PreprocessingJobName']",
41 |               "ProcessingInputs": [
42 |                 {
43 |                   "InputName": "JSON",
44 |                   "AppManaged": false,
45 |                   "S3Input": {
46 |                     "S3Uri.$": "States.Format('##INPUT_DATA_S3URI##/{}' , $)",
47 |                     "LocalPath": "/opt/ml/processing/input",
48 |                     "S3DataType": "S3Prefix",
49 |                     "S3InputMode": "File",
50 |                     "S3DataDistributionType": "ShardedByS3Key",
51 |                     "S3CompressionType": "None"
52 |                   }
53 |                 }
54 |               ],
55 |               "ProcessingOutputConfig": {
56 |                 "Outputs": [
57 |                   {
58 |                     "OutputName": "radiomicsfeature",
59 |                     "AppManaged": false,
60 |                     "S3Output": {
61 |                       "S3Uri": "##OUTPUT_DATA_S3URI##",
62 |                       "LocalPath": "/opt/ml/processing/output/",
63 |                       "S3UploadMode": "EndOfJob"
64 |                     }
65 |                   }
66 |                 ]
67 |               },
68 |               "AppSpecification": {
69 |                 "ImageUri": "##ECR_IMAGE_URI##",
70 |                 "ContainerArguments.$": "States.Array('--datastore_id', $, '--feature_store_name', $$.Execution.Input['FeatureStoreName'], '--offline_store_s3uri', $$.Execution.Input['OfflineStoreS3Uri'])",
71 |                 "ContainerEntrypoint": [
72 |                   "python3",
73 |                   "/opt/ahiradiomics.py"
74 |                 ]
75 |               },
76 |               "RoleArn": "##IAM_ROLE_ARN##",
77 |               "ProcessingResources": {
78 |                 "ClusterConfig": {
79 |                   "InstanceCount": 10,
80 |                   "InstanceType": "ml.m5.large",
81 |                   "VolumeSizeInGB": 5
82 |                 }
83 |               }
84 |             },
85 |             "End": true
86 |           }
87 |         }
88 |       }
89 |     },
90 |     "Finish": {
91 |       "Type": "Succeed"
92 |     }
93 |   }
94 | }
95 | 
96 | 


--------------------------------------------------------------------------------
/store-multimodal-data/clinical/README.md:
--------------------------------------------------------------------------------
 1 | ## Store and analyze clinical data with Amazon HealthLake
 2 | 
 3 | To get started with storing clinical data, follow the steps in the guide [here](https://docs.aws.amazon.com/healthlake/latest/devguide/getting-started.html). 
 4 | 
 5 | Login to your AWS account, search for Amazon HealthLake, and [create an empty Amazon HealthLake datastore](https://docs.aws.amazon.com/healthlake/latest/devguide/create-data-store.html). This will take 20 minutes to provision.
 6 | 
 7 | While the datastore is creating, navigate to Amazon S3, create a new bucket to hold the sample clinical data, and then copy the sample FHIR data folder from the public code repo into your an S3 bucket in your account. You can run the following CLI command to copy the data...
 8 | aws s3 sync s3://guidance-multimodal-hcls-healthai-machinelearning/clinical s3://yournewbucketnamehere/clinical
 9 | 
10 | Once the datastore has been created, navigate to the datastore in the console and click the "Import" button in the top right.
11 | * On the Import page, press the "Browse S3" button, navigate to the sample data bucket in your account, then select the clinical folder. This folder contains various .ndjson files for the patients.
12 | * For an output file location, you can use the "sandbox-data-" folder to store your output job data.
13 | * We have created a HealthLake KMS key you can use throughout the workshop. Select that key for encryption.
14 | * Under the "Access Permissions" section, create an IAM role with a name of your preference.
15 | * Click the "Import data" button.
16 | 
17 | 
18 | 


--------------------------------------------------------------------------------
/store-multimodal-data/genomic/store-analyze-genomicdata-with-awshealthomics.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "f9472cd9",
   6 |    "metadata": {},
   7 |    "source": []
   8 |   },
   9 |   {
  10 |    "cell_type": "markdown",
  11 |    "id": "3c8ef7a7",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "# Create AWS HealthOmics Analytic Stores to Import Genomic Data and Run Queries"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "markdown",
  19 |    "id": "85d3918d",
  20 |    "metadata": {},
  21 |    "source": [
  22 |     "Follow steps in this notebook to:\n",
  23 |     "1. create AWS HealthOmics Reference, Variant, and Annotation Stores\n",
  24 |     "2. import reference genome, variant files, and ClinVar annotation file from S3 to the respective data stores\n",
  25 |     "3. query the variant and annotation data. "
  26 |    ]
  27 |   },
  28 |   {
  29 |    "cell_type": "markdown",
  30 |    "id": "aef34065",
  31 |    "metadata": {},
  32 |    "source": [
  33 |     "## Prerequisites and package dependencies"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "code",
  38 |    "execution_count": null,
  39 |    "id": "fee6df5c",
  40 |    "metadata": {},
  41 |    "outputs": [],
  42 |    "source": [
  43 |     "!pip install awswrangler"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "code",
  48 |    "execution_count": 1,
  49 |    "id": "2dee102e",
  50 |    "metadata": {
  51 |     "tags": []
  52 |    },
  53 |    "outputs": [],
  54 |    "source": [
  55 |     "from datetime import datetime\n",
  56 |     "from pprint import pprint\n",
  57 |     "import urllib\n",
  58 |     "\n",
  59 |     "import boto3\n",
  60 |     "import botocore.exceptions\n",
  61 |     "\n",
  62 |     "from utils import *"
  63 |    ]
  64 |   },
  65 |   {
  66 |    "cell_type": "markdown",
  67 |    "id": "86a19a6d",
  68 |    "metadata": {},
  69 |    "source": [
  70 |     "### Create service role"
  71 |    ]
  72 |   },
  73 |   {
  74 |    "cell_type": "code",
  75 |    "execution_count": 2,
  76 |    "id": "a036d0a8",
  77 |    "metadata": {
  78 |     "jupyter": {
  79 |      "source_hidden": true
  80 |     },
  81 |     "tags": []
  82 |    },
  83 |    "outputs": [],
  84 |    "source": [
  85 |     "# set a timestamp\n",
  86 |     "dt_fmt = '%Y%m%dT%H%M%S'\n",
  87 |     "ts = datetime.now().strftime(dt_fmt)\n",
  88 |     "\n",
  89 |     "policy = {\n",
  90 |     "  \"Version\": \"2012-10-17\",\n",
  91 |     "  \"Statement\": [\n",
  92 |     "    {\n",
  93 |     "      \"Effect\": \"Allow\",\n",
  94 |     "      \"Action\": [\n",
  95 |     "        \"omics:*\"\n",
  96 |     "      ],\n",
  97 |     "      \"Resource\": \"*\"\n",
  98 |     "    },\n",
  99 |     "    {\n",
 100 |     "      \"Effect\": \"Allow\",\n",
 101 |     "      \"Action\": [\n",
 102 |     "        \"ram:AcceptResourceShareInvitation\",\n",
 103 |     "        \"ram:GetResourceShareInvitations\"\n",
 104 |     "      ],\n",
 105 |     "      \"Resource\": \"*\"\n",
 106 |     "    },\n",
 107 |     "    {\n",
 108 |     "      \"Effect\": \"Allow\",\n",
 109 |     "      \"Action\": [\n",
 110 |     "        \"s3:GetBucketLocation\",\n",
 111 |     "        \"s3:PutObject\",\n",
 112 |     "        \"s3:GetObject\",\n",
 113 |     "        \"s3:ListBucket\",\n",
 114 |     "        \"s3:AbortMultipartUpload\",\n",
 115 |     "        \"s3:ListMultipartUploadParts\",\n",
 116 |     "        \"s3:GetObjectAcl\",\n",
 117 |     "        \"s3:PutObjectAcl\"\n",
 118 |     "      ],\n",
 119 |     "      \"Resource\": \"*\"\n",
 120 |     "    }\n",
 121 |     "  ]\n",
 122 |     "}\n",
 123 |     "\n",
 124 |     "trust_policy = {\n",
 125 |     "    \"Version\": \"2012-10-17\",\n",
 126 |     "    \"Statement\": [\n",
 127 |     "        {\n",
 128 |     "            \"Effect\": \"Allow\",\n",
 129 |     "            \"Principal\": {\n",
 130 |     "                \"Service\": \"omics.amazonaws.com\"\n",
 131 |     "            },\n",
 132 |     "            \"Action\": \"sts:AssumeRole\"\n",
 133 |     "        }\n",
 134 |     "    ]\n",
 135 |     "}"
 136 |    ]
 137 |   },
 138 |   {
 139 |    "cell_type": "code",
 140 |    "execution_count": 3,
 141 |    "id": "9eb174a7",
 142 |    "metadata": {
 143 |     "tags": []
 144 |    },
 145 |    "outputs": [],
 146 |    "source": [
 147 |     "# Base name for role and policy\n",
 148 |     "omics_iam_name = f'multimodal-omics-{ts}'\n",
 149 |     "create_omics_role(omics_iam_name, policy, trust_policy)"
 150 |    ]
 151 |   },
 152 |   {
 153 |    "cell_type": "markdown",
 154 |    "id": "d7243a7e",
 155 |    "metadata": {},
 156 |    "source": [
 157 |     "### Create Omics client"
 158 |    ]
 159 |   },
 160 |   {
 161 |    "cell_type": "code",
 162 |    "execution_count": 5,
 163 |    "id": "c02fcf1f",
 164 |    "metadata": {
 165 |     "tags": []
 166 |    },
 167 |    "outputs": [],
 168 |    "source": [
 169 |     "omics = boto3.client('omics', region_name='us-east-1')"
 170 |    ]
 171 |   },
 172 |   {
 173 |    "cell_type": "markdown",
 174 |    "id": "94b5014a",
 175 |    "metadata": {},
 176 |    "source": [
 177 |     "### Source data\n",
 178 |     "Set the source data bucket to the regional replica of the Synthea Coherent dataset. If you want to use different source data, replace the bucket name here and any S3 URIs where it is used in the rest of the notebook."
 179 |    ]
 180 |   },
 181 |   {
 182 |    "cell_type": "code",
 183 |    "execution_count": null,
 184 |    "id": "9bfd290d",
 185 |    "metadata": {},
 186 |    "outputs": [],
 187 |    "source": [
 188 |     "SOURCE_BUCKET_NAME = f\"guidance-multimodal-hcls-healthai-machinelearning-{omics.meta.region_name}\""
 189 |    ]
 190 |   },
 191 |   {
 192 |    "cell_type": "markdown",
 193 |    "id": "adda409b",
 194 |    "metadata": {},
 195 |    "source": [
 196 |     "## Create reference store and import reference genome"
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "markdown",
 201 |    "id": "ec7c48f4",
 202 |    "metadata": {},
 203 |    "source": [
 204 |     "### Create reference store "
 205 |    ]
 206 |   },
 207 |   {
 208 |    "cell_type": "code",
 209 |    "execution_count": 6,
 210 |    "id": "a3071ab4",
 211 |    "metadata": {
 212 |     "tags": []
 213 |    },
 214 |    "outputs": [
 215 |     {
 216 |      "name": "stdout",
 217 |      "output_type": "stream",
 218 |      "text": [
 219 |       "Checking for a reference store in region: us-east-1\n",
 220 |       "Congratulations, you have an existing reference store!\n"
 221 |      ]
 222 |     }
 223 |    ],
 224 |    "source": [
 225 |     "print(f\"Checking for a reference store in region: {omics.meta.region_name}\")\n",
 226 |     "if get_ref_store_id(omics) == None:\n",
 227 |     "    response = omics.create_reference_store(name='myReferenceStore')\n",
 228 |     "    print(response)\n",
 229 |     "else:\n",
 230 |     "    print(\"Congratulations, you have an existing reference store!\")"
 231 |    ]
 232 |   },
 233 |   {
 234 |    "cell_type": "markdown",
 235 |    "id": "9aee28d4",
 236 |    "metadata": {},
 237 |    "source": [
 238 |     "### Import reference genome to reference store"
 239 |    ]
 240 |   },
 241 |   {
 242 |    "cell_type": "code",
 243 |    "execution_count": 4,
 244 |    "id": "161d02b0",
 245 |    "metadata": {
 246 |     "tags": []
 247 |    },
 248 |    "outputs": [],
 249 |    "source": [
 250 |     "SOURCE_S3_URIS = {\n",
 251 |     "    \"reference\": f\"s3://{SOURCE_BUCKET_NAME}/genomic/reference/hg19.fa\"\n",
 252 |     "}"
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "code",
 257 |    "execution_count": 7,
 258 |    "id": "6bf75b4f",
 259 |    "metadata": {
 260 |     "tags": []
 261 |    },
 262 |    "outputs": [],
 263 |    "source": [
 264 |     "# If using a different reference genomem, replace \"hg19\" with a different prefix\n",
 265 |     "\n",
 266 |     "ref_name = f'hg19-{ts}'\n",
 267 |     "\n",
 268 |     "ref_import_job = omics.start_reference_import_job(\n",
 269 |     "    referenceStoreId=get_ref_store_id(omics), \n",
 270 |     "    roleArn=get_role_arn(omics_iam_name),\n",
 271 |     "    sources=[{\n",
 272 |     "        'sourceFile': SOURCE_S3_URIS[\"reference\"],\n",
 273 |     "        'name': ref_name,\n",
 274 |     "        'tags': {'SourceLocation': '1kg'}\n",
 275 |     "    }])"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "code",
 280 |    "execution_count": null,
 281 |    "id": "219c738a",
 282 |    "metadata": {
 283 |     "tags": []
 284 |    },
 285 |    "outputs": [],
 286 |    "source": [
 287 |     "ref_import_job = omics.get_reference_import_job(\n",
 288 |     "    referenceStoreId=get_ref_store_id(omics), \n",
 289 |     "    id=ref_import_job['id'])\n",
 290 |     "ref_import_job"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "code",
 295 |    "execution_count": null,
 296 |    "id": "5e1b56ff-92db-4db7-8e5f-f51f5733b557",
 297 |    "metadata": {
 298 |     "tags": []
 299 |    },
 300 |    "outputs": [],
 301 |    "source": [
 302 |     "try:\n",
 303 |     "    waiter = omics.get_waiter('reference_import_job_completed')\n",
 304 |     "    waiter.wait(id=ref_import_job['id'], referenceStoreId=ref_import_job['referenceStoreId'])\n",
 305 |     "    \n",
 306 |     "    print(f\"reference import job {ref_import_job['id']} complete\")\n",
 307 |     "except botocore.exceptions.WaiterError as e:\n",
 308 |     "    print(f\"reference import job {ref_import_job['id']} FAILED\")\n",
 309 |     "    print(e)"
 310 |    ]
 311 |   },
 312 |   {
 313 |    "cell_type": "markdown",
 314 |    "id": "b0f91c51",
 315 |    "metadata": {},
 316 |    "source": [
 317 |     "### !!! Wait until the above import job has finished !!!"
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "code",
 322 |    "execution_count": null,
 323 |    "id": "055d8c70",
 324 |    "metadata": {
 325 |     "tags": []
 326 |    },
 327 |    "outputs": [],
 328 |    "source": [
 329 |     "resp = omics.list_references(referenceStoreId=get_ref_store_id(omics), filter={\"name\": ref_name})\n",
 330 |     "\n",
 331 |     "ref_list = resp\n",
 332 |     "pprint(resp)"
 333 |    ]
 334 |   },
 335 |   {
 336 |    "cell_type": "code",
 337 |    "execution_count": null,
 338 |    "id": "41525f3c",
 339 |    "metadata": {
 340 |     "tags": []
 341 |    },
 342 |    "outputs": [],
 343 |    "source": [
 344 |     "# Store this reference\n",
 345 |     "ref = omics.get_reference_metadata(\n",
 346 |     "    referenceStoreId=get_ref_store_id(omics), \n",
 347 |     "    id=ref_list['references'][0]['id'])\n",
 348 |     "ref"
 349 |    ]
 350 |   },
 351 |   {
 352 |    "cell_type": "markdown",
 353 |    "id": "c9ed9b75",
 354 |    "metadata": {},
 355 |    "source": [
 356 |     "## Create Variant Store and import VCF files"
 357 |    ]
 358 |   },
 359 |   {
 360 |    "cell_type": "code",
 361 |    "execution_count": 12,
 362 |    "id": "01ec4e3d",
 363 |    "metadata": {
 364 |     "tags": []
 365 |    },
 366 |    "outputs": [],
 367 |    "source": [
 368 |     "SOURCE_VARIANT_URI = f\"s3://{SOURCE_BUCKET_NAME}\""
 369 |    ]
 370 |   },
 371 |   {
 372 |    "cell_type": "code",
 373 |    "execution_count": 13,
 374 |    "id": "801a2d03",
 375 |    "metadata": {
 376 |     "tags": []
 377 |    },
 378 |    "outputs": [],
 379 |    "source": [
 380 |     "# generate a list of VCF files to import\n",
 381 |     "\n",
 382 |     "source = urllib.parse.urlparse(SOURCE_VARIANT_URI)\n",
 383 |     "bucket = source.netloc\n",
 384 |     "prefix = source.path[1:]\n",
 385 |     "\n",
 386 |     "s3r = boto3.resource('s3')\n",
 387 |     "\n",
 388 |     "bucket = s3r.Bucket(bucket)\n",
 389 |     "objects = bucket.objects.filter(Prefix=prefix, MaxKeys=10_000)\n",
 390 |     "ext = '_dna.vcf'\n",
 391 |     "\n",
 392 |     "vcf_list = [f\"s3://{o.bucket_name}/{o.key}\" for o in objects if o.key.endswith(ext)]"
 393 |    ]
 394 |   },
 395 |   {
 396 |    "cell_type": "markdown",
 397 |    "id": "ff428d41",
 398 |    "metadata": {},
 399 |    "source": [
 400 |     "### Create Variant Store"
 401 |    ]
 402 |   },
 403 |   {
 404 |    "cell_type": "code",
 405 |    "execution_count": null,
 406 |    "id": "9c1fa8fd",
 407 |    "metadata": {
 408 |     "tags": []
 409 |    },
 410 |    "outputs": [],
 411 |    "source": [
 412 |     "var_store_name = f'synthea_newvariants_{ts.lower()}'\n",
 413 |     "\n",
 414 |     "response = omics.create_variant_store(\n",
 415 |     "    name=var_store_name, \n",
 416 |     "    reference={\"referenceArn\": get_reference_arn(ref_name, omics)}\n",
 417 |     ")\n",
 418 |     "\n",
 419 |     "var_store = response\n",
 420 |     "response"
 421 |    ]
 422 |   },
 423 |   {
 424 |    "cell_type": "markdown",
 425 |    "id": "82351bce",
 426 |    "metadata": {},
 427 |    "source": [
 428 |     "### !!! Wait until the Variant Store is created !!!"
 429 |    ]
 430 |   },
 431 |   {
 432 |    "cell_type": "code",
 433 |    "execution_count": null,
 434 |    "id": "3aed46b5",
 435 |    "metadata": {
 436 |     "tags": []
 437 |    },
 438 |    "outputs": [],
 439 |    "source": [
 440 |     "try:\n",
 441 |     "    waiter = omics.get_waiter('variant_store_created')\n",
 442 |     "    waiter.wait(name=var_store['name'])\n",
 443 |     "\n",
 444 |     "    print(f\"variant store {var_store['name']} ready for use\")\n",
 445 |     "except botocore.exceptions.WaiterError as e:\n",
 446 |     "    print(f\"variant store {var_store['name']} FAILED:\")\n",
 447 |     "    print(e)\n",
 448 |     "\n",
 449 |     "var_store = omics.get_variant_store(name=var_store['name'])"
 450 |    ]
 451 |   },
 452 |   {
 453 |    "cell_type": "markdown",
 454 |    "id": "c6bb0979",
 455 |    "metadata": {},
 456 |    "source": [
 457 |     "### Import VCF files"
 458 |    ]
 459 |   },
 460 |   {
 461 |    "cell_type": "code",
 462 |    "execution_count": 17,
 463 |    "id": "73a348de",
 464 |    "metadata": {
 465 |     "tags": []
 466 |    },
 467 |    "outputs": [],
 468 |    "source": [
 469 |     "l_vcf = [dict(zip([\"source\"],[uri])) for i, uri in enumerate(vcf_list)]\n",
 470 |     "\n",
 471 |     "response = omics.start_variant_import_job(destinationName=var_store['name'], \n",
 472 |     "                                          roleArn=get_role_arn(omics_iam_name),\n",
 473 |     "                                          items=l_vcf)"
 474 |    ]
 475 |   },
 476 |   {
 477 |    "cell_type": "markdown",
 478 |    "id": "cc6a56d1",
 479 |    "metadata": {},
 480 |    "source": [
 481 |     "## Query Variant Store with Amazon Athena"
 482 |    ]
 483 |   },
 484 |   {
 485 |    "cell_type": "code",
 486 |    "execution_count": null,
 487 |    "id": "b2db8f94-c240-45da-8671-8370f0067407",
 488 |    "metadata": {
 489 |     "tags": []
 490 |    },
 491 |    "outputs": [],
 492 |    "source": [
 493 |     "# To run Athena queries on the data, use AWS LakeFormation to create resource links to the database\n",
 494 |     "# For the following function to work, you need to ensure the IAM user running this notebook is a Data Lake Administrator.\n",
 495 |     "\n",
 496 |     "create_resource_link('omicsdb', var_store, store_type='variant')"
 497 |    ]
 498 |   },
 499 |   {
 500 |    "cell_type": "code",
 501 |    "execution_count": 19,
 502 |    "id": "e570e5d2",
 503 |    "metadata": {
 504 |     "tags": []
 505 |    },
 506 |    "outputs": [
 507 |     {
 508 |      "name": "stdout",
 509 |      "output_type": "stream",
 510 |      "text": [
 511 |       "Athena engine version 3\n",
 512 |       "Workgroup 'omics' found using Athena engine version 3\n"
 513 |      ]
 514 |     },
 515 |     {
 516 |      "data": {
 517 |       "text/plain": [
 518 |        "{'Name': 'omics',\n",
 519 |        " 'State': 'ENABLED',\n",
 520 |        " 'Description': '',\n",
 521 |        " 'CreationTime': datetime.datetime(2023, 3, 21, 20, 36, 59, 255000, tzinfo=tzlocal()),\n",
 522 |        " 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n",
 523 |        "  'EffectiveEngineVersion': 'Athena engine version 3'}}"
 524 |       ]
 525 |      },
 526 |      "execution_count": 19,
 527 |      "metadata": {},
 528 |      "output_type": "execute_result"
 529 |     }
 530 |    ],
 531 |    "source": [
 532 |     "# Omics Analytic Stores requires Athena engine version 3 for querying\n",
 533 |     "# https://docs.aws.amazon.com/athena/latest/ug/versions.html\n",
 534 |     "\n",
 535 |     "# Locate or create a suitable workgroup for Athena queries\n",
 536 |     "\n",
 537 |     "athena = boto3.client('athena')\n",
 538 |     "\n",
 539 |     "athena_workgroups = athena.list_work_groups()['WorkGroups']\n",
 540 |     "\n",
 541 |     "athena_workgroup = None\n",
 542 |     "for wg in athena_workgroups:\n",
 543 |     "    print(wg['EngineVersion']['EffectiveEngineVersion'])\n",
 544 |     "    if wg['EngineVersion']['EffectiveEngineVersion'] == 'Athena engine version 3':\n",
 545 |     "        print(f\"Workgroup '{wg['Name']}' found using Athena engine version 3\")\n",
 546 |     "        athena_workgroup = wg\n",
 547 |     "        break\n",
 548 |     "else:\n",
 549 |     "    print(\"No workgroups with Athena engine version 3 found. creating one\")\n",
 550 |     "    athena_workgroup = athena.create_work_group(\n",
 551 |     "        Name='omics',\n",
 552 |     "        Configuration={\n",
 553 |     "            \"EngineVersion\": {\n",
 554 |     "                \"SelectedEngineVersion\": \"Athena engine version 3\"\n",
 555 |     "            }\n",
 556 |     "        }\n",
 557 |     "    )\n",
 558 |     "\n",
 559 |     "athena_workgroup"
 560 |    ]
 561 |   },
 562 |   {
 563 |    "cell_type": "code",
 564 |    "execution_count": 26,
 565 |    "id": "8dbf203d",
 566 |    "metadata": {
 567 |     "tags": []
 568 |    },
 569 |    "outputs": [
 570 |     {
 571 |      "data": {
 572 |       "text/html": [
 573 |        "<div>\n",
 574 |        "<style scoped>\n",
 575 |        "    .dataframe tbody tr th:only-of-type {\n",
 576 |        "        vertical-align: middle;\n",
 577 |        "    }\n",
 578 |        "\n",
 579 |        "    .dataframe tbody tr th {\n",
 580 |        "        vertical-align: top;\n",
 581 |        "    }\n",
 582 |        "\n",
 583 |        "    .dataframe thead th {\n",
 584 |        "        text-align: right;\n",
 585 |        "    }\n",
 586 |        "</style>\n",
 587 |        "<table border=\"1\" class=\"dataframe\">\n",
 588 |        "  <thead>\n",
 589 |        "    <tr style=\"text-align: right;\">\n",
 590 |        "      <th></th>\n",
 591 |        "      <th>sampleid</th>\n",
 592 |        "      <th>contigname</th>\n",
 593 |        "      <th>start</th>\n",
 594 |        "      <th>referenceallele</th>\n",
 595 |        "      <th>alternatealleles</th>\n",
 596 |        "      <th>calls</th>\n",
 597 |        "    </tr>\n",
 598 |        "  </thead>\n",
 599 |        "  <tbody>\n",
 600 |        "    <tr>\n",
 601 |        "      <th>0</th>\n",
 602 |        "      <td>69eab197-6c14-7fcf-16d8-a18a222b82a4</td>\n",
 603 |        "      <td>1</td>\n",
 604 |        "      <td>46932823</td>\n",
 605 |        "      <td>A</td>\n",
 606 |        "      <td>[G]</td>\n",
 607 |        "      <td>[0, 1]</td>\n",
 608 |        "    </tr>\n",
 609 |        "    <tr>\n",
 610 |        "      <th>1</th>\n",
 611 |        "      <td>c7fff683-fd1b-f937-71ae-a490a80c9197</td>\n",
 612 |        "      <td>1</td>\n",
 613 |        "      <td>46932823</td>\n",
 614 |        "      <td>A</td>\n",
 615 |        "      <td>[G]</td>\n",
 616 |        "      <td>[0, 1]</td>\n",
 617 |        "    </tr>\n",
 618 |        "    <tr>\n",
 619 |        "      <th>2</th>\n",
 620 |        "      <td>c7fff683-fd1b-f937-71ae-a490a80c9197</td>\n",
 621 |        "      <td>1</td>\n",
 622 |        "      <td>55039973</td>\n",
 623 |        "      <td>G</td>\n",
 624 |        "      <td>[A, T]</td>\n",
 625 |        "      <td>[0, 1]</td>\n",
 626 |        "    </tr>\n",
 627 |        "    <tr>\n",
 628 |        "      <th>3</th>\n",
 629 |        "      <td>c7fff683-fd1b-f937-71ae-a490a80c9197</td>\n",
 630 |        "      <td>1</td>\n",
 631 |        "      <td>46932823</td>\n",
 632 |        "      <td>A</td>\n",
 633 |        "      <td>[G]</td>\n",
 634 |        "      <td>[0, 1]</td>\n",
 635 |        "    </tr>\n",
 636 |        "    <tr>\n",
 637 |        "      <th>4</th>\n",
 638 |        "      <td>1c906349-d5f7-3b79-9385-291d6ca12ddc</td>\n",
 639 |        "      <td>1</td>\n",
 640 |        "      <td>46932823</td>\n",
 641 |        "      <td>A</td>\n",
 642 |        "      <td>[G]</td>\n",
 643 |        "      <td>[0, 1]</td>\n",
 644 |        "    </tr>\n",
 645 |        "    <tr>\n",
 646 |        "      <th>5</th>\n",
 647 |        "      <td>69eab197-6c14-7fcf-16d8-a18a222b82a4</td>\n",
 648 |        "      <td>1</td>\n",
 649 |        "      <td>46932823</td>\n",
 650 |        "      <td>A</td>\n",
 651 |        "      <td>[G]</td>\n",
 652 |        "      <td>[0, 1]</td>\n",
 653 |        "    </tr>\n",
 654 |        "    <tr>\n",
 655 |        "      <th>6</th>\n",
 656 |        "      <td>69eab197-6c14-7fcf-16d8-a18a222b82a4</td>\n",
 657 |        "      <td>1</td>\n",
 658 |        "      <td>55039973</td>\n",
 659 |        "      <td>G</td>\n",
 660 |        "      <td>[A, T]</td>\n",
 661 |        "      <td>[0, 1]</td>\n",
 662 |        "    </tr>\n",
 663 |        "    <tr>\n",
 664 |        "      <th>7</th>\n",
 665 |        "      <td>1c906349-d5f7-3b79-9385-291d6ca12ddc</td>\n",
 666 |        "      <td>1</td>\n",
 667 |        "      <td>46932823</td>\n",
 668 |        "      <td>A</td>\n",
 669 |        "      <td>[G]</td>\n",
 670 |        "      <td>[0, 1]</td>\n",
 671 |        "    </tr>\n",
 672 |        "    <tr>\n",
 673 |        "      <th>8</th>\n",
 674 |        "      <td>1c906349-d5f7-3b79-9385-291d6ca12ddc</td>\n",
 675 |        "      <td>1</td>\n",
 676 |        "      <td>55039973</td>\n",
 677 |        "      <td>G</td>\n",
 678 |        "      <td>[A, T]</td>\n",
 679 |        "      <td>[0, 1]</td>\n",
 680 |        "    </tr>\n",
 681 |        "    <tr>\n",
 682 |        "      <th>9</th>\n",
 683 |        "      <td>dccfd9ed-8080-2743-5c17-7888e93617d5</td>\n",
 684 |        "      <td>1</td>\n",
 685 |        "      <td>46932823</td>\n",
 686 |        "      <td>A</td>\n",
 687 |        "      <td>[G]</td>\n",
 688 |        "      <td>[0, 1]</td>\n",
 689 |        "    </tr>\n",
 690 |        "  </tbody>\n",
 691 |        "</table>\n",
 692 |        "</div>"
 693 |       ],
 694 |       "text/plain": [
 695 |        "                               sampleid contigname     start referenceallele  \\\n",
 696 |        "0  69eab197-6c14-7fcf-16d8-a18a222b82a4          1  46932823               A   \n",
 697 |        "1  c7fff683-fd1b-f937-71ae-a490a80c9197          1  46932823               A   \n",
 698 |        "2  c7fff683-fd1b-f937-71ae-a490a80c9197          1  55039973               G   \n",
 699 |        "3  c7fff683-fd1b-f937-71ae-a490a80c9197          1  46932823               A   \n",
 700 |        "4  1c906349-d5f7-3b79-9385-291d6ca12ddc          1  46932823               A   \n",
 701 |        "5  69eab197-6c14-7fcf-16d8-a18a222b82a4          1  46932823               A   \n",
 702 |        "6  69eab197-6c14-7fcf-16d8-a18a222b82a4          1  55039973               G   \n",
 703 |        "7  1c906349-d5f7-3b79-9385-291d6ca12ddc          1  46932823               A   \n",
 704 |        "8  1c906349-d5f7-3b79-9385-291d6ca12ddc          1  55039973               G   \n",
 705 |        "9  dccfd9ed-8080-2743-5c17-7888e93617d5          1  46932823               A   \n",
 706 |        "\n",
 707 |        "  alternatealleles   calls  \n",
 708 |        "0              [G]  [0, 1]  \n",
 709 |        "1              [G]  [0, 1]  \n",
 710 |        "2           [A, T]  [0, 1]  \n",
 711 |        "3              [G]  [0, 1]  \n",
 712 |        "4              [G]  [0, 1]  \n",
 713 |        "5              [G]  [0, 1]  \n",
 714 |        "6           [A, T]  [0, 1]  \n",
 715 |        "7              [G]  [0, 1]  \n",
 716 |        "8           [A, T]  [0, 1]  \n",
 717 |        "9              [G]  [0, 1]  "
 718 |       ]
 719 |      },
 720 |      "execution_count": 26,
 721 |      "metadata": {},
 722 |      "output_type": "execute_result"
 723 |     }
 724 |    ],
 725 |    "source": [
 726 |     "# Use AWS Wrangler to submit query and get results as a Pandas Dataframe\n",
 727 |     "\n",
 728 |     "import awswrangler as wr\n",
 729 |     "\n",
 730 |     "df_var = wr.athena.read_sql_query(\n",
 731 |     "    f\"select sampleid, contigname, start, referenceallele, alternatealleles, calls from {var_store['name']} limit 10;\", \n",
 732 |     "    database=\"omicsdb\", workgroup = \"omics\")\n",
 733 |     "df_var"
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "markdown",
 738 |    "id": "ef5d4866",
 739 |    "metadata": {},
 740 |    "source": [
 741 |     "## Create Annotation Store and import ClinVar annotation file"
 742 |    ]
 743 |   },
 744 |   {
 745 |    "cell_type": "code",
 746 |    "execution_count": null,
 747 |    "id": "4b747f19",
 748 |    "metadata": {
 749 |     "tags": []
 750 |    },
 751 |    "outputs": [],
 752 |    "source": [
 753 |     "SOURCE_ANNOTATION_URI = f\"s3://{SOURCE_BUCKET_NAME}/genomic/annotation/clinvar.vcf.gz\"\n",
 754 |     "\n",
 755 |     "ann_store_name = f'synthea_annotations_{ts.lower()}'\n",
 756 |     "\n",
 757 |     "response = omics.create_annotation_store(\n",
 758 |     "    name=ann_store_name, \n",
 759 |     "    reference={\"referenceArn\": get_reference_arn(ref_name, omics)},\n",
 760 |     "    storeFormat='VCF'\n",
 761 |     ")\n",
 762 |     "\n",
 763 |     "ann_store = response\n",
 764 |     "response\n",
 765 |     "\n",
 766 |     "try:\n",
 767 |     "    waiter = omics.get_waiter('annotation_store_created')\n",
 768 |     "    waiter.wait(name=ann_store['name'])\n",
 769 |     "\n",
 770 |     "    print(f\"annotation store {ann_store['name']} ready for use\")\n",
 771 |     "except botocore.exceptions.WaiterError as e:\n",
 772 |     "    print(f\"annotation store {ann_store['name']} FAILED:\")\n",
 773 |     "    print(e)\n",
 774 |     "\n",
 775 |     "ann_store = omics.get_annotation_store(name=ann_store['name'])\n"
 776 |    ]
 777 |   },
 778 |   {
 779 |    "cell_type": "markdown",
 780 |    "id": "29db5e49",
 781 |    "metadata": {},
 782 |    "source": [
 783 |     "### !!! Wait until the Annotation Store is created !!!"
 784 |    ]
 785 |   },
 786 |   {
 787 |    "cell_type": "markdown",
 788 |    "id": "167610f7",
 789 |    "metadata": {},
 790 |    "source": [
 791 |     "### Import annotation file"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "code",
 796 |    "execution_count": null,
 797 |    "id": "330bff87",
 798 |    "metadata": {
 799 |     "tags": []
 800 |    },
 801 |    "outputs": [],
 802 |    "source": [
 803 |     "response = omics.start_annotation_import_job(\n",
 804 |     "    destinationName=ann_store['name'],\n",
 805 |     "    roleArn=get_role_arn(omics_iam_name),\n",
 806 |     "    items=[{\"source\": SOURCE_ANNOTATION_URI}]\n",
 807 |     ")\n",
 808 |     "response"
 809 |    ]
 810 |   },
 811 |   {
 812 |    "cell_type": "markdown",
 813 |    "id": "8b1199c0",
 814 |    "metadata": {},
 815 |    "source": [
 816 |     "## Query Annotation Store with Amazon Athena"
 817 |    ]
 818 |   },
 819 |   {
 820 |    "cell_type": "code",
 821 |    "execution_count": null,
 822 |    "id": "8345b58b-bfbd-45b0-93a7-882891f4f53a",
 823 |    "metadata": {
 824 |     "tags": []
 825 |    },
 826 |    "outputs": [],
 827 |    "source": [
 828 |     "create_resource_link('omicsdb', ann_store, store_type='annotation')"
 829 |    ]
 830 |   },
 831 |   {
 832 |    "cell_type": "code",
 833 |    "execution_count": 27,
 834 |    "id": "b0f4bf17",
 835 |    "metadata": {
 836 |     "tags": []
 837 |    },
 838 |    "outputs": [
 839 |     {
 840 |      "data": {
 841 |       "text/html": [
 842 |        "<div>\n",
 843 |        "<style scoped>\n",
 844 |        "    .dataframe tbody tr th:only-of-type {\n",
 845 |        "        vertical-align: middle;\n",
 846 |        "    }\n",
 847 |        "\n",
 848 |        "    .dataframe tbody tr th {\n",
 849 |        "        vertical-align: top;\n",
 850 |        "    }\n",
 851 |        "\n",
 852 |        "    .dataframe thead th {\n",
 853 |        "        text-align: right;\n",
 854 |        "    }\n",
 855 |        "</style>\n",
 856 |        "<table border=\"1\" class=\"dataframe\">\n",
 857 |        "  <thead>\n",
 858 |        "    <tr style=\"text-align: right;\">\n",
 859 |        "      <th></th>\n",
 860 |        "      <th>contigname</th>\n",
 861 |        "      <th>start</th>\n",
 862 |        "      <th>referenceallele</th>\n",
 863 |        "      <th>alternatealleles</th>\n",
 864 |        "      <th>attributes</th>\n",
 865 |        "    </tr>\n",
 866 |        "  </thead>\n",
 867 |        "  <tbody>\n",
 868 |        "    <tr>\n",
 869 |        "      <th>0</th>\n",
 870 |        "      <td>1</td>\n",
 871 |        "      <td>926009</td>\n",
 872 |        "      <td>G</td>\n",
 873 |        "      <td>[T]</td>\n",
 874 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 875 |        "    </tr>\n",
 876 |        "    <tr>\n",
 877 |        "      <th>1</th>\n",
 878 |        "      <td>1</td>\n",
 879 |        "      <td>926026</td>\n",
 880 |        "      <td>C</td>\n",
 881 |        "      <td>[T]</td>\n",
 882 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 883 |        "    </tr>\n",
 884 |        "    <tr>\n",
 885 |        "      <th>2</th>\n",
 886 |        "      <td>1</td>\n",
 887 |        "      <td>925975</td>\n",
 888 |        "      <td>T</td>\n",
 889 |        "      <td>[C]</td>\n",
 890 |        "      <td>[(CLNSIG, Uncertain_significance), (GENEINFO, ...</td>\n",
 891 |        "    </tr>\n",
 892 |        "    <tr>\n",
 893 |        "      <th>3</th>\n",
 894 |        "      <td>1</td>\n",
 895 |        "      <td>926002</td>\n",
 896 |        "      <td>C</td>\n",
 897 |        "      <td>[T]</td>\n",
 898 |        "      <td>[(CLNSIG, Uncertain_significance), (GENEINFO, ...</td>\n",
 899 |        "    </tr>\n",
 900 |        "    <tr>\n",
 901 |        "      <th>4</th>\n",
 902 |        "      <td>1</td>\n",
 903 |        "      <td>926013</td>\n",
 904 |        "      <td>G</td>\n",
 905 |        "      <td>[A]</td>\n",
 906 |        "      <td>[(CLNSIG, Uncertain_significance), (GENEINFO, ...</td>\n",
 907 |        "    </tr>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>5</th>\n",
 910 |        "      <td>1</td>\n",
 911 |        "      <td>926024</td>\n",
 912 |        "      <td>G</td>\n",
 913 |        "      <td>[A]</td>\n",
 914 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 915 |        "    </tr>\n",
 916 |        "    <tr>\n",
 917 |        "      <th>6</th>\n",
 918 |        "      <td>1</td>\n",
 919 |        "      <td>925955</td>\n",
 920 |        "      <td>C</td>\n",
 921 |        "      <td>[T]</td>\n",
 922 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 923 |        "    </tr>\n",
 924 |        "    <tr>\n",
 925 |        "      <th>7</th>\n",
 926 |        "      <td>1</td>\n",
 927 |        "      <td>925968</td>\n",
 928 |        "      <td>C</td>\n",
 929 |        "      <td>[T]</td>\n",
 930 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 931 |        "    </tr>\n",
 932 |        "    <tr>\n",
 933 |        "      <th>8</th>\n",
 934 |        "      <td>1</td>\n",
 935 |        "      <td>925985</td>\n",
 936 |        "      <td>C</td>\n",
 937 |        "      <td>[T]</td>\n",
 938 |        "      <td>[(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...</td>\n",
 939 |        "    </tr>\n",
 940 |        "    <tr>\n",
 941 |        "      <th>9</th>\n",
 942 |        "      <td>1</td>\n",
 943 |        "      <td>925951</td>\n",
 944 |        "      <td>G</td>\n",
 945 |        "      <td>[A]</td>\n",
 946 |        "      <td>[(RS, 1640863258), (CLNSIG, Uncertain_signific...</td>\n",
 947 |        "    </tr>\n",
 948 |        "  </tbody>\n",
 949 |        "</table>\n",
 950 |        "</div>"
 951 |       ],
 952 |       "text/plain": [
 953 |        "  contigname   start referenceallele alternatealleles  \\\n",
 954 |        "0          1  926009               G              [T]   \n",
 955 |        "1          1  926026               C              [T]   \n",
 956 |        "2          1  925975               T              [C]   \n",
 957 |        "3          1  926002               C              [T]   \n",
 958 |        "4          1  926013               G              [A]   \n",
 959 |        "5          1  926024               G              [A]   \n",
 960 |        "6          1  925955               C              [T]   \n",
 961 |        "7          1  925968               C              [T]   \n",
 962 |        "8          1  925985               C              [T]   \n",
 963 |        "9          1  925951               G              [A]   \n",
 964 |        "\n",
 965 |        "                                          attributes  \n",
 966 |        "0  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 967 |        "1  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 968 |        "2  [(CLNSIG, Uncertain_significance), (GENEINFO, ...  \n",
 969 |        "3  [(CLNSIG, Uncertain_significance), (GENEINFO, ...  \n",
 970 |        "4  [(CLNSIG, Uncertain_significance), (GENEINFO, ...  \n",
 971 |        "5  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 972 |        "6  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 973 |        "7  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 974 |        "8  [(CLNSIG, Likely_benign), (GENEINFO, SAMD11:14...  \n",
 975 |        "9  [(RS, 1640863258), (CLNSIG, Uncertain_signific...  "
 976 |       ]
 977 |      },
 978 |      "execution_count": 27,
 979 |      "metadata": {},
 980 |      "output_type": "execute_result"
 981 |     }
 982 |    ],
 983 |    "source": [
 984 |     "df_ann = wr.athena.read_sql_query(\n",
 985 |     "    f\"select contigname, start, referenceallele, alternatealleles, attributes from {ann_store['name']} order by contigname limit 10;\", \n",
 986 |     "    database=\"omicsdb\", workgroup = \"omics\")\n",
 987 |     "df_ann"
 988 |    ]
 989 |   },
 990 |   {
 991 |    "cell_type": "code",
 992 |    "execution_count": null,
 993 |    "id": "77f9922d",
 994 |    "metadata": {},
 995 |    "outputs": [],
 996 |    "source": []
 997 |   },
 998 |   {
 999 |    "cell_type": "code",
1000 |    "execution_count": null,
1001 |    "id": "065dc264",
1002 |    "metadata": {},
1003 |    "outputs": [],
1004 |    "source": []
1005 |   }
1006 |  ],
1007 |  "metadata": {
1008 |   "kernelspec": {
1009 |    "display_name": "conda_python3",
1010 |    "language": "python",
1011 |    "name": "conda_python3"
1012 |   },
1013 |   "language_info": {
1014 |    "codemirror_mode": {
1015 |     "name": "ipython",
1016 |     "version": 3
1017 |    },
1018 |    "file_extension": ".py",
1019 |    "mimetype": "text/x-python",
1020 |    "name": "python",
1021 |    "nbconvert_exporter": "python",
1022 |    "pygments_lexer": "ipython3",
1023 |    "version": "3.10.12"
1024 |   }
1025 |  },
1026 |  "nbformat": 4,
1027 |  "nbformat_minor": 5
1028 | }
1029 | 


--------------------------------------------------------------------------------
/store-multimodal-data/genomic/utils.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | 
  3 | 
  4 | import boto3
  5 | import botocore.exceptions
  6 | 
  7 | 
  8 | def create_omics_role(rolename, policy, trust_policy, iam=None):
  9 |     # Create the IAM client
 10 |     if not iam:
 11 |         iam = boto3.resource('iam')
 12 | 
 13 |     # Check if the role already exist. If not, create it
 14 |     try:
 15 |         role = iam.Role(rolename)
 16 |         role.load()
 17 | 
 18 |     except botocore.exceptions.ClientError as ex:
 19 |         if ex.response["Error"]["Code"] == "NoSuchEntity":
 20 |             #Create the role with the corresponding trust policy
 21 |             role = iam.create_role(
 22 |                 RoleName=rolename, 
 23 |                 AssumeRolePolicyDocument=json.dumps(trust_policy))
 24 | 
 25 |             #Create policy
 26 |             policy = iam.create_policy(
 27 |                 PolicyName='{}-policy'.format(rolename), 
 28 |                 Description="Policy for Amazon Omics",
 29 |                 PolicyDocument=json.dumps(policy))
 30 | 
 31 |             #Attach the policy to the role
 32 |             policy.attach_role(RoleName=rolename)
 33 |         else:
 34 |             print (ex)
 35 |             print('Somthing went wrong, please retry and check your account settings and permissions')
 36 | 
 37 | 
 38 | def get_role_arn(rolename, client=None):
 39 |     """retrieves the arn for an iam role name"""
 40 |     if not client:
 41 |         client = boto3.client('iam')
 42 |     
 43 |     role = client.get_role(RoleName=rolename)['Role']
 44 |     return role['Arn']
 45 | 
 46 | 
 47 | def get_ref_store_id(client=None):
 48 |     if not client:
 49 |         client = boto3.client('omics')
 50 |     
 51 |     resp = client.list_reference_stores(maxResults=10)
 52 |     list_of_stores = resp.get('referenceStores')
 53 |     store_id = None
 54 |     
 55 |     if list_of_stores != None:
 56 |         # Since there can only be one store per region, if there is a store present use the first one
 57 |         store_id = list_of_stores[0].get('id')
 58 |     
 59 |     return store_id
 60 | 
 61 | 
 62 | def get_reference_arn(ref_name, client=None):
 63 |     if not client:
 64 |         client = boto3.client('omics')
 65 |     
 66 |     resp = client.list_reference_stores(maxResults=10)
 67 |     ref_stores = resp.get('referenceStores')
 68 |     
 69 |     # There can only be one reference store per account per region
 70 |     # if there is a store present, it is the first one
 71 |     ref_store = ref_stores[0] if ref_stores else None
 72 |     
 73 |     if not ref_store:
 74 |         raise RuntimeError("You have not created a reference store, please got to the Amazon Omics Storage tutorial to learn how to create one. Do not continue with this notebook")
 75 |         
 76 |     ref_arn = None
 77 |     resp = client.list_references(referenceStoreId=ref_store['id'])
 78 |     ref_list = resp.get('references')
 79 |     
 80 |     for ref in resp.get('references'):
 81 |         if ref['name'] == ref_name:
 82 |             ref_arn = ref['arn']
 83 |     
 84 |     if ref_arn == None:
 85 |         raise RuntimeError(f"Could not find {ref_name}.")
 86 |     
 87 |     return ref_arn
 88 | 
 89 | 
 90 | def create_resource_link(database_name, store, store_type='variant'):
 91 |     ram = boto3.client('ram')
 92 |     glue = boto3.client('glue')
 93 | 
 94 |     caller_identity = boto3.client('sts').get_caller_identity()
 95 |     AWS_ACCOUNT_ID = caller_identity['Account']
 96 |     AWS_IDENITY_ARN = caller_identity['Arn']
 97 | 
 98 |     response = ram.list_resources(resourceOwner='OTHER-ACCOUNTS', resourceType='glue:Database')
 99 | 
100 |     if not response.get('resources'):
101 |         print('no shared resources found. verify that you have successfully created an Omics Analytics store')
102 |     else:
103 |         store_resources = [resource for resource in response['resources'] if store['id'] in resource['arn']]
104 |         if not store_resources:
105 |             print(f"no shared resources matching {store_type} store id {store['id']} found")
106 |         else:
107 |             store_resource = store_resources[0]
108 | 
109 |     resource_share = ram.get_resource_shares(
110 |         resourceOwner='OTHER-ACCOUNTS', 
111 |         resourceShareArns=[store_resource['resourceShareArn']])['resourceShares'][0]
112 |     
113 |     # this creates a resource link to the table for the variant store and adds it to the `omicsdb` database
114 |     response = glue.create_table(
115 |         DatabaseName=database_name,
116 |         TableInput = {
117 |             "Name": store['name'],
118 |             "TargetTable": {
119 |                 "CatalogId": resource_share['owningAccountId'],
120 |                 "DatabaseName": f"{store_type}_{AWS_ACCOUNT_ID}_{store['id']}",
121 |                 "Name": store['name'],
122 |             }
123 |         }
124 |     )
125 |     
126 |     return store_resource, resource_share, response


--------------------------------------------------------------------------------
/store-multimodal-data/medical-imaging/src/Api.py:
--------------------------------------------------------------------------------
  1 | import array
  2 | import pydicom
  3 | from pydicom.sequence import Sequence
  4 | from pydicom import Dataset , DataElement 
  5 | from pydicom.dataset import FileMetaDataset
  6 | from pydicom.uid import UID
  7 | import json
  8 | import logging
  9 | import importlib  
 10 | import boto3
 11 | from openjpeg import decode
 12 | import io
 13 | import sys
 14 | import time
 15 | import os
 16 | import gzip
 17 | 
 18 | logging.basicConfig( level="INFO" )
 19 | 
 20 | class MedicalImaging: 
 21 |     def __init__(self, endpoint=""):
 22 |         session = boto3.Session()
 23 |         if len(endpoint)>1:
 24 |             self.client = boto3.client('medical-imaging', endpoint_url=endpoint)
 25 |         else:
 26 |             self.client = boto3.client('medical-imaging')
 27 |     
 28 |     def stopwatch(self, start_time, end_time):
 29 |         time_lapsed = end_time - start_time
 30 |         return time_lapsed*1000 
 31 |     
 32 |     
 33 |     def getMetadata(self, datastoreId, imageSetId):
 34 |         start_time = time.time()
 35 |         dicom_study_metadata = self.client.get_image_set_metadata(datastoreId=datastoreId , imageSetId=imageSetId )
 36 |         json_study_metadata = json.loads( gzip.decompress(dicom_study_metadata["imageSetMetadataBlob"].read()) )
 37 |         end_time = time.time()
 38 |         logging.info(f"Metadata fetch  : {self.stopwatch(start_time,end_time)} ms")   
 39 |         return json_study_metadata
 40 | 
 41 |     
 42 |     def listDatastores(self):
 43 |         start_time = time.time()
 44 |         response = self.client.list_datastores()
 45 |         end_time = time.time()
 46 |         logging.info(f"List Datastores  : {self.stopwatch(start_time,end_time)} ms")        
 47 |         return response
 48 |     
 49 |     
 50 |     def createDatastore(self, datastoreName):
 51 |         start_time = time.time()
 52 |         response = self.client.create_datastore(datastoreName=datastoreName)
 53 |         end_time = time.time()
 54 |         logging.info(f"Create Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 55 |         return response
 56 |     
 57 |     
 58 |     def getDatastore(self, datastoreId):
 59 |         start_time = time.time()
 60 |         response = self.client.get_datastore(datastoreId=datastoreId)
 61 |         end_time = time.time()
 62 |         logging.info(f"Get Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 63 |         return response
 64 |     
 65 |     
 66 |     def deleteDatastore(self, datastoreId):
 67 |         start_time = time.time()
 68 |         response = self.client.delete_datastore(datastoreId=datastoreId)
 69 |         end_time = time.time()
 70 |         logging.info(f"Delete Datastore  : {self.stopwatch(start_time,end_time)} ms")        
 71 |         return response
 72 |     
 73 |     
 74 |     def startImportJob(self, datastoreId, IamRoleArn, inputS3, outputS3):
 75 |         start_time = time.time()
 76 |         response = self.client.start_dicom_import_job(
 77 |             datastoreId=datastoreId,
 78 |             dataAccessRoleArn = IamRoleArn,
 79 |             inputS3Uri = inputS3,
 80 |             outputS3Uri = outputS3,
 81 |             clientToken = "demoClient"
 82 |         )
 83 |         end_time = time.time()
 84 |         logging.info(f"Start Import Job  : {self.stopwatch(start_time,end_time)} ms")        
 85 |         return response
 86 |     
 87 |     
 88 |     def getImportJob(self, datastoreId, jobId):
 89 |         start_time = time.time()
 90 |         response = self.client.get_dicom_import_job(datastoreId=datastoreId, jobId=jobId)
 91 |         end_time = time.time()
 92 |         logging.info(f"Get Import Job  : {self.stopwatch(start_time,end_time)} ms")        
 93 |         return response
 94 |     
 95 |     
 96 |     def getFramePixels(self, datastoreId, imageSetId, imageFrameId):
 97 |         start_time = time.time()
 98 |         res = self.client.get_image_frame(
 99 |             datastoreId=datastoreId,
100 |             imageSetId=imageSetId,
101 |             imageFrameInformation={
102 |                 'imageFrameId': imageFrameId
103 |             })
104 |         end_time = time.time()
105 |         logging.debug(f"Frame fetch     : {self.stopwatch(start_time,end_time)} ms") 
106 |         start_time = time.time() 
107 |         b = io.BytesIO()
108 |         b.write(res['imageFrameBlob'].read())
109 |         b.seek(0)
110 |         d = decode(b)
111 |         end_time = time.time()
112 |         logging.debug(f"Frame decode    : {self.stopwatch(start_time,end_time)} ms")    
113 |         return d 
114 | 
115 |     def getDICOMdataset(self, datastoreId, imageSetId):
116 |         logging.debug("Reading the JSON metadata file")
117 |         json_dicom_header = self.getMetadata(datastoreId , imageSetId)
118 | 
119 |         vrlist = []
120 |         sop_instances = []
121 |         
122 |         file_meta = FileMetaDataset()
123 |         file_meta.MediaStorageSOPClassUID = UID('1.2.840.10008.5.1.4.1.1.1')  ## Media Storage SOP Class UID, e.g. "1.2.840.10008.5.1.4.1.1.88.34" for Comprehensive 3D SR IOD.
124 |         file_meta.MediaStorageSOPInstanceUID = UID("1.3.51.5145.5142.20010109.1105627.1.0.1")
125 |         file_meta.ImplementationClassUID = UID("1.2.826.0.1.3680043.9.3811.2.0.1")
126 |         file_meta.TransferSyntaxUID = UID('1.2.840.10008.1.2.1')  # Made up. Not registered.
127 |         
128 |         logging.debug("Reading the Pixels")
129 |         for series in json_dicom_header["Study"]["Series"]:
130 |             for instances in json_dicom_header["Study"]["Series"][series]["Instances"]:
131 |                 ds = Dataset()
132 |                 ds.file_meta = file_meta
133 |                 
134 |                 PatientLevel = json_dicom_header["Patient"]["DICOM"]
135 |                 self.getTags(PatientLevel, ds, vrlist)
136 |                 StudyLevel = json_dicom_header["Study"]["DICOM"]
137 |                 self.getTags(StudyLevel, ds, vrlist)
138 |                 self.getDICOMVRs(json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOMVRs"] , vrlist)
139 |                 self.getTags( json_dicom_header["Study"]["Series"][series]["Instances"][instances]["DICOM"] , ds, vrlist)
140 |                 self.getTags(json_dicom_header["Study"]["Series"][series]["DICOM"], ds, vrlist)
141 |                 
142 |                 ds.file_meta.TransferSyntaxUID = pydicom.uid.ExplicitVRLittleEndian
143 |                 ds.file_meta.MediaStorageSOPInstanceUID = UID(instances)
144 |                 ds.is_little_endian = True
145 |                 ds.is_implicit_VR = False
146 |                 
147 |                 frameId = json_dicom_header["Study"]["Series"][series]["Instances"][instances]["ImageFrames"][0]["ID"]
148 |                 pixels = self.getFramePixels(datastoreId, json_dicom_header["ImageSetID"], frameId)
149 |                 
150 |                 start_time = time.time()
151 |                 ds.PixelData = pixels.tobytes()
152 |                 sop_instances.append(ds)
153 |                 vrlist.clear()
154 |                 end_time = time.time()
155 |                 logging.debug(f"Outpout save     : {self.stopwatch(start_time,end_time)} ms")     
156 |         return sop_instances
157 |     
158 |     def getDICOMVRs(self, taglevel, vrlist):
159 |         for theKey in taglevel:
160 |             vrlist.append( [ theKey , taglevel[theKey] ])
161 |             logging.debug(f"[getDICOMVRs] - List of private tags VRs: {vrlist}\r\n")
162 | 
163 | 
164 |     def getTags(self, tagLevel, ds, vrlist):    
165 |         for theKey in tagLevel:
166 |             if theKey in ['PrivateCreatorID', 'FileMetaInformationVersion', '00291203']:
167 |                 continue
168 |             try:
169 |                 try:
170 |                     tagvr = pydicom.datadict.dictionary_VR(theKey)
171 |                 except:  #In case the vr is not in the pydicom dictionnary, it might be a private tag , listed in the vrlist
172 |                     tagvr = None
173 |                     for vr in vrlist:
174 |                         if theKey == vr[0]:
175 |                             tagvr = vr[1]
176 |                 datavalue=tagLevel[theKey]
177 |                 #print(f"{tagvr} {theKey} : {datavalue}")
178 |                 if(tagvr == 'SQ'):
179 |                     logging.debug(f"{theKey} : {tagLevel[theKey]} , {vrlist}")
180 |                     seqs = []
181 |                     for underSeq in tagLevel[theKey]:
182 |                         seqds = Dataset()
183 |                         self.getTags(underSeq, seqds, vrlist)
184 |                         seqs.append(seqds)
185 |                     datavalue = Sequence(seqs)
186 |                     continue
187 |                 if(tagvr == 'US or SS'):
188 |                     datavalue=tagLevel[theKey]
189 |                     if (int(datavalue) > 32767):
190 |                         tagvr = 'US'
191 |                 if( tagvr == 'OB'):
192 |                     datavalue = self.getOBVRTagValue(tagLevel[theKey] )
193 |                     
194 |                 data_element = DataElement(theKey , tagvr , datavalue )
195 |                 if data_element.tag.group != 2:
196 |                     try:
197 |                         if (int(data_element.tag.group) % 2) == 0 : # we are skipping all the private tags
198 |                             ds.add(data_element) 
199 |                     except:
200 |                         continue
201 |             except Exception as err:
202 |                 logging.warning(f"[HLIDataDICOMizer][getTags] - {err} for Key: {theKey}")
203 |                 continue 
204 | 
205 | 
206 | 
207 |     def getOBVRTagValue(self, datalist):
208 |         bytevals = []
209 |         for byteval in datalist:
210 |             bytevals.append(int(byteval)) 
211 |         OBArray = bytearray(bytevals)
212 |         return bytes(OBArray)
213 | 
214 | 


--------------------------------------------------------------------------------
/store-multimodal-data/medical-imaging/store-imagingdata-with-awshealthimaging.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "3c073628-e7a4-4598-98cc-b6134c104446",
  7 |    "metadata": {
  8 |     "tags": []
  9 |    },
 10 |    "outputs": [],
 11 |    "source": [
 12 |     " %%sh\n",
 13 |     "pip install -q --upgrade pip\n",
 14 |     "pip install -q --upgrade boto3 botocore\n",
 15 |     "pip install -q tqdm nibabel pydicom numpy pylibjpeg-openjpeg "
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "id": "a1971f92-2784-46f7-8528-47cdd530ae19",
 22 |    "metadata": {
 23 |     "tags": []
 24 |    },
 25 |    "outputs": [],
 26 |    "source": [
 27 |     "import pydicom\n",
 28 |     "from pydicom.sequence import Sequence\n",
 29 |     "from pydicom import Dataset , DataElement \n",
 30 |     "from pydicom.dataset import FileDataset, FileMetaDataset\n",
 31 |     "from pydicom.uid import UID\n",
 32 |     "from pydicom.pixel_data_handlers.util import convert_color_space , apply_color_lut\n",
 33 |     "from openjpeg import decode\n",
 34 |     "import array\n",
 35 |     "import json\n",
 36 |     "import logging\n",
 37 |     "import importlib  \n",
 38 |     "import boto3\n",
 39 |     "import sagemaker\n",
 40 |     "from sagemaker import get_execution_role\n",
 41 |     "import io\n",
 42 |     "import sys\n",
 43 |     "import time\n",
 44 |     "import os\n",
 45 |     "import pandas as pd\n",
 46 |     "from botocore.exceptions import ClientError\n",
 47 |     "logging.basicConfig( level=\"INFO\" )\n",
 48 |     "# logging.basicConfig( level=\"DEBUG\" )\n",
 49 |     "from src.Api import MedicalImaging \n",
 50 |     "medicalimaging = MedicalImaging()\n",
 51 |     "\n",
 52 |     "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n",
 53 |     "region = boto3.Session().region_name\n",
 54 |     "bucket = sagemaker.Session().default_bucket()\n",
 55 |     "role = get_execution_role()\n",
 56 |     "print(f\"S3 Bucket is {bucket}\")\n",
 57 |     "print(f\"IAM role is {role}\")"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "id": "af33e224-5a2e-42f4-b35e-70c04a93106c",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "Copy over Coherent DICOM images to Default SageMaker S3 bucket"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "id": "6994f21d-f0d8-461a-9353-f288125ed474",
 72 |    "metadata": {
 73 |     "tags": []
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "!aws s3 sync s3://guidance-multimodal-hcls-healthai-machinelearning-{region}/imaging s3://{bucket}/imaging/ 2>&1 > /dev/null"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "fd781e6e-606e-4bad-aa36-c2a48fe7bd91",
 84 |    "metadata": {
 85 |     "tags": []
 86 |    },
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "DatastoreName = \"WorkshopDataStore\"\n",
 90 |     "datastoreList = medicalimaging.listDatastores()\n",
 91 |     "\n",
 92 |     "res_createstore = None\n",
 93 |     "for datastore in datastoreList[\"datastoreSummaries\"]:\n",
 94 |     "    if datastore[\"datastoreName\"] == DatastoreName:\n",
 95 |     "        res_createstore = datastore\n",
 96 |     "        break\n",
 97 |     "if res_createstore is None:        \n",
 98 |     "    res_createstore = medicalimaging.createDatastore(DatastoreName)\n",
 99 |     "\n",
100 |     "datastoreId = res_createstore['datastoreId']\n",
101 |     "res_getstore = medicalimaging.getDatastore(res_createstore['datastoreId'])    \n",
102 |     "status = res_getstore['datastoreProperties']['datastoreStatus']\n",
103 |     "while status!='ACTIVE':\n",
104 |     "    time.sleep(30)\n",
105 |     "    res_getstore = medicalimaging.getDatastore(res_createstore['datastoreId'])    \n",
106 |     "    status = res_getstore['datastoreProperties']['datastoreStatus']\n",
107 |     "    print(status)\n",
108 |     "print(f\"datastoreId: {datastoreId}; status: {status}\")"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": null,
114 |    "id": "ede3adfd-4f26-4ada-a75a-3cbff1ed9cc4",
115 |    "metadata": {
116 |     "tags": []
117 |    },
118 |    "outputs": [],
119 |    "source": [
120 |     "res_startimportjob = medicalimaging.startImportJob(\n",
121 |     "    res_createstore['datastoreId'],\n",
122 |     "    f\"arn:aws:iam::{account_id}:role/HealthImagingImportJobRole\",\n",
123 |     "    's3://'+bucket+'/imaging/', \n",
124 |     "    's3://'+bucket+'/ahi_importjob_output/'\n",
125 |     ")\n",
126 |     "\n",
127 |     "jobId = res_startimportjob['jobId']\n",
128 |     "jobstatus = medicalimaging.getImportJob(datastoreId, jobId)['jobProperties']['jobStatus']\n",
129 |     "while jobstatus!='COMPLETED':\n",
130 |     "    time.sleep(30)\n",
131 |     "    jobstatus = medicalimaging.getImportJob(datastoreId, jobId)['jobProperties']['jobStatus']\n",
132 |     "print(f\"jobstatus is {jobstatus}\")"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "id": "525b2359-60ec-4395-830a-639709af8614",
139 |    "metadata": {
140 |     "tags": []
141 |    },
142 |    "outputs": [],
143 |    "source": [
144 |     "imageSetIds = {}\n",
145 |     "s3=boto3.client('s3')\n",
146 |     "try:\n",
147 |     "    response = s3.head_object(Bucket=bucket, Key=f\"ahi_importjob_output/{datastoreId}-DicomImport-{jobId}/job-output-manifest.json\")\n",
148 |     "    if response['ResponseMetadata']['HTTPStatusCode'] == 200:\n",
149 |     "        data = s3.get_object(Bucket=bucket, Key=f\"ahi_importjob_output/{datastoreId}-DicomImport-{jobId}/SUCCESS/success.ndjson\")\n",
150 |     "        contents = data['Body'].read().decode(\"utf-8\")\n",
151 |     "        for l in contents.splitlines():\n",
152 |     "            isid = json.loads(l)['importResponse']['imageSetId']\n",
153 |     "            if isid in imageSetIds:\n",
154 |     "                imageSetIds[isid]+=1\n",
155 |     "            else:\n",
156 |     "                imageSetIds[isid]=1\n",
157 |     "except ClientError:\n",
158 |     "    pass\n",
159 |     "\n",
160 |     "\n",
161 |     "print(\"number of image sets: {}\".format(len(imageSetIds)))"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "id": "378fee7b-2b32-4985-a6b0-de722495c966",
168 |    "metadata": {
169 |     "tags": []
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "%store datastoreId\n",
174 |     "%store imageSetIds\n",
175 |     "%store jobId"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "id": "1f5a7471-e9ad-45ea-a96c-572a774a7529",
181 |    "metadata": {},
182 |    "source": [
183 |     "## (Optional) Save JSON to S3"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "id": "58885cdc-a662-4e43-9f3c-d215719a7931",
190 |    "metadata": {
191 |     "tags": []
192 |    },
193 |    "outputs": [],
194 |    "source": [
195 |     "for s in imageSetIds.keys():\n",
196 |     "    json_dicom_header = medicalimaging.getMetadata(datastoreId, s)\n",
197 |     "    patient = json_dicom_header['Patient']['DICOM']\n",
198 |     "    patient['imagesetid'] = s\n",
199 |     "    s3.put_object(\n",
200 |     "        Body=json.dumps(patient),\n",
201 |     "        Bucket=OutputBucketName,\n",
202 |     "        Key='dicom_header/json/patient/{}'.format(s)\n",
203 |     "    )\n",
204 |     "    study=json_dicom_header['Study']['DICOM']\n",
205 |     "    study['imagesetid'] = s\n",
206 |     "    s3.put_object(\n",
207 |     "        Body=json.dumps(study),\n",
208 |     "        Bucket=OutputBucketName,\n",
209 |     "        Key='dicom_header/json/study/{}'.format(s)\n",
210 |     "    )\n",
211 |     "    for se in list(json_dicom_header['Study']['Series'].keys()):\n",
212 |     "        s3.put_object(\n",
213 |     "            Body=json.dumps(json_dicom_header['Study']['Series'][se]['DICOM']),\n",
214 |     "            Bucket=OutputBucketName,\n",
215 |     "            Key='dicom_header/json/series/{}'.format(s)\n",
216 |     "        )\n",
217 |     "        for i in list(json_dicom_header['Study']['Series'][se]['Instances']):\n",
218 |     "            s3.put_object(\n",
219 |     "                Body=json.dumps(json_dicom_header['Study']['Series'][se]['Instances'][i]),\n",
220 |     "                Bucket=OutputBucketName,\n",
221 |     "                Key='dicom_header/json/series/{}'.format(s)\n",
222 |     "            )"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "id": "1b99d75c-f3a5-42f1-b4f8-b46f756dffbe",
228 |    "metadata": {},
229 |    "source": [
230 |     "## Clean Up"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "id": "ec1c8cbb-8d90-4806-a873-655ccb9b20fa",
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "try:\n",
241 |     "    s3res = boto3.resource('s3')\n",
242 |     "    bucket = s3res.Bucket(InputBucketName)\n",
243 |     "    bucket.object_versions.delete()\n",
244 |     "    s3.delete_bucket(Bucket=InputBucketName)\n",
245 |     "    bucket = s3res.Bucket(OutputBucketName)\n",
246 |     "    bucket.object_versions.delete()\n",
247 |     "    s3.delete_bucket(Bucket=OutputBucketName)\n",
248 |     "except ClientError  as e:\n",
249 |     "    if e.response['Error']['Code'] == 'NoSuchBucket':\n",
250 |     "        print(\"Bucket already deleted\")\n",
251 |     "    \n",
252 |     "try: \n",
253 |     "    resp = iam.detach_role_policy(PolicyArn=respons_createpolicy['Policy']['Arn'],RoleName=response_createrole['Role']['RoleName'])\n",
254 |     "    resp = iam.delete_policy(PolicyArn=respons_createpolicy['Policy']['Arn'])\n",
255 |     "    resp = iam.delete_role(RoleName=response_createrole['Role']['RoleName'])\n",
256 |     "except ClientError as ee:\n",
257 |     "    if ee.response['Error']['Code'] == 'NoSuchEntity':\n",
258 |     "        print(\"Policy not attached, ignore\")\n",
259 |     "    else: \n",
260 |     "        print(ee)"
261 |    ]
262 |   }
263 |  ],
264 |  "metadata": {
265 |   "availableInstances": [
266 |    {
267 |     "_defaultOrder": 0,
268 |     "_isFastLaunch": true,
269 |     "category": "General purpose",
270 |     "gpuNum": 0,
271 |     "hideHardwareSpecs": false,
272 |     "memoryGiB": 4,
273 |     "name": "ml.t3.medium",
274 |     "vcpuNum": 2
275 |    },
276 |    {
277 |     "_defaultOrder": 1,
278 |     "_isFastLaunch": false,
279 |     "category": "General purpose",
280 |     "gpuNum": 0,
281 |     "hideHardwareSpecs": false,
282 |     "memoryGiB": 8,
283 |     "name": "ml.t3.large",
284 |     "vcpuNum": 2
285 |    },
286 |    {
287 |     "_defaultOrder": 2,
288 |     "_isFastLaunch": false,
289 |     "category": "General purpose",
290 |     "gpuNum": 0,
291 |     "hideHardwareSpecs": false,
292 |     "memoryGiB": 16,
293 |     "name": "ml.t3.xlarge",
294 |     "vcpuNum": 4
295 |    },
296 |    {
297 |     "_defaultOrder": 3,
298 |     "_isFastLaunch": false,
299 |     "category": "General purpose",
300 |     "gpuNum": 0,
301 |     "hideHardwareSpecs": false,
302 |     "memoryGiB": 32,
303 |     "name": "ml.t3.2xlarge",
304 |     "vcpuNum": 8
305 |    },
306 |    {
307 |     "_defaultOrder": 4,
308 |     "_isFastLaunch": true,
309 |     "category": "General purpose",
310 |     "gpuNum": 0,
311 |     "hideHardwareSpecs": false,
312 |     "memoryGiB": 8,
313 |     "name": "ml.m5.large",
314 |     "vcpuNum": 2
315 |    },
316 |    {
317 |     "_defaultOrder": 5,
318 |     "_isFastLaunch": false,
319 |     "category": "General purpose",
320 |     "gpuNum": 0,
321 |     "hideHardwareSpecs": false,
322 |     "memoryGiB": 16,
323 |     "name": "ml.m5.xlarge",
324 |     "vcpuNum": 4
325 |    },
326 |    {
327 |     "_defaultOrder": 6,
328 |     "_isFastLaunch": false,
329 |     "category": "General purpose",
330 |     "gpuNum": 0,
331 |     "hideHardwareSpecs": false,
332 |     "memoryGiB": 32,
333 |     "name": "ml.m5.2xlarge",
334 |     "vcpuNum": 8
335 |    },
336 |    {
337 |     "_defaultOrder": 7,
338 |     "_isFastLaunch": false,
339 |     "category": "General purpose",
340 |     "gpuNum": 0,
341 |     "hideHardwareSpecs": false,
342 |     "memoryGiB": 64,
343 |     "name": "ml.m5.4xlarge",
344 |     "vcpuNum": 16
345 |    },
346 |    {
347 |     "_defaultOrder": 8,
348 |     "_isFastLaunch": false,
349 |     "category": "General purpose",
350 |     "gpuNum": 0,
351 |     "hideHardwareSpecs": false,
352 |     "memoryGiB": 128,
353 |     "name": "ml.m5.8xlarge",
354 |     "vcpuNum": 32
355 |    },
356 |    {
357 |     "_defaultOrder": 9,
358 |     "_isFastLaunch": false,
359 |     "category": "General purpose",
360 |     "gpuNum": 0,
361 |     "hideHardwareSpecs": false,
362 |     "memoryGiB": 192,
363 |     "name": "ml.m5.12xlarge",
364 |     "vcpuNum": 48
365 |    },
366 |    {
367 |     "_defaultOrder": 10,
368 |     "_isFastLaunch": false,
369 |     "category": "General purpose",
370 |     "gpuNum": 0,
371 |     "hideHardwareSpecs": false,
372 |     "memoryGiB": 256,
373 |     "name": "ml.m5.16xlarge",
374 |     "vcpuNum": 64
375 |    },
376 |    {
377 |     "_defaultOrder": 11,
378 |     "_isFastLaunch": false,
379 |     "category": "General purpose",
380 |     "gpuNum": 0,
381 |     "hideHardwareSpecs": false,
382 |     "memoryGiB": 384,
383 |     "name": "ml.m5.24xlarge",
384 |     "vcpuNum": 96
385 |    },
386 |    {
387 |     "_defaultOrder": 12,
388 |     "_isFastLaunch": false,
389 |     "category": "General purpose",
390 |     "gpuNum": 0,
391 |     "hideHardwareSpecs": false,
392 |     "memoryGiB": 8,
393 |     "name": "ml.m5d.large",
394 |     "vcpuNum": 2
395 |    },
396 |    {
397 |     "_defaultOrder": 13,
398 |     "_isFastLaunch": false,
399 |     "category": "General purpose",
400 |     "gpuNum": 0,
401 |     "hideHardwareSpecs": false,
402 |     "memoryGiB": 16,
403 |     "name": "ml.m5d.xlarge",
404 |     "vcpuNum": 4
405 |    },
406 |    {
407 |     "_defaultOrder": 14,
408 |     "_isFastLaunch": false,
409 |     "category": "General purpose",
410 |     "gpuNum": 0,
411 |     "hideHardwareSpecs": false,
412 |     "memoryGiB": 32,
413 |     "name": "ml.m5d.2xlarge",
414 |     "vcpuNum": 8
415 |    },
416 |    {
417 |     "_defaultOrder": 15,
418 |     "_isFastLaunch": false,
419 |     "category": "General purpose",
420 |     "gpuNum": 0,
421 |     "hideHardwareSpecs": false,
422 |     "memoryGiB": 64,
423 |     "name": "ml.m5d.4xlarge",
424 |     "vcpuNum": 16
425 |    },
426 |    {
427 |     "_defaultOrder": 16,
428 |     "_isFastLaunch": false,
429 |     "category": "General purpose",
430 |     "gpuNum": 0,
431 |     "hideHardwareSpecs": false,
432 |     "memoryGiB": 128,
433 |     "name": "ml.m5d.8xlarge",
434 |     "vcpuNum": 32
435 |    },
436 |    {
437 |     "_defaultOrder": 17,
438 |     "_isFastLaunch": false,
439 |     "category": "General purpose",
440 |     "gpuNum": 0,
441 |     "hideHardwareSpecs": false,
442 |     "memoryGiB": 192,
443 |     "name": "ml.m5d.12xlarge",
444 |     "vcpuNum": 48
445 |    },
446 |    {
447 |     "_defaultOrder": 18,
448 |     "_isFastLaunch": false,
449 |     "category": "General purpose",
450 |     "gpuNum": 0,
451 |     "hideHardwareSpecs": false,
452 |     "memoryGiB": 256,
453 |     "name": "ml.m5d.16xlarge",
454 |     "vcpuNum": 64
455 |    },
456 |    {
457 |     "_defaultOrder": 19,
458 |     "_isFastLaunch": false,
459 |     "category": "General purpose",
460 |     "gpuNum": 0,
461 |     "hideHardwareSpecs": false,
462 |     "memoryGiB": 384,
463 |     "name": "ml.m5d.24xlarge",
464 |     "vcpuNum": 96
465 |    },
466 |    {
467 |     "_defaultOrder": 20,
468 |     "_isFastLaunch": false,
469 |     "category": "General purpose",
470 |     "gpuNum": 0,
471 |     "hideHardwareSpecs": true,
472 |     "memoryGiB": 0,
473 |     "name": "ml.geospatial.interactive",
474 |     "supportedImageNames": [
475 |      "sagemaker-geospatial-v1-0"
476 |     ],
477 |     "vcpuNum": 0
478 |    },
479 |    {
480 |     "_defaultOrder": 21,
481 |     "_isFastLaunch": true,
482 |     "category": "Compute optimized",
483 |     "gpuNum": 0,
484 |     "hideHardwareSpecs": false,
485 |     "memoryGiB": 4,
486 |     "name": "ml.c5.large",
487 |     "vcpuNum": 2
488 |    },
489 |    {
490 |     "_defaultOrder": 22,
491 |     "_isFastLaunch": false,
492 |     "category": "Compute optimized",
493 |     "gpuNum": 0,
494 |     "hideHardwareSpecs": false,
495 |     "memoryGiB": 8,
496 |     "name": "ml.c5.xlarge",
497 |     "vcpuNum": 4
498 |    },
499 |    {
500 |     "_defaultOrder": 23,
501 |     "_isFastLaunch": false,
502 |     "category": "Compute optimized",
503 |     "gpuNum": 0,
504 |     "hideHardwareSpecs": false,
505 |     "memoryGiB": 16,
506 |     "name": "ml.c5.2xlarge",
507 |     "vcpuNum": 8
508 |    },
509 |    {
510 |     "_defaultOrder": 24,
511 |     "_isFastLaunch": false,
512 |     "category": "Compute optimized",
513 |     "gpuNum": 0,
514 |     "hideHardwareSpecs": false,
515 |     "memoryGiB": 32,
516 |     "name": "ml.c5.4xlarge",
517 |     "vcpuNum": 16
518 |    },
519 |    {
520 |     "_defaultOrder": 25,
521 |     "_isFastLaunch": false,
522 |     "category": "Compute optimized",
523 |     "gpuNum": 0,
524 |     "hideHardwareSpecs": false,
525 |     "memoryGiB": 72,
526 |     "name": "ml.c5.9xlarge",
527 |     "vcpuNum": 36
528 |    },
529 |    {
530 |     "_defaultOrder": 26,
531 |     "_isFastLaunch": false,
532 |     "category": "Compute optimized",
533 |     "gpuNum": 0,
534 |     "hideHardwareSpecs": false,
535 |     "memoryGiB": 96,
536 |     "name": "ml.c5.12xlarge",
537 |     "vcpuNum": 48
538 |    },
539 |    {
540 |     "_defaultOrder": 27,
541 |     "_isFastLaunch": false,
542 |     "category": "Compute optimized",
543 |     "gpuNum": 0,
544 |     "hideHardwareSpecs": false,
545 |     "memoryGiB": 144,
546 |     "name": "ml.c5.18xlarge",
547 |     "vcpuNum": 72
548 |    },
549 |    {
550 |     "_defaultOrder": 28,
551 |     "_isFastLaunch": false,
552 |     "category": "Compute optimized",
553 |     "gpuNum": 0,
554 |     "hideHardwareSpecs": false,
555 |     "memoryGiB": 192,
556 |     "name": "ml.c5.24xlarge",
557 |     "vcpuNum": 96
558 |    },
559 |    {
560 |     "_defaultOrder": 29,
561 |     "_isFastLaunch": true,
562 |     "category": "Accelerated computing",
563 |     "gpuNum": 1,
564 |     "hideHardwareSpecs": false,
565 |     "memoryGiB": 16,
566 |     "name": "ml.g4dn.xlarge",
567 |     "vcpuNum": 4
568 |    },
569 |    {
570 |     "_defaultOrder": 30,
571 |     "_isFastLaunch": false,
572 |     "category": "Accelerated computing",
573 |     "gpuNum": 1,
574 |     "hideHardwareSpecs": false,
575 |     "memoryGiB": 32,
576 |     "name": "ml.g4dn.2xlarge",
577 |     "vcpuNum": 8
578 |    },
579 |    {
580 |     "_defaultOrder": 31,
581 |     "_isFastLaunch": false,
582 |     "category": "Accelerated computing",
583 |     "gpuNum": 1,
584 |     "hideHardwareSpecs": false,
585 |     "memoryGiB": 64,
586 |     "name": "ml.g4dn.4xlarge",
587 |     "vcpuNum": 16
588 |    },
589 |    {
590 |     "_defaultOrder": 32,
591 |     "_isFastLaunch": false,
592 |     "category": "Accelerated computing",
593 |     "gpuNum": 1,
594 |     "hideHardwareSpecs": false,
595 |     "memoryGiB": 128,
596 |     "name": "ml.g4dn.8xlarge",
597 |     "vcpuNum": 32
598 |    },
599 |    {
600 |     "_defaultOrder": 33,
601 |     "_isFastLaunch": false,
602 |     "category": "Accelerated computing",
603 |     "gpuNum": 4,
604 |     "hideHardwareSpecs": false,
605 |     "memoryGiB": 192,
606 |     "name": "ml.g4dn.12xlarge",
607 |     "vcpuNum": 48
608 |    },
609 |    {
610 |     "_defaultOrder": 34,
611 |     "_isFastLaunch": false,
612 |     "category": "Accelerated computing",
613 |     "gpuNum": 1,
614 |     "hideHardwareSpecs": false,
615 |     "memoryGiB": 256,
616 |     "name": "ml.g4dn.16xlarge",
617 |     "vcpuNum": 64
618 |    },
619 |    {
620 |     "_defaultOrder": 35,
621 |     "_isFastLaunch": false,
622 |     "category": "Accelerated computing",
623 |     "gpuNum": 1,
624 |     "hideHardwareSpecs": false,
625 |     "memoryGiB": 61,
626 |     "name": "ml.p3.2xlarge",
627 |     "vcpuNum": 8
628 |    },
629 |    {
630 |     "_defaultOrder": 36,
631 |     "_isFastLaunch": false,
632 |     "category": "Accelerated computing",
633 |     "gpuNum": 4,
634 |     "hideHardwareSpecs": false,
635 |     "memoryGiB": 244,
636 |     "name": "ml.p3.8xlarge",
637 |     "vcpuNum": 32
638 |    },
639 |    {
640 |     "_defaultOrder": 37,
641 |     "_isFastLaunch": false,
642 |     "category": "Accelerated computing",
643 |     "gpuNum": 8,
644 |     "hideHardwareSpecs": false,
645 |     "memoryGiB": 488,
646 |     "name": "ml.p3.16xlarge",
647 |     "vcpuNum": 64
648 |    },
649 |    {
650 |     "_defaultOrder": 38,
651 |     "_isFastLaunch": false,
652 |     "category": "Accelerated computing",
653 |     "gpuNum": 8,
654 |     "hideHardwareSpecs": false,
655 |     "memoryGiB": 768,
656 |     "name": "ml.p3dn.24xlarge",
657 |     "vcpuNum": 96
658 |    },
659 |    {
660 |     "_defaultOrder": 39,
661 |     "_isFastLaunch": false,
662 |     "category": "Memory Optimized",
663 |     "gpuNum": 0,
664 |     "hideHardwareSpecs": false,
665 |     "memoryGiB": 16,
666 |     "name": "ml.r5.large",
667 |     "vcpuNum": 2
668 |    },
669 |    {
670 |     "_defaultOrder": 40,
671 |     "_isFastLaunch": false,
672 |     "category": "Memory Optimized",
673 |     "gpuNum": 0,
674 |     "hideHardwareSpecs": false,
675 |     "memoryGiB": 32,
676 |     "name": "ml.r5.xlarge",
677 |     "vcpuNum": 4
678 |    },
679 |    {
680 |     "_defaultOrder": 41,
681 |     "_isFastLaunch": false,
682 |     "category": "Memory Optimized",
683 |     "gpuNum": 0,
684 |     "hideHardwareSpecs": false,
685 |     "memoryGiB": 64,
686 |     "name": "ml.r5.2xlarge",
687 |     "vcpuNum": 8
688 |    },
689 |    {
690 |     "_defaultOrder": 42,
691 |     "_isFastLaunch": false,
692 |     "category": "Memory Optimized",
693 |     "gpuNum": 0,
694 |     "hideHardwareSpecs": false,
695 |     "memoryGiB": 128,
696 |     "name": "ml.r5.4xlarge",
697 |     "vcpuNum": 16
698 |    },
699 |    {
700 |     "_defaultOrder": 43,
701 |     "_isFastLaunch": false,
702 |     "category": "Memory Optimized",
703 |     "gpuNum": 0,
704 |     "hideHardwareSpecs": false,
705 |     "memoryGiB": 256,
706 |     "name": "ml.r5.8xlarge",
707 |     "vcpuNum": 32
708 |    },
709 |    {
710 |     "_defaultOrder": 44,
711 |     "_isFastLaunch": false,
712 |     "category": "Memory Optimized",
713 |     "gpuNum": 0,
714 |     "hideHardwareSpecs": false,
715 |     "memoryGiB": 384,
716 |     "name": "ml.r5.12xlarge",
717 |     "vcpuNum": 48
718 |    },
719 |    {
720 |     "_defaultOrder": 45,
721 |     "_isFastLaunch": false,
722 |     "category": "Memory Optimized",
723 |     "gpuNum": 0,
724 |     "hideHardwareSpecs": false,
725 |     "memoryGiB": 512,
726 |     "name": "ml.r5.16xlarge",
727 |     "vcpuNum": 64
728 |    },
729 |    {
730 |     "_defaultOrder": 46,
731 |     "_isFastLaunch": false,
732 |     "category": "Memory Optimized",
733 |     "gpuNum": 0,
734 |     "hideHardwareSpecs": false,
735 |     "memoryGiB": 768,
736 |     "name": "ml.r5.24xlarge",
737 |     "vcpuNum": 96
738 |    },
739 |    {
740 |     "_defaultOrder": 47,
741 |     "_isFastLaunch": false,
742 |     "category": "Accelerated computing",
743 |     "gpuNum": 1,
744 |     "hideHardwareSpecs": false,
745 |     "memoryGiB": 16,
746 |     "name": "ml.g5.xlarge",
747 |     "vcpuNum": 4
748 |    },
749 |    {
750 |     "_defaultOrder": 48,
751 |     "_isFastLaunch": false,
752 |     "category": "Accelerated computing",
753 |     "gpuNum": 1,
754 |     "hideHardwareSpecs": false,
755 |     "memoryGiB": 32,
756 |     "name": "ml.g5.2xlarge",
757 |     "vcpuNum": 8
758 |    },
759 |    {
760 |     "_defaultOrder": 49,
761 |     "_isFastLaunch": false,
762 |     "category": "Accelerated computing",
763 |     "gpuNum": 1,
764 |     "hideHardwareSpecs": false,
765 |     "memoryGiB": 64,
766 |     "name": "ml.g5.4xlarge",
767 |     "vcpuNum": 16
768 |    },
769 |    {
770 |     "_defaultOrder": 50,
771 |     "_isFastLaunch": false,
772 |     "category": "Accelerated computing",
773 |     "gpuNum": 1,
774 |     "hideHardwareSpecs": false,
775 |     "memoryGiB": 128,
776 |     "name": "ml.g5.8xlarge",
777 |     "vcpuNum": 32
778 |    },
779 |    {
780 |     "_defaultOrder": 51,
781 |     "_isFastLaunch": false,
782 |     "category": "Accelerated computing",
783 |     "gpuNum": 1,
784 |     "hideHardwareSpecs": false,
785 |     "memoryGiB": 256,
786 |     "name": "ml.g5.16xlarge",
787 |     "vcpuNum": 64
788 |    },
789 |    {
790 |     "_defaultOrder": 52,
791 |     "_isFastLaunch": false,
792 |     "category": "Accelerated computing",
793 |     "gpuNum": 4,
794 |     "hideHardwareSpecs": false,
795 |     "memoryGiB": 192,
796 |     "name": "ml.g5.12xlarge",
797 |     "vcpuNum": 48
798 |    },
799 |    {
800 |     "_defaultOrder": 53,
801 |     "_isFastLaunch": false,
802 |     "category": "Accelerated computing",
803 |     "gpuNum": 4,
804 |     "hideHardwareSpecs": false,
805 |     "memoryGiB": 384,
806 |     "name": "ml.g5.24xlarge",
807 |     "vcpuNum": 96
808 |    },
809 |    {
810 |     "_defaultOrder": 54,
811 |     "_isFastLaunch": false,
812 |     "category": "Accelerated computing",
813 |     "gpuNum": 8,
814 |     "hideHardwareSpecs": false,
815 |     "memoryGiB": 768,
816 |     "name": "ml.g5.48xlarge",
817 |     "vcpuNum": 192
818 |    },
819 |    {
820 |     "_defaultOrder": 55,
821 |     "_isFastLaunch": false,
822 |     "category": "Accelerated computing",
823 |     "gpuNum": 8,
824 |     "hideHardwareSpecs": false,
825 |     "memoryGiB": 1152,
826 |     "name": "ml.p4d.24xlarge",
827 |     "vcpuNum": 96
828 |    },
829 |    {
830 |     "_defaultOrder": 56,
831 |     "_isFastLaunch": false,
832 |     "category": "Accelerated computing",
833 |     "gpuNum": 8,
834 |     "hideHardwareSpecs": false,
835 |     "memoryGiB": 1152,
836 |     "name": "ml.p4de.24xlarge",
837 |     "vcpuNum": 96
838 |    }
839 |   ],
840 |   "instance_type": "ml.t3.medium",
841 |   "kernelspec": {
842 |    "display_name": "Python 3 (Data Science 3.0)",
843 |    "language": "python",
844 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
845 |   },
846 |   "language_info": {
847 |    "codemirror_mode": {
848 |     "name": "ipython",
849 |     "version": 3
850 |    },
851 |    "file_extension": ".py",
852 |    "mimetype": "text/x-python",
853 |    "name": "python",
854 |    "nbconvert_exporter": "python",
855 |    "pygments_lexer": "ipython3",
856 |    "version": "3.10.6"
857 |   }
858 |  },
859 |  "nbformat": 4,
860 |  "nbformat_minor": 5
861 | }
862 | 


--------------------------------------------------------------------------------
/train-test-ml-model/README.md:
--------------------------------------------------------------------------------
 1 | ## Results from tests conducted on four different outcomes
 2 | 
 3 | The train-test-model.ipynb notebook shows the test results for four different outcomes on the entire dataset of clinial, genomic, and imaging features. The four different outcomes tesed are Hypertension, Stroke, Alzheimer's disease, and Coronary heart disease. Tests were conducted for each of the modalities separately and finally by combining all three. In majority of cases it can be seen that using features from all modalities increases the relevant metrics. Please note: As the models were trained on synthetic data, it may not be reflective of real-world examples. 
 4 | 
 5 | |         **Hypertension**         | **Accuracy** | **Precision** | **Recall** | **F1** |
 6 | |:--------------------------------:|:------------:|:-------------:|:----------:|:------:|
 7 | | **Clinical**                     |         0.80 |          0.79 |       0.80 |   0.79 |
 8 | | **Genomic**                      |         0.70 |          0.49 |       0.70 |   0.58 |
 9 | | **Imaging**                      |         0.70 |          0.49 |       0.70 |   0.58 |
10 | | **Clinical + Genomic + Imaging** |         0.87 |          0.91 |       0.87 |   0.87 |
11 | |                                  |              |               |            |        |
12 | |    **Coronary heart disease**    | **Accuracy** | **Precision** | **Recall** | **F1** |
13 | | **Clinical**                     |         0.63 |          0.76 |       0.63 |   0.68 |
14 | | **Genomic**                      |         0.87 |          0.75 |       0.87 |   0.80 |
15 | | **Imaging**                      |         0.80 |          0.74 |       0.80 |   0.77 |
16 | | **Clinical + Genomic + Imaging** |         0.83 |          0.85 |       0.83 |   0.84 |
17 | |                                  |              |               |            |        |
18 | |            **Stroke**            | **Accuracy** | **Precision** | **Recall** | **F1** |
19 | | **Clinical**                     |         0.53 |          0.54 |       0.53 |   0.53 |
20 | | **Genomic**                      |         0.50 |          0.50 |       0.50 |   0.50 |
21 | | **Imaging**                      |         0.47 |          0.48 |       0.47 |   0.45 |
22 | | **Clinical + Genomic + Imaging** |         0.97 |          0.97 |       0.97 |   0.97 |
23 | |                                  |              |               |            |        |
24 | |      **Alzheimer's disease**     | **Accuracy** | **Precision** | **Recall** | **F1** |
25 | | **Clinical**                     |         0.73 |          0.53 |       0.73 |   0.62 |
26 | | **Genomic**                      |         0.73 |          0.53 |       0.73 |   0.62 |
27 | | **Imaging**                      |         0.97 |          0.93 |       0.97 |   0.95 |
28 | | **Clinical + Genomic + Imaging** |         0.90 |          0.84 |       0.80 |   0.75 |
29 | 
30 | The hypertension-train-test-deploy.ipynb notebook shows an example of how to deploy the hypertension AutoGluon model on an Amazon SageMaker endpoint and infer from it. A similar approach can be used to deploy models for other outcomes.


--------------------------------------------------------------------------------
/train-test-ml-model/train-test-model.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "825b615a-d04a-4c3e-8a14-a80257e2ed18",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Clinical, Genomic, and Imaging data - Training and Testing "
   9 |    ]
  10 |   },
  11 |   {
  12 |    "cell_type": "markdown",
  13 |    "id": "c062ea3f-bdbf-4496-9cf6-d0893261c440",
  14 |    "metadata": {
  15 |     "tags": []
  16 |    },
  17 |    "source": [
  18 |     "---\n",
  19 |     "This notebook demonstrates the use of Amazon SageMaker [AutoGluon-Tabular](https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html) algorithm to train and test a tabular binary classification model. Tabular classification is the task of assigning a class to an example of structured or relational data. The Amazon SageMaker API for tabular classification can be used for classification of an example in two classes (binary classification) or more than two classes (multi-class classification).\n",
  20 |     "\n",
  21 |     "In this notebook, we demonstrate two use cases of tabular classification models using the [Synthea Coherent Data Set](https://registry.opendata.aws/synthea-coherent-data/):\n",
  22 |     "\n",
  23 |     "* How to get features from Amazon SageMaker FeatureStore. The preprocess-multimodal-data notebooks for clinical, genomic, and imaging notebooks need to be run before running this notebook.\n",
  24 |     "* How to train a tabular model on a multimodal dataset to do binary classification. This notebook shows example for four different outcomes:  Alzheimers Disease, Coronary Heart Disease, Stroke, and Hypertension\n",
  25 |     "* How to use evaluate predictions from the out of sample test data.\n",
  26 |     "\n",
  27 |     "Note: This notebook was tested in Amazon SageMaker Studio on ml.t3.xlarge instance with Python 3 (Data Science 3.0) kernel.\n",
  28 |     "\n",
  29 |     "---"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "code",
  34 |    "execution_count": null,
  35 |    "id": "9515c767-f825-4691-a8dc-ebe2016e6ae2",
  36 |    "metadata": {
  37 |     "tags": []
  38 |    },
  39 |    "outputs": [],
  40 |    "source": [
  41 |     "import boto3\n",
  42 |     "import sagemaker\n",
  43 |     "from sagemaker.session import Session\n",
  44 |     "from sagemaker import get_execution_role\n",
  45 |     "import pandas as pd\n",
  46 |     "import io, os\n",
  47 |     "import sys\n",
  48 |     "from sklearn.model_selection import train_test_split"
  49 |    ]
  50 |   },
  51 |   {
  52 |    "cell_type": "code",
  53 |    "execution_count": null,
  54 |    "outputs": [],
  55 |    "source": [
  56 |     "!pip install autogluon"
  57 |    ],
  58 |    "metadata": {
  59 |     "collapsed": false
  60 |    },
  61 |    "id": "339915720e97ea08"
  62 |   },
  63 |   {
  64 |    "cell_type": "code",
  65 |    "execution_count": null,
  66 |    "id": "55f9a0b7-fc8f-4c71-8732-9ac79f0faee6",
  67 |    "metadata": {
  68 |     "tags": []
  69 |    },
  70 |    "outputs": [],
  71 |    "source": [
  72 |     "from autogluon.tabular import TabularPredictor\n",
  73 |     "import autogluon as ag"
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "markdown",
  78 |    "id": "aaa341b3-3db4-42ca-b40d-ed3f904af38e",
  79 |    "metadata": {},
  80 |    "source": [
  81 |     "## Get data type to train model"
  82 |    ]
  83 |   },
  84 |   {
  85 |    "cell_type": "code",
  86 |    "execution_count": null,
  87 |    "id": "9588fcd1-5671-4266-bc97-8607c4d6b6ca",
  88 |    "metadata": {
  89 |     "tags": []
  90 |    },
  91 |    "outputs": [],
  92 |    "source": [
  93 |     "data_type = 'genomic-clinical-imaging'\n",
  94 |     "PatientID = 'patientid'"
  95 |    ]
  96 |   },
  97 |   {
  98 |    "cell_type": "markdown",
  99 |    "id": "f648bda3-d7c7-4c0e-984a-d96e925e25fd",
 100 |    "metadata": {},
 101 |    "source": [
 102 |     "## Set up S3 buckets and session"
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "code",
 107 |    "execution_count": null,
 108 |    "id": "191ef27d-efbe-4229-a2c8-2f4fcd02b581",
 109 |    "metadata": {
 110 |     "tags": []
 111 |    },
 112 |    "outputs": [],
 113 |    "source": [
 114 |     "sm_session = sagemaker.Session()\n",
 115 |     "bucket = sm_session.default_bucket()\n",
 116 |     "region = boto3.Session().region_name\n",
 117 |     "role = get_execution_role()\n",
 118 |     "\n",
 119 |     "boto_session = boto3.Session(region_name=region)\n",
 120 |     "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n",
 121 |     "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n",
 122 |     "\n",
 123 |     "feature_store_session = Session(\n",
 124 |     "    boto_session=boto_session,\n",
 125 |     "    sagemaker_client=sagemaker_client,\n",
 126 |     "    sagemaker_featurestore_runtime_client=featurestore_runtime\n",
 127 |     ")\n",
 128 |     "\n",
 129 |     "s3_client = boto3.client('s3', region_name=region)\n",
 130 |     "\n",
 131 |     "default_s3_bucket_name = sm_session.default_bucket()\n",
 132 |     "prefix = 'multi-model-health-ml'\n"
 133 |    ]
 134 |   },
 135 |   {
 136 |    "cell_type": "markdown",
 137 |    "id": "043669dd-325a-46ac-a5ee-21cca32a9470",
 138 |    "metadata": {},
 139 |    "source": [
 140 |     "## Get features from SageMaker FeatureStore based on data type"
 141 |    ]
 142 |   },
 143 |   {
 144 |    "cell_type": "code",
 145 |    "execution_count": null,
 146 |    "id": "0e6a0bbe-aae3-4fdd-bca0-7507c935012e",
 147 |    "metadata": {
 148 |     "tags": []
 149 |    },
 150 |    "outputs": [],
 151 |    "source": [
 152 |     "from sagemaker.feature_store.feature_group import FeatureGroup\n",
 153 |     "\n",
 154 |     "genomic_feature_group_name = 'genomic-feature-group'\n",
 155 |     "clinical_feature_group_name = 'clinical-feature-group'\n",
 156 |     "imaging_feature_group_name = 'imaging-feature-group'\n",
 157 |     "\n",
 158 |     "genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)\n",
 159 |     "clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)\n",
 160 |     "imaging_feature_group = FeatureGroup(name=imaging_feature_group_name, sagemaker_session=feature_store_session)"
 161 |    ]
 162 |   },
 163 |   {
 164 |    "cell_type": "code",
 165 |    "execution_count": null,
 166 |    "id": "eef950fe-e6cd-46f7-ab9a-ba2416af69be",
 167 |    "metadata": {
 168 |     "tags": []
 169 |    },
 170 |    "outputs": [],
 171 |    "source": [
 172 |     "genomic_query = genomic_feature_group.athena_query()\n",
 173 |     "clinical_query = clinical_feature_group.athena_query()\n",
 174 |     "imaging_query = imaging_feature_group.athena_query()\n",
 175 |     "\n",
 176 |     "genomic_table = genomic_query.table_name\n",
 177 |     "clinical_table = clinical_query.table_name\n",
 178 |     "imaging_table = imaging_query.table_name\n",
 179 |     "\n",
 180 |     "print('Table names')\n",
 181 |     "print(genomic_table)\n",
 182 |     "print(clinical_table)\n",
 183 |     "print(imaging_table)\n"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": null,
 189 |    "id": "c25be564-ebb7-429e-936e-6e00e4fd89d2",
 190 |    "metadata": {
 191 |     "tags": []
 192 |    },
 193 |    "outputs": [],
 194 |    "source": [
 195 |     "def get_features(data_type, output_location):   \n",
 196 |     "    if (data_type == 'genomic-clinical-imaging'):\n",
 197 |     "        query_string = f'''SELECT * FROM \"{genomic_table}\", \"{clinical_table}\", \"{imaging_table}\"\n",
 198 |     "                           WHERE \"{genomic_table}\".{PatientID} = \"{clinical_table}\".{PatientID}\n",
 199 |     "                           AND \"{genomic_table}\".{PatientID} = \"{imaging_table}\".{PatientID}\n",
 200 |     "                           ORDER BY \"{clinical_table}\".{PatientID} ASC'''                   \n",
 201 |     "        print(query_string)\n",
 202 |     "        \n",
 203 |     "        genomic_query.run(query_string=query_string, output_location=output_location)\n",
 204 |     "        genomic_query.wait()\n",
 205 |     "        dataset = genomic_query.as_dataframe()\n",
 206 |     "        \n",
 207 |     "    elif data_type not in supported_data_type:\n",
 208 |     "        raise KeyError(f'data_type {data_type} is not supported for this analysis.')\n",
 209 |     "        \n",
 210 |     "    return dataset"
 211 |    ]
 212 |   },
 213 |   {
 214 |    "cell_type": "code",
 215 |    "execution_count": null,
 216 |    "id": "7e6f4687-3bd3-4c1d-8db7-9d164a8841a3",
 217 |    "metadata": {
 218 |     "tags": []
 219 |    },
 220 |    "outputs": [],
 221 |    "source": [
 222 |     "fs_output_location = f's3://{default_s3_bucket_name}/{prefix}/feature-store-queries'\n",
 223 |     "dataset = get_features(data_type, fs_output_location)\n",
 224 |     "dataset = dataset.astype(str).replace({\"{\":\"\", \"}\":\"\"}, regex=True)\n",
 225 |     "\n",
 226 |     "# Write to csv in S3 without headers and index column.\n",
 227 |     "filename=f'{data_type}-dataset.csv'\n",
 228 |     "dataset_uri_prefix = f's3://{default_s3_bucket_name}/{prefix}/training_input/';\n",
 229 |     "\n",
 230 |     "dataset.to_csv(filename)\n",
 231 |     "s3_client.upload_file(filename, default_s3_bucket_name, f'{prefix}/training_input/{filename}')\n",
 232 |     "print(\"Observing the different features in the dataset\")\n",
 233 |     "dataset.head(3)"
 234 |    ]
 235 |   },
 236 |   {
 237 |    "cell_type": "code",
 238 |    "execution_count": null,
 239 |    "id": "79588825-d49c-4309-b27e-7c79bc3f866c",
 240 |    "metadata": {
 241 |     "tags": []
 242 |    },
 243 |    "outputs": [],
 244 |    "source": [
 245 |     "ag.core.utils.random.seed(25)"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "markdown",
 250 |    "id": "1d2deae2-386a-49a0-9662-e81a1c17d550",
 251 |    "metadata": {},
 252 |    "source": [
 253 |     "## Alzheimers Prediction\n",
 254 |     "Splitting data for training and testing"
 255 |    ]
 256 |   },
 257 |   {
 258 |    "cell_type": "code",
 259 |    "execution_count": null,
 260 |    "id": "544d6b79-7ea9-4082-bc95-cacd57015a13",
 261 |    "metadata": {
 262 |     "tags": []
 263 |    },
 264 |    "outputs": [],
 265 |    "source": [
 266 |     "#Alzheimers Prediction\n",
 267 |     "#Splitting data into training and testing 80:20\n",
 268 |     "dataset = dataset.loc[:, ~dataset.columns.str.startswith('diagnostics')]\n",
 269 |     "dataset = dataset.drop(columns = ['eventtime', 'write_time', 'api_invocation_time', 'is_deleted', 'eventtime.1', 'write_time.1', 'api_invocation_time.1', 'is_deleted.1', 'alzheimers_prediction.1',\n",
 270 |     "                                    'coronary_heart_disease_prediction.1', 'stroke_prediction.1', 'hypertension_prediction.1', 'patientid.1', 'eventtime.2', 'write_time.2', 'api_invocation_time.2', 'is_deleted.2', \n",
 271 |     "                                   'alzheimers_prediction.2', 'coronary_heart_disease_prediction.2', 'stroke_prediction.2', 'hypertension_prediction.2', 'patientid.2'])\n",
 272 |     "training= dataset.sample(frac=0.8, random_state=23)\n",
 273 |     "training = training.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])\n",
 274 |     "testing = dataset.drop(training.index)\n",
 275 |     "testing = testing.drop(columns = ['patientid', 'coronary_heart_disease_prediction', 'stroke_prediction', 'hypertension_prediction'])\n",
 276 |     "X_test = testing.drop(columns = ['alzheimers_prediction'])\n",
 277 |     "print(\"Training size = \", len(training))\n",
 278 |     "print(\"Out of sample testing size = \", len(testing))"
 279 |    ]
 280 |   },
 281 |   {
 282 |    "cell_type": "markdown",
 283 |    "id": "0deacd5b-1dd8-4ee3-9b67-395e2298b52c",
 284 |    "metadata": {},
 285 |    "source": [
 286 |     "### Alzheimers prediction on clinical, genomic, and imaging data using Autogluon"
 287 |    ]
 288 |   },
 289 |   {
 290 |    "cell_type": "code",
 291 |    "execution_count": null,
 292 |    "id": "e29cd4ea-2d3b-4e94-bdbb-19b8b64e9b92",
 293 |    "metadata": {
 294 |     "tags": []
 295 |    },
 296 |    "outputs": [],
 297 |    "source": [
 298 |     "import time\n",
 299 |     "start_time = time.time()\n",
 300 |     "buckt = sm_session.default_bucket()\n",
 301 |     "prefix= \"genomic-clinical-imaging-alzheimers-prediction\"\n",
 302 |     "save_file = 's3://{}/{}'.format(buckt, prefix)\n",
 303 |     "predictor = TabularPredictor(label= 'alzheimers_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n",
 304 |     "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))"
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "code",
 309 |    "execution_count": null,
 310 |    "id": "3a8cc026-7387-4b82-9a9c-1bc2b7818a94",
 311 |    "metadata": {
 312 |     "tags": []
 313 |    },
 314 |    "outputs": [],
 315 |    "source": [
 316 |     "predictor.evaluate_predictions(y_true=testing['alzheimers_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)"
 317 |    ]
 318 |   },
 319 |   {
 320 |    "cell_type": "markdown",
 321 |    "id": "c1630bd6-ff2f-4b36-be9b-c8b30b6bbf70",
 322 |    "metadata": {},
 323 |    "source": [
 324 |     "## Coronary heart disease Prediction\n",
 325 |     "Splitting data for training and testing"
 326 |    ]
 327 |   },
 328 |   {
 329 |    "cell_type": "code",
 330 |    "execution_count": null,
 331 |    "id": "accbdc2f-b152-46fb-ab09-f6ce919a8c95",
 332 |    "metadata": {
 333 |     "tags": []
 334 |    },
 335 |    "outputs": [],
 336 |    "source": [
 337 |     "#coronary_heart_disease_prediction\n",
 338 |     "#Splitting data into training and testing 80:20\n",
 339 |     "training = dataset.sample(frac=0.8, random_state=25)\n",
 340 |     "training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])\n",
 341 |     "testing = dataset.drop(training.index)\n",
 342 |     "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'stroke_prediction', 'hypertension_prediction'])\n",
 343 |     "X_test = testing.drop(columns = ['coronary_heart_disease_prediction'])\n",
 344 |     "print(\"Training size = \", len(training))\n",
 345 |     "print(\"Out of sample testing size = \", len(testing))"
 346 |    ]
 347 |   },
 348 |   {
 349 |    "cell_type": "markdown",
 350 |    "id": "6e242581-457b-43e3-a279-9fe488391fab",
 351 |    "metadata": {},
 352 |    "source": [
 353 |     "### Coronary heart disease prediction on clinical,  genomic, and imaging data using Autogluon"
 354 |    ]
 355 |   },
 356 |   {
 357 |    "cell_type": "code",
 358 |    "execution_count": null,
 359 |    "id": "645bf140-25de-443f-b852-34066b33d203",
 360 |    "metadata": {
 361 |     "tags": []
 362 |    },
 363 |    "outputs": [],
 364 |    "source": [
 365 |     "import time\n",
 366 |     "start_time = time.time()\n",
 367 |     "buckt = sm_session.default_bucket()\n",
 368 |     "prefix= \"genomic-clinical-imaging-coronary-heart-disease-prediction\"\n",
 369 |     "save_file = 's3://{}/{}'.format(buckt, prefix)\n",
 370 |     "predictor = TabularPredictor(label= 'coronary_heart_disease_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n",
 371 |     "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))"
 372 |    ]
 373 |   },
 374 |   {
 375 |    "cell_type": "code",
 376 |    "execution_count": null,
 377 |    "id": "9eacb091-7a47-4efc-a360-4a265efbfe0c",
 378 |    "metadata": {
 379 |     "tags": []
 380 |    },
 381 |    "outputs": [],
 382 |    "source": [
 383 |     "predictor.evaluate_predictions(y_true=testing['coronary_heart_disease_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)"
 384 |    ]
 385 |   },
 386 |   {
 387 |    "cell_type": "markdown",
 388 |    "id": "7deb337c-9b26-4d70-b700-c8eda40fa115",
 389 |    "metadata": {},
 390 |    "source": [
 391 |     "## Stroke Prediction\n",
 392 |     "Splitting data for training and testing"
 393 |    ]
 394 |   },
 395 |   {
 396 |    "cell_type": "code",
 397 |    "execution_count": null,
 398 |    "id": "b0ef9c2d-eafc-4f05-a155-eb5237eb5052",
 399 |    "metadata": {
 400 |     "tags": []
 401 |    },
 402 |    "outputs": [],
 403 |    "source": [
 404 |     "#stroke_prediction\n",
 405 |     "#Splitting data into training and testing 80:20\n",
 406 |     "training = dataset.sample(frac=0.8, random_state=30)\n",
 407 |     "training =  training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])\n",
 408 |     "testing = dataset.drop(training.index)\n",
 409 |     "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'hypertension_prediction'])\n",
 410 |     "X_test = testing.drop(columns = ['stroke_prediction'])\n",
 411 |     "print(\"Training size = \", len(training))\n",
 412 |     "print(\"Out of sample testing size = \", len(testing))"
 413 |    ]
 414 |   },
 415 |   {
 416 |    "cell_type": "markdown",
 417 |    "id": "0a753dc2-452f-44d0-92af-2c335a2478a9",
 418 |    "metadata": {},
 419 |    "source": [
 420 |     "### Stroke prediction on clinical, genomic, and imaging data using Autogluon"
 421 |    ]
 422 |   },
 423 |   {
 424 |    "cell_type": "code",
 425 |    "execution_count": null,
 426 |    "id": "3e828813-1792-434b-9f84-c1c7efb4c236",
 427 |    "metadata": {},
 428 |    "outputs": [],
 429 |    "source": [
 430 |     "import time\n",
 431 |     "start_time = time.time()\n",
 432 |     "buckt = sm_session.default_bucket()\n",
 433 |     "prefix= \"genomic-clinical-imaging-stroke_prediction\"\n",
 434 |     "save_file = 's3://{}/{}'.format(buckt, prefix)\n",
 435 |     "predictor = TabularPredictor(label= 'stroke_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n",
 436 |     "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))"
 437 |    ]
 438 |   },
 439 |   {
 440 |    "cell_type": "code",
 441 |    "execution_count": null,
 442 |    "id": "fe947544-7daa-4836-b74a-7291cb83d421",
 443 |    "metadata": {
 444 |     "tags": []
 445 |    },
 446 |    "outputs": [],
 447 |    "source": [
 448 |     "predictor.evaluate_predictions(y_true=testing['stroke_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)"
 449 |    ]
 450 |   },
 451 |   {
 452 |    "cell_type": "markdown",
 453 |    "id": "8cda76e3-db6c-44d3-a6e5-339a8344f55a",
 454 |    "metadata": {},
 455 |    "source": [
 456 |     "## Hypertension Prediction\n",
 457 |     "Splitting data for training and testing"
 458 |    ]
 459 |   },
 460 |   {
 461 |    "cell_type": "code",
 462 |    "execution_count": null,
 463 |    "id": "926e139c-ec74-4c58-84c1-dca1c46f6022",
 464 |    "metadata": {
 465 |     "tags": []
 466 |    },
 467 |    "outputs": [],
 468 |    "source": [
 469 |     "#hypertension_prediction\n",
 470 |     "#Splitting data into training and testing 80:20\n",
 471 |     "training = dataset.sample(frac=0.8, random_state=25)\n",
 472 |     "training = training.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])\n",
 473 |     "testing = dataset.drop(training.index)\n",
 474 |     "testing = testing.drop(columns = ['patientid', 'alzheimers_prediction', 'coronary_heart_disease_prediction', 'stroke_prediction'])\n",
 475 |     "X_test = testing.drop(columns = ['hypertension_prediction'])\n",
 476 |     "print(\"Training size = \", len(training))\n",
 477 |     "print(\"Out of sample testing size = \", len(testing))"
 478 |    ]
 479 |   },
 480 |   {
 481 |    "cell_type": "markdown",
 482 |    "id": "a6bf0905-5b45-4943-82a5-9211e40e010e",
 483 |    "metadata": {},
 484 |    "source": [
 485 |     "### Hypertension prediction on clinical, genomic, and imaging data using Autogluon"
 486 |    ]
 487 |   },
 488 |   {
 489 |    "cell_type": "code",
 490 |    "execution_count": null,
 491 |    "id": "dcdc67d7-d74c-4810-bccb-943ab9cedee6",
 492 |    "metadata": {},
 493 |    "outputs": [],
 494 |    "source": [
 495 |     "import time\n",
 496 |     "start_time = time.time()\n",
 497 |     "buckt = sm_session.default_bucket()\n",
 498 |     "prefix= \"genomic-clinical-imaging-hypertension-prediction\"\n",
 499 |     "save_file = 's3://{}/{}'.format(buckt, prefix)\n",
 500 |     "predictor = TabularPredictor(label= 'hypertension_prediction', problem_type= 'binary', path=save_file).fit(train_data=training, holdout_frac=0.1, excluded_model_types=['CAT', 'XGB'])\n",
 501 |     "print(\"--- Training time= %s seconds ---\" % (time.time() - start_time))"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "code",
 506 |    "execution_count": null,
 507 |    "id": "1611732f-6432-4713-b8da-90cbf404d3ec",
 508 |    "metadata": {
 509 |     "tags": []
 510 |    },
 511 |    "outputs": [],
 512 |    "source": [
 513 |     "predictor.evaluate_predictions(y_true=testing['hypertension_prediction'], y_pred=predictor.predict(X_test), auxiliary_metrics=True, detailed_report=True)"
 514 |    ]
 515 |   },
 516 |   {
 517 |    "cell_type": "code",
 518 |    "execution_count": null,
 519 |    "id": "b6669950-15c3-4231-b10d-69971fdd656a",
 520 |    "metadata": {},
 521 |    "outputs": [],
 522 |    "source": []
 523 |   }
 524 |  ],
 525 |  "metadata": {
 526 |   "availableInstances": [
 527 |    {
 528 |     "_defaultOrder": 0,
 529 |     "_isFastLaunch": true,
 530 |     "category": "General purpose",
 531 |     "gpuNum": 0,
 532 |     "hideHardwareSpecs": false,
 533 |     "memoryGiB": 4,
 534 |     "name": "ml.t3.medium",
 535 |     "vcpuNum": 2
 536 |    },
 537 |    {
 538 |     "_defaultOrder": 1,
 539 |     "_isFastLaunch": false,
 540 |     "category": "General purpose",
 541 |     "gpuNum": 0,
 542 |     "hideHardwareSpecs": false,
 543 |     "memoryGiB": 8,
 544 |     "name": "ml.t3.large",
 545 |     "vcpuNum": 2
 546 |    },
 547 |    {
 548 |     "_defaultOrder": 2,
 549 |     "_isFastLaunch": false,
 550 |     "category": "General purpose",
 551 |     "gpuNum": 0,
 552 |     "hideHardwareSpecs": false,
 553 |     "memoryGiB": 16,
 554 |     "name": "ml.t3.xlarge",
 555 |     "vcpuNum": 4
 556 |    },
 557 |    {
 558 |     "_defaultOrder": 3,
 559 |     "_isFastLaunch": false,
 560 |     "category": "General purpose",
 561 |     "gpuNum": 0,
 562 |     "hideHardwareSpecs": false,
 563 |     "memoryGiB": 32,
 564 |     "name": "ml.t3.2xlarge",
 565 |     "vcpuNum": 8
 566 |    },
 567 |    {
 568 |     "_defaultOrder": 4,
 569 |     "_isFastLaunch": true,
 570 |     "category": "General purpose",
 571 |     "gpuNum": 0,
 572 |     "hideHardwareSpecs": false,
 573 |     "memoryGiB": 8,
 574 |     "name": "ml.m5.large",
 575 |     "vcpuNum": 2
 576 |    },
 577 |    {
 578 |     "_defaultOrder": 5,
 579 |     "_isFastLaunch": false,
 580 |     "category": "General purpose",
 581 |     "gpuNum": 0,
 582 |     "hideHardwareSpecs": false,
 583 |     "memoryGiB": 16,
 584 |     "name": "ml.m5.xlarge",
 585 |     "vcpuNum": 4
 586 |    },
 587 |    {
 588 |     "_defaultOrder": 6,
 589 |     "_isFastLaunch": false,
 590 |     "category": "General purpose",
 591 |     "gpuNum": 0,
 592 |     "hideHardwareSpecs": false,
 593 |     "memoryGiB": 32,
 594 |     "name": "ml.m5.2xlarge",
 595 |     "vcpuNum": 8
 596 |    },
 597 |    {
 598 |     "_defaultOrder": 7,
 599 |     "_isFastLaunch": false,
 600 |     "category": "General purpose",
 601 |     "gpuNum": 0,
 602 |     "hideHardwareSpecs": false,
 603 |     "memoryGiB": 64,
 604 |     "name": "ml.m5.4xlarge",
 605 |     "vcpuNum": 16
 606 |    },
 607 |    {
 608 |     "_defaultOrder": 8,
 609 |     "_isFastLaunch": false,
 610 |     "category": "General purpose",
 611 |     "gpuNum": 0,
 612 |     "hideHardwareSpecs": false,
 613 |     "memoryGiB": 128,
 614 |     "name": "ml.m5.8xlarge",
 615 |     "vcpuNum": 32
 616 |    },
 617 |    {
 618 |     "_defaultOrder": 9,
 619 |     "_isFastLaunch": false,
 620 |     "category": "General purpose",
 621 |     "gpuNum": 0,
 622 |     "hideHardwareSpecs": false,
 623 |     "memoryGiB": 192,
 624 |     "name": "ml.m5.12xlarge",
 625 |     "vcpuNum": 48
 626 |    },
 627 |    {
 628 |     "_defaultOrder": 10,
 629 |     "_isFastLaunch": false,
 630 |     "category": "General purpose",
 631 |     "gpuNum": 0,
 632 |     "hideHardwareSpecs": false,
 633 |     "memoryGiB": 256,
 634 |     "name": "ml.m5.16xlarge",
 635 |     "vcpuNum": 64
 636 |    },
 637 |    {
 638 |     "_defaultOrder": 11,
 639 |     "_isFastLaunch": false,
 640 |     "category": "General purpose",
 641 |     "gpuNum": 0,
 642 |     "hideHardwareSpecs": false,
 643 |     "memoryGiB": 384,
 644 |     "name": "ml.m5.24xlarge",
 645 |     "vcpuNum": 96
 646 |    },
 647 |    {
 648 |     "_defaultOrder": 12,
 649 |     "_isFastLaunch": false,
 650 |     "category": "General purpose",
 651 |     "gpuNum": 0,
 652 |     "hideHardwareSpecs": false,
 653 |     "memoryGiB": 8,
 654 |     "name": "ml.m5d.large",
 655 |     "vcpuNum": 2
 656 |    },
 657 |    {
 658 |     "_defaultOrder": 13,
 659 |     "_isFastLaunch": false,
 660 |     "category": "General purpose",
 661 |     "gpuNum": 0,
 662 |     "hideHardwareSpecs": false,
 663 |     "memoryGiB": 16,
 664 |     "name": "ml.m5d.xlarge",
 665 |     "vcpuNum": 4
 666 |    },
 667 |    {
 668 |     "_defaultOrder": 14,
 669 |     "_isFastLaunch": false,
 670 |     "category": "General purpose",
 671 |     "gpuNum": 0,
 672 |     "hideHardwareSpecs": false,
 673 |     "memoryGiB": 32,
 674 |     "name": "ml.m5d.2xlarge",
 675 |     "vcpuNum": 8
 676 |    },
 677 |    {
 678 |     "_defaultOrder": 15,
 679 |     "_isFastLaunch": false,
 680 |     "category": "General purpose",
 681 |     "gpuNum": 0,
 682 |     "hideHardwareSpecs": false,
 683 |     "memoryGiB": 64,
 684 |     "name": "ml.m5d.4xlarge",
 685 |     "vcpuNum": 16
 686 |    },
 687 |    {
 688 |     "_defaultOrder": 16,
 689 |     "_isFastLaunch": false,
 690 |     "category": "General purpose",
 691 |     "gpuNum": 0,
 692 |     "hideHardwareSpecs": false,
 693 |     "memoryGiB": 128,
 694 |     "name": "ml.m5d.8xlarge",
 695 |     "vcpuNum": 32
 696 |    },
 697 |    {
 698 |     "_defaultOrder": 17,
 699 |     "_isFastLaunch": false,
 700 |     "category": "General purpose",
 701 |     "gpuNum": 0,
 702 |     "hideHardwareSpecs": false,
 703 |     "memoryGiB": 192,
 704 |     "name": "ml.m5d.12xlarge",
 705 |     "vcpuNum": 48
 706 |    },
 707 |    {
 708 |     "_defaultOrder": 18,
 709 |     "_isFastLaunch": false,
 710 |     "category": "General purpose",
 711 |     "gpuNum": 0,
 712 |     "hideHardwareSpecs": false,
 713 |     "memoryGiB": 256,
 714 |     "name": "ml.m5d.16xlarge",
 715 |     "vcpuNum": 64
 716 |    },
 717 |    {
 718 |     "_defaultOrder": 19,
 719 |     "_isFastLaunch": false,
 720 |     "category": "General purpose",
 721 |     "gpuNum": 0,
 722 |     "hideHardwareSpecs": false,
 723 |     "memoryGiB": 384,
 724 |     "name": "ml.m5d.24xlarge",
 725 |     "vcpuNum": 96
 726 |    },
 727 |    {
 728 |     "_defaultOrder": 20,
 729 |     "_isFastLaunch": false,
 730 |     "category": "General purpose",
 731 |     "gpuNum": 0,
 732 |     "hideHardwareSpecs": true,
 733 |     "memoryGiB": 0,
 734 |     "name": "ml.geospatial.interactive",
 735 |     "supportedImageNames": [
 736 |      "sagemaker-geospatial-v1-0"
 737 |     ],
 738 |     "vcpuNum": 0
 739 |    },
 740 |    {
 741 |     "_defaultOrder": 21,
 742 |     "_isFastLaunch": true,
 743 |     "category": "Compute optimized",
 744 |     "gpuNum": 0,
 745 |     "hideHardwareSpecs": false,
 746 |     "memoryGiB": 4,
 747 |     "name": "ml.c5.large",
 748 |     "vcpuNum": 2
 749 |    },
 750 |    {
 751 |     "_defaultOrder": 22,
 752 |     "_isFastLaunch": false,
 753 |     "category": "Compute optimized",
 754 |     "gpuNum": 0,
 755 |     "hideHardwareSpecs": false,
 756 |     "memoryGiB": 8,
 757 |     "name": "ml.c5.xlarge",
 758 |     "vcpuNum": 4
 759 |    },
 760 |    {
 761 |     "_defaultOrder": 23,
 762 |     "_isFastLaunch": false,
 763 |     "category": "Compute optimized",
 764 |     "gpuNum": 0,
 765 |     "hideHardwareSpecs": false,
 766 |     "memoryGiB": 16,
 767 |     "name": "ml.c5.2xlarge",
 768 |     "vcpuNum": 8
 769 |    },
 770 |    {
 771 |     "_defaultOrder": 24,
 772 |     "_isFastLaunch": false,
 773 |     "category": "Compute optimized",
 774 |     "gpuNum": 0,
 775 |     "hideHardwareSpecs": false,
 776 |     "memoryGiB": 32,
 777 |     "name": "ml.c5.4xlarge",
 778 |     "vcpuNum": 16
 779 |    },
 780 |    {
 781 |     "_defaultOrder": 25,
 782 |     "_isFastLaunch": false,
 783 |     "category": "Compute optimized",
 784 |     "gpuNum": 0,
 785 |     "hideHardwareSpecs": false,
 786 |     "memoryGiB": 72,
 787 |     "name": "ml.c5.9xlarge",
 788 |     "vcpuNum": 36
 789 |    },
 790 |    {
 791 |     "_defaultOrder": 26,
 792 |     "_isFastLaunch": false,
 793 |     "category": "Compute optimized",
 794 |     "gpuNum": 0,
 795 |     "hideHardwareSpecs": false,
 796 |     "memoryGiB": 96,
 797 |     "name": "ml.c5.12xlarge",
 798 |     "vcpuNum": 48
 799 |    },
 800 |    {
 801 |     "_defaultOrder": 27,
 802 |     "_isFastLaunch": false,
 803 |     "category": "Compute optimized",
 804 |     "gpuNum": 0,
 805 |     "hideHardwareSpecs": false,
 806 |     "memoryGiB": 144,
 807 |     "name": "ml.c5.18xlarge",
 808 |     "vcpuNum": 72
 809 |    },
 810 |    {
 811 |     "_defaultOrder": 28,
 812 |     "_isFastLaunch": false,
 813 |     "category": "Compute optimized",
 814 |     "gpuNum": 0,
 815 |     "hideHardwareSpecs": false,
 816 |     "memoryGiB": 192,
 817 |     "name": "ml.c5.24xlarge",
 818 |     "vcpuNum": 96
 819 |    },
 820 |    {
 821 |     "_defaultOrder": 29,
 822 |     "_isFastLaunch": true,
 823 |     "category": "Accelerated computing",
 824 |     "gpuNum": 1,
 825 |     "hideHardwareSpecs": false,
 826 |     "memoryGiB": 16,
 827 |     "name": "ml.g4dn.xlarge",
 828 |     "vcpuNum": 4
 829 |    },
 830 |    {
 831 |     "_defaultOrder": 30,
 832 |     "_isFastLaunch": false,
 833 |     "category": "Accelerated computing",
 834 |     "gpuNum": 1,
 835 |     "hideHardwareSpecs": false,
 836 |     "memoryGiB": 32,
 837 |     "name": "ml.g4dn.2xlarge",
 838 |     "vcpuNum": 8
 839 |    },
 840 |    {
 841 |     "_defaultOrder": 31,
 842 |     "_isFastLaunch": false,
 843 |     "category": "Accelerated computing",
 844 |     "gpuNum": 1,
 845 |     "hideHardwareSpecs": false,
 846 |     "memoryGiB": 64,
 847 |     "name": "ml.g4dn.4xlarge",
 848 |     "vcpuNum": 16
 849 |    },
 850 |    {
 851 |     "_defaultOrder": 32,
 852 |     "_isFastLaunch": false,
 853 |     "category": "Accelerated computing",
 854 |     "gpuNum": 1,
 855 |     "hideHardwareSpecs": false,
 856 |     "memoryGiB": 128,
 857 |     "name": "ml.g4dn.8xlarge",
 858 |     "vcpuNum": 32
 859 |    },
 860 |    {
 861 |     "_defaultOrder": 33,
 862 |     "_isFastLaunch": false,
 863 |     "category": "Accelerated computing",
 864 |     "gpuNum": 4,
 865 |     "hideHardwareSpecs": false,
 866 |     "memoryGiB": 192,
 867 |     "name": "ml.g4dn.12xlarge",
 868 |     "vcpuNum": 48
 869 |    },
 870 |    {
 871 |     "_defaultOrder": 34,
 872 |     "_isFastLaunch": false,
 873 |     "category": "Accelerated computing",
 874 |     "gpuNum": 1,
 875 |     "hideHardwareSpecs": false,
 876 |     "memoryGiB": 256,
 877 |     "name": "ml.g4dn.16xlarge",
 878 |     "vcpuNum": 64
 879 |    },
 880 |    {
 881 |     "_defaultOrder": 35,
 882 |     "_isFastLaunch": false,
 883 |     "category": "Accelerated computing",
 884 |     "gpuNum": 1,
 885 |     "hideHardwareSpecs": false,
 886 |     "memoryGiB": 61,
 887 |     "name": "ml.p3.2xlarge",
 888 |     "vcpuNum": 8
 889 |    },
 890 |    {
 891 |     "_defaultOrder": 36,
 892 |     "_isFastLaunch": false,
 893 |     "category": "Accelerated computing",
 894 |     "gpuNum": 4,
 895 |     "hideHardwareSpecs": false,
 896 |     "memoryGiB": 244,
 897 |     "name": "ml.p3.8xlarge",
 898 |     "vcpuNum": 32
 899 |    },
 900 |    {
 901 |     "_defaultOrder": 37,
 902 |     "_isFastLaunch": false,
 903 |     "category": "Accelerated computing",
 904 |     "gpuNum": 8,
 905 |     "hideHardwareSpecs": false,
 906 |     "memoryGiB": 488,
 907 |     "name": "ml.p3.16xlarge",
 908 |     "vcpuNum": 64
 909 |    },
 910 |    {
 911 |     "_defaultOrder": 38,
 912 |     "_isFastLaunch": false,
 913 |     "category": "Accelerated computing",
 914 |     "gpuNum": 8,
 915 |     "hideHardwareSpecs": false,
 916 |     "memoryGiB": 768,
 917 |     "name": "ml.p3dn.24xlarge",
 918 |     "vcpuNum": 96
 919 |    },
 920 |    {
 921 |     "_defaultOrder": 39,
 922 |     "_isFastLaunch": false,
 923 |     "category": "Memory Optimized",
 924 |     "gpuNum": 0,
 925 |     "hideHardwareSpecs": false,
 926 |     "memoryGiB": 16,
 927 |     "name": "ml.r5.large",
 928 |     "vcpuNum": 2
 929 |    },
 930 |    {
 931 |     "_defaultOrder": 40,
 932 |     "_isFastLaunch": false,
 933 |     "category": "Memory Optimized",
 934 |     "gpuNum": 0,
 935 |     "hideHardwareSpecs": false,
 936 |     "memoryGiB": 32,
 937 |     "name": "ml.r5.xlarge",
 938 |     "vcpuNum": 4
 939 |    },
 940 |    {
 941 |     "_defaultOrder": 41,
 942 |     "_isFastLaunch": false,
 943 |     "category": "Memory Optimized",
 944 |     "gpuNum": 0,
 945 |     "hideHardwareSpecs": false,
 946 |     "memoryGiB": 64,
 947 |     "name": "ml.r5.2xlarge",
 948 |     "vcpuNum": 8
 949 |    },
 950 |    {
 951 |     "_defaultOrder": 42,
 952 |     "_isFastLaunch": false,
 953 |     "category": "Memory Optimized",
 954 |     "gpuNum": 0,
 955 |     "hideHardwareSpecs": false,
 956 |     "memoryGiB": 128,
 957 |     "name": "ml.r5.4xlarge",
 958 |     "vcpuNum": 16
 959 |    },
 960 |    {
 961 |     "_defaultOrder": 43,
 962 |     "_isFastLaunch": false,
 963 |     "category": "Memory Optimized",
 964 |     "gpuNum": 0,
 965 |     "hideHardwareSpecs": false,
 966 |     "memoryGiB": 256,
 967 |     "name": "ml.r5.8xlarge",
 968 |     "vcpuNum": 32
 969 |    },
 970 |    {
 971 |     "_defaultOrder": 44,
 972 |     "_isFastLaunch": false,
 973 |     "category": "Memory Optimized",
 974 |     "gpuNum": 0,
 975 |     "hideHardwareSpecs": false,
 976 |     "memoryGiB": 384,
 977 |     "name": "ml.r5.12xlarge",
 978 |     "vcpuNum": 48
 979 |    },
 980 |    {
 981 |     "_defaultOrder": 45,
 982 |     "_isFastLaunch": false,
 983 |     "category": "Memory Optimized",
 984 |     "gpuNum": 0,
 985 |     "hideHardwareSpecs": false,
 986 |     "memoryGiB": 512,
 987 |     "name": "ml.r5.16xlarge",
 988 |     "vcpuNum": 64
 989 |    },
 990 |    {
 991 |     "_defaultOrder": 46,
 992 |     "_isFastLaunch": false,
 993 |     "category": "Memory Optimized",
 994 |     "gpuNum": 0,
 995 |     "hideHardwareSpecs": false,
 996 |     "memoryGiB": 768,
 997 |     "name": "ml.r5.24xlarge",
 998 |     "vcpuNum": 96
 999 |    },
1000 |    {
1001 |     "_defaultOrder": 47,
1002 |     "_isFastLaunch": false,
1003 |     "category": "Accelerated computing",
1004 |     "gpuNum": 1,
1005 |     "hideHardwareSpecs": false,
1006 |     "memoryGiB": 16,
1007 |     "name": "ml.g5.xlarge",
1008 |     "vcpuNum": 4
1009 |    },
1010 |    {
1011 |     "_defaultOrder": 48,
1012 |     "_isFastLaunch": false,
1013 |     "category": "Accelerated computing",
1014 |     "gpuNum": 1,
1015 |     "hideHardwareSpecs": false,
1016 |     "memoryGiB": 32,
1017 |     "name": "ml.g5.2xlarge",
1018 |     "vcpuNum": 8
1019 |    },
1020 |    {
1021 |     "_defaultOrder": 49,
1022 |     "_isFastLaunch": false,
1023 |     "category": "Accelerated computing",
1024 |     "gpuNum": 1,
1025 |     "hideHardwareSpecs": false,
1026 |     "memoryGiB": 64,
1027 |     "name": "ml.g5.4xlarge",
1028 |     "vcpuNum": 16
1029 |    },
1030 |    {
1031 |     "_defaultOrder": 50,
1032 |     "_isFastLaunch": false,
1033 |     "category": "Accelerated computing",
1034 |     "gpuNum": 1,
1035 |     "hideHardwareSpecs": false,
1036 |     "memoryGiB": 128,
1037 |     "name": "ml.g5.8xlarge",
1038 |     "vcpuNum": 32
1039 |    },
1040 |    {
1041 |     "_defaultOrder": 51,
1042 |     "_isFastLaunch": false,
1043 |     "category": "Accelerated computing",
1044 |     "gpuNum": 1,
1045 |     "hideHardwareSpecs": false,
1046 |     "memoryGiB": 256,
1047 |     "name": "ml.g5.16xlarge",
1048 |     "vcpuNum": 64
1049 |    },
1050 |    {
1051 |     "_defaultOrder": 52,
1052 |     "_isFastLaunch": false,
1053 |     "category": "Accelerated computing",
1054 |     "gpuNum": 4,
1055 |     "hideHardwareSpecs": false,
1056 |     "memoryGiB": 192,
1057 |     "name": "ml.g5.12xlarge",
1058 |     "vcpuNum": 48
1059 |    },
1060 |    {
1061 |     "_defaultOrder": 53,
1062 |     "_isFastLaunch": false,
1063 |     "category": "Accelerated computing",
1064 |     "gpuNum": 4,
1065 |     "hideHardwareSpecs": false,
1066 |     "memoryGiB": 384,
1067 |     "name": "ml.g5.24xlarge",
1068 |     "vcpuNum": 96
1069 |    },
1070 |    {
1071 |     "_defaultOrder": 54,
1072 |     "_isFastLaunch": false,
1073 |     "category": "Accelerated computing",
1074 |     "gpuNum": 8,
1075 |     "hideHardwareSpecs": false,
1076 |     "memoryGiB": 768,
1077 |     "name": "ml.g5.48xlarge",
1078 |     "vcpuNum": 192
1079 |    }
1080 |   ],
1081 |   "instance_type": "ml.t3.xlarge",
1082 |   "kernelspec": {
1083 |    "display_name": "Python 3 (Data Science 3.0)",
1084 |    "language": "python",
1085 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
1086 |   },
1087 |   "language_info": {
1088 |    "codemirror_mode": {
1089 |     "name": "ipython",
1090 |     "version": 3
1091 |    },
1092 |    "file_extension": ".py",
1093 |    "mimetype": "text/x-python",
1094 |    "name": "python",
1095 |    "nbconvert_exporter": "python",
1096 |    "pygments_lexer": "ipython3",
1097 |    "version": "3.10.6"
1098 |   }
1099 |  },
1100 |  "nbformat": 4,
1101 |  "nbformat_minor": 5
1102 | }
1103 | 


--------------------------------------------------------------------------------
/visualization-dashboard/README.md:
--------------------------------------------------------------------------------
 1 | ## [Visualization Dashboard for Population-level Analysis](https://us-east-1.quicksight.aws.amazon.com/sn/accounts/659535263284/dashboards/774bd843-f43f-4faf-858f-accd8376c95b/sheets/774bd843-f43f-4faf-858f-accd8376c95b_7871daf0-08f6-4c73-b85e-b7e88b3da794)
 2 | 
 3 | <!-- **HCLS customers are constantly striving to find ways of improving clinical health outcomes of a defined group of individuals for Population Health Management(PHM). Possessing and analyzing data allows providers to identify the greatest needs of the patient population. For example, if the majority of a patient population is suffering from a particular disease, say diabetes, obesity and also the corresponding social determinants of health. PHM allows providers to predict and identify patients at risk for hospital admissions, allows providers to create patient-specific care plans, and helps providers understand their patient population health trends.** -->
 4 | 
 5 | 
 6 | This dashboard, built with Amazon QuickSight, offers an interactive visual interface to help users (e.g. clinicians, bioinformaticians, radiologists) get a comprehensive view of patients at the population or cohort-level. It includes the following sheets:
 7 | 
 8 | * Clinical Analysis - Provides an overview of clinical data at the population level
 9 | * Genomic Analysis - Provides an overview of genomic data at the population level
10 | * Imaging Analysis - Provides an overview of medical imaging data at the population level
11 | 
12 | 
13 | ## [Visualization Dashboard for Patient-level Analysis](https://us-east-1.quicksight.aws.amazon.com/sn/accounts/659535263284/dashboards/9b340b41-9b19-4496-bde1-dcf3638edd39?directory_alias=hcls-multimodal)
14 | 
15 | <!-- **HCLS customers are seeing a rapid growth in patient-level data size and diversity, that include genomic, clinical, medical imaging, medical claims, and sensor data. While multimodal data offers a comprehensive view that can improve patient outcomes and care, analyzing multiple modalities at scale is challenging, preventing customers from adopting multimodal analytics for precision health applications.** -->
16 | 
17 | 
18 | This dashboard, built with Amazon QuickSight, offers a single, interactive visual interface to help users (e.g. clinicians) get a complete view of a patient across multiple data modalities (clinical, genomic, and imaging). To use this dashboard, simply select the Patient ID of interest using the menu located at the top-right of the dashboard. This will generate visulaizations across multiple data types, filtered for the selected patient.
19 | 
20 | Prior to any production deployment, please coordinate with your local security team to ensure appropriate authentication/authorization is in place.
21 | 


--------------------------------------------------------------------------------