├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── container
    ├── inference
    │   ├── Dockerfile
    │   ├── inference.py
    │   ├── nginx.conf
    │   ├── serve
    │   └── wsgi.py
    └── training
    │   ├── Dockerfile
    │   ├── changehostname.c
    │   ├── requirements.txt
    │   └── start_with_right_hostname.sh
├── deploy-ESM-embeddings-server.ipynb
├── img
    ├── 1-setup.png
    ├── 2-api-key.png
    ├── 3-generate.png
    ├── 4-sm.png
    ├── 5-secret-type.png
    └── bionemo-sm-arch.png
├── src
    ├── esm1nv-training.yaml
    └── train.py
└── train-ESM.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .venv
3 | .scratch
4 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Amazon SageMaker with NVIDIA BioNeMo
 2 | 
 3 | ## Description
 4 | 
 5 | Code examples for running NVIDIA BioNeMo inference and training on Amazon SageMaker.
 6 | 
 7 | ## Introduction
 8 | 
 9 | Proteins are complex biomolecules that carry out most of the essential functions in cells, from metabolism to cellular signaling and structure. A deep understanding of protein structure and function is critical for advancing fields like personalized medicine, biomanufacturing, and synthetic biology.
10 | 
11 | Recent advances in natural language processing (NLP) have enabled breakthroughs in computational biology through the development of protein language models (pLMs). Similar to how word tokens are the building blocks of sentences in NLP models, amino acids are the building blocks that make up protein sequences. When exposed to millions of protein sequences during training, pLMs develop attention patterns that represent the evolutionary relationships between amino acids. This learned representation of primary sequence can then be fine-tuned to predict protein properties and higher-order structure.
12 | 
13 | At re:Invent 2023, NVIDIA announced that its BioNeMo generative AI platform for drug discovery will now be available on AWS services including Amazon SageMaker, AWS ParallelCluster, and the upcoming NVIDIA DGX Cloud on AWS. BioNeMo provides pre-trained large language models, data loaders, and optimized training frameworks to help speed up target identification, protein structure prediction, and drug candidate screening in the drug discovery process. Researchers and developers at pharmaceutical and biotech companies that use AWS will be able to leverage BioNeMo and AWS's scalable GPU cloud computing capabilities to rapidly build and train generative AI models on biomolecular data. Several biotech companies and startups are already using BioNeMo for AI-accelerated drug discovery and this announcement will enable them to easily scale up resources as needed.
14 | 
15 | This repository contains example of how to use the [BioNeMo framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/containers/bionemo-framework) on Amazon SageMaker.
16 | 
17 | ## Architecture
18 | 
19 | ![Solution architecture for running BioNeMo training and inference on Amazon SageMaker](img/bionemo-sm-arch.png)
20 | 
21 | ## Configuration
22 | 
23 | BioNeMo uses [Hydra](https://hydra.cc/docs/intro/) to manage the training job parameters. These are stored in .yaml files and passed to the training job at runtime. For this reason, you do not need to pass many hyperparameters to your SageMaker Estimator - sometimes only the name of your configuration file! You can find a basic Hydra tutorial [here](https://hydra.cc/docs/tutorials/basic/your_first_app/simple_cli/).
24 | 
25 | ## Setup
26 | 
27 | Before you create a BioNeMo training job, follow these steps to generate some NGC API credentials and store them in AWS Secrets Manager.
28 | 
29 | 1. Sign-in or create a new account at NVIDIA [NGC](https://ngc.nvidia.com/signin).
30 | 2. Select your name in the top-right corner of the screen and then "Setup"
31 | 
32 | ![Select Setup from the top-right menup](img/1-setup.png)
33 | 
34 | 3. Select "Generate API Key".
35 | 
36 | ![Select Generate API Key](img/2-api-key.png)
37 | 
38 | 4. Select the green "+ Generate API Key" button and confirm.
39 | 
40 | ![Select green Generate API Key button ](img/3-generate.png)
41 | 
42 | 5. Copy the API key - this is the last time you can retrieve it!
43 | 
44 | 6. Before you leave the NVIDIA NGC site, also take note of your organization ID listed under your name in the top-right corner of the screen. You'll need this, plus your API key, to download BioNeMo artifacts.
45 | 
46 | 7. Navigate to the AWS Console and then to AWS Secrets Manager.
47 | 
48 | ![Navigate to AWS Secrets Manager](img/4-sm.png)
49 | 
50 | 8. Select "Store a new secret".
51 | 9. Under "Secret type" select "Other type of secret"
52 | 
53 | ![Select other type of secret](img/5-secret-type.png)
54 | 
55 | 10. Under "Key/value" pairs, add a key named "NGC_CLI_API_KEY" with a value of your NGC API key. Add another key named "NGC_CLI_ORG" with a value of your NGC organization. Select Next.
56 | 
57 | 11. Under "Configure secret - Secret name and description", name your secret "NVIDIA_NGC_CREDS" and select Next. You'll use this secret name when submitting BioNeMo jobs to SageMaker.
58 | 
59 | 12. Select the remaining default options to create your secret.
60 | 
61 | ## Examples
62 | 
63 | ### Generate ESM-1nv sequence embeddings using an Amazon SageMaker Real-Time Inference Endpoint
64 | 
65 | The **deploy-ESM-embeddings-server.ipynb** notebook describes how to deploy the pretrained esm-1nv model as an endpoint for generating sequence embeddings. In this case, all of the required configuration files are already included in the BioNeMo framwork. You only need to specify the name of your model and your NGC API secret name in AWS Secrets Manager.
66 | 
67 | ### Train ESM-1nv on Protein Sequences from UniProt
68 | 
69 | The **train-ESM.ipynb** notebook describes how to pretrain or fine-tune the esm-1nv model using a sequence data from the UniProt sequence database. In this case, you will need to create a configuration file and upload it with the training script when creating the job. You should not need to modify the training script. Once the training has finished the Nemo checkpoints will be available in S3.
70 | 
71 | ## Security
72 | 
73 | Amazon S3 now applies server-side encryption with Amazon S3 managed keys (SSE-S3) as the base level of encryption for every bucket in Amazon S3. However, for particularly sensitive data or models you may want to apply a different server- or client-side encrption method, [as described in the Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingEncryption.html).
74 | 
75 | Additional security best practices, such as disabling access control lists (ACLs) and S3 Block Public Access, can by found in the [Amazon S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html).
76 | 
77 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
78 | 
79 | ## License
80 | 
81 | This library is licensed under the MIT-0 License. See the LICENSE file.
82 | 


--------------------------------------------------------------------------------
/container/inference/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | FROM nvcr.io/nvidia/clara/bionemo-framework:1.10
 5 | 
 6 | # Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present
 7 | LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 8 | 
 9 | ENV PYTHON=python3
10 | ENV PYTHON_VERSION=3.10.12
11 | ENV PYTHON_SHORT_VERSION=3.10
12 | ENV MAMBA_VERSION=23.11.0-0
13 | ENV PYTORCH_VERSION=2.1.0
14 | ENV DEBIAN_FRONTEND=noninteractive
15 | ENV DLC_CONTAINER_TYPE=inference
16 | ENV PYTHONUNBUFFERED=TRUE
17 | ENV PYTHONDONTWRITEBYTECODE=TRUE
18 | ENV PATH="${BIONEMO_HOME}:${PATH}"
19 | ENV TQDM_POSITION=-1
20 | ENV MODEL_PATH $BIONEMO_HOME/models
21 | 
22 | COPY serve .
23 | COPY inference.py .
24 | COPY wsgi.py .
25 | COPY nginx.conf .
26 | 
27 | # Upgrade installed packages
28 | RUN apt-get update && apt-get upgrade -y && apt-get clean \
29 |     && apt-get -y install --no-install-recommends \
30 |     build-essential \
31 |     ca-certificates \
32 |     curl \
33 |     nginx \
34 |     && rm -rf /var/lib/apt/lists/* \
35 |     && pip3 --no-cache-dir install --upgrade pip \
36 |     && pip --no-cache-dir install \
37 |     boto3 \
38 |     "sagemaker>=2,<3" \
39 |     flask \
40 |     gunicorn \
41 |     gevent \
42 |     ujson \
43 |     && rm -rf /root/.cache | true \
44 |     && rm "$HOME/.aws/config"
45 | 
46 | WORKDIR $BIONEMO_HOME
47 | EXPOSE 8080
48 | ENTRYPOINT ["/usr/bin/python"]
49 | CMD ["serve"]


--------------------------------------------------------------------------------
/container/inference/inference.py:
--------------------------------------------------------------------------------
 1 | from io import StringIO
 2 | import flask
 3 | from flask import Flask, Response, Request
 4 | import logging
 5 | 
 6 | from bionemo.triton.inference_wrapper import new_inference_wrapper
 7 | import warnings
 8 | 
 9 | logging.basicConfig(
10 |     format="%(asctime)s - %(levelname)s - %(message)s",
11 |     datefmt="%m/%d/%Y %H:%M:%S",
12 |     level=logging.INFO,
13 | )
14 | 
15 | warnings.filterwarnings("ignore")
16 | warnings.simplefilter("ignore")
17 | 
18 | app = Flask(__name__)
19 | connection = new_inference_wrapper("grpc://localhost:8001")
20 | 
21 | 
22 | @app.route("/ping", methods=["GET"])
23 | def ping():
24 |     """
25 |     Check the health of the model server by verifying if the model is loaded.
26 | 
27 |     Returns a 200 status code if the model is loaded successfully, or a 500
28 |     status code if there is an error.
29 | 
30 |     Returns:
31 |         flask.Response: A response object containing the status code and mimetype.
32 |     """
33 |     status = 200
34 |     return flask.Response(response="\n", status=status, mimetype="application/json")
35 | 
36 | 
37 | @app.route("/invocations", methods=["POST"])
38 | def invocations():
39 |     """
40 |     Handle prediction requests by preprocessing the input data, making predictions,
41 |     and returning the predictions as a JSON object.
42 | 
43 |     This function checks if the request content type is supported (text/csv),
44 |     and if so, decodes the input data, preprocesses it, makes predictions, and returns
45 |     the predictions as a JSON object. If the content type is not supported, a 415 status
46 |     code is returned.
47 | 
48 |     Returns:
49 |         flask.Response: A response object containing the predictions, status code, and mimetype.
50 |     """
51 |     print(f"Predictor: received content type: {flask.request.content_type}")
52 |     if flask.request.content_type == "text/csv":
53 |         input = flask.request.data.decode("utf-8")
54 |         print(f"Predictor: received input: {input}")
55 |         seqs = input.split(",")
56 |         embeddings = connection.seqs_to_embedding(seqs)
57 |         print(f"{embeddings.shape=}")
58 |         print(f"Predictor: output: {embeddings}")
59 |         # Return the predictions as a list
60 |         return embeddings.tolist()
61 |     else:
62 |         print(f"Received: {flask.request.content_type}", flush=True)
63 |         return flask.Response(
64 |             response=f"This predictor only supports CSV data; Received: {flask.request.content_type}",
65 |             status=415,
66 |             mimetype="text/plain",
67 |         )
68 | 


--------------------------------------------------------------------------------
/container/inference/nginx.conf:
--------------------------------------------------------------------------------
 1 | worker_processes 1;
 2 | daemon off; # Prevent forking
 3 | 
 4 | 
 5 | pid /tmp/nginx.pid;
 6 | error_log /var/log/nginx/error.log;
 7 | 
 8 | events {
 9 |   # defaults
10 | }
11 | 
12 | http {
13 |   include /etc/nginx/mime.types;
14 |   default_type application/octet-stream;
15 |   access_log /var/log/nginx/access.log combined;
16 |   
17 |   upstream gunicorn {
18 |     server unix:/tmp/gunicorn.sock;
19 |   }
20 | 
21 |   server {
22 |     listen 8080 deferred;
23 |     client_max_body_size 5m;
24 | 
25 |     keepalive_timeout 5;
26 |     proxy_read_timeout 1200s;
27 | 
28 |     location ~ ^/(ping|invocations) {
29 |       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
30 |       proxy_set_header Host $http_host;
31 |       proxy_redirect off;
32 |       proxy_pass http://gunicorn;
33 |     }
34 | 
35 |     location / {
36 |       return 404 "{}";
37 |     }
38 |   }
39 | }


--------------------------------------------------------------------------------
/container/inference/serve:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | # This file implements the scoring service shell. You don't necessarily need to modify it for various
  4 | # algorithms.
  5 | 
  6 | # Required environment variables:
  7 | #
  8 | 
  9 | import multiprocessing
 10 | import os
 11 | import signal
 12 | import subprocess
 13 | import sys
 14 | import re
 15 | import boto3
 16 | import logging
 17 | import shutil
 18 | from botocore.exceptions import ClientError
 19 | import json
 20 | from time import sleep
 21 | cpu_count = multiprocessing.cpu_count()
 22 | 
 23 | model_server_timeout = os.environ.get("MODEL_SERVER_TIMEOUT", 60)
 24 | model_server_workers = int(os.environ.get("MODEL_SERVER_WORKERS", cpu_count))
 25 | 
 26 | logging.basicConfig(
 27 |     format="%(asctime)s - %(levelname)s - %(message)s",
 28 |     datefmt="%m/%d/%Y %H:%M:%S",
 29 |     level=logging.INFO,
 30 | )
 31 | 
 32 | def sigterm_handler(nginx_pid, gunicorn_pid, pytriton_pid):
 33 |     try:
 34 |         os.kill(nginx_pid, signal.SIGQUIT)
 35 |     except OSError:
 36 |         pass
 37 |     try:
 38 |         os.kill(gunicorn_pid, signal.SIGQUIT)
 39 |     except OSError:
 40 |         pass
 41 |     try:
 42 |         os.kill(pytriton_pid, signal.SIGQUIT)
 43 |     except OSError:
 44 |         pass
 45 |     sys.exit(0)
 46 | 
 47 | 
 48 | def parse_conf_path(
 49 |     model_name: str,
 50 |     root_path: str = "/workspace/bionemo/examples",
 51 | ) -> str:
 52 |     """Parse the conf path from the model name."""
 53 | 
 54 |     if model_name == "megamolbart":
 55 |         conf_path = "molecule/megamolbart/conf"
 56 |     elif model_name == "prott5nv":
 57 |         conf_path = "protein/prott5nv/conf"
 58 |     elif model_name == "esm1nv":
 59 |         conf_path = "protein/esm1nv/conf"
 60 |     elif re.match(r"diffdock", model_name):
 61 |         conf_path = "molecule/diffdock/conf"
 62 |     elif re.match(r"esm2", model_name):
 63 |         conf_path = "protein/esm2nv/conf"
 64 |     elif re.match(r"equidock", model_name):
 65 |         conf_path = "protein/equidock/conf"
 66 |     else:
 67 |         raise ValueError(f"Invalid model name: {model_name}")
 68 | 
 69 |     return os.path.join(root_path, conf_path)
 70 | 
 71 | 
 72 | def set_ngc_credentials(secret_name: str) -> None:
 73 |     """Get NVIDIA NGC API Key and org from AWS Secrets Manager"""
 74 | 
 75 |     # Create a Secrets Manager client
 76 |     client = boto3.client(
 77 |         "secretsmanager", region_name=os.environ.get("AWS_REGION", "us-west-2")
 78 |     )
 79 | 
 80 |     logging.info("Retrieving NGC credentials from AWS Secrets Manager.")
 81 | 
 82 |     try:
 83 |         get_secret_value_response = client.get_secret_value(SecretId=secret_name)
 84 |     except ClientError as e:
 85 |         # For a list of exceptions thrown, see
 86 |         # https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
 87 |         raise e
 88 | 
 89 |     creds = json.loads(get_secret_value_response["SecretString"])
 90 | 
 91 |     logging.info("Setting NGC credentials as environment variables.")
 92 |     os.environ["NGC_CLI_API_KEY"] = creds.get("NGC_CLI_API_KEY", "")
 93 |     os.environ["NGC_CLI_ORG"] = creds.get("NGC_CLI_ORG", "")
 94 |     os.environ["NGC_CLI_TEAM"] = creds.get("NGC_CLI_TEAM", "")
 95 |     os.environ["NGC_CLI_FORMAT_TYPE"] = creds.get("NGC_CLI_FORMAT_TYPE", "ascii")
 96 | 
 97 |     return None
 98 | 
 99 | 
100 | def download_model_weights(
101 |     secret_name=os.environ.get("SM_SECRET_NAME", "NVIDIA_NGC_CREDS"),
102 |     model_name=os.environ.get("MODEL_NAME", "all"),
103 |     model_path=os.environ.get("MODEL_PATH", "/workspace/bionemo/models"),
104 | ):
105 |     set_ngc_credentials(secret_name)
106 |     logging.info("Downloading pre-trained model checkpoint")
107 |     if not os.path.exists(model_path):
108 |         os.makedirs(model_path)
109 | 
110 |     if not os.path.exists("artifact_paths.yaml"):
111 |         shutil.copy(
112 |             "/workspace/bionemo/artifact_paths.yaml",
113 |             os.getcwd(),
114 |         )
115 |     subprocess.run(
116 |         [
117 |             "/usr/bin/python",
118 |             "/workspace/bionemo/download_artifacts.py",
119 |             "--models",
120 |             model_name,
121 |             "--source",
122 |             "ngc",
123 |             "--model_dir",
124 |             model_path,
125 |         ],
126 |         check=True,
127 |     )
128 |     downloaded_nemo_files = [f for f in os.listdir(model_path) if f.endswith(".nemo")]
129 |     checkpoint_path = os.path.join(model_path, downloaded_nemo_files[0])
130 |     logging.info(f"Pre-trained model checkpoint downloaded to {checkpoint_path}")
131 | 
132 | 
133 | def start_server():
134 |     logging.info("Starting the inference server with {} workers.".format(model_server_workers))
135 | 
136 |     # link the log streams to stdout/err so they will be logged to the container logs
137 |     subprocess.check_call(
138 |         ["/usr/bin/ln", "-sf", "/dev/stdout", "/var/log/nginx/access.log"]
139 |     )
140 |     subprocess.check_call(
141 |         ["/usr/bin/ln", "-sf", "/dev/stderr", "/var/log/nginx/error.log"]
142 |     )
143 | 
144 |     download_model_weights()
145 | 
146 |     config_path = parse_conf_path(os.environ.get("MODEL_NAME", "esm1nv"))
147 | 
148 |     logging.info("Starting nginx")
149 |     nginx = subprocess.Popen(
150 |         [
151 |             "/usr/sbin/nginx",
152 |             "-c",
153 |             os.path.join(os.environ.get("BIONEMO_HOME"), "nginx.conf"),
154 |         ]
155 |     )
156 |     sleep(5)
157 | 
158 |     logging.info("Starting gunicorn")
159 |     gunicorn = subprocess.Popen(
160 |         [
161 |             "/usr/local/bin/gunicorn",
162 |             "--timeout",
163 |             str(model_server_timeout),
164 |             "-k",
165 |             "sync",
166 |             "-b",
167 |             "unix:/tmp/gunicorn.sock",
168 |             "-w",
169 |             str(model_server_workers),
170 |             "wsgi:app",
171 |         ]
172 |     )
173 |     sleep(5)
174 | 
175 |     logging.info("Starting pytriton inference wrapper")
176 |     pytriton = subprocess.Popen(
177 |         [
178 |             "/usr/bin/python",
179 |             "-m",
180 |             "bionemo.triton.inference_wrapper",
181 |             "--config-path",
182 |             config_path,
183 |         ]
184 |     )
185 |     sleep(5)
186 | 
187 |     signal.signal(
188 |         signal.SIGTERM,
189 |         lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid, pytriton.pid),
190 |     )
191 | 
192 |     # If any subprocess exits, so do we.
193 |     pids = set([nginx.pid, gunicorn.pid, pytriton.pid])
194 |     while True:
195 |         pid, _ = os.wait()
196 |         if pid in pids:
197 |             break
198 | 
199 |     sigterm_handler(nginx.pid, gunicorn.pid, pytriton.pid)
200 |     logging.info("Inference server exiting")
201 | 
202 | 
203 | # The main routine just invokes the start function.
204 | 
205 | if __name__ == "__main__":
206 |     start_server()
207 | 


--------------------------------------------------------------------------------
/container/inference/wsgi.py:
--------------------------------------------------------------------------------
1 | import inference as myapp
2 | 
3 | # This is just a simple wrapper for gunicorn to find your app.
4 | # If you want to change the algorithm file, simply change "predictor" above to the
5 | # new file.
6 | 
7 | app = myapp.app


--------------------------------------------------------------------------------
/container/training/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | # SPDX-License-Identifier: MIT-0
 3 | 
 4 | FROM nvcr.io/nvidia/clara/bionemo-framework:1.10 as common
 5 | 
 6 | ENV PYTHON=python3
 7 | ENV PYTHON_VERSION=3.10.12
 8 | ENV PYTHON_SHORT_VERSION=3.10
 9 | ENV MAMBA_VERSION=23.11.0-0
10 | ENV PYTORCH_VERSION=2.1.0
11 | ENV SMP_URL=https://smppy.s3.amazonaws.com/pytorch/cu121/smprof-0.3.334-cp310-cp310-linux_x86_64.whl
12 | ENV EFA_PATH=/opt/amazon/efa
13 | ENV PYTHONDONTWRITEBYTECODE=1
14 | ENV PYTHONUNBUFFERED=1
15 | ENV PYTHONIOENCODING=UTF-8
16 | ENV LANG=C.UTF-8
17 | ENV LC_ALL=C.UTF-8
18 | ENV DEBIAN_FRONTEND=noninteractive
19 | ENV TORCH_CUDA_ARCH_LIST="5.2;7.0+PTX;7.5;8.0;8.6;9.0"
20 | ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
21 | ENV CUDNN_VERSION=8.9.2.26
22 | ENV EFA_VERSION=1.30.0
23 | ENV GDRCOPY_VERSION=2.3.1
24 | ENV OPEN_MPI_PATH=/opt/amazon/openmpi
25 | ENV DGLBACKEND=pytorch
26 | ENV MANUAL_BUILD=0
27 | ENV RDMAV_FORK_SAFE=1
28 | ENV DLC_CONTAINER_TYPE=training
29 | ENV NCCL_ASYNC_ERROR_HANDLING=1
30 | ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
31 | ENV PATH="$OPEN_MPI_PATH/bin:$EFA_PATH/bin:$PATH"
32 | ENV LD_LIBRARY_PATH=$OPEN_MPI_PATH/lib/:$EFA_PATH/lib/:$LD_LIBRARY_PATH
33 | ENV PYTORCH_API_USAGE_STDERR=1
34 | ENV TORCH_LOGS=+dynamo,+aot,+inductor
35 | ENV TQDM_POSITION=-1
36 | ENV MODEL_PATH $BIONEMO_HOME/models
37 | # makes AllToAll complete successfully. Update will be included in NCCL 2.20.*
38 | ENV NCCL_CUMEM_ENABLE=0
39 | ENV OFI_URI="https://github.com/aws/aws-ofi-nccl/releases/download/v1.8.0-aws/aws-ofi-nccl-1.8.0-aws.tar.gz"
40 | 
41 | COPY changehostname.c /
42 | COPY start_with_right_hostname.sh /usr/local/bin/start_with_right_hostname.sh
43 | 
44 | RUN apt-get update \
45 |     && apt-get upgrade -y \
46 |     && apt-get autoremove -y \
47 |     && apt-get clean \
48 |     && rm -rf /var/lib/apt/lists/* \
49 |     && mkdir /tmp/efa \
50 |     && cd /tmp/efa \
51 |     && curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \
52 |     && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
53 |     && cd aws-efa-installer \
54 |     && apt-get update \
55 |     && ./efa_installer.sh -y --skip-kmod --skip-limit-conf --no-verify \
56 |     && rm -rf /tmp/efa \
57 |     && rm -rf /tmp/aws-efa-installer-${EFA_VERSION}.tar.gz \
58 |     && rm -rf /var/lib/apt/lists/* \
59 |     && apt-get clean \
60 |     && wget -O /tmp/ofi-aws.tar.gz ${OFI_URI} \
61 |     && tar -xvzf /tmp/ofi-aws.tar.gz -C /usr/local/bin --no-same-owner \
62 |     && rm /tmp/ofi-aws.tar.gz
63 | 
64 | COPY requirements.txt /
65 | 
66 | RUN pip install --upgrade pip --no-cache-dir --trusted-host pypi.org --trusted-host files.pythonhosted.org \
67 |     && pip install --no-cache-dir -U ${SMP_URL} \
68 |     && pip install --no-cache-dir -r /requirements.txt \
69 |     && rm -rf /root/.cache | true \
70 |     && rm "$HOME/.aws/config" \
71 |     && chmod +x /usr/local/bin/start_with_right_hostname.sh
72 | 
73 | WORKDIR /
74 | 
75 | ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]
76 | CMD ["/bin/bash"]


--------------------------------------------------------------------------------
/container/training/changehostname.c:
--------------------------------------------------------------------------------
 1 | #include <stdio.h>
 2 | #include <string.h>
 3 | 
 4 | /*
 5 |  * Modifies gethostname to return algo-1, algo-2, etc. when running on SageMaker.
 6 |  *
 7 |  * Without this gethostname() on SageMaker returns 'aws', leading NCCL/MPI to think there is only one host,
 8 |  * not realizing that it needs to use NET/Socket.
 9 |  *
10 |  * When docker container starts we read 'current_host' value  from /opt/ml/input/config/resourceconfig.json
11 |  * and replace PLACEHOLDER_HOSTNAME with it before compiling this code into a shared library.
12 |  */
13 | int gethostname(char *name, size_t len)
14 | {
15 |   const char *val = PLACEHOLDER_HOSTNAME;
16 |   strncpy(name, val, len);
17 |   return 0;
18 | }
19 | 


--------------------------------------------------------------------------------
/container/training/requirements.txt:
--------------------------------------------------------------------------------
 1 | accelerate==1.1.0
 2 | fastai==2.7.18
 3 | huggingface-hub<0.24.0
 4 | numba
 5 | opencv-python
 6 | pandas
 7 | pillow
 8 | requests>=2.31.0
 9 | sagemaker>=2,<3
10 | sagemaker-pytorch-training
11 | sagemaker-training
12 | scikit-learn
13 | shap
14 | smclarify
15 | 


--------------------------------------------------------------------------------
/container/training/start_with_right_hostname.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | if [[ "$1" = "train" ]]; then
 4 |      CURRENT_HOST=$(jq .current_host  /opt/ml/input/config/resourceconfig.json)
 5 |      sed -ie "s/PLACEHOLDER_HOSTNAME/$CURRENT_HOST/g" changehostname.c
 6 |      gcc -o changehostname.o -c -fPIC -Wall changehostname.c
 7 |      gcc -o libchangehostname.so -shared -export-dynamic changehostname.o -ldl
 8 |      LD_PRELOAD=/libchangehostname.so train
 9 | else
10 |      eval "$@"
11 | fi
12 | 


--------------------------------------------------------------------------------
/deploy-ESM-embeddings-server.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "f9586f20",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Deploy ESM Embeddings Server on on Amazon SageMaker\n",
  9 |     "\n",
 10 |     "Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
 11 |     "SPDX-License-Identifier: MIT-0"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "97b562fb",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "---\n",
 20 |     "## 1. Setup"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "id": "88a64f1b",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "### 1.1. Create clients"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "id": "1c273482-ffb7-49af-a83f-19a7759a7621",
 35 |    "metadata": {
 36 |     "tags": []
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import boto3\n",
 41 |     "import sagemaker\n",
 42 |     "\n",
 43 |     "boto_session = boto3.session.Session()\n",
 44 |     "sagemaker_session = sagemaker.session.Session(boto_session)\n",
 45 |     "s3 = boto_session.resource(\"s3\")\n",
 46 |     "region = boto_session.region_name\n",
 47 |     "role = sagemaker.get_execution_role()"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "id": "a6c5b022",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "### 1.2. Build BioNeMo-Inference Container Image\n",
 56 |     "\n",
 57 |     "If you don't already have access to the BioNeMo-SageMaker container image, run the following cell to build and deploy it to your AWS account. Take note of the image URI - you'll use it for the processing and training steps below.\n",
 58 |     "\n",
 59 |     "Here is an example shell script you can use in your environment (including SageMaker Notebook Instances) to build the container.\n",
 60 |     "\n",
 61 |     "Once you have built and pushed the container, we strongly recommend using [ECR image scanning](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html) to ensure that it meets your security requirements."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "id": "2d24d513",
 68 |    "metadata": {
 69 |     "scrolled": true,
 70 |     "tags": []
 71 |    },
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "%%bash\n",
 75 |     "\n",
 76 |     "# The name of our algorithm\n",
 77 |     "algorithm_name=bionemo-inference\n",
 78 |     "\n",
 79 |     "pushd container/inference\n",
 80 |     "\n",
 81 |     "account=$(aws sts get-caller-identity --query Account --output text)\n",
 82 |     "\n",
 83 |     "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
 84 |     "region=$(aws configure get region)\n",
 85 |     "region=${region:-us-west-2}\n",
 86 |     "\n",
 87 |     "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n",
 88 |     "\n",
 89 |     "# If the repository doesn't exist in ECR, create it.\n",
 90 |     "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
 91 |     "\n",
 92 |     "if [ $? -ne 0 ]\n",
 93 |     "then\n",
 94 |     "    aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
 95 |     "fi\n",
 96 |     "\n",
 97 |     "# Get the login command from ECR and execute it directly\n",
 98 |     "$(aws ecr get-login --region ${region} --no-include-email)\n",
 99 |     "\n",
100 |     "# Build the docker image locally with the image name and then push it to ECR\n",
101 |     "# with the full name.\n",
102 |     "\n",
103 |     "docker build -t ${algorithm_name} .\n",
104 |     "docker tag ${algorithm_name} ${fullname}\n",
105 |     "\n",
106 |     "docker push ${fullname}\n",
107 |     "\n",
108 |     "popd"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "id": "8f7bd546",
114 |    "metadata": {},
115 |    "source": [
116 |     "---\n",
117 |     "## 2. Deploy Real-Time Inference Endpoint"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "id": "80794653",
123 |    "metadata": {},
124 |    "source": [
125 |     "### 2.1. Create esm1nv model"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "id": "c89024a7-f1fa-47df-bd1a-987fa6e647ea",
132 |    "metadata": {
133 |     "tags": []
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "from sagemaker.model import Model\n",
138 |     "\n",
139 |     "# Replace this with your ECR repository URI from above\n",
140 |     "BIONEMO_IMAGE_URI = (\n",
141 |     "    \"<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/bionemo-inference:latest\"\n",
142 |     ")\n",
143 |     "\n",
144 |     "esm_embeddings = Model(\n",
145 |     "    image_uri=BIONEMO_IMAGE_URI,\n",
146 |     "    name=\"esm-embeddings\",\n",
147 |     "    model_data=None,\n",
148 |     "    role=role,\n",
149 |     "    predictor_cls=sagemaker.predictor.Predictor,\n",
150 |     "    sagemaker_session=sagemaker_session,\n",
151 |     "    env={\"SM_SECRET_NAME\": \"NVIDIA_NGC_CREDS\", \"MODEL_NAME\": \"esm1nv\"},\n",
152 |     ")"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "id": "88734416",
158 |    "metadata": {},
159 |    "source": [
160 |     "### 2.2. Deploy model to SageMaker endpoint"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "id": "9e557430-556f-4185-8f43-f90c691ed7db",
167 |    "metadata": {
168 |     "tags": []
169 |    },
170 |    "outputs": [],
171 |    "source": [
172 |     "esm_embeddings_predictor = esm_embeddings.deploy(\n",
173 |     "    initial_instance_count=1,\n",
174 |     "    instance_type='ml.g5.xlarge',\n",
175 |     "    serializer = sagemaker.base_serializers.CSVSerializer(),\n",
176 |     "    deserializer = sagemaker.base_deserializers.NumpyDeserializer()\n",
177 |     ")"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "id": "483388b3",
183 |    "metadata": {},
184 |    "source": [
185 |     "### 2.3. Test model"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "id": "61852f67-ae4e-4f17-86d0-5039e7fa94bd",
192 |    "metadata": {
193 |     "tags": []
194 |    },
195 |    "outputs": [],
196 |    "source": [
197 |     "esm_embeddings_predictor.predict(\"MSLKRKNIALIPAAGIGVRFGADKPKQYVEIGSKTVLEHVL,MIQSQINRNIRLDLADAILLSKAKKDLSFAEIADGTGLA\")"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": null,
203 |    "id": "9a0ac500-4e63-41f7-9b52-9acdda34f84f",
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": []
207 |   }
208 |  ],
209 |  "metadata": {
210 |   "kernelspec": {
211 |    "display_name": "conda_python3",
212 |    "language": "python",
213 |    "name": "conda_python3"
214 |   },
215 |   "language_info": {
216 |    "codemirror_mode": {
217 |     "name": "ipython",
218 |     "version": 3
219 |    },
220 |    "file_extension": ".py",
221 |    "mimetype": "text/x-python",
222 |    "name": "python",
223 |    "nbconvert_exporter": "python",
224 |    "pygments_lexer": "ipython3",
225 |    "version": "3.10.15"
226 |   }
227 |  },
228 |  "nbformat": 4,
229 |  "nbformat_minor": 5
230 | }
231 | 


--------------------------------------------------------------------------------
/img/1-setup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/1-setup.png


--------------------------------------------------------------------------------
/img/2-api-key.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/2-api-key.png


--------------------------------------------------------------------------------
/img/3-generate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/3-generate.png


--------------------------------------------------------------------------------
/img/4-sm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/4-sm.png


--------------------------------------------------------------------------------
/img/5-secret-type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/5-secret-type.png


--------------------------------------------------------------------------------
/img/bionemo-sm-arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-with-nvidia-bionemo/218b86ab0e1202a6d1de8d02669190eda58289c7/img/bionemo-sm-arch.png


--------------------------------------------------------------------------------
/src/esm1nv-training.yaml:
--------------------------------------------------------------------------------
  1 | # Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | name: esm1nv
  5 | do_training: True # set to false if data preprocessing steps must be completed
  6 | do_testing: False # set to true to run evaluation on test data after training, requires test_dataset section
  7 | restore_from_path: null # used when starting from a .nemo file
  8 | 
  9 | trainer:
 10 |   # devices: 1 # number of GPUs or CPUs. Don't define here unless you want to override SM.
 11 |   # num_nodes: 1 # Number of instances. Don't define here unless you want to override SM.
 12 |   accelerator: gpu #gpu or cpu
 13 |   precision: 16-mixed #16-mixed or bf16-mixed
 14 |   logger: False # logger is provided by NeMo exp_manager
 15 |   enable_checkpointing: False # checkpointing is done by NeMo exp_manager
 16 |   use_distributed_sampler: False # use NeMo Megatron samplers
 17 |   max_epochs: null # # use max_steps instead with NeMo Megatron model
 18 |   max_steps: 100 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
 19 |   log_every_n_steps: 1 # number of interations between logging
 20 |   val_check_interval: 25
 21 |   limit_val_batches: 1.0 # number of batches in validation step, use fraction for fraction of data, 0 to disable
 22 |   limit_test_batches: 0 # number of batches in test step, use fraction for fraction of data, 0 to disable
 23 |   accumulate_grad_batches: 1
 24 |   gradient_clip_val: 1.0
 25 |   benchmark: False
 26 | 
 27 | exp_manager:
 28 |   name: ${name}
 29 |   exp_dir: ${oc.env:BIONEMO_HOME}/results/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
 30 |   explicit_log_dir: ${.exp_dir}
 31 |   create_wandb_logger: True
 32 |   create_tensorboard_logger: True
 33 |   wandb_logger_kwargs:
 34 |     project: ${name}_pretraining
 35 |     name: ${name}_pretraining
 36 |     group: ${name}
 37 |     job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
 38 |     notes: "date: ${now:%y%m%d-%H%M%S}"
 39 |     tags:
 40 |       - ${name}
 41 |     offline: True # set to True if there are issues uploading to WandB during training
 42 |   resume_if_exists: True # automatically resume if checkpoint exists
 43 |   resume_ignore_no_checkpoint: True # leave as True, will start new training if resume_if_exists is True but no checkpoint exists
 44 |   create_checkpoint_callback: True # leave as True, use exp_manger for for checkpoints
 45 |   checkpoint_callback_params:
 46 |     monitor: val_loss
 47 |     save_top_k: 3 # number of checkpoints to save
 48 |     mode: min # use min or max of monitored metric to select best checkpoints
 49 |     always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
 50 |     filename: "${name}--{val_loss:.2f}-{step}-{consumed_samples}"
 51 |     model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
 52 | 
 53 | model:
 54 |   micro_batch_size: 96 # NOTE: adjust to occupy ~ 90% of GPU memory
 55 |   tensor_model_parallel_size: 1 # model parallelism
 56 |   pipeline_model_parallel_size: 1 # model parallelism. If enabled, you need to set data.dynamic_padding to False as pipeline parallelism requires fixed-length padding.
 57 |   # model architecture
 58 |   seq_length: 512
 59 |   max_position_embeddings: ${.seq_length}
 60 |   encoder_seq_length: ${.seq_length}
 61 |   num_layers: 6
 62 |   hidden_size: 768
 63 |   ffn_hidden_size: 3072 # Transformer FFN hidden size. Usually 4 * hidden_size.
 64 |   num_attention_heads: 12
 65 |   init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
 66 |   hidden_dropout: 0.1 # Dropout probability for hidden state transformer.
 67 |   kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
 68 |   apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
 69 |   layernorm_epsilon: 1e-5
 70 |   make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
 71 |   pre_process: True # add embedding
 72 |   post_process: True # add pooler
 73 |   bert_binary_head: False # BERT binary head
 74 |   resume_from_checkpoint: null # manually set the checkpoint file to load from
 75 |   masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
 76 | 
 77 |   tokenizer:
 78 |     # Use ESM2 tokenizers from HF
 79 |     library: huggingface
 80 |     type: BertWordPieceLowerCase
 81 |     model_name: facebook/esm2_t33_650M_UR50D
 82 |     mask_id: 32
 83 |     model: null
 84 |     vocab_file: null
 85 |     merge_file: null
 86 | 
 87 |   # precision
 88 |   native_amp_init_scale: 4294967296 # 2 ** 32
 89 |   native_amp_growth_interval: 1000
 90 |   fp32_residual_connection: False # Move residual connections to fp32
 91 |   fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16
 92 | 
 93 |   # miscellaneous
 94 |   seed: 1234
 95 |   use_cpu_initialization: False # Init weights on the CPU (slow for large model)
 96 |   onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
 97 | 
 98 |   # not implemented in NeMo yet
 99 |   activations_checkpoint_method: null # 'uniform', 'block'
100 |   activations_checkpoint_num_layers: 1
101 | 
102 |   data:
103 |     dataset_path: /opt/ml/input/data
104 |     dataset:
105 |       train: x000
106 |       val: x001
107 |     # These control the MLM token probabilities. The following settings are commonly used in literature.
108 |     modify_percent: 0.15 # Fraction of characters in a protein sequence to modify. 
109 |     perturb_percent: 0.1 # Of the modify_percent, what fraction of characters are to be replaced with another amino acid.
110 |     mask_percent: 0.8 # Of the modify_percent, what fraction of characters are to be replaced with a mask token.
111 |     identity_percent: 0.1 # Of the modify_percent, what fraction of characters are to be unchanged as the original amino acid.
112 | 
113 |     data_prefix: "" # must be null or ""
114 |     num_workers: 1
115 |     dataloader_type: single # cyclic
116 |     reset_position_ids: False # Reset position ids after end-of-document token
117 |     reset_attention_mask: False # Reset attention mask after end-of-document token
118 |     eod_mask_loss: False # Mask loss for the end of document tokens
119 |     masked_lm_prob: 0.15 # Probability of replacing a token with mask.
120 |     short_seq_prob: 0.1 # Probability of producing a short sequence.
121 |     skip_lines: 0
122 |     drop_last: False
123 |     pin_memory: False
124 |     index_mapping_dir: null # path to store cached indexing files (if empty, will be stored in the same directory as dataset_path)
125 |     data_impl: "csv_mmap"
126 |     data_impl_kwargs:
127 |       csv_mmap:
128 |         header_lines: 1
129 |         newline_int: 10 # byte-value of newline
130 |         workers: ${model.data.num_workers} # number of workers when creating missing index files (null defaults to cpu_num // 2)
131 |         sort_dataset_paths: True # if True datasets will be sorted by name
132 |         data_sep: "," # string to split text into columns
133 |         data_col: 1
134 |     use_upsampling: False # if the data should be upsampled to max number of steps in the training
135 |     seed: ${model.seed} # Random seed
136 |     max_seq_length: ${model.seq_length} # Maximum input sequence length. Longer sequences are truncated
137 |     dynamic_padding:
138 |       False # If True, each batch is padded to the maximum sequence length within that batch.
139 |       #    Set it to False when model.pipeline_model_parallel_size > 1, as pipeline parallelism requires fixed-length padding.
140 | 
141 |   optim:
142 |     name: fused_adam # fused optimizers used by Megatron model
143 |     lr: 2e-4
144 |     weight_decay: 0.01
145 |     betas:
146 |       - 0.9
147 |       - 0.98
148 |     sched:
149 |       name: CosineAnnealing
150 |       warmup_steps: 500 # use to set warmup_steps explicitly or leave as null to calculate
151 |       constant_steps: 50000
152 |       min_lr: 2e-5
153 | 
154 |   dwnstr_task_validation:
155 |     enabled: False
156 | 


--------------------------------------------------------------------------------
/src/train.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | import argparse
  5 | import boto3
  6 | from botocore.exceptions import ClientError
  7 | import json
  8 | import logging
  9 | import os
 10 | import re
 11 | import shutil
 12 | import subprocess
 13 | from datetime import timedelta
 14 | import yaml
 15 | 
 16 | import torch.distributed as dist
 17 | 
 18 | NUM_GPUS = int(os.environ.get("SM_NUM_GPUS", 0))
 19 | HOSTS = json.loads(os.environ.get("SM_HOSTS", f'["{os.uname()[1]}"]'))
 20 | NUM_HOSTS = len(HOSTS)
 21 | 
 22 | os.environ["HYDRA_FULL_ERROR"] = "1"
 23 | 
 24 | logging.basicConfig(
 25 |     format="%(asctime)s - %(levelname)s - %(message)s",
 26 |     datefmt="%m/%d/%Y %H:%M:%S",
 27 |     level=logging.INFO,
 28 | )
 29 | 
 30 | 
 31 | def parse_args():
 32 |     """Parse the arguments."""
 33 |     logging.info("Parsing arguments")
 34 |     parser = argparse.ArgumentParser()
 35 | 
 36 |     parser.add_argument(
 37 |         "--config-path",
 38 |         type=str,
 39 |         default="/opt/ml/code",
 40 |         help="Path to config files in the container",
 41 |     )
 42 |     parser.add_argument(
 43 |         "--config-name",
 44 |         type=str,
 45 |         default="train",
 46 |         help="Name of the config file for the run (without file extension)",
 47 |     )
 48 | 
 49 |     parser.add_argument(
 50 |         "--model-name",
 51 |         type=str,
 52 |         default=None,
 53 |         choices=[
 54 |             "diffdock_confidence",
 55 |             "diffdock_score",
 56 |             "equidock_db5",
 57 |             "equidock_dips",
 58 |             "esm1nv",
 59 |             "esm2nv_3b",
 60 |             "esm2nv_650m",
 61 |             "esm2_650m_huggingface",
 62 |             "esm2_3b_huggingface",
 63 |             "megamolbart",
 64 |             "prott5nv",
 65 |         ],
 66 |         help="Name of BioNeMo model to use for training",
 67 |     )
 68 | 
 69 |     parser.add_argument(
 70 |         "--download-pretrained-weights",
 71 |         type=bool,
 72 |         default=False,
 73 |         help="Download the pre-trained model checkpoint for fine-tuning?",
 74 |     )
 75 | 
 76 |     parser.add_argument(
 77 |         "--ngc-cli-secret-name",
 78 |         type=str,
 79 |         default="NVIDIA_NGC_CREDS",
 80 |         help="Name of an AWS Secrets Manager secret containing NGC_CLI_API_KEY and NGC_CLI_ORG key/value pairs.",
 81 |     )
 82 | 
 83 |     args, _ = parser.parse_known_args()
 84 |     return args
 85 | 
 86 | 
 87 | def parse_model_path(
 88 |     model_name: str,
 89 |     root_path: str = "/workspace/bionemo/examples",
 90 | ) -> str:
 91 |     """Parse the model path from the model name."""
 92 | 
 93 |     if model_name == "megamolbart":
 94 |         model_path = "molecule/megamolbart/pretrain.py"
 95 |     elif model_name == "prott5nv":
 96 |         model_path = "protein/prott5nv/pretrain.py"
 97 |     elif model_name == "esm1nv":
 98 |         model_path = "protein/esm1nv/pretrain.py"
 99 |     elif re.match(r"diffdock", model_name):
100 |         model_path = "molecule/diffdock/train.py"
101 |     elif re.match(r"esm2", model_name):
102 |         model_path = "protein/esm2nv/pretrain.py"
103 |     elif re.match(r"equidock", model_name):
104 |         model_path = "protein/equidock/pretrain.py"
105 |     else:
106 |         raise ValueError(f"Invalid model name: {model_name}")
107 | 
108 |     return os.path.join(root_path, model_path)
109 | 
110 | 
111 | def main(args):
112 |     """Main function."""
113 | 
114 |     # logging.info(f"Current environment variables are:\n{os.environ}")
115 | 
116 |     parsed_model_name = args.model_name or get_model_name_from_config(
117 |         args.config_path, args.config_name
118 |     )
119 | 
120 |     training_script = parse_model_path(parsed_model_name)
121 | 
122 |     run_cmd = [
123 |         "/usr/bin/python",
124 |         training_script,
125 |         "--config-path",
126 |         args.config_path,
127 |         "--config-name",
128 |         args.config_name,
129 |     ]
130 | 
131 |     if args.download_pretrained_weights == "True":
132 | 
133 |         set_ngc_credentials(args.ngc_cli_secret_name)
134 | 
135 |         logging.info("Downloading pre-trained model checkpoint")
136 |         model_path = os.getenv("MODEL_PATH")
137 |         if not os.path.exists(model_path):
138 |             os.makedirs(model_path)
139 | 
140 |         if not os.path.exists("artifact_paths.yaml"):
141 |             shutil.copy(
142 |                 "/workspace/bionemo/artifact_paths.yaml",
143 |                 os.getcwd(),
144 |             )
145 |         subprocess.run(
146 |             [
147 |                 "/usr/bin/python",
148 |                 "/workspace/bionemo/download_artifacts.py",
149 |                 parsed_model_name,
150 |                 "--source",
151 |                 "ngc",
152 |                 "--download_dir",
153 |                 model_path,
154 |             ],
155 |             check=True,
156 |         )
157 |         downloaded_nemo_files = [
158 |             f for f in os.listdir(model_path) if f.endswith(".nemo")
159 |         ]
160 |         checkpoint_path = os.path.join(model_path, downloaded_nemo_files[0])
161 |         logging.info(f"Pre-trained model checkpoint downloaded to {checkpoint_path}")
162 |         run_cmd.append(f"++restore_from_path={checkpoint_path}")
163 | 
164 |     run_cmd.append(f"++trainer.devices={NUM_GPUS}")
165 |     run_cmd.append(f"++trainer.num_nodes={NUM_HOSTS}")
166 | 
167 |     logging.info(
168 |         f"Running training script located at {training_script} with command:\n{run_cmd}"
169 |     )
170 | 
171 |     subprocess.run(
172 |         run_cmd,
173 |         check=True,
174 |     )
175 | 
176 |     logging.info("Training process complete")
177 | 
178 |     if os.environ["LOCAL_RANK"] == 0:
179 | 
180 |         results_path = os.path.join(
181 |             os.getenv("BIONEMO_HOME"), "results/nemo_experiments"
182 |         )
183 |         shutil.copytree(results_path, "/opt/ml/model/")
184 | 
185 | 
186 | def set_ngc_credentials(secret_name: str) -> None:
187 |     """Get NVIDIA NGC API Key and org from AWS Secrets Manager"""
188 | 
189 |     # Create a Secrets Manager client
190 |     client = boto3.client("secretsmanager", region_name=os.getenv("AWS_REGION"))
191 | 
192 |     logging.info("Retrieving NGC credentials from AWS Secrets Manager.")
193 | 
194 |     try:
195 |         get_secret_value_response = client.get_secret_value(SecretId=secret_name)
196 |     except ClientError as e:
197 |         # For a list of exceptions thrown, see
198 |         # https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
199 |         raise e
200 | 
201 |     creds = json.loads(get_secret_value_response["SecretString"])
202 | 
203 |     logging.info("Setting NGC credentials as environment variables.")
204 |     os.environ["NGC_CLI_API_KEY"] = creds.get("NGC_CLI_API_KEY", "")
205 |     os.environ["NGC_CLI_ORG"] = creds.get("NGC_CLI_ORG", "")
206 |     os.environ["NGC_CLI_TEAM"] = creds.get("NGC_CLI_TEAM", "")
207 |     os.environ["NGC_CLI_FORMAT_TYPE"] = creds.get("NGC_CLI_FORMAT_TYPE", "ascii")
208 | 
209 |     return None
210 | 
211 | 
212 | def get_model_name_from_config(
213 |     config_path: str = "/opt/ml/input/data/config", config_name: str = "train"
214 | ) -> str:
215 |     """Get the model name from the config file."""
216 |     with open(os.path.join(config_path, f"{config_name}.yaml")) as f:
217 |         config = yaml.safe_load(f)
218 |     return config["name"]
219 | 
220 | 
221 | def init_distributed_training(args):
222 |     """Initializes distributed training settings."""
223 | 
224 |     try:
225 |         backend = "smddp"
226 |         import smdistributed.dataparallel.torch.torch_smddp
227 |     except ModuleNotFoundError:
228 |         backend = "nccl"
229 |         print("Warning: SMDDP not found on this image, falling back to NCCL!")
230 | 
231 |     local_rank = int(os.environ["LOCAL_RANK"])
232 |     world_size = int(os.environ["WORLD_SIZE"])
233 |     global_rank = int(os.environ["RANK"])
234 | 
235 |     if local_rank == 0:
236 |         logging.info("Local Rank is : {}".format(os.environ["LOCAL_RANK"]))
237 |         logging.info("Worldsize is : {}".format(os.environ["WORLD_SIZE"]))
238 |         logging.info("Rank is : {}".format(os.environ["RANK"]))
239 | 
240 |         logging.info("Master address is : {}".format(os.environ["MASTER_ADDR"]))
241 |         logging.info("Master port is : {}".format(os.environ["MASTER_PORT"]))
242 | 
243 |     dist.init_process_group(
244 |         backend=backend,
245 |         world_size=world_size,
246 |         rank=global_rank,
247 |         init_method="env://",
248 |         timeout=timedelta(seconds=120),
249 |     )
250 | 
251 |     return local_rank, world_size, global_rank
252 | 
253 | 
254 | if __name__ == "__main__":
255 |     args = parse_args()
256 |     local_rank, world_size, global_rank = init_distributed_training(args)
257 |     main(args)
258 | 


--------------------------------------------------------------------------------
/train-ESM.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# ESM-1nv Training with BioNeMo on Amazon SageMaker\n",
  8 |     "\n",
  9 |     "Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
 10 |     "SPDX-License-Identifier: MIT-0"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {},
 16 |    "source": [
 17 |     "---\n",
 18 |     "## 1. Setup"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "### 1.1. Create clients"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "metadata": {
 32 |     "tags": []
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "import boto3\n",
 37 |     "import os\n",
 38 |     "import sagemaker\n",
 39 |     "from time import strftime\n",
 40 |     "\n",
 41 |     "boto_session = boto3.session.Session()\n",
 42 |     "sagemaker_session = sagemaker.session.Session(boto_session)\n",
 43 |     "REGION_NAME = sagemaker_session.boto_region_name\n",
 44 |     "S3_BUCKET = sagemaker_session.default_bucket()\n",
 45 |     "S3_PREFIX = \"bionemo-training\"\n",
 46 |     "S3_FOLDER = sagemaker.s3.s3_path_join(\"s3://\", S3_BUCKET, S3_PREFIX)\n",
 47 |     "print(f\"S3 uri is {S3_FOLDER}\")\n",
 48 |     "\n",
 49 |     "EXPERIMENT_NAME = \"bionemo-training-\" + strftime(\"%Y-%m-%d\")\n",
 50 |     "\n",
 51 |     "SAGEMAKER_EXECUTION_ROLE = sagemaker.session.get_execution_role(sagemaker_session)\n",
 52 |     "print(f\"Assumed SageMaker role is {SAGEMAKER_EXECUTION_ROLE}\")"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {
 58 |     "tags": []
 59 |    },
 60 |    "source": [
 61 |     "### 1.2. Build BioNeMo-Training Container Image\n",
 62 |     "\n",
 63 |     "If you don't already have access to the BioNeMo-SageMaker container image, run the following cell to build and deploy it to your AWS account. Take note of the image URI - you'll use it for the processing and training steps below.\n",
 64 |     "\n",
 65 |     "Here is an example shell script you can use in your environment (including SageMaker Notebook Instances) to build the container.\n",
 66 |     "\n",
 67 |     "Once you have built and pushed the container, we strongly recommend using [ECR image scanning](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html) to ensure that it meets your security requirements."
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "NOTE: If you don't have access to a container build environment, one alternative is the [Amazon SageMaker Studio Image Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli)."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {
 81 |     "scrolled": true,
 82 |     "tags": []
 83 |    },
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "%%bash\n",
 87 |     "\n",
 88 |     "# The name of our algorithm\n",
 89 |     "algorithm_name=bionemo-training\n",
 90 |     "\n",
 91 |     "pushd container/training\n",
 92 |     "\n",
 93 |     "account=$(aws sts get-caller-identity --query Account --output text)\n",
 94 |     "\n",
 95 |     "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
 96 |     "region=$(aws configure get region)\n",
 97 |     "region=${region:-us-west-2}\n",
 98 |     "\n",
 99 |     "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n",
100 |     "\n",
101 |     "# If the repository doesn't exist in ECR, create it.\n",
102 |     "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
103 |     "\n",
104 |     "if [ $? -ne 0 ]\n",
105 |     "then\n",
106 |     "    aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
107 |     "fi\n",
108 |     "\n",
109 |     "# Get the login command from ECR and execute it directly\n",
110 |     "$(aws ecr get-login --region ${region} --no-include-email)\n",
111 |     "\n",
112 |     "# Build the docker image locally with the image name and then push it to ECR\n",
113 |     "# with the full name.\n",
114 |     "\n",
115 |     "docker build -t ${algorithm_name} .\n",
116 |     "docker tag ${algorithm_name} ${fullname}\n",
117 |     "\n",
118 |     "docker push ${fullname}\n",
119 |     "\n",
120 |     "popd"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {},
126 |    "source": [
127 |     "---\n",
128 |     "## 2. Data Processing"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "### 2.1. Query UniProt for human amino acid sequences between 100 and 500 residues in length"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": null,
141 |    "metadata": {
142 |     "tags": []
143 |    },
144 |    "outputs": [],
145 |    "source": [
146 |     "from io import BytesIO\n",
147 |     "import pandas as pd\n",
148 |     "import requests\n",
149 |     "\n",
150 |     "query_url = \"https://rest.uniprot.org/uniprotkb/stream?query=organism_id:9606+AND+reviewed=True+AND+length=[100+TO+500]&format=tsv&compressed=true&fields=accession,sequence\"\n",
151 |     "uniprot_request = requests.get(query_url)\n",
152 |     "bio = BytesIO(uniprot_request.content)\n",
153 |     "\n",
154 |     "df = pd.read_csv(bio, compression=\"gzip\", sep=\"\\t\")\n",
155 |     "display(df)"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "### 2.2. Split Data and Upload to S3"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {
169 |     "tags": []
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "train = df.sample(n=9600, random_state=42)\n",
174 |     "val_test = df.drop(train.index)\n",
175 |     "val = val_test.sample(n=960, random_state=42)\n",
176 |     "test = val_test.drop(val.index).sample(n=960, random_state=42)\n",
177 |     "del val_test\n",
178 |     "\n",
179 |     "print(f\"Training data size: {train.shape}\")\n",
180 |     "print(f\"Validation data size: {val.shape}\")\n",
181 |     "print(f\"Test data size: {test.shape}\")\n",
182 |     "\n",
183 |     "for dir in [\"train\", \"val\", \"test\"]:\n",
184 |     "    if not os.path.exists(os.path.join(\"data\", dir)):\n",
185 |     "        os.makedirs(os.path.join(\"data\", dir))\n",
186 |     "\n",
187 |     "train.to_csv(os.path.join(\"data\", \"train\", \"x000.csv\"), index=False)\n",
188 |     "val.to_csv(os.path.join(\"data\", \"val\", \"x001.csv\"), index=False)\n",
189 |     "test.to_csv(os.path.join(\"data\", \"test\", \"x002.csv\"), index=False)\n",
190 |     "\n",
191 |     "DATA_PREFIX = os.path.join(S3_PREFIX, \"data\")\n",
192 |     "DATA_URI = sagemaker_session.upload_data(\n",
193 |     "    path=\"data\", bucket=S3_BUCKET, key_prefix=DATA_PREFIX\n",
194 |     ")\n",
195 |     "print(f\"Sequence data available at {DATA_URI}\")"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {
201 |     "tags": []
202 |    },
203 |    "source": [
204 |     "---\n",
205 |     "## 3. Configure NVIDIA NGC API Credentiatls"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "Before you create a BioNeMo training job, follow these steps to generate some NGC API credentials and store them in AWS Secrets Manager. \n",
213 |     "\n",
214 |     "1. Sign-in or create a new account at NVIDIA [NGC](https://ngc.nvidia.com/signin).\n",
215 |     "2. Select your name in the top-right corner of the screen and then \"Setup\"\n",
216 |     "\n",
217 |     "![Select Setup from the top-right menup](img/1-setup.png)\n",
218 |     "\n",
219 |     "3. Select \"Generate API Key\".\n",
220 |     "\n",
221 |     "![Select Generate API Key](img/2-api-key.png)\n",
222 |     "\n",
223 |     "4. Select the green \"+ Generate API Key\" button and confirm.\n",
224 |     "\n",
225 |     "![Select green Generate API Key button ](img/3-generate.png)\n",
226 |     "\n",
227 |     "5. Copy the API key - this is the last time you can retrieve it!\n",
228 |     "\n",
229 |     "6. Before you leave the NVIDIA NGC site, also take note of your organization ID listed under your name in the top-right corner of the screen. You'll need this, plus your API key, to download BioNeMo artifacts.\n",
230 |     "\n",
231 |     "7. Navigate to the AWS Console and then to AWS Secrets Manager.\n",
232 |     "\n",
233 |     "![Navigate to AWS Secrets Manager](img/4-sm.png)\n",
234 |     "\n",
235 |     "8. Select \"Store a new secret\".\n",
236 |     "9. Under \"Secret type\" select \"Other type of secret\"\n",
237 |     "\n",
238 |     "![Select other type of secret](img/5-secret-type.png)\n",
239 |     "\n",
240 |     "10. Under \"Key/value\" pairs, add a key named \"NGC_CLI_API_KEY\" with a value of your NGC API key. Add another key named \"NGC_CLI_ORG\" with a value of your NGC organization. Select Next.\n",
241 |     "\n",
242 |     "11. Under \"Configure secret - Secret name and description\", name your secret \"NVIDIA_NGC_CREDS\" and select Next. You'll use this secret name when submitting BioNeMo jobs to SageMaker.\n",
243 |     "\n",
244 |     "12. Select the remaining default options to create your secret.\n"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "markdown",
249 |    "metadata": {},
250 |    "source": [
251 |     "## 4. Submit ESM-1nv Training Job"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": null,
257 |    "metadata": {
258 |     "scrolled": true,
259 |     "tags": []
260 |    },
261 |    "outputs": [],
262 |    "source": [
263 |     "import os\n",
264 |     "from sagemaker.experiments.run import Run\n",
265 |     "from sagemaker.pytorch import PyTorch\n",
266 |     "\n",
267 |     "# Replace this with your ECR repository URI from above\n",
268 |     "BIONEMO_IMAGE_URI = (\n",
269 |     "    \"<ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/bionemo-training:latest\"\n",
270 |     ")\n",
271 |     "\n",
272 |     "bionemo_estimator = PyTorch(\n",
273 |     "    base_job_name=\"bionemo-training\",\n",
274 |     "    distribution={\"torch_distributed\": {\"enabled\": True}},\n",
275 |     "    entry_point=\"train.py\",\n",
276 |     "    hyperparameters={\n",
277 |     "        \"config-name\": \"esm1nv-training\",  # This is  the name of your config file, without the extension\n",
278 |     "        \"model-name\": \"esm1nv\",  # If you don't provide this as a hyperparameter, it will be inferred from the name field in the config file\n",
279 |     "        \"download-pretrained-weights\": True,  # Required to fine-tune from pretrained weights. Set to False for pretraining.\n",
280 |     "        \"ngc-cli-secret-name\": \"NVIDIA_NGC_CREDS\"  # Replace this if you used a different name above.\n",
281 |     "    },\n",
282 |     "    image_uri=BIONEMO_IMAGE_URI,\n",
283 |     "    instance_count=1,  # Update this value for multi-node training\n",
284 |     "    instance_type=\"ml.g5.2xlarge\",  # Update this value for other instance types\n",
285 |     "    output_path=os.path.join(S3_FOLDER, \"model\"),\n",
286 |     "    role=SAGEMAKER_EXECUTION_ROLE,\n",
287 |     "    sagemaker_session=sagemaker_session,\n",
288 |     "    source_dir=\"src\",\n",
289 |     ")\n",
290 |     "\n",
291 |     "with Run(\n",
292 |     "    experiment_name=EXPERIMENT_NAME,\n",
293 |     "    sagemaker_session=sagemaker_session,\n",
294 |     ") as run:\n",
295 |     "    bionemo_estimator.fit(\n",
296 |     "        inputs={\n",
297 |     "            \"train\": os.path.join(DATA_URI, \"train\"),\n",
298 |     "            \"val\": os.path.join(DATA_URI, \"val\"),\n",
299 |     "        },\n",
300 |     "        wait=False,\n",
301 |     "    )"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": null,
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": []
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": null,
314 |    "metadata": {},
315 |    "outputs": [],
316 |    "source": []
317 |   }
318 |  ],
319 |  "metadata": {
320 |   "kernelspec": {
321 |    "display_name": "conda_pytorch_p310",
322 |    "language": "python",
323 |    "name": "conda_pytorch_p310"
324 |   },
325 |   "language_info": {
326 |    "codemirror_mode": {
327 |     "name": "ipython",
328 |     "version": 3
329 |    },
330 |    "file_extension": ".py",
331 |    "mimetype": "text/x-python",
332 |    "name": "python",
333 |    "nbconvert_exporter": "python",
334 |    "pygments_lexer": "ipython3",
335 |    "version": "3.10.14"
336 |   }
337 |  },
338 |  "nbformat": 4,
339 |  "nbformat_minor": 4
340 | }
341 | 


--------------------------------------------------------------------------------