├── .github ├── CODEOWNERS └── workflows │ ├── build-and-deploy.yml │ └── pre-commit.yml ├── .gitignore ├── .markdownlint.json ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── Makefile ├── README.md ├── ci └── release │ └── update-version.sh ├── extensions ├── rapids_admonitions.py ├── rapids_grid_toctree.py ├── rapids_notebook_files.py ├── rapids_related_examples.py └── rapids_version_templating.py ├── make.bat ├── pyproject.toml ├── scripts ├── gen_release_checklist_issue.py └── unused_images.sh ├── source ├── _includes │ ├── check-gpu-pod-works.md │ ├── install-rapids-with-docker.md │ ├── menus │ │ ├── aws.md │ │ ├── azure.md │ │ ├── ci.md │ │ ├── gcp.md │ │ ├── ibm.md │ │ └── nvidia.md │ └── test-rapids-docker-vm.md ├── _static │ ├── RAPIDS-logo-purple.png │ ├── azure-set-ports-inbound-sec.png │ ├── azure_availability_zone.PNG │ ├── css │ │ └── custom.css │ ├── daskworker.PNG │ ├── eightworkers.PNG │ ├── images │ │ ├── developer │ │ │ └── ci │ │ │ │ └── github-actions │ │ │ │ ├── new-hosted-runner.png │ │ │ │ └── new-runner-config.png │ │ ├── examples │ │ │ ├── rapids-1brc-single-node │ │ │ │ ├── dask-labextension-graphs.png │ │ │ │ ├── dask-labextension-processing.png │ │ │ │ ├── nvdashboard-resources.png │ │ │ │ └── nvdashboard-sidebar.png │ │ │ ├── rapids-sagemaker-hpo │ │ │ │ ├── cpu_hpo_100x10.png │ │ │ │ ├── gpu_hpo_100x10.png │ │ │ │ ├── hpo.png │ │ │ │ ├── ml_workflow.png │ │ │ │ ├── results.png │ │ │ │ ├── results_analysis.png │ │ │ │ └── run_hpo.png │ │ │ └── xgboost-rf-gpu-cpu-benchmark │ │ │ │ └── amazon-deeplearning-ami.png │ │ └── platforms │ │ │ ├── brev │ │ │ ├── brev1.png │ │ │ ├── brev2.png │ │ │ ├── brev3.png │ │ │ ├── brev4.png │ │ │ ├── brev5.png │ │ │ ├── brev6.png │ │ │ └── brev8.png │ │ │ ├── coiled │ │ │ ├── coiled-jupyter.png │ │ │ └── jupyter-on-coiled.png │ │ │ └── nvidia-ai-workbench │ │ │ ├── add-remote-system-dialog.png │ │ │ ├── create-project.png │ │ │ ├── cudf-example.png │ │ │ ├── new-project.png │ │ │ ├── open-jupyter.png │ │ │ ├── project-building.png │ │ │ └── rapids-with-cuda.png │ ├── js │ │ ├── nav.js │ │ └── notebook-gallery.js │ └── workingdask.PNG ├── _templates │ ├── feedback.html │ ├── notebooks-extra-files-nav.html │ ├── notebooks-tag-filter.html │ └── notebooks-tags.html ├── cloud │ ├── aws │ │ ├── ec2-multi.md │ │ ├── ec2.md │ │ ├── ecs.md │ │ ├── eks.md │ │ ├── index.md │ │ └── sagemaker.md │ ├── azure │ │ ├── aks.md │ │ ├── azure-vm-multi.md │ │ ├── azure-vm.md │ │ ├── azureml.md │ │ └── index.md │ ├── gcp │ │ ├── compute-engine.md │ │ ├── dataproc.md │ │ ├── gke.md │ │ ├── index.md │ │ └── vertex-ai.md │ ├── ibm │ │ ├── index.md │ │ └── virtual-server.md │ ├── index.md │ └── nvidia │ │ ├── brev.md │ │ └── index.md ├── conf.py ├── developer │ ├── ci │ │ ├── github-actions.md │ │ └── index.md │ └── index.md ├── examples │ ├── index.md │ ├── rapids-1brc-single-node │ │ ├── lookup.csv │ │ └── notebook.ipynb │ ├── rapids-autoscaling-multi-tenant-kubernetes │ │ ├── image-prepuller.yaml │ │ ├── notebook.ipynb │ │ ├── prometheus-stack-values.yaml │ │ └── rapids-notebook.yaml │ ├── rapids-azureml-hpo │ │ ├── notebook.ipynb │ │ ├── rapids_csp_azure.py │ │ └── train_rapids.py │ ├── rapids-coiled-cudf │ │ └── notebook.ipynb │ ├── rapids-ec2-mnmg │ │ └── notebook.ipynb │ ├── rapids-morpheus-pipeline │ │ ├── k8s │ │ │ ├── kafka-producer │ │ │ │ └── kafka-producer.yaml │ │ │ ├── kafka │ │ │ │ ├── kafka-create-topics.yaml │ │ │ │ ├── kafka-single-node.yaml │ │ │ │ └── kafka-ui.yaml │ │ │ ├── morpheus-pipeline │ │ │ │ └── morpheus-pipeline-deployment.yaml │ │ │ └── triton │ │ │ │ └── morpheus-triton-server.yaml │ │ ├── notebook.ipynb │ │ └── scripts │ │ │ ├── pipeline-dockerfile │ │ │ ├── Dockerfile │ │ │ ├── message_filter_stage.py │ │ │ ├── morpheus-nightly-env.yaml │ │ │ ├── network_traffic_analyzer_stage.py │ │ │ └── run_pipeline_kafka.py │ │ │ └── producer-dockerfile │ │ │ ├── Dockerfile │ │ │ ├── pcap_dump.jsonlines │ │ │ └── producer.py │ ├── rapids-optuna-hpo │ │ └── notebook.ipynb │ ├── rapids-sagemaker-higgs │ │ ├── .dockerignore │ │ ├── Dockerfile │ │ ├── entrypoint.sh │ │ ├── notebook.ipynb │ │ └── rapids-higgs.py │ ├── rapids-sagemaker-hpo │ │ ├── HPOConfig.py │ │ ├── HPODatasets.py │ │ ├── MLWorkflow.py │ │ ├── entrypoint.sh │ │ ├── helper_functions.py │ │ ├── notebook.ipynb │ │ ├── serve.py │ │ ├── train.py │ │ └── workflows │ │ │ ├── MLWorkflowMultiCPU.py │ │ │ ├── MLWorkflowMultiGPU.py │ │ │ ├── MLWorkflowSingleCPU.py │ │ │ └── MLWorkflowSingleGPU.py │ ├── rapids-snowflake-cudf │ │ └── notebook.ipynb │ ├── time-series-forecasting-with-hpo │ │ └── notebook.ipynb │ ├── xgboost-azure-mnmg-daskcloudprovider │ │ ├── configs │ │ │ └── cloud_init.yaml.j2 │ │ ├── notebook.ipynb │ │ └── trained-model_nyctaxi.xgb │ ├── xgboost-dask-databricks │ │ └── notebook.ipynb │ ├── xgboost-gpu-hpo-job-parallel-k8s │ │ └── notebook.ipynb │ ├── xgboost-gpu-hpo-mnmg-parallel-k8s │ │ └── notebook.ipynb │ ├── xgboost-randomforest-gpu-hpo-dask │ │ ├── notebook.ipynb │ │ └── rapids_hpo │ │ │ └── data │ │ │ └── airlines.parquet │ └── xgboost-rf-gpu-cpu-benchmark │ │ ├── hpo.py │ │ └── notebook.ipynb ├── guides │ ├── azure │ │ └── infiniband.md │ ├── caching-docker-images.md │ ├── colocate-workers.md │ ├── index.md │ ├── l4-gcp.md │ ├── mig.md │ ├── scheduler-gpu-optimization.md │ └── scheduler-gpu-requirements.md ├── hpc.md ├── images │ ├── azureml-access-datastore-uri.png │ ├── azureml-create-notebook-instance.png │ ├── azureml-provision-setup-script.png │ ├── azureml_returned_job_completed.png │ ├── databricks-choose-gpu-node.png │ ├── databricks-create-compute.png │ ├── databricks-custom-container.png │ ├── databricks-dask-cudf-example.png │ ├── databricks-dask-init-script.png │ ├── databricks-dask-logging.png │ ├── databricks-mnmg-dask-client.png │ ├── databricks-standard-runtime.png │ ├── databricks-worker-driver-node.png │ ├── docref-admonition.png │ ├── googlecolab-output-nvidia-smi.png │ ├── googlecolab-select-gpu-hardware-accelerator.png │ ├── googlecolab-select-runtime-type.png │ ├── kubeflow-configure-dashboard-option.png │ ├── kubeflow-create-notebook.png │ ├── kubeflow-dask-dashboard.png │ ├── kubeflow-jupyter-dask-cluster-widget.png │ ├── kubeflow-jupyter-dask-labextension.png │ ├── kubeflow-jupyter-example-notebook.png │ ├── kubeflow-jupyter-nvidia-smi.png │ ├── kubeflow-jupyter-using-dask.png │ ├── kubeflow-new-notebook.png │ ├── kubeflow-notebook-running.png │ ├── kubernetes-jupyter.png │ ├── morpheus-pipeline-KafkaUI_9MB.gif │ ├── sagemaker-choose-rapids-kernel.png │ ├── sagemaker-create-lifecycle-configuration.png │ ├── sagemaker-create-notebook-instance.png │ ├── snowflake_jupyter.png │ ├── theme-notebook-tags.png │ ├── theme-tag-style.png │ └── vertex-ai-launcher.png ├── index.md ├── local.md ├── nims.md ├── platforms │ ├── coiled.md │ ├── colab.md │ ├── databricks.md │ ├── index.md │ ├── kserve.md │ ├── kubeflow.md │ ├── kubernetes.md │ ├── nvidia-ai-workbench.md │ └── snowflake.md └── tools │ ├── dask-cuda.md │ ├── index.md │ ├── kubernetes │ ├── dask-helm-chart.md │ └── dask-operator.md │ └── rapids-docker.md └── uv.lock /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @rapidsai/deployment-write 2 | -------------------------------------------------------------------------------- /.github/workflows/build-and-deploy.yml: -------------------------------------------------------------------------------- 1 | name: Build and deploy 2 | on: 3 | push: 4 | tags: 5 | - "*" 6 | branches: 7 | - main 8 | pull_request: 9 | 10 | # Required shell entrypoint to have properly activated conda environments 11 | defaults: 12 | run: 13 | shell: bash -l {0} 14 | 15 | permissions: 16 | id-token: write 17 | contents: read 18 | 19 | jobs: 20 | conda: 21 | name: Build (and deploy) 22 | runs-on: ubuntu-latest 23 | steps: 24 | - uses: actions/checkout@v4 25 | with: 26 | fetch-depth: 0 27 | 28 | - name: Install uv 29 | uses: astral-sh/setup-uv@v5 30 | 31 | - name: Build 32 | env: 33 | DEPLOYMENT_DOCS_BUILD_STABLE: ${{ startsWith(github.event.ref, 'refs/tags/') && 'true' || 'false' }} 34 | run: uv run make dirhtml SPHINXOPTS="-W --keep-going -n" 35 | 36 | - uses: aws-actions/configure-aws-credentials@v4 37 | if: ${{ github.repository == 'rapidsai/deployment' && github.event_name == 'push' }} 38 | with: 39 | role-to-assume: ${{ vars.AWS_ROLE_ARN }} 40 | aws-region: ${{ vars.AWS_REGION }} 41 | role-duration-seconds: 3600 # 1h 42 | 43 | - name: Sync HTML files to S3 44 | if: ${{ github.repository == 'rapidsai/deployment' && github.event_name == 'push' }} 45 | env: 46 | DESTINATION_DIR: ${{ startsWith(github.event.ref, 'refs/tags/') && 'stable' || 'nightly' }} 47 | run: aws s3 sync --no-progress --delete build/dirhtml "s3://rapidsai-docs/deployment/html/${DESTINATION_DIR}" 48 | -------------------------------------------------------------------------------- /.github/workflows/pre-commit.yml: -------------------------------------------------------------------------------- 1 | name: Linting 2 | 3 | on: 4 | push: 5 | pull_request: 6 | 7 | jobs: 8 | checks: 9 | name: "pre-commit hooks" 10 | runs-on: ubuntu-latest 11 | steps: 12 | - uses: actions/checkout@v4 13 | - uses: actions/setup-python@v5 14 | with: 15 | python-version: "3.12" 16 | - uses: pre-commit/action@v3.0.1 17 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.bz2 2 | *.crt 3 | *.csv 4 | *.csv.gz 5 | *.csv.zip 6 | data/ 7 | *.env 8 | *.parquet 9 | *.pem 10 | *.pub 11 | .ruff_cache 12 | *.tar.gz 13 | xgboost_hpo_logs.txt 14 | *.zip 15 | *.7z 16 | 17 | build 18 | *.swp 19 | 20 | __pycache__ 21 | .ipynb_checkpoints 22 | 23 | .DS_Store 24 | 25 | cufile.log 26 | node_modules/ 27 | jupyter_execute/ 28 | 29 | # files manually written by example code 30 | source/examples/rapids-azureml-hpo/Dockerfile 31 | source/examples/rapids-sagemaker-hpo/Dockerfile 32 | 33 | # exclusions 34 | !source/examples/rapids-1brc-single-node/lookup.csv 35 | -------------------------------------------------------------------------------- /.markdownlint.json: -------------------------------------------------------------------------------- 1 | { 2 | "default": true, 3 | "MD013": false, 4 | "MD014": false, 5 | "MD029": false, 6 | "MD033": false, 7 | "MD041": false 8 | } 9 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | # See https://pre-commit.com for more information 2 | # See https://pre-commit.com/hooks.html for more hooks 3 | repos: 4 | - repo: https://github.com/psf/black 5 | rev: 25.1.0 6 | hooks: 7 | - id: black-jupyter 8 | - repo: https://github.com/adamchainz/blacken-docs 9 | rev: 1.19.1 10 | hooks: 11 | - id: blacken-docs 12 | additional_dependencies: 13 | - black==23.1.0 14 | args: [--skip-errors] 15 | - repo: https://github.com/pre-commit/mirrors-prettier 16 | rev: "v4.0.0-alpha.8" 17 | hooks: 18 | - id: prettier 19 | - repo: https://github.com/igorshubovych/markdownlint-cli 20 | rev: v0.45.0 21 | hooks: 22 | - id: markdownlint 23 | - repo: https://github.com/charliermarsh/ruff-pre-commit 24 | rev: "v0.11.10" 25 | hooks: 26 | - id: ruff 27 | types_or: [jupyter, python] 28 | - repo: https://github.com/shellcheck-py/shellcheck-py 29 | rev: v0.10.0.1 30 | hooks: 31 | - id: shellcheck 32 | - repo: local 33 | hooks: 34 | - id: unused-images 35 | name: unused-images 36 | entry: bash scripts/unused_images.sh 37 | language: system 38 | pass_filenames: false 39 | always_run: true 40 | - repo: https://github.com/codespell-project/codespell 41 | rev: v2.4.1 42 | hooks: 43 | - id: codespell 44 | additional_dependencies: [tomli] 45 | exclude: "^.*.jsonlines$" 46 | args: ["--toml", "pyproject.toml"] 47 | 48 | default_language_version: 49 | python: python3 50 | -------------------------------------------------------------------------------- /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | build: 4 | os: "ubuntu-lts-latest" 5 | tools: 6 | python: "3.12" 7 | jobs: 8 | create_environment: 9 | - asdf plugin add uv 10 | - asdf install uv latest 11 | - asdf global uv latest 12 | - UV_PROJECT_ENVIRONMENT=$READTHEDOCS_VIRTUALENV_PATH uv sync 13 | install: 14 | - "true" 15 | 16 | sphinx: 17 | configuration: source/conf.py 18 | builder: dirhtml 19 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = source 9 | BUILDDIR = build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /ci/release/update-version.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright (c) 2024-2025, NVIDIA CORPORATION. 3 | ############################### 4 | # Deployment Version Updater # 5 | ############################### 6 | 7 | ## Usage 8 | # bash update-version.sh 9 | 10 | # Format is YY.MM.PP - no leading 'v' or trailing 'a' 11 | NEXT_FULL_TAG=$1 12 | 13 | # Get . for next version 14 | NEXT_MAJOR=$(echo "$NEXT_FULL_TAG" | awk '{split($0, a, "."); print a[1]}') 15 | NEXT_MINOR=$(echo "$NEXT_FULL_TAG" | awk '{split($0, a, "."); print a[2]}') 16 | NEXT_SHORT_TAG=${NEXT_MAJOR}.${NEXT_MINOR} 17 | 18 | # Calculate the next nightly version 19 | NEXT_MINOR_INT=$((10#$NEXT_MINOR)) 20 | NEXT_NIGHTLY_MINOR=$((NEXT_MINOR_INT + 2)) 21 | NEXT_NIGHTLY_MINOR=$(printf "%02d" $NEXT_NIGHTLY_MINOR) 22 | NEXT_NIGHTLY_TAG=${NEXT_MAJOR}.${NEXT_NIGHTLY_MINOR} 23 | 24 | echo "Preparing release $NEXT_FULL_TAG with next nightly version $NEXT_NIGHTLY_TAG" 25 | 26 | # Inplace sed replace; workaround for Linux and Mac 27 | function sed_runner() { 28 | sed -i.bak ''"$1"'' "$2" && rm -f "${2}".bak 29 | } 30 | 31 | # Update stable_version and nightly_version in conf.py 32 | sed_runner "s/stable_version = \"[0-9.]*\"/stable_version = \"${NEXT_SHORT_TAG}\"/" source/conf.py 33 | sed_runner "s/nightly_version = \"[0-9.]*\"/nightly_version = \"${NEXT_NIGHTLY_TAG}\"/" source/conf.py 34 | 35 | # Update container references in README.md 36 | sed_runner "s/\"rapids_container\": \"nvcr.io\/nvidia\/rapidsai\/base:[0-9.]*-/\"rapids_container\": \"nvcr.io\/nvidia\/rapidsai\/base:${NEXT_SHORT_TAG}-/" README.md 37 | sed_runner "s/\"rapids_container\": \"rapidsai\/base:[0-9.]*a-/\"rapids_container\": \"rapidsai\/base:${NEXT_NIGHTLY_TAG}a-/" README.md 38 | 39 | echo "Version update complete" 40 | -------------------------------------------------------------------------------- /extensions/rapids_admonitions.py: -------------------------------------------------------------------------------- 1 | from docutils.nodes import Text, admonition, inline, paragraph 2 | from docutils.parsers.rst.directives.admonitions import BaseAdmonition 3 | from sphinx.addnodes import pending_xref 4 | from sphinx.application import Sphinx 5 | from sphinx.util.docutils import SphinxDirective 6 | 7 | 8 | class Docref(BaseAdmonition, SphinxDirective): 9 | node_class = admonition 10 | required_arguments = 1 11 | 12 | def run(self): 13 | doc = self.arguments[0] 14 | self.arguments = ["See Documentation"] 15 | self.options["classes"] = ["admonition-docref"] 16 | nodes = super().run() 17 | custom_xref = pending_xref( 18 | reftype="myst", 19 | refdomain="std", 20 | refexplicit=True, 21 | reftarget=doc, 22 | refdoc=self.env.docname, 23 | refwarn=True, 24 | ) 25 | text_wrapper = inline() 26 | text_wrapper += Text("Visit the documentation >>") 27 | custom_xref += text_wrapper 28 | wrapper = paragraph() 29 | wrapper["classes"] = ["visit-link"] 30 | wrapper += custom_xref 31 | nodes[0] += wrapper 32 | return nodes 33 | 34 | 35 | def setup(app: Sphinx) -> dict: 36 | app.add_directive("docref", Docref) 37 | 38 | return { 39 | "version": "0.1", 40 | "parallel_read_safe": True, 41 | "parallel_write_safe": True, 42 | } 43 | -------------------------------------------------------------------------------- /extensions/rapids_grid_toctree.py: -------------------------------------------------------------------------------- 1 | from functools import partial 2 | 3 | from docutils import nodes 4 | from sphinx.application import Sphinx 5 | from sphinx.directives.other import TocTree 6 | from sphinx_design.grids import GridDirective 7 | 8 | 9 | def find_linked_documents(node): 10 | """Find all referenced documents in a node tree. 11 | 12 | Walks the nodes and yield the reftarget attribute for any that have it set. 13 | 14 | """ 15 | for child in node.traverse(): 16 | try: 17 | if child.attributes["reftarget"]: 18 | yield child.attributes["reftarget"] 19 | except (AttributeError, KeyError): 20 | pass 21 | 22 | 23 | class CardGridTocTree(GridDirective): 24 | """An extension of sphinx_design.grids.GridDirective that also add referenced docs to the toctree. 25 | 26 | For any element within the grid which links to another page with the ``link-type`` ``doc`` the 27 | doc gets added to the toctree of that page. 28 | 29 | """ 30 | 31 | def run(self) -> list[nodes.Node]: 32 | output = nodes.container() 33 | 34 | # Generate the card grid 35 | grid = nodes.section(ids=["toctreegrid"]) 36 | grid += super().run()[0] 37 | output += grid 38 | 39 | # Update the content with the document names referenced in the card grid ready for toctree generation 40 | self.content.data = [doc for doc in find_linked_documents(grid)] 41 | 42 | # Generate the actual toctree but ensure it is hidden 43 | self.options["hidden"] = True 44 | self.parse_content = partial(TocTree.parse_content, self) 45 | toctree = TocTree.run(self)[0] 46 | output += toctree 47 | 48 | return [output] 49 | 50 | 51 | def setup(app: Sphinx) -> dict: 52 | app.add_directive("gridtoctree", CardGridTocTree) 53 | 54 | return { 55 | "version": "0.1", 56 | "parallel_read_safe": True, 57 | "parallel_write_safe": True, 58 | } 59 | -------------------------------------------------------------------------------- /extensions/rapids_notebook_files.py: -------------------------------------------------------------------------------- 1 | import contextlib 2 | import os 3 | import pathlib 4 | import re 5 | import shutil 6 | import tempfile 7 | from functools import partial 8 | 9 | 10 | def template_func(app, match): 11 | return app.builder.templates.render_string(match.group(), app.config.rapids_version) 12 | 13 | 14 | def walk_files(app, dir, outdir): 15 | outdir.mkdir(parents=True, exist_ok=True) 16 | related_notebook_files = {} 17 | for page in dir.glob("*"): 18 | if page.is_dir(): 19 | related_notebook_files[page.name] = walk_files( 20 | app, page, outdir / page.name 21 | ) 22 | else: 23 | with contextlib.suppress(OSError): 24 | os.remove(str(outdir / page.name)) 25 | if "ipynb" in page.name: 26 | with open(str(page)) as reader: 27 | notebook = reader.read() 28 | with open(str(outdir / page.name), "w") as writer: 29 | writer.write( 30 | re.sub( 31 | r"(? 1: 68 | archive_path = path_to_output_parent / "all_files.zip" 69 | with contextlib.suppress(OSError): 70 | os.remove(str(archive_path)) 71 | with tempfile.NamedTemporaryFile() as tmpf: 72 | shutil.make_archive( 73 | tmpf.name, 74 | "zip", 75 | str(path_to_output_parent.parent), 76 | str(path_to_output_parent.name), 77 | ) 78 | shutil.move(tmpf.name + ".zip", str(archive_path)) 79 | context["related_notebook_files_archive"] = archive_path.name 80 | context["related_notebook_files"] = related_notebook_files 81 | 82 | 83 | def setup(app): 84 | app.add_config_value("rapids_deployment_notebooks_base_url", "", "html") 85 | app.connect("html-page-context", find_notebook_related_files) 86 | 87 | return { 88 | "version": "0.1", 89 | "parallel_read_safe": True, 90 | "parallel_write_safe": True, 91 | } 92 | -------------------------------------------------------------------------------- /extensions/rapids_version_templating.py: -------------------------------------------------------------------------------- 1 | import re 2 | from copy import deepcopy 3 | from typing import TYPE_CHECKING 4 | 5 | from docutils import nodes 6 | 7 | if TYPE_CHECKING: 8 | import sphinx 9 | 10 | 11 | class RapidsCustomNodeVisitor(nodes.SparseNodeVisitor): 12 | """ 13 | Post-process the text generated by Sphinx. 14 | 15 | ``docutils`` breaks documents down into different Python classes that 16 | roughly correspond to the HTML document object model ("DOM"). 17 | 18 | The only node types that will be modified by this class are those with 19 | a corresponding ``visit_{node_class}`` method defined. 20 | 21 | For a list of all the available types, see 22 | https://sourceforge.net/p/docutils/code/9881/tree/trunk/docutils/docutils/nodes.py#l2630 23 | """ 24 | 25 | def __init__(self, app: "sphinx.application.Sphinx", *args, **kwargs): 26 | self.app = app 27 | super().__init__(*args, **kwargs) 28 | 29 | def visit_reference(self, node: nodes.reference) -> None: 30 | """ 31 | Replace template strings in URLs. These are ``docutils.nodes.reference`` objects. 32 | 33 | See https://sourceforge.net/p/docutils/code/9881/tree/trunk/docutils/docutils/nodes.py#l2599 34 | """ 35 | # references to anchors will not have the "refuri" attribute. For example, markdown like this: 36 | # 37 | # [Option 1](use-an-Azure-marketplace-VM-image) 38 | # 39 | # Will have attributes like this: 40 | # 41 | # {'ids': [], 'classes': [], 'names': [], 'dupnames': [], 'backrefs': [], 42 | # 'internal': True, 'refid': 'use-an-azure-marketplace-vm-image'} 43 | # 44 | if "refuri" not in node.attributes: 45 | return 46 | 47 | # find templated bits in the URI and replace them with '{{' template markers that Jinja2 will understand 48 | uri_str = deepcopy(node.attributes)["refuri"] 49 | uri_str = re.sub(r"~~~(.*?)~~~", r"{{ \1 }}", uri_str) 50 | 51 | # fill in appropriate values based on app context 52 | node.attributes["refuri"] = re.sub( 53 | r"(? None: 61 | """ 62 | Replace template strings in generic text. 63 | This roughly corresponds to HTML ``

``, ``

``, and similar elements.
 64 |         """
 65 |         new_node = nodes.Text(
 66 |             re.sub(r"(? str:
 72 |         """
 73 |         Replace template strings like ``{{ rapids_version }}`` with real
 74 |         values like ``24.10``.
 75 |         """
 76 |         return self.app.builder.templates.render_string(
 77 |             source=match.group(), context=self.app.config.rapids_version
 78 |         )
 79 | 
 80 | 
 81 | def version_template(
 82 |     app: "sphinx.application.Sphinx",
 83 |     doctree: "sphinx.addnodes.document",
 84 |     docname: str,
 85 | ) -> None:
 86 |     """Substitute versions into each page.
 87 | 
 88 |     This allows documentation pages and notebooks to substitute in values like
 89 |     the latest container image using jinja2 syntax.
 90 | 
 91 |     E.g
 92 | 
 93 |         # My doc page
 94 | 
 95 |         The latest container image is {{ rapids_container }}.
 96 | 
 97 |     """
 98 |     doctree.walk(RapidsCustomNodeVisitor(app, doctree))
 99 | 
100 | 
101 | def setup(app: "sphinx.application.Sphinx") -> None:
102 |     app.add_config_value("rapids_version", {}, "html")
103 |     app.connect("doctree-resolved", version_template)
104 | 
105 |     return {
106 |         "version": "0.1",
107 |         "parallel_read_safe": True,
108 |         "parallel_write_safe": True,
109 |     }
110 | 


--------------------------------------------------------------------------------
/make.bat:
--------------------------------------------------------------------------------
 1 | @ECHO OFF
 2 | 
 3 | pushd %~dp0
 4 | 
 5 | REM Command file for Sphinx documentation
 6 | 
 7 | if "%SPHINXBUILD%" == "" (
 8 | 	set SPHINXBUILD=sphinx-build
 9 | )
10 | set SOURCEDIR=source
11 | set BUILDDIR=build
12 | 
13 | %SPHINXBUILD% >NUL 2>NUL
14 | if errorlevel 9009 (
15 | 	echo.
16 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17 | 	echo.installed, then set the SPHINXBUILD environment variable to point
18 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
19 | 	echo.may add the Sphinx directory to PATH.
20 | 	echo.
21 | 	echo.If you don't have Sphinx installed, grab it from
22 | 	echo.https://www.sphinx-doc.org/
23 | 	exit /b 1
24 | )
25 | 
26 | if "%1" == "" goto help
27 | 
28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29 | goto end
30 | 
31 | :help
32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33 | 
34 | :end
35 | popd
36 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [project]
 2 | name = "rapids-deployment-docs"
 3 | version = "0.1.0"
 4 | requires-python = ">=3.12"
 5 | dependencies = [
 6 |     "sphinx>=8.2.3",
 7 |     "myst-nb>=1.1.2",
 8 |     "myst-parser>=4.0.0",
 9 |     "nbsphinx>=0.9.5",
10 |     "numpydoc>=1.8.0",
11 |     "pydata-sphinx-theme>=0.15.4",
12 |     "pre-commit>=3.8.0",
13 |     "sphinx>=8.0.2",
14 |     "sphinx-autobuild>=2024.9.19",
15 |     "sphinx-copybutton>=0.5.2",
16 |     "sphinx-design>=0.6.1",
17 |     "sphinxcontrib-mermaid>=1.0.0",
18 |     "python-frontmatter>=1.1.0",
19 |     "sphinx-reredirects"
20 | ]
21 | 
22 | [tool.codespell]
23 | # note: pre-commit passes explicit lists of files here, which this skip file list doesn't override -
24 | skip = "./.git,./pyproject.toml,./.ruff_cache"
25 | ignore-regex = "\\b(.{1,4}|[A-Z]\\w*T)\\b"
26 | builtin = "clear"
27 | quiet-level = 3
28 | 
29 | [tool.ruff]
30 | fix = true
31 | line-length = 120
32 | 
33 | [tool.ruff.lint]
34 | select = [
35 |     # pycodestyle
36 |     "E",
37 |     # pyflakes
38 |     "F",
39 |     # isort
40 |     "I",
41 |     # numpy
42 |     "NPY",
43 |     # pyupgrade
44 |     "UP",
45 |     # flake8-bugbear
46 |     "B"
47 | ]
48 | 
49 | [tool.ruff.lint.per-file-ignores]
50 | "source/examples/**/*.ipynb" = [
51 |     # "module level import not at top of cell".
52 |     # This is sometimes necessary, for example to ship a self-contained function
53 |     # around with Dask.
54 |     "E402",
55 | ]
56 | "source/examples/rapids-ec2-mnmg/notebook.ipynb" = [
57 |     # "undefined name cluster", because in this notebook we recommend, in a markdown
58 |     # cell, creating a Dask cluster separately and then running the rest of the notebook's code
59 |     "F821",
60 | ]
61 | "source/examples/rapids-sagemaker-higgs/notebook.ipynb" = [
62 |     # "Line too long", because of a 1-liner shell command starting with '!'
63 |     "E501",
64 | ]
65 | "source/examples/xgboost-dask-databricks/notebook.ipynb" = [
66 |     # "undefined name spark" because Databricks magically makes a SparkSession
67 |     # available with name 'spark'
68 |     # ref: https://docs.databricks.com/en/migration/spark.html#remove-sparksession-creation-commands
69 |     "F821",
70 | ]
71 | 


--------------------------------------------------------------------------------
/scripts/gen_release_checklist_issue.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # Run this script to generate the release issue checklist for easy pasting into GitHub
 3 | 
 4 | from pathlib import Path
 5 | 
 6 | import frontmatter
 7 | 
 8 | # Get the full path to the directory where this script lives
 9 | script_name = Path(__file__).resolve()
10 | script_dir = script_name.parent
11 | 
12 | print(
13 |     """# Release checklist
14 | 
15 | For the upcoming release we need to verify our documentation. This is a best efforts activity
16 | so please refer to the checklist from the previous release and focus on pages that were not
17 | verified last time.
18 | 
19 | ## Verify pages
20 | 
21 | - Look at the nightly build of each page listed below
22 | - Check page renders correctly
23 | - Check for spelling/grammar problems
24 | - Check that the instructions work as expected
25 | - Ensure legacy pages with out of date instructions have a content warning
26 | - If page needs updating convert the task to an issue and open a PR that closes the issue
27 | 
28 | """
29 | )
30 | 
31 | priority_lists = {
32 |     "index": {"name": "Index/Non-technical", "pages": []},
33 |     "p0": {"name": "P0", "pages": []},
34 |     "p1": {"name": "P1", "pages": []},
35 |     "p2": {"name": "P2", "pages": []},
36 | }
37 | 
38 | # Walk all files recursively in the source directory
39 | for file in (script_dir.parent / "source").rglob("*"):
40 |     if file.is_file() and file.suffix in [".ipynb", ".md"]:
41 |         if "_includes" in file.parts:
42 |             continue
43 |         if ".ipynb_checkpoints" in file.parts:
44 |             continue
45 |         if "index.md" in file.parts:
46 |             rel_path = file.parent
47 |         else:
48 |             rel_path = file
49 |         rel_path = rel_path.relative_to(script_dir.parent / "source")
50 |         priority = "p2"
51 |         if file.suffix == ".md":
52 |             try:
53 |                 priority = str(frontmatter.load(file).metadata["review_priority"])
54 |             except KeyError:
55 |                 pass
56 |         elif file.suffix == ".ipynb":
57 |             # TODO - add support for ipynb review_priority
58 |             pass
59 | 
60 |         if rel_path.name:
61 |             rel_path = str(rel_path.with_suffix(""))
62 |         elif str(rel_path) == ".":
63 |             rel_path = ""
64 |         else:
65 |             rel_path = str(rel_path)
66 | 
67 |         file_info = {
68 |             "file": file,
69 |             "url": "https://docs.rapids.ai/deployment/nightly/" + rel_path,
70 |             "priority": priority,
71 |         }
72 |         if priority in priority_lists:
73 |             priority_lists[priority]["pages"].append(file_info)
74 |         else:
75 |             raise ValueError(f"Unknown review_priority '{priority}' for page {file}")
76 | 
77 | for data in priority_lists.values():
78 |     pages = data["pages"]
79 |     if not pages:
80 |         continue
81 |     print(f"### {data['name']}\n")
82 |     for page in sorted(pages, key=lambda x: x["url"]):
83 |         print(f"- [ ] {page['url']}")
84 |     print()
85 | 
86 | print(f"_Issue text generated by {script_name.parent.name}/{script_name.name}._")
87 | 


--------------------------------------------------------------------------------
/scripts/unused_images.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | set -e -u -o pipefail
 4 | 
 5 | imagepaths=$(find source -type f \( -iname "*.jpg" -o -iname "*.jpeg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.svg" \))
 6 | counter=0
 7 | 
 8 | for imagepath in $imagepaths; do
 9 |     filename=$(basename -- "$imagepath")
10 |     if ! grep -q -r "$filename" source README.md; then
11 |         echo "Found unused image $imagepath"
12 |         counter=$((counter+1))
13 |     fi
14 | done
15 | 
16 | if [ "$counter" -eq "0" ]; then
17 |     echo "No unused images found!"
18 | else
19 |     echo "Found $counter unused images"
20 |     exit 1
21 | fi
22 | 


--------------------------------------------------------------------------------
/source/_includes/check-gpu-pod-works.md:
--------------------------------------------------------------------------------
 1 | Let's create a sample Pod that uses some GPU compute to make sure that everything is working as expected.
 2 | 
 3 | ```bash
 4 | cat << EOF | kubectl create -f -
 5 | apiVersion: v1
 6 | kind: Pod
 7 | metadata:
 8 |   name: cuda-vectoradd
 9 | spec:
10 |   restartPolicy: OnFailure
11 |   containers:
12 |   - name: cuda-vectoradd
13 |     image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04"
14 |     resources:
15 |        limits:
16 |          nvidia.com/gpu: 1
17 | EOF
18 | ```
19 | 
20 | ```console
21 | $ kubectl logs pod/cuda-vectoradd
22 | [Vector addition of 50000 elements]
23 | Copy input data from the host memory to the CUDA device
24 | CUDA kernel launch with 196 blocks of 256 threads
25 | Copy output data from the CUDA device to the host memory
26 | Test PASSED
27 | Done
28 | ```
29 | 
30 | If you see `Test PASSED` in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly.
31 | 
32 | Next, clean up that Pod.
33 | 
34 | ```console
35 | $ kubectl delete pod cuda-vectoradd
36 | pod "cuda-vectoradd" deleted
37 | ```
38 | 


--------------------------------------------------------------------------------
/source/_includes/install-rapids-with-docker.md:
--------------------------------------------------------------------------------
 1 | There are a selection of methods you can use to install RAPIDS which you can see via the [RAPIDS release selector](https://docs.rapids.ai/install#selector).
 2 | 
 3 | For this example we are going to run the RAPIDS Docker container so we need to know the name of the most recent container.
 4 | On the release selector choose **Docker** in the **Method** column.
 5 | 
 6 | Then copy the commands shown:
 7 | 
 8 | ```bash
 9 | docker pull {{ rapids_notebooks_container }}
10 | docker run --gpus all --rm -it \
11 |     --shm-size=1g --ulimit memlock=-1 \
12 |     -p 8888:8888 -p 8787:8787 -p 8786:8786 \
13 |     {{ rapids_notebooks_container }}
14 | ```
15 | 
16 | ```{note}
17 | If you see a "docker socket permission denied" error while running these commands try closing and reconnecting your
18 | SSH window. This happens because your user was added to the `docker` group only after you signed in.
19 | ```
20 | 


--------------------------------------------------------------------------------
/source/_includes/menus/aws.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /cloud/aws/ec2
 6 | :link-type: doc
 7 | Elastic Compute Cloud (EC2)
 8 | ^^^
 9 | Launch an EC2 instance and run RAPIDS.
10 | 
11 | {bdg}`single-node`
12 | ````
13 | 
14 | ````{grid-item-card}
15 | :link: /cloud/aws/ec2-multi
16 | :link-type: doc
17 | EC2 Cluster (with Dask)
18 | ^^^
19 | Launch a RAPIDS cluster on EC2 with Dask.
20 | 
21 | {bdg}`multi-node`
22 | ````
23 | 
24 | ````{grid-item-card}
25 | :link: /cloud/aws/eks
26 | :link-type: doc
27 | Elastic Kubernetes Service (EKS)
28 | ^^^
29 | Launch a RAPIDS cluster on managed Kubernetes.
30 | 
31 | {bdg}`multi-node`
32 | ````
33 | 
34 | ````{grid-item-card}
35 | :link: /cloud/aws/ecs
36 | :link-type: doc
37 | Elastic Container Service (ECS)
38 | ^^^
39 | Launch a RAPIDS cluster on managed container service.
40 | 
41 | {bdg}`multi-node`
42 | ````
43 | 
44 | ````{grid-item-card}
45 | :link: /cloud/aws/sagemaker
46 | :link-type: doc
47 | Sagemaker
48 | ^^^
49 | Launch the RAPIDS container as a Sagemaker notebook.
50 | 
51 | {bdg}`single-node`
52 | {bdg}`multi-node`
53 | ````
54 | 
55 | `````
56 | 


--------------------------------------------------------------------------------
/source/_includes/menus/azure.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /cloud/azure/azure-vm
 6 | :link-type: doc
 7 | Azure Virtual Machine
 8 | ^^^
 9 | Launch an Azure VM instance and run RAPIDS.
10 | 
11 | {bdg}`single-node`
12 | ````
13 | 
14 | ````{grid-item-card}
15 | :link: /cloud/azure/aks
16 | :link-type: doc
17 | Azure Kubernetes Service (AKS)
18 | ^^^
19 | Launch a RAPIDS cluster on managed Kubernetes.
20 | 
21 | {bdg}`multi-node`
22 | ````
23 | 
24 | ````{grid-item-card}
25 | :link: /cloud/azure/azure-vm-multi
26 | :link-type: doc
27 | Azure Cluster via Dask
28 | ^^^
29 | Launch a RAPIDS cluster on Azure VMs or Azure ML with Dask.
30 | 
31 | {bdg}`multi-node`
32 | ````
33 | 
34 | ````{grid-item-card}
35 | :link: /cloud/azure/azureml
36 | :link-type: doc
37 | Azure Machine Learning (Azure ML)
38 | ^^^
39 | Launch RAPIDS Experiment on Azure ML.
40 | 
41 | {bdg}`single-node`
42 | {bdg}`multi-node`
43 | ````
44 | 
45 | `````
46 | 


--------------------------------------------------------------------------------
/source/_includes/menus/ci.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /developer/ci/github-actions
 6 | :link-type: doc
 7 | GitHub Actions
 8 | ^^^
 9 | Run tests in GitHub Actions that depend on RAPIDS and NVIDIA GPUs.
10 | 
11 | {bdg}`single-node`
12 | ````
13 | `````
14 | 


--------------------------------------------------------------------------------
/source/_includes/menus/gcp.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /cloud/gcp/compute-engine
 6 | :link-type: doc
 7 | Compute Engine Instance
 8 | ^^^
 9 | Launch a Compute Engine instance and run RAPIDS.
10 | 
11 | {bdg}`single-node`
12 | ````
13 | 
14 | ````{grid-item-card}
15 | :link: /cloud/gcp/vertex-ai
16 | :link-type: doc
17 | Vertex AI
18 | ^^^
19 | Launch the RAPIDS container in Vertex AI managed notebooks.
20 | 
21 | {bdg}`single-node`
22 | ````
23 | 
24 | ````{grid-item-card}
25 | :link: /cloud/gcp/gke
26 | :link-type: doc
27 | Google Kubernetes Engine (GKE)
28 | ^^^
29 | Launch a RAPIDS cluster on managed Kubernetes.
30 | 
31 | {bdg}`multi-node`
32 | ````
33 | 
34 | ````{grid-item-card}
35 | :link: /cloud/gcp/dataproc
36 | :link-type: doc
37 | Dataproc
38 | ^^^
39 | Launch a RAPIDS cluster on Dataproc.
40 | 
41 | {bdg}`multi-node`
42 | ````
43 | 
44 | `````
45 | 


--------------------------------------------------------------------------------
/source/_includes/menus/ibm.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /cloud/ibm/virtual-server
 6 | :link-type: doc
 7 | IBM Virtual Server
 8 | ^^^
 9 | Launch a virtual server and run RAPIDS.
10 | 
11 | {bdg}`single-node`
12 | ````
13 | 
14 | `````
15 | 


--------------------------------------------------------------------------------
/source/_includes/menus/nvidia.md:
--------------------------------------------------------------------------------
 1 | `````{grid} 1 2 2 3
 2 | :gutter: 2 2 2 2
 3 | 
 4 | ````{grid-item-card}
 5 | :link: /cloud/nvidia/brev
 6 | :link-type: doc
 7 | Brev.dev
 8 | ^^^
 9 | Deploy and run RAPIDS on NVIDIA Brev
10 | 
11 | {bdg}`single-node`
12 | ````
13 | 
14 | `````
15 | 


--------------------------------------------------------------------------------
/source/_includes/test-rapids-docker-vm.md:
--------------------------------------------------------------------------------
 1 | To access Jupyter, navigate to `:8888` in the browser.
 2 | 
 3 | In a Python notebook, check that you can import and use RAPIDS libraries like `cudf`.
 4 | 
 5 | ```ipython
 6 | In [1]: import cudf
 7 | In [2]: df = cudf.datasets.timeseries()
 8 | In [3]: df.head()
 9 | Out[3]:
10 |                        id     name         x         y
11 | timestamp
12 | 2000-01-01 00:00:00  1020    Kevin  0.091536  0.664482
13 | 2000-01-01 00:00:01   974    Frank  0.683788 -0.467281
14 | 2000-01-01 00:00:02  1000  Charlie  0.419740 -0.796866
15 | 2000-01-01 00:00:03  1019    Edith  0.488411  0.731661
16 | 2000-01-01 00:00:04   998    Quinn  0.651381 -0.525398
17 | ```
18 | 
19 | Open `cudf/10min.ipynb` and execute the cells to explore more of how `cudf` works.
20 | 
21 | When running a Dask cluster you can also visit `:8787` to monitor the Dask cluster status.
22 | 


--------------------------------------------------------------------------------
/source/_static/RAPIDS-logo-purple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/RAPIDS-logo-purple.png


--------------------------------------------------------------------------------
/source/_static/azure-set-ports-inbound-sec.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/azure-set-ports-inbound-sec.png


--------------------------------------------------------------------------------
/source/_static/azure_availability_zone.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/azure_availability_zone.PNG


--------------------------------------------------------------------------------
/source/_static/css/custom.css:
--------------------------------------------------------------------------------
  1 | nav.bd-links fieldset legend {
  2 |   color: var(--pst-color-text-base);
  3 |   font-weight: var(--pst-sidebar-header-font-weight);
  4 |   font-size: 1em;
  5 | }
  6 | 
  7 | nav.bd-links fieldset input {
  8 |   margin-left: 1em;
  9 |   margin-right: 0.25em;
 10 | }
 11 | 
 12 | .bd-links__title small {
 13 |   float: right;
 14 |   padding-right: 2em;
 15 | }
 16 | 
 17 | nav.related-files {
 18 |   font-family: var(--pst-font-family-monospace);
 19 | }
 20 | 
 21 | nav.bd-links fieldset .sd-badge {
 22 |   font-size: 0.9em;
 23 | }
 24 | 
 25 | /* Admonitions */
 26 | div.admonition.admonition-docref {
 27 |   border-color: var(--sd-color-primary);
 28 | }
 29 | 
 30 | div.admonition.admonition-docref > .admonition-title {
 31 |   color: var(--sd-color-primary-text);
 32 |   background: var(--sd-color-primary);
 33 | }
 34 | 
 35 | div.admonition.admonition-docref > .admonition-title::after {
 36 |   color: var(--sd-color-primary-text);
 37 |   content: "\f02d";
 38 | }
 39 | 
 40 | .docref .visit-link {
 41 |   width: 100%;
 42 |   text-align: right;
 43 |   padding-right: 2em;
 44 |   margin-top: -1.15rem;
 45 | }
 46 | 
 47 | .tagwrapper {
 48 |   margin-top: 0.25em;
 49 | }
 50 | 
 51 | /* Tag colours */
 52 | .sd-badge {
 53 |   /* Defaults */
 54 |   color: var(--sd-color-primary-text);
 55 |   background-color: var(--sd-color-primary);
 56 |   border-left: 0.5em var(--sd-color-primary) solid;
 57 |   padding-left: 0.25em !important;
 58 | }
 59 | 
 60 | .tag-dask,
 61 | .tag-dask-kubernetes,
 62 | .tag-dask-operator,
 63 | .tag-dask-yarn,
 64 | .tag-dask-gateway,
 65 | .tag-dask-jobqueue,
 66 | .tag-dask-helm-chart,
 67 | .tag-dask-cloudprovider,
 68 | .tag-dask-ml {
 69 |   color: #262326;
 70 |   background-color: #ffc11e !important;
 71 |   border-left: 0.5em #ffc11e solid;
 72 | }
 73 | 
 74 | .tag-kubernetes,
 75 | .tag-kubeflow {
 76 |   background-color: #3069de;
 77 |   border-left: 0.5em #3069de solid;
 78 | }
 79 | 
 80 | .tag-aws {
 81 |   color: #222e3c;
 82 |   background-color: #f79700;
 83 |   border-left: 0.5em #f79700 solid;
 84 | }
 85 | 
 86 | .tag-gcp {
 87 |   background-color: #0f9d58;
 88 |   border-left: 0.5em #0f9d58 solid;
 89 | }
 90 | 
 91 | .tag-optuna {
 92 |   background-color: #045895;
 93 |   border-left: 0.5em #045895 solid;
 94 | }
 95 | 
 96 | .tag-numpy {
 97 |   background-color: #4ba6c9;
 98 |   border-left: 0.5em #4670c8 solid;
 99 | }
100 | 
101 | .tag-scikit-learn {
102 |   color: #030200;
103 |   background-color: #f09436;
104 |   border-left: 0.5em #3194c7 solid;
105 | }
106 | 
107 | .tag-data-format {
108 |   background-color: #cc539d;
109 |   border-left: 0.5em #cc539d solid;
110 | }
111 | 
112 | .tag-data-storage {
113 |   background-color: #53a8cc;
114 |   border-left: 0.5em #53a8cc solid;
115 | }
116 | 
117 | .tag-workflow {
118 |   background-color: #348653;
119 |   border-left: 0.5em #348653 solid;
120 | }
121 | 


--------------------------------------------------------------------------------
/source/_static/daskworker.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/daskworker.PNG


--------------------------------------------------------------------------------
/source/_static/eightworkers.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/eightworkers.PNG


--------------------------------------------------------------------------------
/source/_static/images/developer/ci/github-actions/new-hosted-runner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/developer/ci/github-actions/new-hosted-runner.png


--------------------------------------------------------------------------------
/source/_static/images/developer/ci/github-actions/new-runner-config.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/developer/ci/github-actions/new-runner-config.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-1brc-single-node/dask-labextension-graphs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-1brc-single-node/dask-labextension-graphs.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-1brc-single-node/dask-labextension-processing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-1brc-single-node/dask-labextension-processing.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-1brc-single-node/nvdashboard-resources.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-1brc-single-node/nvdashboard-resources.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-1brc-single-node/nvdashboard-sidebar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-1brc-single-node/nvdashboard-sidebar.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/cpu_hpo_100x10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/cpu_hpo_100x10.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/gpu_hpo_100x10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/gpu_hpo_100x10.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/hpo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/hpo.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/ml_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/ml_workflow.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/results.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/results_analysis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/results_analysis.png


--------------------------------------------------------------------------------
/source/_static/images/examples/rapids-sagemaker-hpo/run_hpo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/rapids-sagemaker-hpo/run_hpo.png


--------------------------------------------------------------------------------
/source/_static/images/examples/xgboost-rf-gpu-cpu-benchmark/amazon-deeplearning-ami.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/examples/xgboost-rf-gpu-cpu-benchmark/amazon-deeplearning-ami.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev1.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev2.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev3.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev4.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev5.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev6.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/brev/brev8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/brev/brev8.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/coiled/coiled-jupyter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/coiled/coiled-jupyter.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/coiled/jupyter-on-coiled.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/coiled/jupyter-on-coiled.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/add-remote-system-dialog.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/add-remote-system-dialog.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/create-project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/create-project.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/cudf-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/cudf-example.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/new-project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/new-project.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/open-jupyter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/open-jupyter.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/project-building.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/project-building.png


--------------------------------------------------------------------------------
/source/_static/images/platforms/nvidia-ai-workbench/rapids-with-cuda.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/images/platforms/nvidia-ai-workbench/rapids-with-cuda.png


--------------------------------------------------------------------------------
/source/_static/js/nav.js:
--------------------------------------------------------------------------------
 1 | document.addEventListener("DOMContentLoaded", function () {
 2 |   let sidebar = document.getElementsByClassName("bd-sidebar-primary")[0];
 3 |   sidebar.innerHTML =
 4 |     `
 5 |     
6 |
7 | Docs Home 8 |
9 |
10 | Deployment Home 11 |
12 |
13 |
14 |
15 | nightly 16 | stable 17 |
18 |
19 |
20 | ` + sidebar.innerHTML; 21 | 22 | let versionSection = document.getElementById( 23 | "rapids-selector__container-version", 24 | ); 25 | let selectorSelected = versionSection.getElementsByClassName( 26 | "rapids-selector__selected", 27 | )[0]; 28 | if (window.location.href.includes("/deployment/stable")) { 29 | selectorSelected.innerHTML = "stable"; 30 | versionSection 31 | .getElementsByClassName("rapids-selector__menu-item") 32 | .forEach((element) => { 33 | if (element.innerHTML.includes("stable")) { 34 | element.classList.add("rapids-selector__menu-item--selected"); 35 | } 36 | }); 37 | } else if (window.location.href.includes("/deployment/nightly")) { 38 | selectorSelected.innerHTML = "nightly"; 39 | versionSection 40 | .getElementsByClassName("rapids-selector__menu-item") 41 | .forEach((element) => { 42 | if (element.innerHTML.includes("nightly")) { 43 | element.classList.add("rapids-selector__menu-item--selected"); 44 | } 45 | }); 46 | } else { 47 | selectorSelected.innerHTML = "dev"; 48 | let menu = versionSection.getElementsByClassName( 49 | "rapids-selector__menu", 50 | )[0]; 51 | menu.innerHTML = 52 | menu.innerHTML + 53 | 'dev'; 54 | menu.style["height"] = "97px"; 55 | } 56 | }); 57 | -------------------------------------------------------------------------------- /source/_static/js/notebook-gallery.js: -------------------------------------------------------------------------------- 1 | document.addEventListener("DOMContentLoaded", function () { 2 | var setURLFilters = function (filters) { 3 | var newAdditionalURL = ""; 4 | var tempArray = window.location.href.split("?"); 5 | var baseURL = tempArray[0]; 6 | var additionalURL = tempArray[1]; 7 | var temp = ""; 8 | if (additionalURL) { 9 | tempArray = additionalURL.split("&"); 10 | for (var i = 0; i < tempArray.length; i++) { 11 | if (tempArray[i].split("=")[0] != "filters") { 12 | newAdditionalURL += temp + tempArray[i]; 13 | temp = "&"; 14 | } 15 | } 16 | } 17 | if (filters.length) { 18 | newAdditionalURL += temp + "filters=" + filters.join(","); 19 | } 20 | if (newAdditionalURL) { 21 | window.history.replaceState("", "", baseURL + "?" + newAdditionalURL); 22 | } else { 23 | window.history.replaceState("", "", baseURL); 24 | } 25 | }; 26 | 27 | var getUrlFilters = function () { 28 | let search = new URLSearchParams(window.location.search); 29 | let filters = search.get("filters"); 30 | if (filters) { 31 | return filters.split(","); 32 | } 33 | }; 34 | 35 | var tagFilterListener = function () { 36 | // Get filter checkbox status 37 | filterTagRoots = []; // Which sections are we filtering on 38 | filterTags = []; // Which tags are being selected 39 | Array.from(document.getElementsByClassName("tag-filter")).forEach( 40 | (checkbox) => { 41 | if (checkbox.checked) { 42 | let tag = checkbox.getAttribute("id"); 43 | filterTags.push(checkbox.getAttribute("id")); 44 | let root = tag.split("/")[0]; 45 | if (!filterTagRoots.includes(root)) { 46 | filterTagRoots.push(root); 47 | } 48 | } 49 | }, 50 | ); 51 | 52 | setURLFilters(filterTags); 53 | 54 | // Iterate notebook cards 55 | Array.from(document.getElementsByClassName("sd-col")).forEach( 56 | (notebook) => { 57 | let isFiltered = false; 58 | 59 | // Get tags from the card 60 | let tags = []; 61 | Array.from(notebook.getElementsByClassName("sd-badge")).forEach( 62 | (tag) => { 63 | tags.push(tag.getAttribute("aria-label")); 64 | }, 65 | ); 66 | 67 | // Iterate each of the sections we are filtering on 68 | filterTagRoots.forEach((rootTag) => { 69 | // If a notebook has no tags with the current root tag then it is definitely filtered 70 | if ( 71 | !tags.some((tag) => { 72 | return tag.startsWith(rootTag); 73 | }) 74 | ) { 75 | isFiltered = true; 76 | } else { 77 | // Get filter tags with the current root we are testing 78 | let tagsWithRoot = []; 79 | filterTags.forEach((filteredTag) => { 80 | if (filteredTag.startsWith(rootTag)) { 81 | tagsWithRoot.push(filteredTag); 82 | } 83 | }); 84 | 85 | // If the notebook tags and filter tags don't intersect it is filtered 86 | if (!tags.some((item) => tagsWithRoot.includes(item))) { 87 | isFiltered = true; 88 | } 89 | } 90 | }); 91 | 92 | // Show/hide the card 93 | if (isFiltered) { 94 | notebook.setAttribute("style", "display:none !important"); 95 | } else { 96 | notebook.setAttribute("style", "display:flex"); 97 | } 98 | }, 99 | ); 100 | }; 101 | 102 | // Add listener for resetting the filters 103 | let resetButton = document.getElementById("resetfilters"); 104 | if (resetButton != undefined) { 105 | resetButton.addEventListener( 106 | "click", 107 | function () { 108 | Array.from(document.getElementsByClassName("tag-filter")).forEach( 109 | (checkbox) => { 110 | checkbox.checked = false; 111 | }, 112 | ); 113 | tagFilterListener(); 114 | }, 115 | false, 116 | ); 117 | } 118 | 119 | // Add listeners to all checkboxes for triggering filtering 120 | Array.from(document.getElementsByClassName("tag-filter")).forEach( 121 | (checkbox) => { 122 | checkbox.addEventListener("change", tagFilterListener, false); 123 | }, 124 | ); 125 | 126 | // Simplify tags and add class for styling 127 | // It's not possible to control these attributes in Sphinx otherwise we would 128 | Array.from(document.getElementsByClassName("sd-badge")).forEach((tag) => { 129 | tag.setAttribute("aria-label", tag.innerHTML); 130 | try { 131 | tag 132 | .getAttribute("aria-label") 133 | .split("/") 134 | .forEach((subtag) => tag.classList.add(`tag-${subtag}`)); 135 | } catch (err) {} 136 | 137 | if (tag.innerHTML.includes("/")) { 138 | tag.innerHTML = tag.innerHTML.split("/").slice(1).join("/"); 139 | } 140 | }); 141 | 142 | // Set checkboxes initial state 143 | var initFilters = getUrlFilters(); 144 | if (initFilters) { 145 | Array.from(document.getElementsByClassName("tag-filter")).forEach( 146 | (checkbox) => { 147 | if (initFilters.includes(checkbox.id)) { 148 | checkbox.checked = true; 149 | } 150 | }, 151 | ); 152 | tagFilterListener(); 153 | } 154 | }); 155 | -------------------------------------------------------------------------------- /source/_static/workingdask.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/_static/workingdask.PNG -------------------------------------------------------------------------------- /source/_templates/feedback.html: -------------------------------------------------------------------------------- 1 |

2 | 3 | Spotted a mistake? 4 |
5 | 8 | Let us know! 9 | 10 |

11 | -------------------------------------------------------------------------------- /source/_templates/notebooks-extra-files-nav.html: -------------------------------------------------------------------------------- 1 | {% if related_notebook_files %} {% macro gen_list(root, dir, related_files) -%} 2 | {{ dir }} 3 | 18 | {%- endmacro %} 19 |
20 | Related files 21 |
22 | 32 | {% endif %} 33 | -------------------------------------------------------------------------------- /source/_templates/notebooks-tag-filter.html: -------------------------------------------------------------------------------- 1 | 28 | -------------------------------------------------------------------------------- /source/_templates/notebooks-tags.html: -------------------------------------------------------------------------------- 1 | {% if notebook_tags %} 2 |
Tags
3 | 12 | {% endif %} 13 | -------------------------------------------------------------------------------- /source/cloud/aws/ec2-multi.md: -------------------------------------------------------------------------------- 1 | # EC2 Cluster (via Dask) 2 | 3 | To launch a multi-node cluster on AWS EC2 we recommend you use [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/), a native cloud integration for Dask. It helps manage Dask clusters on different cloud platforms. 4 | 5 | ## Local Environment Setup 6 | 7 | Before running these instructions, ensure you have installed RAPIDS. 8 | 9 | ```{note} 10 | This method of deploying RAPIDS effectively allows you to burst beyond the node you are on into a cluster of EC2 VMs. This does come with the caveat that you are on a RAPIDS capable environment with GPUs. 11 | ``` 12 | 13 | If you are using a machine with an NVIDIA GPU then follow the [local install instructions](https://docs.rapids.ai/install). Alternatively if you do not have a GPU locally consider using a remote environment like a [SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html). 14 | 15 | ### Install the AWS CLI 16 | 17 | Install the AWS CLI tools following the [official instructions](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 18 | 19 | ### Install Dask Cloud Provider 20 | 21 | Also install `dask-cloudprovider` and ensure you select the `aws` optional extras. 22 | 23 | ```console 24 | $ pip install "dask-cloudprovider[aws]" 25 | ``` 26 | 27 | ## Cluster setup 28 | 29 | We'll now setup the [EC2Cluster](https://cloudprovider.dask.org/en/latest/aws.html#elastic-compute-cloud-ec2) from Dask Cloud Provider. 30 | 31 | To do this, you'll first need to run `aws configure` and ensure the credentials are updated. [Learn more about the setup](https://cloudprovider.dask.org/en/latest/aws.html#authentication). The API also expects a security group that allows access to ports 8786-8787 and all traffic between instances in the security group. If you do not pass a group here, `dask-cloudprovider` will create one for you. 32 | 33 | ```python 34 | from dask_cloudprovider.aws import EC2Cluster 35 | 36 | cluster = EC2Cluster( 37 | instance_type="g4dn.12xlarge", # 4 T4 GPUs 38 | docker_image="{{ rapids_container }}", 39 | worker_class="dask_cuda.CUDAWorker", 40 | worker_options={"rmm-managed-memory": True}, 41 | security_groups=[""], 42 | docker_args="--shm-size=256m", 43 | n_workers=3, 44 | security=False, 45 | availability_zone="us-east-1a", 46 | region="us-east-1", 47 | ) 48 | ``` 49 | 50 | ```{warning} 51 | Instantiating this class can take upwards of 30 minutes. See the [Dask docs](https://cloudprovider.dask.org/en/latest/packer.html) on prebuilding AMIs to speed this up. 52 | ``` 53 | 54 | ````{dropdown} If you have non-default credentials you may need to pass your credentials manually. 55 | :color: info 56 | :icon: info 57 | 58 | Here's a small utility for parsing credential profiles. 59 | 60 | ```python 61 | import os 62 | import configparser 63 | import contextlib 64 | 65 | 66 | def get_aws_credentials(*, aws_profile="default"): 67 | parser = configparser.RawConfigParser() 68 | parser.read(os.path.expanduser("~/.aws/config")) 69 | config = parser.items( 70 | f"profile {aws_profile}" if aws_profile != "default" else "default" 71 | ) 72 | parser.read(os.path.expanduser("~/.aws/credentials")) 73 | credentials = parser.items(aws_profile) 74 | all_credentials = {key.upper(): value for key, value in [*config, *credentials]} 75 | with contextlib.suppress(KeyError): 76 | all_credentials["AWS_REGION"] = all_credentials.pop("REGION") 77 | return all_credentials 78 | ``` 79 | 80 | ```python 81 | cluster = EC2Cluster(..., env_vars=get_aws_credentials(aws_profile="foo")) 82 | ``` 83 | 84 | ```` 85 | 86 | ## Connecting a client 87 | 88 | Once your cluster has started you can connect a Dask client to submit work. 89 | 90 | ```python 91 | from dask.distributed import Client 92 | 93 | client = Client(cluster) 94 | ``` 95 | 96 | ```python 97 | import cudf 98 | import dask_cudf 99 | 100 | df = dask_cudf.from_cudf(cudf.datasets.timeseries(), npartitions=2) 101 | df.x.mean().compute() 102 | ``` 103 | 104 | ## Clean up 105 | 106 | When you create your cluster Dask Cloud Provider will register a finalizer to shutdown the cluster. So when your Python process exits the cluster will be cleaned up. 107 | 108 | You can also explicitly shutdown the cluster with: 109 | 110 | ```python 111 | client.close() 112 | cluster.close() 113 | ``` 114 | 115 | ```{relatedexamples} 116 | 117 | ``` 118 | -------------------------------------------------------------------------------- /source/cloud/aws/ec2.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # Elastic Compute Cloud (EC2) 6 | 7 | ## Create Instance 8 | 9 | Create a new [EC2 Instance](https://aws.amazon.com/ec2/) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). 10 | 11 | NVIDIA maintains an [Amazon Machine Image (AMI) that pre-installs NVIDIA drivers and container runtimes](https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq), we recommend using this image as the starting point. 12 | 13 | 1. Open the [**EC2 Dashboard**](https://console.aws.amazon.com/ec2/home). 14 | 1. Select **Launch Instance**. 15 | 1. In the AMI selection box search for "nvidia", then switch to the **AWS Marketplace AMIs** tab. 16 | 1. Select **NVIDIA GPU-Optimized AMI** and click "Select". Then, in the new popup, select **Subscribe on Instance Launch**. 17 | 1. In **Key pair** select your SSH keys (create these first if you haven't already). 18 | 1. Under network settings create a security group (or choose an existing) with inbound rules that allows SSH access on 19 | port `22` and also allow ports `8888,8786,8787` to access Jupyter and Dask. For outbound rules, allow all traffic. 20 | 1. Select **Launch**. 21 | 22 | ## Connect to the instance 23 | 24 | Next we need to connect to the instance. 25 | 26 | 1. Open the [**EC2 Dashboard**](https://console.aws.amazon.com/ec2/home). 27 | 2. Locate your VM and note the **Public IP Address**. 28 | 3. In your terminal run `ssh ubuntu@`. 29 | 30 | ```{note} 31 | If you use the AWS Console, please use the default `ubuntu` user to ensure the NVIDIA driver installs on the first boot. 32 | ``` 33 | 34 | ````{tip} 35 | Depending on where your ssh key is, when connecting via SSH you might need to do 36 | 37 | ```bash 38 | ssh -i /your-key-file.pem ubuntu@ 39 | ``` 40 | 41 | If you get prompted with a `WARNING: UNPROTECTED PRIVATE KEY FILE!`, and get a 42 | **"Permission denied"** as a result of this. 43 | 44 | Change the permissions of your key file to be less permissive by doing 45 | `chmod 400 your_key_file.pem`, and you should be good to go. 46 | ```` 47 | 48 | ## Install RAPIDS 49 | 50 | ```{include} ../../_includes/install-rapids-with-docker.md 51 | 52 | ``` 53 | 54 | ```{note} 55 | If you see a "modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-aws" while first connecting to the EC2 instance, try logging out and reconnecting again. 56 | ``` 57 | 58 | ## Test RAPIDS 59 | 60 | ```{include} ../../_includes/test-rapids-docker-vm.md 61 | 62 | ``` 63 | 64 | ```{relatedexamples} 65 | 66 | ``` 67 | -------------------------------------------------------------------------------- /source/cloud/aws/ecs.md: -------------------------------------------------------------------------------- 1 | # Elastic Container Service (ECS) 2 | 3 | RAPIDS can be deployed on a multi-node ECS cluster using Dask’s dask-cloudprovider management tools. For more details, see our **[blog post on 4 | deploying on ECS.](https://medium.com/rapids-ai/getting-started-with-rapids-on-aws-ecs-using-dask-cloud-provider-b1adfdbc9c6e)** 5 | 6 | ## Run from within AWS 7 | 8 | The following steps assume you are running from within the same AWS VPC. One way to ensure this is to use 9 | [AWS EC2 Single Instance](https://docs.rapids.ai/deployment/stable/cloud/aws/ec2.html) as your development environment. 10 | 11 | ### Setup AWS credentials 12 | 13 | First, you will need AWS credentials to interact with the AWS CLI. If someone else manages your AWS account, you will need to 14 | get these keys from them.
15 | 16 | You can provide these credentials to dask-cloudprovider in a number of ways, but the easiest is to setup your 17 | local environment using the AWS command line tools: 18 | 19 | ```shell 20 | $ pip install awscli 21 | $ aws configure 22 | ``` 23 | 24 | ### Install dask-cloudprovider 25 | 26 | To install, you will need to run the following: 27 | 28 | ```shell 29 | $ pip install dask-cloudprovider[aws] 30 | ``` 31 | 32 | ## Create an ECS cluster 33 | 34 | In the AWS console, visit the ECS dashboard and on the left-hand side, click “Clusters” then **Create Cluster** 35 | 36 | Give the cluster a name e.g.`rapids-cluster` 37 | 38 | For Networking, select the default VPC and all the subnets available in that VPC 39 | 40 | Select "Amazon EC2 instances" for the Infrastructure type and configure your settings: 41 | 42 | - Operating system: must be Linux-based architecture 43 | - EC2 instance type: must support RAPIDS-compatible GPUs ([see the RAPIDS docs](https://docs.rapids.ai/install#system-req)) 44 | - Desired capacity: number of maximum instances to launch (default maximum 5) 45 | - SSH Key pair 46 | 47 | Review your settings then click on the "Create" button and wait for the cluster creation to complete. 48 | 49 | ## Create a Dask cluster 50 | 51 | Get the Amazon Resource Name (ARN) for the cluster you just created. 52 | 53 | Set `AWS_REGION` environment variable to your **[default region](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions)**, for instance `us-east-1` 54 | 55 | ```shell 56 | AWS_REGION=[REGION] 57 | ``` 58 | 59 | Create the ECSCluster object in your Python session: 60 | 61 | ```python 62 | from dask_cloudprovider.aws import ECSCluster 63 | 64 | cluster = ECSCluster( 65 | cluster_arn= "", 66 | n_workers=, 67 | worker_gpu=, 68 | skip_cleaup=True, 69 | scheduler_timeout="20 minutes", 70 | ) 71 | ``` 72 | 73 | ````{note} 74 | When you call this command for the first time, `ECSCluster()` will automatically create a **security group** with the same name as the ECS cluster you created above.. 75 | 76 | However, if the Dask cluster creation fails or you'd like to reuse the same ECS cluster for subsequent runs of `ECSCluster()`, then you will need to provide this security group value. 77 | 78 | ```shell 79 | security_groups=["sg-0fde781be42651"] 80 | 81 | ```` 82 | 83 | [**cluster_arn**] = ARN of an existing ECS cluster to use for launching tasks
84 | 85 | [**num_workers**] = number of workers to start on cluster creation
86 | 87 | [**num_gpus**] = number of GPUs to expose to the worker, this must be less than or equal to the number of GPUs in the instance type you selected for the ECS cluster (e.g `1` for `p3.2xlarge`).
88 | 89 | [**skip_cleanup**] = if True, Dask workers won't be automatically terminated when cluster is shut down
90 | 91 | [**execution_role_arn**] = ARN of the IAM role that allows the Dask cluster to create and manage ECS resources
92 | 93 | [**task_role_arn**] = ARN of the IAM role that the Dask workers assume when they run
94 | 95 | [**scheduler_timeout**] = maximum time scheduler will wait for workers to connect to the cluster 96 | 97 | ## Test RAPIDS 98 | 99 | Create a distributed client for our cluster: 100 | 101 | ```python 102 | from dask.distributed import Client 103 | 104 | client = Client(cluster) 105 | ``` 106 | 107 | Load sample data and test the cluster! 108 | 109 | ```python 110 | import dask, cudf, dask_cudf 111 | 112 | ddf = dask.datasets.timeseries() 113 | gdf = ddf.map_partitions(cudf.from_pandas) 114 | gdf.groupby("name").id.count().compute().head() 115 | ``` 116 | 117 | ```shell 118 | Out[34]: 119 | Xavier 99495 120 | Oliver 100251 121 | Charlie 99354 122 | Zelda 99709 123 | Alice 100106 124 | Name: id, dtype: int64 125 | ``` 126 | 127 | ## Cleanup 128 | 129 | You can scale down or delete the Dask cluster, but the ECS cluster will continue to run (and incur charges!) until you also scale it down or shut down altogether.
130 | 131 | If you are planning to use the ECS cluster again soon, it is probably preferable to reduce the nodes to zero. 132 | 133 | ```{relatedexamples} 134 | 135 | ``` 136 | -------------------------------------------------------------------------------- /source/cloud/aws/eks.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # AWS Elastic Kubernetes Service (EKS) 6 | 7 | RAPIDS can be deployed on AWS via the [Elastic Kubernetes Service](https://aws.amazon.com/eks/) (EKS). 8 | 9 | To run RAPIDS you'll need a Kubernetes cluster with GPUs available. 10 | 11 | ## Prerequisites 12 | 13 | First you'll need to have the [`aws` CLI tool](https://aws.amazon.com/cli/) and [`eksctl` CLI tool](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), for managing Kubernetes. 14 | 15 | Ensure you are logged into the `aws` CLI. 16 | 17 | ```console 18 | $ aws configure 19 | ``` 20 | 21 | ## Create the Kubernetes cluster 22 | 23 | Now we can launch a GPU enabled EKS cluster with `eksctl`. 24 | 25 | ```{note} 26 | 1. You will need to create or import a public SSH key to be able to execute the following command. 27 | In your aws console under `EC2` in the side panel under Network & Security > Key Pairs, you can create a 28 | key pair or import (see "Actions" dropdown) one you've created locally. 29 | 30 | 2. If you are not using your default AWS profile, add `--profile ` to the following command. 31 | ``` 32 | 33 | ```console 34 | $ eksctl create cluster rapids \ 35 | --version 1.30 \ 36 | --nodes 3 \ 37 | --node-type=g4dn.xlarge \ 38 | --timeout=40m \ 39 | --ssh-access \ 40 | --ssh-public-key \ # Name assigned during creation of your key in aws console 41 | --region us-east-1 \ 42 | --zones=us-east-1c,us-east-1b,us-east-1d \ 43 | --auto-kubeconfig 44 | ``` 45 | 46 | With this command, you’ve launched an EKS cluster called `rapids`. You’ve specified that it should use nodes of type `p3.8xlarge`. We also specified that we don't want to install the NVIDIA drivers as we will do that with the NVIDIA operator. 47 | 48 | To access the cluster we need to pull down the credentials. 49 | Add `--profile ` if you are not using the default profile. 50 | 51 | ```console 52 | $ aws eks --region us-east-1 update-kubeconfig --name rapids 53 | ``` 54 | 55 | ## Install drivers 56 | 57 | As we selected a GPU node type EKS will automatically install drivers for us. We can verify this by listing the NVIDIA driver plugin Pods. 58 | 59 | ```console 60 | $ kubectl get po -n kube-system -l name=nvidia-device-plugin-ds 61 | NAME READY STATUS RESTARTS AGE 62 | nvidia-device-plugin-daemonset-kv7t5 1/1 Running 0 52m 63 | nvidia-device-plugin-daemonset-rhmvx 1/1 Running 0 52m 64 | nvidia-device-plugin-daemonset-thjhc 1/1 Running 0 52m 65 | ``` 66 | 67 | ```{note} 68 | By default this plugin will install the latest version on the NVIDIA drivers on every Node. If you need more control over your driver installation we recommend that when creating your cluster you set `eksctl create cluster --install-nvidia-plugin=false ...` and then install drivers yourself using the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html). 69 | ``` 70 | 71 | After you have confirmed your drivers are installed, you are ready to test your cluster. 72 | 73 | ```{include} ../../_includes/check-gpu-pod-works.md 74 | 75 | ``` 76 | 77 | ## Install RAPIDS 78 | 79 | Now that you have a GPU enabled Kubernetes cluster on EKS you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes). 80 | 81 | ## Clean up 82 | 83 | You can also delete the EKS cluster to stop billing with the following command. 84 | 85 | ```console 86 | $ eksctl delete cluster --region=us-east-1 --name=rapids 87 | Deleting cluster rapids...⠼ 88 | ``` 89 | 90 | ```{relatedexamples} 91 | 92 | ``` 93 | -------------------------------------------------------------------------------- /source/cloud/aws/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Amazon Web Services 7 | 8 | ```{include} ../../_includes/menus/aws.md 9 | 10 | ``` 11 | 12 | RAPIDS can be deployed on Amazon Web Services (AWS) in several ways. See the 13 | [list of accelerated instance types](https://aws.amazon.com/ec2/instance-types/) below: 14 | 15 | | Cloud
Provider | Inst.
Type | Inst.
Name | GPU
Count | GPU
Type | xGPU
RAM | xGPU
RAM Total | 16 | | :------------------ | --------------- | --------------- | -------------- | ------------- | ------------- | ------------------: | 17 | | AWS | P5 | p5\.48xlarge | 8 | H100 | 80 (GB) | 640 (GB) | 18 | | AWS | P4 | p4d\.24xlarge | 8 | A100 | 40 (GB) | 320 (GB) | 19 | | AWS | P3 | p3dn\.24xlarge | 8 | V100 | 32 (GB) | 256 (GB) | 20 | | AWS | P3 | p3\.16xlarge | 8 | V100 | 16 (GB) | 128 (GB) | 21 | | AWS | P3 | p3\.8xlarge | 4 | V100 | 16 (GB) | 64 (GB) | 22 | | AWS | P3 | p3\.2xlarge | 1 | V100 | 16 (GB) | 16 (GB) | 23 | | AWS | G6 | g6\.48xlarge | 8 | L4 | 24 (GB) | 192 (GB) | 24 | | AWS | G6 | g6\.24xlarge | 4 | L4 | 24 (GB) | 96 (GB) | 25 | | AWS | G6 | gr6\.8xlarge | 1 | L4 | 24 (GB) | 24 (GB) | 26 | | AWS | G5 | g5\.48xlarge | 8 | A10G | 24 (GB) | 192 (GB) | 27 | | AWS | G5 | g5\.24xlarge | 4 | A10G | 24 (GB) | 96 (GB) | 28 | | AWS | G5 | g5\.16xlarge | 1 | A10G | 24 (GB) | 24 (GB) | 29 | | AWS | G4dn | g4dn\.metal | 8 | T4 | 16 (GB) | 128 (GB) | 30 | | AWS | G4dn | g4dn\.12xlarge | 4 | T4 | 16 (GB) | 64 (GB) | 31 | | AWS | G4dn | g4dn\.xlarge | 1 | T4 | 16 (GB) | 16 (GB) | 32 | 33 | ```{toctree} 34 | --- 35 | hidden: true 36 | --- 37 | ec2 38 | ec2-multi 39 | eks 40 | ecs 41 | sagemaker 42 | ``` 43 | -------------------------------------------------------------------------------- /source/cloud/aws/sagemaker.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p0" 3 | --- 4 | 5 | # SageMaker 6 | 7 | RAPIDS can be used in a few ways with [AWS SageMaker](https://aws.amazon.com/sagemaker/). 8 | 9 | ## SageMaker Notebooks 10 | 11 | To get started head to [the SageMaker console](https://console.aws.amazon.com/sagemaker/) and create a [new SageMaker Notebook Instance](https://console.aws.amazon.com/sagemaker/home#/notebook-instances/create). 12 | 13 | Choose `Applications and IDEs > Notebooks > Create notebook instance`. 14 | 15 | ### Select your instance 16 | 17 | If a field is not mentioned below, leave the default values: 18 | 19 | - **Notebook instance name** = Name of the notebook instance 20 | - **Notebook instance type** = Type of notebook instance. Select a RAPIDS-compatible GPU ([see the RAPIDS docs](https://docs.rapids.ai/install#system-req)) as the SageMaker Notebook instance type (e.g., `ml.p3.2xlarge`). 21 | - **Platform identifier** = 'Amazon Linux 2, Jupyter Lab 4' 22 | 23 | ![Screenshot of the create new notebook screen with a ml.p3.2xlarge selected](../../images/sagemaker-create-notebook-instance.png) 24 | 25 | ### Create a RAPIDS lifecycle configuration 26 | 27 | [SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) can be augmented with a RAPIDS conda environment. 28 | 29 | We can add a RAPIDS conda environment to the set of Jupyter ipython kernels available in our SageMaker notebook instance by installing in a [lifecycle configuration script](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html). 30 | 31 | Create a new lifecycle configuration (via the 'Additional Configuration' dropdown). 32 | 33 | ![Screenshot of the create lifecycle configuration screen](../../images/sagemaker-create-lifecycle-configuration.png) 34 | 35 | Give your configuration a name like `rapids` and paste the following script into the "start notebook" script. 36 | 37 | ```bash 38 | #!/bin/bash 39 | 40 | set -e 41 | 42 | sudo -u ec2-user -i <<'EOF' 43 | 44 | mamba create -y -n rapids -c rapidsai -c conda-forge -c nvidia rapids=24.12 python=3.12 cuda-version=12.4 \ 45 | boto3 \ 46 | ipykernel \ 47 | 'sagemaker-python-sdk>=2.239.0' 48 | 49 | conda activate rapids 50 | 51 | python -m ipykernel install --user --name rapids 52 | echo "kernel install completed" 53 | EOF 54 | ``` 55 | 56 | ```{warning} 57 | RAPIDS `>24.12` will not be installable on SageMaker Notebook Instances until those instances support 58 | Amazon Linux 2023 or other Linux distributions with GLIBC of at least 2.28. 59 | For more details, see [rapidsai/deployment#520](https://github.com/rapidsai/deployment/issues/520). 60 | ``` 61 | 62 | Set the volume size to at least `15GB`, to accommodate the conda environment. 63 | 64 | Then launch the instance. 65 | 66 | ### Select the RAPIDS environment 67 | 68 | Once your Notebook Instance is `InService` select "Open JupyterLab" 69 | 70 | ```{note} 71 | If you see Pending to the right of the notebook instance in the Status column, your notebook is still being created. The status will change to InService when the notebook is ready for use. 72 | ``` 73 | 74 | Then in Jupyter select the `rapids` kernel when working with a new notebook. 75 | 76 | ![Screenshot of Jupyter with the rapids kernel highlighted](../../images/sagemaker-choose-rapids-kernel.png) 77 | 78 | ### Run the Example Notebook 79 | 80 | Once inside JupyterLab you should be able to upload the [Running RAPIDS hyperparameter experiments at scale](/examples/rapids-sagemaker-higgs/notebook) example notebook and continue following those instructions. 81 | 82 | ## SageMaker Estimators 83 | 84 | RAPIDS can also be used in [SageMaker Estimators](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). 85 | Estimators allow you to launch training jobs on ephemeral VMs which SageMaker manages for you. 86 | With this option, your Notebook Instance doesn't need to have a GPU... you are only charged for GPU instances for the time that your training job is running. 87 | 88 | All you’ll need to do is bring in your RAPIDS training script and libraries as a Docker container image and ask Amazon SageMaker to run copies of it in parallel on a specified number of GPU instances. 89 | 90 | Let’s take a closer look at how this works through a step-by-step approach: 91 | 92 | - Training script should accept hyperparameters as command line arguments. Starting with the base RAPIDS container (pulled from [Docker Hub](https://hub.docker.com/u/rapidsai)), use a `Dockerfile` to augment it by copying your training code and set `WORKDIR` path to the code. 93 | 94 | - Install [sagemaker-training toolkit](https://github.com/aws/sagemaker-training-toolkit) to make the container compatible with Sagemaker. Add other packages as needed for your workflow needs e.g. python, flask (model serving), dask-ml etc. 95 | 96 | - Push the image to a container registry (ECR). 97 | 98 | - Having built our container and custom logic, we can now assemble all components into an Estimator. We can now test the Estimator and run parallel hyperparameter optimization tuning jobs. 99 | 100 | Estimators follow an API roughly like this: 101 | 102 | ```python 103 | # set up configuration for the estimator 104 | estimator = sagemaker.estimator.Estimator( 105 | image_uri, 106 | role, 107 | instance_type, 108 | instance_count, 109 | input_mode, 110 | output_path, 111 | use_spot_instances, 112 | max_run=86400, 113 | sagemaker_session, 114 | ) 115 | 116 | # launch a single remote training job 117 | estimator.fit(inputs=s3_data_input, job_name=job_name) 118 | 119 | # set up configuration for HyperparameterTuner 120 | hpo = sagemaker.tuner.HyperparameterTuner( 121 | estimator, 122 | metric_definitions, 123 | objective_metric_name, 124 | objective_type="Maximize", 125 | hyperparameter_ranges, 126 | strategy, 127 | max_jobs, 128 | max_parallel_jobs, 129 | ) 130 | 131 | # launch multiple training jobs (one per combination of hyperparameters) 132 | hpo.fit(inputs=s3_data_input, job_name=tuning_job_name, wait=True, logs="All") 133 | ``` 134 | 135 | For a hands-on demo of this, try ["Deep Dive into running Hyper Parameter Optimization on AWS SageMaker"]/examples/rapids-sagemaker-higgs/notebook). 136 | 137 | ## Further reading 138 | 139 | We’ve also written a **[detailed blog post](https://medium.com/rapids-ai/running-rapids-experiments-at-scale-using-amazon-sagemaker-d516420f165b)** on how to use SageMaker with RAPIDS. 140 | 141 | ```{relatedexamples} 142 | 143 | ``` 144 | -------------------------------------------------------------------------------- /source/cloud/azure/aks.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # Azure Kubernetes Service 6 | 7 | RAPIDS can be deployed on Azure via the [Azure Kubernetes Service](https://azure.microsoft.com/en-us/products/kubernetes-service/) (AKS). 8 | 9 | To run RAPIDS you'll need a Kubernetes cluster with GPUs available. 10 | 11 | ## Prerequisites 12 | 13 | First you'll need to have the [`az` CLI tool](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. 14 | 15 | Ensure you are logged into the `az` CLI. 16 | 17 | ```console 18 | $ az login 19 | ``` 20 | 21 | ## Create the Kubernetes cluster 22 | 23 | Now we can launch a GPU enabled AKS cluster. First launch an AKS cluster. 24 | 25 | ```bash 26 | az aks create -g -n rapids \ 27 | --enable-managed-identity \ 28 | --node-count 1 \ 29 | --enable-addons monitoring \ 30 | --enable-msi-auth-for-monitoring \ 31 | --generate-ssh-keys 32 | ``` 33 | 34 | Once the cluster has created we need to pull the credentials into our local config. 35 | 36 | ```console 37 | $ az aks get-credentials -g --name rapids 38 | Merged "rapids" as current context in ~/.kube/config 39 | ``` 40 | 41 | Next we need to add an additional node group with GPUs which you can [learn more about in the Azure docs](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster). 42 | 43 | `````{note} 44 | You will need the `GPUDedicatedVHDPreview` feature enabled so that NVIDIA drivers are installed automatically. 45 | 46 | You can check if this is enabled with: 47 | 48 | ````console 49 | $ az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}" 50 | Name State 51 | ------------------------------------------------- ------------- 52 | Microsoft.ContainerService/GPUDedicatedVHDPreview NotRegistered 53 | ```` 54 | 55 | ````{dropdown} If you see NotRegistered follow these instructions 56 | :color: info 57 | :icon: info 58 | 59 | If it is not registered for you you'll need to register it which can take a few minutes. 60 | 61 | ```console 62 | $ az feature register --name GPUDedicatedVHDPreview --namespace Microsoft.ContainerService 63 | Once the feature 'GPUDedicatedVHDPreview' is registered, invoking 'az provider register -n Microsoft.ContainerService' is required to get the change propagated 64 | Name 65 | ------------------------------------------------- 66 | Microsoft.ContainerService/GPUDedicatedVHDPreview 67 | ``` 68 | 69 | Keep checking until it does into a registered state. 70 | 71 | ```console 72 | $ az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}" 73 | Name State 74 | ------------------------------------------------- ----------- 75 | Microsoft.ContainerService/GPUDedicatedVHDPreview Registered 76 | ``` 77 | 78 | When the status shows as registered, refresh the registration of the `Microsoft.ContainerService` resource provider by using the `az provider register` command: 79 | 80 | ```console 81 | $ az provider register --namespace Microsoft.ContainerService 82 | ``` 83 | 84 | Then install the aks-preview CLI extension, use the following Azure CLI commands: 85 | 86 | ```console 87 | $ az extension add --name aks-preview 88 | ``` 89 | 90 | ```` 91 | 92 | ````` 93 | 94 | ```bash 95 | az aks nodepool add \ 96 | --resource-group \ 97 | --cluster-name rapids \ 98 | --name gpunp \ 99 | --node-count 1 \ 100 | --node-vm-size Standard_NC48ads_A100_v4 \ 101 | --enable-cluster-autoscaler \ 102 | --min-count 1 \ 103 | --max-count 3 104 | ``` 105 | 106 | Here we have added a new pool made up of `Standard_NC48ads_A100_v4` instances which each have two A100 GPUs. We've also enabled autoscaling between one and three nodes on the pool. 107 | 108 | Then we can install the NVIDIA drivers. 109 | 110 | ```bash 111 | helm install --wait --generate-name --repo https://helm.ngc.nvidia.com/nvidia \ 112 | -n gpu-operator --create-namespace \ 113 | gpu-operator \ 114 | --set operator.runtimeClass=nvidia-container-runtime 115 | ``` 116 | 117 | Once our new pool has been created and configured, we can test the cluster. 118 | 119 | ```{include} ../../_includes/check-gpu-pod-works.md 120 | 121 | ``` 122 | 123 | we should be able to test that we can schedule GPU pods. 124 | 125 | ## Install RAPIDS 126 | 127 | Now that you have a GPU enables Kubernetes cluster on AKS you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes). 128 | 129 | ## Clean up 130 | 131 | You can also delete the AKS cluster to stop billing with the following command. 132 | 133 | ```console 134 | $ az aks delete -g -n rapids 135 | / Running .. 136 | ``` 137 | 138 | ```{relatedexamples} 139 | 140 | ``` 141 | -------------------------------------------------------------------------------- /source/cloud/azure/azure-vm-multi.md: -------------------------------------------------------------------------------- 1 | # Azure VM Cluster (via Dask) 2 | 3 | ## Create a Cluster using Dask Cloud Provider 4 | 5 | The easiest way to setup a multi-node, multi-GPU cluster on Azure is to use [Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/azure.html). 6 | 7 | ### 1. Install Dask Cloud Provider 8 | 9 | Dask Cloud Provider can be installed via `conda` or `pip`. The Azure-specific capabilities will need to be installed via the `[azure]` pip extra. 10 | 11 | ```shell 12 | $ pip install dask-cloudprovider[azure] 13 | ``` 14 | 15 | ### 2. Configure your Azure Resources 16 | 17 | Set up your [Azure Resource Group](https://cloudprovider.dask.org/en/latest/azure.html#resource-groups), [Virtual Network](https://cloudprovider.dask.org/en/latest/azure.html#virtual-networks), and [Security Group](https://cloudprovider.dask.org/en/latest/azure.html#security-groups) according to [Dask Cloud Provider instructions](https://cloudprovider.dask.org/en/latest/azure.html#authentication). 18 | 19 | ### 3. Create a Cluster 20 | 21 | In Python terminal, a cluster can be created using the `dask_cloudprovider` package. The below example creates a cluster with 2 workers in `westus2` with `Standard_NC12s_v3` VMs. The VMs should have at least 100GB of disk space in order to accommodate the RAPIDS container image and related dependencies. 22 | 23 | ```python 24 | from dask_cloudprovider.azure import AzureVMCluster 25 | 26 | resource_group = "" 27 | vnet = "" 28 | security_group = "" 29 | subscription_id = "" 30 | cluster = AzureVMCluster( 31 | resource_group=resource_group, 32 | vnet=vnet, 33 | security_group=security_group, 34 | subscription_id=subscription_id, 35 | location="westus2", 36 | vm_size="Standard_NC12s_v3", 37 | public_ingress=True, 38 | disk_size=100, 39 | n_workers=2, 40 | worker_class="dask_cuda.CUDAWorker", 41 | docker_image="{{rapids_container}}", 42 | docker_args="-p 8787:8787 -p 8786:8786", 43 | ) 44 | ``` 45 | 46 | ### 4. Test RAPIDS 47 | 48 | To test RAPIDS, create a distributed client for the cluster and query for the GPU model. 49 | 50 | ```python 51 | from dask.distributed import Client 52 | 53 | client = Client(cluster) 54 | 55 | 56 | def get_gpu_model(): 57 | import pynvml 58 | 59 | pynvml.nvmlInit() 60 | return pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0)) 61 | 62 | 63 | client.submit(get_gpu_model).result() 64 | ``` 65 | 66 | ```shell 67 | Out[5]: b'Tesla V100-PCIE-16GB' 68 | ``` 69 | 70 | ### 5. Cleanup 71 | 72 | Once done with the cluster, ensure the `cluster` and `client` are closed: 73 | 74 | ```python 75 | client.close() 76 | cluster.close() 77 | ``` 78 | 79 | ```{relatedexamples} 80 | 81 | ``` 82 | -------------------------------------------------------------------------------- /source/cloud/azure/azure-vm.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # Azure Virtual Machine 6 | 7 | ## Create Virtual Machine 8 | 9 | Create a new [Azure Virtual Machine](https://azure.microsoft.com/en-gb/products/virtual-machines/) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). 10 | 11 | NVIDIA maintains a [Virtual Machine Image (VMI) that pre-installs NVIDIA drivers and container runtimes](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/nvidia.ngc_azure_17_11?tab=Overview), we recommend using this image as the starting point. 12 | 13 | `````{tab-set} 14 | 15 | ````{tab-item} via Azure Portal 16 | :sync: portal 17 | 18 | 1. Select a resource group or create one if needed. 19 | 2. Select the latest **NVIDIA GPU-Optimized VMI** version from the drop down list, then select **Get It Now** (if there are multiple `Gen` versions, select the latest). 20 | 3. If already logged in on Azure, select continue clicking **Create**. 21 | 4. In **Create a virtual machine** interface, fill in required information for the vm. 22 | - Select a GPU enabled VM size (see [recommended VM types](https://docs.rapids.ai/deployment/stable/cloud/azure/)). 23 | - In "Configure security features" select Standard. 24 | - Make sure you create ssh keys and download them. 25 | 26 | ```{dropdown} Note that not all regions support availability zones with GPU VMs. 27 | :color: info 28 | :icon: info 29 | 30 | When the GPU VM size is not selectable 31 | with notice: **The size is not available in zone x. No zones are supported.** It means the GPU VM does not 32 | support availability zone. Try other availability options. 33 | 34 | ![azure-gpuvm-availability-zone-error](../../_static/azure_availability_zone.PNG) 35 | ``` 36 | 37 | Click **Review+Create** to start the virtual machine. 38 | 39 | ```` 40 | 41 | ````{tab-item} via Azure CLI 42 | :sync: cli 43 | 44 | Prepare the following environment variables. 45 | 46 | | Name | Description | Example | 47 | | ------------------ | -------------------- | -------------------------------------------------------------- | 48 | | `AZ_VMNAME` | Name for VM | `RapidsAI-V100` | 49 | | `AZ_RESOURCEGROUP` | Resource group of VM | `rapidsai-deployment` | 50 | | `AZ_LOCATION` | Region of VM | `westus2` | 51 | | `AZ_IMAGE` | URN of image | `nvidia:ngc_azure_17_11:ngc-base-version-22_06_0-gen2:22.06.0` | 52 | | `AZ_SIZE` | VM Size | `Standard_NC6s_v3` | 53 | | `AZ_USERNAME` | User name of VM | `rapidsai` | 54 | | `AZ_SSH_KEY` | public ssh key | `~/.ssh/id_rsa.pub` | 55 | 56 | ```bash 57 | az vm create \ 58 | --name ${AZ_VMNAME} \ 59 | --resource-group ${AZ_RESOURCEGROUP} \ 60 | --image ${AZ_IMAGE} \ 61 | --location ${AZ_LOCATION} \ 62 | --size ${AZ_SIZE} \ 63 | --admin-username ${AZ_USERNAME} \ 64 | --ssh-key-value ${AZ_SSH_KEY} 65 | ``` 66 | 67 | ```{note} 68 | Use `az vm image list --publisher Nvidia --all --output table` to inspect URNs of official 69 | NVIDIA images on Azure. 70 | ``` 71 | 72 | ```{note} 73 | See [this link](https://learn.microsoft.com/en-us/azure/virtual-machines/linux/mac-create-ssh-keys) 74 | for supported ssh keys on Azure. 75 | ``` 76 | 77 | ```` 78 | 79 | ````` 80 | 81 | ## Create Network Security Group 82 | 83 | Next we need to allow network traffic to the VM so we can access Jupyter and Dask. 84 | 85 | `````{tab-set} 86 | 87 | ````{tab-item} via Azure Portal 88 | :sync: portal 89 | 90 | 1. After creating VM, select **Go to resource** to access VM. 91 | 2. Select **Networking** -> **Networking Settings** in the left panel. 92 | 3. Select **+Create port rule** -> **Add inbound port rule**. 93 | 4. Set **Destination port ranges** to `8888,8787`. 94 | 5. Modify the "Name" to avoid the `,` or any other symbols. 95 | 96 | ```{dropdown} See example of port setting. 97 | :color: info 98 | :icon: info 99 | ![set-ports-inbound-sec](../../_static/azure-set-ports-inbound-sec.png) 100 | ``` 101 | 5. Keep rest unchanged. Select **Add**. 102 | ```` 103 | 104 | ````{tab-item} via Azure CLI 105 | :sync: cli 106 | 107 | | Name | Description | Example | 108 | | ---------------- | ------------------- | -------------------------- | 109 | | `AZ_NSGNAME` | NSG name for the VM | `${AZ_VMNAME}NSG` | 110 | | `AZ_NSGRULENAME` | Name for NSG rule | `Allow-Dask-Jupyter-ports` | 111 | 112 | ```bash 113 | az network nsg rule create \ 114 | -g ${AZ_RESOURCEGROUP} \ 115 | --nsg-name ${AZ_NSGNAME} \ 116 | -n ${AZ_NSGRULENAME} \ 117 | --priority 1050 \ 118 | --destination-port-ranges 8888 8787 119 | ``` 120 | 121 | ```` 122 | ````` 123 | 124 | ## Install RAPIDS 125 | 126 | Next, we can SSH into our VM to install RAPIDS. SSH instructions can be found by selecting **Connect** in the left panel. 127 | 128 | ````{tip} 129 | When connecting via SSH by doing 130 | 131 | ```bash 132 | ssh -i /your-key-file.pem azureuser@ 133 | ``` 134 | 135 | you might get prompted with a `WARNING: UNPROTECTED PRIVATE KEY FILE!`, and get a 136 | **"Permission denied"** as a result of this. 137 | 138 | Change the permissions of your key file to be less permissive by 139 | doing `chmod 600 your_key_file.pem`, and you should be good to go. 140 | ```` 141 | 142 | ```{include} ../../_includes/install-rapids-with-docker.md 143 | 144 | ``` 145 | 146 | ## Test RAPIDS 147 | 148 | ```{include} ../../_includes/test-rapids-docker-vm.md 149 | 150 | ``` 151 | 152 | ### Useful Links 153 | 154 | - [Using NGC with Azure](https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-azure/index.html) 155 | 156 | ```{relatedexamples} 157 | 158 | ``` 159 | -------------------------------------------------------------------------------- /source/cloud/azure/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Microsoft Azure 7 | 8 | ```{include} ../../_includes/menus/azure.md 9 | 10 | ``` 11 | 12 | RAPIDS can be deployed on Microsoft Azure in several ways. Azure supports various kinds of GPU VMs for different needs. 13 | For RAPIDS users we recommend NC/ND VMs for computation and deep learning optimized instances. 14 | 15 | NC (>=v3) series 16 | 17 | | Size | vCPU | Memory: GiB | Temp Storage (with NVMe) : GiB | GPU | GPU Memory: GiB | Max data disks | Max uncached disk throughput: IOPS / MBps | Max NICs/network bandwidth (MBps) | 18 | | ------------------------ | ---- | ----------- | ------------------------------ | --- | --------------- | -------------- | ----------------------------------------- | --------------------------------- | 19 | | Standard_ND96isr_H100_v5 | 96 | 1900 | 1000 | 8 | 80 | 32 | 40800/612 | 8/80,000 | 20 | | Standard_NC24ads_A100_v4 | 24 | 220 | 1123 | 1 | 80 | 12 | 30000/1000 | 2/20,000 | 21 | | Standard_NC48ads_A100_v4 | 48 | 440 | 2246 | 2 | 160 | 24 | 60000/2000 | 4/40,000 | 22 | | Standard_NC96ads_A100_v4 | 96 | 880 | 4492 | 4 | 320 | 32 | 120000/4000 | 8/80,000 | 23 | | Standard_NC4as_T4_v3 | 4 | 28 | 180 | 1 | 16 | 8 | 2 / 8000 | | 24 | | Standard_NC8as_T4_v3 | 8 | 56 | 360 | 1 | 16 | 16 | 4 / 8000 | | 25 | | Standard_NC16as_T4_v3 | 16 | 110 | 360 | 1 | 16 | 32 | 8 / 8000 | | 26 | | Standard_NC64as_T4_v3 | 64 | 440 | 2880 | 4 | 64 | 32 | 8 / 32000 | | 27 | | Standard_NC6s_v3 | 6 | 112 | 736 | 1 | 16 | 12 | 20000/200 | 4 | 28 | | Standard_NC12s_v3 | 12 | 224 | 1474 | 2 | 32 | 24 | 40000/400 | 8 | 29 | | Standard_NC24s_v3 | 24 | 448 | 2948 | 4 | 64 | 32 | 80000/800 | 8 | 30 | | Standard_NC24rs_v3\* | 24 | 448 | 2948 | 4 | 64 | 32 | 80000/800 | 8 | 31 | 32 | \* RDMA capable 33 | 34 | ND (>=v2) series 35 | 36 | | Size | vCPU | Memory: GiB | Temp Storage (with NVMe) : GiB | GPU | GPU Memory: GiB | Max data disks | Max uncached disk throughput: IOPS / MBps | Max NICs/network bandwidth (MBps) | 37 | | ------------------------- | ---- | ----------- | ------------------------------ | ------------------------------ | --------------- | -------------- | ----------------------------------------- | --------------------------------- | 38 | | Standard_ND96asr_v4 | 96 | 900 | 6000 | 8 A100 40 GB GPUs (NVLink 3.0) | 40 | 32 | 80,000 / 800 | 8/24,000 | 39 | | Standard_ND96amsr_A100_v4 | 96 | 1900 | 6400 | 8 A100 80 GB GPUs (NVLink 3.0) | 80 | 32 | 80,000 / 800 | 8/24,000 | 40 | | Standard_ND40rs_v2 | 40 | 672 | 2948 | 8 V100 32 GB (NVLink) | 32 | 32 | 80,000 / 800 | 8/24,000 | 41 | 42 | ## Useful Links 43 | 44 | - [GPU VM availability by region](https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?products=virtual-machines) 45 | - [For GPU VM sizes overview](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) 46 | 47 | ```{toctree} 48 | --- 49 | hidden: true 50 | --- 51 | azure-vm 52 | aks 53 | azure-vm-multi 54 | azureml 55 | ``` 56 | -------------------------------------------------------------------------------- /source/cloud/gcp/compute-engine.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # Compute Engine Instance 6 | 7 | ## Create Virtual Machine 8 | 9 | Create a new [Compute Engine Instance](https://cloud.google.com/compute/docs/instances) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). 10 | 11 | NVIDIA maintains a [Virtual Machine Image (VMI) that pre-installs NVIDIA drivers and container runtimes](https://console.cloud.google.com/marketplace/product/nvidia-ngc-public/nvidia-gpu-optimized-vmi), we recommend using this image. 12 | 13 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 14 | 1. Select **Create Instance**. 15 | 1. Select the **Create VM from..** option at the top. 16 | 1. Select **Marketplace**. 17 | 1. Search for "nvidia" and select **NVIDIA GPU-Optimized VMI**, then select **Launch**. 18 | 1. In the **New NVIDIA GPU-Optimized VMI deployment** interface, fill in the name and any required information for the vm (the defaults should be fine for most users). 19 | 1. **Read and accept** the Terms of Service 20 | 1. Select **Deploy** to start the virtual machine. 21 | 22 | ## Allow network access 23 | 24 | To access Jupyter and Dask we will need to set up some firewall rules to open up some ports. 25 | 26 | ### Create the firewall rule 27 | 28 | 1. Open [**VPC Network**](https://console.cloud.google.com/networking/networks/list). 29 | 2. Select **Firewall** and **Create firewall rule** 30 | 3. Give the rule a name like `rapids` and ensure the network matches the one you selected for the VM. 31 | 4. Add a tag like `rapids` which we will use to assign the rule to our VM. 32 | 5. Set your source IP range. We recommend you restrict this to your own IP address or your corporate network rather than `0.0.0.0/0` which will allow anyone to access your VM. 33 | 6. Under **Protocols and ports** allow TCP connections on ports `22,8786,8787,8888`. 34 | 35 | ### Assign it to the VM 36 | 37 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 38 | 2. Select your VM and press **Edit**. 39 | 3. Scroll down to **Networking** and add the `rapids` network tag you gave your firewall rule. 40 | 4. Select **Save**. 41 | 42 | ## Connect to the VM 43 | 44 | Next we need to connect to the VM. 45 | 46 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 47 | 2. Locate your VM and press the **SSH** button which will open a new browser tab with a terminal. 48 | 3. **Read and accept** the NVIDIA installer prompts. 49 | 50 | ## Install RAPIDS 51 | 52 | ```{include} ../../_includes/install-rapids-with-docker.md 53 | 54 | ``` 55 | 56 | ## Test RAPIDS 57 | 58 | ```{include} ../../_includes/test-rapids-docker-vm.md 59 | 60 | ``` 61 | 62 | ## Clean up 63 | 64 | Once you are finished head back to the [Deployments](https://console.cloud.google.com/dm/deployments) page and delete the marketplace deployment you created. 65 | 66 | ```{relatedexamples} 67 | 68 | ``` 69 | -------------------------------------------------------------------------------- /source/cloud/gcp/dataproc.md: -------------------------------------------------------------------------------- 1 | # Dataproc 2 | 3 | RAPIDS can be deployed on Google Cloud Dataproc using Dask. For more details, see our **[detailed instructions and helper scripts.](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids)** 4 | 5 | **0. Copy initialization actions to your own Cloud Storage bucket.** Don't create clusters that reference initialization actions located in `gs://goog-dataproc-initialization-actions-REGION` public buckets. These scripts are provided as reference implementations and are synchronized with ongoing [GitHub repository](https://github.com/GoogleCloudDataproc/initialization-actions) changes. 6 | 7 | It is strongly recommended that you copy the initialization scripts into your own Storage bucket to prevent unintended upgrades from upstream in the cluster: 8 | 9 | ```console 10 | $ REGION= 11 | $ GCS_BUCKET= 12 | $ gcloud storage buckets create gs://$GCS_BUCKET 13 | $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh gs://$GCS_BUCKET 14 | $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/dask/dask.sh gs://$GCS_BUCKET 15 | $ gsutil cp gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh gs://$GCS_BUCKET 16 | 17 | ``` 18 | 19 | **1. Create Dataproc cluster with Dask RAPIDS.** Use the gcloud command to create a new cluster. Because of an Anaconda version conflict, script deployment on older images is slow, we recommend using Dask with Dataproc 2.0+. 20 | 21 | ```{warning} 22 | At the time of writing [Dataproc only supports RAPIDS version 23.12 and earlier with CUDA<=11.8 and Ubuntu 18.04](https://github.com/GoogleCloudDataproc/initialization-actions/issues/1137). 23 | 24 | Please ensure that your setup complies with this compatibility requirement. Using newer RAPIDS versions may result in unexpected behavior or errors. 25 | ``` 26 | 27 | ```console 28 | $ CLUSTER_NAME= 29 | $ DASK_RUNTIME=yarn 30 | $ RAPIDS_VERSION=23.12 31 | $ CUDA_VERSION=11.8 32 | 33 | $ gcloud dataproc clusters create $CLUSTER_NAME\ 34 | --region $REGION\ 35 | --image-version 2.0-ubuntu18\ 36 | --master-machine-type n1-standard-32\ 37 | --master-accelerator type=nvidia-tesla-t4,count=2\ 38 | --worker-machine-type n1-standard-32\ 39 | --worker-accelerator type=nvidia-tesla-t4,count=2\ 40 | --initialization-actions=gs://$GCS_BUCKET/install_gpu_driver.sh,gs://$GCS_BUCKET/dask.sh,gs://$GCS_BUCKET/rapids.sh\ 41 | --initialization-action-timeout 60m\ 42 | --optional-components=JUPYTER\ 43 | --metadata gpu-driver-provider=NVIDIA,dask-runtime=$DASK_RUNTIME,rapids-runtime=DASK,rapids-version=$RAPIDS_VERSION,cuda-version=$CUDA_VERSION\ 44 | --enable-component-gateway 45 | 46 | ``` 47 | 48 | [GCS_BUCKET] = name of the bucket to use.\ 49 | [CLUSTER_NAME] = name of the cluster.\ 50 | [REGION] = name of region where cluster is to be created.\ 51 | [DASK_RUNTIME] = Dask runtime could be set to either yarn or standalone. 52 | 53 | **2. Run Dask RAPIDS Workload.** Once the cluster has been created, the Dask scheduler listens for workers on `port 8786`, and its status dashboard is on `port 8787` on the Dataproc master node. 54 | 55 | To connect to the Dask web interface, you will need to create an SSH tunnel as described in the [Dataproc web interfaces documentation.](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces) You can also connect using the Dask Client Python API from a Jupyter notebook, or from a Python script or interpreter session. 56 | 57 | ```{relatedexamples} 58 | 59 | ``` 60 | -------------------------------------------------------------------------------- /source/cloud/gcp/gke.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p1" 3 | --- 4 | 5 | # Google Kubernetes Engine 6 | 7 | RAPIDS can be deployed on Google Cloud via the [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) (GKE). 8 | 9 | To run RAPIDS you'll need a Kubernetes cluster with GPUs available. 10 | 11 | ## Prerequisites 12 | 13 | First you'll need to have the [`gcloud` CLI tool](https://cloud.google.com/sdk/gcloud) installed along with [`kubectl`](https://kubernetes.io/docs/tasks/tools/), [`helm`](https://helm.sh/docs/intro/install/), etc for managing Kubernetes. 14 | 15 | Ensure you are logged into the `gcloud` CLI. 16 | 17 | ```console 18 | $ gcloud init 19 | ``` 20 | 21 | ## Create the Kubernetes cluster 22 | 23 | Now we can launch a GPU enabled GKE cluster. 24 | 25 | ```console 26 | gcloud container clusters create rapids-gpu-kubeflow \ 27 | --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g \ 28 | --zone us-central1-c --release-channel stable 29 | ``` 30 | 31 | With this command, you’ve launched a GKE cluster called `rapids-gpu-kubeflow`. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs. 32 | 33 | ````{note} 34 | After creating your cluster, if you get a message saying 35 | 36 | ```text 37 | CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not 38 | executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin 39 | ``` 40 | you will need to install the `gke-gcloud-auth-plugin` to be able to get the credentials. To do so, 41 | 42 | ```bash 43 | gcloud components install gke-gcloud-auth-plugin 44 | ``` 45 | ```` 46 | 47 | ## Get the cluster credentials 48 | 49 | ```console 50 | gcloud container clusters get-credentials rapids-gpu-kubeflow \ 51 | --region=us-central1-c 52 | ``` 53 | 54 | With this command, your `kubeconfig` is updated with credentials and endpoint information for the `rapids-gpu-kubeflow` cluster. 55 | 56 | ## Install drivers 57 | 58 | Next, [install the NVIDIA drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) onto each node. 59 | 60 | ```console 61 | $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml 62 | daemonset.apps/nvidia-driver-installer created 63 | ``` 64 | 65 | Verify that the NVIDIA drivers are successfully installed. 66 | 67 | ```console 68 | $ kubectl get po -A --watch | grep nvidia 69 | kube-system nvidia-gpu-device-plugin-medium-cos-h5kkz 2/2 Running 0 3m42s 70 | kube-system nvidia-gpu-device-plugin-medium-cos-pw89w 2/2 Running 0 3m42s 71 | kube-system nvidia-gpu-device-plugin-medium-cos-wdnm9 2/2 Running 0 3m42s 72 | ``` 73 | 74 | After your drivers are installed, you are ready to test your cluster. 75 | 76 | ```{include} ../../_includes/check-gpu-pod-works.md 77 | 78 | ``` 79 | 80 | ## Install RAPIDS 81 | 82 | Now that you have a GPU enables Kubernetes cluster on GKE you can install RAPIDS with [any of the supported methods](../../platforms/kubernetes). 83 | 84 | ## Clean up 85 | 86 | You can also delete the GKE cluster to stop billing with the following command. 87 | 88 | ```console 89 | $ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c 90 | Deleting cluster rapids...⠼ 91 | ``` 92 | 93 | ```{relatedexamples} 94 | 95 | ``` 96 | -------------------------------------------------------------------------------- /source/cloud/gcp/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Google Cloud Platform 7 | 8 | ```{include} ../../_includes/menus/gcp.md 9 | 10 | ``` 11 | 12 | RAPIDS can be deployed on Google Cloud Platform in several ways. Google Cloud supports various kinds of GPU VMs for different needs. Please visit the Google Cloud documentation for [an overview of GPU VM sizes](https://cloud.google.com/compute/docs/gpus) and [GPU VM availability by region](https://cloud.google.com/compute/docs/gpus/gpu-regions-zones). 13 | 14 | ```{toctree} 15 | --- 16 | hidden: true 17 | --- 18 | compute-engine 19 | vertex-ai 20 | gke 21 | dataproc 22 | ``` 23 | -------------------------------------------------------------------------------- /source/cloud/gcp/vertex-ai.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p0" 3 | --- 4 | 5 | # Vertex AI 6 | 7 | RAPIDS can be deployed on [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench). 8 | 9 | ## Create a new Notebook Instance 10 | 11 | 1. From the Google Cloud UI, navigate to [**Vertex AI**](https://console.cloud.google.com/vertex-ai/workbench/user-managed) -> Notebook -> **Workbench** 12 | 2. Select **Instances** and select **+ CREATE NEW**. 13 | 3. In the **Details** section give the instance a name. 14 | 4. Check the "Attach 1 NVIDIA T4 GPU" option. 15 | 5. After customizing any other aspects of the machine you wish, click **CREATE**. 16 | 17 | ```{tip} 18 | If you want to select a different GPU or select other hardware options you can select "Advanced Options" at the bottom and then make changes in the "Machine type" section. 19 | ``` 20 | 21 | ## Install RAPIDS 22 | 23 | Once the instance has started select **OPEN JUPYTER LAB** and at the top of a notebook install the RAPIDS libraries you wish to use. 24 | 25 | ```{warning} 26 | Installing RAPIDS via `pip` in the default environment is [not currently possible](https://github.com/rapidsai/deployment/issues/517), for now you must create a new `conda` environment. 27 | 28 | Vertex AI currently ships with CUDA Toolkit 11 system packages as of the [Jan 2025 Vertex AI release](https://cloud.google.com/vertex-ai/docs/release-notes#January_31_2025). 29 | The default Python environment also contains the `cupy-cuda12x` package. This means it's not possible to install RAPIDS package like `cudf` via `pip` as `cudf-cu12` will conflict with the CUDA Toolkit version but `cudf-cu11` will conflict with the `cupy` version. 30 | 31 | You can find out your current system CUDA Toolkit version by running `ls -ld /usr/local/cuda*`. 32 | ``` 33 | 34 | You can create a new RAPIDS conda environment and register it with `ipykernel` for use in Jupyter Lab. Open a new terminal in Jupyter and run the following commands. 35 | 36 | ```bash 37 | # Create a new environment 38 | conda create -y -n rapids \ 39 | {{ rapids_conda_channels }} \ 40 | {{ rapids_conda_packages }} \ 41 | ipykernel 42 | 43 | # Activate the environment 44 | conda activate rapids 45 | 46 | # Register the environment with Jupyter 47 | python -m ipykernel install --prefix "${DL_ANACONDA_HOME}/envs/rapids" --name rapids --display-name rapids 48 | ``` 49 | 50 | Then refresh the Jupyter Lab page and open the launcher. You will see a new "rapids" kernel available. 51 | 52 | ![Screenshot of the Jupyter Lab launcher showing the RAPIDS kernel](../../images/vertex-ai-launcher.png) 53 | 54 | ```{tip} 55 | If you don't see the new kernel wait a minute and refresh the page again, it can take a little while to show up. 56 | ``` 57 | 58 | ## Test RAPIDS 59 | 60 | You should now be able to open a notebook and use RAPIDS. 61 | 62 | For example we could import and use RAPIDS libraries like `cudf`. 63 | 64 | ```ipython 65 | In [1]: import cudf 66 | In [2]: df = cudf.datasets.timeseries() 67 | In [3]: df.head() 68 | Out[3]: 69 | id name x y 70 | timestamp 71 | 2000-01-01 00:00:00 1020 Kevin 0.091536 0.664482 72 | 2000-01-01 00:00:01 974 Frank 0.683788 -0.467281 73 | 2000-01-01 00:00:02 1000 Charlie 0.419740 -0.796866 74 | 2000-01-01 00:00:03 1019 Edith 0.488411 0.731661 75 | 2000-01-01 00:00:04 998 Quinn 0.651381 -0.525398 76 | ``` 77 | 78 | ```{relatedexamples} 79 | 80 | ``` 81 | -------------------------------------------------------------------------------- /source/cloud/ibm/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # IBM Cloud 7 | 8 | ```{include} ../../_includes/menus/ibm.md 9 | 10 | ``` 11 | 12 | RAPIDS can be deployed on IBM Cloud in several ways. See the 13 | list of accelerated instance types below: 14 | 15 | | Cloud
Provider | Inst.
Type | vCPUs | Inst.
Name | GPU
Count | GPU
Type | xGPU
RAM | xGPU
RAM Total | 16 | | :------------------ | --------------------- | ----- | ------------------ | -------------- | ------------- | ------------- | ------------------: | 17 | | IBM | V100 GPU Virtual | 8 | gx2-8x64x1v100 | 1 | NVIDIA Tesla | 16 (GB) | 64 (GB) | 18 | | IBM | V100 GPU Virtual | 16 | gx2-16x128x1v100 | 1 | NVIDIA Tesla | 16 (GB) | 128 (GB) | 19 | | IBM | V100 GPU Virtual | 16 | gx2-16x128x2v100 | 2 | NVIDIA Tesla | 16 (GB) | 128 (GB) | 20 | | IBM | V100 GPU Virtual | 32 | gx2-32x256x2v100 | 2 | NVIDIA Tesla | 16 (GB) | 256 (GB) | 21 | | IBM | P100 GPU Bare Metal\* | 32 | mg4c.32x384.2xp100 | 2 | NVIDIA Tesla | 16 (GB) | 384 (GB) | 22 | | IBM | V100 GPU Bare Metal\* | 48 | mg4c.48x384.2xv100 | 2 | NVIDIA Tesla | 16 (GB) | 384 (GB) | 23 | 24 | ```{warning} 25 | *Bare Metal instances are billed in monthly intervals rather than hourly intervals. 26 | ``` 27 | 28 | ```{toctree} 29 | --- 30 | hidden: true 31 | --- 32 | virtual-server 33 | ``` 34 | -------------------------------------------------------------------------------- /source/cloud/ibm/virtual-server.md: -------------------------------------------------------------------------------- 1 | # Virtual Server for VPC 2 | 3 | ## Create Instance 4 | 5 | Create a new [Virtual Server (for VPC)](https://www.ibm.com/cloud/virtual-servers) with GPUs, the [NVIDIA Driver](https://www.nvidia.co.uk/Download/index.aspx) and the [NVIDIA Container Runtime](https://developer.nvidia.com/nvidia-container-runtime). 6 | 7 | 1. Open the [**Virtual Server Dashboard**](https://cloud.ibm.com/vpc-ext/compute/vs). 8 | 1. Select **Create**. 9 | 1. Give the server a **name** and select your **resource group**. 10 | 1. Under **Operating System** choose **Ubuntu Linux**. 11 | 1. Under **Profile** select **View all profiles** and select a profile with NVIDIA GPUs. 12 | 1. Under **SSH Keys** choose your SSH key. 13 | 1. Under network settings create a security group (or choose an existing) that allows SSH access on port `22` and also allow ports `8888,8786,8787` to access Jupyter and Dask. 14 | 1. Select **Create Virtual Server**. 15 | 16 | ## Create floating IP 17 | 18 | To access the virtual server we need to attach a public IP address. 19 | 20 | 1. Open [**Floating IPs**](https://cloud.ibm.com/vpc-ext/network/floatingIPs) 21 | 1. Select **Reserve**. 22 | 1. Give the Floating IP a **name**. 23 | 1. Under **Resource to bind** select the virtual server you just created. 24 | 25 | ## Connect to the instance 26 | 27 | Next we need to connect to the instance. 28 | 29 | 1. Open [**Floating IPs**](https://cloud.ibm.com/vpc-ext/network/floatingIPs) 30 | 1. Locate the IP you just created and note the address. 31 | 1. In your terminal run `ssh root@` 32 | 33 | ```{note} 34 | For a short guide on launching your instance and accessing it, read the 35 | [Getting Started with IBM Virtual Server Documentation](https://cloud.ibm.com/docs/virtual-servers?topic=virtual-servers-getting-started-tutorial). 36 | ``` 37 | 38 | ## Install NVIDIA Drivers 39 | 40 | Next we need to install the NVIDIA drivers and container runtime. 41 | 42 | 1. Ensure build essentials are installed `apt-get update && apt-get install build-essential -y`. 43 | 1. Install the [NVIDIA drivers](https://www.nvidia.com/Download/index.aspx?lang=en-us). 44 | 1. Install [Docker and the NVIDIA Docker runtime](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). 45 | 46 | ````{dropdown} How do I check everything installed successfully? 47 | :color: info 48 | :icon: info 49 | 50 | You can check everything installed correctly by running `nvidia-smi` in a container. 51 | 52 | ```console 53 | $ docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi 54 | +-----------------------------------------------------------------------------+ 55 | | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | 56 | |-------------------------------+----------------------+----------------------+ 57 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 58 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 59 | | | | MIG M. | 60 | |===============================+======================+======================| 61 | | 0 Tesla V100-PCIE... Off | 00000000:04:01.0 Off | 0 | 62 | | N/A 33C P0 36W / 250W | 0MiB / 16384MiB | 0% Default | 63 | | | | N/A | 64 | +-------------------------------+----------------------+----------------------+ 65 | 66 | +-----------------------------------------------------------------------------+ 67 | | Processes: | 68 | | GPU GI CI PID Type Process name GPU Memory | 69 | | ID ID Usage | 70 | |=============================================================================| 71 | | No running processes found | 72 | +-----------------------------------------------------------------------------+ 73 | ``` 74 | 75 | ```` 76 | 77 | ## Install RAPIDS 78 | 79 | ```{include} ../../_includes/install-rapids-with-docker.md 80 | 81 | ``` 82 | 83 | ## Test RAPIDS 84 | 85 | ```{include} ../../_includes/test-rapids-docker-vm.md 86 | 87 | ``` 88 | 89 | ```{relatedexamples} 90 | 91 | ``` 92 | -------------------------------------------------------------------------------- /source/cloud/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Cloud 7 | 8 | ## NVIDIA Cloud Platforms 9 | 10 | ```{include} ../_includes/menus/nvidia.md 11 | 12 | ``` 13 | 14 | ## Amazon Web Services 15 | 16 | ```{include} ../_includes/menus/aws.md 17 | 18 | ``` 19 | 20 | ## Microsoft Azure 21 | 22 | ```{include} ../_includes/menus/azure.md 23 | 24 | ``` 25 | 26 | ## Google Cloud Platform 27 | 28 | ```{include} ../_includes/menus/gcp.md 29 | 30 | ``` 31 | 32 | ## IBM Cloud 33 | 34 | ```{include} ../_includes/menus/ibm.md 35 | 36 | ``` 37 | 38 | ```{toctree} 39 | :maxdepth: 2 40 | :caption: Cloud 41 | :hidden: 42 | 43 | nvidia/index 44 | aws/index 45 | azure/index 46 | gcp/index 47 | ibm/index 48 | ``` 49 | -------------------------------------------------------------------------------- /source/cloud/nvidia/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # NVIDIA Cloud Platforms 7 | 8 | ```{include} ../../_includes/menus/nvidia.md 9 | 10 | ``` 11 | 12 | ```{toctree} 13 | --- 14 | hidden: true 15 | --- 16 | brev 17 | ``` 18 | -------------------------------------------------------------------------------- /source/developer/ci/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Continuous Integration 7 | 8 | ```{include} ../../_includes/menus/ci.md 9 | 10 | ``` 11 | 12 | ```{toctree} 13 | --- 14 | hidden: true 15 | --- 16 | github-actions 17 | ``` 18 | -------------------------------------------------------------------------------- /source/developer/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Developer 7 | 8 | ## Continuous Integration 9 | 10 | ```{include} ../_includes/menus/ci.md 11 | 12 | ``` 13 | 14 | ```{toctree} 15 | :maxdepth: 2 16 | :caption: Developer 17 | :hidden: 18 | 19 | ci/index 20 | ``` 21 | -------------------------------------------------------------------------------- /source/examples/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Workflow Examples 7 | 8 | ```{notebookgallerytoctree} 9 | xgboost-gpu-hpo-job-parallel-k8s/notebook 10 | xgboost-gpu-hpo-mnmg-parallel-k8s/notebook 11 | rapids-optuna-hpo/notebook 12 | rapids-sagemaker-higgs/notebook 13 | rapids-sagemaker-hpo/notebook 14 | rapids-ec2-mnmg/notebook 15 | rapids-autoscaling-multi-tenant-kubernetes/notebook 16 | xgboost-randomforest-gpu-hpo-dask/notebook 17 | rapids-azureml-hpo/notebook 18 | time-series-forecasting-with-hpo/notebook 19 | xgboost-rf-gpu-cpu-benchmark/notebook 20 | xgboost-dask-databricks/notebook 21 | xgboost-azure-mnmg-daskcloudprovider/notebook 22 | rapids-1brc-single-node/notebook 23 | rapids-snowflake-cudf/notebook 24 | rapids-coiled-cudf/notebook 25 | rapids-morpheus-pipeline/notebook 26 | ``` 27 | -------------------------------------------------------------------------------- /source/examples/rapids-autoscaling-multi-tenant-kubernetes/image-prepuller.yaml: -------------------------------------------------------------------------------- 1 | # image-prepuller.yaml 2 | apiVersion: apps/v1 3 | kind: DaemonSet 4 | metadata: 5 | name: prepull-rapids 6 | spec: 7 | selector: 8 | matchLabels: 9 | name: prepull-rapids 10 | template: 11 | metadata: 12 | labels: 13 | name: prepull-rapids 14 | spec: 15 | initContainers: 16 | - name: prepull-rapids 17 | image: us-central1-docker.pkg.dev/nv-ai-infra/rapidsai/rapidsai/base:example 18 | command: ["sh", "-c", "'true'"] 19 | containers: 20 | - name: pause 21 | image: gcr.io/google_containers/pause 22 | -------------------------------------------------------------------------------- /source/examples/rapids-autoscaling-multi-tenant-kubernetes/prometheus-stack-values.yaml: -------------------------------------------------------------------------------- 1 | # prometheus-stack-values.yaml 2 | serviceMonitorSelectorNilUsesHelmValues: false 3 | 4 | prometheus: 5 | prometheusSpec: 6 | # Setting this to a high frequency so that we have richer data for analysis later 7 | scrapeInterval: 1s 8 | -------------------------------------------------------------------------------- /source/examples/rapids-autoscaling-multi-tenant-kubernetes/rapids-notebook.yaml: -------------------------------------------------------------------------------- 1 | # rapids-notebook.yaml (extended) 2 | apiVersion: v1 3 | kind: ServiceAccount 4 | metadata: 5 | name: rapids-dask 6 | --- 7 | apiVersion: rbac.authorization.k8s.io/v1 8 | kind: Role 9 | metadata: 10 | name: rapids-dask 11 | rules: 12 | - apiGroups: [""] 13 | resources: ["events"] 14 | verbs: ["get", "list", "watch"] 15 | - apiGroups: [""] 16 | resources: ["pods", "services"] 17 | verbs: ["get", "list", "watch", "create", "delete"] 18 | - apiGroups: [""] 19 | resources: ["pods/log"] 20 | verbs: ["get", "list"] 21 | - apiGroups: [kubernetes.dask.org] 22 | resources: ["*"] 23 | verbs: ["*"] 24 | --- 25 | apiVersion: rbac.authorization.k8s.io/v1 26 | kind: RoleBinding 27 | metadata: 28 | name: rapids-dask 29 | roleRef: 30 | apiGroup: rbac.authorization.k8s.io 31 | kind: Role 32 | name: rapids-dask 33 | subjects: 34 | - kind: ServiceAccount 35 | name: rapids-dask 36 | --- 37 | apiVersion: v1 38 | kind: ConfigMap 39 | metadata: 40 | name: jupyter-server-proxy-config 41 | data: 42 | jupyter_server_config.py: | 43 | c.ServerProxy.host_allowlist = lambda app, host: True 44 | --- 45 | apiVersion: v1 46 | kind: Service 47 | metadata: 48 | name: rapids-notebook 49 | labels: 50 | app: rapids-notebook 51 | spec: 52 | type: ClusterIP 53 | ports: 54 | - port: 8888 55 | name: http 56 | targetPort: notebook 57 | selector: 58 | app: rapids-notebook 59 | --- 60 | apiVersion: v1 61 | kind: Pod 62 | metadata: 63 | name: rapids-notebook 64 | labels: 65 | app: rapids-notebook 66 | spec: 67 | serviceAccountName: rapids-dask 68 | securityContext: 69 | fsGroup: 0 70 | containers: 71 | - name: rapids-notebook 72 | image: us-central1-docker.pkg.dev/nv-ai-infra/rapidsai/rapidsai/base:example 73 | resources: 74 | limits: 75 | nvidia.com/gpu: 1 76 | ports: 77 | - containerPort: 8888 78 | name: notebook 79 | env: 80 | - name: DASK_DISTRIBUTED__DASHBOARD__LINK 81 | value: "/proxy/{host}:{port}/status" 82 | volumeMounts: 83 | - name: jupyter-server-proxy-config 84 | mountPath: /root/.jupyter/jupyter_server_config.py 85 | subPath: jupyter_server_config.py 86 | volumes: 87 | - name: jupyter-server-proxy-config 88 | configMap: 89 | name: jupyter-server-proxy-config 90 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/kafka-producer/kafka-producer.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: kafka-producer 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: kafka-producer 10 | template: 11 | metadata: 12 | labels: 13 | app: kafka-producer 14 | spec: 15 | containers: 16 | - name: kafka-producer 17 | image: ncclementi/kafka-producer-image:latest 18 | # args: ["--message-limit", "50000"] # uncomment for message limit otherwise unlimited 19 | env: 20 | - name: KAFKA_CLUSTER_BOOTSTRAP_SERVER 21 | value: "kafka-cluster-kafka-bootstrap:9092" 22 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/kafka/kafka-create-topics.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: kafka.strimzi.io/v1beta2 2 | kind: KafkaTopic 3 | metadata: 4 | name: network-traffic-input 5 | labels: 6 | strimzi.io/cluster: kafka-cluster 7 | spec: 8 | topicName: network-traffic-input 9 | partitions: 3 10 | replicas: 1 11 | config: 12 | retention.ms: 3600000 # 60 minutes 13 | segment.bytes: 157286400 # 150 MB 14 | --- 15 | apiVersion: kafka.strimzi.io/v1beta2 16 | kind: KafkaTopic 17 | metadata: 18 | name: network-traffic-results 19 | labels: 20 | strimzi.io/cluster: kafka-cluster 21 | spec: 22 | topicName: network-traffic-results 23 | partitions: 3 24 | replicas: 1 25 | config: 26 | retention.ms: 1200000 # 20 minutes 27 | segment.bytes: 157286400 # 150 MB 28 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/kafka/kafka-single-node.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: kafka.strimzi.io/v1beta2 2 | kind: KafkaNodePool 3 | metadata: 4 | name: dual-role 5 | labels: 6 | strimzi.io/cluster: kafka-cluster 7 | spec: 8 | replicas: 1 9 | roles: 10 | - controller 11 | - broker 12 | storage: 13 | type: jbod 14 | volumes: 15 | - id: 0 16 | type: ephemeral 17 | sizeLimit: 5Gi 18 | kraftMetadata: shared 19 | --- 20 | apiVersion: kafka.strimzi.io/v1beta2 21 | kind: Kafka 22 | metadata: 23 | name: kafka-cluster 24 | annotations: 25 | strimzi.io/node-pools: enabled 26 | strimzi.io/kraft: enabled 27 | spec: 28 | kafka: 29 | version: 4.0.0 30 | metadataVersion: 4.0-IV3 31 | listeners: 32 | - name: plain 33 | port: 9092 34 | type: internal 35 | tls: false 36 | - name: tls 37 | port: 9093 38 | type: internal 39 | tls: true 40 | config: 41 | offsets.topic.replication.factor: 1 42 | transaction.state.log.replication.factor: 1 43 | transaction.state.log.min.isr: 1 44 | default.replication.factor: 1 45 | min.insync.replicas: 1 46 | entityOperator: 47 | topicOperator: {} 48 | userOperator: {} 49 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/kafka/kafka-ui.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: kafka-ui 5 | labels: 6 | app: kafka-ui 7 | spec: 8 | replicas: 1 9 | selector: 10 | matchLabels: 11 | app: kafka-ui 12 | template: 13 | metadata: 14 | labels: 15 | app: kafka-ui 16 | spec: 17 | containers: 18 | - name: kafka-ui 19 | image: provectuslabs/kafka-ui:latest 20 | ports: 21 | - containerPort: 8080 22 | env: 23 | - name: KAFKA_CLUSTERS_0_NAME 24 | value: "kafka-cluster" 25 | - name: KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS 26 | value: "kafka-cluster-kafka-bootstrap:9092" # if not on default namespace: kafka-cluster-kafka-bootstrap..svc:9092 27 | - name: DYNAMIC_CONFIG_ENABLED 28 | value: "true" 29 | --- 30 | apiVersion: v1 31 | kind: Service 32 | metadata: 33 | name: kafka-ui 34 | spec: 35 | selector: 36 | app: kafka-ui 37 | ports: 38 | - protocol: TCP 39 | port: 80 40 | targetPort: 8080 41 | type: ClusterIP 42 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/morpheus-pipeline/morpheus-pipeline-deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: morpheus-pipeline 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: morpheus-pipeline 10 | template: 11 | metadata: 12 | labels: 13 | app: morpheus-pipeline 14 | spec: 15 | containers: 16 | - name: morpheus-pipeline 17 | image: ncclementi/morpheus-pipeline-image:latest 18 | env: 19 | - name: TRITON_SERVER 20 | value: "tritonserver:8000" 21 | - name: KAFKA_CLUSTER_BOOTSTRAP_SERVER 22 | value: "kafka-cluster-kafka-bootstrap:9092" 23 | resources: 24 | limits: 25 | nvidia.com/gpu: 1 26 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/k8s/triton/morpheus-triton-server.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: tritonserver 5 | spec: 6 | replicas: 1 7 | selector: 8 | matchLabels: 9 | app: tritonserver 10 | template: 11 | metadata: 12 | labels: 13 | app: tritonserver 14 | spec: 15 | containers: 16 | - name: tritonserver 17 | image: nvcr.io/nvidia/morpheus/morpheus-tritonserver-models:25.02 18 | command: ["tritonserver"] 19 | args: 20 | - "--model-repository=/models/triton-model-repo" 21 | - "--exit-on-error=false" 22 | - "--model-control-mode=explicit" 23 | - "--load-model" 24 | - "sid-minibert-onnx" 25 | ports: 26 | - containerPort: 8000 27 | name: http 28 | - containerPort: 8001 29 | name: grpc 30 | - containerPort: 8002 31 | name: metrics 32 | resources: 33 | limits: 34 | nvidia.com/gpu: 1 35 | --- 36 | apiVersion: v1 37 | kind: Service 38 | metadata: 39 | name: tritonserver 40 | spec: 41 | selector: 42 | app: tritonserver 43 | ports: 44 | - name: http 45 | port: 8000 46 | targetPort: 8000 47 | - name: grpc 48 | port: 8001 49 | targetPort: 8001 50 | - name: metrics 51 | port: 8002 52 | targetPort: 8002 53 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/pipeline-dockerfile/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM nvidia/cuda:12.8.1-runtime-ubuntu24.04 2 | 3 | # Install curl (needed to fetch miniforge installer) 4 | RUN apt-get update && apt-get install -y \ 5 | curl \ 6 | && rm -rf /var/lib/apt/lists/* 7 | 8 | # Download and install Miniforge from GitHub 9 | RUN curl -L "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -o miniforge.sh && \ 10 | bash miniforge.sh -b -p /opt/conda && \ 11 | rm miniforge.sh 12 | 13 | # Set up miniforge environment 14 | ENV PATH=/opt/conda/bin:$PATH 15 | 16 | RUN <> /etc/skel/.bashrc 20 | echo ". /opt/conda/etc/profile.d/conda.sh; conda activate base" >> ~/.bashrc 21 | EOF 22 | 23 | # Copy the environment YAML file into the container 24 | COPY morpheus-nightly-env.yaml /tmp/env.yaml 25 | 26 | ARG CUDA_VERSION=12.8 27 | 28 | # Install dependencies from the YAML file using mamba 29 | RUN CONDA_OVERRIDE_CUDA=$CUDA_VERSION conda env create -n morpheus_env -f /tmp/env.yaml && \ 30 | conda clean --all --yes && \ 31 | echo ". /opt/conda/etc/profile.d/conda.sh; conda activate morpheus_env" >> ~/.bashrc 32 | 33 | # Copy pipeline script 34 | COPY run_pipeline_kafka.py /workspace/run_pipeline_kafka.py 35 | COPY network_traffic_analyzer_stage.py /workspace/network_traffic_analyzer_stage.py 36 | COPY message_filter_stage.py /workspace/message_filter_stage.py 37 | WORKDIR /workspace 38 | 39 | # Set entrypoint to run the script in the morpheus environment 40 | ENTRYPOINT ["/bin/bash", "-c", "\ 41 | source /opt/conda/etc/profile.d/conda.sh && \ 42 | conda activate morpheus_env && \ 43 | python run_pipeline_kafka.py"] -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/pipeline-dockerfile/message_filter_stage.py: -------------------------------------------------------------------------------- 1 | import mrc 2 | from morpheus.messages import ControlMessage, MessageMeta 3 | from morpheus.pipeline.single_port_stage import SinglePortStage 4 | from morpheus.pipeline.stage_schema import StageSchema 5 | from mrc.core import operators as ops 6 | 7 | 8 | class MessageFilterStage(SinglePortStage): 9 | """ 10 | A stage that filters out messages shorter than a specified length using GPU-accelerated operations. 11 | 12 | Parameters 13 | ---------- 14 | column : str, default = 'data' 15 | The column containing the text to filter 16 | min_length : int, default = 10 17 | Minimum length of messages to keep. Messages shorter than this will be filtered out. 18 | """ 19 | 20 | def __init__(self, c, column: str = "data", min_length: int = 50): 21 | super().__init__(c) 22 | self._column = column 23 | self._min_length = min_length 24 | 25 | @property 26 | def name(self) -> str: 27 | return "message-filter" 28 | 29 | def accepted_types(self) -> tuple: 30 | return (ControlMessage,) 31 | 32 | def supports_cpp_node(self) -> bool: 33 | return False 34 | 35 | def compute_schema(self, schema: StageSchema): 36 | schema.output_schema.set_type(ControlMessage) 37 | 38 | def on_data(self, message: ControlMessage) -> ControlMessage: 39 | # Get the payload from the ControlMessage 40 | if message is None: 41 | return None 42 | 43 | with message.payload().mutable_dataframe() as cudf_df: 44 | # Filter based on column length 45 | mask = cudf_df[self._column].str.len() >= self._min_length 46 | 47 | new_meta = MessageMeta(cudf_df[mask]) 48 | 49 | # Set the new metadata as the payload 50 | message.payload(new_meta) 51 | 52 | return message 53 | 54 | def _build_single( 55 | self, builder: mrc.Builder, input_node: mrc.SegmentObject 56 | ) -> mrc.SegmentObject: 57 | node = builder.make_node(self.unique_name, ops.map(self.on_data)) 58 | builder.make_edge(input_node, node) 59 | return node 60 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/pipeline-dockerfile/morpheus-nightly-env.yaml: -------------------------------------------------------------------------------- 1 | channels: 2 | - conda-forge 3 | - nvidia/label/dev 4 | - rapidsai 5 | dependencies: 6 | - python>=3.12 7 | - morpheus-core=25.06 8 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/pipeline-dockerfile/network_traffic_analyzer_stage.py: -------------------------------------------------------------------------------- 1 | import mrc 2 | from morpheus.messages import ControlMessage, MessageMeta 3 | from morpheus.pipeline.single_port_stage import SinglePortStage 4 | from morpheus.pipeline.stage_schema import StageSchema 5 | from mrc.core import operators as ops 6 | 7 | 8 | class NetworkTrafficAnalyzerStage(SinglePortStage): 9 | """ 10 | A stage that analyzes network traffic patterns using GPU-accelerated operations. 11 | This stage adds insights about high-volume sources/destinations and common port pairs. 12 | """ 13 | 14 | def __init__(self, c): 15 | super().__init__(c) 16 | 17 | @property 18 | def name(self) -> str: 19 | return "network-traffic-analyzer" 20 | 21 | def accepted_types(self) -> tuple: 22 | return (ControlMessage,) 23 | 24 | def supports_cpp_node(self) -> bool: 25 | return False 26 | 27 | def compute_schema(self, schema: StageSchema): 28 | schema.output_schema.set_type(ControlMessage) 29 | 30 | def on_data(self, message: ControlMessage) -> ControlMessage: 31 | if message is None: 32 | return None 33 | 34 | with message.payload().mutable_dataframe() as cudf_df: 35 | # Convert data_len to numeric type 36 | cudf_df["data_len"] = cudf_df["data_len"].astype("int64") 37 | 38 | # 1. Identify high-volume source IPs 39 | src_ip_stats = cudf_df.groupby("src_ip")["data_len"].sum().reset_index() 40 | high_volume_src_ips = src_ip_stats[ 41 | src_ip_stats["data_len"] > src_ip_stats["data_len"].mean() 42 | ]["src_ip"] 43 | cudf_df["is_high_volume_src"] = cudf_df["src_ip"].isin(high_volume_src_ips) 44 | 45 | # 2. Identify high-volume destination IPs 46 | dest_ip_stats = cudf_df.groupby("dest_ip")["data_len"].sum().reset_index() 47 | high_volume_dest_ips = dest_ip_stats[ 48 | dest_ip_stats["data_len"] > dest_ip_stats["data_len"].mean() 49 | ]["dest_ip"] 50 | cudf_df["is_high_volume_dest"] = cudf_df["dest_ip"].isin( 51 | high_volume_dest_ips 52 | ) 53 | 54 | # 3. Identify common port pairs 55 | # Create port pair identifiers using string concatenation 56 | cudf_df["port_pair"] = cudf_df["src_port"] + ":" + cudf_df["dest_port"] 57 | 58 | # Count occurrences of each port pair 59 | port_stats = cudf_df["port_pair"].value_counts().reset_index() 60 | port_stats.columns = ["port_pair", "count"] 61 | 62 | # Identify common port pairs (above average frequency) 63 | common_port_pairs = port_stats[ 64 | port_stats["count"] > port_stats["count"].mean() 65 | ]["port_pair"] 66 | 67 | # Check if each port pair is common 68 | cudf_df["is_common_port_pair"] = cudf_df["port_pair"].isin( 69 | common_port_pairs 70 | ) 71 | 72 | # Remove temporary column 73 | cudf_df = cudf_df.drop("port_pair", axis=1) 74 | 75 | # Create new metadata with the analysis results 76 | new_meta = MessageMeta(cudf_df) 77 | message.payload(new_meta) 78 | 79 | return message 80 | 81 | def _build_single( 82 | self, builder: mrc.Builder, input_node: mrc.SegmentObject 83 | ) -> mrc.SegmentObject: 84 | node = builder.make_node(self.unique_name, ops.map(self.on_data)) 85 | builder.make_edge(input_node, node) 86 | return node 87 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/pipeline-dockerfile/run_pipeline_kafka.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import logging 4 | import os 5 | from pprint import pprint 6 | 7 | from message_filter_stage import MessageFilterStage 8 | from morpheus.common import FilterSource 9 | from morpheus.config import Config, PipelineModes 10 | from morpheus.pipeline import LinearPipeline 11 | from morpheus.stages.general.monitor_stage import MonitorStage 12 | from morpheus.stages.inference.triton_inference_stage import TritonInferenceStage 13 | from morpheus.stages.input.kafka_source_stage import KafkaSourceStage 14 | from morpheus.stages.output.write_to_kafka_stage import WriteToKafkaStage 15 | from morpheus.stages.postprocess.add_classifications_stage import ( 16 | AddClassificationsStage, 17 | ) 18 | from morpheus.stages.postprocess.filter_detections_stage import FilterDetectionsStage 19 | from morpheus.stages.postprocess.serialize_stage import SerializeStage 20 | from morpheus.stages.preprocess.deserialize_stage import DeserializeStage 21 | from morpheus.stages.preprocess.preprocess_nlp_stage import PreprocessNLPStage 22 | from morpheus.utils.file_utils import get_data_file_path, load_labels_file 23 | from morpheus.utils.logger import configure_logging 24 | from network_traffic_analyzer_stage import NetworkTrafficAnalyzerStage 25 | 26 | 27 | def main(): 28 | # Get the Kafka bootstrap server from the environment variable 29 | bootstrap_server = os.getenv("KAFKA_CLUSTER_BOOTSTRAP_SERVER") 30 | if not bootstrap_server: 31 | raise RuntimeError( 32 | """KAFKA_CLUSTER_BOOTSTRAP_SERVER environment variable 33 | is not set. Please set it to your Kafka bootstrap service address.""" 34 | ) 35 | 36 | # Get the Triton server URL from the environment variable 37 | triton_server = os.getenv("TRITON_SERVER") 38 | if not triton_server: 39 | raise RuntimeError( 40 | """TRITON_SERVER environment variable 41 | is not set. Please set it to your Triton Inference Server address.""" 42 | ) 43 | 44 | # Configure logging 45 | configure_logging(log_level=logging.DEBUG) 46 | 47 | # Create a pipeline configuration 48 | config = Config() 49 | config.mode = PipelineModes.NLP 50 | config.pipeline_batch_size = 1024 51 | config.model_max_batch_size = 32 52 | config.feature_length = 256 53 | config.num_threads = min( 54 | len(os.sched_getaffinity(0)), 16 55 | ) # choose threads = num cores unless more than 16 56 | config.class_labels = load_labels_file(get_data_file_path("data/labels_nlp.txt")) 57 | 58 | # Print the config dictionary 59 | pprint(vars(config)) 60 | 61 | # Confirm we are using right kafka bootstrap server 62 | print(f"Using Kafka bootstrap server: {bootstrap_server}") 63 | 64 | # Create the pipeline 65 | pipeline = LinearPipeline(config) 66 | 67 | # Add stages to the pipeline 68 | pipeline.set_source( 69 | KafkaSourceStage( 70 | config, 71 | bootstrap_servers=bootstrap_server, 72 | input_topic=["network-traffic-input"], 73 | group_id="network-traffic-group", 74 | auto_offset_reset="latest", 75 | ) 76 | ) 77 | pipeline.add_stage(DeserializeStage(config)) 78 | pipeline.add_stage(MessageFilterStage(config, column="data", min_length=50)) 79 | pipeline.add_stage( 80 | PreprocessNLPStage( 81 | config, 82 | vocab_hash_file="data/bert-base-uncased-hash.txt", 83 | do_lower_case=True, 84 | truncation=True, 85 | add_special_tokens=False, 86 | ) 87 | ) 88 | pipeline.add_stage( 89 | TritonInferenceStage( 90 | config, 91 | model_name="sid-minibert-onnx", 92 | server_url=triton_server, 93 | force_convert_inputs=True, 94 | ) 95 | ) 96 | pipeline.add_stage( 97 | MonitorStage(config, description="Inference Rate", smoothing=0.001, unit="inf") 98 | ) 99 | pipeline.add_stage(AddClassificationsStage(config)) 100 | pipeline.add_stage(FilterDetectionsStage(config, filter_source=FilterSource.TENSOR)) 101 | pipeline.add_stage(NetworkTrafficAnalyzerStage(config)) 102 | pipeline.add_stage(SerializeStage(config, exclude=["^_ts_"])) 103 | 104 | # Add Kafka sink stage 105 | pipeline.add_stage( 106 | WriteToKafkaStage( 107 | config, 108 | bootstrap_servers=bootstrap_server, 109 | output_topic="network-traffic-results", 110 | ) 111 | ) 112 | 113 | # Run the pipeline 114 | pipeline.run() 115 | 116 | 117 | if __name__ == "__main__": 118 | main() 119 | -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/producer-dockerfile/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.12-slim 2 | 3 | # Copy application files 4 | COPY producer.py /workspace/producer.py 5 | COPY pcap_dump.jsonlines /workspace/pcap_dump.jsonlines 6 | 7 | WORKDIR /workspace 8 | 9 | # Install kafka-python 10 | RUN pip install kafka-python==2.2.3 11 | 12 | # Run the script (this uses the --message-limit default to override add in docker run) 13 | ENTRYPOINT ["python", "producer.py"] -------------------------------------------------------------------------------- /source/examples/rapids-morpheus-pipeline/scripts/producer-dockerfile/producer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import random 5 | import time 6 | 7 | from kafka import KafkaProducer 8 | 9 | 10 | def main(): 11 | # Parse command line arguments 12 | parser = argparse.ArgumentParser( 13 | description="Kafka producer for network traffic data" 14 | ) 15 | parser.add_argument( 16 | "--message-limit", 17 | type=int, 18 | default=0, 19 | help="Maximum number of messages to send (default: 0, run indefinitely)", 20 | ) 21 | args = parser.parse_args() 22 | 23 | # Get Kafka bootstrap server from environment variable 24 | bootstrap_servers = os.getenv("KAFKA_CLUSTER_BOOTSTRAP_SERVER") 25 | if not bootstrap_servers: 26 | raise RuntimeError( 27 | """KAFKA_CLUSTER_BOOTSTRAP_SERVER environment variable 28 | is not set. Please set it to your Kafka bootstrap service address.""" 29 | ) 30 | 31 | # Initialize Kafka producer with optimized settings 32 | producer = KafkaProducer( 33 | bootstrap_servers=bootstrap_servers, 34 | value_serializer=lambda v: json.dumps(v).encode("utf-8"), 35 | # Performance optimizations 36 | batch_size=16384, # batch size 16 KB 37 | linger_ms=5, # Wait up to 5ms for more messages to batch 38 | compression_type="gzip", # Enable compression 39 | buffer_memory=33554432, # 32MB buffer 40 | max_request_size=1048576, # 1MB max request size 41 | retries=3, 42 | acks=1, # Leader acknowledgment only for better throughput 43 | ) 44 | 45 | print("Starting to send messages...") 46 | start_time = time.time() 47 | message_count = 0 48 | 49 | # First, read all lines into memory for random sampling 50 | with open("pcap_dump.jsonlines") as file: 51 | all_lines = file.readlines() 52 | 53 | print(f"Loaded {len(all_lines)} lines into memory for sampling") 54 | 55 | try: 56 | while True: 57 | # Check if we've reached the message limit 58 | if args.message_limit > 0 and message_count >= args.message_limit: 59 | print(f"\nReached message limit of {args.message_limit}") 60 | break 61 | 62 | # Randomly sample a line 63 | random_line = random.choice(all_lines) 64 | 65 | try: 66 | # Parse the line as JSON 67 | data = json.loads(random_line.strip()) 68 | # Send to Kafka asynchronously 69 | producer.send("network-traffic-input", data) 70 | message_count += 1 71 | 72 | # Print progress every 10000 messages 73 | if message_count % 10000 == 0: 74 | print(f"Sent {message_count} messages...") 75 | except json.JSONDecodeError as e: 76 | print(f"Error decoding JSON: {e}") 77 | continue 78 | 79 | except KeyboardInterrupt: 80 | print("\nReceived keyboard interrupt, stopping...") 81 | 82 | finally: 83 | # Flush and close the producer 84 | producer.flush() 85 | producer.close() 86 | 87 | end_time = time.time() 88 | duration = end_time - start_time 89 | messages_per_second = message_count / duration 90 | 91 | print("\nData publishing complete!") 92 | print(f"Total messages sent: {message_count}") 93 | print(f"Total time: {duration:.2f} seconds") 94 | print(f"Messages per second: {messages_per_second:.2f}") 95 | 96 | 97 | if __name__ == "__main__": 98 | main() 99 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-higgs/.dockerignore: -------------------------------------------------------------------------------- 1 | # ignore everything by default 2 | * 3 | 4 | # except these specific things 5 | !entrypoint.sh 6 | !rapids-higgs.py 7 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-higgs/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG RAPIDS_IMAGE 2 | 3 | FROM $RAPIDS_IMAGE as rapids 4 | 5 | # Installs a few more dependencies 6 | RUN conda install --yes -n base \ 7 | cupy \ 8 | flask \ 9 | protobuf \ 10 | 'sagemaker-python-sdk>=2.239.0' 11 | 12 | # Copies the training code inside the container 13 | COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py 14 | 15 | # Defines rapids-higgs.py as script entry point 16 | # ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html 17 | ENV SAGEMAKER_PROGRAM rapids-higgs.py 18 | 19 | # override entrypoint from the base image with one that accepts 20 | # 'train' and 'serve' (as SageMaker expects to provide) 21 | COPY entrypoint.sh /opt/entrypoint.sh 22 | ENTRYPOINT ["/opt/entrypoint.sh"] 23 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-higgs/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # [description] 4 | # 5 | # SageMaker runs your image like 'docker run train'. 6 | # ref: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image 7 | # 8 | # This entrypoint is used to override the entrypoint in the base image, to ensure 9 | # that that works as expected. 10 | # 11 | 12 | set -e 13 | 14 | if [[ "$1" == "train" ]]; then 15 | echo -e "@ entrypoint -> launching training script \n" 16 | python /opt/ml/code/rapids-higgs.py 17 | else 18 | echo -e "@ entrypoint -> did not recognize option '${1}' \n" 19 | exit 1 20 | fi 21 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-higgs/rapids-higgs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import argparse 4 | 5 | import cudf 6 | from cuml import RandomForestClassifier as cuRF 7 | from cuml.metrics import accuracy_score 8 | from cuml.model_selection import train_test_split 9 | 10 | 11 | def main(args): 12 | # SageMaker options 13 | data_dir = args.data_dir 14 | 15 | col_names = ["label"] + [f"col-{i}" for i in range(2, 30)] # Assign column names 16 | dtypes_ls = ["int32"] + [ 17 | "float32" for _ in range(2, 30) 18 | ] # Assign dtypes to each column 19 | 20 | data = cudf.read_csv(data_dir + "HIGGS.csv", names=col_names, dtype=dtypes_ls) 21 | X_train, X_test, y_train, y_test = train_test_split(data, "label", train_size=0.70) 22 | 23 | # Hyper-parameters 24 | hyperparams = { 25 | "n_estimators": args.n_estimators, 26 | "max_depth": args.max_depth, 27 | "n_bins": args.n_bins, 28 | "split_criterion": args.split_criterion, 29 | "bootstrap": args.bootstrap, 30 | "max_leaves": args.max_leaves, 31 | "max_features": args.max_features, 32 | } 33 | 34 | cu_rf = cuRF(**hyperparams) 35 | cu_rf.fit(X_train, y_train) 36 | 37 | print("test_acc:", accuracy_score(cu_rf.predict(X_test), y_test)) 38 | 39 | 40 | if __name__ == "__main__": 41 | parser = argparse.ArgumentParser() 42 | 43 | # Hyper-parameters 44 | parser.add_argument("--n_estimators", type=int, default=20) 45 | parser.add_argument("--max_depth", type=int, default=16) 46 | parser.add_argument("--n_bins", type=int, default=8) 47 | parser.add_argument("--split_criterion", type=int, default=0) 48 | parser.add_argument("--bootstrap", type=bool, default=True) 49 | parser.add_argument("--max_leaves", type=int, default=-1) 50 | parser.add_argument("--max_features", type=float, default=0.2) 51 | 52 | # SageMaker parameters 53 | # ref: https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html 54 | parser.add_argument("--model_output_dir", type=str, default="/opt/ml/output/") 55 | parser.add_argument("--data_dir", type=str, default="/opt/ml/input/data/training/") 56 | 57 | args = parser.parse_args() 58 | main(args) 59 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-hpo/HPODatasets.py: -------------------------------------------------------------------------------- 1 | """Airline Dataset target label and feature column names""" 2 | 3 | airline_label_column = "ArrDel15" 4 | airline_feature_columns = [ 5 | "Year", 6 | "Quarter", 7 | "Month", 8 | "DayOfWeek", 9 | "Flight_Number_Reporting_Airline", 10 | "DOT_ID_Reporting_Airline", 11 | "OriginCityMarketID", 12 | "DestCityMarketID", 13 | "DepTime", 14 | "DepDelay", 15 | "DepDel15", 16 | "ArrDel15", 17 | "AirTime", 18 | "Distance", 19 | ] 20 | airline_dtype = "float32" 21 | 22 | """ NYC TLC Trip Record Data target label and feature column names """ 23 | nyctaxi_label_column = "above_average_tip" 24 | nyctaxi_feature_columns = [ 25 | "VendorID", 26 | "tpep_pickup_datetime", 27 | "tpep_dropoff_datetime", 28 | "passenger_count", 29 | "trip_distance", 30 | "RatecodeID", 31 | "store_and_fwd_flag", 32 | "PULocationID", 33 | "DOLocationID", 34 | "payment_type", 35 | "fare_amount", 36 | "extra", 37 | "mta_tax", 38 | "tolls_amount", 39 | "improvement_surcharge", 40 | "total_amount", 41 | "congestion_surcharge", 42 | "above_average_tip", 43 | ] 44 | nyctaxi_dtype = "float32" 45 | 46 | 47 | """ Insert your dataset here! """ 48 | 49 | BYOD_label_column = "" # e.g., nyctaxi_label_column 50 | BYOD_feature_columns = [] # e.g., nyctaxi_feature_columns 51 | BYOD_dtype = None # e.g., nyctaxi_dtype 52 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-hpo/MLWorkflow.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2019-2021, NVIDIA CORPORATION. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import functools 18 | import logging 19 | import time 20 | from abc import abstractmethod 21 | 22 | hpo_log = logging.getLogger("hpo_log") 23 | 24 | 25 | def create_workflow(hpo_config): 26 | """Workflow Factory [instantiate MLWorkflow based on config]""" 27 | if hpo_config.compute_type == "single-CPU": 28 | from workflows.MLWorkflowSingleCPU import MLWorkflowSingleCPU 29 | 30 | return MLWorkflowSingleCPU(hpo_config) 31 | 32 | if hpo_config.compute_type == "multi-CPU": 33 | from workflows.MLWorkflowMultiCPU import MLWorkflowMultiCPU 34 | 35 | return MLWorkflowMultiCPU(hpo_config) 36 | 37 | if hpo_config.compute_type == "single-GPU": 38 | from workflows.MLWorkflowSingleGPU import MLWorkflowSingleGPU 39 | 40 | return MLWorkflowSingleGPU(hpo_config) 41 | 42 | if hpo_config.compute_type == "multi-GPU": 43 | from workflows.MLWorkflowMultiGPU import MLWorkflowMultiGPU 44 | 45 | return MLWorkflowMultiGPU(hpo_config) 46 | 47 | 48 | class MLWorkflow: 49 | @abstractmethod 50 | def ingest_data(self): 51 | pass 52 | 53 | @abstractmethod 54 | def handle_missing_data(self, dataset): 55 | pass 56 | 57 | @abstractmethod 58 | def split_dataset(self, dataset, i_fold): 59 | pass 60 | 61 | @abstractmethod 62 | def fit(self, X_train, y_train): 63 | pass 64 | 65 | @abstractmethod 66 | def predict(self, trained_model, X_test): 67 | pass 68 | 69 | @abstractmethod 70 | def score(self, y_test, predictions): 71 | pass 72 | 73 | @abstractmethod 74 | def save_trained_model(self, score, trained_model): 75 | pass 76 | 77 | @abstractmethod 78 | def cleanup(self, i_fold): 79 | pass 80 | 81 | @abstractmethod 82 | def emit_final_score(self): 83 | pass 84 | 85 | 86 | def timer_decorator(target_function): 87 | @functools.wraps(target_function) 88 | def timed_execution_wrapper(*args, **kwargs): 89 | start_time = time.perf_counter() 90 | result = target_function(*args, **kwargs) 91 | exec_time = time.perf_counter() - start_time 92 | hpo_log.info( 93 | f" --- {target_function.__name__}" f" completed in {exec_time:.5f} s" 94 | ) 95 | return result 96 | 97 | return timed_execution_wrapper 98 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-hpo/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [[ "$1" == "serve" ]]; then 4 | echo -e "@ entrypoint -> launching serving script \n" 5 | python serve.py 6 | else 7 | echo -e "@ entrypoint -> launching training script \n" 8 | python train.py 9 | fi 10 | -------------------------------------------------------------------------------- /source/examples/rapids-sagemaker-hpo/train.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2019-2021, NVIDIA CORPORATION. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | import logging 18 | import sys 19 | import traceback 20 | 21 | from HPOConfig import HPOConfig 22 | from MLWorkflow import create_workflow 23 | 24 | 25 | def train(): 26 | hpo_config = HPOConfig(input_args=sys.argv[1:]) 27 | ml_workflow = create_workflow(hpo_config) 28 | 29 | # cross-validation to improve robustness via multiple train/test reshuffles 30 | for i_fold in range(hpo_config.cv_folds): 31 | # ingest 32 | dataset = ml_workflow.ingest_data() 33 | 34 | # handle missing samples [ drop ] 35 | dataset = ml_workflow.handle_missing_data(dataset) 36 | 37 | # split into train and test set 38 | X_train, X_test, y_train, y_test = ml_workflow.split_dataset( 39 | dataset, random_state=i_fold 40 | ) 41 | 42 | # train model 43 | trained_model = ml_workflow.fit(X_train, y_train) 44 | 45 | # use trained model to predict target labels of test data 46 | predictions = ml_workflow.predict(trained_model, X_test) 47 | 48 | # score test set predictions against ground truth 49 | score = ml_workflow.score(y_test, predictions) 50 | 51 | # save trained model [ if it sets a new-high score ] 52 | ml_workflow.save_best_model(score, trained_model) 53 | 54 | # restart cluster to avoid memory creep [ for multi-CPU/GPU ] 55 | ml_workflow.cleanup(i_fold) 56 | 57 | # emit final score to cloud HPO [i.e., SageMaker] 58 | ml_workflow.emit_final_score() 59 | 60 | 61 | def configure_logging(): 62 | hpo_log = logging.getLogger("hpo_log") 63 | log_handler = logging.StreamHandler() 64 | log_handler.setFormatter( 65 | logging.Formatter("%(asctime)-15s %(levelname)8s %(name)s %(message)s") 66 | ) 67 | hpo_log.addHandler(log_handler) 68 | hpo_log.setLevel(logging.DEBUG) 69 | hpo_log.propagate = False 70 | 71 | 72 | if __name__ == "__main__": 73 | configure_logging() 74 | try: 75 | train() 76 | sys.exit(0) # success exit code 77 | except Exception: 78 | traceback.print_exc() 79 | sys.exit(-1) # failure exit code 80 | -------------------------------------------------------------------------------- /source/examples/xgboost-azure-mnmg-daskcloudprovider/configs/cloud_init.yaml.j2: -------------------------------------------------------------------------------- 1 | #cloud-config 2 | 3 | 4 | # Bootstrap 5 | packages: 6 | - apt-transport-https 7 | - ca-certificates 8 | - curl 9 | - gnupg-agent 10 | - software-properties-common 11 | - ubuntu-drivers-common 12 | 13 | # Enable ipv4 forwarding, required on CIS hardened machines 14 | write_files: 15 | - path: /etc/sysctl.d/enabled_ipv4_forwarding.conf 16 | content: | 17 | net.ipv4.conf.all.forwarding=1 18 | 19 | # create the docker group 20 | groups: 21 | - docker 22 | 23 | # Add default auto created user to docker group 24 | system_info: 25 | default_user: 26 | groups: [docker] 27 | 28 | 29 | runcmd: 30 | 31 | # Install Docker 32 | - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - 33 | - add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" 34 | - apt-get update -y 35 | - apt-get install -y docker-ce docker-ce-cli containerd.io 36 | - systemctl start docker 37 | - systemctl enable docker 38 | 39 | 40 | 41 | # Install NVIDIA driver 42 | - DEBIAN_FRONTEND=noninteractive ubuntu-drivers install 43 | 44 | # Install NVIDIA docker 45 | - curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - 46 | - curl -s -L https://nvidia.github.io/nvidia-docker/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list 47 | - apt-get update -y 48 | - apt-get install -y nvidia-docker2 49 | - systemctl restart docker 50 | 51 | # Attempt to run a RAPIDS container to download the container layers and decompress them 52 | - 'docker run --net=host --gpus=all --shm-size=256m rapidsai/base:latest --version' 53 | -------------------------------------------------------------------------------- /source/examples/xgboost-azure-mnmg-daskcloudprovider/trained-model_nyctaxi.xgb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/examples/xgboost-azure-mnmg-daskcloudprovider/trained-model_nyctaxi.xgb -------------------------------------------------------------------------------- /source/examples/xgboost-randomforest-gpu-hpo-dask/rapids_hpo/data/airlines.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/examples/xgboost-randomforest-gpu-hpo-dask/rapids_hpo/data/airlines.parquet -------------------------------------------------------------------------------- /source/guides/caching-docker-images.md: -------------------------------------------------------------------------------- 1 | # Caching Docker Images For Autoscaling Workloads 2 | 3 | The [Dask Autoscaler](https://kubernetes.dask.org/en/latest/operator_resources.html#daskautoscaler) leverages Dask's adaptive mode and allows the scheduler to scale the number of workers up and down based on the task graph. 4 | 5 | When scaling the Dask cluster up or down, there is no guarantee that newly created worker Pods will be scheduled on the same node as previously removed workers. As a result, when a new node is allocated for a worker Pod, the cluster will incur a pull penalty due to the need to download the Docker image. 6 | 7 | ## Using a Daemonset to cache images 8 | 9 | To guarantee that each node runs a consistent workload, we will deploy a Kubernetes [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) utilizing the RAPIDS image. This DaemonSet will prevent Dask worker Pods created from this image from entering a pending state when tasks are scheduled. 10 | 11 | This is an example manifest to deploy a Daemonset with the RAPIDS container. 12 | 13 | ```yaml 14 | #caching-daemonset.yaml 15 | apiVersion: apps/v1 16 | kind: DaemonSet 17 | metadata: 18 | name: prepuller 19 | namespace: image-cache 20 | spec: 21 | selector: 22 | matchLabels: 23 | name: prepuller 24 | template: 25 | metadata: 26 | labels: 27 | name: prepuller 28 | spec: 29 | initContainers: 30 | - name: prepuller-1 31 | image: "{{ rapids_container }}" 32 | command: ["sh", "-c", "'true'"] 33 | 34 | containers: 35 | - name: pause 36 | image: gcr.io/google_containers/pause:3.2 37 | resources: 38 | limits: 39 | cpu: 1m 40 | memory: 8Mi 41 | requests: 42 | cpu: 1m 43 | memory: 8Mi 44 | ``` 45 | 46 | You can create this Daemonset with `kubectl`. 47 | 48 | ```console 49 | $ kubectl apply -f caching-daemonset.yaml 50 | ``` 51 | 52 | The DaemonSet is deployed in the `image-cache` namespace. In the `initContainers` section, we specify the image to be pulled and cached within the cluster, utilizing any executable command that terminates successfully. Additionally, the `pause` container is used to ensure the Pod transitions into a Running state without consuming resources or running any processes. 53 | 54 | When deploying the DaemonSet, after all pre-puller Pods are running successfully, you can confirm that the images have been cached across all nodes in the cluster. As the Kubernetes cluster is scaled up or down, the DaemonSet will automatically pull and cache the necessary images on any newly added nodes, ensuring consistent image availability throughout 55 | -------------------------------------------------------------------------------- /source/guides/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Guides 7 | 8 | `````{gridtoctree} 1 2 2 3 9 | :gutter: 2 2 2 2 10 | 11 | ````{grid-item-card} 12 | :link: mig 13 | :link-type: doc 14 | Multi-Instance GPUs 15 | ^^^ 16 | Use RAPIDS with Multi-Instance GPUs 17 | 18 | {bdg}`Dask Cluster` 19 | {bdg}`XGBoost with Dask Cluster` 20 | ```` 21 | 22 | ````{grid-item-card} 23 | :link: azure/infiniband 24 | :link-type: doc 25 | Infiniband on Azure 26 | ^^^ 27 | How to setup InfiniBand on Azure. 28 | 29 | {bdg}`Microsoft Azure` 30 | ```` 31 | 32 | ````{grid-item-card} 33 | :link: scheduler-gpu-requirements 34 | :link-type: doc 35 | Does the Dask scheduler need a GPU? 36 | ^^^ 37 | Guidance on Dask scheduler software and hardware requirements. 38 | 39 | {bdg-primary}`Dask` 40 | ```` 41 | 42 | ````{grid-item-card} 43 | :link: scheduler-gpu-optimization 44 | :link-type: doc 45 | Optimizing the Dask Scheduler on Kubernetes 46 | ^^^ 47 | Use a T4 for the scheduler to optimize resource costs on Kubernetes 48 | 49 | {bdg-primary}`Dask` 50 | {bdg-primary}`Kubernetes` 51 | {bdg-primary}`dask-operator` 52 | ```` 53 | 54 | ````{grid-item-card} 55 | :link: colocate-workers 56 | :link-type: doc 57 | Colocate worker pods on Kubernetes 58 | ^^^ 59 | Use Pod affinity for the workers to optimize communication overhead on Kubernetes 60 | 61 | {bdg-primary}`Dask` 62 | {bdg-primary}`Kubernetes` 63 | {bdg-primary}`dask-operator` 64 | ```` 65 | 66 | ````{grid-item-card} 67 | :link: caching-docker-images 68 | :link-type: doc 69 | Caching Docker Images for autoscaling workloads 70 | ^^^ 71 | Prepull Docker Images while using the Dask Autoscaler on Kubernetes 72 | 73 | {bdg-primary}`Dask` 74 | {bdg-primary}`Kubernetes` 75 | {bdg-primary}`dask-operator` 76 | ```` 77 | 78 | ````{grid-item-card} 79 | :link: l4-gcp 80 | :link-type: doc 81 | L4 on Google Cloud Platform 82 | ^^^ 83 | How to setup a VM instance on GCP with an L4 GPU. 84 | 85 | {bdg-primary}`Google Cloud Platform` 86 | ```` 87 | 88 | ````` 89 | -------------------------------------------------------------------------------- /source/guides/l4-gcp.md: -------------------------------------------------------------------------------- 1 | # L4 GPUs on a Google Cloud Platform (GCP) 2 | 3 | [L4 GPUs](https://www.nvidia.com/en-us/data-center/l4/) are a more energy and computationally efficient option compared to T4 GPUs. L4 GPUs are [generally available on GCP](https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus) to run your workflows with RAPIDS. 4 | 5 | ## Compute Engine Instance 6 | 7 | ### Create the Virtual Machine 8 | 9 | To create a VM instance with an L4 GPU to run RAPIDS: 10 | 11 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 12 | 1. Select **Create Instance**. 13 | 1. Under the **Machine configuration** section, select **GPUs** and then select `NVIDIA L4` in the **GPU type** dropdown. 14 | 1. Under the **Boot Disk** section, click **CHANGE** and select `Deep Learning on Linux` in the **Operating System** dropdown. 15 | 1. It is also recommended to increase the default boot disk size to something like `100GB`. 16 | 1. Once you have customized other attributes of the instance, click **CREATE**. 17 | 18 | ### Allow network access 19 | 20 | To access Jupyter and Dask we will need to set up some firewall rules to open up some ports. 21 | 22 | #### Create the firewall rule 23 | 24 | 1. Open [**VPC Network**](https://console.cloud.google.com/networking/networks/list). 25 | 2. Select **Firewall** and **Create firewall rule** 26 | 3. Give the rule a name like `rapids` and ensure the network matches the one you selected for the VM. 27 | 4. Add a tag like `rapids` which we will use to assign the rule to our VM. 28 | 5. Set your source IP range. We recommend you restrict this to your own IP address or your corporate network rather than `0.0.0.0/0` which will allow anyone to access your VM. 29 | 6. Under **Protocols and ports** allow TCP connections on ports `22,8786,8787,8888`. 30 | 31 | #### Assign it to the VM 32 | 33 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 34 | 2. Select your VM and press **Edit**. 35 | 3. Scroll down to **Networking** and add the `rapids` network tag you gave your firewall rule. 36 | 4. Select **Save**. 37 | 38 | ### Connect to the VM 39 | 40 | Next we need to connect to the VM. 41 | 42 | 1. Open [**Compute Engine**](https://console.cloud.google.com/compute/instances). 43 | 2. Locate your VM and press the **SSH** button which will open a new browser tab with a terminal. 44 | 45 | ### Install CUDA and NVIDIA Container Toolkit 46 | 47 | Since [GCP recommends CUDA 12](https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#no-secure-boot) on L4 VM, we will be upgrading CUDA. 48 | 49 | 1. [Install CUDA Toolkit 12](https://developer.nvidia.com/cuda-downloads) in your VM and accept the default prompts with the following commands. 50 | 51 | ```bash 52 | $ wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run 53 | $ sudo sh cuda_12.1.1_530.30.02_linux.run 54 | ``` 55 | 56 | 1. [Install NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit) with the following commands. 57 | 58 | ```bash 59 | $ sudo apt-get update 60 | $ sudo apt-get install -y nvidia-container-toolkit 61 | $ sudo nvidia-ctk runtime configure --runtime=docker 62 | $ sudo systemctl restart docker 63 | ``` 64 | 65 | ### Install RAPIDS 66 | 67 | ```{include} ../_includes/install-rapids-with-docker.md 68 | 69 | ``` 70 | 71 | ### Test RAPIDS 72 | 73 | ```{include} ../_includes/test-rapids-docker-vm.md 74 | 75 | ``` 76 | 77 | ### Clean up 78 | 79 | Once you are finished head back to the [Deployments](https://console.cloud.google.com/compute/instances) page and delete the instance you created. 80 | 81 | ```{relatedexamples} 82 | 83 | ``` 84 | -------------------------------------------------------------------------------- /source/guides/mig.md: -------------------------------------------------------------------------------- 1 | # Multi-Instance GPU (MIG) 2 | 3 | [Multi-Instance GPU](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/) is a technology that allows partitioning a single GPU into multiple instances, making each one seem as a completely independent GPU. Each instance then receives a certain slice of the GPU computational resources and a pre-defined block of memory that is detached from the other instances by on-chip protections. 4 | 5 | Due to the protection layer to make MIG secure, certain limitations exist. One such limitation that is generally important for HPC applications is the lack of support for [CUDA Inter-Process Communication (IPC)](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#interprocess-communication), which enables transfers over NVLink and NVSwitch to greatly speed up communication between physical GPUs. When using MIG, [NVLink and NVSwitch](https://www.nvidia.com/en-us/data-center/nvlink/) are thus completely unavailable, forcing the application to take a more expensive communication channel via the system (CPU) memory. 6 | 7 | Given limitations in communication capability, we advise users to first understand the tradeoffs that have to be made when attempting to setup a cluster of MIG instances. While the partitioning could be beneficial to certain applications that need only a certain amount of compute capability, communication bottlenecks may be a problem and thus need to be thought of carefully. 8 | 9 | ## Dask Cluster 10 | 11 | Dask clusters of MIG instances are supported via Dask-CUDA as long as all MIG instances are identical with respect to memory. Much like a cluster of physical GPUs, mixing GPUs with different memory sizes is generally not a good idea as Dask may not be able to balance work correctly and eventually could lead to more frequent out-of-memory errors. 12 | 13 | For example, partitioning two GPUs into 7 x 10GB instances each and setting up a cluster with all 14 instances should be ok. However, partitioning one of the GPUs into 7 x 10GB instances and another with 3 x 20GB should be avoided. 14 | 15 | Unlike for a system composed of unpartitioned GPUs, Dask-CUDA cannot automatically infer the GPUs to be utilized for the cluster. In a MIG setup, the user is then required to specify the GPU instances to be used by the cluster. This is achieved by specifying either the `CUDA_VISIBLE_DEVICES` environment variable for either {class}`dask_cuda.LocalCUDACluster` or `dask-cuda-worker`, or the homonymous argument for {class}`dask_cuda.LocalCUDACluster`. 16 | 17 | Physical GPUs can be addressed by their indices `[0..N)` (where `N` is the total number of GPUs installed) or by its name composed of the `GPU-` prefix followed by its UUID. MIG instances have no indices and can only be addressed by their names, composed of the `MIG-` prefix followed by its UUID. The name of a MIG instance will the look similar to: `MIG-41b3359c-e721-56e5-8009-12e5797ed514`. 18 | 19 | ### Determine MIG Names 20 | 21 | The simplest way to determine the names of MIG instances is to run `nvidia-smi -L` on the command line. 22 | 23 | ```bash 24 | $ nvidia-smi -L 25 | GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb) 26 | MIG 2g.10gb Device 0: (UUID: MIG-41b3359c-e721-56e5-8009-12e5797ed514) 27 | MIG 2g.10gb Device 1: (UUID: MIG-65b79fff-6d3c-5490-a288-b31ec705f310) 28 | MIG 2g.10gb Device 2: (UUID: MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0) 29 | ``` 30 | 31 | In the example case above the system has one NVIDIA A100 with 3 x 10GB MIG instances. In the next sections we will see how to use the instance names to startup a Dask cluster composed of MIG GPUs. Please note that once a GPU is partitioned, the physical GPU (named `GPU-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb` above) is inaccessible for CUDA compute and cannot be used as part of a Dask cluster. 32 | 33 | Alternatively, MIG instance names can be obtained programmatically using [NVML](https://developer.nvidia.com/nvidia-management-library-nvml) or [PyNVML](https://pypi.org/project/nvidia-ml-py/). Please refer to the [NVML API](https://docs.nvidia.com/deploy/nvml-api/) to write appropriate utilities for that purpose. 34 | 35 | ### LocalCUDACluster 36 | 37 | Suppose you have 3 MIG instances on the local system: 38 | 39 | - `MIG-41b3359c-e721-56e5-8009-12e5797ed514` 40 | - `MIG-65b79fff-6d3c-5490-a288-b31ec705f310` 41 | - `MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0` 42 | 43 | To start a {class}`dask_cuda.LocalCUDACluster`, the user would run the following: 44 | 45 | ```python 46 | from dask_cuda import LocalCUDACluster 47 | 48 | cluster = LocalCUDACluster( 49 | CUDA_VISIBLE_DEVICES=[ 50 | "MIG-41b3359c-e721-56e5-8009-12e5797ed514", 51 | "MIG-65b79fff-6d3c-5490-a288-b31ec705f310", 52 | "MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0", 53 | ], 54 | # Other `LocalCUDACluster` arguments 55 | ) 56 | ``` 57 | 58 | ### dask-cuda-worker 59 | 60 | Suppose you have 3 MIG instances on the local system: 61 | 62 | - `MIG-41b3359c-e721-56e5-8009-12e5797ed514` 63 | - `MIG-65b79fff-6d3c-5490-a288-b31ec705f310` 64 | - `MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0` 65 | 66 | To start a `dask-cuda-worker` that the address to the scheduler is located in the `scheduler.json` file, the user would run the following: 67 | 68 | ```bash 69 | CUDA_VISIBLE_DEVICES="MIG-41b3359c-e721-56e5-8009-12e5797ed514,MIG-65b79fff-6d3c-5490-a288-b31ec705f310,MIG-c6e2bae8-46d4-5a7e-9a68-c6cf1f680ba0" dask-cuda-worker scheduler.json # --other-arguments 70 | ``` 71 | 72 | Please note that in the example above we created 3 Dask-CUDA workers on one node, for a multi-node cluster, the correct MIG names need to be specified, and they will always be different for each host. 73 | 74 | ## XGBoost with Dask Cluster 75 | 76 | Currently [XGBoost](https://www.nvidia.com/en-us/glossary/data-science/xgboost/) only exposes support for GPU communication via NCCL, which does not support MIG. For this reason, A Dask cluster that utilizes XGBoost would have to utilize TCP instead for all communications which will likely cause in considerable performance degradation. Therefore, using XGBoost with MIG is not recommended. 77 | 78 | ```{relatedexamples} 79 | 80 | ``` 81 | -------------------------------------------------------------------------------- /source/hpc.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | --- 4 | 5 | # HPC 6 | 7 | RAPIDS works extremely well in traditional HPC (High Performance Computing) environments where GPUs are often co-located with accelerated networking hardware such as InfiniBand. Deploying on HPC often means using queue management systems such as SLURM, LSF, PBS, etc. 8 | 9 | ## SLURM 10 | 11 | ```{warning} 12 | This is a legacy page and may contain outdated information. We are working hard to update our documentation with the latest and greatest information, thank you for bearing with us. 13 | ``` 14 | 15 | If you are unfamiliar with SLURM or need a refresher, we recommend the [quickstart guide](https://slurm.schedmd.com/quickstart.html). 16 | Depending on how your nodes are configured, additional settings may be required such as defining the number of GPUs `(--gpus)` desired or the number of gpus per node `(--gpus-per-node)`. 17 | In the following example, we assume each allocation runs on a DGX1 with access to all eight GPUs. 18 | 19 | ### Start Scheduler 20 | 21 | First, start the scheduler with the following SLURM script. This and the following scripts can deployed with `salloc` for interactive usage or `sbatch` for batched run. 22 | 23 | ```bash 24 | #!/usr/bin/env bash 25 | 26 | #SBATCH -J dask-scheduler 27 | #SBATCH -n 1 28 | #SBATCH -t 00:10:00 29 | 30 | module load cuda/11.0.3 31 | CONDA_ROOT=/nfs-mount/user/miniconda3 32 | source $CONDA_ROOT/etc/profile.d/conda.sh 33 | conda activate rapids 34 | 35 | LOCAL_DIRECTORY=/nfs-mount/dask-local-directory 36 | mkdir $LOCAL_DIRECTORY 37 | CUDA_VISIBLE_DEVICES=0 dask-scheduler \ 38 | --protocol tcp \ 39 | --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" & 40 | 41 | dask-cuda-worker \ 42 | --rmm-pool-size 14GB \ 43 | --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" 44 | ``` 45 | 46 | Notice that we configure the scheduler to write a `scheduler-file` to a NFS accessible location. This file contains metadata about the scheduler and will 47 | include the IP address and port for the scheduler. The file will serve as input to the workers informing them what address and port to connect. 48 | 49 | The scheduler doesn't need the whole node to itself so we can also start a worker on this node to fill out the unused resources. 50 | 51 | ### Start Dask CUDA Workers 52 | 53 | Next start the other [dask-cuda workers](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/). Dask-CUDA extends the traditional Dask `Worker` class with specific options and enhancements for GPU environments. Unlike the scheduler and client, the workers script should be scalable and allow the users to tune how many workers are created. 54 | For example, we can scale the number of nodes to 3: `sbatch/salloc -N3 dask-cuda-worker.script` . In this case, because we have 8 GPUs per node and we have 3 nodes, 55 | our job will have 24 workers. 56 | 57 | ```bash 58 | #!/usr/bin/env bash 59 | 60 | #SBATCH -J dask-cuda-workers 61 | #SBATCH -t 00:10:00 62 | 63 | module load cuda/11.0.3 64 | CONDA_ROOT=/nfs-mount/miniconda3 65 | source $CONDA_ROOT/etc/profile.d/conda.sh 66 | conda activate rapids 67 | 68 | LOCAL_DIRECTORY=/nfs-mount/dask-local-directory 69 | mkdir $LOCAL_DIRECTORY 70 | dask-cuda-worker \ 71 | --rmm-pool-size 14GB \ 72 | --scheduler-file "$LOCAL_DIRECTORY/dask-scheduler.json" 73 | ``` 74 | 75 | ### cuDF Example Workflow 76 | 77 | Lastly, we can now run a job on the established Dask Cluster. 78 | 79 | ```bash 80 | #!/usr/bin/env bash 81 | 82 | #SBATCH -J dask-client 83 | #SBATCH -n 1 84 | #SBATCH -t 00:10:00 85 | 86 | module load cuda/11.0.3 87 | CONDA_ROOT=/nfs-mount/miniconda3 88 | source $CONDA_ROOT/etc/profile.d/conda.sh 89 | conda activate rapids 90 | 91 | LOCAL_DIRECTORY=/nfs-mount/dask-local-directory 92 | 93 | cat <>/tmp/dask-cudf-example.py 94 | import cudf 95 | import dask.dataframe as dd 96 | from dask.distributed import Client 97 | 98 | client = Client(scheduler_file="$LOCAL_DIRECTORY/dask-scheduler.json") 99 | cdf = cudf.datasets.timeseries() 100 | 101 | ddf = dd.from_pandas(cdf, npartitions=10) 102 | res = ddf.groupby(['id', 'name']).agg(['mean', 'sum', 'count']).compute() 103 | print(res) 104 | EOF 105 | 106 | python /tmp/dask-cudf-example.py 107 | ``` 108 | 109 | ### Confirm Output 110 | 111 | Putting the above together will result in the following output: 112 | 113 | ```bash 114 | x y 115 | mean sum count mean sum count 116 | id name 117 | 1077 Laura 0.028305 1.868120 66 -0.098905 -6.527731 66 118 | 1026 Frank 0.001536 1.414839 921 -0.017223 -15.862306 921 119 | 1082 Patricia 0.072045 3.602228 50 0.081853 4.092667 50 120 | 1007 Wendy 0.009837 11.676199 1187 0.022978 27.275216 1187 121 | 976 Wendy -0.003663 -3.267674 892 0.008262 7.369577 892 122 | ... ... ... ... ... ... ... 123 | 912 Michael 0.012409 0.459119 37 0.002528 0.093520 37 124 | 1103 Ingrid -0.132714 -1.327142 10 0.108364 1.083638 10 125 | 998 Tim 0.000587 0.747745 1273 0.001777 2.262094 1273 126 | 941 Yvonne 0.050258 11.358393 226 0.080584 18.212019 226 127 | 900 Michael -0.134216 -1.073729 8 0.008701 0.069610 8 128 | 129 | [6449 rows x 6 columns] 130 | ``` 131 | 132 |

133 | -------------------------------------------------------------------------------- /source/images/azureml-access-datastore-uri.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/azureml-access-datastore-uri.png -------------------------------------------------------------------------------- /source/images/azureml-create-notebook-instance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/azureml-create-notebook-instance.png -------------------------------------------------------------------------------- /source/images/azureml-provision-setup-script.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/azureml-provision-setup-script.png -------------------------------------------------------------------------------- /source/images/azureml_returned_job_completed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/azureml_returned_job_completed.png -------------------------------------------------------------------------------- /source/images/databricks-choose-gpu-node.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-choose-gpu-node.png -------------------------------------------------------------------------------- /source/images/databricks-create-compute.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-create-compute.png -------------------------------------------------------------------------------- /source/images/databricks-custom-container.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-custom-container.png -------------------------------------------------------------------------------- /source/images/databricks-dask-cudf-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-dask-cudf-example.png -------------------------------------------------------------------------------- /source/images/databricks-dask-init-script.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-dask-init-script.png -------------------------------------------------------------------------------- /source/images/databricks-dask-logging.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-dask-logging.png -------------------------------------------------------------------------------- /source/images/databricks-mnmg-dask-client.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-mnmg-dask-client.png -------------------------------------------------------------------------------- /source/images/databricks-standard-runtime.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-standard-runtime.png -------------------------------------------------------------------------------- /source/images/databricks-worker-driver-node.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/databricks-worker-driver-node.png -------------------------------------------------------------------------------- /source/images/docref-admonition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/docref-admonition.png -------------------------------------------------------------------------------- /source/images/googlecolab-output-nvidia-smi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/googlecolab-output-nvidia-smi.png -------------------------------------------------------------------------------- /source/images/googlecolab-select-gpu-hardware-accelerator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/googlecolab-select-gpu-hardware-accelerator.png -------------------------------------------------------------------------------- /source/images/googlecolab-select-runtime-type.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/googlecolab-select-runtime-type.png -------------------------------------------------------------------------------- /source/images/kubeflow-configure-dashboard-option.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-configure-dashboard-option.png -------------------------------------------------------------------------------- /source/images/kubeflow-create-notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-create-notebook.png -------------------------------------------------------------------------------- /source/images/kubeflow-dask-dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-dask-dashboard.png -------------------------------------------------------------------------------- /source/images/kubeflow-jupyter-dask-cluster-widget.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-jupyter-dask-cluster-widget.png -------------------------------------------------------------------------------- /source/images/kubeflow-jupyter-dask-labextension.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-jupyter-dask-labextension.png -------------------------------------------------------------------------------- /source/images/kubeflow-jupyter-example-notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-jupyter-example-notebook.png -------------------------------------------------------------------------------- /source/images/kubeflow-jupyter-nvidia-smi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-jupyter-nvidia-smi.png -------------------------------------------------------------------------------- /source/images/kubeflow-jupyter-using-dask.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-jupyter-using-dask.png -------------------------------------------------------------------------------- /source/images/kubeflow-new-notebook.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-new-notebook.png -------------------------------------------------------------------------------- /source/images/kubeflow-notebook-running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubeflow-notebook-running.png -------------------------------------------------------------------------------- /source/images/kubernetes-jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/kubernetes-jupyter.png -------------------------------------------------------------------------------- /source/images/morpheus-pipeline-KafkaUI_9MB.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/morpheus-pipeline-KafkaUI_9MB.gif -------------------------------------------------------------------------------- /source/images/sagemaker-choose-rapids-kernel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/sagemaker-choose-rapids-kernel.png -------------------------------------------------------------------------------- /source/images/sagemaker-create-lifecycle-configuration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/sagemaker-create-lifecycle-configuration.png -------------------------------------------------------------------------------- /source/images/sagemaker-create-notebook-instance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/sagemaker-create-notebook-instance.png -------------------------------------------------------------------------------- /source/images/snowflake_jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/snowflake_jupyter.png -------------------------------------------------------------------------------- /source/images/theme-notebook-tags.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/theme-notebook-tags.png -------------------------------------------------------------------------------- /source/images/theme-tag-style.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/theme-tag-style.png -------------------------------------------------------------------------------- /source/images/vertex-ai-launcher.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rapidsai/deployment/38aca2668a83ee6632f6536d0d57e92d82c67726/source/images/vertex-ai-launcher.png -------------------------------------------------------------------------------- /source/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Deploying RAPIDS 7 | 8 | Deployment documentation to get you up and running with RAPIDS anywhere. 9 | 10 | `````{gridtoctree} 1 2 2 3 11 | :gutter: 2 2 2 2 12 | 13 | ````{grid-item-card} 14 | :link: local 15 | :link-type: doc 16 | {fas}`desktop;sd-text-primary` Local Machine 17 | ^^^ 18 | Use RAPIDS on your local workstation or server. 19 | 20 | {bdg}`docker` 21 | {bdg}`conda` 22 | {bdg}`pip` 23 | {bdg}`WSL2` 24 | ```` 25 | 26 | ````{grid-item-card} 27 | :link: cloud/index 28 | :link-type: doc 29 | {fas}`cloud;sd-text-primary` Cloud 30 | ^^^ 31 | Use RAPIDS on the cloud. 32 | 33 | {bdg}`Amazon Web Services` 34 | {bdg}`Google Cloud Platform` 35 | {bdg}`Microsoft Azure` 36 | {bdg}`IBM Cloud` 37 | ```` 38 | 39 | ````{grid-item-card} 40 | :link: hpc 41 | :link-type: doc 42 | {fas}`server;sd-text-primary` HPC 43 | ^^^ 44 | Use RAPIDS on high performance computers and supercomputers. 45 | 46 | {bdg}`SLURM` 47 | ```` 48 | 49 | ````{grid-item-card} 50 | :link: platforms/index 51 | :link-type: doc 52 | {fas}`network-wired;sd-text-primary` Platforms 53 | ^^^ 54 | Use RAPIDS on compute platforms. 55 | 56 | {bdg}`Kubernetes` 57 | {bdg}`Kubeflow` 58 | {bdg}`Coiled` 59 | {bdg}`Databricks` 60 | {bdg}`Google Colab` 61 | ```` 62 | 63 | ````{grid-item-card} 64 | :link: tools/index 65 | :link-type: doc 66 | {fas}`hammer;sd-text-primary` Tools 67 | ^^^ 68 | There are many tools to deploy RAPIDS. 69 | 70 | {bdg}`containers` 71 | {bdg}`dask-kubernetes` 72 | {bdg}`dask-operator` 73 | {bdg}`dask-helm-chart` 74 | {bdg}`dask-gateway` 75 | ```` 76 | 77 | ````{grid-item-card} 78 | :link: examples/index 79 | :link-type: doc 80 | {fas}`book;sd-text-primary` Workflow Examples 81 | ^^^ 82 | For inspiration see our example notebooks with opinionated deployments of RAPIDS to boost machine learning workflows. 83 | 84 | {bdg}`xgboost` 85 | {bdg}`optuna` 86 | {bdg}`mlflow` 87 | {bdg}`ray tune` 88 | ```` 89 | 90 | ````{grid-item-card} 91 | :link: guides/index 92 | :link-type: doc 93 | {fas}`book;sd-text-primary` Guides 94 | ^^^ 95 | Detailed guides on how to deploy and optimize RAPIDS. 96 | 97 | {bdg}`Microsoft Azure` 98 | {bdg}`Infiniband` 99 | {bdg}`MIG` 100 | ```` 101 | 102 | ````{grid-item-card} 103 | :link: nims 104 | :link-type: doc 105 | {fas}`zap;sd-text-primary` NVIDIA NIM Microservices 106 | ^^^ 107 | NVIDIA NIM Microservices using RAPIDS to accelerate your AI deployment. 108 | 109 | {bdg}`Natural Language Processing` 110 | {bdg}`Data Processing` 111 | 112 | ```` 113 | 114 | ````{grid-item-card} 115 | :link: developer/index 116 | :link-type: doc 117 | {fas}`wrench;sd-text-primary` Developer 118 | ^^^ 119 | Build on RAPIDS in your development environments. 120 | 121 | {bdg}`CI` 122 | ```` 123 | ````` 124 | -------------------------------------------------------------------------------- /source/local.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Local 7 | 8 | ## Conda 9 | 10 | Installation instructions for conda are hosted at the [RAPIDS Conda Installation Docs Page](https://docs.rapids.ai/install#conda). 11 | 12 | ## Docker 13 | 14 | Installation instructions for Docker are hosted at the [RAPIDS Docker Installation Docs Page](https://docs.rapids.ai/install#docker). 15 | 16 | ## pip 17 | 18 | RAPIDS packages can be installed with pip. See [RAPIDS pip Installation Docs Page](https://docs.rapids.ai/install#pip) for installation instructions and requirements. 19 | 20 | ## WSL2 21 | 22 | RAPIDS can be installed on Windows using Windows Subsystem for Linux version 2 (WSL2). See [RAPIDS WSL2 Installation Docs Page](https://docs.rapids.ai/install#wsl2) for installation instructions and requirements. 23 | -------------------------------------------------------------------------------- /source/nims.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # NVIDIA NIM Microservices 7 | 8 | ## Natural Language Processing 9 | 10 | `````{gridtoctree} 1 2 2 3 11 | :gutter: 2 2 2 2 12 | 13 | ````{grid-item-card} 14 | :link: https://docs.nvidia.com/nim/index.html#nemo-retriever 15 | :link-type: url 16 | NeMo Retriever 17 | ^^^ 18 | Get access to state-of-the-art models for building text Q&A retrieval pipelines with high accuracy. 19 | 20 | {bdg}`Text Embedding` 21 | {bdg}`Text Reranking` 22 | ```` 23 | 24 | ````` 25 | -------------------------------------------------------------------------------- /source/platforms/colab.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "p0" 3 | --- 4 | 5 | # RAPIDS on Google Colab 6 | 7 | ## Overview 8 | 9 | RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This guide is applicable for users who want to utilize the full suite of the RAPIDS libraries for their workflows. It is broken into two sections: 10 | 11 | 1. [RAPIDS Quick Install](colab-quick) - applicable for most users and quickly installs all the RAPIDS Stable packages. 12 | 2. [RAPIDS Custom Setup Instructions](colab-custom) - step by step set up instructions covering the **must haves** for when a user needs to adapt instance to their workflows. 13 | 14 | In both sections, we will be installing RAPIDS on colab using pip. The pip installation allows users to install cuDF, cuML, cuGraph, cuXfilter, and cuSpatial stable versions in a few minutes. 15 | 16 | RAPIDS install on Colab strives to be an "always working" solution, and sometimes will **pin** RAPIDS versions to ensure compatibility. 17 | 18 | (colab-quick)= 19 | 20 | ## Section 1: RAPIDS Quick Install 21 | 22 | ### Links 23 | 24 | Please follow the links below to our install templates: 25 | 26 | #### Pip 27 | 28 | 1. Open the pip template link by clicking this button --> 29 | 30 | Open In Colab 31 | . 32 | 1. Click **Runtime** > **Run All**. 33 | 1. Wait a few minutes for the installation to complete without errors. 34 | 1. Add your code in the cells below the template. 35 | 36 | (colab-custom)= 37 | 38 | ## Section 2: User Customizable RAPIDS Install Instructions 39 | 40 | ### 1. Launch notebook 41 | 42 | To get started in [Google Colab](https://colab.research.google.com/), click `File` at the top toolbar to Create new or Upload existing notebook 43 | 44 | ### 2. Set the Runtime 45 | 46 | Click the `Runtime` dropdown and select `Change Runtime Type` 47 | 48 | ![Screenshot of create runtime and runtime type](../images/googlecolab-select-runtime-type.png) 49 | 50 | Choose GPU for Hardware Accelerator 51 | 52 | ![Screenshot of gpu for hardware accelerator](../images/googlecolab-select-gpu-hardware-accelerator.png) 53 | 54 | ### 3. Check GPU type 55 | 56 | Check the output of `!nvidia-smi` to make sure you've been allocated a Rapids Compatible GPU ([see the RAPIDS install docs](https://docs.rapids.ai/install/#system-req)). 57 | 58 | ![Screenshot of nvidia-smi](../images/googlecolab-output-nvidia-smi.png) 59 | 60 | ### 4. Install RAPIDS on Colab 61 | 62 | You can install RAPIDS using pip. The script first checks GPU compatibility with RAPIDS, then installs the latest **stable** versions of some core RAPIDS libraries (e.g. cuDF, cuML, cuGraph, and xgboost) using `pip`. 63 | 64 | ```bash 65 | # Colab warns and provides remediation steps if the GPUs is not compatible with RAPIDS. 66 | 67 | !git clone https://github.com/rapidsai/rapidsai-csp-utils.git 68 | !python rapidsai-csp-utils/colab/pip-install.py 69 | ``` 70 | 71 | ### 5. Test RAPIDS 72 | 73 | Run the following in a Python cell. 74 | 75 | ```python 76 | import cudf 77 | 78 | gdf = cudf.DataFrame({"a":[1,2,3], "b":[4,5,6]}) 79 | gdf 80 | a b 81 | 0 1 4 82 | 1 2 5 83 | 2 3 6 84 | 85 | ``` 86 | 87 | ### 6. Next steps 88 | 89 | Try a more thorough example of using cuDF on Google Colab, "10 Minutes to RAPIDS cuDF's pandas accelerator mode (cudf.pandas)" ([Google Colab link](https://nvda.ws/rapids-cudf)). 90 | -------------------------------------------------------------------------------- /source/platforms/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | html_theme.sidebar_secondary.remove: true 4 | --- 5 | 6 | # Platforms 7 | 8 | `````{gridtoctree} 1 2 2 3 9 | :gutter: 2 2 2 2 10 | 11 | ````{grid-item-card} 12 | :link: nvidia-ai-workbench 13 | :link-type: doc 14 | NVIDIA AI Workbench 15 | ^^^ 16 | Run RAPIDS in NVIDIA AI Workbench, GPU workstation setup tool that enables developers to work, manage, and collaborate across heterogeneous platforms. 17 | 18 | {bdg}`single-node` 19 | ```` 20 | 21 | ````{grid-item-card} 22 | :link: kubernetes 23 | :link-type: doc 24 | Kubernetes 25 | ^^^ 26 | Launch RAPIDS containers and cluster on Kubernetes with various tools. 27 | 28 | {bdg}`single-node` 29 | {bdg}`multi-node` 30 | ```` 31 | 32 | ````{grid-item-card} 33 | :link: kubeflow 34 | :link-type: doc 35 | Kubeflow 36 | ^^^ 37 | Integrate RAPIDS with Kubeflow notebooks and pipelines. 38 | 39 | {bdg}`single-node` 40 | {bdg}`multi-node` 41 | ```` 42 | 43 | ````{grid-item-card} 44 | :link: kserve 45 | :link-type: doc 46 | KServe 47 | ^^^ 48 | Deploy RAPIDS models with KServe, a standard model inference platform 49 | for Kubernetes. 50 | 51 | {bdg}`multi-node` 52 | ```` 53 | 54 | ````{grid-item-card} 55 | :link: coiled 56 | :link-type: doc 57 | Coiled 58 | ^^^ 59 | Run RAPIDS on Coiled. 60 | 61 | {bdg}`multi-node` 62 | ```` 63 | 64 | ````{grid-item-card} 65 | :link: databricks 66 | :link-type: doc 67 | Databricks 68 | ^^^ 69 | Run RAPIDS on Databricks. 70 | 71 | {bdg}`single-node` 72 | ```` 73 | 74 | ````{grid-item-card} 75 | :link: colab 76 | :link-type: doc 77 | Google Colab 78 | ^^^ 79 | Run RAPIDS on Google Colab. 80 | 81 | {bdg}`single-node` 82 | ```` 83 | 84 | ````{grid-item-card} 85 | :link: snowflake 86 | :link-type: doc 87 | Snowflake 88 | ^^^ 89 | Run RAPIDS on Snowflake. 90 | 91 | {bdg}`single-node` 92 | ```` 93 | 94 | ````` 95 | -------------------------------------------------------------------------------- /source/platforms/nvidia-ai-workbench.md: -------------------------------------------------------------------------------- 1 | # NVIDIA AI Workbench 2 | 3 | [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/) is a developer toolkit for data science, machine learning, and AI projects. It lets you develop on your laptop/workstation and then easily transition workloads to scalable GPU resources in a data center or the cloud. AI Workbench is free, you can install it in minutes on both local or remote computers, and offers a desktop application as well as a command-line interface (CLI). 4 | 5 | ## Installation 6 | 7 | You can install AI Workbench locally, or on a remote computer that you have SSH access to. 8 | 9 | Follow the [AI Workbench installation](https://docs.nvidia.com/ai-workbench/user-guide/latest/installation/overview.html) documentation for instructions on installing on different operating systems. 10 | 11 | ```{admonition} Local GPU System 12 | :class: note 13 | 14 | If you are working on a system that has an NVIDIA GPU you can use the AI Workbench 15 | by installing the desktop application and configure Docker if you don't have it installed already. Then you can run notebooks in Python environments with access to your GPU. 16 | 17 | ``` 18 | 19 | ```{admonition} Remote GPU System 20 | :class: note 21 | 22 | If you don't have an NVIDIA GPU in your system, but have remote SSH access to a system that does, then you can use AI Workbench to connect to that system. Your code will be executed on the remote system, but files will be synced between your local and remote environments automatically. This allows you to burst from a system without NVIDIA GPUs like a lightweight laptop to a powerful remote AI system. 23 | 24 | To use AI Workbench in this way you need to install the desktop application on your local system and the CLI application on the remote system. 25 | ``` 26 | 27 | ## Configure your system 28 | 29 | Once you have installed AI Workbench you can launch the desktop application. On first run it will talk you through installing some dependencies if they aren't available already. 30 | 31 | Then you will be able to choose between using your local environment or working on a remote system (you can switch between them later very easily). 32 | 33 | If you wish to configure a remote system click the "Add Remote System" button and enter the configuration information for that system. 34 | 35 | ![Screenshot of adding a new remote location with a form where you can enter SSH information](../_static/images/platforms/nvidia-ai-workbench/add-remote-system-dialog.png) 36 | 37 | Once configured select the system you wish to use. You will then be greeted with a screen where you can create a new project or clone an existing one. 38 | 39 | ![Screenshot of ](../_static/images/platforms/nvidia-ai-workbench/new-project.png) 40 | 41 | Select "Start a new project" and give it a name and description. You can also change the default location to store the project files. 42 | 43 | ![Screenshot of the "Start a new project" button](../_static/images/platforms/nvidia-ai-workbench/create-project.png) 44 | 45 | Then scroll down and select "RAPIDS with CUDA" from the list of templates. 46 | 47 | ![Screenshot of the template selector with "RAPIDS with CUDA" highlighted](../_static/images/platforms/nvidia-ai-workbench/rapids-with-cuda.png) 48 | 49 | The new project will then be created. AI Workbench will automatically build a container for this project, this may take a few minutes. 50 | 51 | ![Screenshot of the AI workbench UI. In the bottom corner the build status says "Building" and the "Open Jupyterlab" button is greyed out](../_static/images/platforms/nvidia-ai-workbench/project-building.png) 52 | 53 | Once the project has built you can select "Open Jupyterlab" to launch Jupyter in your RAPIDS environment. 54 | 55 | ![Screenshot of the AI workbench UI. In the bottom corner the build status says "Build Ready" and the "Open Jupyterlab" button is highlighted](../_static/images/platforms/nvidia-ai-workbench/open-jupyter.png) 56 | 57 | Then you can start working with the RAPIDS libraries in your notebooks. 58 | 59 | ![Screenshot of Jupyterlab running some cudf code to demonstrate that the RAPIDS libraries are available and working](../_static/images/platforms/nvidia-ai-workbench/cudf-example.png) 60 | 61 | ## Further reading 62 | 63 | For more information and to learn more about what you can do with NVIDIA AI Workbench [see the documentation](https://docs.nvidia.com/ai-workbench/user-guide/latest/overview/introduction.html). 64 | -------------------------------------------------------------------------------- /source/tools/dask-cuda.md: -------------------------------------------------------------------------------- 1 | # dask-cuda 2 | 3 | [Dask-CUDA](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/) is a library extending `LocalCluster` from `dask.distributed` to enable multi-GPU workloads. 4 | 5 | ## LocalCUDACluster 6 | 7 | You can use `LocalCUDACluster` to create a cluster of one or more GPUs on your local machine. You can launch a Dask scheduler on LocalCUDACluster to parallelize and distribute your RAPIDS workflows across multiple GPUs on a single node. 8 | 9 | In addition to enabling multi-GPU computation, `LocalCUDACluster` also provides a simple interface for managing the cluster, such as starting and stopping the cluster, querying the status of the nodes, and monitoring the workload distribution. 10 | 11 | ## Pre-requisites 12 | 13 | Before running these instructions, ensure you have installed the [`dask`](https://docs.dask.org/en/stable/install.html) and [`dask-cuda`](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/install.html) packages in your local environment. 14 | 15 | ## Cluster setup 16 | 17 | ### Instantiate a LocalCUDACluster object 18 | 19 | The `LocalCUDACluster` class autodetects the GPUs in your system, so if you create it on a machine with two GPUs it will create a cluster with two workers, each of which is responsible for executing tasks on a separate GPU. 20 | 21 | ```python 22 | from dask_cuda import LocalCUDACluster 23 | from dask.distributed import Client 24 | 25 | cluster = LocalCUDACluster() 26 | ``` 27 | 28 | You can also restrict your cluster to use specific GPUs by setting the `CUDA_VISIBLE_DEVICES` environment variable, or as a keyword argument. 29 | 30 | ```python 31 | cluster = LocalCUDACluster( 32 | CUDA_VISIBLE_DEVICES="0,1" 33 | ) # Creates one worker for GPUs 0 and 1 34 | ``` 35 | 36 | ### Connecting a Dask client 37 | 38 | The Dask scheduler coordinates the execution of tasks, whereas the Dask client is the user-facing interface that submits tasks to the scheduler and monitors their progress. 39 | 40 | ```python 41 | client = Client(cluster) 42 | ``` 43 | 44 | ## Test RAPIDS 45 | 46 | To test RAPIDS, create a `distributed` client for the cluster and query for the GPU model. 47 | 48 | ```python 49 | def get_gpu_model(): 50 | import pynvml 51 | 52 | pynvml.nvmlInit() 53 | return pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0)) 54 | 55 | 56 | result = client.submit(get_gpu_model).result() 57 | 58 | print(result) 59 | # b'Tesla V100-SXM2-16GB 60 | ``` 61 | -------------------------------------------------------------------------------- /source/tools/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | review_priority: "index" 3 | --- 4 | 5 | # Tools 6 | 7 | ## Packages 8 | 9 | `````{gridtoctree} 1 2 2 3 10 | :gutter: 2 2 2 2 11 | 12 | ````{grid-item-card} 13 | :link: rapids-docker 14 | :link-type: doc 15 | Container Images 16 | ^^^ 17 | Container images containing the RAPIDS software environment. 18 | ```` 19 | 20 | ````{grid-item-card} 21 | :link: dask-cuda 22 | :link-type: doc 23 | Dask CUDA 24 | ^^^ 25 | Dask-CUDA is a library extending Dask.distributed’s single-machine LocalCluster and Worker for use in distributed GPU workloads. 26 | ```` 27 | 28 | ````` 29 | 30 | ## Kubernetes 31 | 32 | `````{gridtoctree} 1 2 2 3 33 | :gutter: 2 2 2 2 34 | 35 | ````{grid-item-card} 36 | :link: kubernetes/dask-operator 37 | :link-type: doc 38 | Dask Kubernetes Operator 39 | ^^^ 40 | Launch RAPIDS containers and clusters as native Kubernetes resources with the Dask Operator. 41 | ```` 42 | 43 | ````{grid-item-card} 44 | :link: kubernetes/dask-helm-chart 45 | :link-type: doc 46 | Dask Helm Chart 47 | ^^^ 48 | Install a single user notebook and cluster on Kubernetes with the Dask Helm Chart. 49 | ```` 50 | 51 | ````` 52 | -------------------------------------------------------------------------------- /source/tools/kubernetes/dask-helm-chart.md: -------------------------------------------------------------------------------- 1 | # Dask Helm Chart 2 | 3 | Dask has a [Helm Chart](https://github.com/dask/helm-chart) that creates the following resources: 4 | 5 | - 1 x Jupyter server (preconfigured to access the Dask cluster) 6 | - 1 x Dask scheduler 7 | - 3 x Dask workers that connect to the scheduler (scalable) 8 | 9 | This helm chart can be configured to run RAPIDS by providing GPUs to the Jupyter server and Dask workers and by using container images with the RAPIDS libraries available. 10 | 11 | ## Configuring RAPIDS 12 | 13 | Built on top of the Dask Helm Chart, `rapids-config.yaml` file contains additional configurations required to setup RAPIDS environment. 14 | 15 | ```yaml 16 | # rapids-config.yaml 17 | scheduler: 18 | image: 19 | repository: "{{ rapids_container.split(":")[0] }}" 20 | tag: "{{ rapids_container.split(":")[1] }}" 21 | 22 | worker: 23 | image: 24 | repository: "{{ rapids_container.split(":")[0] }}" 25 | tag: "{{ rapids_container.split(":")[1] }}" 26 | dask_worker: "dask_cuda_worker" 27 | replicas: 3 28 | resources: 29 | limits: 30 | nvidia.com/gpu: 1 31 | 32 | jupyter: 33 | image: 34 | repository: "{{ rapids_container.split(":")[0].replace('base', 'notebooks') }}" 35 | tag: "{{ rapids_container.split(":")[1] }}" 36 | servicePort: 8888 37 | # Default password hash for "rapids" 38 | password: "argon2:$argon2id$v=19$m=10240,t=10,p=8$TBbhubLuX7efZGRKQqIWtw$RG+jCBB2KYF2VQzxkhMNvHNyJU9MzNGTm2Eu2/f7Qpc" 39 | resources: 40 | limits: 41 | nvidia.com/gpu: 1 42 | 43 | ``` 44 | 45 | `[jupyter|scheduler|worker].image.*` is updated with the RAPIDS "runtime" image from the stable release, 46 | which includes environment necessary to launch run accelerated libraries in RAPIDS, and scaling up and down via dask. 47 | Note that all scheduler, worker and jupyter Pods are required to use the same image. 48 | This ensures that dask scheduler and worker versions match. 49 | 50 | `[jupyter|worker].resources` explicitly requests a GPU for each worker Pod and the Jupyter Pod, required by many accelerated libraries in RAPIDS. 51 | 52 | `worker.dask_worker` is the launch command for dask worker inside worker Pod. 53 | To leverage the GPUs assigned to each Pod the [`dask_cuda_worker`](https://docs.rapids.ai/api/dask-cuda/~~~rapids_api_docs_version~~~/index.html) command is launched in place of the regular `dask_worker`. 54 | 55 | If desired to have a different jupyter notebook password than default, compute the hash for `` and update `jupyter.password`. 56 | You can compute password hash by following the [jupyter notebook guide](https://jupyter-notebook.readthedocs.io/en/stable/public_server.html?highlight=passwd#preparing-a-hashed-password). 57 | 58 | ### Installing the Helm Chart 59 | 60 | ```console 61 | $ helm install rapids-release --repo https://helm.dask.org dask -f rapids-config.yaml 62 | ``` 63 | 64 | This will deploy the cluster with the same topography as dask helm chart, 65 | see [dask helm chart documentation for detail](https://artifacthub.io/packages/helm/dask/dask). 66 | 67 | ```{note} 68 | By default, the Dask Helm Chart will not create an `Ingress` resource. 69 | A custom `Ingress` may be configured to consume external traffic and redirect to corresponding services. 70 | ``` 71 | 72 | For simplicity, this guide will setup access to the Jupyter server via port forwarding. 73 | 74 | ## Running Rapids Notebook 75 | 76 | First, setup port forwarding from the cluster to external port: 77 | 78 | ```console 79 | # For the Jupyter server 80 | $ kubectl port-forward --address 127.0.0.1 service/rapids-release-dask-jupyter 8888:8888 81 | 82 | # For the Dask dashboard 83 | $ kubectl port-forward --address 127.0.0.1 service/rapids-release-dask-scheduler 8787:8787 84 | ``` 85 | 86 | Open a browser and visit `localhost:8888` to access Jupyter, 87 | and `localhost:8787` for the dask dashboard. 88 | Enter the password (default is `rapids`) and access the notebook environment. 89 | 90 | ### Notebooks and Cluster Scaling 91 | 92 | Now we can verify that everything is working correctly by running some of the example notebooks. 93 | 94 | Open the `10 Minutes to cuDF and Dask-cuDF` notebook under `cudf/10-min.ipynb`. 95 | 96 | Add a new cell at the top to connect to the Dask cluster. Conveniently, the helm chart preconfigures the scheduler address in client's environment. 97 | So you do not need to pass any config to the `Client` object. 98 | 99 | ```python 100 | from dask.distributed import Client 101 | 102 | client = Client() 103 | client 104 | ``` 105 | 106 | By default, we can see 3 workers are created and each has 1 GPU assigned. 107 | 108 | ![dask worker](../../_static/daskworker.PNG) 109 | 110 | Walk through the examples to validate that the dask cluster is setup correctly, and that GPUs are accessible for the workers. 111 | Worker metrics can be examined in dask dashboard. 112 | 113 | ![dask worker](../../_static/workingdask.PNG) 114 | 115 | In case you want to scale up the cluster with more GPU workers, you may do so via `kubectl` or via `helm upgrade`. 116 | 117 | ```bash 118 | $ kubectl scale deployment rapids-release-dask-worker --replicas=8 119 | 120 | # or 121 | 122 | $ helm upgrade --set worker.replicas=8 rapids-release dask/dask 123 | ``` 124 | 125 | ![dask worker](../../_static/eightworkers.PNG) 126 | 127 | ```{relatedexamples} 128 | 129 | ``` 130 | -------------------------------------------------------------------------------- /source/tools/rapids-docker.md: -------------------------------------------------------------------------------- 1 | # Container Images 2 | 3 | Installation instructions for Docker are hosted at the [RAPIDS Container Installation Docs Page](https://docs.rapids.ai/install#docker). 4 | 5 | ```{relatedexamples} 6 | 7 | ``` 8 | --------------------------------------------------------------------------------