├── bin ├── .gitkeep ├── create-conda-env.sh ├── create-conda-env.sbatch ├── launch-jupyter-server.sbatch ├── launch-nvdashboard-server.srun ├── launch-jupyter-server.srun └── README.md ├── data └── .gitkeep ├── doc └── .gitkeep ├── src └── .gitkeep ├── docker ├── .gitkeep ├── entrypoint.sh ├── img │ └── creating-dockerhub-repo-screenshot.png ├── hooks │ └── build ├── docker-compose.yml ├── Dockerfile └── README.md ├── notebooks └── .gitkeep ├── results └── .gitkeep ├── requirements.txt ├── environment.yml ├── LICENSE ├── .gitignore └── README.md /bin/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /doc/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docker/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /notebooks/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /results/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docker/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | set -e 3 | 4 | conda activate $HOME/app/env 5 | exec "$@" 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas-bokeh 2 | jax[cuda11_cudnn82] 3 | 4 | --find-links https://storage.googleapis.com/jax-releases/jax_releases.html 5 | 6 | -------------------------------------------------------------------------------- /docker/img/creating-dockerhub-repo-screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidrpugh/jax-gpu-data-science-project/master/docker/img/creating-dockerhub-repo-screenshot.png -------------------------------------------------------------------------------- /bin/create-conda-env.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | 3 | set -euo pipefail 4 | 5 | # create the conda environment 6 | export ENV_PREFIX=$PWD/env 7 | mamba env create --prefix $ENV_PREFIX --file environment.yml --force 8 | -------------------------------------------------------------------------------- /docker/hooks/build: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -ex 3 | 4 | docker image build \ 5 | --build-arg username=al-khawarizmi \ 6 | --build-arg uid=1000 \ 7 | --build-arg gid=100 \ 8 | --tag "$DOCKER_REPO:latest" \ 9 | --file Dockerfile \ 10 | ../ 11 | -------------------------------------------------------------------------------- /bin/create-conda-env.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --time=2:00:00 3 | #SBATCH --gpus-per-node=v100:1 4 | #SBATCH --cpus-per-gpu=6 5 | #SBATCH --mem=32G 6 | #SBATCH --partition=debug 7 | #SBATCH --job-name=create-conda-env 8 | #SBATCH --mail-type=ALL 9 | #SBATCH --output=bin/%x-%j-slurm.out 10 | #SBATCH --error=bin/%x-%j-slurm.err 11 | 12 | 13 | # create the conda environment 14 | ./bin/create-conda-env.sh 15 | -------------------------------------------------------------------------------- /bin/launch-jupyter-server.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | #SBATCH --time=2:00:00 3 | #SBATCH --nodes=1 4 | #SBATCH --gpus-per-node=v100:1 5 | #SBATCH --cpus-per-gpu=6 6 | #SBATCH --mem=64G 7 | #SBATCH --constraint=intel 8 | #SBATCH --partition=debug 9 | #SBATCH --job-name=launch-jupyter-server 10 | #SBATCH --mail-type=ALL 11 | #SBATCH --output=bin/%x-%j-slurm.out 12 | #SBATCH --error=bin/%x-%j-slurm.err 13 | 14 | # use srun to launch Jupyter server in order to reserve a port 15 | srun --resv-ports=1 ./bin/launch-jupyter-server.srun 16 | -------------------------------------------------------------------------------- /docker/docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "2.3" 2 | 3 | services: 4 | jupyterlab-server: 5 | build: 6 | args: 7 | - username=${USER} 8 | - uid=${UID} 9 | - gid=${GID} 10 | context: ../ 11 | dockerfile: docker/Dockerfile 12 | ports: 13 | - "8888:8888" 14 | runtime: nvidia 15 | volumes: 16 | - ../bin:/home/${USER}/app/bin 17 | - ../data:/home/${USER}/app/data 18 | - ../doc:/home/${USER}/app/doc 19 | - ../notebooks:/home/${USER}/app/notebooks 20 | - ../results:/home/${USER}/app/results 21 | - ../src:/home/${USER}/app/src 22 | init: true 23 | stdin_open: true 24 | tty: true 25 | -------------------------------------------------------------------------------- /bin/launch-nvdashboard-server.srun: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | NVDASHBOARD_PORT=$SLURM_STEP_RESV_PORTS 4 | IBEX_NODE=$(hostname -s) 5 | 6 | echo " 7 | To connect to the compute node ${IBEX_NODE} on Ibex running your NVDashboard server, 8 | you need to create an ssh tunnel from your local machine to login node on Ibex 9 | using the following command. 10 | 11 | ssh -L ${NVDASHBOARD_PORT}:${IBEX_NODE}:${NVDASHBOARD_PORT} ${USER}@glogin.ibex.kaust.edu.sa 12 | 13 | Next, you need to copy the url provided below and paste it into the browser on your 14 | local machine. 15 | 16 | http://127.0.0.1:${NVDASHBOARD_PORT} 17 | " >&2 18 | 19 | # Start the nvdashboard server 20 | python -m jupyterlab_nvdashboard.server $NVDASHBOARD_PORT 21 | -------------------------------------------------------------------------------- /bin/launch-jupyter-server.srun: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | 3 | # setup the environment 4 | module purge 5 | conda activate ./env 6 | 7 | # setup ssh tunneling 8 | export XDG_RUNTIME_DIR=/tmp IBEX_NODE=$(hostname -s) 9 | KAUST_USER=$(whoami) 10 | JUPYTER_PORT=$SLURM_STEP_RESV_PORTS 11 | 12 | echo " 13 | To connect to the compute node ${IBEX_NODE} on Ibex running your Jupyter server, 14 | you need to create an ssh tunnel from your local machine to login node on Ibex 15 | using the following command. 16 | 17 | ssh -L ${JUPYTER_PORT}:${IBEX_NODE}:${JUPYTER_PORT} ${KAUST_USER}@glogin.ibex.kaust.edu.sa 18 | 19 | Next, you need to copy the second url provided below and paste it into the browser 20 | on your local machine. 21 | " >&2 22 | 23 | # launch jupyter server 24 | jupyter ${1:-lab} --no-browser --port=${JUPYTER_PORT} --ip=${IBEX_NODE} 25 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: null 2 | 3 | channels: 4 | - conda-forge 5 | - defaults 6 | 7 | dependencies: 8 | - bokeh 9 | - cudnn=8.2 10 | - cudatoolkit=11.2 11 | - cudatoolkit-dev=11.2 12 | - cupy 13 | - dask 14 | - dask-ml 15 | - dask-labextension 16 | - datashader 17 | - featuretools 18 | - gfortran_linux-64 19 | - gh 20 | - git 21 | - gxx_linux-64 22 | - h5py 23 | - holoviews 24 | - hvplot 25 | - imbalanced-learn 26 | - jupyterlab 27 | - jupyterlab-git 28 | - jupyterlab-lsp 29 | - jupyterlab-nvdashboard 30 | - lightgbm 31 | - matplotlib 32 | - nccl 33 | - numba 34 | - numpy 35 | - optuna 36 | - pandas 37 | - panel 38 | - pip 39 | - pip: 40 | - -r requirements.txt 41 | - pyarrow 42 | - python=3.8 43 | - python-language-server 44 | - pyviz_comms 45 | - scikit-learn 46 | - scipy 47 | - shap 48 | - tensorboard 49 | - umap-learn 50 | - wandb 51 | - xarray 52 | - xeus-python 53 | - xgboost 54 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) [year], [fullname] 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | db.sqlite3-journal 62 | 63 | # Flask stuff: 64 | instance/ 65 | .webassets-cache 66 | 67 | # Scrapy stuff: 68 | .scrapy 69 | 70 | # Sphinx documentation 71 | docs/_build/ 72 | 73 | # PyBuilder 74 | target/ 75 | 76 | # Jupyter Notebook 77 | .ipynb_checkpoints 78 | 79 | # IPython 80 | profile_default/ 81 | ipython_config.py 82 | 83 | # pyenv 84 | .python-version 85 | 86 | # pipenv 87 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 88 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 89 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 90 | # install all needed dependencies. 91 | #Pipfile.lock 92 | 93 | # celery beat schedule file 94 | celerybeat-schedule 95 | 96 | # SageMath parsed files 97 | *.sage.py 98 | 99 | # Environments 100 | .env 101 | .venv 102 | env/ 103 | venv/ 104 | ENV/ 105 | env.bak/ 106 | venv.bak/ 107 | 108 | # Spyder project settings 109 | .spyderproject 110 | .spyproject 111 | 112 | # Rope project settings 113 | .ropeproject 114 | 115 | # mkdocs documentation 116 | /site 117 | 118 | # mypy 119 | .mypy_cache/ 120 | .dmypy.json 121 | dmypy.json 122 | 123 | # Pyre type checker 124 | .pyre/ 125 | 126 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:16.04 2 | 3 | LABEL maintainer="pughdr " 4 | 5 | SHELL [ "/bin/bash", "-c" ] 6 | 7 | RUN apt-get update --fix-missing && \ 8 | apt-get install -y wget bzip2 curl git gcc && \ 9 | apt-get clean && \ 10 | rm -rf /var/lib/apt/lists/* 11 | 12 | # Create a non-root user 13 | ARG username=al-khawarizmi 14 | ARG uid=1000 15 | ARG gid=100 16 | ENV USER ${username} 17 | ENV UID ${uid} 18 | ENV GID ${gid} 19 | ENV HOME /home/$USER 20 | 21 | RUN adduser --disabled-password \ 22 | --gecos "Non-root user" \ 23 | --uid $UID \ 24 | --gid $GID \ 25 | --home $HOME \ 26 | $USER 27 | 28 | # Dockerhub not yet supporting COPY --chown $UID:$GID syntax 29 | COPY environment.yml /tmp/environment.yml 30 | RUN chown $UID:$GID /tmp/environment.yml 31 | 32 | COPY postBuild /usr/local/postBuild.sh 33 | RUN chown $UID:$GID /usr/local/postBuild.sh && \ 34 | chmod +x /usr/local/postBuild.sh 35 | 36 | COPY docker/entrypoint.sh /usr/local/bin/docker-entrypoint.sh 37 | RUN chown $UID:$GID /usr/local/bin/docker-entrypoint.sh && \ 38 | chmod +x /usr/local/bin/docker-entrypoint.sh 39 | 40 | # install Miniconda as non-root user 41 | USER $USER 42 | 43 | ENV MINICONDA_VERSION 4.8.2 44 | ENV CONDA_DIR $HOME/miniconda3 45 | RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-$MINICONDA_VERSION-Linux-x86_64.sh -O ~/miniconda.sh && \ 46 | chmod +x ~/miniconda.sh && \ 47 | ~/miniconda.sh -b -p $CONDA_DIR && \ 48 | rm ~/miniconda.sh 49 | 50 | # make non-activate conda commands available 51 | ENV PATH=$CONDA_DIR/bin:$PATH 52 | 53 | # make conda activate command available from /bin/bash --login shells 54 | RUN echo ". $CONDA_DIR/etc/profile.d/conda.sh" >> ~/.profile 55 | 56 | # make conda activate command available from /bin/bash --interative shells 57 | RUN conda init bash 58 | 59 | # create a project directory inside user home 60 | ENV PROJECT_DIR $HOME/app 61 | RUN mkdir $PROJECT_DIR 62 | WORKDIR $PROJECT_DIR 63 | 64 | # build the conda environment 65 | ENV ENV_PREFIX $PROJECT_DIR/env 66 | RUN conda update --name base --channel defaults conda && \ 67 | conda env create --prefix $ENV_PREFIX --file /tmp/environment.yml --force && \ 68 | conda clean --all --yes 69 | 70 | # run the postBuild script to install the JupyterLab extensions 71 | RUN conda activate $ENV_PREFIX && \ 72 | /usr/local/bin/postBuild.sh && \ 73 | conda deactivate 74 | 75 | # use an entrypoint script to insure conda environment is properly activated at runtime 76 | ENTRYPOINT [ "/usr/local/bin/entrypoint.sh" ] 77 | 78 | # default command will be to launch JupyterLab server for development 79 | CMD [ "jupyter", "lab", "--no-browser", "--ip", "0.0.0.0" ] 80 | -------------------------------------------------------------------------------- /docker/README.md: -------------------------------------------------------------------------------- 1 | ## Building a new image for your project 2 | 3 | If you wish to add (remove) dependencies in your project's `environment.yml` (or if you wish to have a custom user defined inside the image), then you will need to build a new Docker image for you project. The following command builds a new image for your project with a custom `$USER` (with associated `$UID` and `$GID`) as well as a particular `$IMAGE_NAME` and `$IMAGE_TAG`. This command should be run within the `docker` sub-directory of the project. 4 | 5 | ```bash 6 | $ docker image build \ 7 | --build-arg username=$USER \ 8 | --build-arg uid=$UID \ 9 | --build-arg gid=$GID \ 10 | --file Dockerfile \ 11 | --tag $IMAGE_NAME:$IMAGE_TAG \ 12 | ../ 13 | ``` 14 | 15 | ### Automating the build process with DockerHub 16 | 17 | 1. Create a new (or login to your existing) [DockerHub](https://hub.docker.com) account. 18 | 2. [Link your GitHub account with your DockerHub account](https://docs.docker.com/docker-hub/builds/link-source/) (if you have not already done so). 19 | 3. Create a new DockerHub repository. 20 | 1. Under "Build Settings" click the GitHub logo and then select your project's GitHub repository. 21 | 2. Select "Click here to customize build settings" and specify the location of the Dockerfile for your build as `docker/Dockerfile`. 22 | 3. Give the DockerHub repository the same name as your project's GitHub repository. 23 | 4. Give the DockerHub repository a brief descrition (something like "Automated builds for $PROJECT" or similar). 24 | 5. Click the "Create and Build" button. 25 | 4. Edit the `hooks/build` script with your project's `$USER`, `$UID`, and `$GID` build args in place of the corresponding default values. 26 | 27 | Below is a screenshot which should give you an idea of how the form out to be filled out prior to clicking "Create and Build". 28 | 29 | ![Creating a new DockerHub repository for your project](./img/creating-dockerhub-repo-screenshot.png) 30 | 31 | DockerHub is now configured to re-build your project's image whenever commits are pushed to your project's GitHub repository! Specifically, whenever you push new commits to your project's GitHub repository, GitHub will notify DockerHub and DockerHub will then run the `./hooks/build` script to re-build your project's image. For more details on the whole process see the [official documentation](https://docs.docker.com/docker-hub/builds/advanced/#build-hook-examples) on advanced DockerHub build options. 32 | 33 | ### Running a container 34 | 35 | Once you have built the image, the following command will run a container based on the image `$IMAGE_NAME:$IMAGE_TAG`. This command should be run from within the project's root directory. 36 | 37 | ```bash 38 | $ docker container run \ 39 | --rm \ 40 | --tty \ 41 | --volume ${pwd}/bin:/home/$USER/app/bin \ 42 | --volume ${pwd}/data:/home/$USER/app/data \ 43 | --volume ${pwd}/doc:/home/$USER/app/doc \ 44 | --volume ${pwd}/notebooks:/home/$USER/app/notebooks \ 45 | --volume ${pwd}/results:/home/$USER/app/results \ 46 | --volume ${pwd}/src:/home/$USER/app/src \ 47 | --runtime nvidia \ 48 | --publish 8888:8888 \ 49 | $IMAGE_NAME:$IMAGE_TAG 50 | ``` 51 | 52 | ### Using Docker Compose 53 | 54 | It is quite easy to make a typo whilst writing the above docker commands by hand, a less error-prone approach is to use [Docker Compose](https://docs.docker.com/compose/). The above docker commands have been encapsulated into the `docker-compose.yml` configuration file. You will need to store your project specific values for `$USER`, `$UID`, and `$GID` in an a file called `.env` as follows. 55 | 56 | ``` 57 | USER=$USER 58 | UID=$UID 59 | GID=$GID 60 | ``` 61 | 62 | For more details on how variable substitution works with Docker Compose, see the [official documentation](https://docs.docker.com/compose/environment-variables/#the-env-file). 63 | 64 | Note that you can test your `docker-compose.yml` file by running the following command in the `docker` sub-directory of the project. 65 | 66 | ```bash 67 | $ docker-compose config 68 | ``` 69 | 70 | This command takes the `docker-compose.yml` file and substitutes the values provided in the `.env` file and then returns the result. 71 | 72 | Once you are confident that values in the `.env` file are being substituted properly into the `docker-compose.yml` file, the following command can be used to bring up a container based on your project's Docker image and launch the JupyterLab server. This command should also be run from within the `docker` sub-directory of the project. 73 | 74 | ```bash 75 | $ docker-compose up --build 76 | ``` 77 | 78 | When you are done developing and have shutdown the JupyterLab server, the following command tears down the networking infrastructure for the running container. 79 | 80 | ```bash 81 | $ docker-compose down 82 | ``` 83 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # jax-gpu-data-science-project 2 | 3 | Repository containing scaffolding for a Python 3-based data science project with GPU acceleration using the [Jax](https://jax.readthedocs.io/en/latest/index.html) ecosystem. 4 | 5 | ## Creating a new project from this template 6 | 7 | Simply follow the [instructions](https://help.github.com/en/articles/creating-a-repository-from-a-template) to create a new project repository from this template. 8 | 9 | ## Project organization 10 | 11 | Project organization is based on ideas from [_Good Enough Practices for Scientific Computing_](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510). 12 | 13 | 1. Put each project in its own directory, which is named after the project. 14 | 2. Put external scripts or compiled programs in the `bin` directory. 15 | 3. Put raw data and metadata in a `data` directory. 16 | 4. Put text documents associated with the project in the `doc` directory. 17 | 5. Put all Docker related files in the `docker` directory. 18 | 6. Install the Conda environment into an `env` directory. 19 | 7. Put all notebooks in the `notebooks` directory. 20 | 8. Put files generated during cleanup and analysis in a `results` directory. 21 | 9. Put project source code in the `src` directory. 22 | 10. Name all files to reflect their content or function. 23 | 24 | ## Building the Conda environment 25 | 26 | After adding any necessary dependencies that should be downloaded via `conda` to the 27 | `environment.yml` file and any dependencies that should be downloaded via `pip` to the 28 | `requirements.txt` file you create the Conda environment in a sub-directory `./env`of your project 29 | directory by running the following commands. 30 | 31 | ```bash 32 | export ENV_PREFIX=$PWD/env 33 | mamba env create --prefix $ENV_PREFIX --file environment.yml --force 34 | ``` 35 | 36 | Once the new environment has been created you can activate the environment with the following 37 | command. 38 | 39 | ```bash 40 | conda activate $ENV_PREFIX 41 | ``` 42 | 43 | Note that the `ENV_PREFIX` directory is *not* under version control as it can always be re-created as 44 | necessary. 45 | 46 | For your convenience these commands have been combined in a shell script `./bin/create-conda-env.sh`. 47 | Running the shell script will create the Conda environment, activate the Conda environment, and build 48 | JupyterLab with any additional extensions. The script should be run from the project root directory 49 | as follows. 50 | 51 | ```bash 52 | ./bin/create-conda-env.sh 53 | ``` 54 | 55 | ### Ibex 56 | 57 | The most efficient way to build Conda environments on Ibex is to launch the environment creation script 58 | as a job on the debug partition via Slurm. For your convenience a Slurm job script 59 | `./bin/create-conda-env.sbatch` is included. The script should be run from the project root directory 60 | as follows. 61 | 62 | ```bash 63 | sbatch ./bin/create-conda-env.sbatch 64 | ``` 65 | 66 | ### Listing the full contents of the Conda environment 67 | 68 | The list of explicit dependencies for the project are listed in the `environment.yml` file. To see 69 | the full lost of packages installed into the environment run the following command. 70 | 71 | ```bash 72 | conda list --prefix $ENV_PREFIX 73 | ``` 74 | 75 | ### Updating the Conda environment 76 | 77 | If you add (remove) dependencies to (from) the `environment.yml` file or the `requirements.txt` file 78 | after the environment has already been created, then you can re-create the environment with the 79 | following command. 80 | 81 | ```bash 82 | $ mamba env create --prefix $ENV_PREFIX --file environment.yml --force 83 | ``` 84 | 85 | ## Installing the NVIDIA CUDA Compiler (NVCC) (Optional) 86 | 87 | Installing the NVIDIA CUDA Toolkit manually is only required if your project needs to use the `nvcc` compiler. 88 | Note that even if you have not written any custom CUDA code that needs to be compiled with `nvcc`, if your project 89 | uses packages that include custom CUDA extensions for PyTorch then you will need `nvcc` installed in order to build these packages. 90 | 91 | If you don't need `nvcc`, then you can skip this section as `conda` will install a `cudatoolkit` package 92 | which includes all the necessary runtime CUDA dependencies (but not the `nvcc` compiler). 93 | 94 | ### Workstation 95 | 96 | You will need to have the [appropriate version](https://developer.nvidia.com/cuda-toolkit-archive) 97 | of the NVIDIA CUDA Toolkit installed on your workstation. If using the most recent versionf of PyTorch, then you 98 | should install [NVIDIA CUDA Toolkit 11.1](https://developer.nvidia.com/cuda-11.2.2-download-archive) 99 | [(documentation)](https://docs.nvidia.com/cuda/archive/11.2.2/). 100 | 101 | After installing the appropriate version of the NVIDIA CUDA Toolkit you will need to set the 102 | following environment variables. 103 | 104 | ```bash 105 | $ export CUDA_HOME=/usr/local/cuda-11.2.2 106 | $ export PATH=$CUDA_HOME/bin:$PATH 107 | $ export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH 108 | ``` 109 | 110 | ### Ibex 111 | 112 | Ibex users do not neet to install NVIDIA CUDA Toolkit as the relevant versions have already been 113 | made available on Ibex by the Ibex Systems team. Users simply need to load the appropriate version 114 | using the `module` tool. 115 | 116 | ```bash 117 | $ module load cuda/11.2.2 118 | ``` 119 | 120 | ## Using Docker 121 | 122 | In order to build Docker images for your project and run containers with GPU acceleration you will 123 | need to install 124 | [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/), 125 | [Docker Compose](https://docs.docker.com/compose/install/) and the 126 | [NVIDIA Docker runtime](https://github.com/NVIDIA/nvidia-docker). 127 | 128 | Detailed instructions for using Docker to build and image and launch containers can be found in 129 | the `docker/README.md`. 130 | -------------------------------------------------------------------------------- /bin/README.md: -------------------------------------------------------------------------------- 1 | ## Creating the Conda environment 2 | 3 | For your convenience the commands to create the Conda environment have been combined in a shell script. The script should be run from the project root directory as follows. 4 | 5 | ```bash 6 | ./bin/create-conda-env.sh 7 | ``` 8 | 9 | ## Launching a job via Slurm to create the Conda environment 10 | 11 | While running the shell script above on a login node will create the Conda environment, you may prefer to launch a job via Slurm 12 | to create the Conda environment. If you lose your connection to the Ibex login node whilst your Conda environment script is running 13 | the environment will be left in an inconsistent state and you will need to start over. Depending on the load on the Ibex login nodes, 14 | lanuching a job via Slurm to create your Conda environment can also be faster. 15 | 16 | For your convenience the commands to launch a job via Slurm to create the Conda environment have been combined into a job script. The script should be run from the project root directory as follows. 17 | 18 | ```bash 19 | sbatch ./bin/create-conda-env.sbatch 20 | ``` 21 | 22 | ## Launching a Jupyter server for interactive work 23 | 24 | The job script `launch-jupyter-server.sbatch` launches a Jupyter server for interactive prototyping. To launch a JupyterLab server 25 | use `sbatch` to submit the job script by running the following command from the project root directory. 26 | 27 | ```bash 28 | sbatch ./bin/launch-jupyter-server.sbatch 29 | ``` 30 | 31 | If you prefer the classic Jupyter Notebook interface, then you can launch the Jupyter notebook server with the following command in 32 | the project root directory. 33 | 34 | ```bash 35 | sbatch ./bin/launch-jupyter-server.sbatch notebook 36 | ``` 37 | 38 | Once the job has started, you can inspect the `./bin/launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file where you will find 39 | instructions on how to access the server running in your local browser. 40 | 41 | ### SSH tunneling between your local machine and Ibex compute node(s) 42 | To connect to the compute node on Ibex running your Jupyter server, you need to create an SSH tunnel from your local machine 43 | to a login node on Ibex using a command similar to the following. 44 | 45 | ``` 46 | ssh -L ${JUPYTER_PORT}:${IBEX_NODE}:${JUPYTER_PORT} ${USER}@glogin.ibex.kaust.edu.sa 47 | ``` 48 | 49 | The exact command for your job can be copied from the `./bin/launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file. 50 | 51 | ### Accessing the Jupyter server from your local machine 52 | 53 | Once you have set up your SSH tunnel, in order to access the Jupyter server from your local machine you need to copy the 54 | second URL provided in the Jupyter server logs in the `launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file and paste it into 55 | the browser on your local machine. The URL will look similar to the following. 56 | 57 | ``` 58 | http://127.0.0.1:$JUPYTER_PORT/lab?token=$JUPYTER_TOKEN 59 | ``` 60 | 61 | The exact command for your job containing both your assigned `$JUPYTER_PORT` as well as your specific `$JUPYTER_TOKEN` can 62 | be copied from the `launch-jupyter-server-$SLURM_JOB_ID-slurm.err`. 63 | 64 | ## Launching a training job via Slurm 65 | 66 | The `src` directory contains an example training script, `train.py`, that trains a classification pipeline on the 67 | [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset. You can launch this training script as a batch 68 | job on Ibex via Slurm using the following command in the project root directory. 69 | 70 | ```bash 71 | ./bin/launch-train.sh 72 | ``` 73 | 74 | At present the `./bin/launch-train.sh` script is only defining a job name for Slurm and creating a separate 75 | sub-directory within the `results/` directory for any output generated by the Slurm job script. However, wrapping 76 | your job submission inside a `launch-train.sh` script is an Ibex "best-practice" that will help you automate more 77 | complex machine learning workflows. 78 | 79 | The script `./bin/train.sbatch` is the actual Slurm job script. This script can be broken down into several parts that 80 | are common to all machine learning jobs on Ibex. 81 | 82 | ### Request resources from Slurm 83 | 84 | You will request resources for your job using Slurm headers. The headers below request 4 Intel CPU cores, 36G of 85 | CPU memory for 2 hours. Requesting only Intel CPU cores is important because the Conda environment has been optimized 86 | for performance on Intel CPUs. Most Scikit-Learn algorithms are parallelized and by default will take advantage of all 87 | available CPUs therefore you will typically want to request more than one CPU for your Scikit-Learn training jobs. You 88 | should typically request at most 9G of CPU memory per CPU (each Intel node has 40 CPUs and roughly 366G of usable CPU 89 | memory which works out to a little more than 9G per CPU). 90 | 91 | ```bash 92 | #!/bin/bash --login 93 | #SBATCH --time 2:00:00 94 | #SBATCH --cpus-per-task=4 95 | #SBATCH --mem-per-cpu=9G 96 | #SBATCH --constraint=intel 97 | #SBATCH --partition=batch 98 | #SBATCH --mail-type=ALL 99 | #SBATCH --output=results/%x/%j-slurm.out 100 | #SBATCH --error=results/%x/%j-slurm.err 101 | ``` 102 | 103 | ### Activate the Conda environment 104 | 105 | Activating the Conda environment is done in the usual way however it is critical for the job script to run inside a 106 | login shell in order for the `conda activate` command to work as expected (this is why the first line of the job script 107 | is `#!/bin/bash --login`). It is also a good practice to purge any modules that you might have loaded prior to launching 108 | the training job. 109 | 110 | ```bash 111 | module purge 112 | ENV_PREFIX=$PROJECT_DIR/env 113 | conda activate $ENV_PREFIX 114 | ``` 115 | 116 | ### Starting the NVDashboard monitoring server 117 | 118 | After activating the Conda environment, but before launching your training script, you should start the 119 | NVDashboard monitoring server to run in the background using `srun` to reserve a free port. The 120 | `./bin/launch-nvdashboard-server.srun` script launches the monitoring server and logs out the assigned 121 | port to the `slurm.err` file for the job. 122 | 123 | ```bash 124 | # use srun to launch NVDashboard server in order to reserve a port 125 | srun --resv-ports=1 ./bin/launch-nvdashboard-server.srun & 126 | NVDASHBOARD_PID=$! 127 | ``` 128 | 129 | ### Launch a training script 130 | 131 | Finally, you launch the training job! Note that we use the special Bash variable `$1` to refer to the first argument 132 | passed to the Slurm job script. This allows you to reuse the same Slurm job script for other training jobs! 133 | 134 | ```bash 135 | python $1 136 | ``` 137 | 138 | ### Stopping the NVDashboard monitoring server 139 | 140 | Once the training script has finished running, you should stop the NVDashboard server so that your job exits. If 141 | you do not stop the server the job will continue to run until it reached its time limit (which wastes resources). 142 | 143 | ``` 144 | # shutdown the NVDashboard server 145 | kill $NVDASHBOARD_PID 146 | ``` 147 | --------------------------------------------------------------------------------