├── bin ├── .gitkeep ├── create-conda-env.sh ├── create-conda-env.sbatch ├── launch-code-server.sbatch ├── launch-jupyter-server.sbatch ├── launch-nvdashboard-server.srun ├── train.sbatch ├── launch-jupyter-server.srun ├── launch-code-server.srun ├── launch-train.sh ├── launch-checkpoint-and-resubmit.sh └── README.md ├── data └── .gitkeep ├── doc └── .gitkeep ├── src ├── .gitkeep ├── train.py ├── train-argparse.py └── train-checkpoint-restart.py ├── docker ├── .gitkeep ├── entrypoint.sh ├── img │ └── creating-dockerhub-repo-screenshot.png ├── hooks │ └── build ├── docker-compose.yml ├── Dockerfile └── README.md ├── notebooks └── .gitkeep ├── results └── .gitkeep ├── requirements.txt ├── environment.yml ├── LICENSE ├── .gitignore └── README.md /bin/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /doc/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docker/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /notebooks/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /results/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docker/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | set -e 3 | 4 | conda activate $HOME/app/env 5 | exec "$@" 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | kornia 2 | pandas-bokeh 3 | 4 | # install NVIDIA DALI 5 | --extra-index-url https://developer.download.nvidia.com/compute/redist 6 | nvidia-dali-cuda110 7 | -------------------------------------------------------------------------------- /docker/img/creating-dockerhub-repo-screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidrpugh/introduction-to-recurrent-neural-networks/master/docker/img/creating-dockerhub-repo-screenshot.png -------------------------------------------------------------------------------- /docker/hooks/build: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -ex 3 | 4 | docker image build \ 5 | --build-arg username=al-khawarizmi \ 6 | --build-arg uid=1000 \ 7 | --build-arg gid=100 \ 8 | --tag "$DOCKER_REPO:latest" \ 9 | --file Dockerfile \ 10 | ../ 11 | -------------------------------------------------------------------------------- /bin/create-conda-env.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | 3 | # entire script fails if a single command fails 4 | set -e 5 | 6 | # create the conda environment 7 | PROJECT_DIR="$PWD" 8 | ENV_PREFIX="$PROJECT_DIR"/env 9 | mamba env create --prefix $ENV_PREFIX --file "$PROJECT_DIR"/environment.yml --force 10 | -------------------------------------------------------------------------------- /bin/create-conda-env.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --time=2:00:00 3 | #SBATCH --cpus-per-task=2 4 | #SBATCH --mem=8G 5 | #SBATCH --partition=debug 6 | #SBATCH --job-name=create-conda-env 7 | #SBATCH --mail-type=ALL 8 | #SBATCH --output=bin/%x-%j-slurm.out 9 | #SBATCH --error=bin/%x-%j-slurm.err 10 | 11 | # create the conda environment 12 | ./bin/create-conda-env.sh 13 | -------------------------------------------------------------------------------- /bin/launch-code-server.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | #SBATCH --time=2:00:00 3 | #SBATCH --nodes=1 4 | #SBATCH --gpus-per-node=v100:1 5 | #SBATCH --cpus-per-gpu=6 6 | #SBATCH --mem=64G 7 | #SBATCH --partition=debug 8 | #SBATCH --job-name=launch-code-server 9 | #SBATCH --mail-type=ALL 10 | #SBATCH --output=bin/%x-%j-slurm.out 11 | #SBATCH --error=bin/%x-%j-slurm.err 12 | 13 | # use srun to launch code server in order to reserve a port 14 | srun --resv-ports=1 ./bin/launch-code-server.srun 15 | -------------------------------------------------------------------------------- /bin/launch-jupyter-server.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | #SBATCH --time=2:00:00 3 | #SBATCH --nodes=1 4 | #SBATCH --gpus-per-node=v100:1 5 | #SBATCH --cpus-per-gpu=6 6 | #SBATCH --mem=64G 7 | #SBATCH --partition=debug 8 | #SBATCH --job-name=launch-jupyter-server 9 | #SBATCH --mail-type=ALL 10 | #SBATCH --output=bin/%x-%j-slurm.out 11 | #SBATCH --error=bin/%x-%j-slurm.err 12 | 13 | # use srun to launch Jupyter server in order to reserve a port 14 | srun --resv-ports=1 ./bin/launch-jupyter-server.srun 15 | -------------------------------------------------------------------------------- /docker/docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "2.3" 2 | 3 | services: 4 | jupyterlab-server: 5 | build: 6 | args: 7 | - username=${USER} 8 | - uid=${UID} 9 | - gid=${GID} 10 | context: ../ 11 | dockerfile: docker/Dockerfile 12 | ports: 13 | - "8888:8888" 14 | runtime: nvidia 15 | volumes: 16 | - ../bin:/home/${USER}/app/bin 17 | - ../data:/home/${USER}/app/data 18 | - ../doc:/home/${USER}/app/doc 19 | - ../notebooks:/home/${USER}/app/notebooks 20 | - ../results:/home/${USER}/app/results 21 | - ../src:/home/${USER}/app/src 22 | init: true 23 | stdin_open: true 24 | tty: true 25 | -------------------------------------------------------------------------------- /bin/launch-nvdashboard-server.srun: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | NVDASHBOARD_PORT=$SLURM_STEP_RESV_PORTS 4 | IBEX_NODE=$(hostname -s) 5 | 6 | echo " 7 | To connect to the compute node ${IBEX_NODE} on Ibex running your NVDashboard server, 8 | you need to create an ssh tunnel from your local machine to login node on Ibex 9 | using the following command. 10 | 11 | ssh -L ${NVDASHBOARD_PORT}:${IBEX_NODE}:${NVDASHBOARD_PORT} ${USER}@glogin.ibex.kaust.edu.sa 12 | 13 | Next, you need to copy the url provided below and paste it into the browser on your 14 | local machine. 15 | 16 | http://127.0.0.1:${NVDASHBOARD_PORT} 17 | " >&2 18 | 19 | # Start the nvdashboard server 20 | python -m jupyterlab_nvdashboard.server $NVDASHBOARD_PORT 21 | -------------------------------------------------------------------------------- /bin/train.sbatch: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | #SBATCH --time=2:00:00 3 | #SBATCH --gpus-per-node=v100:1 4 | #SBATCH --cpus-per-gpu=4 5 | #SBATCH --mem=64G 6 | #SBATCH --partition=batch 7 | #SBATCH --mail-type=ALL 8 | #SBATCH --output=results/%x/%j-slurm.out 9 | #SBATCH --error=results/%x/%j-slurm.err 10 | 11 | # entire script fails if single command fails 12 | set -e 13 | 14 | # activate the conda environment 15 | module purge 16 | conda activate "$1" 17 | 18 | # use srun to launch NVDashboard server in order to reserve a port 19 | srun --resv-ports=1 ./bin/launch-nvdashboard-server.srun & 20 | NVDASHBOARD_PID=$! 21 | 22 | # launch the training script 23 | python "${@:2}" 24 | 25 | # shutdown the NVDashboard server 26 | kill $NVDASHBOARD_PID 27 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: pytorch-env 2 | 3 | channels: 4 | - pytorch 5 | - conda-forge 6 | - defaults 7 | 8 | dependencies: 9 | - bokeh 10 | - captum 11 | - cudatoolkit=11.1 12 | - datashader 13 | - gh 14 | - git 15 | - holoviews 16 | - hvplot 17 | - ipywidgets 18 | - jupyterlab 19 | - jupyterlab-git 20 | - jupyterlab-nvdashboard 21 | - jupyterlab-lsp 22 | - matplotlib 23 | - numba 24 | - numpy 25 | - optuna 26 | - pandas 27 | - panel 28 | - pip 29 | - pip: 30 | - -r requirements.txt 31 | - pyarrow 32 | - python=3.9 33 | - python-language-server 34 | - pytorch=1.9 35 | - pytorch-lightning 36 | - pyviz_comms 37 | - scikit-learn 38 | - scipy 39 | - tensorboard 40 | - torchaudio 41 | - torchmetrics 42 | - torchtext 43 | - torchvision 44 | - wandb 45 | - xeus-python 46 | -------------------------------------------------------------------------------- /bin/launch-jupyter-server.srun: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | 3 | # setup the environment 4 | module purge 5 | conda activate ./env 6 | 7 | # setup ssh tunneling 8 | export XDG_RUNTIME_DIR=/tmp IBEX_NODE=$(hostname -s) 9 | KAUST_USER=$(whoami) 10 | JUPYTER_PORT=$SLURM_STEP_RESV_PORTS 11 | 12 | echo " 13 | To connect to the compute node ${IBEX_NODE} on Ibex running your Jupyter server, 14 | you need to create an ssh tunnel from your local machine to login node on Ibex 15 | using the following command. 16 | 17 | ssh -L ${JUPYTER_PORT}:${IBEX_NODE}:${JUPYTER_PORT} ${KAUST_USER}@glogin.ibex.kaust.edu.sa 18 | 19 | Next, you need to copy the second url provided below and paste it into the browser 20 | on your local machine. 21 | " >&2 22 | 23 | # launch jupyter server 24 | jupyter ${1:-lab} --no-browser --port=${JUPYTER_PORT} --ip=${IBEX_NODE} 25 | -------------------------------------------------------------------------------- /bin/launch-code-server.srun: -------------------------------------------------------------------------------- 1 | #!/bin/bash --login 2 | 3 | set -e 4 | 5 | # setup the environment 6 | PROJECT_DIR="$PWD" 7 | ENV_PREFIX="$PROJECT_DIR"/env 8 | 9 | module purge 10 | conda activate "$ENV_PREFIX" 11 | 12 | # setup ssh tunneling 13 | COMPUTE_NODE=$(hostname -s) 14 | CODE_SERVER_PORT=$SLURM_STEP_RESV_PORTS 15 | 16 | echo " 17 | To connect to the compute node ${COMPUTE_NODE} on Ibex running your Code Server, 18 | you need to create an ssh tunnel from your local machine to login node on Ibex 19 | using the following command. 20 | 21 | ssh -L ${CODE_SERVER_PORT}:${COMPUTE_NODE}:${CODE_SERVER_PORT} ${USER}@glogin.ibex.kaust.edu.sa 22 | 23 | Next, you need to copy the url provided below and paste it into the browser 24 | on your local machine. 25 | 26 | localhost:${CODE_SERVER_PORT} 27 | 28 | " >&2 29 | 30 | # launch code server 31 | code-server --auth none --bind-addr ${COMPUTE_NODE}:${CODE_SERVER_PORT} "$PROJECT_DIR" 32 | 33 | 34 | -------------------------------------------------------------------------------- /bin/launch-train.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # entire script fails if a single command fails 4 | set -e 5 | 6 | # script should be run from the project directory 7 | PROJECT_DIR="$PWD" 8 | 9 | # path to the Conda environment 10 | ENV_PREFIX="$PROJECT_DIR"/env 11 | 12 | # project should have a data directory 13 | DATA_DIR="$PROJECT_DIR"/data 14 | 15 | # creates a separate directory for each job 16 | JOB_NAME=example-training-job 17 | JOB_RESULTS_DIR="$PROJECT_DIR"/results/"$JOB_NAME" 18 | mkdir -p "$JOB_RESULTS_DIR" 19 | 20 | # launch the training job 21 | CPUS_PER_GPU=6 22 | sbatch --job-name "$JOB_NAME" --cpus-per-gpu $CPUS_PER_GPU \ 23 | "$PROJECT_DIR"/bin/train.sbatch "$ENV_PREFIX" \ 24 | "$PROJECT_DIR"/src/train-argparse.py \ 25 | --dataloader-num-workers $CPUS_PER_GPU \ 26 | --data-dir "$DATA_DIR" \ 27 | --num-training-epochs 10 \ 28 | --output-dir "$JOB_RESULTS_DIR" \ 29 | --tqdm-disable 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) [year], [fullname] 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /docker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:18.04 2 | 3 | LABEL maintainer="pughdr " 4 | 5 | SHELL [ "/bin/bash", "--login", "-c" ] 6 | 7 | RUN apt-get update --fix-missing && \ 8 | apt-get install -y wget bzip2 curl git && \ 9 | apt-get clean && \ 10 | rm -rf /var/lib/apt/lists/* 11 | 12 | # Create a non-root user 13 | ARG username=al-khawarizmi 14 | ARG uid=1000 15 | ARG gid=100 16 | ENV USER $username 17 | ENV UID $uid 18 | ENV GID $gid 19 | ENV HOME /home/$USER 20 | 21 | RUN adduser --disabled-password \ 22 | --gecos "Non-root user" \ 23 | --uid $UID \ 24 | --gid $GID \ 25 | --home $HOME \ 26 | $USER 27 | 28 | COPY environment.yml requirements.txt /tmp/ 29 | RUN chown $UID:$GID /tmp/environment.yml /tmp/requirements.txt 30 | 31 | COPY postBuild docker/entrypoint.sh /usr/local/bin/ 32 | RUN chown $UID:$GID /usr/local/bin/postBuild /usr/local/bin/entrypoint.sh && \ 33 | chmod u+x /usr/local/bin/postBuild /usr/local/bin/entrypoint.sh 34 | 35 | USER $USER 36 | 37 | # install miniconda 38 | ENV MINICONDA_VERSION 4.8.2 39 | ENV CONDA_DIR $HOME/miniconda3 40 | RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-$MINICONDA_VERSION-Linux-x86_64.sh -O ~/miniconda.sh && \ 41 | chmod +x ~/miniconda.sh && \ 42 | ~/miniconda.sh -b -p $CONDA_DIR && \ 43 | rm ~/miniconda.sh 44 | 45 | # make non-activate conda commands available 46 | ENV PATH=$CONDA_DIR/bin:$PATH 47 | 48 | # make conda activate command available from /bin/bash --login shells 49 | RUN echo ". $CONDA_DIR/etc/profile.d/conda.sh" >> ~/.profile 50 | 51 | # make conda activate command available from /bin/bash --interative shells 52 | RUN conda init bash 53 | 54 | # create a project directory inside user home 55 | ENV PROJECT_DIR $HOME/app 56 | RUN mkdir $PROJECT_DIR 57 | WORKDIR $PROJECT_DIR 58 | 59 | # build the conda environment 60 | ENV ENV_PREFIX $PROJECT_DIR/env 61 | RUN conda update --name base --channel defaults conda && \ 62 | conda env create --prefix $ENV_PREFIX --file /tmp/environment.yml --force && \ 63 | conda activate $ENV_PREFIX && \ 64 | . /usr/local/bin/postBuild && \ 65 | conda clean --all --yes 66 | 67 | # use an entrypoint script to insure conda environment is properly activated at runtime 68 | ENTRYPOINT [ "/usr/local/bin/entrypoint.sh" ] 69 | 70 | # default command will be to launch JupyterLab server for development 71 | CMD [ "jupyter", "lab", "--no-browser", "--ip", "0.0.0.0" ] 72 | -------------------------------------------------------------------------------- /bin/launch-checkpoint-and-resubmit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # entire script fails if a single command fails 4 | set -e 5 | 6 | # script should be run from the project directory 7 | PROJECT_DIR="$PWD" 8 | 9 | # path to Conda environment 10 | ENV_PREFIX="$PROJECT_DIR"/env 11 | 12 | # data should be read from a data directory 13 | DATA_DIR="$PROJECT_DIR"/data 14 | 15 | # creates a separate directory for each job 16 | JOB_NAME=example-training-job 17 | JOB_RESULTS_DIR="$PROJECT_DIR"/results/"$JOB_NAME" 18 | mkdir -p "$JOB_RESULTS_DIR" 19 | 20 | # create a directory to store the checkpoints 21 | CHECKPOINTS_DIR="$JOB_RESULTS_DIR"/checkpoints 22 | mkdir -p "$CHECKPOINTS_DIR" 23 | 24 | # use a single file to track intermediate checkpoints 25 | CHECKPOINT_FILEPATH="$CHECKPOINTS_DIR"/checkpoint.pt 26 | 27 | # define number of training periods and training epochs (per period) 28 | NUM_TRAINING_PERIODS=10 29 | NUM_EPOCHS_PER_PERIOD=1 30 | 31 | # launch the training job for the initial period 32 | CPUS_PER_GPU=4 33 | TRAIN_JOBID=$( 34 | sbatch --job-name "$JOB_NAME" --cpus-per-gpu $CPUS_PER_GPU --parsable \ 35 | "$PROJECT_DIR"/bin/train.sbatch "$ENV_PREFIX" \ 36 | "$PROJECT_DIR"/src/train-checkpoint-restart.py \ 37 | --dataloader-num-workers $CPUS_PER_GPU \ 38 | --data-dir "$DATA_DIR" \ 39 | --num-training-epochs $NUM_EPOCHS_PER_PERIOD \ 40 | --tqdm-disable \ 41 | --write-checkpoint-to "$CHECKPOINT_FILEPATH" \ 42 | ) 43 | 44 | # store the most recent checkpoint 45 | cp "$CHECKPOINT_FILEPATH" "$CHECKPOINTS_DIR"/most-recent-checkpoint.pt 46 | 47 | # queue the training jobs for the remaining periods 48 | for ((PERIOD=1;PERIOD<$NUM_TRAINING_PERIODS;PERIOD++)) 49 | do 50 | 51 | TRAIN_JOBID=$( 52 | sbatch --job-name "$JOB_NAME" --cpus-per-gpu $CPUS_PER_GPU --parsable --dependency=afterok:$TRAIN_JOBID --kill-on-invalid-dep=yes \ 53 | "$PROJECT_DIR"/bin/train.sbatch "$ENV_PREFIX" \ 54 | "$PROJECT_DIR"/src/train-checkpoint-restart.py \ 55 | --checkpoint-filepath "$CHECKPOINTS_DIR"/most-recent-checkpoint.pt \ 56 | --dataloader-num-workers $CPUS_PER_GPU \ 57 | --data-dir "$DATA_DIR" \ 58 | --num-training-epochs $NUM_EPOCHS_PER_PERIOD \ 59 | --tqdm-disable \ 60 | --write-checkpoint-to "$CHECKPOINT_FILEPATH" \ 61 | ) 62 | 63 | # store the most recent checkpoint 64 | cp "$CHECKPOINT_FILEPATH" "$CHECKPOINTS_DIR"/most-recent-checkpoint.pt 65 | 66 | done 67 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | db.sqlite3-journal 62 | 63 | # Flask stuff: 64 | instance/ 65 | .webassets-cache 66 | 67 | # Scrapy stuff: 68 | .scrapy 69 | 70 | # Sphinx documentation 71 | docs/_build/ 72 | 73 | # PyBuilder 74 | target/ 75 | 76 | # Jupyter Notebook 77 | .ipynb_checkpoints 78 | 79 | # IPython 80 | profile_default/ 81 | ipython_config.py 82 | 83 | # pyenv 84 | .python-version 85 | 86 | # pipenv 87 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 88 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 89 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 90 | # install all needed dependencies. 91 | #Pipfile.lock 92 | 93 | # celery beat schedule file 94 | celerybeat-schedule 95 | 96 | # SageMath parsed files 97 | *.sage.py 98 | 99 | # Environments 100 | .env 101 | .venv 102 | env/ 103 | venv/ 104 | ENV/ 105 | env.bak/ 106 | venv.bak/ 107 | 108 | # Spyder project settings 109 | .spyderproject 110 | .spyproject 111 | 112 | # Rope project settings 113 | .ropeproject 114 | 115 | # mkdocs documentation 116 | /site 117 | 118 | # mypy 119 | .mypy_cache/ 120 | .dmypy.json 121 | dmypy.json 122 | 123 | # Pyre type checker 124 | .pyre/ 125 | 126 | # ignore Slurm .out .err files for certain jobs 127 | bin/create-conda-env-*-slurm.out 128 | bin/create-conda-env-*-slurm.err 129 | bin/launch-code-server-*-slurm.out 130 | bin/launch-code-server-*-slurm.err 131 | bin/launch-jupyter-server-*-slurm.out 132 | bin/launch-jupyter-server-*-slurm.err 133 | 134 | # ignore DCGM reports 135 | dcgm/ 136 | 137 | # ignore vscode settings 138 | .vscode/ -------------------------------------------------------------------------------- /src/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | 4 | from sklearn import metrics 5 | import torch 6 | from torch import nn, optim, utils 7 | from torchvision import datasets, models, transforms 8 | from tqdm import tqdm 9 | 10 | 11 | BATCH_SIZE = 256 12 | DATA_DIR = pathlib.Path("data/") 13 | DATALOADER_NUM_WORKERS = 6 14 | DEVICE = torch.device("cuda") 15 | NUM_CLASSES = 10 16 | NUM_TRAIN_EPOCHS = 10 17 | OPTIMIZER_LEARNING_RATE = 1e-3 18 | OPTIMIZER_MOMENTUM = 0.9 19 | OUTPUT_DIR = pathlib.Path("results/example-training-job/") 20 | OUTPUT_FILENAME = OUTPUT_DIR / "model.pt" 21 | PREFETCH_FACTOR = 2 22 | RESIZE_SIZE = 224 23 | SEED = 42 24 | TQDM_DISABLE = True 25 | 26 | 27 | # create the output directory 28 | if not OUTPUT_DIR.exists(): 29 | os.mkdir(OUTPUT_DIR) 30 | 31 | # set seed for reproducibility 32 | torch.manual_seed(SEED) 33 | 34 | # create the train and test datasets 35 | _transform = transforms.Compose([ 36 | transforms.Resize(RESIZE_SIZE), 37 | transforms.ToTensor(), 38 | ]) 39 | train_dataset = datasets.CIFAR10(root=DATA_DIR, 40 | train=True, 41 | download=True, 42 | transform=_transform) 43 | 44 | test_dataset = datasets.CIFAR10(root=DATA_DIR, 45 | train=False, 46 | download=True, 47 | transform=_transform) 48 | 49 | # create the train and test dataloaders 50 | train_dataloader = (utils.data 51 | .DataLoader(train_dataset, 52 | batch_size=BATCH_SIZE, 53 | shuffle=True, 54 | num_workers=DATALOADER_NUM_WORKERS, 55 | persistent_workers=True, 56 | pin_memory=True, 57 | prefetch_factor=PREFETCH_FACTOR)) 58 | test_dataloader = (utils.data 59 | .DataLoader(test_dataset, 60 | batch_size=BATCH_SIZE, 61 | shuffle=False, 62 | num_workers=DATALOADER_NUM_WORKERS, 63 | persistent_workers=True, 64 | pin_memory=True, 65 | prefetch_factor=PREFETCH_FACTOR)) 66 | 67 | # define a model_fn, loss function, and an optimizer 68 | model_fn = models.resnet50(pretrained=False, 69 | num_classes=NUM_CLASSES) 70 | model_fn.to(DEVICE) 71 | loss_fn = nn.CrossEntropyLoss() 72 | optimizer = optim.SGD(model_fn.parameters(), 73 | lr=OPTIMIZER_LEARNING_RATE, 74 | momentum=OPTIMIZER_MOMENTUM) 75 | 76 | # train the model 77 | print("Training started...") 78 | for epoch in range(NUM_TRAIN_EPOCHS): 79 | 80 | with tqdm(train_dataloader, unit="batch", disable=TQDM_DISABLE) as tepoch: 81 | 82 | for (features, targets) in tepoch: 83 | tepoch.set_description(f"Epoch {epoch}") 84 | 85 | # zero the parameter gradients 86 | optimizer.zero_grad() 87 | 88 | # forward + backward + optimize 89 | predictions = model_fn(features.to(DEVICE)) 90 | loss = loss_fn(predictions, targets.to(DEVICE)) 91 | loss.backward() 92 | optimizer.step() 93 | 94 | print("...training finished!") 95 | 96 | # save the trained model 97 | torch.save(model_fn.state_dict(), OUTPUT_FILENAME) 98 | 99 | # compute the predications on the test data 100 | batch_targets = [] 101 | batch_predicted_targets = [] 102 | 103 | with torch.no_grad(): 104 | for (features, targets) in test_dataloader: 105 | predicted_probs = model_fn(features.to(DEVICE)) 106 | predicted_targets = predicted_probs.argmax(axis=1) 107 | batch_targets.append(targets) 108 | batch_predicted_targets.append(predicted_targets) 109 | 110 | # generate a classification report 111 | test_target = (torch.cat(batch_targets) 112 | .cpu()) 113 | test_predicted_targets = (torch.cat(batch_predicted_targets) 114 | .cpu()) 115 | 116 | classification_report = metrics.classification_report( 117 | test_target, 118 | test_predicted_targets, 119 | ) 120 | print(classification_report) 121 | -------------------------------------------------------------------------------- /docker/README.md: -------------------------------------------------------------------------------- 1 | ## Using the `kaustvl/pytorch-gpu-data-science-project` image 2 | 3 | If you are not adding any additional dependencies to your project's `environment.yml` file, then you can run containers for your project based on the `kaustvl/pytorch-gpu--data-science-project` image hosted on DockerHub. Run the following command within your project's root directory to run a container for your project based on this existing Docker image. 4 | 5 | ```bash 6 | $ docker container run \ 7 | --rm \ 8 | --tty \ 9 | --volume ${pwd}/bin:/home/$USER/app/bin \ 10 | --volume ${pwd}/data:/home/$USER/app/data \ 11 | --volume ${pwd}/doc:/home/$USER/app/doc \ 12 | --volume ${pwd}/notebooks:/home/$USER/app/notebooks \ 13 | --volume ${pwd}/results:/home/$USER/app/results \ 14 | --volume ${pwd}/src:/home/$USER/app/src \ 15 | --runtime nvidia \ 16 | --publish 8888:8888 \ 17 | kaustvl/pytorch-gpu-data-science-project:latest 18 | ``` 19 | 20 | ## Building a new image for your project 21 | 22 | If you wish to add (remove) dependencies in your project's `environment.yml` (or if you wish to have a custom user defined inside the image), then you will need to build a new Docker image for you project. The following command builds a new image for your project with a custom `$USER` (with associated `$UID` and `$GID`) as well as a particular `$IMAGE_NAME` and `$IMAGE_TAG`. This command should be run within the `docker` sub-directory of the project. 23 | 24 | ```bash 25 | $ docker image build \ 26 | --build-arg username=$USER \ 27 | --build-arg uid=$UID \ 28 | --build-arg gid=$GID \ 29 | --file Dockerfile \ 30 | --tag $IMAGE_NAME:$IMAGE_TAG \ 31 | ../ 32 | ``` 33 | 34 | ### Automating the build process with DockerHub 35 | 36 | 1. Create a new (or login to your existing) [DockerHub](https://hub.docker.com) account. 37 | 2. [Link your GitHub account with your DockerHub account](https://docs.docker.com/docker-hub/builds/link-source/) (if you have not already done so). 38 | 3. Create a new DockerHub repository. 39 | 1. Under "Build Settings" click the GitHub logo and then select your project's GitHub repository. 40 | 2. Select "Click here to customize build settings" and specify the location of the Dockerfile for your build as `docker/Dockerfile`. 41 | 3. Give the DockerHub repository the same name as your project's GitHub repository. 42 | 4. Give the DockerHub repository a brief descrition (something like "Automated builds for $PROJECT" or similar). 43 | 5. Click the "Create and Build" button. 44 | 4. Edit the `hooks/build` script with your project's `$USER`, `$UID`, and `$GID` build args in place of the corresponding default values. 45 | 46 | Below is a screenshot which should give you an idea of how the form out to be filled out prior to clicking "Create and Build". 47 | 48 | ![Creating a new DockerHub repository for your project](./img/creating-dockerhub-repo-screenshot.png) 49 | 50 | DockerHub is now configured to re-build your project's image whenever commits are pushed to your project's GitHub repository! Specifically, whenever you push new commits to your project's GitHub repository, GitHub will notify DockerHub and DockerHub will then run the `./hooks/build` script to re-build your project's image. For more details on the whole process see the [official documentation](https://docs.docker.com/docker-hub/builds/advanced/#build-hook-examples) on advanced DockerHub build options. 51 | 52 | ### Running a container 53 | 54 | Once you have built the image, the following command will run a container based on the image `$IMAGE_NAME:$IMAGE_TAG`. This command should be run from within the project's root directory. 55 | 56 | ```bash 57 | $ docker container run \ 58 | --rm \ 59 | --tty \ 60 | --volume ${pwd}/bin:/home/$USER/app/bin \ 61 | --volume ${pwd}/data:/home/$USER/app/data \ 62 | --volume ${pwd}/doc:/home/$USER/app/doc \ 63 | --volume ${pwd}/notebooks:/home/$USER/app/notebooks \ 64 | --volume ${pwd}/results:/home/$USER/app/results \ 65 | --volume ${pwd}/src:/home/$USER/app/src \ 66 | --runtime nvidia \ 67 | --publish 8888:8888 \ 68 | $IMAGE_NAME:$IMAGE_TAG 69 | ``` 70 | 71 | ### Using Docker Compose 72 | 73 | It is quite easy to make a typo whilst writing the above docker commands by hand, a less error-prone approach is to use [Docker Compose](https://docs.docker.com/compose/). The above docker commands have been encapsulated into the `docker-compose.yml` configuration file. You will need to store your project specific values for `$USER`, `$UID`, and `$GID` in an a file called `.env` as follows. 74 | 75 | ``` 76 | USER=$USER 77 | UID=$UID 78 | GID=$GID 79 | ``` 80 | 81 | For more details on how variable substitution works with Docker Compose, see the [official documentation](https://docs.docker.com/compose/environment-variables/#the-env-file). 82 | 83 | Note that you can test your `docker-compose.yml` file by running the following command in the `docker` sub-directory of the project. 84 | 85 | ```bash 86 | $ docker-compose config 87 | ``` 88 | 89 | This command takes the `docker-compose.yml` file and substitutes the values provided in the `.env` file and then returns the result. 90 | 91 | Once you are confident that values in the `.env` file are being substituted properly into the `docker-compose.yml` file, the following command can be used to bring up a container based on your project's Docker image and launch the JupyterLab server. This command should also be run from within the `docker` sub-directory of the project. 92 | 93 | ```bash 94 | $ docker-compose up --build 95 | ``` 96 | 97 | When you are done developing and have shutdown the JupyterLab server, the following command tears down the networking infrastructure for the running container. 98 | 99 | ```bash 100 | $ docker-compose down 101 | ``` 102 | -------------------------------------------------------------------------------- /src/train-argparse.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import pathlib 4 | 5 | from sklearn import metrics 6 | import torch 7 | from torch import nn, optim, utils 8 | from torchvision import datasets, models, transforms 9 | from tqdm import tqdm 10 | 11 | 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument("--batch-size", 14 | default=256, 15 | type=int, 16 | help="Number of training samples per batch.") 17 | parser.add_argument("--data-dir", 18 | required=True, 19 | type=str, 20 | help="Path to directory containing the train, val, test data.") 21 | parser.add_argument("--dataloader-num-workers", 22 | required=True, 23 | type=int, 24 | help="Number of workers to use for loading data.") 25 | parser.add_argument("--dataloader-prefetch-factor", 26 | default=2, 27 | type=int, 28 | help="Number of data batches to prefetch per worker.") 29 | parser.add_argument("--disable-gpu", 30 | action="store_true", 31 | help="Disable GPU(s) for training and inference.") 32 | parser.add_argument("--num-training-epochs", 33 | default=1, 34 | type=int, 35 | help="Number of training epochs.") 36 | parser.add_argument("--optimizer-learning-rate", 37 | default=1e-3, 38 | type=float, 39 | help="Learning rate for optimizer.") 40 | parser.add_argument("--optimizer-momentum", 41 | default=0.9, 42 | type=float, 43 | help="Momentum for optimizer.") 44 | parser.add_argument("--output-dir", 45 | required=True, 46 | type=str, 47 | help="Path to directory where output should be written.") 48 | parser.add_argument("--output-filename", 49 | default="model.pt", 50 | type=str, 51 | help="Filename for model checkpoint.") 52 | parser.add_argument("--seed", 53 | type=int, 54 | help="Seed used for pseudorandom number generation.") 55 | parser.add_argument("--tqdm-disable", 56 | action="store_true", 57 | help="Disables the training progress bar.") 58 | args = parser.parse_args() 59 | 60 | 61 | # no need to expose these as command line args 62 | DATA_DIR = pathlib.Path(args.data_dir) 63 | DEVICE = torch.device("cpu") if args.disable_gpu else torch.device("cuda") 64 | NUM_CLASSES = 10 65 | OUTPUT_DIR = pathlib.Path(args.output_dir) 66 | OUTPUT_FILEPATH = OUTPUT_DIR / args.output_filename 67 | RESIZE_SIZE = 224 68 | 69 | 70 | # create the output directory 71 | if not OUTPUT_DIR.exists(): 72 | os.mkdir(OUTPUT_DIR) 73 | 74 | # set seed for reproducibility 75 | if args.seed is not None: 76 | torch.manual_seed(args.seed) 77 | 78 | # create the train and test datasets 79 | _transform = transforms.Compose([ 80 | transforms.Resize(RESIZE_SIZE), 81 | transforms.ToTensor(), 82 | ]) 83 | train_dataset = datasets.CIFAR10(root=DATA_DIR, 84 | train=True, 85 | download=True, 86 | transform=_transform) 87 | 88 | test_dataset = datasets.CIFAR10(root=DATA_DIR, 89 | train=False, 90 | download=True, 91 | transform=_transform) 92 | 93 | # create the train and test dataloaders 94 | train_dataloader = (utils.data 95 | .DataLoader(train_dataset, 96 | batch_size=args.batch_size, 97 | shuffle=True, 98 | num_workers=args.dataloader_num_workers, 99 | persistent_workers=True, 100 | pin_memory=True, 101 | prefetch_factor=args.dataloader_prefetch_factor)) 102 | test_dataloader = (utils.data 103 | .DataLoader(test_dataset, 104 | batch_size=args.batch_size, 105 | shuffle=False, 106 | num_workers=args.dataloader_num_workers, 107 | persistent_workers=True, 108 | pin_memory=True, 109 | prefetch_factor=args.dataloader_prefetch_factor)) 110 | 111 | # define a model_fn, loss function, and an optimizer 112 | model_fn = models.resnet50(pretrained=False, 113 | num_classes=NUM_CLASSES) 114 | model_fn.to(DEVICE) 115 | loss_fn = nn.CrossEntropyLoss() 116 | optimizer = optim.SGD(model_fn.parameters(), 117 | lr=args.optimizer_learning_rate, 118 | momentum=args.optimizer_momentum) 119 | 120 | # train the model 121 | print("Training started...") 122 | for epoch in range(args.num_training_epochs): 123 | 124 | with tqdm(train_dataloader, unit="batch", disable=args.tqdm_disable) as tepoch: 125 | 126 | for (features, targets) in tepoch: 127 | tepoch.set_description(f"Epoch {epoch}") 128 | 129 | # zero the parameter gradients 130 | optimizer.zero_grad() 131 | 132 | # forward + backward + optimize 133 | predictions = model_fn(features.to(DEVICE)) 134 | loss = loss_fn(predictions, targets.to(DEVICE)) 135 | loss.backward() 136 | optimizer.step() 137 | 138 | print("...training finished!") 139 | 140 | # save the trained model 141 | torch.save(model_fn.state_dict(), OUTPUT_FILEPATH) 142 | 143 | # compute the predications on the test data 144 | batch_targets = [] 145 | batch_predicted_targets = [] 146 | 147 | with torch.no_grad(): 148 | for (features, targets) in test_dataloader: 149 | predicted_probs = model_fn(features.to(DEVICE)) 150 | predicted_targets = predicted_probs.argmax(axis=1) 151 | batch_targets.append(targets) 152 | batch_predicted_targets.append(predicted_targets) 153 | 154 | # generate a classification report 155 | test_target = (torch.cat(batch_targets) 156 | .cpu()) 157 | test_predicted_targets = (torch.cat(batch_predicted_targets) 158 | .cpu()) 159 | 160 | classification_report = metrics.classification_report( 161 | test_target, 162 | test_predicted_targets, 163 | ) 164 | print(classification_report) 165 | -------------------------------------------------------------------------------- /src/train-checkpoint-restart.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import pathlib 3 | 4 | from sklearn import metrics 5 | import torch 6 | from torch import nn, optim, utils 7 | from torchvision import datasets, models, transforms 8 | from tqdm import tqdm 9 | 10 | 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument("--batch-size", 13 | default=256, 14 | type=int, 15 | help="Number of training samples per batch.") 16 | parser.add_argument("--checkpoint-filepath", 17 | type=str, 18 | help="Path to a file containing the current checkpoint") 19 | parser.add_argument("--data-dir", 20 | required=True, 21 | type=str, 22 | help="Path to directory containing the train, val, test data.") 23 | parser.add_argument("--dataloader-num-workers", 24 | required=True, 25 | type=int, 26 | help="Number of workers to use for loading data.") 27 | parser.add_argument("--dataloader-prefetch-factor", 28 | default=2, 29 | type=int, 30 | help="Number of data batches to prefetch per worker.") 31 | parser.add_argument("--disable-gpu", 32 | action="store_true", 33 | help="Disable GPU(s) for training and inference.") 34 | parser.add_argument("--num-training-epochs", 35 | default=1, 36 | type=int, 37 | help="Number of training epochs.") 38 | parser.add_argument("--optimizer-learning-rate", 39 | default=1e-3, 40 | type=float, 41 | help="Learning rate for optimizer.") 42 | parser.add_argument("--optimizer-momentum", 43 | default=0.9, 44 | type=float, 45 | help="Momentum for optimizer.") 46 | parser.add_argument("--seed", 47 | type=int, 48 | help="Seed used for pseudorandom number generation.") 49 | parser.add_argument("--tqdm-disable", 50 | action="store_true", 51 | help="Disables the training progress bar.") 52 | parser.add_argument("--write-checkpoint-to", 53 | type=str, 54 | help="Path to the file where checkpoint should be written") 55 | args = parser.parse_args() 56 | 57 | 58 | # no need to expose these as command line args 59 | DATA_DIR = pathlib.Path(args.data_dir) 60 | DEVICE = torch.device("cpu") if args.disable_gpu else torch.device("cuda") 61 | NUM_CLASSES = 10 62 | RESIZE_SIZE = 224 63 | 64 | 65 | # set up checkpointing 66 | if args.checkpoint_filepath is not None: 67 | CHECKPOINT_FILEPATH = pathlib.Path(args.checkpoint_filepath) 68 | else: 69 | CHECKPOINT_FILEPATH = None 70 | 71 | if args.write_checkpoint_to is not None: 72 | WRITE_CHECKPOINT_TO = pathlib.Path(args.write_checkpoint_to) 73 | else: 74 | WRITE_CHECKPOINT_TO = None 75 | 76 | # set seed for reproducibility 77 | if args.seed is not None: 78 | torch.manual_seed(args.seed) 79 | 80 | # create the train and test datasets 81 | _transform = transforms.Compose([ 82 | transforms.Resize(RESIZE_SIZE), 83 | transforms.ToTensor(), 84 | ]) 85 | train_dataset = datasets.CIFAR10(root=DATA_DIR, 86 | train=True, 87 | download=True, 88 | transform=_transform) 89 | 90 | test_dataset = datasets.CIFAR10(root=DATA_DIR, 91 | train=False, 92 | download=True, 93 | transform=_transform) 94 | 95 | # create the train and test dataloaders 96 | train_dataloader = (utils.data 97 | .DataLoader(train_dataset, 98 | batch_size=args.batch_size, 99 | shuffle=True, 100 | num_workers=args.dataloader_num_workers, 101 | persistent_workers=True, 102 | pin_memory=True, 103 | prefetch_factor=args.dataloader_prefetch_factor)) 104 | test_dataloader = (utils.data 105 | .DataLoader(test_dataset, 106 | batch_size=args.batch_size, 107 | shuffle=False, 108 | num_workers=args.dataloader_num_workers, 109 | persistent_workers=True, 110 | pin_memory=True, 111 | prefetch_factor=args.dataloader_prefetch_factor)) 112 | 113 | # define a model_fn, loss function, and an optimizer 114 | model_fn = models.resnet50(pretrained=False, 115 | num_classes=NUM_CLASSES) 116 | model_fn.to(DEVICE) 117 | loss_fn = nn.CrossEntropyLoss() 118 | optimizer = optim.SGD(model_fn.parameters(), 119 | lr=args.optimizer_learning_rate, 120 | momentum=args.optimizer_momentum) 121 | 122 | # load model checkpoint (if available) 123 | if CHECKPOINT_FILEPATH is not None: 124 | checkpoint_file = torch.load(CHECKPOINT_FILEPATH) 125 | model_fn.load_state_dict(checkpoint_file["model_state_dict"]) 126 | optimizer.load_state_dict(checkpoint_file["optimizer_state_dict"]) 127 | 128 | # train the model 129 | print("Training started...") 130 | for epoch in range(args.num_training_epochs): 131 | 132 | with tqdm(train_dataloader, unit="batch", disable=args.tqdm_disable) as tepoch: 133 | 134 | for (features, targets) in tepoch: 135 | tepoch.set_description(f"Epoch {epoch}") 136 | 137 | # zero the parameter gradients 138 | optimizer.zero_grad() 139 | 140 | # forward + backward + optimize 141 | predictions = model_fn(features.to(DEVICE)) 142 | loss = loss_fn(predictions, targets.to(DEVICE)) 143 | loss.backward() 144 | optimizer.step() 145 | 146 | if WRITE_CHECKPOINT_TO is not None: 147 | checkpoint = { 148 | "model_state_dict": model_fn.state_dict(), 149 | "optimizer_state_dict": optimizer.state_dict() 150 | } 151 | torch.save(checkpoint, WRITE_CHECKPOINT_TO) 152 | 153 | print("...training finished!") 154 | 155 | # compute the predications on the test data 156 | batch_targets = [] 157 | batch_predicted_targets = [] 158 | 159 | with torch.no_grad(): 160 | for (features, targets) in test_dataloader: 161 | predicted_probs = model_fn(features.to(DEVICE)) 162 | predicted_targets = predicted_probs.argmax(axis=1) 163 | batch_targets.append(targets) 164 | batch_predicted_targets.append(predicted_targets) 165 | 166 | # generate a classification report 167 | test_target = (torch.cat(batch_targets) 168 | .cpu()) 169 | test_predicted_targets = (torch.cat(batch_predicted_targets) 170 | .cpu()) 171 | 172 | classification_report = metrics.classification_report( 173 | test_target, 174 | test_predicted_targets, 175 | ) 176 | print(classification_report) 177 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KAUST-Academy/introduction-to-machine-learning/HEAD) 2 | 3 | # Introduction to Recurrent Neural Networks 4 | 5 | There is strong demand for deep learning (DL) skills and expertise to solve challenging business problems both globally and locally in KSA. This course will help learners build capacity in core DL tools and methods and enable them to develop their own applications that use recurrent neural networks. This course covers the basic theory behind RNN algorithms but the majority of the focus is on hands-on examples using [PyTorch](https://pytorch.org/). 6 | 7 | ## Learning Objectives 8 | 9 | The primary learning objective of this course is to provide students with practical, hands-on experience with state-of-the-art machine learning and deep learning tools that are widely used in industry. 10 | 11 | This course covers portions of [Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) and [Machine Learning with PyTorch and Scikit-Learn](https://www.packtpub.com/product/machine-learning-with-pytorch-and-scikit-learn/9781801819312). The following topics will be discussed. 12 | 13 | * Processing Sequences using Recurrent Neural Networks (RNNs) 14 | * Natural Language Processing using Attention and Transformers 15 | 16 | ## Lessons 17 | 18 | The lessons are organizes into modules with the idea that they can taught somewhat independently to accommodate specific audiences. 19 | 20 | ### Module 0: Recap of Deep Learning Fundamentals 21 | 22 | ### Module 1: [Introduction to Deep Learning, Part III](https://kaust-my.sharepoint.com/:p:/g/personal/pughdr_kaust_edu_sa/EQ-T0E5AcXJLsgYw0VPhomYBN4l1YLAc3hK0UXQUNZ4N9g?e=R5ZTCJ) 23 | 24 | * Consolidation of previous days content via Q/A and live coding demonstrations. 25 | * The morning session will focus on the theory behind Recurrent Neural Networks (RNNs) by covering portions of [Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) and [Machine Learning with PyTorch and Scikit-Learn](https://www.packtpub.com/product/machine-learning-with-pytorch-and-scikit-learn/9781801819312). 26 | * The afternoon session will focus on applying the techniques learned in the morning session using [PyTorch](https://pytorch.org/), followed by a short assessment on the Kaggle data science competition platform. 27 | 28 | | **Tutorial** | **Open in Google Colab** | **Open in Kaggle** | 29 | |--------------|:------------------------:|:------------------:| 30 | | Univariate Time Series Forecasting with RNNs | [![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KAUST-Academy/introduction-to-recurrent-neural-networks/blob/master/notebooks/03a-forecasting-univariate-time-series-with-rnns.ipynb) | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/KAUST-Academy/introduction-to-recurrent-neural-networks/blob/master/notebooks/03a-forecasting-univariate-time-series-with-rnns.ipynb) | 31 | | Multivariate Time Series Forecasting with RNNs | [![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KAUST-Academy/introduction-to-recurrent-neural-networks/blob/master/notebooks/03b-forecasting-multivariate-time-series-with-rnns.ipynb) | [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/KAUST-Academy/introduction-to-recurrent-neural-networks/blob/master/notebooks/03b-forecasting-multivariate-time-series-with-rnns.ipynb) | 32 | 33 | ### Module 2: [Transformers and Attention](https://kaust-my.sharepoint.com/:p:/g/personal/pughdr_kaust_edu_sa/EZUsh7VkuIlIhfp0KmKA61EBnb6cLPlEoVsicJmkZNRr9w?e=J34Jrt) 34 | 35 | * Consolidation of previous days content via Q/A and live coding demonstrations. 36 | * The morning session will focus on the theory behind Transformers and Attention, by covering portions of by covering portions of [Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) and [Machine Learning with PyTorch and Scikit-Learn] 37 | * The afternoon session will focus on applying the techniques learned in the morning session using [PyTorch](https://pytorch.org/), followed by a short assessment on the Kaggle data science competition platform. 38 | 39 | 40 | ## Assessment 41 | 42 | Student performance on the course will be assessed through participation in a Kaggle classroom competition. 43 | 44 | # Repository Organization 45 | 46 | Repository organization is based on ideas from [_Good Enough Practices for Scientific Computing_](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510). 47 | 48 | 1. Put each project in its own directory, which is named after the project. 49 | 2. Put external scripts or compiled programs in the `bin` directory. 50 | 3. Put raw data and metadata in a `data` directory. 51 | 4. Put text documents associated with the project in the `doc` directory. 52 | 5. Put all Docker related files in the `docker` directory. 53 | 6. Install the Conda environment into an `env` directory. 54 | 7. Put all notebooks in the `notebooks` directory. 55 | 8. Put files generated during cleanup and analysis in a `results` directory. 56 | 9. Put project source code in the `src` directory. 57 | 10. Name all files to reflect their content or function. 58 | 59 | ## Building the Conda environment 60 | 61 | After adding any necessary dependencies that should be downloaded via `conda` to the 62 | `environment.yml` file and any dependencies that should be downloaded via `pip` to the 63 | `requirements.txt` file you create the Conda environment in a sub-directory `./env`of your project 64 | directory by running the following commands. 65 | 66 | ```bash 67 | export ENV_PREFIX=$PWD/env 68 | mamba env create --prefix $ENV_PREFIX --file environment.yml --force 69 | ``` 70 | 71 | Once the new environment has been created you can activate the environment with the following 72 | command. 73 | 74 | ```bash 75 | conda activate $ENV_PREFIX 76 | ``` 77 | 78 | Note that the `ENV_PREFIX` directory is *not* under version control as it can always be re-created as 79 | necessary. 80 | 81 | For your convenience these commands have been combined in a shell script `./bin/create-conda-env.sh`. 82 | Running the shell script will create the Conda environment, activate the Conda environment, and build 83 | JupyterLab with any additional extensions. The script should be run from the project root directory 84 | as follows. 85 | 86 | ```bash 87 | ./bin/create-conda-env.sh 88 | ``` 89 | 90 | ### Ibex 91 | 92 | The most efficient way to build Conda environments on Ibex is to launch the environment creation script 93 | as a job on the debug partition via Slurm. For your convenience a Slurm job script 94 | `./bin/create-conda-env.sbatch` is included. The script should be run from the project root directory 95 | as follows. 96 | 97 | ```bash 98 | sbatch ./bin/create-conda-env.sbatch 99 | ``` 100 | 101 | ### Listing the full contents of the Conda environment 102 | 103 | The list of explicit dependencies for the project are listed in the `environment.yml` file. To see 104 | the full lost of packages installed into the environment run the following command. 105 | 106 | ```bash 107 | conda list --prefix $ENV_PREFIX 108 | ``` 109 | 110 | ### Updating the Conda environment 111 | 112 | If you add (remove) dependencies to (from) the `environment.yml` file or the `requirements.txt` file 113 | after the environment has already been created, then you can re-create the environment with the 114 | following command. 115 | 116 | ```bash 117 | $ mamba env create --prefix $ENV_PREFIX --file environment.yml --force 118 | ``` 119 | 120 | ## Using Docker 121 | 122 | In order to build Docker images for your project and run containers with GPU acceleration you will 123 | need to install 124 | [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/), 125 | [Docker Compose](https://docs.docker.com/compose/install/) and the 126 | [NVIDIA Docker runtime](https://github.com/NVIDIA/nvidia-docker). 127 | 128 | Detailed instructions for using Docker to build and image and launch containers can be found in 129 | the `docker/README.md`. 130 | -------------------------------------------------------------------------------- /bin/README.md: -------------------------------------------------------------------------------- 1 | ## Creating the Conda environment 2 | 3 | For your convenience the commands to create the Conda environment have been combined in a shell script. The script should be run from the project root directory as follows. 4 | 5 | ```bash 6 | ./bin/create-conda-env.sh 7 | ``` 8 | 9 | ## Launching a job via Slurm to create the Conda environment 10 | 11 | While running the shell script above on a login node will create the Conda environment, you may prefer to launch a job via Slurm 12 | to create the Conda environment. If you lose your connection to the Ibex login node whilst your Conda environment script is running 13 | the environment will be left in an inconsistent state and you will need to start over. Depending on the load on the Ibex login nodes, 14 | lanuching a job via Slurm to create your Conda environment can also be faster. 15 | 16 | For your convenience the commands to launch a job via Slurm to create the Conda environment have been combined into a job script. The script should be run from the project root directory as follows. 17 | 18 | ```bash 19 | sbatch ./bin/create-conda-env.sbatch 20 | ``` 21 | 22 | ## Launching a Jupyter server for interactive work 23 | 24 | The job script `launch-jupyter-server.sbatch` launches a Jupyter server for interactive prototyping. To launch a JupyterLab server 25 | use `sbatch` to submit the job script by running the following command from the project root directory. 26 | 27 | ```bash 28 | sbatch ./bin/launch-jupyter-server.sbatch 29 | ``` 30 | 31 | If you prefer the classic Jupyter Notebook interface, then you can launch the Jupyter notebook server with the following command in 32 | the project root directory. 33 | 34 | ```bash 35 | sbatch ./bin/launch-jupyter-server.sbatch notebook 36 | ``` 37 | 38 | Once the job has started, you can inspect the `./bin/launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file where you will find 39 | instructions on how to access the server running in your local browser. 40 | 41 | ### SSH tunneling between your local machine and Ibex compute node(s) 42 | To connect to the compute node on Ibex running your Jupyter server, you need to create an SSH tunnel from your local machine 43 | to a login node on Ibex using a command similar to the following. 44 | 45 | ``` 46 | ssh -L ${JUPYTER_PORT}:${IBEX_NODE}:${JUPYTER_PORT} ${USER}@glogin.ibex.kaust.edu.sa 47 | ``` 48 | 49 | The exact command for your job can be copied from the `./bin/launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file. 50 | 51 | ### Accessing the Jupyter server from your local machine 52 | 53 | Once you have set up your SSH tunnel, in order to access the Jupyter server from your local machine you need to copy the 54 | second URL provided in the Jupyter server logs in the `launch-jupyter-server-$SLURM_JOB_ID-slurm.err` file and paste it into 55 | the browser on your local machine. The URL will look similar to the following. 56 | 57 | ``` 58 | http://127.0.0.1:$JUPYTER_PORT/lab?token=$JUPYTER_TOKEN 59 | ``` 60 | 61 | The exact command for your job containing both your assigned `$JUPYTER_PORT` as well as your specific `$JUPYTER_TOKEN` can 62 | be copied from the `launch-jupyter-server-$SLURM_JOB_ID-slurm.err`. 63 | 64 | ## Launching a VS Code server for development work 65 | 66 | The job script `launch-code-server.sbatch` launches a Microsoft Visual Studio (VS) Code server for development work. In order to 67 | use VS Code server, you will first need to install the server package in your Ibex home directory following the instructions 68 | provided on [GitHub](https://github.com/kaust-rccl/ibex-code-server-install). Once you have installed VS Code server, you can 69 | launch a server by running the following command from the project root directory. 70 | 71 | ```bash 72 | sbatch ./bin/launch-code-server.sbatch 73 | ``` 74 | 75 | Once the job has started, you can inspect the `./bin/launch-code-server-$SLURM_JOB_ID-slurm.err` file where you will find 76 | instructions on how to access the server running in your local browser. 77 | 78 | ### SSH tunneling between your local machine and Ibex compute node(s) 79 | To connect to the compute node on Ibex running your VS Code server, you need to create an SSH tunnel from your local machine 80 | to a login node on Ibex using a command similar to the following. 81 | 82 | ``` 83 | ssh -L ${JUPYTER_PORT}:${IBEX_NODE}:${JUPYTER_PORT} ${USER}@glogin.ibex.kaust.edu.sa 84 | ``` 85 | 86 | The exact command for your job can be copied from the `./bin/launch-code-server-$SLURM_JOB_ID-slurm.err` file. 87 | 88 | ### Accessing the VS Code server from your local machine 89 | 90 | Once you have set up your SSH tunnel, in order to access the VS Code server from your local machine you need to copy the 91 | second URL provided in the `launch-code-server-$SLURM_JOB_ID-slurm.err` file and paste it into the browser on your local 92 | machine. The URL will look similar to the following. 93 | 94 | ``` 95 | localhost:$CODE_SERVER_PORT 96 | ``` 97 | 98 | The exact command for your job containing both your assigned `$CODE_SERVER_PORT` can be copied from the 99 | `launch-code-server-$SLURM_JOB_ID-slurm.err`. 100 | 101 | ## Launching a training job via Slurm 102 | 103 | The `src` directory contains an example training script, `train.py`, that trains a classification pipeline on the 104 | [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset. You can launch this training script as a batch 105 | job on Ibex via Slurm using the following command in the project root directory. 106 | 107 | ```bash 108 | ./bin/launch-train.sh 109 | ``` 110 | 111 | ### The `./bin/launch-train.sh` script 112 | 113 | Wrapping your job submission inside a `launch-train.sh` script is an Ibex "best-practice" that will help you automate more 114 | complex machine learning workflows. In particular, this example script demonstrates how to break up a single training job 115 | into training periods (where each training period consists of one or more training epochs) and launches a sequence of jobs 116 | for each training period. Breaking up large training jobs into smaller jobs can dramatically improve your job throughput! 117 | 118 | ### The `./bin/train.sbatch` script 119 | 120 | The script `./bin/train.sbatch` is the actual Slurm job script. This script can be broken down into several parts that 121 | are common to all machine learning jobs on Ibex. 122 | 123 | #### Request resources from Slurm 124 | 125 | You will request resources for your job using Slurm headers. It is important to request a balanced allocation of CPUs, GPUs, 126 | and CPU memory in order to insure good overall job performance. You should typically request resources that are roughly 127 | proportional to the amount of GPUs you are requesting. Most of our nodes have 8 V100 GPUs, 48 CPU cores, and 748 GB of CPU 128 | memory. The headers below request 6 Intel CPU cores per NVIDIA V100 GPU, and 64G of CPU memory for 2 hours. 129 | 130 | ```bash 131 | #!/bin/bash --login 132 | #SBATCH --time=2:00:00 133 | #SBATCH --gpus-per-node=v100:1 134 | #SBATCH --cpus-per-gpu=4 135 | #SBATCH --mem=64G 136 | #SBATCH --partition=batch 137 | #SBATCH --mail-type=ALL 138 | #SBATCH --output=results/%x/%j-slurm.out 139 | #SBATCH --error=results/%x/%j-slurm.err 140 | ``` 141 | 142 | #### Activate the Conda environment 143 | 144 | Activating the Conda environment is done in the usual way however it is critical for the job script to run inside a 145 | login shell in order for the `conda activate` command to work as expected (this is why the first line of the job script 146 | is `#!/bin/bash --login`). It is also a good practice to purge any modules that you might have loaded prior to launching 147 | the training job. Note that this script expects the first argument to be the path to the Conda environment. 148 | 149 | ```bash 150 | module purge 151 | conda activate "$1" 152 | ``` 153 | 154 | #### Starting the NVDashboard monitoring server 155 | 156 | After activating the Conda environment, but before launching your training script, you should start the 157 | NVDashboard monitoring server to run in the background using `srun` to reserve a free port. The 158 | `./bin/launch-nvdashboard-server.srun` script launches the monitoring server and logs out the assigned 159 | port to the `slurm.err` file for the job. 160 | 161 | ```bash 162 | # use srun to launch NVDashboard server in order to reserve a port 163 | srun --resv-ports=1 ./bin/launch-nvdashboard-server.srun & 164 | NVDASHBOARD_PID=$! 165 | ``` 166 | 167 | #### Launch a Python training script 168 | 169 | Finally, you launch the training job! Note that we use the special Bash variable `"${@:2}"` to refer to all of the 170 | command line arguments (other than the first!) passed to the Slurm job script. This allows you to reuse the same Slurm 171 | job script for other training jobs! 172 | 173 | ```bash 174 | python "${@:2}" 175 | ``` 176 | 177 | #### Stopping the NVDashboard monitoring server 178 | 179 | Once the training script has finished running, you should stop the NVDashboard server so that your job exits. If 180 | you do not stop the server the job will continue to run until it reached its time limit (which wastes resources). 181 | 182 | ``` 183 | # shutdown the NVDashboard server 184 | kill $NVDASHBOARD_PID 185 | ``` 186 | --------------------------------------------------------------------------------