├── scripts
    ├── nscc
    │   ├── multi_node
    │   │   ├── experiments
    │   │   │   ├── scripts
    │   │   │   │   ├── sshcont
    │   │   │   │   │   ├── testnodes
    │   │   │   │   │   ├── testnodes_dgx
    │   │   │   │   │   ├── invocation
    │   │   │   │   │   ├── container_slaves_tensorflow.sh
    │   │   │   │   │   ├── tensorflow.sh
    │   │   │   │   │   ├── job_tensorflow_gloo.sh
    │   │   │   │   │   └── ssh_and_run.py
    │   │   │   │   └── dropbear
    │   │   │   │   │   ├── dropbear_dss_host_key
    │   │   │   │   │   ├── dropbear_rsa_host_key
    │   │   │   │   │   ├── dropbear_ecdsa_host_key
    │   │   │   │   │   ├── dropbear_ed25519_host_key
    │   │   │   │   │   └── start-dropbear.sh
    │   │   │   ├── test
    │   │   │   │   ├── test_mpi.sh
    │   │   │   │   └── comm.py
    │   │   │   └── apps
    │   │   │   │   └── software
    │   │   │   │       └── dropbear
    │   │   │   │           └── 2020.80
    │   │   │   │               ├── bin
    │   │   │   │                   ├── dbclient
    │   │   │   │                   ├── dropbearkey
    │   │   │   │                   └── dropbearconvert
    │   │   │   │               ├── sbin
    │   │   │   │                   └── dropbear
    │   │   │   │               └── share
    │   │   │   │                   └── man
    │   │   │   │                       ├── man1
    │   │   │   │                           ├── dropbearconvert.1
    │   │   │   │                           ├── dropbearkey.1
    │   │   │   │                           └── dbclient.1
    │   │   │   │                       └── man8
    │   │   │   │                           └── dropbear.8
    │   │   └── pbs_script
    │   │   │   ├── start_jupyter.sh
    │   │   │   └── multinode-jupyter-non-dev.pbs
    │   └── monitor_job
    │   │   ├── monitor.sh
    │   │   └── nscc_monitor.py
    ├── cscs
    │   ├── cscs-pytorch.slurm
    │   ├── cscs-spark.slurm
    │   └── cscs-pytorch-gloo.slurm
    ├── tacc_longhorn
    │   ├── multicard-node-longhorn.slurm
    │   ├── cp_imagenet_to_temp_bin.sh
    │   └── multicard-node-longhorn-large.slurm
    └── tacc_frontera
    │   ├── frontera-test.slurm
    │   ├── frontera-btest.slurm
    │   └── frontera-gloo.slurm
├── docs
    ├── doc_figures
    │   ├── cscs
    │   │   ├── globus.jpg
    │   │   ├── jupyter_step1.jpg
    │   │   ├── jupyter_step2.jpg
    │   │   ├── jupyter_step3.jpg
    │   │   └── cscs_project_information.jpg
    │   └── nscc
    │   │   └── NSCC_jupyter.png
    ├── initialize
    │   ├── cluster_software.md
    │   └── vm_software.md
    ├── get_started
    │   ├── hardware_keywords.md
    │   └── sysinfo.md
    ├── hpc_system_notes
    │   ├── nus_apollo.md
    │   ├── general.md
    │   ├── tacc_longhorn.md
    │   ├── nscc.md
    │   ├── cscs.md
    │   └── tacc_frontera.md
    └── resources
    │   └── dl_libraries.md
├── README.md
└── .gitignore


/scripts/nscc/multi_node/experiments/scripts/sshcont/testnodes:
--------------------------------------------------------------------------------
1 | ntu01
2 | ntu02
3 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/testnodes_dgx:
--------------------------------------------------------------------------------
1 | dgx4106
2 | dgx4105
3 | 


--------------------------------------------------------------------------------
/docs/doc_figures/cscs/globus.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/globus.jpg


--------------------------------------------------------------------------------
/docs/doc_figures/cscs/jupyter_step1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step1.jpg


--------------------------------------------------------------------------------
/docs/doc_figures/cscs/jupyter_step2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step2.jpg


--------------------------------------------------------------------------------
/docs/doc_figures/cscs/jupyter_step3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step3.jpg


--------------------------------------------------------------------------------
/docs/doc_figures/nscc/NSCC_jupyter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/nscc/NSCC_jupyter.png


--------------------------------------------------------------------------------
/docs/doc_figures/cscs/cscs_project_information.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/cscs_project_information.jpg


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/test/test_mpi.sh:
--------------------------------------------------------------------------------
1 | python /home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/mpi_testing/comm.py


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_dss_host_key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_dss_host_key


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_rsa_host_key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_rsa_host_key


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ecdsa_host_key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ecdsa_host_key


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dbclient:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dbclient


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/sbin/dropbear:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/sbin/dropbear


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ed25519_host_key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ed25519_host_key


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearkey:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearkey


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearconvert:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearconvert


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/dropbear/start-dropbear.sh:
--------------------------------------------------------------------------------
1 | ROOT="${HOME}/scripts/dropbear"
2 | 
3 | exec "${HOME}/apps/software/dropbear/2020.80/sbin/dropbear" -r "${ROOT}/dropbear_dss_host_key" \
4 | -r "${ROOT}/dropbear_ecdsa_host_key" -r "${ROOT}/dropbear_ed25519_host_key" \
5 | -r "${ROOT}/dropbear_rsa_host_key" -FE -p 41017
6 | 


--------------------------------------------------------------------------------
/scripts/nscc/monitor_job/monitor.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | job=${1:-""}
 4 | email=${2:-"c170166@e.ntu.edu.sg"}
 5 | interval=${3:-"600"}
 6 | 
 7 | 
 8 | source ~/anaconda3/bin/activate base
 9 | nohup python /home/users/ntu/c170166/pbs_scripts/job_monitor/nscc_monitor.py --job $job --email $email --interval $interval > ./nohup.log 2>&1 &
10 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/invocation:
--------------------------------------------------------------------------------
1 | 
2 | # load dropbear: add dropbear bin and sbin to system path
3 | export PATH=$HOME/apps/software/dropbear/2020.80/bin:$HOME/apps/software/dropbear/2020.80/sbin:$PATH
4 | 
5 | echo $PATH
6 | 
7 | ~/scripts/sshcont/ssh_and_run.py ~/scripts/sshcont/tensorflow.sh ~/scripts/sshcont/job_tensorflow_gloo.sh
8 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/pbs_script/start_jupyter.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | source /home/users/ntu/c170166/anaconda3/bin/activate base
4 | DATESTAMP=`date +'%y%m%d%H%M%S'`
5 | LOGFILE=./jupyter.o$DATESTAMP
6 | jupyter lab --no-browser --ip=0.0.0.0 --port=8888 \
7 |  --NotebookApp.terminado_settings='{"shell_command": ["/bin/bash"]}' \
8 |  --NotebookApp.allow_remote_access=True --FileContentsManager.delete_to_trash=False |& tee $LOGFILE
9 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/container_slaves_tensorflow.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -l
2 | # replace with your favourite container image
3 | # IMAGE_NAME="/home/projects/ai/singularity/nvcr.io/nvidia/tensorflow:20.02-tf1-py3.sif"
4 | IMAGE_NAME="/home/project/ai/singularity/nvcr.io/nvidia/pytorch:latest-py3.sif"
5 | pbs-attach "${PBS_JOBID}"
6 | singularity run --nv ${NTUHPC_CONT_EXTRA_ARGS} -e "${IMAGE_NAME}" "${HOME}/scripts/dropbear/start-dropbear.sh"
7 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/test/comm.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import os
 3 | import horovod.torch as hvd
 4 | import socket
 5 | 
 6 | hvd.init()
 7 | 
 8 | os.environ['MASTER_ADDR']="localhost"
 9 | os.environ['MASTER_PORT']="6002"
10 | 
11 | hostname = socket.gethostname()
12 | 
13 | # rank = os.getenv('OMPI_COMM_WORLD_RANK', '0')
14 | rank = hvd.local_rank()
15 | world = hvd.size()
16 | # world = os.getenv("OMPI_COMM_WORLD_SIZE", '1')
17 | 
18 | print("rank: {}, world size: {}, hostname: {}".format(rank, world, hostname))
19 | print("trying to initliaze dist")
20 | torch.distributed.init_process_group(
21 |     backend="nccl",
22 |     world_size=world, rank=rank)
23 | print("init successful on hostname: {}".format(hostname))


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/tensorflow.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash 
 2 | trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
 3 | 
 4 | IMAGE_NAME="/home/project/ai/singularity/nvcr.io/nvidia/pytorch:latest-py3.sif"
 5 | export NTUHPC_CONT_EXTRA_ARGS="--bind ${NTUHPC_SSH_DIR}:$HOME/.ssh"
 6 | echo "Binding ${NTUHPC_CONT_EXTRA_ARGS}"
 7 | 
 8 | pbsdsh hostname
 9 | `which mpirun` --bind-to none --tag-output -x NTUHPC_CONT_EXTRA_ARGS \
10 |     -H "${NTUHPC_OPENMPI_HOSTSPEC}" -N 1 -np "${NTUHPC_OPENMPI_HOSTCOUNT}" \
11 |     "${HOME}/scripts/sshcont/container_slaves_tensorflow.sh" &
12 | 
13 | echo "Waiting 180s for SSH servers to be up"
14 | sleep 180s
15 | echo "Logging in"
16 | 
17 | echo $@
18 | singularity run --nv ${NTUHPC_CONT_EXTRA_ARGS} "${IMAGE_NAME}" $@
19 | 


--------------------------------------------------------------------------------
/scripts/cscs/cscs-pytorch.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -l
 2 | 
 3 | #SBATCH --job-name=my_cscs_job
 4 | #SBATCH --time=00:10:00
 5 | #SBATCH --nodes=2
 6 | #SBATCH --ntasks-per-core=1
 7 | #SBATCH --ntasks-per-node=1
 8 | #SBATCH --cpus-per-task=12
 9 | #SBATCH --constraint=gpu
10 | #SBATCH --account=<project>
11 | 
12 | module load daint-gpu
13 | module load PyTorch
14 | 
15 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
16 | 
17 | # Environment variables needed by the NCCL backend
18 | export NCCL_DEBUG=INFO
19 | export NCCL_IB_HCA=ipogif0
20 | export NCCL_IB_CUDA_SUPPORT=1
21 | 
22 | . /users/your_account/miniconda3/bin/activate pt38
23 | 
24 | cd your_code_path
25 | 
26 | srun \
27 | python -u your_code.py  \
28 | --epochs 90 \
29 | --model resnet50 \
30 | > ${SLURM_JOBID}.out 2> ${SLURM_JOBID}.err
31 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/job_tensorflow_gloo.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -li
 2 | 
 3 | # DATESTAMP=`date +'%y%m%d%H%M%S'`
 4 | 
 5 | # Edit this
 6 | RUN_SCRIPT=/home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/mpi_testing/test_mpi.sh
 7 | # RUN_SCRIPT=/home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/pretrain_bert_mpi.sh
 8 | 
 9 | # Clear environment and run
10 | env -i - "/home/users/ntu/c170166/.local/bin/horovodrun" \
11 | 	-H "${NTUHPC_OPENMPI_HOSTSPEC}" -np "${NTUHPC_OPENMPI_SLOTCOUNT}" --gloo $RUN_SCRIPT
12 | 
13 | 
14 | #env -i - "/home/users/ntu/c170166/.local/bin/horovodrun" \
15 | #	-H "${NTUHPC_OPENMPI_HOSTSPEC}" -np "${NTUHPC_OPENMPI_SLOTCOUNT}" --gloo $GLUE_SCRIPT
16 | 


--------------------------------------------------------------------------------
/scripts/tacc_longhorn/multicard-node-longhorn.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | #SBATCH -J myjob           # Job name
 4 | #SBATCH -o myjob.o%j       # Name of stdout output file
 5 | #SBATCH -e myjob.e%j       # Name of stderr error file
 6 | #SBATCH -p v100            # Queue (partition) name
 7 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 8 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 9 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
10 | #SBATCH --mail-user=your email
11 | #SBATCH --mail-type=all    # Send email at begin and end of job
12 | 
13 | 
14 | pwd
15 | date
16 | 
17 | cd /scratch/07801/nusbin20/tacc-our
18 | module load conda
19 | conda activate py36pt
20 | 
21 | ibrun -np 8 \
22 | python examples/pytorch_imagenet_resnet.py \
23 | --epochs 90 \
24 | --model resnet50
25 | 


--------------------------------------------------------------------------------
/scripts/tacc_longhorn/cp_imagenet_to_temp_bin.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copy imagenet tar file to /tmp/ and extract.
 4 | # Use --tiny to copy a smaller version of imagenet instead.
 5 | # Produces directories: /tmp/imagenet/ILSVRC2012_img_{train,val}
 6 | 
 7 | tiny=${tiny:-false}
 8 | 
 9 | while [ $# -gt 0 ]; do
10 |   if [[ $1 == *"--"* ]]; then
11 |       param="${1/--/}"
12 |       declare $param=true
13 |   fi
14 |   shift
15 | done
16 | 
17 | DEST=/tmp
18 | if [ "$tiny" == true ]; then
19 |   SOURCE=/scratch/07801/nusbin20/imagenet-tar/imagenet-tiny.tar
20 | else
21 |   SOURCE=/scratch/07801/nusbin20/imagenet-tar/imagenet-1k.tar
22 | fi
23 | 
24 | echo Copying $SOURCE to $DEST
25 | cp $SOURCE $DEST/imagenet.tar
26 | echo "Done copying. Extracting"
27 | mkdir $DEST/imagenet
28 | tar xf $DEST/imagenet.tar -C $DEST/imagenet
29 | echo "Done."
30 | 


--------------------------------------------------------------------------------
/scripts/tacc_frontera/frontera-test.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | #SBATCH -J myjob           # Job name
 4 | #SBATCH -o myjob.o%j       # Name of stdout output file
 5 | #SBATCH -e myjob.e%j       # Name of stderr error file
 6 | #SBATCH -p rtx            # Queue (partition) name
 7 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 8 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 9 | #SBATCH -t 00:10:00        # Run time (hh:mm:ss)
10 | #SBATCH --mail-user=your email
11 | #SBATCH --mail-type=all    # Send email at begin and end of job
12 | 
13 | 
14 | pwd
15 | date
16 | 
17 | source ~/python-env/cuda10-home/bin/activate
18 | 
19 | cd /scratch1/07801/nusbin20/tacc-our
20 | module load intel/18.0.5 impi/18.0.5
21 | module load cuda/10.1 cudnn nccl
22 | 
23 | ibrun -np 8 \
24 | 	python pytorch_imagenet_resnet.py  \
25 | 	--epochs 90 \
26 | 	--model resnet50
27 | 
28 | 


--------------------------------------------------------------------------------
/scripts/cscs/cscs-spark.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #SBATCH --job-name="spark"
 4 | #SBATCH --time=00:30:00
 5 | #SBATCH --nodes=4
 6 | #SBATCH --constraint=gpu
 7 | #SBATCH --output=sparkjob.%j.log
 8 | #SBATCH --account=<project>
 9 | 
10 | # set some variables for Spark
11 | export SPARK_WORKER_CORES=12
12 | export SPARK_LOCAL_DIRS="/tmp"
13 | 
14 | # load modules
15 | module load slurm
16 | module load daint-gpu
17 | module load Spark
18 | 
19 | # deploy of spark
20 | start-all.sh
21 | 
22 | # some extra Spark configuration
23 | SPARK_CONF="
24 | --conf spark.default.parallelism=10
25 | --conf spark.executor.cores=8
26 | --conf spark.executor.memory=15g
27 | "
28 | 
29 | # submit a Spark job
30 | spark-submit ${SPARK_CONF} --master $SPARKURL \
31 |     --class org.apache.spark.examples.SparkPi \
32 |     $EBROOTSPARK/examples/jars/spark-examples_2.11-2.4.7.jar 10000;
33 | 
34 | # clean out Spark deployment
35 | stop-all.sh


--------------------------------------------------------------------------------
/scripts/cscs/cscs-pytorch-gloo.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -l
 2 | 
 3 | #SBATCH --job-name=gloo-eb
 4 | #SBATCH --time=00:30:00
 5 | #SBATCH --nodes=4
 6 | #SBATCH --ntasks-per-node=1
 7 | #SBATCH --constraint=gpu
 8 | #SBATCH --account=<project>
 9 | #SBATCH --output=test_pt_hvd_%j.out
10 | 
11 | module load daint-gpu PyTorch
12 | 
13 | . /users/lyongbin/miniconda3/bin/activate your_env_name
14 | 
15 | export PMI_NO_PREINITIALIZE=1  # avoid warnings on fork
16 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS
17 | 
18 | for node in $(scontrol show hostnames); do
19 |    HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE,"
20 | done
21 | HOSTS=${HOSTS%?}  # trim trailing comma
22 | echo HOSTS $HOSTS
23 | 
24 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \
25 |    --start-timeout 120 --gloo-timeout-seconds 120 \
26 | python -u your_code.py  \
27 | --epochs 90 \
28 | --model resnet50
29 | 


--------------------------------------------------------------------------------
/scripts/tacc_longhorn/multicard-node-longhorn-large.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | #SBATCH -J myjob           # Job name
 4 | #SBATCH -o myjob.o%j       # Name of stdout output file
 5 | #SBATCH -e myjob.e%j       # Name of stderr error file
 6 | #SBATCH -p v100            # Queue (partition) name
 7 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 8 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 9 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
10 | #SBATCH --mail-user=your email
11 | #SBATCH --mail-type=all    # Send email at begin and end of job
12 | 
13 | 
14 | pwd
15 | date
16 | 
17 | cd /scratch/07801/nusbin20/tacc-our
18 | module load conda
19 | conda activate py36pt
20 | 
21 | scontrol show hostnames $SLURM_NODELIST > /tmp/hostfile
22 | 
23 | cat /tmp/hostfile
24 | 
25 | mpiexec -hostfile /tmp/hostfile -N 1 ./cp_imagenet_to_temp_bin.sh
26 | 
27 | ibrun -np 8 \
28 | python examples/pytorch_imagenet_resnet.py \
29 | --epochs 90 \
30 | --model resnet50
31 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # oh-my-server
 2 | 
 3 | This repository aims to provide some guidance on creating and managing your own server and cluster. It includes some
 4 | personal debugging experiences on different HPC systems as well.
 5 | 
 6 | ## Quick Reference
 7 | 
 8 | - Get Started
 9 |     - [Understanding Your System](docs/get_started/sysinfo.md)
10 |     - [Hardware Keywords](docs/get_started/hardware_keywords.md)
11 | - Initialization
12 |     - [Initializing Your Server](docs/initialize/vm_software.md)
13 |     - [Initializing Your Cluster](docs/initialize/cluster_software.md)
14 | - HPC System Notes
15 |     - [General Notes](docs/hpc_system_notes/general.md)
16 |     - [NSCC AI System](docs/hpc_system_notes/nscc.md)
17 |     - [TACC Longhorn](docs/hpc_system_notes/tacc_longhorn.md)
18 |     - [TACC Frontera](docs/hpc_system_notes/tacc_frontera.md)
19 |     - [CSCS](docs/hpc_system_notes/cscs.md)
20 |     - [NUS Apollo](docs/hpc_system_notes/nus_apollo.md)
21 | - Resources
22 |     - [Awesome Deep Learning Libraries](docs/resources/dl_libraries.md)
23 | 
24 | 


--------------------------------------------------------------------------------
/docs/initialize/cluster_software.md:
--------------------------------------------------------------------------------
 1 | # Software for Cluster
 2 | 
 3 | ## Table Of Contents
 4 | 
 5 | - [Ansible](#ansible)
 6 | - [BeeGFS](#beegfs)
 7 | 
 8 | ## Ansible
 9 | 
10 | Ansible is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates. Ansible does not require you to deploy on every node and it can access other nodes from master node via ssh and execute operations on other nodes automatically. For example, if you wish to install CUDA on your execute nodes, you can just write a simple Ansible playbook script and specify your execute nodes IPs. Then Ansible will access these execute nodes via ssh and install CUDA on each node as instructed in your script. This avoids the trouble of installing on each node manually.
11 | 
12 | > [Ansible Documentation](https://docs.ansible.com/ansible/latest/index.html)
13 | 
14 | ## BeeGFS
15 | 
16 | BeeGFS is a distributed optimized parallel file system.
17 | 
18 | > [BeeGFS documentation](https://www.beegfs.io/c/)
19 | 


--------------------------------------------------------------------------------
/scripts/tacc_frontera/frontera-btest.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | #SBATCH -J myjob           # Job name
 4 | #SBATCH -o lars-t.o%j       # Name of stdout output file
 5 | #SBATCH -e lars-t.e%j       # Name of stderr error file
 6 | #SBATCH -p rtx            # Queue (partition) name
 7 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 8 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 9 | #SBATCH -t 00:20:00        # Run time (hh:mm:ss)
10 | #SBATCH --mail-user=your email
11 | #SBATCH --mail-type=all    # Send email at begin and end of job
12 | 
13 | 
14 | pwd
15 | date
16 | 
17 | source ~/python-env/cuda10-home/bin/activate
18 | export FS_ROOT=/tmp/fs_`id -u`
19 | ibrun -np 2 /work/00410/huang/share/read_remote_file 16 /scratch1/07801/nusbin20/imagenet-16parts & sleep 500
20 | 
21 | module load intel/18.0.5 impi/18.0.5
22 | module load cuda/10.1 cudnn nccl
23 | 
24 | cd /scratch1/07801/nusbin20/tacc-our
25 | 
26 | ibrun -np 8 LD_PRELOAD=/work/00410/huang/share/wrapper.so python pytorch_imagenet_resnet.py  --epochs 90 --model resnet50 --train-dir=/tmp/fs_871009/ILSVRC2012_img_train --val-dir=/tmp/fs_871009/ILSVRC2012_img_val 


--------------------------------------------------------------------------------
/scripts/tacc_frontera/frontera-gloo.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | #SBATCH -J myjob           # Job name
 4 | #SBATCH -o myjob-t.o%j       # Name of stdout output file
 5 | #SBATCH -e myjob-t.e%j       # Name of stderr error file
 6 | #SBATCH -p rtx            # Queue (partition) name
 7 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 8 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 9 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
10 | 
11 | pwd
12 | date
13 | 
14 | source ~/python-env/cuda10-home/bin/activate
15 | 
16 | module load intel/18.0.5 impi/18.0.5
17 | module load cuda/10.1 cudnn nccl
18 | 
19 | cd /file_path/
20 | 
21 | export PMI_NO_PREINITIALIZE=1  # avoid warnings on fork
22 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS
23 | 
24 | # 4 means $SLURM_NTASKS_PER_NODE
25 | for node in $(scontrol show hostnames); do
26 |    HOSTS="$HOSTS$node:4,"
27 | done
28 | HOSTS=${HOSTS%?}  # trim trailing comma
29 | echo HOSTS $HOSTS
30 | 
31 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ib0 \
32 |    --start-timeout 120 --gloo-timeout-seconds 120 \
33 | python you_file.py  \
34 | --epochs 90 \
35 | --model resnet50
36 | 
37 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dropbearconvert.1:
--------------------------------------------------------------------------------
 1 | .TH dropbearconvert 1
 2 | .SH NAME
 3 | dropbearconvert \- convert between Dropbear and OpenSSH private key formats
 4 | .SH SYNOPSIS
 5 | .B dropbearconvert
 6 | .I input_type
 7 | .I output_type
 8 | .I input_file
 9 | .I output_file
10 | .SH DESCRIPTION
11 | .B Dropbear
12 | and 
13 | .B OpenSSH
14 | SSH implementations have different private key formats.
15 | .B dropbearconvert
16 | can convert between the two.
17 | .P
18 | Dropbear uses the same SSH public key format as OpenSSH, it can be extracted
19 | from a private key by using
20 | .B dropbearkey \-y
21 | .P
22 | Encrypted private keys are not supported, use ssh-keygen(1) to decrypt them
23 | first.
24 | .SH ARGUMENTS
25 | .TP
26 | .I input_type
27 | Either 
28 | .I dropbear
29 | or 
30 | .I openssh
31 | .TP
32 | .I output_type
33 | Either 
34 | .I dropbear
35 | or 
36 | .I openssh
37 | .TP
38 | .I input_file
39 | An existing Dropbear or OpenSSH private key file
40 | .TP
41 | .I output_file
42 | The path to write the converted private key file. For client authentication ~/.ssh/id_dropbear is loaded by default
43 | .SH EXAMPLE
44 |  # dropbearconvert openssh dropbear ~/.ssh/id_rsa ~/.ssh/id_dropbear
45 | .SH AUTHOR
46 | Matt Johnston (matt@ucc.asn.au).
47 | .SH SEE ALSO
48 |  dropbearkey(1), ssh-keygen(1)
49 | .P
50 | https://matt.ucc.asn.au/dropbear/dropbear.html
51 | 


--------------------------------------------------------------------------------
/docs/get_started/hardware_keywords.md:
--------------------------------------------------------------------------------
 1 | # Hardware Keywords
 2 | 
 3 | ## PCIe
 4 | 
 5 | [PCIe](https://en.wikipedia.org/wiki/PCI_Express) stands for "peripheral component interconnect express". It is the common motherboard interface for personal computers' graphics cards, hard disk drive host adapters, SSDs, Wi-Fi and Ethernet hardware connections. The most common PCIe standards are PCIe 3.0 and 4.0. 8-lane PCIe 3.0 has a bandwidth of 8.0GT/s while 8-lane PCIe 4.0 has achieved 16.0GT/s.
 6 | 
 7 | ## InfiniBand
 8 | 
 9 | [InfiniBand](https://en.wikipedia.org/wiki/InfiniBand) is normally termed as IB and is commonly seen in high-performance computing systems. InfinibandBand is a computer networking communication standard featured by very high throughput and very low latency. It can be used as interconnect within servers, among servers, between servers and storage systems and between storage systems. 4-links EDR and HDR IB can provide 100Gb/s and 200Gb/s for throughput.
10 | 
11 | ## NVLink
12 | 
13 | [NVLink](https://en.wikipedia.org/wiki/NVLink) is the GPU interconnect designed by NVIDIA. It is for near-range communication, which means between GPUs. It provides high-speed direct GPU-to-GPU interconnect 12 links for an Ampere-based A100 GPU can support up to 600 GB/s bandwidth.
14 | 
15 | ## NVMe
16 | 
17 | [NVMe](https://en.wikipedia.org/wiki/NVM_Express) is the short form for NVM Express. is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via PCI Express (PCIe) bus. It provides optimization for I/O-intensive applications.


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dropbearkey.1:
--------------------------------------------------------------------------------
 1 | .TH dropbearkey 1
 2 | .SH NAME
 3 | dropbearkey \- create private keys for the use with dropbear(8) or dbclient(1)
 4 | .SH SYNOPSIS
 5 | .B dropbearkey
 6 | \-t
 7 | .I type
 8 | \-f
 9 | .I file
10 | [\-s
11 | .IR bits ]
12 | [\-y]
13 | .SH DESCRIPTION
14 | .B dropbearkey
15 | generates a
16 | \fIRSA\fR, \fIDSS\fR, \fIECDSA\fR, or \fIEd25519\fR
17 | format SSH private key, and saves it to a file for the use with the
18 | Dropbear client or server.
19 | Note that 
20 | some SSH implementations
21 | use the term "DSA" rather than "DSS", they mean the same thing.
22 | .SH OPTIONS
23 | .TP
24 | .B \-t \fItype
25 | Type of key to generate.
26 | Must be one of
27 | .I rsa
28 | .I ecdsa
29 | .I ed25519
30 | or
31 | .IR dss .
32 | .TP
33 | .B \-f \fIfile
34 | Write the secret key to the file
35 | \fIfile\fR. For client authentication ~/.ssh/id_dropbear is loaded by default
36 | .TP
37 | .B \-s \fIbits
38 | Set the key size to
39 | .I bits
40 | bits, should be multiple of 8 (optional). 
41 | .TP
42 | .B \-y
43 | Just print the publickey and fingerprint for the private key in \fIfile\fR.
44 | .SH NOTES
45 | The program dropbearconvert(1) can be used to convert between Dropbear and OpenSSH key formats.
46 | .P
47 | Dropbear does not support encrypted keys. 
48 | .SH EXAMPLE
49 | generate a host-key:
50 |  # dropbearkey -t rsa -f /etc/dropbear/dropbear_rsa_host_key
51 | 
52 | extract a public key suitable for authorized_keys from private key:
53 |  # dropbearkey -y -f id_rsa | grep "^ssh-rsa " >> authorized_keys
54 | .SH AUTHOR
55 | Matt Johnston (matt@ucc.asn.au).
56 | .br
57 | Gerrit Pape (pape@smarden.org) wrote this manual page.
58 | .SH SEE ALSO
59 | dropbear(8), dbclient(1), dropbearconvert(1)
60 | .P
61 | https://matt.ucc.asn.au/dropbear/dropbear.html
62 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/pbs_script/multinode-jupyter-non-dev.pbs:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | ## The following line specifies the resources request
 4 | #PBS -l select=2:ncpus=5:ngpus=1,place=scatter
 5 | 
 6 | ### Run in the shared test and development area
 7 | ### Note that you do not have exclusive access to the GPUs in the dgx-dev queue
 8 | #PBS -q dgx
 9 | #PBS -l walltime=12:00:00
10 | 
11 | ### Specify project code
12 | ### e.g. 41000001 was the pilot project code
13 | ### Job will not submit unless this is changed
14 | #PBS -P 11002046
15 | 
16 | ### Specify name for job
17 | #PBS -N multinode_lsg_jupyter
18 | 
19 | ### Standard output by default goes to file $PBS_JOBNAME.o$PBS_JOBID
20 | ### Standard error by default goes to file $PBS_JOBNAME.e$PBS_JOBID
21 | ### To merge standard output and error use the following
22 | #PBS -j oe
23 | 
24 | ### Start of commands to be run
25 | 
26 | # Change directory to where job was submitted
27 | cd $PBS_O_WORKDIR || exit $?
28 | echo $0
29 | hostname
30 | bash /home/users/ntu/c170166/pbs_scripts/jupyterlab/start_jupyter.sh
31 | 
32 | 
33 | ### Notebook is now running on compute node
34 | ### However you cannot directly access port
35 | ### There are two methods that you can use
36 | ### 1. ssh port forwarding
37 | ### 2. reverse proxy (FRP, etc.)
38 | 
39 | ### Using reverse proxies is a security risk, user is responsible for any data loss or unauthorized access.
40 | 
41 | ### ssh port forwarding
42 | ### On local machine use ssh port forwarding to tunnel to node and port where job is running:
43 | ###   ssh -L$PORT:$HOST:$PORT aspire.nscc.sg     ### e.g. ssh -L8888:dgx4106:8888 aspire.nscc.sg
44 | ### On local machine go to http://localhost:$PORT and use token from found in file stderr.$PBS_JOBID
45 | ### Alternatively, pass a pre-determined token using --NotebookApp.token=... (visible to ps command on node)
46 | 


--------------------------------------------------------------------------------
/scripts/nscc/monitor_job/nscc_monitor.py:
--------------------------------------------------------------------------------
 1 | import smtplib
 2 | from email.mime.text import MIMEText
 3 | from email.header import Header
 4 | import subprocess
 5 | import time
 6 | import argparse
 7 | import datetime
 8 | import logging
 9 | 
10 | FROM_ADDR = 'franklee_9@163.com'
11 | PASSWORD = 'MQFKZUTJQBMKMWWT'
12 | SMTP_SERVER = 'smtp.163.com'
13 | 
14 | 
15 | def send_email(server, to_addr, job_id):
16 |     current_time = datetime.datetime.now()
17 |     content = 'Job {} is currently running. Detected at {}'.format(
18 |         job_id, current_time)
19 |     msg = MIMEText(content, 'plain', 'utf-8')
20 | 
21 |     msg['From'] = Header(FROM_ADDR)
22 |     msg['To'] = Header(to_addr)
23 |     msg['Subject'] = Header('Your NSCC Job is Running')
24 |     server.sendmail(FROM_ADDR, to_addr, msg.as_string())
25 |     logging.info("Email alert sent to {}".format(to_addr))
26 |     server.quit()
27 | 
28 | 
29 | def listen_to_qstat(job_id, interval=600):
30 |     logging.info("Start listening for {}".format(job_id))
31 |     while True:
32 |         qstat_info = subprocess.getoutput('qstat')
33 |         qstat_info = qstat_info.split('\n')
34 |         for line in qstat_info:
35 |             if job_id in line and line.split()[-2] == "R":
36 |                 logging.info("Detected job start for: {}".format(job_id))
37 |                 return
38 | 
39 |         time.sleep(interval)
40 | 
41 | 
42 | def parse_args():
43 |     parser = argparse.ArgumentParser()
44 |     parser.add_argument("--job", type=str)
45 |     parser.add_argument("--interval", type=int, default=600)
46 |     parser.add_argument("--email", type=str)
47 |     return parser.parse_args()
48 | 
49 | 
50 | def main():
51 |     args = parse_args()
52 |     logging.basicConfig(
53 |         filename="./monitor_{}.log".format(args.job), level=logging.DEBUG)
54 |     listen_to_qstat(args.job, args.interval)
55 | 
56 |     # set up email server
57 |     logging.info("Setting up server")
58 |     server = smtplib.SMTP_SSL(host=SMTP_SERVER)
59 |     server.connect(host=SMTP_SERVER, port=465)
60 |     server.login(FROM_ADDR, PASSWORD)
61 |     send_email(server, args.email, args.job)
62 | 
63 | 
64 | if __name__ == "__main__":
65 |     main()
66 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/scripts/sshcont/ssh_and_run.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python2
 2 | 
 3 | from __future__ import print_function
 4 | 
 5 | SSH_PORT = 41017
 6 | 
 7 | import subprocess
 8 | import os
 9 | import tempfile
10 | import shutil
11 | import stat
12 | import sys
13 | 
14 | with open(os.environ["PBS_NODEFILE"], "r") as f:
15 |     hosts = dict()
16 |     for host in f:
17 |         host = host.strip()
18 |         cnt = hosts.get(host, 0)
19 |         cnt += 1
20 |         hosts[host] = cnt
21 | 
22 | hstrings = ["{0}:{1}".format(h, s) for (h, s) in hosts.items()]    
23 | 
24 | def generate_ssh_config_file():
25 |     sections = []
26 | 
27 |     for host in hosts:
28 |         sections.append(
29 |             ("Host {0}\n"
30 |              "\tPort {1}\n"
31 |              "\tStrictHostKeyChecking no\n".format(host, SSH_PORT)))
32 |     return "\n".join(sections)
33 | 
34 |     
35 | tempdir = tempfile.mkdtemp(dir=os.getenv("HOME") + "/sshcont")
36 | print("Created temporary config directory:", tempdir)
37 | print("bind mount that directory under $HOME/.ssh")
38 | 
39 | os.chmod(tempdir, stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR)
40 | 
41 | configfile = tempdir + "/config"
42 | with open(configfile, "w") as f:
43 |     f.write(generate_ssh_config_file())
44 | os.chmod(configfile, stat.S_IRUSR | stat.S_IWUSR)
45 | 
46 | 
47 | keyfile = tempdir + "/id_rsa"
48 | pubkey = tempdir + "/authorized_keys"
49 | out = subprocess.Popen("dropbearkey -t rsa -f {0}".format(keyfile), shell=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
50 | out.wait()
51 | if out.returncode != 0:
52 |     raise RuntimeError("return not zero")
53 | out = out.stdout.read()
54 | 
55 | with open(pubkey, "w") as f:
56 |     f.write(out.splitlines()[1])
57 | subprocess.check_call("dropbearconvert dropbear openssh {0} {1}".format(keyfile, keyfile), shell=True)
58 | os.chmod(pubkey, stat.S_IRUSR | stat.S_IWUSR)
59 | 
60 | os.environ["NTUHPC_OPENMPI_HOSTSPEC"] =  ",".join(hstrings)
61 | os.environ["NTUHPC_SSH_DIR"] = tempdir
62 | os.environ["NTUHPC_OPENMPI_HOSTCOUNT"] = str(len(hosts))
63 | os.environ["NTUHPC_OPENMPI_SLOTCOUNT"] = str(sum(hosts.values()))
64 | try:
65 |     subprocess.check_call(" ".join(sys.argv[1:]), shell=True)
66 | finally:
67 |     shutil.rmtree(tempdir)
68 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | # IDE
132 | .vscode/
133 | .idea/
134 | 
135 | # macos
136 | .DS_Store


--------------------------------------------------------------------------------
/docs/hpc_system_notes/nus_apollo.md:
--------------------------------------------------------------------------------
 1 | # NUS Apollo
 2 | 
 3 | ## Table of Contents
 4 | 
 5 | - [Connecting](#connecting)
 6 | - [Software Stack](software-stack)
 7 | - [Hardware Overview](#hardware-overview)
 8 | 
 9 | ## Connecting
10 | 
11 | Connect to SoC VPN then you can use ssh to connect to the lab machine directly.
12 | 
13 | ```shell
14 | ssh USERNAME@hpc-ai-01.d2.comp.nus.edu.sg
15 | ```
16 | 
17 | ## :heavy_exclamation_mark: Download Rules
18 | 
19 | :heavy_exclamation_mark:
20 | :heavy_exclamation_mark:
21 | :heavy_exclamation_mark:
22 | **Please read this section before use**
23 | 
24 | 
25 | If you want to download a large dataset, please limit your download rate so that it will not perform DDOS to the network. 
26 | You can use trickle to specify how fast you download your dataset. In general, you should limit your download rate to 10 MB/s to play safe. 
27 | We use **`trickle`** to control the download rate.
28 | 
29 | Some options of trickle are: 
30 | - `-s`: run in standalone mode 
31 | - `-u`: upload rate in KB/s 
32 | - `-d`: download rate in KB/s 
33 | 
34 | 
35 | ```bash
36 | # this will download CIFAR10 dataset at around 4 MB/s
37 | trickle -s  -u 1024 -d 4096 wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
38 | 
39 | # this will download CIFAR10 dataset at around 2 MB/s
40 | trickle -s  -u 1024 -d 2048 wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
41 | ```
42 | 
43 | ## Storage
44 | 
45 | We store datasets which are commonly used by many users in `/data/common`.
46 | If you feel that your dataset is common but no in `/data/common`, you can notify the system admin to download this dataset for you or move the downloaded dataset to `data/common`.
47 | 
48 | Please store large files (e.g. personal datasets, model weights) in `/data/personal`. You need to create your own directory.
49 | 
50 | 
51 | 
52 | ## Software Stack
53 | 
54 | ### Environment Module
55 | 
56 | We use **Environment Module** to manage the software stack.
57 | You can use `module av` to list the available software package and use `module load` to load the software you want to use.
58 | 
59 | ```shell
60 | module load miniconda3 cuda/11.3.1
61 | ```
62 | 
63 | ### Singularity Image
64 | 
65 | You can use singularity to run container. 
66 | 
67 | 1. Run Image: `singularity shell --nv`
68 | 2. Clone Environment: `conda create --clone`
69 | 3. Install Other Python Dependency with `pip` or `conda`
70 | 
71 | New environment will create under `$HOME/.conda/envs`. Then activate the environment again without reconfiguring the environment, use it directly after source. 
72 | 
73 | ```shell
74 | $ module load singularity
75 | $ singularity shell --nv /opt/nussif/pytorch_21.12-py3.sif
76 | Singularity> conda create --clone base --name ngc-torch-21.11 # only need when first time 
77 | Singularity> source activate ngc-torch-21.11
78 | Singularity> python -c "import torch;print(torch.cuda.is_available())"
79 | ```
80 | 
81 | ## Hardware Overview
82 | 
83 | | Node Name | GPU                    | CPU                        | Memory              | Storage                            |
84 | |-----------|------------------------|----------------------------|---------------------|------------------------------------|
85 | | hpc-ai-01 | NVIDIA  HGX A100 8-GPU | 2 x  AMD EPYC 7742 64-Core | 1TB  3200 MT/s DDR4 | 1.8 TB System NVME 11 TB Dada NVME |
86 | 


--------------------------------------------------------------------------------
/docs/get_started/sysinfo.md:
--------------------------------------------------------------------------------
  1 | # Understanding Your System
  2 | 
  3 | ## Table Of Contents
  4 | 
  5 | - [Common Linux Commands](#common-linux-commands)
  6 | - [More info in /proc](#more-info-in-proc)
  7 | - [More info in /sys](#more-info-in-sys)
  8 | - [Scripts](#scripts)
  9 | 
 10 | ## Common Linux Commands
 11 | 
 12 | ```shell
 13 | # get operating system info (any of the commands below)
 14 | cat /etc/os-release
 15 | lsb_release -a
 16 | hostnamectl
 17 | 
 18 | # get linux kernel version
 19 | uname -r
 20 | 
 21 | # get free RAM
 22 | free -m
 23 | 
 24 | # Disk
 25 | df -h /home
 26 | 
 27 | # CPU
 28 | lscpu
 29 | 
 30 | # check GPU status
 31 | nvidia-smi
 32 | 
 33 | # check CUDA/nvcc version
 34 | nvcc --version
 35 | 
 36 | # check GPU topology
 37 | nvidia-smi topo -m
 38 | 
 39 | # environment variable
 40 | env
 41 | 
 42 | # check modules
 43 | module avai
 44 | 
 45 | # check infiniband (any of the commands below)
 46 | ibv_devinfo
 47 | ibstat
 48 | ibstatus
 49 | 
 50 | # check if hyperthreading is enabled
 51 | dmidecode -t processor | grep HTT
 52 | 
 53 | ```
 54 | 
 55 | ## More info in proc
 56 | 
 57 | `/proc` is a virtual file system on Linux which stores the information about the current status of the linux kernel. Users can view and modify the running status of the kernel with these files. As '/proc' is virtual, the files in `/proc` are 0KB even though they can display lots of information. Many of them are stored in RAM and constantly updated.
 58 | 
 59 | If you change your directory to `/proc`, you can actually see a lot of folders named with numbers. These are folders of the running processes. The following commands will give you some basic information about the process and you can definitely get more if you view other files.
 60 | 
 61 | ```shell
 62 | # see the command which launches the process
 63 | cat cmdline
 64 | 
 65 | # check the environment variables of the process
 66 | cat environ
 67 | 
 68 | # check the memory usage
 69 | cat mem
 70 | 
 71 | # view the current status of the process
 72 | # cat status has better readability
 73 | cat stat
 74 | cat status
 75 | 
 76 | # check the info about threads run by the current process
 77 | cat task
 78 | 
 79 | ```
 80 | 
 81 | Besides the process folders, there are many files directly located under `/proc`, they provide the status information of the whole operating system. For example:
 82 | 
 83 | ```shell
 84 | # check power management and battery info
 85 | cat /proc/apm
 86 | 
 87 | # get memory usage
 88 | cat /proc/meminfo
 89 | 
 90 | # get virtual memory info
 91 | cat /proc/vmstat
 92 | 
 93 | # get mounted devices info
 94 | cat /proc/devices
 95 | 
 96 | # get cpu info
 97 | /proc/cpuinfo
 98 | 
 99 | # get system running time since last up
100 | cat /proc/uptime
101 | 
102 | # get kernel version
103 | cat /proc/version
104 | 
105 | ```
106 | 
107 | ## More info in sys
108 | 
109 | `/sys` is another directory which allows to view information about your devices and system. You can view and modify some settings such power, module, hypervisor, etc.
110 | 
111 | > You can view/chagne simultaneous multi-threading in `/sys/devices/system/cpu/smt`
112 | 
113 | ## Scripts
114 | 
115 | `Author-Kit` provides an easy-to-use script to gather system information on the machine.
116 | 
117 | ```shell
118 | git clone https://github.com/SC-Tech-Program/Author-Kit
119 | bash ./Author-Kit/collect_environment.sh > ./sysinfo.txt
120 | ```
121 | 


--------------------------------------------------------------------------------
/docs/hpc_system_notes/general.md:
--------------------------------------------------------------------------------
 1 | # General Notes
 2 | 
 3 | ## Get to Know Your Scheduler
 4 | 
 5 | HPC clusters always have a job scheduler which dispatches jobs submitted to a queue. You can refer to this official 
 6 | documentation for how to use the job scheduler.
 7 | 
 8 | - PBS Pro: [Documentation](https://help.nscc.sg/pbspro-quickstartguide/)
 9 | - SLURM: [Documentation](https://slurm.schedmd.com/documentation.html)
10 | 
11 | ## How to Debug on HPC Cluster
12 | 
13 | ### Use debug node
14 | 
15 | In general, HPC clusters provide a development node for you to test your code. For example, you can use the `dgx-dev` 
16 | queue on NSCC to run interactive job and debug you code on one node with 4 GPUs.
17 | 
18 | ### Use JupyterLab
19 | 
20 | Development nodes usually have resource limitation such as wall time. It is also bad when your terminal gets stuck as 
21 | you have to quit and re-submit the job again. Thus, I would recommend to use JupyterLab to debug your job. This has 
22 | several advantages:
23 | 
24 | 1. You can have multiple terminals
25 | 2. You can edit your files
26 | 3. Your job is still running if you disconnect
27 | 
28 | If you are not a vim enthusiast (even though I highly recommend it as you can make it a pro IDE if you install plugins),
29 | you can use JupyterLab for convenience.
30 | 
31 | You can refer to the [NSCC notes](nscc.md) on how to set up JupyterLab.
32 | 
33 | ### Use IDE
34 | 
35 | You can refer to the following documentation on how to use IDE on clusters.
36 | - VS Code: [documentation](https://code.visualstudio.com/docs/remote/ssh)
37 | - PyCharm: [documentation](https://www.jetbrains.com/help/pycharm/tutorial-deployment-in-product.html#comparing)
38 | 
39 | ## How to Run Many Short Experiments With a Single Job Submission
40 | 
41 | Sometimes, you may need to run many experiments while each experiment only takes several minutes. Thus, it is definitely
42 | not a good idea to submit a job script for each experiment. For example, I only want to profile the peak memory usage 
43 | in one iteration of my training process. In this case, it is recommended to use JupyterLab for experiments. This has
44 | several advantages:
45 | 
46 | 1. You don't have to worry about program error when you run many experiments. For example, when you only submit a job 
47 |    to the scheduler, the job will get stuck if your program runs into out-of-memory. However, if you run JupyterLab,
48 |    you can just terminate it with `ctrl+c` and continue with your next try.
49 | 2. JupyterLab is running in the job dispatched by the scheduler. Thus, the environment inherits variables passed by the 
50 |    scheduler, for example, the `NODEFILE` variable of PBS Pro.
51 | 
52 | ## How to Find Network Interface
53 | 
54 | Network interface needs to be specified for distributed training. This is to tell which interface to rely on for 
55 | communication. For example, when running PyTorch distributed, you may find that your script stuck at the initialization
56 | of the default process group. Usually this is because PyTorch does not know which interface to use to cross-node 
57 | communication, or some found interfaces have issues. You can specify the network interface for communication by setting the
58 | environment variables `NCCL_SOCKET_IFNAME` or `GLOO_SOCKET_IFNAME` depending on the backend of your choice.
59 | You can check the available network interfaces by the command `ifconfig` or use 
60 | [PyRoute2](https://github.com/svinota/pyroute2) if this command is not available. You can also get the host address in 
61 | this way as well.
62 | 
63 | ```python
64 | from pyroute2 import NDB
65 | 
66 | ndb = NDB(log='debug')
67 | print(ndb.addresses.summary())
68 | ```
69 | 
70 | For example, you can find `ib0` in the list on NSCC. It means InfiniBand and you can set `NCCL_SOCKET_IFNAME=ib0` in 
71 | your script. Base on test, if you do not set this environment variable, your PyTorch initialization will stuck.
72 | 
73 | ## Use Proxyjump to directly connect to the compute node (requires having a job running on the compute node)
74 | 
75 | ```
76 | Host computenode_hostname
77 |   HostName computenode_hostname
78 |   User username
79 |   ProxyJump loginnode_hostname
80 |   ServerAliveInterval 60
81 | 
82 | Host loginnode_hostname
83 |   HostName loginnode_hostname
84 |   User username
85 | ```
86 | 


--------------------------------------------------------------------------------
/docs/resources/dl_libraries.md:
--------------------------------------------------------------------------------
 1 | # Awesome Deep Learning Libraries
 2 | 
 3 | I have listed some awesome libraries which I found useful for most machine learning practices. These libraries can make 
 4 | things easier and boost your productivity.
 5 | 
 6 | 
 7 | ## General
 8 | 
 9 | - [DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples): Source code for many deep learning models by Nvidia
10 | 
11 | 
12 | ## Computer Vision
13 | 
14 | - [mmcv](https://github.com/open-mmlab/mmcv): OpenMMLab Computer Vision Foundation
15 | - [MMClassification](https://github.com/open-mmlab/mmclassification): OpenMMLab Image Classification Toolbox and Benchmark
16 | - [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab Detection Toolbox and Benchmark
17 | - [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
18 | - [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab Semantic Segmentation Toolbox and Benchmark.
19 | - [OpenSelfSup](https://github.com/open-mmlab/OpenSelfSup): Self-Supervised Learning Toolbox and Benchmark
20 | 
21 | 
22 | ## Natural Language Processing
23 | 
24 | - [autonlp](https://github.com/huggingface/autonlp): AutoNLP: train state-of-the-art natural language processing models and deploy them in a scalable environment automatically
25 | - [HuggingFaceTransformer](https://github.com/huggingface/transformers): Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
26 | - [fairseq](https://github.com/pytorch/fairseq): Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
27 | 
28 | 
29 | ## Data
30 | 
31 | - [DALI](https://github.com/NVIDIA/DALI): A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
32 | - [AugLy](https://github.com/facebookresearch/AugLy): A data augmentations library for audio, image, text, and video.
33 | - [Open3D](https://github.com/intel-isl/Open3D): Open3D: A Modern Library for 3D Data Processing
34 | - [HuggingFaceTokenizer](https://github.com/huggingface/tokenizers): Fast State-of-the-Art Tokenizers optimized for Research and Production
35 | - [HuggingFaceDatasets](https://github.com/huggingface/datasets): The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
36 | 
37 | 
38 | ## Accelerating Training
39 | 
40 | - [Apex](https://github.com/NVIDIA/apex): A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
41 | - [ApexDataPrefetcher](https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L265): prefetch data to hide data I/O cost.
42 | - [Horovod](https://github.com/horovod/horovod): Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
43 | - [Checkpoint](https://pytorch.org/docs/stable/checkpoint.html): A PyTorch function which implements [activation checkpointing](https://arxiv.org/abs/1604.06174).
44 | - [TorchPipe](https://github.com/kakaobrain/torchgpipe): A GPipe implementation in PyTorch
45 | - [PowerSGD Communication Hook](https://pytorch.org/docs/stable/ddp_comm_hooks.html): PowerSGD (Vogels et al., NeurIPS 2019) is a gradient compression algorithm, which can provide very high compression rates and accelerate bandwidth-bound distributed training. 
46 | - [Accelerate](https://github.com/huggingface/accelerate): A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision
47 | - [lightseq](https://github.com/bytedance/lightseq): LightSeq: A High Performance Library for Sequence Processing and Generation
48 | 
49 | 
50 | ## Large-Scale Distributed Training
51 | 
52 | - [Megatron](https://github.com/NVIDIA/Megatron-LM): Ongoing research training transformer language models at scale, including: BERT & GPT-2
53 | - [DeepSpeed](https://github.com/microsoft/deepspeed): DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
54 | - [Ray](https://github.com/ray-project/ray): An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.
55 | 
56 | 
57 | ## Utilities
58 | 
59 | - [Tensorboard](https://github.com/tensorflow/tensorboard): TensorFlow's Visualization Toolkit
60 | - [KnockKnock](https://github.com/huggingface/knockknock): Get notified when your training ends with only two additional lines of code
61 | - [Neptune](https://github.com/neptune-ai/neptune-client): Lightweight experiment tracking tool for AI/ML individuals and teams. Fits any workflow.
62 | - [netron](https://github.com/lutzroeder/netron): Visualizer for neural network, deep learning, and machine learning models
63 | - [scalene](https://github.com/plasma-umass/scalene): Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python
64 | 
65 | 
66 | 


--------------------------------------------------------------------------------
/docs/initialize/vm_software.md:
--------------------------------------------------------------------------------
  1 | # Software for Single VM
  2 | 
  3 | I use CentOS for my personal cloud server, thus the commands will be a bit different if you using other operating system. For software you wish to make available for all users, I would recommend you to put them under `/opt/apps`. As I will be using `Lmod` to manage all the modules, I build all the software from source. If you wish to install directly, you can just use `yum install`.
  4 | 
  5 | ## Table Of Contents
  6 | 
  7 | - [GCC](#gcc)
  8 | - [Tmux](#common-linux-commands)
  9 | - [Lmod](#lmod)
 10 | - [Docker](#docker)
 11 | - [Singularity](#singularity)
 12 | - [OpenMPI](#openmpi)
 13 | - [Node.js](#node.js)
 14 | - [Go](#go)
 15 | - [Useful Plug-ins](#useful-plug-ins)
 16 | 
 17 | ## GCC
 18 | 
 19 | GCC compiler is required for compilation for many programs. To build GCC from source, you can follow the steps below.
 20 | 
 21 | ```shell
 22 | # setup workspace
 23 | cd /opt/apps
 24 | mkdir -p gcc/source
 25 | cd ./gcc/source
 26 | 
 27 | # setup source code
 28 | git clone git://gcc.gnu.org/git/gcc.git
 29 | git branch -a # view all branch
 30 | git tag -l  # view all tags
 31 | git checkout <BRANCH_OR_TAG>
 32 | ./contrib/download_prerequisites
 33 | 
 34 | # make and install
 35 | cd /opt/apps/gcc
 36 | mkdir <VERSION> # e.g. mkdir 8.3.0
 37 | /opt/apps/gcc/source/configure --prefix=/opt/apps/gcc/<VERSION> --enable-languages=c,c++,fortran,go
 38 | make
 39 | make install
 40 | ```
 41 | 
 42 | **Compilation can take quite long**
 43 | 
 44 | ## Tmux
 45 | 
 46 | Install tmux
 47 | 
 48 | ```shell
 49 | sudo yum install tmux
 50 | ```
 51 | 
 52 | However, this is version 1.8 and is a bit too old. To install Tmux 2.8 from source:
 53 | 
 54 | ```shell
 55 | # install deps
 56 | yum install gcc kernel-devel make ncurses-devel
 57 | 
 58 | # DOWNLOAD SOURCES FOR LIBEVENT AND MAKE AND INSTALL
 59 | curl -LOk https://github.com/libevent/libevent/releases/download/release-2.1.8-stable/libevent-2.1.8-stable.tar.gz
 60 | tar -xf libevent-2.1.8-stable.tar.gz
 61 | cd libevent-2.1.8-stable
 62 | ./configure --prefix=/usr/local
 63 | make
 64 | make install
 65 | 
 66 | # DOWNLOAD SOURCES FOR TMUX AND MAKE AND INSTALL
 67 | 
 68 | curl -LOk https://github.com/tmux/tmux/releases/download/2.8/tmux-2.8.tar.gz
 69 | tar -xf tmux-2.8.tar.gz
 70 | cd tmux-2.8
 71 | LDFLAGS="-L/usr/local/lib -Wl,-rpath=/usr/local/lib" ./configure --prefix=/usr/local
 72 | make
 73 | make install
 74 | 
 75 | # pkill tmux
 76 | # close your terminal window (flushes cached tmux executable)
 77 | # open new shell and check tmux version
 78 | tmux -V
 79 | ```
 80 | 
 81 | tmux cheatsheet: https://tmuxcheatsheet.com/
 82 | 
 83 | ## Lmod
 84 | 
 85 | You may refer to [Lmod Installation](https://lmod.readthedocs.io/en/latest/030_installing.html) for detailed instructions. For a quick summary on the installation process, you can refer to my procedure.
 86 | 
 87 | 1. Try `lua -v` to check if you have lua installed. If not, install lua.
 88 | 
 89 | ```shell
 90 | curl -R -O http://www.lua.org/ftp/lua-5.4.1.tar.gz
 91 | tar xf lua-5.4.1.tar.gz
 92 | cd lua-5.4.1
 93 | # if you want to set where you want to install
 94 | # e.g. change INSTALL_TOP to /opt/apps/lua/5.4.1
 95 | # and lua will be installed there
 96 | vim Makefile
 97 | make
 98 | make install
 99 | ```
100 | 
101 | 2. Download Lmod from [here](https://sourceforge.net/projects/lmod/files/) and install (e.g. Lmod-8.4).
102 | 
103 | ```shell
104 | tar -xf Lmod-8.4.tar.bz2
105 | cd Lmod-8.4
106 | ./configure --prefix=/opt/apps
107 | make install
108 | ```
109 | 
110 | 3. Then Configure the shell setup.
111 | 
112 | ```
113 | ln -s /opt/apps/lmod/lmod/init/profile        /etc/profile.d/z00_lmod.sh
114 | ln -s /opt/apps/lmod/lmod/init/cshrc          /etc/profile.d/z00_lmod.csh
115 | ln -s /opt/apps/lmod/lmod/init/profile.fish   /etc/fish/conf.d/z00_lmod.fish
116 | ```
117 | 
118 | 4. When you start you shell again, you can type `module` and see the option list and see some environment variables such as `LMOD_CMD`.
119 | 
120 | ```
121 | $ echo $LMOD_CMD
122 | /opt/apps/lmod/lmod/libexec/lmod
123 | ```
124 | 
125 | You can refer to [Write your own modulefiles](https://lmod.readthedocs.io/en/latest/020_advanced.html) for more information. You can add `module use <MODULEFILES_PATH>` in `/etc/profile` so all users can initialize to the same bunch of paths upon start.
126 | 
127 | ## Docker
128 | 
129 | Install Docker
130 | 
131 | ```shell
132 | sudo yum install -y yum-utils
133 | sudo yum-config-manager \
134 |     --add-repo \
135 |     https://download.docker.com/linux/centos/docker-ce.repo
136 | sudo systemctl start docker
137 | ```
138 | 
139 | Docker will create a group called docker, other users can be granted access to docker by
140 | 
141 | ```shell
142 | sudo usermod -aG docker $USER
143 | ```
144 | 
145 | docker cheatsheet: https://www.docker.com/sites/default/files/d8/2019-09/docker-cheat-sheet.pdf
146 | 
147 | ## Singularity
148 | 
149 | Singularity is a containerization tool similar to Docker. It is more for HPC community. Its image can be obtained by conversion from Docker Image. Follow the steps below to install Singularity.
150 | 
151 | ```shell
152 | sudo yum update && \
153 |     sudo yum groupinstall 'Development Tools' && \
154 |     sudo yum install libarchive-devel
155 | 
156 | git clone https://github.com/singularityware/singularity.git
157 | cd singularity
158 | ./autogen.sh
159 | ./configure --prefix=/opt/apps/singularity --sysconfdir=/etc
160 | make
161 | sudo make install
162 | 
163 | ```
164 | 
165 | ## OpenMPI
166 | 
167 | OpenMPI lets you spawn mutltiple processes simultaneously for distributed job. As my VM has too few cores to make OpenMPI effective. I didn't not install it. You may refer to this [guide](https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX) if you want to build OpenMPI with UCX.
168 | 
169 | ## Node.js
170 | 
171 | If you are a web developer, then most likely you will need Node.js for your frontend and backend devepment. To install Node.js on the server, you can download the binary files from the [Node.js website](https://nodejs.org/en/download/) with [instructions](https://github.com/nodejs/help/wiki/Installation).
172 | 
173 | ## Go
174 | 
175 | To install Go, you can also download the binary directly at [Go Installation](https://golang.org/doc/install).
176 | 
177 | ## Useful Plug-ins
178 | 
179 | [oh-my-zsh](https://github.com/ohmyzsh/ohmyzsh)  
180 | [oh-my-tmux](https://github.com/gpakosz/.tmux)  
181 | [vimrc-configuration](https://github.com/amix/vimrc)  
182 | [vim-plugin-code-autocomplete](https://github.com/ycm-core/YouCompleteMe)
183 | 


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man8/dropbear.8:
--------------------------------------------------------------------------------
  1 | .TH dropbear 8
  2 | .SH NAME
  3 | dropbear \- lightweight SSH server
  4 | .SH SYNOPSIS
  5 | .B dropbear
  6 | [\fIflag arguments\fR] [\-b
  7 | .I banner\fR] 
  8 | [\-r
  9 | .I hostkeyfile\fR] [\-p [\fIaddress\fR:]\fIport\fR]
 10 | .SH DESCRIPTION
 11 | .B dropbear
 12 | is a small SSH server 
 13 | .SH OPTIONS
 14 | .TP
 15 | .B \-b \fIbanner
 16 | bannerfile.
 17 | Display the contents of the file
 18 | .I banner
 19 | before user login (default: none).
 20 | .TP
 21 | .B \-r \fIhostkey
 22 | Use the contents of the file
 23 | .I hostkey
 24 | for the SSH hostkey.
 25 | This file is generated with
 26 | .BR dropbearkey (1) 
 27 | or automatically with the '-R' option. See "Host Key Files" below.
 28 | .TP
 29 | .B \-R
 30 | Generate hostkeys automatically. See "Host Key Files" below.
 31 | .TP
 32 | .B \-F
 33 | Don't fork into background.
 34 | .TP
 35 | .B \-E
 36 | Log to standard error rather than syslog.
 37 | .TP
 38 | .B \-m
 39 | Don't display the message of the day on login.
 40 | .TP
 41 | .B \-w
 42 | Disallow root logins.
 43 | .TP
 44 | .B \-s
 45 | Disable password logins.
 46 | .TP
 47 | .B \-g
 48 | Disable password logins for root.
 49 | .TP
 50 | .B \-j
 51 | Disable local port forwarding.
 52 | .TP
 53 | .B \-k
 54 | Disable remote port forwarding.
 55 | .TP
 56 | .B \-p\fR [\fIaddress\fR:]\fIport
 57 | Listen on specified 
 58 | .I address
 59 | and TCP
 60 | .I port.
 61 | If just a port is given listen
 62 | on all addresses.
 63 | up to 10 can be specified (default 22 if none specified).
 64 | .TP
 65 | .B \-i
 66 | Service program mode.
 67 | Use this option to run
 68 | .B dropbear
 69 | under TCP/IP servers like inetd, tcpsvd, or tcpserver.
 70 | In program mode the \-F option is implied, and \-p options are ignored.
 71 | .TP
 72 | .B \-P \fIpidfile
 73 | Specify a pidfile to create when running as a daemon. If not specified, the 
 74 | default is /var/run/dropbear.pid
 75 | .TP
 76 | .B \-a
 77 | Allow remote hosts to connect to forwarded ports.
 78 | .TP
 79 | .B \-W \fIwindowsize
 80 | Specify the per-channel receive window buffer size. Increasing this 
 81 | may improve network performance at the expense of memory use. Use -h to see the
 82 | default buffer size.
 83 | .TP
 84 | .B \-K \fItimeout_seconds
 85 | Ensure that traffic is transmitted at a certain interval in seconds. This is
 86 | useful for working around firewalls or routers that drop connections after
 87 | a certain period of inactivity. The trade-off is that a session may be
 88 | closed if there is a temporary lapse of network connectivity. A setting
 89 | if 0 disables keepalives. If no response is received for 3 consecutive keepalives the connection will be closed.
 90 | .TP
 91 | .B \-I \fIidle_timeout
 92 | Disconnect the session if no traffic is transmitted or received for \fIidle_timeout\fR seconds.
 93 | .TP
 94 | .B \-T \fImax_authentication_attempts
 95 | Set the number of authentication attempts allowed per connection. If unspecified the default is 10 (MAX_AUTH_TRIES)
 96 | .TP
 97 | .B \-c \fIforced_command
 98 | Disregard the command provided by the user and always run \fIforced_command\fR. This also
 99 | overrides any authorized_keys command= option.
100 | .TP
101 | .B \-V
102 | Print the version
103 | 
104 | .SH FILES
105 | 
106 | .TP
107 | Authorized Keys
108 | 
109 | ~/.ssh/authorized_keys can be set up to allow remote login with a RSA,
110 | ECDSA, Ed25519 or DSS
111 | key. Each line is of the form
112 | .TP
113 | [restrictions] ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIgAsp... [comment]
114 | 
115 | and can be extracted from a Dropbear private host key with "dropbearkey -y". This is the same format as used by OpenSSH, though the restrictions are a subset (keys with unknown restrictions are ignored).
116 | Restrictions are comma separated, with double quotes around spaces in arguments.
117 | Available restrictions are:
118 | 
119 | .TP
120 | .B no-port-forwarding
121 | Don't allow port forwarding for this connection
122 | 
123 | .TP
124 | .B no-agent-forwarding
125 | Don't allow agent forwarding for this connection
126 | 
127 | .TP
128 | .B no-X11-forwarding
129 | Don't allow X11 forwarding for this connection
130 | 
131 | .TP
132 | .B no-pty
133 | Disable PTY allocation. Note that a user can still obtain most of the
134 | same functionality with other means even if no-pty is set.
135 | 
136 | .TP
137 | .B command=\fR"\fIforced_command\fR"
138 | Disregard the command provided by the user and always run \fIforced_command\fR.
139 | The -c command line option overrides this.
140 | 
141 | The authorized_keys file and its containing ~/.ssh directory must only be
142 | writable by the user, otherwise Dropbear will not allow a login using public
143 | key authentication.
144 | 
145 | .TP
146 | Host Key Files
147 | 
148 | Host key files are read at startup from a standard location, by default
149 | /etc/dropbear/dropbear_dss_host_key, /etc/dropbear/dropbear_rsa_host_key,
150 | /etc/dropbear/dropbear_ecdsa_host_key and /etc/dropbear/dropbear_ed25519_host_key
151 | 
152 | If the -r command line option is specified the default files are not loaded.
153 | Host key files are of the form generated by dropbearkey. 
154 | The -R option can be used to automatically generate keys
155 | in the default location - keys will be generated after startup when the first
156 | connection is established. This had the benefit that the system /dev/urandom
157 | random number source has a better chance of being securely seeded.
158 | 
159 | .TP
160 | Message Of The Day
161 | 
162 | By default the file /etc/motd will be printed for any login shell (unless 
163 | disabled at compile-time). This can also be disabled per-user
164 | by creating a file ~/.hushlogin .
165 | 
166 | .SH ENVIRONMENT VARIABLES
167 | Dropbear sets the standard variables USER, LOGNAME, HOME, SHELL, PATH, and TERM.
168 | 
169 | The variables below are set for sessions as appropriate. 
170 | 
171 | .TP
172 | .B SSH_TTY
173 | This is set to the allocated TTY if a PTY was used.
174 | 
175 | .TP
176 | .B SSH_CONNECTION
177 | Contains "<remote_ip> <remote_port> <local_ip> <local_port>".
178 | 
179 | .TP
180 | .B DISPLAY
181 | Set X11 forwarding is used.
182 | 
183 | .TP
184 | .B SSH_ORIGINAL_COMMAND
185 | If a 'command=' authorized_keys option was used, the original command is specified
186 | in this variable. If a shell was requested this is set to an empty value.
187 | 
188 | .TP
189 | .B SSH_AUTH_SOCK
190 | Set to a forwarded ssh-agent connection.
191 | 
192 | .SH NOTES
193 | Dropbear only supports SSH protocol version 2.
194 | 
195 | .SH AUTHOR
196 | Matt Johnston (matt@ucc.asn.au).
197 | .br
198 | Gerrit Pape (pape@smarden.org) wrote this manual page.
199 | .SH SEE ALSO
200 | dropbearkey(1), dbclient(1), dropbearconvert(1)
201 | .P
202 | https://matt.ucc.asn.au/dropbear/dropbear.html
203 | 


--------------------------------------------------------------------------------
/docs/hpc_system_notes/tacc_longhorn.md:
--------------------------------------------------------------------------------
  1 | # TACC-Longhorn
  2 | 
  3 | - [Build Conda Environment ](#build-conda-environment )
  4 | - [Common Commands](#common-commands)
  5 | - [Job Script Example](#job-script-example)
  6 | - [Dataset and Transfer files](#dataset-and-transfer-files)
  7 | - [Large Scale Experiment](#large-scale-experiment)
  8 | - [DALI](#dali)
  9 | - [Question Ticket](#question-ticket)
 10 | 
 11 | ## Build Conda Environment 
 12 | 
 13 | You can follow the file "TACC distributed pytorch.pdf", which show an example about how to build your conda environment and run the code with interactive usage. But the example use PyTorch 1.1, and the newest version in the used [link](https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/) seems only PyTorch 1.3. In addition, unlike other TACC machines, Longhorn nodes are a PowerPC architecture (Power PC 64 LE). Thus, when pulling images from (e.g.) Docker Hub, make sure the image is Power PC 64 LE compatible.  Here, you can use the method in this [link](https://stackoverflow.com/questions/52750622/how-to-install-pytorch-on-power-8-or-ppc64-machine/64528124#64528124?newreg=1b10fc8fcbed4beca9cdc3d4238359a5), to bulid a Python 3.6 and PyTorch 1.5 conda environment.
 14 | 
 15 | ## Common Commands
 16 | 
 17 | ```shell
 18 | # To directory $SCRATCH. The default directory is $HOME when you login in.
 19 | # Longhorn users must run all jobs in Longhorn's $SCRATCH file system.
 20 | cd $SCRATCH
 21 | 
 22 | # interactive usage
 23 | idev -t 02:00:00 -N 1 -n 4 -p development
 24 | 
 25 | # submit a job script
 26 | sbatch /PATH/TO/job.slurm
 27 | 
 28 | # check your job status
 29 | squeue -u your_account
 30 | 
 31 | # delete your job
 32 | scancel job_ID
 33 | 
 34 | # display estimated start time for job
 35 | squeue --start -j job_ID
 36 | 
 37 | # view loaded module
 38 | # module related commands need at computing node
 39 | module list
 40 | 
 41 | # view avail module
 42 | module available
 43 | 
 44 | # active your conda environment
 45 | # you need run it after you login in computing nodes everytime
 46 | module load conda
 47 | conda activate your_env_name
 48 | 
 49 | # generate hostfile according to your current avaliable computing resources
 50 | # Note, you maybe need run it after you login in computing nodes everytime, 
 51 | # because you may get different computing resources.
 52 | scontrol show hostname > hostfile
 53 | 
 54 | # run your code with interactive usage
 55 | # -N means GPU numbers per node, -np means total GPU numbers
 56 | mpiexec -hostfile hostfile -N 4 -np 8 python your_file.py
 57 | ```
 58 | 
 59 | For more detailed information, please refer to [TACC Longhorn User Guide](https://portal.tacc.utexas.edu/user-guides/longhorn).
 60 | 
 61 | ## Job Script Example
 62 | 
 63 | ```
 64 | #!/bin/sh
 65 | 
 66 | #SBATCH -J myjob           # Job name
 67 | #SBATCH -o myjob.o%j       # Name of stdout output file
 68 | #SBATCH -e myjob.e%j       # Name of stderr error file
 69 | #SBATCH -p v100            # Queue (partition) name
 70 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
 71 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
 72 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
 73 | #SBATCH --mail-user=your email address
 74 | #SBATCH --mail-type=all    # Send email at begin and end of job
 75 | 
 76 | pwd
 77 | date
 78 | 
 79 | cd your_code_path
 80 | module load conda
 81 | conda activate your_env_name
 82 | 
 83 | ibrun -np 8 \
 84 | python your_code.py
 85 | ```
 86 | 
 87 | You should use ibrun instead of mpirun or mpiexec here, and -np means total GPU numbers.
 88 | 
 89 | Note: Although in Longhorn tutorial, they use "#SBATCH -A myproject       # Allocation name", you can actually delete it. If you incorrect set it, you will meet permission error when submit job.
 90 | 
 91 | ## Dataset and Transfer files
 92 | 
 93 | **Dataset**
 94 | 
 95 | ```shell
 96 | cp /scratch/00946/zzhang/data/imagenet-1k.tar /your/path
 97 | tar xf imagenet-1k.tar
 98 | # TFRecord format file
 99 | /scratch/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 
100 | ```
101 | 
102 | You can get a prepared ImageNet-1K (ILSVRC2012) use above command.
103 | 
104 | Maybe you can find other prepared dataset around this path.
105 | 
106 | **Transfer files**
107 | 
108 | You can use scp or git clone command,  or WinSCP, a visible tool can input TACC token, to transfer files.
109 | 
110 | ```shell
111 | # scp command
112 | # tested in Git Bash on Windows10
113 | scp G:/your_path/xxx.tar your_account@longhorn.tacc.utexas.edu:/your_path/imagenet-tar
114 | ```
115 | 
116 | ## Large Scale Experiment
117 | 
118 | (You can also consider DALI rather than this part)
119 | 
120 | If a job uses many nodes (eg. 32 nodes with 128 GPUs) and reads large dataset(eg. ImageNet) directly from the hard disk, it will cause huge IO pressure on the file system, which may cause the job to be killed by the system : (
121 | 
122 | So, for large scale experiment, we need first move the data to the /tmp file system of nodes for temporary storage and then run the program, to reduce the IO pressure of the main file system.
123 | 
124 | large scale experiment job script example
125 | 
126 | ```shell
127 | #!/bin/sh
128 | 
129 | #SBATCH -J myjob           # Job name
130 | #SBATCH -o myjob.o%j       # Name of stdout output file
131 | #SBATCH -e myjob.e%j       # Name of stderr error file
132 | #SBATCH -p v100            # Queue (partition) name
133 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
134 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
135 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
136 | #SBATCH --mail-user=your email
137 | #SBATCH --mail-type=all    # Send email at begin and end of job
138 | 
139 | pwd
140 | date
141 | 
142 | cd your_code_path
143 | module load conda
144 | conda activate py36pt
145 | 
146 | scontrol show hostnames $SLURM_NODELIST > /tmp/hostfile
147 | 
148 | cat /tmp/hostfile
149 | 
150 | mpiexec -hostfile /tmp/hostfile -N 1 ./cp_imagenet_to_temp_bin.sh
151 | 
152 | ibrun -np 8 \
153 | python your_code.py
154 | ```
155 | 
156 | cp_imagenet_to_temp_bin.sh copy and extract imagenet data to following path, which may take 50 minutes :( 
157 | 
158 | The data in /tmp will be automatically deleted when job finished, which is located in the following path
159 | 
160 | ```shell
161 | --train-dir=/tmp/imagenet/ILSVRC2012_img_train/
162 | --val-dir=/tmp/imagenet/ILSVRC2012_img_val/
163 | ```
164 | 
165 | ## DALI
166 | 
167 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 
168 | 
169 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters.  
170 | 
171 | **Install**
172 | 
173 | ```shell
174 | conda install dali
175 | # because longhorn is Power architecture, we cannot use following command as other cluster.
176 | # pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
177 | ```
178 | 
179 | **Usage**
180 | 
181 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch). This example has been tested with nvidia-dali-cuda110, maybe it needs some changes if you use it with Longhorn CUDA10 and Power architecture.
182 | 
183 | DALI requires data in *TFRecord format* in the following structure:
184 | 
185 | ```
186 | train-recs 'path/train/*' 
187 | val-recs 'path/validation/*' 
188 | train-idx 'path/idx_files/train/*' 
189 | val-idx 'path/idx_files/validation/*' 
190 | ```
191 | 
192 | On longhorn, if you want use ImageNet-1k TFRecord data, you can directly use
193 | 
194 | data-dir=/scratch/07801/nusbin20/ILSVRC2012_1k_TFRecord/
195 | 
196 | ```shell
197 | # TFRecord format tar file
198 | /scratch/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 
199 | ```
200 | 
201 | About the parameters on DALI:
202 | 
203 | - *prefetch_queue_depth* and *num_threads*  might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff.
204 | - *last_batch_policy*  you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link.
205 | - *device*  Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster.
206 | 
207 | ## Question Ticket
208 | 
209 | If you have some specific questions, you can sent them to [TACC Frontera Help Desk](https://frontera-portal.tacc.utexas.edu/user-guide/help/).


--------------------------------------------------------------------------------
/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dbclient.1:
--------------------------------------------------------------------------------
  1 | .TH dbclient 1
  2 | .SH NAME
  3 | dbclient \- lightweight SSH client
  4 | .SH SYNOPSIS
  5 | .B dbclient
  6 | [\fIflag arguments\fR] [\-p
  7 | .I port\fR] [\-i
  8 | .I id\fR] [\-L
  9 | .I l\fR:\fIh\fR:\fIp\fR] [\-R
 10 | .I l\fR:\fIh\fR:\fIp\fR] [\-l
 11 | .IR user ]
 12 | .I host
 13 | .RI [ \fImore\ flags\fR ]
 14 | .RI [ command ]
 15 | 
 16 | .B dbclient
 17 | [\fIargs\fR]
 18 | [\fIuser1\fR]@\fIhost1\fR[^\fIport1\fR],[\fIuser2\fR]@\fIhost2\fR[^\fIport2\fR],...
 19 | 
 20 | .SH DESCRIPTION
 21 | .B dbclient
 22 | is a small SSH client 
 23 | .SH OPTIONS
 24 | .TP
 25 | .TP
 26 | .B command
 27 | A command to run on the remote host. This will normally be run by the remote host
 28 | using the user's shell. The command begins at the first hyphen argument after the 
 29 | host argument. If no command is specified an interactive terminal will be opened
 30 | (see -t and -T).
 31 | .TP
 32 | .B \-p \fIport
 33 | Connect to 
 34 | .I port
 35 | on the remote host. Alternatively a port can be specified as hostname^port.
 36 | Default is 22.
 37 | .TP
 38 | .B \-i \fIidfile
 39 | Identity file.
 40 | Read the identity key from file
 41 | .I idfile
 42 | (multiple allowed). This file is created with dropbearkey(1) or converted
 43 | from OpenSSH with dropbearconvert(1). The default path ~/.ssh/id_dropbear is used
 44 | .TP
 45 | .B \-L\fR [\fIlistenaddress\fR]:\fIlistenport\fR:\fIhost\fR:\fIport\fR
 46 | Local port forwarding.
 47 | Forward the port
 48 | .I listenport
 49 | on the local host through the SSH connection to port
 50 | .I port
 51 | on the host
 52 | .IR host .
 53 | .TP
 54 | .B \-R\fR [\fIlistenaddress\fR]:\fIlistenport\fR:\fIhost\fR:\fIport\fR
 55 | Remote port forwarding.
 56 | Forward the port
 57 | .I listenport
 58 | on the remote host through the SSH connection to port
 59 | .I port
 60 | on the host
 61 | .IR host .
 62 | .TP
 63 | .B \-l \fIuser
 64 | Username.
 65 | Login as
 66 | .I user
 67 | on the remote host.
 68 | .TP
 69 | .B \-t
 70 | Allocate a PTY. This is the default when no command is given, it gives a full
 71 | interactive remote session. The main effect is that keystrokes are sent remotely 
 72 | immediately as opposed to local line-based editing.
 73 | .TP
 74 | .B \-T
 75 | Don't allocate a PTY. This is the default a command is given. See -t.
 76 | .TP
 77 | .B \-N
 78 | Don't request a remote shell or run any commands. Any command arguments are ignored.
 79 | .TP
 80 | .B \-f
 81 | Fork into the background after authentication. A command argument (or -N) is required.
 82 | This is useful when using password authentication.
 83 | .TP
 84 | .B \-g
 85 | Allow non-local hosts to connect to forwarded ports. Applies to -L and -R
 86 | forwarded ports, though remote connections to -R forwarded ports may be limited
 87 | by the ssh server.
 88 | .TP
 89 | .B \-y
 90 | Always accept hostkeys if they are unknown. If a hostkey mismatch occurs the
 91 | connection will abort as normal. If specified a second time no host key checking
 92 | is performed at all, this is usually undesirable.
 93 | .TP
 94 | .B \-A
 95 | Forward agent connections to the remote host. dbclient will use any
 96 | OpenSSH-style agent program if available ($SSH_AUTH_SOCK will be set) for
 97 | public key authentication.  Forwarding is only enabled if -A is specified.
 98 | .TP
 99 | .B \-W \fIwindowsize
100 | Specify the per-channel receive window buffer size. Increasing this 
101 | may improve network performance at the expense of memory use. Use -h to see the
102 | default buffer size.
103 | .TP
104 | .B \-K \fItimeout_seconds
105 | Ensure that traffic is transmitted at a certain interval in seconds. This is
106 | useful for working around firewalls or routers that drop connections after
107 | a certain period of inactivity. The trade-off is that a session may be
108 | closed if there is a temporary lapse of network connectivity. A setting
109 | if 0 disables keepalives. If no response is received for 3 consecutive keepalives the connection will be closed.
110 | .TP
111 | .B \-I \fIidle_timeout
112 | Disconnect the session if no traffic is transmitted or received for \fIidle_timeout\fR seconds.
113 | .TP
114 | 
115 | .\" TODO: how to avoid a line break between these two -J arguments?
116 | .B \-J \fIproxy_command
117 | .TP
118 | .B \-J \fI&fd
119 | .br
120 | Use the standard input/output of the program \fIproxy_command\fR rather than using
121 | a normal TCP connection. A hostname should be still be provided, as this is used for
122 | comparing saved hostkeys. This command will be executed as "exec proxy_command ..." with the
123 | default shell.
124 | 
125 | The second form &fd will make dbclient use the numeric file descriptor as a socket. This
126 | can be used for more complex tunnelling scenarios. Example usage with socat is
127 | 
128 | socat EXEC:'dbclient -J &38 ev',fdin=38,fdout=38 TCP4:host.example.com:22
129 | 
130 | .TP
131 | .B \-B \fIendhost:endport
132 | "Netcat-alike" mode, where Dropbear will connect to the given host, then create a
133 | forwarded connection to \fIendhost\fR. This will then be presented as dbclient's
134 | standard input/output.
135 | .TP
136 | .B \-c \fIcipherlist
137 | Specify a comma separated list of ciphers to enable. Use \fI-c help\fR to list possibilities.
138 | .TP
139 | .B \-m \fIMAClist
140 | Specify a comma separated list of authentication MACs to enable. Use \fI-m help\fR to list possibilities.
141 | .TP
142 | .B \-o \fIoption
143 | Can be used to give options in the format used by OpenSSH config file. This is
144 | useful for specifying options for which there is no separate command-line flag.
145 | For full details of the options listed below, and their possible values, see
146 | ssh_config(5).
147 | The following options have currently been implemented:
148 | 
149 | .RS
150 | .TP
151 | .B ExitOnForwardFailure
152 | Specifies whether dbclient should terminate the connection if it cannot set up all requested local and remote port forwardings. The argument must be “yes” or “no”.  The default is “no”.
153 | .TP
154 | .B UseSyslog
155 | Send dbclient log messages to syslog in addition to stderr.
156 | .RE
157 | .TP
158 | .B \-s 
159 | The specified command will be requested as a subsystem, used for sftp. Dropbear doesn't implement sftp itself but the OpenSSH sftp client can be used eg \fIsftp -S dbclient user@host\fR
160 | .TP
161 | .B \-b \fI[address][:port]
162 | Bind to a specific local address when connecting to the remote host. This can be used to choose from
163 | multiple outgoing interfaces. Either address or port (or both) can be given.
164 | .TP
165 | .B \-V
166 | Print the version
167 | 
168 | .SH MULTI-HOP
169 | Dropbear will also allow multiple "hops" to be specified, separated by commas. In
170 | this case a connection will be made to the first host, then a TCP forwarded 
171 | connection will be made through that to the second host, and so on. Hosts other than
172 | the final destination will not see anything other than the encrypted SSH stream. 
173 | A port for a host can be specified with a caret (eg matt@martello^44 ).
174 | This syntax can also be used with scp or rsync (specifying dbclient as the 
175 | ssh/rsh command). A file can be "bounced" through multiple SSH hops, eg
176 | 
177 | scp -S dbclient matt@martello,root@wrt,canyons:/tmp/dump .
178 | 
179 | Note that hostnames are resolved by the prior hop (so "canyons" would be resolved by the host "wrt")
180 | in the example above, the same way as other -L TCP forwarded hosts are. Host keys are 
181 | checked locally based on the given hostname.
182 | 
183 | .SH ESCAPE CHARACTERS
184 | Typing a newline followed by the  key sequence \fI~.\fR (tilde, dot) will terminate a connection.
185 | The sequence \fI~^Z\fR (tilde, ctrl-z) will background the connection. This behaviour only
186 | applies when a PTY is used.
187 | 
188 | .SH ENVIRONMENT
189 | .TP
190 | .B DROPBEAR_PASSWORD
191 | A password to use for remote authentication can be specified in the environment
192 | variable DROPBEAR_PASSWORD. Care should be taken that the password is not
193 | exposed to other users on a multi-user system, or stored in accessible files.
194 | .TP
195 | .B SSH_ASKPASS
196 | dbclient can use an external program to request a password from a user.
197 | SSH_ASKPASS should be set to the path of a program that will return a password
198 | on standard output. This program will only be used if either DISPLAY is set and
199 | standard input is not a TTY, or the environment variable SSH_ASKPASS_ALWAYS is
200 | set.
201 | .SH NOTES
202 | If compiled with zlib support and if the server supports it, dbclient will
203 | always use compression.
204 | 
205 | .SH AUTHOR
206 | Matt Johnston (matt@ucc.asn.au).
207 | .br
208 | Mihnea Stoenescu wrote initial Dropbear client support
209 | .br
210 | Gerrit Pape (pape@smarden.org) wrote this manual page.
211 | .SH SEE ALSO
212 | dropbear(8), dropbearkey(1)
213 | .P
214 | https://matt.ucc.asn.au/dropbear/dropbear.html
215 | 


--------------------------------------------------------------------------------
/docs/hpc_system_notes/nscc.md:
--------------------------------------------------------------------------------
  1 | # NSCC AI System
  2 | 
  3 | ## Table Of Contents
  4 | 
  5 | - [Common Commands](#common-commands)
  6 | - [Scheduler](#scheduler)
  7 | - [Job Status Email Alert](#job-status-email-alert)
  8 | - [Jupyter Lab](#jupyter-lab)
  9 | - [Horovod](#horovod)
 10 | - [Container](#container)
 11 | - [Python Package Management](#python-package-management)
 12 | - [Multinode Experiments](#multinode-experiments)
 13 | - [Dataset](#dataset)
 14 | 
 15 | ## Common Commands
 16 | 
 17 | ```shell
 18 | # submit interactive job
 19 | qsub -I -q dgx-dev -l walltime=3:00:00 -P <PROJECT_ID>
 20 | 
 21 | # submit a job script
 22 | qsub /PATH/TO/job.pbs
 23 | 
 24 | # check your job status
 25 | qstat
 26 | 
 27 | # view job info
 28 | qstat -f 2603175.wlm01
 29 | 
 30 | # delete your job
 31 | qdel 2603175.wlm01
 32 | 
 33 | # view job queue
 34 | gstat -dgx
 35 | 
 36 | # run interactive containers
 37 | nscc-docker run -t nvcr.io/nvidia/pytorch
 38 | singularity run pytorch\:latest.sif
 39 | 
 40 | # run container with executable
 41 | nscc-docker run nvcr.io/nvidia/pytorch <PATH_TO_EXECUTABLE>
 42 | singularity run pytorch\:latest.sif <PATH_TO_EXECUTABLE>
 43 | 
 44 | ```
 45 | 
 46 | > Some default options are added automatically for nscc-docker run
 47 | >
 48 | > > -u UID:GID --group-add GROUP –v /home:/home –v /raid:/raid -v /scratch:/scratch --rm –i --ulimit memlock=-1 --ulimit stack=67108864
 49 | > >
 50 | > > If --ipc=host is not specified then the following option is also added: --shm-size=1g
 51 | 
 52 | ## Scheduler
 53 | 
 54 | NSCC uses PBS Pro as the job scheduler, you can refer to the
 55 | [NSCC official guide](https://help.nscc.sg/pbspro-quickstartguide/) for detailed instructions.
 56 | 
 57 | ## Job Status Email Alert
 58 | 
 59 | To receive notification when your job status changes, you can add the following to your PBS job script.
 60 | 
 61 | ```shell
 62 | #PBS -M username@x.y.z
 63 | ```
 64 | 
 65 | ## Jupyter Lab
 66 | 
 67 | If you want to debug your code or run many short experiments with exclusive GPU resources, you can launch a jupyter lab
 68 | on the NSCC server. The procedure is as follows:
 69 | 
 70 | ![](../doc_figures/nscc/NSCC_jupyter.png)
 71 | 
 72 | 1. Setup NSCC VPN for your [Mac](https://help.nscc.sg/vpnmac/) or [Windows](https://help.nscc.sg/vpnmicrosoft/). You
 73 |    should not login via the school VPN as they do not provide outgoing internet access. The NSCC VPN allows you to
 74 |    access NSCC server from `aspire.nscc.sg`. You can send data to outside machine with this IP but not
 75 |    with `ntu.nscc.sg` or
 76 |    `nus.nscc.sg`. This ensures that you can access jupyter lab from your local machine later.
 77 | 
 78 | 2. Log in to your nscc account by
 79 |    ```shell
 80 |    ssh <USERNAME>@aspire.nscc.sg
 81 |    ```
 82 | 
 83 | 3. Submit a jupyter job. There are two ways to start the Jupyter Lab. There are two ways to start the jupyter lab.
 84 |     - The first way is to start Jupyter Lab in a container
 85 |       (You can refer to the [official guide](https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf)
 86 |       for how to write this kind of script).
 87 |     - The second way is to start Jupyter Lab only and run containers in Jupyter Lab. You can refer
 88 |       to [my script](https://github.com/FrankLeeeee/oh-my-server/tree/main/scripts/nscc/multi_node/pbs_script).
 89 | 
 90 |    The second way is highly recommended because you can access both your base environment in the compute node and the
 91 |    environment in containers and you can run containers with different images. For example, if you choose a wrong
 92 |    container via the first method, you have to quit the job and re-submit but this is not necessary for the second
 93 |    method.
 94 | 
 95 | 4. Once your job is running, you need to check which host on which jupyter is running by:
 96 |    ```shell
 97 |    qstat -f <JOB_ID>
 98 |    ```
 99 | 
100 |    You can also check the job output file and get the port on which your jupyter is running.
101 | 
102 | 5. Then, you can connect to your jupyter lab via port forwarding. The command is like
103 |    ```shell
104 |    ssh -L <LOCAL_PORT>:<NSCC_HOST>:<NSCC_PORT> <USERNAME>@aspire.nscc.sg
105 |    
106 |    # example
107 |    ssh -L 8888:dgx4106:8888 u12345@aspire.nscc.sg
108 |    ```
109 |    The `NSCC_HOST`  and `NSCC_PORT` are found in step 4. `LOCAL_PORT` can be any port.
110 | 
111 | 6. Finally, open your browser and enter `localhost:<LOCAL_PORT>` to access the remote jupyter lab.
112 | 
113 | ## Horovod
114 | 
115 | ---
116 | 
117 | **UPDATE: You can try this new Singularity
118 | image `/home/projects/ai/singularity/nscc/horovod_0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1.sif` which
119 | has Horovod pre-installed now**
120 | 
121 | ---
122 | 
123 | If you are using TensorFlow, you can use docker/singularity image directly as Horovod is pre-installed.
124 | 
125 | If you are using PyTorch, you need to install horovod manually. To install Horovod on NSCC, just run the following
126 | script in the container. I used Singularity as it is more sutiable for multinode training.
127 | 
128 | ```shell
129 | #!/bin/bash
130 | 
131 | export HOROVOD_NCCL_INCLUDE_=/usr/include
132 | export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu
133 | export HOROVOD_NCCL_LINK=SHARED
134 | 
135 | export HOROVOD_GPU_ALLREDUCE=NCCL
136 | export HOROVOD_WITH_PYTORCH=1
137 | pip install --no-cache-dir --user  horovod==0.18.2
138 | ```
139 | 
140 | ## Container
141 | 
142 | NSCC provides both Singularity and Docker containers. The containers used mostly are NGC contianers. The Singularity
143 | ones are obtained by converting Docker image to Singularity image. Some points that I have observed are that:
144 | 
145 | 1. You do not have root access in both Docker and Singularity container.
146 | 2. `singularity shell xxx.sif` behaves interestingly different from `singularity run xxx.sif`. Even though both will
147 |    start bash as the default shell, `singularity shell` does not source your bashrc file while `singularity run`
148 |    actually does. So your init script in bashrc will be omitted when you execute `singularity shell`. It is desgined to
149 |    be so and look at https://github.com/hpcng/singularity/issues/643 for more info.
150 | 
151 | ## Python Package Management
152 | 
153 | It is sometimes possible that the container does not provide the Python libraries you need. In this case, you need to be
154 | careful in managing these libraries.
155 | 
156 | You can install your library to your user directory by:
157 | 
158 | ```shell
159 | pip install --user <LIBRARY>
160 | ```
161 | 
162 | This should install Python library to `~/.local/lib` of your home directory.
163 | 
164 | You can also check the directories where Python searches for libraries during `import` by running
165 | 
166 | ```shell
167 | python -m site
168 | ```
169 | 
170 | If you container Python does not source your Python package, you can perform one of the following methods to add these
171 | libraries to PYTHONPATH.
172 | 
173 | - export PYTHONPATH=<LIBRARY_ROOT_PATH>:$PYTHONPATH
174 | - add `sys.path.insert(0, <LIBRARY_ROOT_PATH>)` in your python file For example, the `LIBRARY_ROOT_PATH` can
175 |   be `/home/users/ntu/c170166/.local/lib/python3.6/site-packages`.
176 | 
177 | If you are not sure where a specific library is imported from when you run your script, you can check like below:
178 | 
179 | ```python
180 | # take pytorch as an example
181 | import torch
182 | 
183 | print(torch.__file__)
184 | ```
185 | 
186 | ## Multinode Experiments
187 | 
188 | First of all, it takes a long time to queue for resources on NSCC. It is normal to wait for a few days if you request
189 | for 2 nodes with 4 GPUs each. At the initial debugging stage, it is definitely crazy if you submit a job script again
190 | and again. Thus, I would highly recommend you to debug using the Jupyter Lab as mentioned above.
191 | 
192 | To run experiments on multinode on NSCC, you need to follow these steps.
193 | 
194 | ```
195 | # Use OMPI integrated with PBS
196 | export PATH=/home/app/dgx/openmpi-3.1.3-gnu/bin:$PATH
197 | 
198 | # run the script in container
199 | mpirun --mca btl_openib_warn_default_gid_prefix 0 \
200 | --host dgx4106:1,dgx4105:1 -N 1 --np 2 \
201 | /opt/singularity/bin/singularity exec --nv  \
202 | /home/project/ai/singularity/nvcr.io/nvidia/pytorch\:latest.sif \
203 | python <YOUR_SCRIPT>
204 | ```
205 | 
206 | For PyTorch users, you can use `horovod` for cross-node communication. However, if you are using `torch.distributed`,
207 | you need to specify the communication interface by setting environment variables `NCCL_SOCKET_IFNAME`. Currently, `enp1s0f1`
208 | is for InfiniBand. You can do `NCCL_SOCKET_IFNAME=enp1s0f1 python your_script.py` or add `os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f1'` 
209 | in your Python file. If you are using `gloo` backend, the environment variable will be `GLOO_SOCKET_IFNAME`. This is to 
210 | force PyTorch use InfiniBand for communication. Otherwise, the program will be stuck at initialization based on my test.
211 | 
212 | ## Dataset and Transfer files
213 | 
214 | The ImageNet dataset is placed at `/scratch/users/nus/e0575577/ImageNet` if you are using the shared account.
215 | 
216 | Maybe you can find other prepared dataset around this path.
217 | 
218 | **Transfer files**
219 | 
220 | You can use scp or git clone command, FileZilla or WinSCP, a visible tool, to transfer files.
221 | 
222 | ```shell
223 | # scp command
224 | # tested in Git Bash on Windows10
225 | scp G:/globus_share/xxx.tar your_account@aspire.nscc.sg:/your_path/imagenet-tar
226 | ```
227 | 
228 | 


--------------------------------------------------------------------------------
/docs/hpc_system_notes/cscs.md:
--------------------------------------------------------------------------------
  1 | # CSCS
  2 | 
  3 | - [Before Usage](#before-usage )
  4 | - [Interactive Usage](#interactive-usage)
  5 | - [Common Commands](#common-commands)
  6 | - [PyTorch Job Script Example](#pytorch-job-script-example)
  7 | - [Horovod Gloo](#horovod_gloo)
  8 | - [Spark](#spark)
  9 | - [TensorFlow](#tensorflow)
 10 | - [Dataset and Transfer files](#dataset-and-transfer-files)
 11 | - [Large Scale Experiment](#large-scale-experiment)
 12 | - [DALI](#dali)
 13 | - [Question Ticket](#question-ticket)
 14 | 
 15 | ## Before Usage
 16 | 
 17 | **Get Your Account**
 18 | 
 19 | 1. Request the Prof invites you to current CSCS project (each project can only has a very limited members)
 20 | 2. Waiting for CSCS administrator approval
 21 | 3. Receive invitation e-mail and provide your personal information (need your passport pdf)
 22 | 4. Waiting for CSCS administrator approval again
 23 | 5. Receive account e-mail with password
 24 | 
 25 | According to [CSCS User Regulations](https://www.cscs.ch/services/user-regulations/), **CSCS does not allow sharing of accounts and the applicant will be immediately barred from all present and future use of CSCS facilities and is fully liable for all consequences arising from the infraction if such activity occur**.
 26 | 
 27 | **Connect the System**
 28 | 
 29 | ssh ela.cscs.ch
 30 | 
 31 | Here just the front end of the system, you cannot access the $SCRATCH file system and submit jobs. Then,
 32 | 
 33 | ssh daint.cscs.ch
 34 | 
 35 | You can access the $SCRATCH file system and submit jobs from here. Cannot connect daint directly before connecting ela.
 36 | 
 37 | **Check Project Status**
 38 | 
 39 | You can go to [https://account.cscs.ch](https://account.cscs.ch/) to check the <project ID> and remain quota hours, as shown below
 40 | 
 41 | ![](../doc_figures/cscs/cscs_project_information.jpg)
 42 | 
 43 | The quota here is the whole lab quota rather than your current account, so the resources is very limited, please be careful and economical. One node h means occupy one node (1 P100 GPU) one hour, so we need balance the job walltime and requested resources. In my test, train ImageNet-1k using PyTorch with 90 epochs need about two or three hundreds node h at a time.
 44 | 
 45 | ## Interactive Usage
 46 | 
 47 | **Interactive Usage**
 48 | 
 49 | For interactive usage to debug, unlike other cluster using command, You can access computing resources via your browser through a user interface based on Jupyter.
 50 | 
 51 | 1. Login at https://jupyter.cscs.ch/hub/home
 52 | 
 53 | 2. Right click 'Start My Server' and select 'Open link in new tab' 
 54 | 
 55 |    ![](../doc_figures/cscs/jupyter_step1.jpg)
 56 | 
 57 | 3. Select nodes and duration you need.
 58 | 
 59 |    Each node has only one P100 GPU, and request 4 or more nodes here often fail, please consider use job.slurm for large test.
 60 | 
 61 |    ![](../doc_figures/cscs/jupyter_step2.jpg)
 62 | 
 63 | 4. Cancel interactive resources usage
 64 | 
 65 |    Directly close the JupyterLab will **not** stop the job, it is still consuming resources in the background.
 66 | 
 67 |    Click the refresh button in the upper left corner of the previous page, then you can see the 'Stop My Server'. Click it and refresh again, if it shows as step 2, the job has been successfully terminated.
 68 | 
 69 |    ![](../doc_figures/cscs/jupyter_step3.jpg)
 70 | 
 71 | **Build Conda Environment**
 72 | 
 73 | After entering JupyterLab, you can build your conda environment just as how you do in other Linux system, such as using miniconda.
 74 | 
 75 | Most environments, such as  PyTorch, TensorFlow and Spark, etc. can be directly load using modules, you just need other packages in conda if they are not in modules.
 76 | 
 77 | ## Common Commands
 78 | 
 79 | ```shell
 80 | # To directory $SCRATCH.
 81 | # put and run all jobs related files in $SCRATCH system. 
 82 | cd $SCRATCH
 83 | 
 84 | # submit a job script
 85 | sbatch /PATH/TO/job.slurm
 86 | 
 87 | # check your job status
 88 | squeue -u your_account
 89 | 
 90 | # scancel your job
 91 | scancel job_ID
 92 | 
 93 | # view loaded module
 94 | # module related commands need at computing node
 95 | module list
 96 | 
 97 | # view avail module
 98 | module available
 99 | 
100 | # load CUDA
101 | module load daint-gpu
102 | 
103 | # active your conda environment
104 | . /users/your_account/miniconda3/bin/activate your_env_name
105 | 
106 | # run your code
107 | # don't need to select how many nodes or GPUs, default use all available resources that you requested.
108 | srun python your_file.py
109 | ```
110 | 
111 | For more detailed information, please refer to [CSCS User Guide](https://user.cscs.ch/access/running/).
112 | 
113 | According to [CSCS Regulations](https://user.cscs.ch/access/running/#slurm-best-practices-on-cscs-cray-systems),  **Users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time**. 
114 | 
115 | **It is possible to manually check the output file or log generated from your code or job, and in general, they actually has been finished for a while before the Slurm job finished. So you can cancel the Slurm job in advance manually, which means you can save time waiting for the Slurm job start and finish (because request less walltime)  and save our computing resources, especially when you use a lot nodes to parallel.** At least it was so when I tested PyTorch and Spark.
116 | 
117 | ## PyTorch Job Script Example
118 | 
119 | ```
120 | #!/bin/bash -l
121 | 
122 | #SBATCH --job-name=my_cscs_job
123 | #SBATCH --time=00:10:00
124 | #SBATCH --nodes=2
125 | #SBATCH --ntasks-per-core=1
126 | #SBATCH --ntasks-per-node=1
127 | #SBATCH --cpus-per-task=12
128 | #SBATCH --constraint=gpu
129 | #SBATCH --account=<project>
130 | 
131 | module load daint-gpu
132 | module load PyTorch
133 | 
134 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
135 | 
136 | # Environment variables needed by the NCCL backend
137 | export NCCL_DEBUG=INFO
138 | export NCCL_IB_HCA=ipogif0
139 | export NCCL_IB_CUDA_SUPPORT=1
140 | 
141 | . /users/your_account/miniconda3/bin/activate your_env_name
142 | 
143 | cd your_code_path
144 | 
145 | srun \
146 | python -u your_code.py  \
147 | --epochs 90 \
148 | --model resnet50 \
149 | > ${SLURM_JOBID}.out 2> ${SLURM_JOBID}.err
150 | ```
151 | 
152 | Number of nodes define by --nodes, and each node only has one P100 GPU. 
153 | 
154 | We can noly use the 'normal' queue, because the project is too small and this kind of projects are not entitled to use the "low' queue. So once we have used our quarterly allocation we can only wait until the next allocation period to start.
155 | 
156 | CSCS also provides a general PyTorch guide at https://user.cscs.ch/computing/data_science/pytorch/, but some content maybe not works at least when I tested.
157 | 
158 | **Default PyTorch dataloader does not work if you use Horovod for distribution** (maybe because the specific MPI), please **use torch.distributed, Horovod Gloo, or Horovod + DALI** (please refer to DALI or Horovod Gloo section).
159 | 
160 | ## Horovod Gloo
161 | 
162 | ```
163 | #!/bin/bash -l
164 | 
165 | #SBATCH --job-name=gloo-eb
166 | #SBATCH --time=00:30:00
167 | #SBATCH --nodes=4
168 | #SBATCH --ntasks-per-node=1
169 | #SBATCH --constraint=gpu
170 | #SBATCH --account=<project>
171 | #SBATCH --output=test_pt_hvd_%j.out
172 | 
173 | module load daint-gpu PyTorch
174 | 
175 | . /users/lyongbin/miniconda3/bin/activate your_env_name
176 | 
177 | export PMI_NO_PREINITIALIZE=1  # avoid warnings on fork
178 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS
179 | 
180 | for node in $(scontrol show hostnames); do
181 |    HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE,"
182 | done
183 | HOSTS=${HOSTS%?}  # trim trailing comma
184 | echo HOSTS $HOSTS
185 | 
186 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \
187 |    --start-timeout 120 --gloo-timeout-seconds 120 \
188 | python -u your_code.py  \
189 | --epochs 90 \
190 | --model resnet50
191 | ```
192 | 
193 | Horovod Gloo works without using MPI, so you can use default PyTorch dataloader as on other clusters.
194 | 
195 | ## Spark
196 | 
197 | Please refer to https://user.cscs.ch/computing/data_science/spark/, it looks like they have fixed the bugs here I encountered before.
198 | 
199 | Please note that the example in above link use CPU nodes, while our project only can access GPU nodes when I used, so we need appropriate modify it, such as SPARK_WORKER_CORES=12 and module load.
200 | 
201 | In the example, the version number of spark-examples_2.11-2.4.7.jar maybe inappropriate. If you meet 'ERROR: file doesn't exist' , please use ‘module available’ to check the Spark location, and go to that path to find the right jar name.
202 | 
203 | When generate yourself jar, please do not include packages in your local environment into the jar file.
204 | 
205 | They also don't have HDFS on Piz Daint, we need put our data directly on $SCRATCH file system.
206 | 
207 | ## TensorFlow
208 | 
209 | Please refer to https://user.cscs.ch/computing/data_science/tensorflow/, I have test the examples here.
210 | 
211 | ## Dataset and Transfer files
212 | 
213 | **Dataset**
214 | 
215 | Each user only has 1M files quota on Piz Daint $SCRATCH, users with over 1 million files and folders will be warned at submit time and will not be able to submit new jobs. 
216 | 
217 | So for large dataset like ImageNet-1K (ILSVRC2012), which has about 1.4M, we cannot use it as what we do in other cluster. Although the administrator temporarily raised the limit of another member, I was denied when I requested the same operation : (
218 | 
219 | You can access the ImageNet-1K (ILSVRC2012) at following path temporarily:
220 | 
221 | ```shell
222 | --train-dir /scratch/snx3000/hliu/imagenet/train 
223 | --val-dir /scratch/snx3000/hliu/imagenet/val
224 | ```
225 | 
226 | The long-term method is to use NVIDIA DALI with TFRecord, please refer to DALI section.
227 | 
228 | If you need use other large dataset has more than 1M files, you need to convert them to TFRecord or LMDB format to reduce the number of files occupied, which maybe need use the appropriate dataloader when use them.
229 | 
230 | **Transfer files**
231 | 
232 | Unlike other cluster where we can use scp command, FileZilla or WinSCP, on CSCS, we cannot connect the daint.cscs.ch $SCRATCH file system directly. 
233 | 
234 | You can use first transfer them to $HOME, then cp them to \$SCRATCH, or use git clone.
235 | 
236 | **If you use your own personal account**( login need google account and it's pretty wired if one CSCS account use multiple google account at different IP), a better way is to use globus to transfer files, which actually works like FileZilla or WinSCP and can connect the daint.cscs.ch $SCRATCH file system directly. Please refer to https://user.cscs.ch/storage/transfer/external/
237 | 
238 | You need install the globus personal client on your local machine and set visible path.
239 | 
240 | ![](../doc_figures/cscs/globus.jpg)
241 | 
242 | 
243 | 
244 | ## DALI
245 | 
246 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 
247 | 
248 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters.  
249 | 
250 | **Install**
251 | 
252 | ```shell
253 | module load daint-gpu PyTorch
254 | . /users/your_account/miniconda3/bin/activate your_env_name
255 | pip install tensorboard tqdm
256 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
257 | ```
258 | 
259 | **Usage**
260 | 
261 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch).
262 | 
263 | DALI requires data in *TFRecord format* in the following structure:
264 | 
265 | ```
266 | train-recs 'path/train/*' 
267 | val-recs 'path/validation/*' 
268 | train-idx 'path/idx_files/train/*' 
269 | val-idx 'path/idx_files/validation/*' 
270 | ```
271 | 
272 | On CSCS, if you want use ImageNet-1k TFRecord data, you can directly use data-dir=/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/
273 | 
274 | About the parameters on DALI:
275 | 
276 | - *prefetch_queue_depth* and *num_threads*  might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff.
277 | - *last_batch_policy*  you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link.
278 | - *device*  Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster.
279 | 
280 | ## Question Ticket
281 | 
282 | If you have some specific question, you can sent a question ticket to the CSCS staff. (Do not send an email to them, although CSCS User Guide said so, the e-mail method has actually been discarded.)
283 | 
284 | Go to https://jira.cscs.ch/plugins/servlet/desk/site/global , click 'Generic request' under 'Open a case', describe the problem you encountered and reelevated code and log files location. The staff will answer it soon on at working hours on European working days.


--------------------------------------------------------------------------------
/docs/hpc_system_notes/tacc_frontera.md:
--------------------------------------------------------------------------------
  1 | # TACC-Frontera
  2 | 
  3 | - [Build Virtualenv Environment ](#build-virtualenv-environment )
  4 | - [Common Commands](#common-commands)
  5 | - [Job Script Example](#job-script-example)
  6 | - [Horovod Gloo](#horovod_gloo)
  7 | - [Dataset and Transfer files](#dataset-and-transfer-files)
  8 | - [Large Scale Experiment](#large-scale-experiment)
  9 | - [DALI](#dali)
 10 | - [Potential Error](#potential-error)
 11 | - [Question Ticket](#question-ticket)
 12 | ## Build Virtualenv Environment 
 13 | 
 14 | According to the TACC staff's personal instruction, we should **use Python virtualenv instead of Conda** to build the environment on TACC-Frontera. 
 15 | 
 16 | ```shell
 17 | # interactive usage
 18 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00
 19 | 
 20 | # bulid Python virtualenv
 21 | cd ~
 22 | mkdir python-env
 23 | cd python-env
 24 | # you can use other names
 25 | virtualenv cuda10-home
 26 | 
 27 | # active environment
 28 | source ~/python-env/cuda10-home/bin/activate
 29 | 
 30 | # deactivate environment
 31 | deactivate　
 32 | 
 33 | # you can use pip to install packages
 34 | ```
 35 | 
 36 | Build Horovod with Pytorch. 
 37 | 
 38 | ```shell
 39 | # login in computing nodes and virtualenv 
 40 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00
 41 | 
 42 | # load modules (cannot use default intel/19 impi/19)
 43 | module load intel/18.0.5 impi/18.0.5
 44 | module load cuda/10.1 cudnn nccl
 45 | 
 46 | source ~/python-env/cuda10-home/bin/activate
 47 | 
 48 | # install Pytorch, example 1.5.1
 49 | pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html --force-reinstall
 50 | 
 51 | # bulid Horovod, maybe spend ten minutes 
 52 | HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR  CC=gcc HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install horovod --no-cache-dir --force-reinstall
 53 | ```
 54 | 
 55 | **Default PyTorch dataloader does not work if you use Horovod for distribution directly** (maybe because the specific MPI), please **use torch.distributed, Horovod Gloo, or Horovod + DALI** (please refer to DALI or Horovod Gloo section).
 56 | 
 57 | ## Common Commands
 58 | 
 59 | ```shell
 60 | # To directory $SCRATCH. The default directory is $HOME when you login in.
 61 | cd $SCRATCH
 62 | 
 63 | # interactive usage
 64 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00
 65 | 
 66 | # submit a job script
 67 | sbatch /PATH/TO/job.slurm
 68 | 
 69 | # check your job status
 70 | squeue -u your_account
 71 | 
 72 | # view requested nodes ID 
 73 | squeue | grep your_account
 74 | 
 75 | # view nodes local file space 
 76 | ssh c196-032 df
 77 | 
 78 | # delete your job
 79 | scancel job_ID
 80 | 
 81 | # display estimated start time for job
 82 | squeue --start -j job_ID
 83 | 
 84 | # view loaded module
 85 | # module related commands need at computing node
 86 | module list
 87 | 
 88 | # view avail module
 89 | module avail
 90 | 
 91 | # active your environment and load cuda components
 92 | # you need run it after you login in computing nodes everytime
 93 | source ~/python-env/cuda10-home/bin/activate
 94 | module load cuda/10.1 cudnn nccl
 95 | # module load cuda/10.0 cudnn nccl
 96 | # module load cuda/11.0 cudnn/8.0.5 nccl/2.8.3
 97 | 
 98 | # run your code, n is the number of total GPUs
 99 | ibrun -np 4 python pytorch_imagenet_resnet.py
100 | ```
101 | 
102 | For more detailed information, please refer to [TACC Frontera User Guide](https://frontera-portal.tacc.utexas.edu/user-guide/quickstart/).
103 | 
104 | ## Job Script Example
105 | 
106 | ```
107 | #!/bin/sh
108 | 
109 | #SBATCH -J myjob           # Job name
110 | #SBATCH -o myjob.o%j       # Name of stdout output file
111 | #SBATCH -e myjob.e%j       # Name of stderr error file
112 | #SBATCH -p rtx            # Queue (partition) name
113 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
114 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
115 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
116 | #SBATCH --mail-user=your email address
117 | #SBATCH --mail-type=all    # Send email at begin and end of job
118 | 
119 | pwd
120 | date
121 | 
122 | cd your_code_path
123 | 
124 | source ~/python-env/cuda10-home/bin/activate
125 | module load intel/18.0.5 impi/18.0.5
126 | module load cuda/10.1 cudnn nccl
127 | 
128 | ibrun -np 8 \
129 | python your_code.py
130 | ```
131 | 
132 | You should use ibrun instead of mpirun or mpiexec here, and -np means total GPU numbers.
133 | 
134 | Note: Although in Frontera tutorial, they use "#SBATCH -A myproject       # Allocation name", you can actually delete it. If you incorrect set it, you will meet permission error when submit job.
135 | 
136 | From 6 April 2021, for Frontera jobs, a new queue named *small* has been created specifically for one and two node jobs. Jobs of one or two nodes that will run for up to 48 hours should be submitted to the *small* queue. The *normal* queue now has a lower limit of 3 nodes for all jobs. This should improve the turnaround time for jobs in the *normal* queue and small jobs in the *small* queue.
137 | 
138 | ## Horovod Gloo
139 | 
140 | ```
141 | #!/bin/sh
142 | 
143 | #SBATCH -J myjob           # Job name
144 | #SBATCH -o myjob-t.o%j       # Name of stdout output file
145 | #SBATCH -e myjob-t.e%j       # Name of stderr error file
146 | #SBATCH -p rtx            # Queue (partition) name
147 | #SBATCH -N 2               # Total # of nodes (must be 1 for serial)
148 | #SBATCH -n 8               # Total # of mpi tasks (should be 1 for serial)
149 | #SBATCH -t 00:30:00        # Run time (hh:mm:ss)
150 | 
151 | pwd
152 | date
153 | 
154 | source ~/python-env/cuda10-home/bin/activate
155 | 
156 | module load intel/18.0.5 impi/18.0.5
157 | module load cuda/10.1 cudnn nccl
158 | 
159 | cd /file_path/
160 | 
161 | export PMI_NO_PREINITIALIZE=1  # avoid warnings on fork
162 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS
163 | 
164 | # 4 means $SLURM_NTASKS_PER_NODE
165 | for node in $(scontrol show hostnames); do
166 |    HOSTS="$HOSTS$node:4,"
167 | done
168 | HOSTS=${HOSTS%?}  # trim trailing comma
169 | echo HOSTS $HOSTS
170 | 
171 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ib0 \
172 |    --start-timeout 120 --gloo-timeout-seconds 120 \
173 | python you_file.py  \
174 | --epochs 90 \
175 | --model resnet50
176 | ```
177 | 
178 | Horovod Gloo works without using MPI, so you can use default PyTorch dataloader as on other clusters.
179 | 
180 | ## Dataset and Transfer files
181 | 
182 | **Dataset**
183 | 
184 | ```shell
185 | cp /scratch1/00946/zzhang/imagenet/imagenet-1k.tar /your/path
186 | tar xf imagenet-1k.tar
187 | 
188 | --train-dir=/your/path/ILSVRC2012_img_train/
189 | --val-dir=/your/path/ILSVRC2012_img_val/
190 | ```
191 | 
192 | You can get a prepared ImageNet-1K (ILSVRC2012) use above command.
193 | 
194 | Maybe you can find other prepared dataset around this path.
195 | 
196 | **Transfer files**
197 | 
198 | You can use scp or git clone command, or WinSCP, a visible tool can input TACC token, to transfer files.
199 | 
200 | ```shell
201 | # scp command
202 | # tested in Git Bash on Windows10
203 | scp G:/globus_share/xxx.tar your_account@frontera.tacc.utexas.edu:/your_path/imagenet-tar
204 | ```
205 | 
206 | ## Large Scale Experiment
207 | 
208 | (You can also consider DALI rather than this part)
209 | 
210 | If a job uses many nodes (eg. 32 nodes with 128 GPUs) and reads large dataset(eg. ImageNet) directly from the hard disk, it will cause huge IO pressure on the file system, which may cause the job to be killed by the system : (
211 | 
212 | Different with TACC-Longhorn, which can directly use /tmp for ImageNet. The /tmp file system on TACC-Frontera is only around 100G and cannot copy and extract the whole ImageNet.
213 | 
214 | Since we can only use max 16 nodes with 64 GPUs, so maybe it is still OK to read data from $SCRATCH, which is the simplest way. The following method is pretty tricky and only works for ImageNet. 
215 | 
216 | According to the TACC staff's personal instruction, there is a method called fanstore, which can load the preprocessed ImageNet directly to /tmp file system. 
217 | 
218 | Preprocessed ImageNet
219 | 
220 | ```shell
221 | # copy preprocessed ImageNet-1K (ILSVRC2012), which is a binary file folder.
222 | cp /scratch1/00946/zzhang/imagenet/imagenet-16parts /your/path
223 | 
224 | # imagenet-tiny, same structure of ImageNet-1K (ILSVRC2012), but size is much smaller
225 | cp /scratch1/00946/zzhang/imagenet/imagenet-tiny-16parts /your/path
226 | ```
227 | 
228 | Job Script Example
229 | 
230 | ```shell
231 | source ~/python-env/cuda10-home/bin/activate
232 | # load data
233 | # In interactive usage, do not need & sleep 500, when you see Ready in command, press ctrl+z, input bg 1
234 | # Sometimes, it maybe get stuck:(
235 | export FS_ROOT=/tmp/fs_`id -u`
236 | ibrun -np 2 /work/00410/huang/share/read_remote_file 16 /scratch1/07801/nusbin20/imagenet-16parts & sleep 500
237 | 
238 | # load module 
239 | module load cuda/10.1 cudnn nccl
240 | 
241 | cd /scratch1/07801/nusbin20/tacc-our
242 | 
243 | # read_remote_file and wrapper.so are provided by the TACC staff, they are binary files, so I also don't know the detail :(
244 | ibrun -np 8 LD_PRELOAD=/work/00410/huang/share/wrapper.so python pytorch_imagenet_resnet.py  --epochs 90 --model resnet50 --batch-size 128 --train-dir=/tmp/fs_871009/ILSVRC2012_img_train --val-dir=/tmp/fs_871009/ILSVRC2012_img_val 
245 | 
246 | ```
247 | 
248 | You need use id -u, to find your user ID, and change above 871009 to it. 
249 | 
250 | ## DALI
251 | 
252 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 
253 | 
254 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters.  
255 | 
256 | On TACC-Frontera, DALI only works for one node with 4GPUs, if use Pytorch + Horovod, I guess because its specific MPI : ( . When use mutil nodes, it still doesn't work.
257 | 
258 | **Install**
259 | 
260 | Build DALI with Pytorch, Horovod and CUDA 11.0. 
261 | 
262 | ```shell
263 | # login in computing nodes and virtualenv 
264 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00
265 | 
266 | source ~/python-env/cuda110-home/bin/activate
267 | 
268 | # load modules (cannot use default intel/19 impi/19)
269 | module load intel/18.0.5 impi/18.0.5
270 | module load cuda/11.0 cudnn nccl
271 | 
272 | 
273 | # install Pytorch, example 1.7.1
274 | pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html --force-reinstall
275 | 
276 | # bulid Horovod, maybe spend ten minutes 
277 | HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR  CC=gcc HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install horovod --no-cache-dir --force-reinstall
278 | 
279 | # install dali
280 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
281 | # pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda100
282 | 
283 | # --user leads to different packages location 
284 | # export PYTHONPATH=~/.local/lib/python3.7/site-packages
285 | ```
286 | 
287 | **Usage**
288 | 
289 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch).This example has been tested with nvidia-dali-cuda110, maybe it needs some changes if you use it with CUDA10.
290 | 
291 | DALI requires data in *TFRecord format* in the following structure:
292 | 
293 | ```
294 | train-recs 'path/train/*' 
295 | val-recs 'path/validation/*' 
296 | train-idx 'path/idx_files/train/*' 
297 | val-idx 'path/idx_files/validation/*' 
298 | ```
299 | 
300 | On longhorn, if you want use ImageNet-1k TFRecord data, you can directly use
301 | 
302 | data-dir=/scratch1/07801/nusbin20/ILSVRC2012_1k_TFRecord/
303 | 
304 | ```shell
305 | # TFRecord format tar file
306 | /scratch1/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 
307 | ```
308 | 
309 | About the parameters on DALI:
310 | 
311 | - *prefetch_queue_depth* and *num_threads*  might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff.
312 | - *last_batch_policy*  you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link.
313 | - *device*  Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster.
314 | 
315 | ## Potential Error
316 | 
317 | - **Program works well in one node but get stuck just after it started running in two nodes.**
318 | 
319 | For Pytorch program with Horovod, please check your 'num_workers': N in torch.utils.data.DataLoader, maybe you need set 0 here to force synchronous IO, because there are some potential conflicts between Horovod and asynchronous IO on TACC-Frontera. In addition, maybe you cannot use torch.multiprocessing.set_start_method('spawn'). This can help progeam works but will seriously slow down the efficiency of the program. So, you would better use torch.distributed rather than Horovod on TACC-Frontera.
320 | 
321 | For Pytorch program with torch.distributed, please check your init_method. For shared file-system initialization, you need to select a **no-exist** file, eg. init_method='file:///mnt/nfs/sharedfile', maybe you need delete the file manually. For TCP initialization, 
322 | 
323 | ```shell
324 | # MASTER_ADDR is host address
325 | # MASTER_PORT is unused host port
326 | init_method=’tcp://$MASTER_ADDR:$MASTER_PORT’
327 | 
328 | # example in SLURM
329 | master_addr = os.getenv('SLURM_LAUNCH_NODE_IPADDR')
330 | master_port = ‘6006’
331 | init_method = 'tcp://' + master_addr + ':' + master_port
332 | ```
333 | 
334 | - **torch.distributed.new_group(ranks=ranks) must be functions to be run by all processes**
335 | 
336 | Wrong example
337 | 
338 | ```shell
339 | if rank == 0:
340 | 	torch.distributed.new_group(ranks=ranks_1)
341 | else:
342 | 	torch.distributed.new_group(ranks=ranks_2)
343 | ```
344 | 
345 | Correct example
346 | 
347 | ```shell
348 | global _DATA_PARALLEL_GROUP
349 | assert _DATA_PARALLEL_GROUP is None, \
350 |     'data parallel group is already initialized'
351 | for i in range(model_parallel_size):
352 |     ranks = range(i, world_size, model_parallel_size)
353 |     group = torch.distributed.new_group(ranks)
354 |     if i == (rank % model_parallel_size):
355 |         _DATA_PARALLEL_GROUP = group
356 | ```
357 | 
358 | ## Question Ticket
359 | 
360 | If you have some specific questions, you can sent them to [TACC Longhorn Help Desk](https://portal.tacc.utexas.edu/user-guides/longhorn#help-desk).


--------------------------------------------------------------------------------