├── scripts ├── nscc │ ├── multi_node │ │ ├── experiments │ │ │ ├── scripts │ │ │ │ ├── sshcont │ │ │ │ │ ├── testnodes │ │ │ │ │ ├── testnodes_dgx │ │ │ │ │ ├── invocation │ │ │ │ │ ├── container_slaves_tensorflow.sh │ │ │ │ │ ├── tensorflow.sh │ │ │ │ │ ├── job_tensorflow_gloo.sh │ │ │ │ │ └── ssh_and_run.py │ │ │ │ └── dropbear │ │ │ │ │ ├── dropbear_dss_host_key │ │ │ │ │ ├── dropbear_rsa_host_key │ │ │ │ │ ├── dropbear_ecdsa_host_key │ │ │ │ │ ├── dropbear_ed25519_host_key │ │ │ │ │ └── start-dropbear.sh │ │ │ ├── test │ │ │ │ ├── test_mpi.sh │ │ │ │ └── comm.py │ │ │ └── apps │ │ │ │ └── software │ │ │ │ └── dropbear │ │ │ │ └── 2020.80 │ │ │ │ ├── bin │ │ │ │ ├── dbclient │ │ │ │ ├── dropbearkey │ │ │ │ └── dropbearconvert │ │ │ │ ├── sbin │ │ │ │ └── dropbear │ │ │ │ └── share │ │ │ │ └── man │ │ │ │ ├── man1 │ │ │ │ ├── dropbearconvert.1 │ │ │ │ ├── dropbearkey.1 │ │ │ │ └── dbclient.1 │ │ │ │ └── man8 │ │ │ │ └── dropbear.8 │ │ └── pbs_script │ │ │ ├── start_jupyter.sh │ │ │ └── multinode-jupyter-non-dev.pbs │ └── monitor_job │ │ ├── monitor.sh │ │ └── nscc_monitor.py ├── cscs │ ├── cscs-pytorch.slurm │ ├── cscs-spark.slurm │ └── cscs-pytorch-gloo.slurm ├── tacc_longhorn │ ├── multicard-node-longhorn.slurm │ ├── cp_imagenet_to_temp_bin.sh │ └── multicard-node-longhorn-large.slurm └── tacc_frontera │ ├── frontera-test.slurm │ ├── frontera-btest.slurm │ └── frontera-gloo.slurm ├── docs ├── doc_figures │ ├── cscs │ │ ├── globus.jpg │ │ ├── jupyter_step1.jpg │ │ ├── jupyter_step2.jpg │ │ ├── jupyter_step3.jpg │ │ └── cscs_project_information.jpg │ └── nscc │ │ └── NSCC_jupyter.png ├── initialize │ ├── cluster_software.md │ └── vm_software.md ├── get_started │ ├── hardware_keywords.md │ └── sysinfo.md ├── hpc_system_notes │ ├── nus_apollo.md │ ├── general.md │ ├── tacc_longhorn.md │ ├── nscc.md │ ├── cscs.md │ └── tacc_frontera.md └── resources │ └── dl_libraries.md ├── README.md └── .gitignore /scripts/nscc/multi_node/experiments/scripts/sshcont/testnodes: -------------------------------------------------------------------------------- 1 | ntu01 2 | ntu02 3 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/testnodes_dgx: -------------------------------------------------------------------------------- 1 | dgx4106 2 | dgx4105 3 | -------------------------------------------------------------------------------- /docs/doc_figures/cscs/globus.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/globus.jpg -------------------------------------------------------------------------------- /docs/doc_figures/cscs/jupyter_step1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step1.jpg -------------------------------------------------------------------------------- /docs/doc_figures/cscs/jupyter_step2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step2.jpg -------------------------------------------------------------------------------- /docs/doc_figures/cscs/jupyter_step3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/jupyter_step3.jpg -------------------------------------------------------------------------------- /docs/doc_figures/nscc/NSCC_jupyter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/nscc/NSCC_jupyter.png -------------------------------------------------------------------------------- /docs/doc_figures/cscs/cscs_project_information.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/docs/doc_figures/cscs/cscs_project_information.jpg -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/test/test_mpi.sh: -------------------------------------------------------------------------------- 1 | python /home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/mpi_testing/comm.py -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_dss_host_key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_dss_host_key -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_rsa_host_key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_rsa_host_key -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ecdsa_host_key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ecdsa_host_key -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dbclient: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dbclient -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/sbin/dropbear: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/sbin/dropbear -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ed25519_host_key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/scripts/dropbear/dropbear_ed25519_host_key -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearkey: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearkey -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearconvert: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NUS-HPC-AI-Lab/oh-my-server/HEAD/scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/bin/dropbearconvert -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/dropbear/start-dropbear.sh: -------------------------------------------------------------------------------- 1 | ROOT="${HOME}/scripts/dropbear" 2 | 3 | exec "${HOME}/apps/software/dropbear/2020.80/sbin/dropbear" -r "${ROOT}/dropbear_dss_host_key" \ 4 | -r "${ROOT}/dropbear_ecdsa_host_key" -r "${ROOT}/dropbear_ed25519_host_key" \ 5 | -r "${ROOT}/dropbear_rsa_host_key" -FE -p 41017 6 | -------------------------------------------------------------------------------- /scripts/nscc/monitor_job/monitor.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | job=${1:-""} 4 | email=${2:-"c170166@e.ntu.edu.sg"} 5 | interval=${3:-"600"} 6 | 7 | 8 | source ~/anaconda3/bin/activate base 9 | nohup python /home/users/ntu/c170166/pbs_scripts/job_monitor/nscc_monitor.py --job $job --email $email --interval $interval > ./nohup.log 2>&1 & 10 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/invocation: -------------------------------------------------------------------------------- 1 | 2 | # load dropbear: add dropbear bin and sbin to system path 3 | export PATH=$HOME/apps/software/dropbear/2020.80/bin:$HOME/apps/software/dropbear/2020.80/sbin:$PATH 4 | 5 | echo $PATH 6 | 7 | ~/scripts/sshcont/ssh_and_run.py ~/scripts/sshcont/tensorflow.sh ~/scripts/sshcont/job_tensorflow_gloo.sh 8 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/pbs_script/start_jupyter.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /home/users/ntu/c170166/anaconda3/bin/activate base 4 | DATESTAMP=`date +'%y%m%d%H%M%S'` 5 | LOGFILE=./jupyter.o$DATESTAMP 6 | jupyter lab --no-browser --ip=0.0.0.0 --port=8888 \ 7 | --NotebookApp.terminado_settings='{"shell_command": ["/bin/bash"]}' \ 8 | --NotebookApp.allow_remote_access=True --FileContentsManager.delete_to_trash=False |& tee $LOGFILE 9 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/container_slaves_tensorflow.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | # replace with your favourite container image 3 | # IMAGE_NAME="/home/projects/ai/singularity/nvcr.io/nvidia/tensorflow:20.02-tf1-py3.sif" 4 | IMAGE_NAME="/home/project/ai/singularity/nvcr.io/nvidia/pytorch:latest-py3.sif" 5 | pbs-attach "${PBS_JOBID}" 6 | singularity run --nv ${NTUHPC_CONT_EXTRA_ARGS} -e "${IMAGE_NAME}" "${HOME}/scripts/dropbear/start-dropbear.sh" 7 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/test/comm.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import os 3 | import horovod.torch as hvd 4 | import socket 5 | 6 | hvd.init() 7 | 8 | os.environ['MASTER_ADDR']="localhost" 9 | os.environ['MASTER_PORT']="6002" 10 | 11 | hostname = socket.gethostname() 12 | 13 | # rank = os.getenv('OMPI_COMM_WORLD_RANK', '0') 14 | rank = hvd.local_rank() 15 | world = hvd.size() 16 | # world = os.getenv("OMPI_COMM_WORLD_SIZE", '1') 17 | 18 | print("rank: {}, world size: {}, hostname: {}".format(rank, world, hostname)) 19 | print("trying to initliaze dist") 20 | torch.distributed.init_process_group( 21 | backend="nccl", 22 | world_size=world, rank=rank) 23 | print("init successful on hostname: {}".format(hostname)) -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/tensorflow.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT 3 | 4 | IMAGE_NAME="/home/project/ai/singularity/nvcr.io/nvidia/pytorch:latest-py3.sif" 5 | export NTUHPC_CONT_EXTRA_ARGS="--bind ${NTUHPC_SSH_DIR}:$HOME/.ssh" 6 | echo "Binding ${NTUHPC_CONT_EXTRA_ARGS}" 7 | 8 | pbsdsh hostname 9 | `which mpirun` --bind-to none --tag-output -x NTUHPC_CONT_EXTRA_ARGS \ 10 | -H "${NTUHPC_OPENMPI_HOSTSPEC}" -N 1 -np "${NTUHPC_OPENMPI_HOSTCOUNT}" \ 11 | "${HOME}/scripts/sshcont/container_slaves_tensorflow.sh" & 12 | 13 | echo "Waiting 180s for SSH servers to be up" 14 | sleep 180s 15 | echo "Logging in" 16 | 17 | echo $@ 18 | singularity run --nv ${NTUHPC_CONT_EXTRA_ARGS} "${IMAGE_NAME}" $@ 19 | -------------------------------------------------------------------------------- /scripts/cscs/cscs-pytorch.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | 3 | #SBATCH --job-name=my_cscs_job 4 | #SBATCH --time=00:10:00 5 | #SBATCH --nodes=2 6 | #SBATCH --ntasks-per-core=1 7 | #SBATCH --ntasks-per-node=1 8 | #SBATCH --cpus-per-task=12 9 | #SBATCH --constraint=gpu 10 | #SBATCH --account= 11 | 12 | module load daint-gpu 13 | module load PyTorch 14 | 15 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 16 | 17 | # Environment variables needed by the NCCL backend 18 | export NCCL_DEBUG=INFO 19 | export NCCL_IB_HCA=ipogif0 20 | export NCCL_IB_CUDA_SUPPORT=1 21 | 22 | . /users/your_account/miniconda3/bin/activate pt38 23 | 24 | cd your_code_path 25 | 26 | srun \ 27 | python -u your_code.py \ 28 | --epochs 90 \ 29 | --model resnet50 \ 30 | > ${SLURM_JOBID}.out 2> ${SLURM_JOBID}.err 31 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/job_tensorflow_gloo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -li 2 | 3 | # DATESTAMP=`date +'%y%m%d%H%M%S'` 4 | 5 | # Edit this 6 | RUN_SCRIPT=/home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/mpi_testing/test_mpi.sh 7 | # RUN_SCRIPT=/home/users/ntu/c170166/scratch/projects/dl-auto-load-balance/auto-ml-load-balance/scripts/nscc/jupyter/pretrain_bert_mpi.sh 8 | 9 | # Clear environment and run 10 | env -i - "/home/users/ntu/c170166/.local/bin/horovodrun" \ 11 | -H "${NTUHPC_OPENMPI_HOSTSPEC}" -np "${NTUHPC_OPENMPI_SLOTCOUNT}" --gloo $RUN_SCRIPT 12 | 13 | 14 | #env -i - "/home/users/ntu/c170166/.local/bin/horovodrun" \ 15 | # -H "${NTUHPC_OPENMPI_HOSTSPEC}" -np "${NTUHPC_OPENMPI_SLOTCOUNT}" --gloo $GLUE_SCRIPT 16 | -------------------------------------------------------------------------------- /scripts/tacc_longhorn/multicard-node-longhorn.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | #SBATCH -J myjob # Job name 4 | #SBATCH -o myjob.o%j # Name of stdout output file 5 | #SBATCH -e myjob.e%j # Name of stderr error file 6 | #SBATCH -p v100 # Queue (partition) name 7 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 8 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 9 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 10 | #SBATCH --mail-user=your email 11 | #SBATCH --mail-type=all # Send email at begin and end of job 12 | 13 | 14 | pwd 15 | date 16 | 17 | cd /scratch/07801/nusbin20/tacc-our 18 | module load conda 19 | conda activate py36pt 20 | 21 | ibrun -np 8 \ 22 | python examples/pytorch_imagenet_resnet.py \ 23 | --epochs 90 \ 24 | --model resnet50 25 | -------------------------------------------------------------------------------- /scripts/tacc_longhorn/cp_imagenet_to_temp_bin.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copy imagenet tar file to /tmp/ and extract. 4 | # Use --tiny to copy a smaller version of imagenet instead. 5 | # Produces directories: /tmp/imagenet/ILSVRC2012_img_{train,val} 6 | 7 | tiny=${tiny:-false} 8 | 9 | while [ $# -gt 0 ]; do 10 | if [[ $1 == *"--"* ]]; then 11 | param="${1/--/}" 12 | declare $param=true 13 | fi 14 | shift 15 | done 16 | 17 | DEST=/tmp 18 | if [ "$tiny" == true ]; then 19 | SOURCE=/scratch/07801/nusbin20/imagenet-tar/imagenet-tiny.tar 20 | else 21 | SOURCE=/scratch/07801/nusbin20/imagenet-tar/imagenet-1k.tar 22 | fi 23 | 24 | echo Copying $SOURCE to $DEST 25 | cp $SOURCE $DEST/imagenet.tar 26 | echo "Done copying. Extracting" 27 | mkdir $DEST/imagenet 28 | tar xf $DEST/imagenet.tar -C $DEST/imagenet 29 | echo "Done." 30 | -------------------------------------------------------------------------------- /scripts/tacc_frontera/frontera-test.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | #SBATCH -J myjob # Job name 4 | #SBATCH -o myjob.o%j # Name of stdout output file 5 | #SBATCH -e myjob.e%j # Name of stderr error file 6 | #SBATCH -p rtx # Queue (partition) name 7 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 8 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 9 | #SBATCH -t 00:10:00 # Run time (hh:mm:ss) 10 | #SBATCH --mail-user=your email 11 | #SBATCH --mail-type=all # Send email at begin and end of job 12 | 13 | 14 | pwd 15 | date 16 | 17 | source ~/python-env/cuda10-home/bin/activate 18 | 19 | cd /scratch1/07801/nusbin20/tacc-our 20 | module load intel/18.0.5 impi/18.0.5 21 | module load cuda/10.1 cudnn nccl 22 | 23 | ibrun -np 8 \ 24 | python pytorch_imagenet_resnet.py \ 25 | --epochs 90 \ 26 | --model resnet50 27 | 28 | -------------------------------------------------------------------------------- /scripts/cscs/cscs-spark.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name="spark" 4 | #SBATCH --time=00:30:00 5 | #SBATCH --nodes=4 6 | #SBATCH --constraint=gpu 7 | #SBATCH --output=sparkjob.%j.log 8 | #SBATCH --account= 9 | 10 | # set some variables for Spark 11 | export SPARK_WORKER_CORES=12 12 | export SPARK_LOCAL_DIRS="/tmp" 13 | 14 | # load modules 15 | module load slurm 16 | module load daint-gpu 17 | module load Spark 18 | 19 | # deploy of spark 20 | start-all.sh 21 | 22 | # some extra Spark configuration 23 | SPARK_CONF=" 24 | --conf spark.default.parallelism=10 25 | --conf spark.executor.cores=8 26 | --conf spark.executor.memory=15g 27 | " 28 | 29 | # submit a Spark job 30 | spark-submit ${SPARK_CONF} --master $SPARKURL \ 31 | --class org.apache.spark.examples.SparkPi \ 32 | $EBROOTSPARK/examples/jars/spark-examples_2.11-2.4.7.jar 10000; 33 | 34 | # clean out Spark deployment 35 | stop-all.sh -------------------------------------------------------------------------------- /scripts/cscs/cscs-pytorch-gloo.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | 3 | #SBATCH --job-name=gloo-eb 4 | #SBATCH --time=00:30:00 5 | #SBATCH --nodes=4 6 | #SBATCH --ntasks-per-node=1 7 | #SBATCH --constraint=gpu 8 | #SBATCH --account= 9 | #SBATCH --output=test_pt_hvd_%j.out 10 | 11 | module load daint-gpu PyTorch 12 | 13 | . /users/lyongbin/miniconda3/bin/activate your_env_name 14 | 15 | export PMI_NO_PREINITIALIZE=1 # avoid warnings on fork 16 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS 17 | 18 | for node in $(scontrol show hostnames); do 19 | HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE," 20 | done 21 | HOSTS=${HOSTS%?} # trim trailing comma 22 | echo HOSTS $HOSTS 23 | 24 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \ 25 | --start-timeout 120 --gloo-timeout-seconds 120 \ 26 | python -u your_code.py \ 27 | --epochs 90 \ 28 | --model resnet50 29 | -------------------------------------------------------------------------------- /scripts/tacc_longhorn/multicard-node-longhorn-large.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | #SBATCH -J myjob # Job name 4 | #SBATCH -o myjob.o%j # Name of stdout output file 5 | #SBATCH -e myjob.e%j # Name of stderr error file 6 | #SBATCH -p v100 # Queue (partition) name 7 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 8 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 9 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 10 | #SBATCH --mail-user=your email 11 | #SBATCH --mail-type=all # Send email at begin and end of job 12 | 13 | 14 | pwd 15 | date 16 | 17 | cd /scratch/07801/nusbin20/tacc-our 18 | module load conda 19 | conda activate py36pt 20 | 21 | scontrol show hostnames $SLURM_NODELIST > /tmp/hostfile 22 | 23 | cat /tmp/hostfile 24 | 25 | mpiexec -hostfile /tmp/hostfile -N 1 ./cp_imagenet_to_temp_bin.sh 26 | 27 | ibrun -np 8 \ 28 | python examples/pytorch_imagenet_resnet.py \ 29 | --epochs 90 \ 30 | --model resnet50 31 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # oh-my-server 2 | 3 | This repository aims to provide some guidance on creating and managing your own server and cluster. It includes some 4 | personal debugging experiences on different HPC systems as well. 5 | 6 | ## Quick Reference 7 | 8 | - Get Started 9 | - [Understanding Your System](docs/get_started/sysinfo.md) 10 | - [Hardware Keywords](docs/get_started/hardware_keywords.md) 11 | - Initialization 12 | - [Initializing Your Server](docs/initialize/vm_software.md) 13 | - [Initializing Your Cluster](docs/initialize/cluster_software.md) 14 | - HPC System Notes 15 | - [General Notes](docs/hpc_system_notes/general.md) 16 | - [NSCC AI System](docs/hpc_system_notes/nscc.md) 17 | - [TACC Longhorn](docs/hpc_system_notes/tacc_longhorn.md) 18 | - [TACC Frontera](docs/hpc_system_notes/tacc_frontera.md) 19 | - [CSCS](docs/hpc_system_notes/cscs.md) 20 | - [NUS Apollo](docs/hpc_system_notes/nus_apollo.md) 21 | - Resources 22 | - [Awesome Deep Learning Libraries](docs/resources/dl_libraries.md) 23 | 24 | -------------------------------------------------------------------------------- /docs/initialize/cluster_software.md: -------------------------------------------------------------------------------- 1 | # Software for Cluster 2 | 3 | ## Table Of Contents 4 | 5 | - [Ansible](#ansible) 6 | - [BeeGFS](#beegfs) 7 | 8 | ## Ansible 9 | 10 | Ansible is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates. Ansible does not require you to deploy on every node and it can access other nodes from master node via ssh and execute operations on other nodes automatically. For example, if you wish to install CUDA on your execute nodes, you can just write a simple Ansible playbook script and specify your execute nodes IPs. Then Ansible will access these execute nodes via ssh and install CUDA on each node as instructed in your script. This avoids the trouble of installing on each node manually. 11 | 12 | > [Ansible Documentation](https://docs.ansible.com/ansible/latest/index.html) 13 | 14 | ## BeeGFS 15 | 16 | BeeGFS is a distributed optimized parallel file system. 17 | 18 | > [BeeGFS documentation](https://www.beegfs.io/c/) 19 | -------------------------------------------------------------------------------- /scripts/tacc_frontera/frontera-btest.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | #SBATCH -J myjob # Job name 4 | #SBATCH -o lars-t.o%j # Name of stdout output file 5 | #SBATCH -e lars-t.e%j # Name of stderr error file 6 | #SBATCH -p rtx # Queue (partition) name 7 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 8 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 9 | #SBATCH -t 00:20:00 # Run time (hh:mm:ss) 10 | #SBATCH --mail-user=your email 11 | #SBATCH --mail-type=all # Send email at begin and end of job 12 | 13 | 14 | pwd 15 | date 16 | 17 | source ~/python-env/cuda10-home/bin/activate 18 | export FS_ROOT=/tmp/fs_`id -u` 19 | ibrun -np 2 /work/00410/huang/share/read_remote_file 16 /scratch1/07801/nusbin20/imagenet-16parts & sleep 500 20 | 21 | module load intel/18.0.5 impi/18.0.5 22 | module load cuda/10.1 cudnn nccl 23 | 24 | cd /scratch1/07801/nusbin20/tacc-our 25 | 26 | ibrun -np 8 LD_PRELOAD=/work/00410/huang/share/wrapper.so python pytorch_imagenet_resnet.py --epochs 90 --model resnet50 --train-dir=/tmp/fs_871009/ILSVRC2012_img_train --val-dir=/tmp/fs_871009/ILSVRC2012_img_val -------------------------------------------------------------------------------- /scripts/tacc_frontera/frontera-gloo.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | #SBATCH -J myjob # Job name 4 | #SBATCH -o myjob-t.o%j # Name of stdout output file 5 | #SBATCH -e myjob-t.e%j # Name of stderr error file 6 | #SBATCH -p rtx # Queue (partition) name 7 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 8 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 9 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 10 | 11 | pwd 12 | date 13 | 14 | source ~/python-env/cuda10-home/bin/activate 15 | 16 | module load intel/18.0.5 impi/18.0.5 17 | module load cuda/10.1 cudnn nccl 18 | 19 | cd /file_path/ 20 | 21 | export PMI_NO_PREINITIALIZE=1 # avoid warnings on fork 22 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS 23 | 24 | # 4 means $SLURM_NTASKS_PER_NODE 25 | for node in $(scontrol show hostnames); do 26 | HOSTS="$HOSTS$node:4," 27 | done 28 | HOSTS=${HOSTS%?} # trim trailing comma 29 | echo HOSTS $HOSTS 30 | 31 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ib0 \ 32 | --start-timeout 120 --gloo-timeout-seconds 120 \ 33 | python you_file.py \ 34 | --epochs 90 \ 35 | --model resnet50 36 | 37 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dropbearconvert.1: -------------------------------------------------------------------------------- 1 | .TH dropbearconvert 1 2 | .SH NAME 3 | dropbearconvert \- convert between Dropbear and OpenSSH private key formats 4 | .SH SYNOPSIS 5 | .B dropbearconvert 6 | .I input_type 7 | .I output_type 8 | .I input_file 9 | .I output_file 10 | .SH DESCRIPTION 11 | .B Dropbear 12 | and 13 | .B OpenSSH 14 | SSH implementations have different private key formats. 15 | .B dropbearconvert 16 | can convert between the two. 17 | .P 18 | Dropbear uses the same SSH public key format as OpenSSH, it can be extracted 19 | from a private key by using 20 | .B dropbearkey \-y 21 | .P 22 | Encrypted private keys are not supported, use ssh-keygen(1) to decrypt them 23 | first. 24 | .SH ARGUMENTS 25 | .TP 26 | .I input_type 27 | Either 28 | .I dropbear 29 | or 30 | .I openssh 31 | .TP 32 | .I output_type 33 | Either 34 | .I dropbear 35 | or 36 | .I openssh 37 | .TP 38 | .I input_file 39 | An existing Dropbear or OpenSSH private key file 40 | .TP 41 | .I output_file 42 | The path to write the converted private key file. For client authentication ~/.ssh/id_dropbear is loaded by default 43 | .SH EXAMPLE 44 | # dropbearconvert openssh dropbear ~/.ssh/id_rsa ~/.ssh/id_dropbear 45 | .SH AUTHOR 46 | Matt Johnston (matt@ucc.asn.au). 47 | .SH SEE ALSO 48 | dropbearkey(1), ssh-keygen(1) 49 | .P 50 | https://matt.ucc.asn.au/dropbear/dropbear.html 51 | -------------------------------------------------------------------------------- /docs/get_started/hardware_keywords.md: -------------------------------------------------------------------------------- 1 | # Hardware Keywords 2 | 3 | ## PCIe 4 | 5 | [PCIe](https://en.wikipedia.org/wiki/PCI_Express) stands for "peripheral component interconnect express". It is the common motherboard interface for personal computers' graphics cards, hard disk drive host adapters, SSDs, Wi-Fi and Ethernet hardware connections. The most common PCIe standards are PCIe 3.0 and 4.0. 8-lane PCIe 3.0 has a bandwidth of 8.0GT/s while 8-lane PCIe 4.0 has achieved 16.0GT/s. 6 | 7 | ## InfiniBand 8 | 9 | [InfiniBand](https://en.wikipedia.org/wiki/InfiniBand) is normally termed as IB and is commonly seen in high-performance computing systems. InfinibandBand is a computer networking communication standard featured by very high throughput and very low latency. It can be used as interconnect within servers, among servers, between servers and storage systems and between storage systems. 4-links EDR and HDR IB can provide 100Gb/s and 200Gb/s for throughput. 10 | 11 | ## NVLink 12 | 13 | [NVLink](https://en.wikipedia.org/wiki/NVLink) is the GPU interconnect designed by NVIDIA. It is for near-range communication, which means between GPUs. It provides high-speed direct GPU-to-GPU interconnect 12 links for an Ampere-based A100 GPU can support up to 600 GB/s bandwidth. 14 | 15 | ## NVMe 16 | 17 | [NVMe](https://en.wikipedia.org/wiki/NVM_Express) is the short form for NVM Express. is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via PCI Express (PCIe) bus. It provides optimization for I/O-intensive applications. -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dropbearkey.1: -------------------------------------------------------------------------------- 1 | .TH dropbearkey 1 2 | .SH NAME 3 | dropbearkey \- create private keys for the use with dropbear(8) or dbclient(1) 4 | .SH SYNOPSIS 5 | .B dropbearkey 6 | \-t 7 | .I type 8 | \-f 9 | .I file 10 | [\-s 11 | .IR bits ] 12 | [\-y] 13 | .SH DESCRIPTION 14 | .B dropbearkey 15 | generates a 16 | \fIRSA\fR, \fIDSS\fR, \fIECDSA\fR, or \fIEd25519\fR 17 | format SSH private key, and saves it to a file for the use with the 18 | Dropbear client or server. 19 | Note that 20 | some SSH implementations 21 | use the term "DSA" rather than "DSS", they mean the same thing. 22 | .SH OPTIONS 23 | .TP 24 | .B \-t \fItype 25 | Type of key to generate. 26 | Must be one of 27 | .I rsa 28 | .I ecdsa 29 | .I ed25519 30 | or 31 | .IR dss . 32 | .TP 33 | .B \-f \fIfile 34 | Write the secret key to the file 35 | \fIfile\fR. For client authentication ~/.ssh/id_dropbear is loaded by default 36 | .TP 37 | .B \-s \fIbits 38 | Set the key size to 39 | .I bits 40 | bits, should be multiple of 8 (optional). 41 | .TP 42 | .B \-y 43 | Just print the publickey and fingerprint for the private key in \fIfile\fR. 44 | .SH NOTES 45 | The program dropbearconvert(1) can be used to convert between Dropbear and OpenSSH key formats. 46 | .P 47 | Dropbear does not support encrypted keys. 48 | .SH EXAMPLE 49 | generate a host-key: 50 | # dropbearkey -t rsa -f /etc/dropbear/dropbear_rsa_host_key 51 | 52 | extract a public key suitable for authorized_keys from private key: 53 | # dropbearkey -y -f id_rsa | grep "^ssh-rsa " >> authorized_keys 54 | .SH AUTHOR 55 | Matt Johnston (matt@ucc.asn.au). 56 | .br 57 | Gerrit Pape (pape@smarden.org) wrote this manual page. 58 | .SH SEE ALSO 59 | dropbear(8), dbclient(1), dropbearconvert(1) 60 | .P 61 | https://matt.ucc.asn.au/dropbear/dropbear.html 62 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/pbs_script/multinode-jupyter-non-dev.pbs: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | ## The following line specifies the resources request 4 | #PBS -l select=2:ncpus=5:ngpus=1,place=scatter 5 | 6 | ### Run in the shared test and development area 7 | ### Note that you do not have exclusive access to the GPUs in the dgx-dev queue 8 | #PBS -q dgx 9 | #PBS -l walltime=12:00:00 10 | 11 | ### Specify project code 12 | ### e.g. 41000001 was the pilot project code 13 | ### Job will not submit unless this is changed 14 | #PBS -P 11002046 15 | 16 | ### Specify name for job 17 | #PBS -N multinode_lsg_jupyter 18 | 19 | ### Standard output by default goes to file $PBS_JOBNAME.o$PBS_JOBID 20 | ### Standard error by default goes to file $PBS_JOBNAME.e$PBS_JOBID 21 | ### To merge standard output and error use the following 22 | #PBS -j oe 23 | 24 | ### Start of commands to be run 25 | 26 | # Change directory to where job was submitted 27 | cd $PBS_O_WORKDIR || exit $? 28 | echo $0 29 | hostname 30 | bash /home/users/ntu/c170166/pbs_scripts/jupyterlab/start_jupyter.sh 31 | 32 | 33 | ### Notebook is now running on compute node 34 | ### However you cannot directly access port 35 | ### There are two methods that you can use 36 | ### 1. ssh port forwarding 37 | ### 2. reverse proxy (FRP, etc.) 38 | 39 | ### Using reverse proxies is a security risk, user is responsible for any data loss or unauthorized access. 40 | 41 | ### ssh port forwarding 42 | ### On local machine use ssh port forwarding to tunnel to node and port where job is running: 43 | ### ssh -L$PORT:$HOST:$PORT aspire.nscc.sg ### e.g. ssh -L8888:dgx4106:8888 aspire.nscc.sg 44 | ### On local machine go to http://localhost:$PORT and use token from found in file stderr.$PBS_JOBID 45 | ### Alternatively, pass a pre-determined token using --NotebookApp.token=... (visible to ps command on node) 46 | -------------------------------------------------------------------------------- /scripts/nscc/monitor_job/nscc_monitor.py: -------------------------------------------------------------------------------- 1 | import smtplib 2 | from email.mime.text import MIMEText 3 | from email.header import Header 4 | import subprocess 5 | import time 6 | import argparse 7 | import datetime 8 | import logging 9 | 10 | FROM_ADDR = 'franklee_9@163.com' 11 | PASSWORD = 'MQFKZUTJQBMKMWWT' 12 | SMTP_SERVER = 'smtp.163.com' 13 | 14 | 15 | def send_email(server, to_addr, job_id): 16 | current_time = datetime.datetime.now() 17 | content = 'Job {} is currently running. Detected at {}'.format( 18 | job_id, current_time) 19 | msg = MIMEText(content, 'plain', 'utf-8') 20 | 21 | msg['From'] = Header(FROM_ADDR) 22 | msg['To'] = Header(to_addr) 23 | msg['Subject'] = Header('Your NSCC Job is Running') 24 | server.sendmail(FROM_ADDR, to_addr, msg.as_string()) 25 | logging.info("Email alert sent to {}".format(to_addr)) 26 | server.quit() 27 | 28 | 29 | def listen_to_qstat(job_id, interval=600): 30 | logging.info("Start listening for {}".format(job_id)) 31 | while True: 32 | qstat_info = subprocess.getoutput('qstat') 33 | qstat_info = qstat_info.split('\n') 34 | for line in qstat_info: 35 | if job_id in line and line.split()[-2] == "R": 36 | logging.info("Detected job start for: {}".format(job_id)) 37 | return 38 | 39 | time.sleep(interval) 40 | 41 | 42 | def parse_args(): 43 | parser = argparse.ArgumentParser() 44 | parser.add_argument("--job", type=str) 45 | parser.add_argument("--interval", type=int, default=600) 46 | parser.add_argument("--email", type=str) 47 | return parser.parse_args() 48 | 49 | 50 | def main(): 51 | args = parse_args() 52 | logging.basicConfig( 53 | filename="./monitor_{}.log".format(args.job), level=logging.DEBUG) 54 | listen_to_qstat(args.job, args.interval) 55 | 56 | # set up email server 57 | logging.info("Setting up server") 58 | server = smtplib.SMTP_SSL(host=SMTP_SERVER) 59 | server.connect(host=SMTP_SERVER, port=465) 60 | server.login(FROM_ADDR, PASSWORD) 61 | send_email(server, args.email, args.job) 62 | 63 | 64 | if __name__ == "__main__": 65 | main() 66 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/scripts/sshcont/ssh_and_run.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2 2 | 3 | from __future__ import print_function 4 | 5 | SSH_PORT = 41017 6 | 7 | import subprocess 8 | import os 9 | import tempfile 10 | import shutil 11 | import stat 12 | import sys 13 | 14 | with open(os.environ["PBS_NODEFILE"], "r") as f: 15 | hosts = dict() 16 | for host in f: 17 | host = host.strip() 18 | cnt = hosts.get(host, 0) 19 | cnt += 1 20 | hosts[host] = cnt 21 | 22 | hstrings = ["{0}:{1}".format(h, s) for (h, s) in hosts.items()] 23 | 24 | def generate_ssh_config_file(): 25 | sections = [] 26 | 27 | for host in hosts: 28 | sections.append( 29 | ("Host {0}\n" 30 | "\tPort {1}\n" 31 | "\tStrictHostKeyChecking no\n".format(host, SSH_PORT))) 32 | return "\n".join(sections) 33 | 34 | 35 | tempdir = tempfile.mkdtemp(dir=os.getenv("HOME") + "/sshcont") 36 | print("Created temporary config directory:", tempdir) 37 | print("bind mount that directory under $HOME/.ssh") 38 | 39 | os.chmod(tempdir, stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR) 40 | 41 | configfile = tempdir + "/config" 42 | with open(configfile, "w") as f: 43 | f.write(generate_ssh_config_file()) 44 | os.chmod(configfile, stat.S_IRUSR | stat.S_IWUSR) 45 | 46 | 47 | keyfile = tempdir + "/id_rsa" 48 | pubkey = tempdir + "/authorized_keys" 49 | out = subprocess.Popen("dropbearkey -t rsa -f {0}".format(keyfile), shell=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE) 50 | out.wait() 51 | if out.returncode != 0: 52 | raise RuntimeError("return not zero") 53 | out = out.stdout.read() 54 | 55 | with open(pubkey, "w") as f: 56 | f.write(out.splitlines()[1]) 57 | subprocess.check_call("dropbearconvert dropbear openssh {0} {1}".format(keyfile, keyfile), shell=True) 58 | os.chmod(pubkey, stat.S_IRUSR | stat.S_IWUSR) 59 | 60 | os.environ["NTUHPC_OPENMPI_HOSTSPEC"] = ",".join(hstrings) 61 | os.environ["NTUHPC_SSH_DIR"] = tempdir 62 | os.environ["NTUHPC_OPENMPI_HOSTCOUNT"] = str(len(hosts)) 63 | os.environ["NTUHPC_OPENMPI_SLOTCOUNT"] = str(sum(hosts.values())) 64 | try: 65 | subprocess.check_call(" ".join(sys.argv[1:]), shell=True) 66 | finally: 67 | shutil.rmtree(tempdir) 68 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # IDE 132 | .vscode/ 133 | .idea/ 134 | 135 | # macos 136 | .DS_Store -------------------------------------------------------------------------------- /docs/hpc_system_notes/nus_apollo.md: -------------------------------------------------------------------------------- 1 | # NUS Apollo 2 | 3 | ## Table of Contents 4 | 5 | - [Connecting](#connecting) 6 | - [Software Stack](software-stack) 7 | - [Hardware Overview](#hardware-overview) 8 | 9 | ## Connecting 10 | 11 | Connect to SoC VPN then you can use ssh to connect to the lab machine directly. 12 | 13 | ```shell 14 | ssh USERNAME@hpc-ai-01.d2.comp.nus.edu.sg 15 | ``` 16 | 17 | ## :heavy_exclamation_mark: Download Rules 18 | 19 | :heavy_exclamation_mark: 20 | :heavy_exclamation_mark: 21 | :heavy_exclamation_mark: 22 | **Please read this section before use** 23 | 24 | 25 | If you want to download a large dataset, please limit your download rate so that it will not perform DDOS to the network. 26 | You can use trickle to specify how fast you download your dataset. In general, you should limit your download rate to 10 MB/s to play safe. 27 | We use **`trickle`** to control the download rate. 28 | 29 | Some options of trickle are: 30 | - `-s`: run in standalone mode 31 | - `-u`: upload rate in KB/s 32 | - `-d`: download rate in KB/s 33 | 34 | 35 | ```bash 36 | # this will download CIFAR10 dataset at around 4 MB/s 37 | trickle -s -u 1024 -d 4096 wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 38 | 39 | # this will download CIFAR10 dataset at around 2 MB/s 40 | trickle -s -u 1024 -d 2048 wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 41 | ``` 42 | 43 | ## Storage 44 | 45 | We store datasets which are commonly used by many users in `/data/common`. 46 | If you feel that your dataset is common but no in `/data/common`, you can notify the system admin to download this dataset for you or move the downloaded dataset to `data/common`. 47 | 48 | Please store large files (e.g. personal datasets, model weights) in `/data/personal`. You need to create your own directory. 49 | 50 | 51 | 52 | ## Software Stack 53 | 54 | ### Environment Module 55 | 56 | We use **Environment Module** to manage the software stack. 57 | You can use `module av` to list the available software package and use `module load` to load the software you want to use. 58 | 59 | ```shell 60 | module load miniconda3 cuda/11.3.1 61 | ``` 62 | 63 | ### Singularity Image 64 | 65 | You can use singularity to run container. 66 | 67 | 1. Run Image: `singularity shell --nv` 68 | 2. Clone Environment: `conda create --clone` 69 | 3. Install Other Python Dependency with `pip` or `conda` 70 | 71 | New environment will create under `$HOME/.conda/envs`. Then activate the environment again without reconfiguring the environment, use it directly after source. 72 | 73 | ```shell 74 | $ module load singularity 75 | $ singularity shell --nv /opt/nussif/pytorch_21.12-py3.sif 76 | Singularity> conda create --clone base --name ngc-torch-21.11 # only need when first time 77 | Singularity> source activate ngc-torch-21.11 78 | Singularity> python -c "import torch;print(torch.cuda.is_available())" 79 | ``` 80 | 81 | ## Hardware Overview 82 | 83 | | Node Name | GPU | CPU | Memory | Storage | 84 | |-----------|------------------------|----------------------------|---------------------|------------------------------------| 85 | | hpc-ai-01 | NVIDIA HGX A100 8-GPU | 2 x AMD EPYC 7742 64-Core | 1TB 3200 MT/s DDR4 | 1.8 TB System NVME 11 TB Dada NVME | 86 | -------------------------------------------------------------------------------- /docs/get_started/sysinfo.md: -------------------------------------------------------------------------------- 1 | # Understanding Your System 2 | 3 | ## Table Of Contents 4 | 5 | - [Common Linux Commands](#common-linux-commands) 6 | - [More info in /proc](#more-info-in-proc) 7 | - [More info in /sys](#more-info-in-sys) 8 | - [Scripts](#scripts) 9 | 10 | ## Common Linux Commands 11 | 12 | ```shell 13 | # get operating system info (any of the commands below) 14 | cat /etc/os-release 15 | lsb_release -a 16 | hostnamectl 17 | 18 | # get linux kernel version 19 | uname -r 20 | 21 | # get free RAM 22 | free -m 23 | 24 | # Disk 25 | df -h /home 26 | 27 | # CPU 28 | lscpu 29 | 30 | # check GPU status 31 | nvidia-smi 32 | 33 | # check CUDA/nvcc version 34 | nvcc --version 35 | 36 | # check GPU topology 37 | nvidia-smi topo -m 38 | 39 | # environment variable 40 | env 41 | 42 | # check modules 43 | module avai 44 | 45 | # check infiniband (any of the commands below) 46 | ibv_devinfo 47 | ibstat 48 | ibstatus 49 | 50 | # check if hyperthreading is enabled 51 | dmidecode -t processor | grep HTT 52 | 53 | ``` 54 | 55 | ## More info in proc 56 | 57 | `/proc` is a virtual file system on Linux which stores the information about the current status of the linux kernel. Users can view and modify the running status of the kernel with these files. As '/proc' is virtual, the files in `/proc` are 0KB even though they can display lots of information. Many of them are stored in RAM and constantly updated. 58 | 59 | If you change your directory to `/proc`, you can actually see a lot of folders named with numbers. These are folders of the running processes. The following commands will give you some basic information about the process and you can definitely get more if you view other files. 60 | 61 | ```shell 62 | # see the command which launches the process 63 | cat cmdline 64 | 65 | # check the environment variables of the process 66 | cat environ 67 | 68 | # check the memory usage 69 | cat mem 70 | 71 | # view the current status of the process 72 | # cat status has better readability 73 | cat stat 74 | cat status 75 | 76 | # check the info about threads run by the current process 77 | cat task 78 | 79 | ``` 80 | 81 | Besides the process folders, there are many files directly located under `/proc`, they provide the status information of the whole operating system. For example: 82 | 83 | ```shell 84 | # check power management and battery info 85 | cat /proc/apm 86 | 87 | # get memory usage 88 | cat /proc/meminfo 89 | 90 | # get virtual memory info 91 | cat /proc/vmstat 92 | 93 | # get mounted devices info 94 | cat /proc/devices 95 | 96 | # get cpu info 97 | /proc/cpuinfo 98 | 99 | # get system running time since last up 100 | cat /proc/uptime 101 | 102 | # get kernel version 103 | cat /proc/version 104 | 105 | ``` 106 | 107 | ## More info in sys 108 | 109 | `/sys` is another directory which allows to view information about your devices and system. You can view and modify some settings such power, module, hypervisor, etc. 110 | 111 | > You can view/chagne simultaneous multi-threading in `/sys/devices/system/cpu/smt` 112 | 113 | ## Scripts 114 | 115 | `Author-Kit` provides an easy-to-use script to gather system information on the machine. 116 | 117 | ```shell 118 | git clone https://github.com/SC-Tech-Program/Author-Kit 119 | bash ./Author-Kit/collect_environment.sh > ./sysinfo.txt 120 | ``` 121 | -------------------------------------------------------------------------------- /docs/hpc_system_notes/general.md: -------------------------------------------------------------------------------- 1 | # General Notes 2 | 3 | ## Get to Know Your Scheduler 4 | 5 | HPC clusters always have a job scheduler which dispatches jobs submitted to a queue. You can refer to this official 6 | documentation for how to use the job scheduler. 7 | 8 | - PBS Pro: [Documentation](https://help.nscc.sg/pbspro-quickstartguide/) 9 | - SLURM: [Documentation](https://slurm.schedmd.com/documentation.html) 10 | 11 | ## How to Debug on HPC Cluster 12 | 13 | ### Use debug node 14 | 15 | In general, HPC clusters provide a development node for you to test your code. For example, you can use the `dgx-dev` 16 | queue on NSCC to run interactive job and debug you code on one node with 4 GPUs. 17 | 18 | ### Use JupyterLab 19 | 20 | Development nodes usually have resource limitation such as wall time. It is also bad when your terminal gets stuck as 21 | you have to quit and re-submit the job again. Thus, I would recommend to use JupyterLab to debug your job. This has 22 | several advantages: 23 | 24 | 1. You can have multiple terminals 25 | 2. You can edit your files 26 | 3. Your job is still running if you disconnect 27 | 28 | If you are not a vim enthusiast (even though I highly recommend it as you can make it a pro IDE if you install plugins), 29 | you can use JupyterLab for convenience. 30 | 31 | You can refer to the [NSCC notes](nscc.md) on how to set up JupyterLab. 32 | 33 | ### Use IDE 34 | 35 | You can refer to the following documentation on how to use IDE on clusters. 36 | - VS Code: [documentation](https://code.visualstudio.com/docs/remote/ssh) 37 | - PyCharm: [documentation](https://www.jetbrains.com/help/pycharm/tutorial-deployment-in-product.html#comparing) 38 | 39 | ## How to Run Many Short Experiments With a Single Job Submission 40 | 41 | Sometimes, you may need to run many experiments while each experiment only takes several minutes. Thus, it is definitely 42 | not a good idea to submit a job script for each experiment. For example, I only want to profile the peak memory usage 43 | in one iteration of my training process. In this case, it is recommended to use JupyterLab for experiments. This has 44 | several advantages: 45 | 46 | 1. You don't have to worry about program error when you run many experiments. For example, when you only submit a job 47 | to the scheduler, the job will get stuck if your program runs into out-of-memory. However, if you run JupyterLab, 48 | you can just terminate it with `ctrl+c` and continue with your next try. 49 | 2. JupyterLab is running in the job dispatched by the scheduler. Thus, the environment inherits variables passed by the 50 | scheduler, for example, the `NODEFILE` variable of PBS Pro. 51 | 52 | ## How to Find Network Interface 53 | 54 | Network interface needs to be specified for distributed training. This is to tell which interface to rely on for 55 | communication. For example, when running PyTorch distributed, you may find that your script stuck at the initialization 56 | of the default process group. Usually this is because PyTorch does not know which interface to use to cross-node 57 | communication, or some found interfaces have issues. You can specify the network interface for communication by setting the 58 | environment variables `NCCL_SOCKET_IFNAME` or `GLOO_SOCKET_IFNAME` depending on the backend of your choice. 59 | You can check the available network interfaces by the command `ifconfig` or use 60 | [PyRoute2](https://github.com/svinota/pyroute2) if this command is not available. You can also get the host address in 61 | this way as well. 62 | 63 | ```python 64 | from pyroute2 import NDB 65 | 66 | ndb = NDB(log='debug') 67 | print(ndb.addresses.summary()) 68 | ``` 69 | 70 | For example, you can find `ib0` in the list on NSCC. It means InfiniBand and you can set `NCCL_SOCKET_IFNAME=ib0` in 71 | your script. Base on test, if you do not set this environment variable, your PyTorch initialization will stuck. 72 | 73 | ## Use Proxyjump to directly connect to the compute node (requires having a job running on the compute node) 74 | 75 | ``` 76 | Host computenode_hostname 77 | HostName computenode_hostname 78 | User username 79 | ProxyJump loginnode_hostname 80 | ServerAliveInterval 60 81 | 82 | Host loginnode_hostname 83 | HostName loginnode_hostname 84 | User username 85 | ``` 86 | -------------------------------------------------------------------------------- /docs/resources/dl_libraries.md: -------------------------------------------------------------------------------- 1 | # Awesome Deep Learning Libraries 2 | 3 | I have listed some awesome libraries which I found useful for most machine learning practices. These libraries can make 4 | things easier and boost your productivity. 5 | 6 | 7 | ## General 8 | 9 | - [DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples): Source code for many deep learning models by Nvidia 10 | 11 | 12 | ## Computer Vision 13 | 14 | - [mmcv](https://github.com/open-mmlab/mmcv): OpenMMLab Computer Vision Foundation 15 | - [MMClassification](https://github.com/open-mmlab/mmclassification): OpenMMLab Image Classification Toolbox and Benchmark 16 | - [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab Detection Toolbox and Benchmark 17 | - [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark 18 | - [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab Semantic Segmentation Toolbox and Benchmark. 19 | - [OpenSelfSup](https://github.com/open-mmlab/OpenSelfSup): Self-Supervised Learning Toolbox and Benchmark 20 | 21 | 22 | ## Natural Language Processing 23 | 24 | - [autonlp](https://github.com/huggingface/autonlp): AutoNLP: train state-of-the-art natural language processing models and deploy them in a scalable environment automatically 25 | - [HuggingFaceTransformer](https://github.com/huggingface/transformers): Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX. 26 | - [fairseq](https://github.com/pytorch/fairseq): Facebook AI Research Sequence-to-Sequence Toolkit written in Python. 27 | 28 | 29 | ## Data 30 | 31 | - [DALI](https://github.com/NVIDIA/DALI): A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications. 32 | - [AugLy](https://github.com/facebookresearch/AugLy): A data augmentations library for audio, image, text, and video. 33 | - [Open3D](https://github.com/intel-isl/Open3D): Open3D: A Modern Library for 3D Data Processing 34 | - [HuggingFaceTokenizer](https://github.com/huggingface/tokenizers): Fast State-of-the-Art Tokenizers optimized for Research and Production 35 | - [HuggingFaceDatasets](https://github.com/huggingface/datasets): The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools 36 | 37 | 38 | ## Accelerating Training 39 | 40 | - [Apex](https://github.com/NVIDIA/apex): A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch 41 | - [ApexDataPrefetcher](https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L265): prefetch data to hide data I/O cost. 42 | - [Horovod](https://github.com/horovod/horovod): Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. 43 | - [Checkpoint](https://pytorch.org/docs/stable/checkpoint.html): A PyTorch function which implements [activation checkpointing](https://arxiv.org/abs/1604.06174). 44 | - [TorchPipe](https://github.com/kakaobrain/torchgpipe): A GPipe implementation in PyTorch 45 | - [PowerSGD Communication Hook](https://pytorch.org/docs/stable/ddp_comm_hooks.html): PowerSGD (Vogels et al., NeurIPS 2019) is a gradient compression algorithm, which can provide very high compression rates and accelerate bandwidth-bound distributed training. 46 | - [Accelerate](https://github.com/huggingface/accelerate): A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision 47 | - [lightseq](https://github.com/bytedance/lightseq): LightSeq: A High Performance Library for Sequence Processing and Generation 48 | 49 | 50 | ## Large-Scale Distributed Training 51 | 52 | - [Megatron](https://github.com/NVIDIA/Megatron-LM): Ongoing research training transformer language models at scale, including: BERT & GPT-2 53 | - [DeepSpeed](https://github.com/microsoft/deepspeed): DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 54 | - [Ray](https://github.com/ray-project/ray): An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library. 55 | 56 | 57 | ## Utilities 58 | 59 | - [Tensorboard](https://github.com/tensorflow/tensorboard): TensorFlow's Visualization Toolkit 60 | - [KnockKnock](https://github.com/huggingface/knockknock): Get notified when your training ends with only two additional lines of code 61 | - [Neptune](https://github.com/neptune-ai/neptune-client): Lightweight experiment tracking tool for AI/ML individuals and teams. Fits any workflow. 62 | - [netron](https://github.com/lutzroeder/netron): Visualizer for neural network, deep learning, and machine learning models 63 | - [scalene](https://github.com/plasma-umass/scalene): Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python 64 | 65 | 66 | -------------------------------------------------------------------------------- /docs/initialize/vm_software.md: -------------------------------------------------------------------------------- 1 | # Software for Single VM 2 | 3 | I use CentOS for my personal cloud server, thus the commands will be a bit different if you using other operating system. For software you wish to make available for all users, I would recommend you to put them under `/opt/apps`. As I will be using `Lmod` to manage all the modules, I build all the software from source. If you wish to install directly, you can just use `yum install`. 4 | 5 | ## Table Of Contents 6 | 7 | - [GCC](#gcc) 8 | - [Tmux](#common-linux-commands) 9 | - [Lmod](#lmod) 10 | - [Docker](#docker) 11 | - [Singularity](#singularity) 12 | - [OpenMPI](#openmpi) 13 | - [Node.js](#node.js) 14 | - [Go](#go) 15 | - [Useful Plug-ins](#useful-plug-ins) 16 | 17 | ## GCC 18 | 19 | GCC compiler is required for compilation for many programs. To build GCC from source, you can follow the steps below. 20 | 21 | ```shell 22 | # setup workspace 23 | cd /opt/apps 24 | mkdir -p gcc/source 25 | cd ./gcc/source 26 | 27 | # setup source code 28 | git clone git://gcc.gnu.org/git/gcc.git 29 | git branch -a # view all branch 30 | git tag -l # view all tags 31 | git checkout 32 | ./contrib/download_prerequisites 33 | 34 | # make and install 35 | cd /opt/apps/gcc 36 | mkdir # e.g. mkdir 8.3.0 37 | /opt/apps/gcc/source/configure --prefix=/opt/apps/gcc/ --enable-languages=c,c++,fortran,go 38 | make 39 | make install 40 | ``` 41 | 42 | **Compilation can take quite long** 43 | 44 | ## Tmux 45 | 46 | Install tmux 47 | 48 | ```shell 49 | sudo yum install tmux 50 | ``` 51 | 52 | However, this is version 1.8 and is a bit too old. To install Tmux 2.8 from source: 53 | 54 | ```shell 55 | # install deps 56 | yum install gcc kernel-devel make ncurses-devel 57 | 58 | # DOWNLOAD SOURCES FOR LIBEVENT AND MAKE AND INSTALL 59 | curl -LOk https://github.com/libevent/libevent/releases/download/release-2.1.8-stable/libevent-2.1.8-stable.tar.gz 60 | tar -xf libevent-2.1.8-stable.tar.gz 61 | cd libevent-2.1.8-stable 62 | ./configure --prefix=/usr/local 63 | make 64 | make install 65 | 66 | # DOWNLOAD SOURCES FOR TMUX AND MAKE AND INSTALL 67 | 68 | curl -LOk https://github.com/tmux/tmux/releases/download/2.8/tmux-2.8.tar.gz 69 | tar -xf tmux-2.8.tar.gz 70 | cd tmux-2.8 71 | LDFLAGS="-L/usr/local/lib -Wl,-rpath=/usr/local/lib" ./configure --prefix=/usr/local 72 | make 73 | make install 74 | 75 | # pkill tmux 76 | # close your terminal window (flushes cached tmux executable) 77 | # open new shell and check tmux version 78 | tmux -V 79 | ``` 80 | 81 | tmux cheatsheet: https://tmuxcheatsheet.com/ 82 | 83 | ## Lmod 84 | 85 | You may refer to [Lmod Installation](https://lmod.readthedocs.io/en/latest/030_installing.html) for detailed instructions. For a quick summary on the installation process, you can refer to my procedure. 86 | 87 | 1. Try `lua -v` to check if you have lua installed. If not, install lua. 88 | 89 | ```shell 90 | curl -R -O http://www.lua.org/ftp/lua-5.4.1.tar.gz 91 | tar xf lua-5.4.1.tar.gz 92 | cd lua-5.4.1 93 | # if you want to set where you want to install 94 | # e.g. change INSTALL_TOP to /opt/apps/lua/5.4.1 95 | # and lua will be installed there 96 | vim Makefile 97 | make 98 | make install 99 | ``` 100 | 101 | 2. Download Lmod from [here](https://sourceforge.net/projects/lmod/files/) and install (e.g. Lmod-8.4). 102 | 103 | ```shell 104 | tar -xf Lmod-8.4.tar.bz2 105 | cd Lmod-8.4 106 | ./configure --prefix=/opt/apps 107 | make install 108 | ``` 109 | 110 | 3. Then Configure the shell setup. 111 | 112 | ``` 113 | ln -s /opt/apps/lmod/lmod/init/profile /etc/profile.d/z00_lmod.sh 114 | ln -s /opt/apps/lmod/lmod/init/cshrc /etc/profile.d/z00_lmod.csh 115 | ln -s /opt/apps/lmod/lmod/init/profile.fish /etc/fish/conf.d/z00_lmod.fish 116 | ``` 117 | 118 | 4. When you start you shell again, you can type `module` and see the option list and see some environment variables such as `LMOD_CMD`. 119 | 120 | ``` 121 | $ echo $LMOD_CMD 122 | /opt/apps/lmod/lmod/libexec/lmod 123 | ``` 124 | 125 | You can refer to [Write your own modulefiles](https://lmod.readthedocs.io/en/latest/020_advanced.html) for more information. You can add `module use ` in `/etc/profile` so all users can initialize to the same bunch of paths upon start. 126 | 127 | ## Docker 128 | 129 | Install Docker 130 | 131 | ```shell 132 | sudo yum install -y yum-utils 133 | sudo yum-config-manager \ 134 | --add-repo \ 135 | https://download.docker.com/linux/centos/docker-ce.repo 136 | sudo systemctl start docker 137 | ``` 138 | 139 | Docker will create a group called docker, other users can be granted access to docker by 140 | 141 | ```shell 142 | sudo usermod -aG docker $USER 143 | ``` 144 | 145 | docker cheatsheet: https://www.docker.com/sites/default/files/d8/2019-09/docker-cheat-sheet.pdf 146 | 147 | ## Singularity 148 | 149 | Singularity is a containerization tool similar to Docker. It is more for HPC community. Its image can be obtained by conversion from Docker Image. Follow the steps below to install Singularity. 150 | 151 | ```shell 152 | sudo yum update && \ 153 | sudo yum groupinstall 'Development Tools' && \ 154 | sudo yum install libarchive-devel 155 | 156 | git clone https://github.com/singularityware/singularity.git 157 | cd singularity 158 | ./autogen.sh 159 | ./configure --prefix=/opt/apps/singularity --sysconfdir=/etc 160 | make 161 | sudo make install 162 | 163 | ``` 164 | 165 | ## OpenMPI 166 | 167 | OpenMPI lets you spawn mutltiple processes simultaneously for distributed job. As my VM has too few cores to make OpenMPI effective. I didn't not install it. You may refer to this [guide](https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX) if you want to build OpenMPI with UCX. 168 | 169 | ## Node.js 170 | 171 | If you are a web developer, then most likely you will need Node.js for your frontend and backend devepment. To install Node.js on the server, you can download the binary files from the [Node.js website](https://nodejs.org/en/download/) with [instructions](https://github.com/nodejs/help/wiki/Installation). 172 | 173 | ## Go 174 | 175 | To install Go, you can also download the binary directly at [Go Installation](https://golang.org/doc/install). 176 | 177 | ## Useful Plug-ins 178 | 179 | [oh-my-zsh](https://github.com/ohmyzsh/ohmyzsh) 180 | [oh-my-tmux](https://github.com/gpakosz/.tmux) 181 | [vimrc-configuration](https://github.com/amix/vimrc) 182 | [vim-plugin-code-autocomplete](https://github.com/ycm-core/YouCompleteMe) 183 | -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man8/dropbear.8: -------------------------------------------------------------------------------- 1 | .TH dropbear 8 2 | .SH NAME 3 | dropbear \- lightweight SSH server 4 | .SH SYNOPSIS 5 | .B dropbear 6 | [\fIflag arguments\fR] [\-b 7 | .I banner\fR] 8 | [\-r 9 | .I hostkeyfile\fR] [\-p [\fIaddress\fR:]\fIport\fR] 10 | .SH DESCRIPTION 11 | .B dropbear 12 | is a small SSH server 13 | .SH OPTIONS 14 | .TP 15 | .B \-b \fIbanner 16 | bannerfile. 17 | Display the contents of the file 18 | .I banner 19 | before user login (default: none). 20 | .TP 21 | .B \-r \fIhostkey 22 | Use the contents of the file 23 | .I hostkey 24 | for the SSH hostkey. 25 | This file is generated with 26 | .BR dropbearkey (1) 27 | or automatically with the '-R' option. See "Host Key Files" below. 28 | .TP 29 | .B \-R 30 | Generate hostkeys automatically. See "Host Key Files" below. 31 | .TP 32 | .B \-F 33 | Don't fork into background. 34 | .TP 35 | .B \-E 36 | Log to standard error rather than syslog. 37 | .TP 38 | .B \-m 39 | Don't display the message of the day on login. 40 | .TP 41 | .B \-w 42 | Disallow root logins. 43 | .TP 44 | .B \-s 45 | Disable password logins. 46 | .TP 47 | .B \-g 48 | Disable password logins for root. 49 | .TP 50 | .B \-j 51 | Disable local port forwarding. 52 | .TP 53 | .B \-k 54 | Disable remote port forwarding. 55 | .TP 56 | .B \-p\fR [\fIaddress\fR:]\fIport 57 | Listen on specified 58 | .I address 59 | and TCP 60 | .I port. 61 | If just a port is given listen 62 | on all addresses. 63 | up to 10 can be specified (default 22 if none specified). 64 | .TP 65 | .B \-i 66 | Service program mode. 67 | Use this option to run 68 | .B dropbear 69 | under TCP/IP servers like inetd, tcpsvd, or tcpserver. 70 | In program mode the \-F option is implied, and \-p options are ignored. 71 | .TP 72 | .B \-P \fIpidfile 73 | Specify a pidfile to create when running as a daemon. If not specified, the 74 | default is /var/run/dropbear.pid 75 | .TP 76 | .B \-a 77 | Allow remote hosts to connect to forwarded ports. 78 | .TP 79 | .B \-W \fIwindowsize 80 | Specify the per-channel receive window buffer size. Increasing this 81 | may improve network performance at the expense of memory use. Use -h to see the 82 | default buffer size. 83 | .TP 84 | .B \-K \fItimeout_seconds 85 | Ensure that traffic is transmitted at a certain interval in seconds. This is 86 | useful for working around firewalls or routers that drop connections after 87 | a certain period of inactivity. The trade-off is that a session may be 88 | closed if there is a temporary lapse of network connectivity. A setting 89 | if 0 disables keepalives. If no response is received for 3 consecutive keepalives the connection will be closed. 90 | .TP 91 | .B \-I \fIidle_timeout 92 | Disconnect the session if no traffic is transmitted or received for \fIidle_timeout\fR seconds. 93 | .TP 94 | .B \-T \fImax_authentication_attempts 95 | Set the number of authentication attempts allowed per connection. If unspecified the default is 10 (MAX_AUTH_TRIES) 96 | .TP 97 | .B \-c \fIforced_command 98 | Disregard the command provided by the user and always run \fIforced_command\fR. This also 99 | overrides any authorized_keys command= option. 100 | .TP 101 | .B \-V 102 | Print the version 103 | 104 | .SH FILES 105 | 106 | .TP 107 | Authorized Keys 108 | 109 | ~/.ssh/authorized_keys can be set up to allow remote login with a RSA, 110 | ECDSA, Ed25519 or DSS 111 | key. Each line is of the form 112 | .TP 113 | [restrictions] ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIgAsp... [comment] 114 | 115 | and can be extracted from a Dropbear private host key with "dropbearkey -y". This is the same format as used by OpenSSH, though the restrictions are a subset (keys with unknown restrictions are ignored). 116 | Restrictions are comma separated, with double quotes around spaces in arguments. 117 | Available restrictions are: 118 | 119 | .TP 120 | .B no-port-forwarding 121 | Don't allow port forwarding for this connection 122 | 123 | .TP 124 | .B no-agent-forwarding 125 | Don't allow agent forwarding for this connection 126 | 127 | .TP 128 | .B no-X11-forwarding 129 | Don't allow X11 forwarding for this connection 130 | 131 | .TP 132 | .B no-pty 133 | Disable PTY allocation. Note that a user can still obtain most of the 134 | same functionality with other means even if no-pty is set. 135 | 136 | .TP 137 | .B command=\fR"\fIforced_command\fR" 138 | Disregard the command provided by the user and always run \fIforced_command\fR. 139 | The -c command line option overrides this. 140 | 141 | The authorized_keys file and its containing ~/.ssh directory must only be 142 | writable by the user, otherwise Dropbear will not allow a login using public 143 | key authentication. 144 | 145 | .TP 146 | Host Key Files 147 | 148 | Host key files are read at startup from a standard location, by default 149 | /etc/dropbear/dropbear_dss_host_key, /etc/dropbear/dropbear_rsa_host_key, 150 | /etc/dropbear/dropbear_ecdsa_host_key and /etc/dropbear/dropbear_ed25519_host_key 151 | 152 | If the -r command line option is specified the default files are not loaded. 153 | Host key files are of the form generated by dropbearkey. 154 | The -R option can be used to automatically generate keys 155 | in the default location - keys will be generated after startup when the first 156 | connection is established. This had the benefit that the system /dev/urandom 157 | random number source has a better chance of being securely seeded. 158 | 159 | .TP 160 | Message Of The Day 161 | 162 | By default the file /etc/motd will be printed for any login shell (unless 163 | disabled at compile-time). This can also be disabled per-user 164 | by creating a file ~/.hushlogin . 165 | 166 | .SH ENVIRONMENT VARIABLES 167 | Dropbear sets the standard variables USER, LOGNAME, HOME, SHELL, PATH, and TERM. 168 | 169 | The variables below are set for sessions as appropriate. 170 | 171 | .TP 172 | .B SSH_TTY 173 | This is set to the allocated TTY if a PTY was used. 174 | 175 | .TP 176 | .B SSH_CONNECTION 177 | Contains " ". 178 | 179 | .TP 180 | .B DISPLAY 181 | Set X11 forwarding is used. 182 | 183 | .TP 184 | .B SSH_ORIGINAL_COMMAND 185 | If a 'command=' authorized_keys option was used, the original command is specified 186 | in this variable. If a shell was requested this is set to an empty value. 187 | 188 | .TP 189 | .B SSH_AUTH_SOCK 190 | Set to a forwarded ssh-agent connection. 191 | 192 | .SH NOTES 193 | Dropbear only supports SSH protocol version 2. 194 | 195 | .SH AUTHOR 196 | Matt Johnston (matt@ucc.asn.au). 197 | .br 198 | Gerrit Pape (pape@smarden.org) wrote this manual page. 199 | .SH SEE ALSO 200 | dropbearkey(1), dbclient(1), dropbearconvert(1) 201 | .P 202 | https://matt.ucc.asn.au/dropbear/dropbear.html 203 | -------------------------------------------------------------------------------- /docs/hpc_system_notes/tacc_longhorn.md: -------------------------------------------------------------------------------- 1 | # TACC-Longhorn 2 | 3 | - [Build Conda Environment ](#build-conda-environment ) 4 | - [Common Commands](#common-commands) 5 | - [Job Script Example](#job-script-example) 6 | - [Dataset and Transfer files](#dataset-and-transfer-files) 7 | - [Large Scale Experiment](#large-scale-experiment) 8 | - [DALI](#dali) 9 | - [Question Ticket](#question-ticket) 10 | 11 | ## Build Conda Environment 12 | 13 | You can follow the file "TACC distributed pytorch.pdf", which show an example about how to build your conda environment and run the code with interactive usage. But the example use PyTorch 1.1, and the newest version in the used [link](https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/) seems only PyTorch 1.3. In addition, unlike other TACC machines, Longhorn nodes are a PowerPC architecture (Power PC 64 LE). Thus, when pulling images from (e.g.) Docker Hub, make sure the image is Power PC 64 LE compatible. Here, you can use the method in this [link](https://stackoverflow.com/questions/52750622/how-to-install-pytorch-on-power-8-or-ppc64-machine/64528124#64528124?newreg=1b10fc8fcbed4beca9cdc3d4238359a5), to bulid a Python 3.6 and PyTorch 1.5 conda environment. 14 | 15 | ## Common Commands 16 | 17 | ```shell 18 | # To directory $SCRATCH. The default directory is $HOME when you login in. 19 | # Longhorn users must run all jobs in Longhorn's $SCRATCH file system. 20 | cd $SCRATCH 21 | 22 | # interactive usage 23 | idev -t 02:00:00 -N 1 -n 4 -p development 24 | 25 | # submit a job script 26 | sbatch /PATH/TO/job.slurm 27 | 28 | # check your job status 29 | squeue -u your_account 30 | 31 | # delete your job 32 | scancel job_ID 33 | 34 | # display estimated start time for job 35 | squeue --start -j job_ID 36 | 37 | # view loaded module 38 | # module related commands need at computing node 39 | module list 40 | 41 | # view avail module 42 | module available 43 | 44 | # active your conda environment 45 | # you need run it after you login in computing nodes everytime 46 | module load conda 47 | conda activate your_env_name 48 | 49 | # generate hostfile according to your current avaliable computing resources 50 | # Note, you maybe need run it after you login in computing nodes everytime, 51 | # because you may get different computing resources. 52 | scontrol show hostname > hostfile 53 | 54 | # run your code with interactive usage 55 | # -N means GPU numbers per node, -np means total GPU numbers 56 | mpiexec -hostfile hostfile -N 4 -np 8 python your_file.py 57 | ``` 58 | 59 | For more detailed information, please refer to [TACC Longhorn User Guide](https://portal.tacc.utexas.edu/user-guides/longhorn). 60 | 61 | ## Job Script Example 62 | 63 | ``` 64 | #!/bin/sh 65 | 66 | #SBATCH -J myjob # Job name 67 | #SBATCH -o myjob.o%j # Name of stdout output file 68 | #SBATCH -e myjob.e%j # Name of stderr error file 69 | #SBATCH -p v100 # Queue (partition) name 70 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 71 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 72 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 73 | #SBATCH --mail-user=your email address 74 | #SBATCH --mail-type=all # Send email at begin and end of job 75 | 76 | pwd 77 | date 78 | 79 | cd your_code_path 80 | module load conda 81 | conda activate your_env_name 82 | 83 | ibrun -np 8 \ 84 | python your_code.py 85 | ``` 86 | 87 | You should use ibrun instead of mpirun or mpiexec here, and -np means total GPU numbers. 88 | 89 | Note: Although in Longhorn tutorial, they use "#SBATCH -A myproject # Allocation name", you can actually delete it. If you incorrect set it, you will meet permission error when submit job. 90 | 91 | ## Dataset and Transfer files 92 | 93 | **Dataset** 94 | 95 | ```shell 96 | cp /scratch/00946/zzhang/data/imagenet-1k.tar /your/path 97 | tar xf imagenet-1k.tar 98 | # TFRecord format file 99 | /scratch/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 100 | ``` 101 | 102 | You can get a prepared ImageNet-1K (ILSVRC2012) use above command. 103 | 104 | Maybe you can find other prepared dataset around this path. 105 | 106 | **Transfer files** 107 | 108 | You can use scp or git clone command, or WinSCP, a visible tool can input TACC token, to transfer files. 109 | 110 | ```shell 111 | # scp command 112 | # tested in Git Bash on Windows10 113 | scp G:/your_path/xxx.tar your_account@longhorn.tacc.utexas.edu:/your_path/imagenet-tar 114 | ``` 115 | 116 | ## Large Scale Experiment 117 | 118 | (You can also consider DALI rather than this part) 119 | 120 | If a job uses many nodes (eg. 32 nodes with 128 GPUs) and reads large dataset(eg. ImageNet) directly from the hard disk, it will cause huge IO pressure on the file system, which may cause the job to be killed by the system : ( 121 | 122 | So, for large scale experiment, we need first move the data to the /tmp file system of nodes for temporary storage and then run the program, to reduce the IO pressure of the main file system. 123 | 124 | large scale experiment job script example 125 | 126 | ```shell 127 | #!/bin/sh 128 | 129 | #SBATCH -J myjob # Job name 130 | #SBATCH -o myjob.o%j # Name of stdout output file 131 | #SBATCH -e myjob.e%j # Name of stderr error file 132 | #SBATCH -p v100 # Queue (partition) name 133 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 134 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 135 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 136 | #SBATCH --mail-user=your email 137 | #SBATCH --mail-type=all # Send email at begin and end of job 138 | 139 | pwd 140 | date 141 | 142 | cd your_code_path 143 | module load conda 144 | conda activate py36pt 145 | 146 | scontrol show hostnames $SLURM_NODELIST > /tmp/hostfile 147 | 148 | cat /tmp/hostfile 149 | 150 | mpiexec -hostfile /tmp/hostfile -N 1 ./cp_imagenet_to_temp_bin.sh 151 | 152 | ibrun -np 8 \ 153 | python your_code.py 154 | ``` 155 | 156 | cp_imagenet_to_temp_bin.sh copy and extract imagenet data to following path, which may take 50 minutes :( 157 | 158 | The data in /tmp will be automatically deleted when job finished, which is located in the following path 159 | 160 | ```shell 161 | --train-dir=/tmp/imagenet/ILSVRC2012_img_train/ 162 | --val-dir=/tmp/imagenet/ILSVRC2012_img_val/ 163 | ``` 164 | 165 | ## DALI 166 | 167 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 168 | 169 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters. 170 | 171 | **Install** 172 | 173 | ```shell 174 | conda install dali 175 | # because longhorn is Power architecture, we cannot use following command as other cluster. 176 | # pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110 177 | ``` 178 | 179 | **Usage** 180 | 181 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch). This example has been tested with nvidia-dali-cuda110, maybe it needs some changes if you use it with Longhorn CUDA10 and Power architecture. 182 | 183 | DALI requires data in *TFRecord format* in the following structure: 184 | 185 | ``` 186 | train-recs 'path/train/*' 187 | val-recs 'path/validation/*' 188 | train-idx 'path/idx_files/train/*' 189 | val-idx 'path/idx_files/validation/*' 190 | ``` 191 | 192 | On longhorn, if you want use ImageNet-1k TFRecord data, you can directly use 193 | 194 | data-dir=/scratch/07801/nusbin20/ILSVRC2012_1k_TFRecord/ 195 | 196 | ```shell 197 | # TFRecord format tar file 198 | /scratch/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 199 | ``` 200 | 201 | About the parameters on DALI: 202 | 203 | - *prefetch_queue_depth* and *num_threads* might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff. 204 | - *last_batch_policy* you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link. 205 | - *device* Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster. 206 | 207 | ## Question Ticket 208 | 209 | If you have some specific questions, you can sent them to [TACC Frontera Help Desk](https://frontera-portal.tacc.utexas.edu/user-guide/help/). -------------------------------------------------------------------------------- /scripts/nscc/multi_node/experiments/apps/software/dropbear/2020.80/share/man/man1/dbclient.1: -------------------------------------------------------------------------------- 1 | .TH dbclient 1 2 | .SH NAME 3 | dbclient \- lightweight SSH client 4 | .SH SYNOPSIS 5 | .B dbclient 6 | [\fIflag arguments\fR] [\-p 7 | .I port\fR] [\-i 8 | .I id\fR] [\-L 9 | .I l\fR:\fIh\fR:\fIp\fR] [\-R 10 | .I l\fR:\fIh\fR:\fIp\fR] [\-l 11 | .IR user ] 12 | .I host 13 | .RI [ \fImore\ flags\fR ] 14 | .RI [ command ] 15 | 16 | .B dbclient 17 | [\fIargs\fR] 18 | [\fIuser1\fR]@\fIhost1\fR[^\fIport1\fR],[\fIuser2\fR]@\fIhost2\fR[^\fIport2\fR],... 19 | 20 | .SH DESCRIPTION 21 | .B dbclient 22 | is a small SSH client 23 | .SH OPTIONS 24 | .TP 25 | .TP 26 | .B command 27 | A command to run on the remote host. This will normally be run by the remote host 28 | using the user's shell. The command begins at the first hyphen argument after the 29 | host argument. If no command is specified an interactive terminal will be opened 30 | (see -t and -T). 31 | .TP 32 | .B \-p \fIport 33 | Connect to 34 | .I port 35 | on the remote host. Alternatively a port can be specified as hostname^port. 36 | Default is 22. 37 | .TP 38 | .B \-i \fIidfile 39 | Identity file. 40 | Read the identity key from file 41 | .I idfile 42 | (multiple allowed). This file is created with dropbearkey(1) or converted 43 | from OpenSSH with dropbearconvert(1). The default path ~/.ssh/id_dropbear is used 44 | .TP 45 | .B \-L\fR [\fIlistenaddress\fR]:\fIlistenport\fR:\fIhost\fR:\fIport\fR 46 | Local port forwarding. 47 | Forward the port 48 | .I listenport 49 | on the local host through the SSH connection to port 50 | .I port 51 | on the host 52 | .IR host . 53 | .TP 54 | .B \-R\fR [\fIlistenaddress\fR]:\fIlistenport\fR:\fIhost\fR:\fIport\fR 55 | Remote port forwarding. 56 | Forward the port 57 | .I listenport 58 | on the remote host through the SSH connection to port 59 | .I port 60 | on the host 61 | .IR host . 62 | .TP 63 | .B \-l \fIuser 64 | Username. 65 | Login as 66 | .I user 67 | on the remote host. 68 | .TP 69 | .B \-t 70 | Allocate a PTY. This is the default when no command is given, it gives a full 71 | interactive remote session. The main effect is that keystrokes are sent remotely 72 | immediately as opposed to local line-based editing. 73 | .TP 74 | .B \-T 75 | Don't allocate a PTY. This is the default a command is given. See -t. 76 | .TP 77 | .B \-N 78 | Don't request a remote shell or run any commands. Any command arguments are ignored. 79 | .TP 80 | .B \-f 81 | Fork into the background after authentication. A command argument (or -N) is required. 82 | This is useful when using password authentication. 83 | .TP 84 | .B \-g 85 | Allow non-local hosts to connect to forwarded ports. Applies to -L and -R 86 | forwarded ports, though remote connections to -R forwarded ports may be limited 87 | by the ssh server. 88 | .TP 89 | .B \-y 90 | Always accept hostkeys if they are unknown. If a hostkey mismatch occurs the 91 | connection will abort as normal. If specified a second time no host key checking 92 | is performed at all, this is usually undesirable. 93 | .TP 94 | .B \-A 95 | Forward agent connections to the remote host. dbclient will use any 96 | OpenSSH-style agent program if available ($SSH_AUTH_SOCK will be set) for 97 | public key authentication. Forwarding is only enabled if -A is specified. 98 | .TP 99 | .B \-W \fIwindowsize 100 | Specify the per-channel receive window buffer size. Increasing this 101 | may improve network performance at the expense of memory use. Use -h to see the 102 | default buffer size. 103 | .TP 104 | .B \-K \fItimeout_seconds 105 | Ensure that traffic is transmitted at a certain interval in seconds. This is 106 | useful for working around firewalls or routers that drop connections after 107 | a certain period of inactivity. The trade-off is that a session may be 108 | closed if there is a temporary lapse of network connectivity. A setting 109 | if 0 disables keepalives. If no response is received for 3 consecutive keepalives the connection will be closed. 110 | .TP 111 | .B \-I \fIidle_timeout 112 | Disconnect the session if no traffic is transmitted or received for \fIidle_timeout\fR seconds. 113 | .TP 114 | 115 | .\" TODO: how to avoid a line break between these two -J arguments? 116 | .B \-J \fIproxy_command 117 | .TP 118 | .B \-J \fI&fd 119 | .br 120 | Use the standard input/output of the program \fIproxy_command\fR rather than using 121 | a normal TCP connection. A hostname should be still be provided, as this is used for 122 | comparing saved hostkeys. This command will be executed as "exec proxy_command ..." with the 123 | default shell. 124 | 125 | The second form &fd will make dbclient use the numeric file descriptor as a socket. This 126 | can be used for more complex tunnelling scenarios. Example usage with socat is 127 | 128 | socat EXEC:'dbclient -J &38 ev',fdin=38,fdout=38 TCP4:host.example.com:22 129 | 130 | .TP 131 | .B \-B \fIendhost:endport 132 | "Netcat-alike" mode, where Dropbear will connect to the given host, then create a 133 | forwarded connection to \fIendhost\fR. This will then be presented as dbclient's 134 | standard input/output. 135 | .TP 136 | .B \-c \fIcipherlist 137 | Specify a comma separated list of ciphers to enable. Use \fI-c help\fR to list possibilities. 138 | .TP 139 | .B \-m \fIMAClist 140 | Specify a comma separated list of authentication MACs to enable. Use \fI-m help\fR to list possibilities. 141 | .TP 142 | .B \-o \fIoption 143 | Can be used to give options in the format used by OpenSSH config file. This is 144 | useful for specifying options for which there is no separate command-line flag. 145 | For full details of the options listed below, and their possible values, see 146 | ssh_config(5). 147 | The following options have currently been implemented: 148 | 149 | .RS 150 | .TP 151 | .B ExitOnForwardFailure 152 | Specifies whether dbclient should terminate the connection if it cannot set up all requested local and remote port forwardings. The argument must be “yes” or “no”. The default is “no”. 153 | .TP 154 | .B UseSyslog 155 | Send dbclient log messages to syslog in addition to stderr. 156 | .RE 157 | .TP 158 | .B \-s 159 | The specified command will be requested as a subsystem, used for sftp. Dropbear doesn't implement sftp itself but the OpenSSH sftp client can be used eg \fIsftp -S dbclient user@host\fR 160 | .TP 161 | .B \-b \fI[address][:port] 162 | Bind to a specific local address when connecting to the remote host. This can be used to choose from 163 | multiple outgoing interfaces. Either address or port (or both) can be given. 164 | .TP 165 | .B \-V 166 | Print the version 167 | 168 | .SH MULTI-HOP 169 | Dropbear will also allow multiple "hops" to be specified, separated by commas. In 170 | this case a connection will be made to the first host, then a TCP forwarded 171 | connection will be made through that to the second host, and so on. Hosts other than 172 | the final destination will not see anything other than the encrypted SSH stream. 173 | A port for a host can be specified with a caret (eg matt@martello^44 ). 174 | This syntax can also be used with scp or rsync (specifying dbclient as the 175 | ssh/rsh command). A file can be "bounced" through multiple SSH hops, eg 176 | 177 | scp -S dbclient matt@martello,root@wrt,canyons:/tmp/dump . 178 | 179 | Note that hostnames are resolved by the prior hop (so "canyons" would be resolved by the host "wrt") 180 | in the example above, the same way as other -L TCP forwarded hosts are. Host keys are 181 | checked locally based on the given hostname. 182 | 183 | .SH ESCAPE CHARACTERS 184 | Typing a newline followed by the key sequence \fI~.\fR (tilde, dot) will terminate a connection. 185 | The sequence \fI~^Z\fR (tilde, ctrl-z) will background the connection. This behaviour only 186 | applies when a PTY is used. 187 | 188 | .SH ENVIRONMENT 189 | .TP 190 | .B DROPBEAR_PASSWORD 191 | A password to use for remote authentication can be specified in the environment 192 | variable DROPBEAR_PASSWORD. Care should be taken that the password is not 193 | exposed to other users on a multi-user system, or stored in accessible files. 194 | .TP 195 | .B SSH_ASKPASS 196 | dbclient can use an external program to request a password from a user. 197 | SSH_ASKPASS should be set to the path of a program that will return a password 198 | on standard output. This program will only be used if either DISPLAY is set and 199 | standard input is not a TTY, or the environment variable SSH_ASKPASS_ALWAYS is 200 | set. 201 | .SH NOTES 202 | If compiled with zlib support and if the server supports it, dbclient will 203 | always use compression. 204 | 205 | .SH AUTHOR 206 | Matt Johnston (matt@ucc.asn.au). 207 | .br 208 | Mihnea Stoenescu wrote initial Dropbear client support 209 | .br 210 | Gerrit Pape (pape@smarden.org) wrote this manual page. 211 | .SH SEE ALSO 212 | dropbear(8), dropbearkey(1) 213 | .P 214 | https://matt.ucc.asn.au/dropbear/dropbear.html 215 | -------------------------------------------------------------------------------- /docs/hpc_system_notes/nscc.md: -------------------------------------------------------------------------------- 1 | # NSCC AI System 2 | 3 | ## Table Of Contents 4 | 5 | - [Common Commands](#common-commands) 6 | - [Scheduler](#scheduler) 7 | - [Job Status Email Alert](#job-status-email-alert) 8 | - [Jupyter Lab](#jupyter-lab) 9 | - [Horovod](#horovod) 10 | - [Container](#container) 11 | - [Python Package Management](#python-package-management) 12 | - [Multinode Experiments](#multinode-experiments) 13 | - [Dataset](#dataset) 14 | 15 | ## Common Commands 16 | 17 | ```shell 18 | # submit interactive job 19 | qsub -I -q dgx-dev -l walltime=3:00:00 -P 20 | 21 | # submit a job script 22 | qsub /PATH/TO/job.pbs 23 | 24 | # check your job status 25 | qstat 26 | 27 | # view job info 28 | qstat -f 2603175.wlm01 29 | 30 | # delete your job 31 | qdel 2603175.wlm01 32 | 33 | # view job queue 34 | gstat -dgx 35 | 36 | # run interactive containers 37 | nscc-docker run -t nvcr.io/nvidia/pytorch 38 | singularity run pytorch\:latest.sif 39 | 40 | # run container with executable 41 | nscc-docker run nvcr.io/nvidia/pytorch 42 | singularity run pytorch\:latest.sif 43 | 44 | ``` 45 | 46 | > Some default options are added automatically for nscc-docker run 47 | > 48 | > > -u UID:GID --group-add GROUP –v /home:/home –v /raid:/raid -v /scratch:/scratch --rm –i --ulimit memlock=-1 --ulimit stack=67108864 49 | > > 50 | > > If --ipc=host is not specified then the following option is also added: --shm-size=1g 51 | 52 | ## Scheduler 53 | 54 | NSCC uses PBS Pro as the job scheduler, you can refer to the 55 | [NSCC official guide](https://help.nscc.sg/pbspro-quickstartguide/) for detailed instructions. 56 | 57 | ## Job Status Email Alert 58 | 59 | To receive notification when your job status changes, you can add the following to your PBS job script. 60 | 61 | ```shell 62 | #PBS -M username@x.y.z 63 | ``` 64 | 65 | ## Jupyter Lab 66 | 67 | If you want to debug your code or run many short experiments with exclusive GPU resources, you can launch a jupyter lab 68 | on the NSCC server. The procedure is as follows: 69 | 70 | ![](../doc_figures/nscc/NSCC_jupyter.png) 71 | 72 | 1. Setup NSCC VPN for your [Mac](https://help.nscc.sg/vpnmac/) or [Windows](https://help.nscc.sg/vpnmicrosoft/). You 73 | should not login via the school VPN as they do not provide outgoing internet access. The NSCC VPN allows you to 74 | access NSCC server from `aspire.nscc.sg`. You can send data to outside machine with this IP but not 75 | with `ntu.nscc.sg` or 76 | `nus.nscc.sg`. This ensures that you can access jupyter lab from your local machine later. 77 | 78 | 2. Log in to your nscc account by 79 | ```shell 80 | ssh @aspire.nscc.sg 81 | ``` 82 | 83 | 3. Submit a jupyter job. There are two ways to start the Jupyter Lab. There are two ways to start the jupyter lab. 84 | - The first way is to start Jupyter Lab in a container 85 | (You can refer to the [official guide](https://help.nscc.sg/wp-content/uploads/AI_System_QuickStart.pdf) 86 | for how to write this kind of script). 87 | - The second way is to start Jupyter Lab only and run containers in Jupyter Lab. You can refer 88 | to [my script](https://github.com/FrankLeeeee/oh-my-server/tree/main/scripts/nscc/multi_node/pbs_script). 89 | 90 | The second way is highly recommended because you can access both your base environment in the compute node and the 91 | environment in containers and you can run containers with different images. For example, if you choose a wrong 92 | container via the first method, you have to quit the job and re-submit but this is not necessary for the second 93 | method. 94 | 95 | 4. Once your job is running, you need to check which host on which jupyter is running by: 96 | ```shell 97 | qstat -f 98 | ``` 99 | 100 | You can also check the job output file and get the port on which your jupyter is running. 101 | 102 | 5. Then, you can connect to your jupyter lab via port forwarding. The command is like 103 | ```shell 104 | ssh -L :: @aspire.nscc.sg 105 | 106 | # example 107 | ssh -L 8888:dgx4106:8888 u12345@aspire.nscc.sg 108 | ``` 109 | The `NSCC_HOST` and `NSCC_PORT` are found in step 4. `LOCAL_PORT` can be any port. 110 | 111 | 6. Finally, open your browser and enter `localhost:` to access the remote jupyter lab. 112 | 113 | ## Horovod 114 | 115 | --- 116 | 117 | **UPDATE: You can try this new Singularity 118 | image `/home/projects/ai/singularity/nscc/horovod_0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1.sif` which 119 | has Horovod pre-installed now** 120 | 121 | --- 122 | 123 | If you are using TensorFlow, you can use docker/singularity image directly as Horovod is pre-installed. 124 | 125 | If you are using PyTorch, you need to install horovod manually. To install Horovod on NSCC, just run the following 126 | script in the container. I used Singularity as it is more sutiable for multinode training. 127 | 128 | ```shell 129 | #!/bin/bash 130 | 131 | export HOROVOD_NCCL_INCLUDE_=/usr/include 132 | export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu 133 | export HOROVOD_NCCL_LINK=SHARED 134 | 135 | export HOROVOD_GPU_ALLREDUCE=NCCL 136 | export HOROVOD_WITH_PYTORCH=1 137 | pip install --no-cache-dir --user horovod==0.18.2 138 | ``` 139 | 140 | ## Container 141 | 142 | NSCC provides both Singularity and Docker containers. The containers used mostly are NGC contianers. The Singularity 143 | ones are obtained by converting Docker image to Singularity image. Some points that I have observed are that: 144 | 145 | 1. You do not have root access in both Docker and Singularity container. 146 | 2. `singularity shell xxx.sif` behaves interestingly different from `singularity run xxx.sif`. Even though both will 147 | start bash as the default shell, `singularity shell` does not source your bashrc file while `singularity run` 148 | actually does. So your init script in bashrc will be omitted when you execute `singularity shell`. It is desgined to 149 | be so and look at https://github.com/hpcng/singularity/issues/643 for more info. 150 | 151 | ## Python Package Management 152 | 153 | It is sometimes possible that the container does not provide the Python libraries you need. In this case, you need to be 154 | careful in managing these libraries. 155 | 156 | You can install your library to your user directory by: 157 | 158 | ```shell 159 | pip install --user 160 | ``` 161 | 162 | This should install Python library to `~/.local/lib` of your home directory. 163 | 164 | You can also check the directories where Python searches for libraries during `import` by running 165 | 166 | ```shell 167 | python -m site 168 | ``` 169 | 170 | If you container Python does not source your Python package, you can perform one of the following methods to add these 171 | libraries to PYTHONPATH. 172 | 173 | - export PYTHONPATH=:$PYTHONPATH 174 | - add `sys.path.insert(0, )` in your python file For example, the `LIBRARY_ROOT_PATH` can 175 | be `/home/users/ntu/c170166/.local/lib/python3.6/site-packages`. 176 | 177 | If you are not sure where a specific library is imported from when you run your script, you can check like below: 178 | 179 | ```python 180 | # take pytorch as an example 181 | import torch 182 | 183 | print(torch.__file__) 184 | ``` 185 | 186 | ## Multinode Experiments 187 | 188 | First of all, it takes a long time to queue for resources on NSCC. It is normal to wait for a few days if you request 189 | for 2 nodes with 4 GPUs each. At the initial debugging stage, it is definitely crazy if you submit a job script again 190 | and again. Thus, I would highly recommend you to debug using the Jupyter Lab as mentioned above. 191 | 192 | To run experiments on multinode on NSCC, you need to follow these steps. 193 | 194 | ``` 195 | # Use OMPI integrated with PBS 196 | export PATH=/home/app/dgx/openmpi-3.1.3-gnu/bin:$PATH 197 | 198 | # run the script in container 199 | mpirun --mca btl_openib_warn_default_gid_prefix 0 \ 200 | --host dgx4106:1,dgx4105:1 -N 1 --np 2 \ 201 | /opt/singularity/bin/singularity exec --nv \ 202 | /home/project/ai/singularity/nvcr.io/nvidia/pytorch\:latest.sif \ 203 | python 204 | ``` 205 | 206 | For PyTorch users, you can use `horovod` for cross-node communication. However, if you are using `torch.distributed`, 207 | you need to specify the communication interface by setting environment variables `NCCL_SOCKET_IFNAME`. Currently, `enp1s0f1` 208 | is for InfiniBand. You can do `NCCL_SOCKET_IFNAME=enp1s0f1 python your_script.py` or add `os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f1'` 209 | in your Python file. If you are using `gloo` backend, the environment variable will be `GLOO_SOCKET_IFNAME`. This is to 210 | force PyTorch use InfiniBand for communication. Otherwise, the program will be stuck at initialization based on my test. 211 | 212 | ## Dataset and Transfer files 213 | 214 | The ImageNet dataset is placed at `/scratch/users/nus/e0575577/ImageNet` if you are using the shared account. 215 | 216 | Maybe you can find other prepared dataset around this path. 217 | 218 | **Transfer files** 219 | 220 | You can use scp or git clone command, FileZilla or WinSCP, a visible tool, to transfer files. 221 | 222 | ```shell 223 | # scp command 224 | # tested in Git Bash on Windows10 225 | scp G:/globus_share/xxx.tar your_account@aspire.nscc.sg:/your_path/imagenet-tar 226 | ``` 227 | 228 | -------------------------------------------------------------------------------- /docs/hpc_system_notes/cscs.md: -------------------------------------------------------------------------------- 1 | # CSCS 2 | 3 | - [Before Usage](#before-usage ) 4 | - [Interactive Usage](#interactive-usage) 5 | - [Common Commands](#common-commands) 6 | - [PyTorch Job Script Example](#pytorch-job-script-example) 7 | - [Horovod Gloo](#horovod_gloo) 8 | - [Spark](#spark) 9 | - [TensorFlow](#tensorflow) 10 | - [Dataset and Transfer files](#dataset-and-transfer-files) 11 | - [Large Scale Experiment](#large-scale-experiment) 12 | - [DALI](#dali) 13 | - [Question Ticket](#question-ticket) 14 | 15 | ## Before Usage 16 | 17 | **Get Your Account** 18 | 19 | 1. Request the Prof invites you to current CSCS project (each project can only has a very limited members) 20 | 2. Waiting for CSCS administrator approval 21 | 3. Receive invitation e-mail and provide your personal information (need your passport pdf) 22 | 4. Waiting for CSCS administrator approval again 23 | 5. Receive account e-mail with password 24 | 25 | According to [CSCS User Regulations](https://www.cscs.ch/services/user-regulations/), **CSCS does not allow sharing of accounts and the applicant will be immediately barred from all present and future use of CSCS facilities and is fully liable for all consequences arising from the infraction if such activity occur**. 26 | 27 | **Connect the System** 28 | 29 | ssh ela.cscs.ch 30 | 31 | Here just the front end of the system, you cannot access the $SCRATCH file system and submit jobs. Then, 32 | 33 | ssh daint.cscs.ch 34 | 35 | You can access the $SCRATCH file system and submit jobs from here. Cannot connect daint directly before connecting ela. 36 | 37 | **Check Project Status** 38 | 39 | You can go to [https://account.cscs.ch](https://account.cscs.ch/) to check the and remain quota hours, as shown below 40 | 41 | ![](../doc_figures/cscs/cscs_project_information.jpg) 42 | 43 | The quota here is the whole lab quota rather than your current account, so the resources is very limited, please be careful and economical. One node h means occupy one node (1 P100 GPU) one hour, so we need balance the job walltime and requested resources. In my test, train ImageNet-1k using PyTorch with 90 epochs need about two or three hundreds node h at a time. 44 | 45 | ## Interactive Usage 46 | 47 | **Interactive Usage** 48 | 49 | For interactive usage to debug, unlike other cluster using command, You can access computing resources via your browser through a user interface based on Jupyter. 50 | 51 | 1. Login at https://jupyter.cscs.ch/hub/home 52 | 53 | 2. Right click 'Start My Server' and select 'Open link in new tab' 54 | 55 | ![](../doc_figures/cscs/jupyter_step1.jpg) 56 | 57 | 3. Select nodes and duration you need. 58 | 59 | Each node has only one P100 GPU, and request 4 or more nodes here often fail, please consider use job.slurm for large test. 60 | 61 | ![](../doc_figures/cscs/jupyter_step2.jpg) 62 | 63 | 4. Cancel interactive resources usage 64 | 65 | Directly close the JupyterLab will **not** stop the job, it is still consuming resources in the background. 66 | 67 | Click the refresh button in the upper left corner of the previous page, then you can see the 'Stop My Server'. Click it and refresh again, if it shows as step 2, the job has been successfully terminated. 68 | 69 | ![](../doc_figures/cscs/jupyter_step3.jpg) 70 | 71 | **Build Conda Environment** 72 | 73 | After entering JupyterLab, you can build your conda environment just as how you do in other Linux system, such as using miniconda. 74 | 75 | Most environments, such as PyTorch, TensorFlow and Spark, etc. can be directly load using modules, you just need other packages in conda if they are not in modules. 76 | 77 | ## Common Commands 78 | 79 | ```shell 80 | # To directory $SCRATCH. 81 | # put and run all jobs related files in $SCRATCH system. 82 | cd $SCRATCH 83 | 84 | # submit a job script 85 | sbatch /PATH/TO/job.slurm 86 | 87 | # check your job status 88 | squeue -u your_account 89 | 90 | # scancel your job 91 | scancel job_ID 92 | 93 | # view loaded module 94 | # module related commands need at computing node 95 | module list 96 | 97 | # view avail module 98 | module available 99 | 100 | # load CUDA 101 | module load daint-gpu 102 | 103 | # active your conda environment 104 | . /users/your_account/miniconda3/bin/activate your_env_name 105 | 106 | # run your code 107 | # don't need to select how many nodes or GPUs, default use all available resources that you requested. 108 | srun python your_file.py 109 | ``` 110 | 111 | For more detailed information, please refer to [CSCS User Guide](https://user.cscs.ch/access/running/). 112 | 113 | According to [CSCS Regulations](https://user.cscs.ch/access/running/#slurm-best-practices-on-cscs-cray-systems), **Users are not supposed to submit arbitrary amounts of Slurm jobs and commands at the same time**. 114 | 115 | **It is possible to manually check the output file or log generated from your code or job, and in general, they actually has been finished for a while before the Slurm job finished. So you can cancel the Slurm job in advance manually, which means you can save time waiting for the Slurm job start and finish (because request less walltime) and save our computing resources, especially when you use a lot nodes to parallel.** At least it was so when I tested PyTorch and Spark. 116 | 117 | ## PyTorch Job Script Example 118 | 119 | ``` 120 | #!/bin/bash -l 121 | 122 | #SBATCH --job-name=my_cscs_job 123 | #SBATCH --time=00:10:00 124 | #SBATCH --nodes=2 125 | #SBATCH --ntasks-per-core=1 126 | #SBATCH --ntasks-per-node=1 127 | #SBATCH --cpus-per-task=12 128 | #SBATCH --constraint=gpu 129 | #SBATCH --account= 130 | 131 | module load daint-gpu 132 | module load PyTorch 133 | 134 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 135 | 136 | # Environment variables needed by the NCCL backend 137 | export NCCL_DEBUG=INFO 138 | export NCCL_IB_HCA=ipogif0 139 | export NCCL_IB_CUDA_SUPPORT=1 140 | 141 | . /users/your_account/miniconda3/bin/activate your_env_name 142 | 143 | cd your_code_path 144 | 145 | srun \ 146 | python -u your_code.py \ 147 | --epochs 90 \ 148 | --model resnet50 \ 149 | > ${SLURM_JOBID}.out 2> ${SLURM_JOBID}.err 150 | ``` 151 | 152 | Number of nodes define by --nodes, and each node only has one P100 GPU. 153 | 154 | We can noly use the 'normal' queue, because the project is too small and this kind of projects are not entitled to use the "low' queue. So once we have used our quarterly allocation we can only wait until the next allocation period to start. 155 | 156 | CSCS also provides a general PyTorch guide at https://user.cscs.ch/computing/data_science/pytorch/, but some content maybe not works at least when I tested. 157 | 158 | **Default PyTorch dataloader does not work if you use Horovod for distribution** (maybe because the specific MPI), please **use torch.distributed, Horovod Gloo, or Horovod + DALI** (please refer to DALI or Horovod Gloo section). 159 | 160 | ## Horovod Gloo 161 | 162 | ``` 163 | #!/bin/bash -l 164 | 165 | #SBATCH --job-name=gloo-eb 166 | #SBATCH --time=00:30:00 167 | #SBATCH --nodes=4 168 | #SBATCH --ntasks-per-node=1 169 | #SBATCH --constraint=gpu 170 | #SBATCH --account= 171 | #SBATCH --output=test_pt_hvd_%j.out 172 | 173 | module load daint-gpu PyTorch 174 | 175 | . /users/lyongbin/miniconda3/bin/activate your_env_name 176 | 177 | export PMI_NO_PREINITIALIZE=1 # avoid warnings on fork 178 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS 179 | 180 | for node in $(scontrol show hostnames); do 181 | HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE," 182 | done 183 | HOSTS=${HOSTS%?} # trim trailing comma 184 | echo HOSTS $HOSTS 185 | 186 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \ 187 | --start-timeout 120 --gloo-timeout-seconds 120 \ 188 | python -u your_code.py \ 189 | --epochs 90 \ 190 | --model resnet50 191 | ``` 192 | 193 | Horovod Gloo works without using MPI, so you can use default PyTorch dataloader as on other clusters. 194 | 195 | ## Spark 196 | 197 | Please refer to https://user.cscs.ch/computing/data_science/spark/, it looks like they have fixed the bugs here I encountered before. 198 | 199 | Please note that the example in above link use CPU nodes, while our project only can access GPU nodes when I used, so we need appropriate modify it, such as SPARK_WORKER_CORES=12 and module load. 200 | 201 | In the example, the version number of spark-examples_2.11-2.4.7.jar maybe inappropriate. If you meet 'ERROR: file doesn't exist' , please use ‘module available’ to check the Spark location, and go to that path to find the right jar name. 202 | 203 | When generate yourself jar, please do not include packages in your local environment into the jar file. 204 | 205 | They also don't have HDFS on Piz Daint, we need put our data directly on $SCRATCH file system. 206 | 207 | ## TensorFlow 208 | 209 | Please refer to https://user.cscs.ch/computing/data_science/tensorflow/, I have test the examples here. 210 | 211 | ## Dataset and Transfer files 212 | 213 | **Dataset** 214 | 215 | Each user only has 1M files quota on Piz Daint $SCRATCH, users with over 1 million files and folders will be warned at submit time and will not be able to submit new jobs. 216 | 217 | So for large dataset like ImageNet-1K (ILSVRC2012), which has about 1.4M, we cannot use it as what we do in other cluster. Although the administrator temporarily raised the limit of another member, I was denied when I requested the same operation : ( 218 | 219 | You can access the ImageNet-1K (ILSVRC2012) at following path temporarily: 220 | 221 | ```shell 222 | --train-dir /scratch/snx3000/hliu/imagenet/train 223 | --val-dir /scratch/snx3000/hliu/imagenet/val 224 | ``` 225 | 226 | The long-term method is to use NVIDIA DALI with TFRecord, please refer to DALI section. 227 | 228 | If you need use other large dataset has more than 1M files, you need to convert them to TFRecord or LMDB format to reduce the number of files occupied, which maybe need use the appropriate dataloader when use them. 229 | 230 | **Transfer files** 231 | 232 | Unlike other cluster where we can use scp command, FileZilla or WinSCP, on CSCS, we cannot connect the daint.cscs.ch $SCRATCH file system directly. 233 | 234 | You can use first transfer them to $HOME, then cp them to \$SCRATCH, or use git clone. 235 | 236 | **If you use your own personal account**( login need google account and it's pretty wired if one CSCS account use multiple google account at different IP), a better way is to use globus to transfer files, which actually works like FileZilla or WinSCP and can connect the daint.cscs.ch $SCRATCH file system directly. Please refer to https://user.cscs.ch/storage/transfer/external/ 237 | 238 | You need install the globus personal client on your local machine and set visible path. 239 | 240 | ![](../doc_figures/cscs/globus.jpg) 241 | 242 | 243 | 244 | ## DALI 245 | 246 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 247 | 248 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters. 249 | 250 | **Install** 251 | 252 | ```shell 253 | module load daint-gpu PyTorch 254 | . /users/your_account/miniconda3/bin/activate your_env_name 255 | pip install tensorboard tqdm 256 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110 257 | ``` 258 | 259 | **Usage** 260 | 261 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch). 262 | 263 | DALI requires data in *TFRecord format* in the following structure: 264 | 265 | ``` 266 | train-recs 'path/train/*' 267 | val-recs 'path/validation/*' 268 | train-idx 'path/idx_files/train/*' 269 | val-idx 'path/idx_files/validation/*' 270 | ``` 271 | 272 | On CSCS, if you want use ImageNet-1k TFRecord data, you can directly use data-dir=/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/ 273 | 274 | About the parameters on DALI: 275 | 276 | - *prefetch_queue_depth* and *num_threads* might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff. 277 | - *last_batch_policy* you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link. 278 | - *device* Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster. 279 | 280 | ## Question Ticket 281 | 282 | If you have some specific question, you can sent a question ticket to the CSCS staff. (Do not send an email to them, although CSCS User Guide said so, the e-mail method has actually been discarded.) 283 | 284 | Go to https://jira.cscs.ch/plugins/servlet/desk/site/global , click 'Generic request' under 'Open a case', describe the problem you encountered and reelevated code and log files location. The staff will answer it soon on at working hours on European working days. -------------------------------------------------------------------------------- /docs/hpc_system_notes/tacc_frontera.md: -------------------------------------------------------------------------------- 1 | # TACC-Frontera 2 | 3 | - [Build Virtualenv Environment ](#build-virtualenv-environment ) 4 | - [Common Commands](#common-commands) 5 | - [Job Script Example](#job-script-example) 6 | - [Horovod Gloo](#horovod_gloo) 7 | - [Dataset and Transfer files](#dataset-and-transfer-files) 8 | - [Large Scale Experiment](#large-scale-experiment) 9 | - [DALI](#dali) 10 | - [Potential Error](#potential-error) 11 | - [Question Ticket](#question-ticket) 12 | ## Build Virtualenv Environment 13 | 14 | According to the TACC staff's personal instruction, we should **use Python virtualenv instead of Conda** to build the environment on TACC-Frontera. 15 | 16 | ```shell 17 | # interactive usage 18 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00 19 | 20 | # bulid Python virtualenv 21 | cd ~ 22 | mkdir python-env 23 | cd python-env 24 | # you can use other names 25 | virtualenv cuda10-home 26 | 27 | # active environment 28 | source ~/python-env/cuda10-home/bin/activate 29 | 30 | # deactivate environment 31 | deactivate  32 | 33 | # you can use pip to install packages 34 | ``` 35 | 36 | Build Horovod with Pytorch. 37 | 38 | ```shell 39 | # login in computing nodes and virtualenv 40 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00 41 | 42 | # load modules (cannot use default intel/19 impi/19) 43 | module load intel/18.0.5 impi/18.0.5 44 | module load cuda/10.1 cudnn nccl 45 | 46 | source ~/python-env/cuda10-home/bin/activate 47 | 48 | # install Pytorch, example 1.5.1 49 | pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html --force-reinstall 50 | 51 | # bulid Horovod, maybe spend ten minutes 52 | HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install horovod --no-cache-dir --force-reinstall 53 | ``` 54 | 55 | **Default PyTorch dataloader does not work if you use Horovod for distribution directly** (maybe because the specific MPI), please **use torch.distributed, Horovod Gloo, or Horovod + DALI** (please refer to DALI or Horovod Gloo section). 56 | 57 | ## Common Commands 58 | 59 | ```shell 60 | # To directory $SCRATCH. The default directory is $HOME when you login in. 61 | cd $SCRATCH 62 | 63 | # interactive usage 64 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00 65 | 66 | # submit a job script 67 | sbatch /PATH/TO/job.slurm 68 | 69 | # check your job status 70 | squeue -u your_account 71 | 72 | # view requested nodes ID 73 | squeue | grep your_account 74 | 75 | # view nodes local file space 76 | ssh c196-032 df 77 | 78 | # delete your job 79 | scancel job_ID 80 | 81 | # display estimated start time for job 82 | squeue --start -j job_ID 83 | 84 | # view loaded module 85 | # module related commands need at computing node 86 | module list 87 | 88 | # view avail module 89 | module avail 90 | 91 | # active your environment and load cuda components 92 | # you need run it after you login in computing nodes everytime 93 | source ~/python-env/cuda10-home/bin/activate 94 | module load cuda/10.1 cudnn nccl 95 | # module load cuda/10.0 cudnn nccl 96 | # module load cuda/11.0 cudnn/8.0.5 nccl/2.8.3 97 | 98 | # run your code, n is the number of total GPUs 99 | ibrun -np 4 python pytorch_imagenet_resnet.py 100 | ``` 101 | 102 | For more detailed information, please refer to [TACC Frontera User Guide](https://frontera-portal.tacc.utexas.edu/user-guide/quickstart/). 103 | 104 | ## Job Script Example 105 | 106 | ``` 107 | #!/bin/sh 108 | 109 | #SBATCH -J myjob # Job name 110 | #SBATCH -o myjob.o%j # Name of stdout output file 111 | #SBATCH -e myjob.e%j # Name of stderr error file 112 | #SBATCH -p rtx # Queue (partition) name 113 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 114 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 115 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 116 | #SBATCH --mail-user=your email address 117 | #SBATCH --mail-type=all # Send email at begin and end of job 118 | 119 | pwd 120 | date 121 | 122 | cd your_code_path 123 | 124 | source ~/python-env/cuda10-home/bin/activate 125 | module load intel/18.0.5 impi/18.0.5 126 | module load cuda/10.1 cudnn nccl 127 | 128 | ibrun -np 8 \ 129 | python your_code.py 130 | ``` 131 | 132 | You should use ibrun instead of mpirun or mpiexec here, and -np means total GPU numbers. 133 | 134 | Note: Although in Frontera tutorial, they use "#SBATCH -A myproject # Allocation name", you can actually delete it. If you incorrect set it, you will meet permission error when submit job. 135 | 136 | From 6 April 2021, for Frontera jobs, a new queue named *small* has been created specifically for one and two node jobs. Jobs of one or two nodes that will run for up to 48 hours should be submitted to the *small* queue. The *normal* queue now has a lower limit of 3 nodes for all jobs. This should improve the turnaround time for jobs in the *normal* queue and small jobs in the *small* queue. 137 | 138 | ## Horovod Gloo 139 | 140 | ``` 141 | #!/bin/sh 142 | 143 | #SBATCH -J myjob # Job name 144 | #SBATCH -o myjob-t.o%j # Name of stdout output file 145 | #SBATCH -e myjob-t.e%j # Name of stderr error file 146 | #SBATCH -p rtx # Queue (partition) name 147 | #SBATCH -N 2 # Total # of nodes (must be 1 for serial) 148 | #SBATCH -n 8 # Total # of mpi tasks (should be 1 for serial) 149 | #SBATCH -t 00:30:00 # Run time (hh:mm:ss) 150 | 151 | pwd 152 | date 153 | 154 | source ~/python-env/cuda10-home/bin/activate 155 | 156 | module load intel/18.0.5 impi/18.0.5 157 | module load cuda/10.1 cudnn nccl 158 | 159 | cd /file_path/ 160 | 161 | export PMI_NO_PREINITIALIZE=1 # avoid warnings on fork 162 | # unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS 163 | 164 | # 4 means $SLURM_NTASKS_PER_NODE 165 | for node in $(scontrol show hostnames); do 166 | HOSTS="$HOSTS$node:4," 167 | done 168 | HOSTS=${HOSTS%?} # trim trailing comma 169 | echo HOSTS $HOSTS 170 | 171 | horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ib0 \ 172 | --start-timeout 120 --gloo-timeout-seconds 120 \ 173 | python you_file.py \ 174 | --epochs 90 \ 175 | --model resnet50 176 | ``` 177 | 178 | Horovod Gloo works without using MPI, so you can use default PyTorch dataloader as on other clusters. 179 | 180 | ## Dataset and Transfer files 181 | 182 | **Dataset** 183 | 184 | ```shell 185 | cp /scratch1/00946/zzhang/imagenet/imagenet-1k.tar /your/path 186 | tar xf imagenet-1k.tar 187 | 188 | --train-dir=/your/path/ILSVRC2012_img_train/ 189 | --val-dir=/your/path/ILSVRC2012_img_val/ 190 | ``` 191 | 192 | You can get a prepared ImageNet-1K (ILSVRC2012) use above command. 193 | 194 | Maybe you can find other prepared dataset around this path. 195 | 196 | **Transfer files** 197 | 198 | You can use scp or git clone command, or WinSCP, a visible tool can input TACC token, to transfer files. 199 | 200 | ```shell 201 | # scp command 202 | # tested in Git Bash on Windows10 203 | scp G:/globus_share/xxx.tar your_account@frontera.tacc.utexas.edu:/your_path/imagenet-tar 204 | ``` 205 | 206 | ## Large Scale Experiment 207 | 208 | (You can also consider DALI rather than this part) 209 | 210 | If a job uses many nodes (eg. 32 nodes with 128 GPUs) and reads large dataset(eg. ImageNet) directly from the hard disk, it will cause huge IO pressure on the file system, which may cause the job to be killed by the system : ( 211 | 212 | Different with TACC-Longhorn, which can directly use /tmp for ImageNet. The /tmp file system on TACC-Frontera is only around 100G and cannot copy and extract the whole ImageNet. 213 | 214 | Since we can only use max 16 nodes with 64 GPUs, so maybe it is still OK to read data from $SCRATCH, which is the simplest way. The following method is pretty tricky and only works for ImageNet. 215 | 216 | According to the TACC staff's personal instruction, there is a method called fanstore, which can load the preprocessed ImageNet directly to /tmp file system. 217 | 218 | Preprocessed ImageNet 219 | 220 | ```shell 221 | # copy preprocessed ImageNet-1K (ILSVRC2012), which is a binary file folder. 222 | cp /scratch1/00946/zzhang/imagenet/imagenet-16parts /your/path 223 | 224 | # imagenet-tiny, same structure of ImageNet-1K (ILSVRC2012), but size is much smaller 225 | cp /scratch1/00946/zzhang/imagenet/imagenet-tiny-16parts /your/path 226 | ``` 227 | 228 | Job Script Example 229 | 230 | ```shell 231 | source ~/python-env/cuda10-home/bin/activate 232 | # load data 233 | # In interactive usage, do not need & sleep 500, when you see Ready in command, press ctrl+z, input bg 1 234 | # Sometimes, it maybe get stuck:( 235 | export FS_ROOT=/tmp/fs_`id -u` 236 | ibrun -np 2 /work/00410/huang/share/read_remote_file 16 /scratch1/07801/nusbin20/imagenet-16parts & sleep 500 237 | 238 | # load module 239 | module load cuda/10.1 cudnn nccl 240 | 241 | cd /scratch1/07801/nusbin20/tacc-our 242 | 243 | # read_remote_file and wrapper.so are provided by the TACC staff, they are binary files, so I also don't know the detail :( 244 | ibrun -np 8 LD_PRELOAD=/work/00410/huang/share/wrapper.so python pytorch_imagenet_resnet.py --epochs 90 --model resnet50 --batch-size 128 --train-dir=/tmp/fs_871009/ILSVRC2012_img_train --val-dir=/tmp/fs_871009/ILSVRC2012_img_val 245 | 246 | ``` 247 | 248 | You need use id -u, to find your user ID, and change above 871009 to it. 249 | 250 | ## DALI 251 | 252 | NVIDIA DALI can accelerate data loading and pre-processing using GPU rather than CPU, although with GPU memory tradeoff. 253 | 254 | It can also avoid some potential conflicts between MPI libraries and Horovod on some GPU clusters. 255 | 256 | On TACC-Frontera, DALI only works for one node with 4GPUs, if use Pytorch + Horovod, I guess because its specific MPI : ( . When use mutil nodes, it still doesn't work. 257 | 258 | **Install** 259 | 260 | Build DALI with Pytorch, Horovod and CUDA 11.0. 261 | 262 | ```shell 263 | # login in computing nodes and virtualenv 264 | idev -p rtx-dev -N 1 -n 4 -t 02:00:00 265 | 266 | source ~/python-env/cuda110-home/bin/activate 267 | 268 | # load modules (cannot use default intel/19 impi/19) 269 | module load intel/18.0.5 impi/18.0.5 270 | module load cuda/11.0 cudnn nccl 271 | 272 | 273 | # install Pytorch, example 1.7.1 274 | pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html --force-reinstall 275 | 276 | # bulid Horovod, maybe spend ten minutes 277 | HOROVOD_CUDA_HOME=$TACC_CUDA_DIR HOROVOD_NCCL_HOME=$TACC_NCCL_DIR CC=gcc HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install horovod --no-cache-dir --force-reinstall 278 | 279 | # install dali 280 | pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110 281 | # pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda100 282 | 283 | # --user leads to different packages location 284 | # export PYTHONPATH=~/.local/lib/python3.7/site-packages 285 | ``` 286 | 287 | **Usage** 288 | 289 | You need replace default PyTorch dataloader with dali_dataloader, I provide a PyTorch DALI example using ImageNet-1k at [here](https://github.com/NUS-HPC-AI-Lab/LARS-ImageNet-PyTorch).This example has been tested with nvidia-dali-cuda110, maybe it needs some changes if you use it with CUDA10. 290 | 291 | DALI requires data in *TFRecord format* in the following structure: 292 | 293 | ``` 294 | train-recs 'path/train/*' 295 | val-recs 'path/validation/*' 296 | train-idx 'path/idx_files/train/*' 297 | val-idx 'path/idx_files/validation/*' 298 | ``` 299 | 300 | On longhorn, if you want use ImageNet-1k TFRecord data, you can directly use 301 | 302 | data-dir=/scratch1/07801/nusbin20/ILSVRC2012_1k_TFRecord/ 303 | 304 | ```shell 305 | # TFRecord format tar file 306 | /scratch1/07801/nusbin20/imagenet-tar/ILSVRC2012_1k_TFRecord.tar 307 | ``` 308 | 309 | About the parameters on DALI: 310 | 311 | - *prefetch_queue_depth* and *num_threads* might also be something to explore, as it can speed up your loading a lot, with some memory tradeoff. 312 | - *last_batch_policy* you probably want PARTIAL on validation, and DROP during training: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html?highlight=last_batch_policy, just as I set in above example link. 313 | - *device* Above example link use device="mixed/gpu", for ImageNet-1k and GPU with 16GB, default PyTorch dataloader allows batchsize 128, while DALI can only use batchsize 64. If you set device="mixed/gpu" to "cpu", it won't need extra GPU memory, however copying directly to gpu makes the loading much faster. 314 | 315 | ## Potential Error 316 | 317 | - **Program works well in one node but get stuck just after it started running in two nodes.** 318 | 319 | For Pytorch program with Horovod, please check your 'num_workers': N in torch.utils.data.DataLoader, maybe you need set 0 here to force synchronous IO, because there are some potential conflicts between Horovod and asynchronous IO on TACC-Frontera. In addition, maybe you cannot use torch.multiprocessing.set_start_method('spawn'). This can help progeam works but will seriously slow down the efficiency of the program. So, you would better use torch.distributed rather than Horovod on TACC-Frontera. 320 | 321 | For Pytorch program with torch.distributed, please check your init_method. For shared file-system initialization, you need to select a **no-exist** file, eg. init_method='file:///mnt/nfs/sharedfile', maybe you need delete the file manually. For TCP initialization, 322 | 323 | ```shell 324 | # MASTER_ADDR is host address 325 | # MASTER_PORT is unused host port 326 | init_method=’tcp://$MASTER_ADDR:$MASTER_PORT’ 327 | 328 | # example in SLURM 329 | master_addr = os.getenv('SLURM_LAUNCH_NODE_IPADDR') 330 | master_port = ‘6006’ 331 | init_method = 'tcp://' + master_addr + ':' + master_port 332 | ``` 333 | 334 | - **torch.distributed.new_group(ranks=ranks) must be functions to be run by all processes** 335 | 336 | Wrong example 337 | 338 | ```shell 339 | if rank == 0: 340 | torch.distributed.new_group(ranks=ranks_1) 341 | else: 342 | torch.distributed.new_group(ranks=ranks_2) 343 | ``` 344 | 345 | Correct example 346 | 347 | ```shell 348 | global _DATA_PARALLEL_GROUP 349 | assert _DATA_PARALLEL_GROUP is None, \ 350 | 'data parallel group is already initialized' 351 | for i in range(model_parallel_size): 352 | ranks = range(i, world_size, model_parallel_size) 353 | group = torch.distributed.new_group(ranks) 354 | if i == (rank % model_parallel_size): 355 | _DATA_PARALLEL_GROUP = group 356 | ``` 357 | 358 | ## Question Ticket 359 | 360 | If you have some specific questions, you can sent them to [TACC Longhorn Help Desk](https://portal.tacc.utexas.edu/user-guides/longhorn#help-desk). --------------------------------------------------------------------------------