├── test.py ├── example_job.sh ├── job_generator.py └── README.md /test.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | x = np.max(np.random.randn(100000)) 4 | print(x) 5 | 6 | -------------------------------------------------------------------------------- /example_job.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #################################### 4 | # ARIS slurm script template # 5 | # # 6 | # Submit script: sbatch filename # 7 | # # 8 | #################################### 9 | 10 | #SBATCH --job-name=test_job # DO NOT FORGET TO CHANGE THIS 11 | #SBATCH --output=test_job.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 12 | #SBATCH --error=test_job.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 13 | #SBATCH --ntasks=1 # How many times the command will run. Leave this to 1 unless you know what you are doing 14 | #SBATCH --nodes=1 # The task will break in so many nodes. Use this if you need many GPUs 15 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated 16 | #SBATCH --ntasks-per-node=1 # Same as ntasks 17 | #SBATCH --cpus-per-task=1 # If you need multithreading 18 | #SBATCH --time=0:01:00 # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit 19 | #SBATCH --mem=1G # memory to be allocated per NODE 20 | #SBATCH --partition=gpu # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node 21 | #SBATCH --account=pa181004 # DO NOT CHANGE THIS 22 | 23 | export I_MPI_FABRICS=shm:dapl 24 | 25 | if [ x$SLURM_CPUS_PER_TASK == x ]; then 26 | export OMP_NUM_THREADS=1 27 | else 28 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 29 | fi 30 | 31 | 32 | ## LOAD MODULES ## 33 | module purge # clean up loaded modules 34 | 35 | # load necessary modules 36 | module use ${HOME}/modulefiles 37 | module load gnu/8.3.0 38 | module load intel/18.0.5 39 | module load intelmpi/2018.5 40 | module load cuda/10.1.168 41 | module load python/3.6.5 42 | module load pytorch/1.3.1 43 | module load slp/1.3.1 44 | 45 | 46 | ## RUN YOUR PROGRAM ## 47 | srun python test.py 48 | 49 | -------------------------------------------------------------------------------- /job_generator.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | 4 | 5 | def query_yes_no(question, default="yes"): 6 | """Ask a yes/no question via raw_input() and return their answer. 7 | 8 | "question" is a string that is presented to the user. 9 | "default" is the presumed answer if the user just hits . 10 | It must be "yes" (the default), "no" or None (meaning 11 | an answer is required of the user). 12 | 13 | The "answer" return value is True for "yes" or False for "no". 14 | """ 15 | valid = {"yes": True, "y": True, "ye": True, 16 | "no": False, "n": False} 17 | if default is None: 18 | prompt = " [y/n] " 19 | elif default == "yes": 20 | prompt = " [Y/n] " 21 | elif default == "no": 22 | prompt = " [y/N] " 23 | else: 24 | raise ValueError("invalid default answer: '%s'" % default) 25 | 26 | while True: 27 | sys.stdout.write(question + prompt) 28 | choice = raw_input().lower() 29 | if default is not None and choice == '': 30 | return valid[default] 31 | elif choice in valid: 32 | return valid[choice] 33 | else: 34 | sys.stdout.write("Please respond with 'yes' or 'no' " 35 | "(or 'y' or 'n').\n") 36 | 37 | 38 | ########################################################## 39 | # Read arguments 40 | ########################################################## 41 | job_name = sys.argv[1] 42 | command = ' '.join(sys.argv[2:]) 43 | 44 | header = """#!/bin/bash 45 | 46 | #################################### 47 | # ARIS slurm script template # 48 | # # 49 | # Submit script: sbatch filename # 50 | # # 51 | #################################### 52 | 53 | #SBATCH --job-name={0} # DO NOT FORGET TO CHANGE THIS 54 | #SBATCH --output={1}.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 55 | #SBATCH --error={2}.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 56 | #SBATCH --ntasks=1 # How many times the command will run. Leave this to 1 unless you know what you are doing 57 | #SBATCH --nodes=1 # The task will break in so many nodes. Use this if you need many GPUs 58 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated 59 | #SBATCH --ntasks-per-node=1 # Same as ntasks 60 | #SBATCH --cpus-per-task=1 # If you need multithreading 61 | #SBATCH --time=32:00:00 # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit 62 | #SBATCH --mem=32G # memory to be allocated per NODE 63 | #SBATCH --partition=gpu # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node 64 | #SBATCH --account=pa181004 # DO NOT CHANGE THIS 65 | """.format(job_name, job_name, job_name) 66 | 67 | body = """ 68 | 69 | export I_MPI_FABRICS=shm:dapl 70 | 71 | if [ x$SLURM_CPUS_PER_TASK == x ]; then 72 | export OMP_NUM_THREADS=1 73 | else 74 | export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 75 | fi 76 | 77 | 78 | ## LOAD MODULES ## 79 | module purge # clean up loaded modules 80 | 81 | # load necessary modules 82 | module use ${HOME}/modulefiles 83 | module load gnu/8.3.0 84 | module load intel/18.0.5 85 | module load intelmpi/2018.5 86 | module load cuda/10.1.168 87 | module load python/3.6.5 88 | module load pytorch/1.3.1 89 | module load slp/1.3.1 90 | 91 | 92 | """ 93 | 94 | footer = """ 95 | 96 | ## RUN YOUR PROGRAM ## 97 | srun python {0} 98 | 99 | """.format(command) 100 | 101 | runner = header + body + footer 102 | 103 | write_approval = query_yes_no("IS THE GENERATED SCRIPT OK? \n\n" + "=" * 50 + 104 | "\n\n\n {0}".format(runner), default="no") 105 | 106 | if write_approval: 107 | with open("{0}.sh".format(job_name), "w") as f: 108 | f.write(header + body + footer) 109 | 110 | ex_approval = query_yes_no("Execute the job '{0}' ?".format(job_name), 111 | default="no") 112 | 113 | if ex_approval: 114 | os.system("sbatch {0}.sh".format(job_name)) 115 | 116 | else: 117 | print("Exiting...") 118 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # grnet_guide 2 | 3 | ## Step 1: RTFM 4 | 5 | Read the infrastructure documentation [here](http://doc.aris.grnet.gr/) 6 | 7 | Don't even continue reading (or even worse try to use the cluster) if you don't finish this document. 8 | 9 | 10 | ## Step 2: Copy your data 11 | 12 | We all share a common $USER in the cluster, so we need to be careful to keep the filesystem 13 | organized. 14 | 15 | 1. Create a source dir in /users/pa18/geopar/${YOUR_NAME}. This is your home. Here you put code etc. 16 | 2. Create a source dir in /work2/pa18/geopar/${YOUR_NAME}. This is where your data are kept. 17 | 3. scp or rsync your data in /work2/pa18/geopar/${YOUR_NAME}. 18 | 19 | As you read in the docs you are not supposed to run your jobs (data) from /users. All data should be in work2. If you did not read this RTFM before continuing. 20 | 21 | ## Step 2: Creating a batch job 22 | 23 | Now that you RTFM you know ARIS is a batch system and you need to submit batch jobs to it. 24 | 25 | An example batch script is contained in `example_job.sh`. 26 | 27 | Let's say we need to run `test.py`. First we need to configure the batch job 28 | 29 | ```bash 30 | #SBATCH --job-name=test_job # DO NOT FORGET TO CHANGE THIS 31 | #SBATCH --output=test_job.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 32 | #SBATCH --error=test_job.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId). 33 | #SBATCH --ntasks=1 # How many times the command will run. Leave this to 1 unless you know what you are doing 34 | #SBATCH --nodes=1 # The task will break in so many nodes. Use this if you need many GPUs 35 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated 36 | #SBATCH --ntasks-per-node=1 # Same as ntasks 37 | #SBATCH --cpus-per-task=1 # If you need multithreading 38 | #SBATCH --time=0:01:00 # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit 39 | #SBATCH --mem=1G # memory to be allocated per NODE 40 | #SBATCH --partition=gpu # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node 41 | #SBATCH --account=pa181004 # DO NOT CHANGE THIS 42 | ``` 43 | 44 | After this the script loads the required modules. Modules are a way to dynamically configure your dependencies. 45 | 46 | caffe2/201809 is for pytorch. There are also some tensorflow modules. You can see all available modules if you run module avail. 47 | 48 | There is also the slp/0.1.0 module which loads some useful tools that did not exist in the system like nltk, gensim, librosa and ekphrasis, courtesy of @geopar. 49 | 50 | You normally won't need to change this section except if you want tensorflow. 51 | 52 | ```bash 53 | ## LOAD MODULES ## 54 | module purge # clean up loaded modules 55 | 56 | # load necessary modules 57 | module use ${HOME}/modulefiles 58 | module load gnu/6.4.0 59 | module load intel/19.0.0 60 | module load openblas/0.2.20 61 | module load cuda/9.2.148 62 | module load caffe2/201809 63 | module load slp/0.1.0 64 | ``` 65 | 66 | Now you specify the script you want to run 67 | 68 | ```bash 69 | ## RUN YOUR PROGRAM ## 70 | srun python test.py 71 | ``` 72 | 73 | This is saved in a `job.sh` script and you can submit it to the queue with 74 | 75 | ```bash 76 | sbatch job.sh 77 | ``` 78 | 79 | ## Hacking the system 80 | 81 | As you probably will know because you definitely read the documentation... 82 | Ok seriously go read it. 83 | 84 | As you know SLURM is a batch system and there is a job queue. This means that you may wait for other jobs to complete before enough resources become available. 85 | 86 | Here are some tips and tricks to jump the queue 87 | 88 | 1. Don't spam jobs. This is not a development environment, it's a production one. Make sure your code runs correctly before submitting. There is some fair sharing built into the queue so if you spam, other users will jump before you. 89 | 2. Do not oversize your jobs. Just because you **can** request 256GB of RAM and 10 GPUs it doesn't mean you **should**. Have a basic knowledge of your actual HW requirements. 90 | 3. Do not overestimate the time your job will need to complete. You should know how long it should take approximately. If it takes 3 hours, request 4-5 to be safe, not a week. 91 | 92 | ## Access 93 | 94 | 1. You need to connect to the NTUA VPN to access the infrastructure. Here's a 95 | [guide](http://www.noc.ntua.gr/el/help-page/vpn/linux) on how to connect 96 | 2. Send your ssh key to @geopar and he will grant you access. If you have read the docs. There will be an exam. 97 | 98 | 99 | ## How to install a dependency 100 | 101 | 1. Send a mail to support 102 | 2. If this fails, notify geopar 103 | 3. DIY 104 | ```bash 105 | git clone http://github.com//my_repo 106 | cd my_repo 107 | module purge # clean up loaded modules 108 | # load necessary modules 109 | module use ${HOME}/modulefiles 110 | module load gnu/6.4.0 111 | module load intel/19.0.0 112 | module load openblas/0.2.20 113 | module load cuda/9.2.148 114 | module load caffe2/201809 115 | module load slp/0.1.0 116 | pip install . --prefix /users/pa18/geopar/packages/python/ 117 | 118 | ``` 119 | --------------------------------------------------------------------------------