├── test.py
├── example_job.sh
├── job_generator.py
└── README.md


/test.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | 
3 | x = np.max(np.random.randn(100000))
4 | print(x)
5 | 
6 | 


--------------------------------------------------------------------------------
/example_job.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | ####################################
 4 | #     ARIS slurm script template   #
 5 | #                                  #
 6 | # Submit script: sbatch filename   #
 7 | #                                  #
 8 | ####################################
 9 | 
10 | #SBATCH --job-name=test_job    # DO NOT FORGET TO CHANGE THIS
11 | #SBATCH --output=test_job.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
12 | #SBATCH --error=test_job.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
13 | #SBATCH --ntasks=1     # How many times the command will run. Leave this to 1 unless you know what you are doing
14 | #SBATCH --nodes=1     # The task will break in so many nodes. Use this if you need many GPUs
15 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated
16 | #SBATCH --ntasks-per-node=1     # Same as ntasks
17 | #SBATCH --cpus-per-task=1     # If you need multithreading
18 | #SBATCH --time=0:01:00   # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit
19 | #SBATCH --mem=1G   # memory to be allocated per NODE
20 | #SBATCH --partition=gpu    # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node
21 | #SBATCH --account=pa181004    # DO NOT CHANGE THIS
22 | 
23 | export I_MPI_FABRICS=shm:dapl
24 | 
25 | if [ x$SLURM_CPUS_PER_TASK == x ]; then
26 |   export OMP_NUM_THREADS=1
27 | else
28 |   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
29 | fi
30 | 
31 | 
32 | ## LOAD MODULES ##
33 | module purge            # clean up loaded modules 
34 | 
35 | # load necessary modules
36 | module use ${HOME}/modulefiles
37 | module load gnu/8.3.0
38 | module load intel/18.0.5
39 | module load intelmpi/2018.5
40 | module load cuda/10.1.168
41 | module load python/3.6.5
42 | module load pytorch/1.3.1
43 | module load slp/1.3.1
44 | 
45 | 
46 | ## RUN YOUR PROGRAM ##
47 | srun python test.py
48 | 
49 | 


--------------------------------------------------------------------------------
/job_generator.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | 
  4 | 
  5 | def query_yes_no(question, default="yes"):
  6 |     """Ask a yes/no question via raw_input() and return their answer.
  7 | 
  8 |     "question" is a string that is presented to the user.
  9 |     "default" is the presumed answer if the user just hits <Enter>.
 10 |         It must be "yes" (the default), "no" or None (meaning
 11 |         an answer is required of the user).
 12 | 
 13 |     The "answer" return value is True for "yes" or False for "no".
 14 |     """
 15 |     valid = {"yes": True, "y": True, "ye": True,
 16 |              "no": False, "n": False}
 17 |     if default is None:
 18 |         prompt = " [y/n] "
 19 |     elif default == "yes":
 20 |         prompt = " [Y/n] "
 21 |     elif default == "no":
 22 |         prompt = " [y/N] "
 23 |     else:
 24 |         raise ValueError("invalid default answer: '%s'" % default)
 25 | 
 26 |     while True:
 27 |         sys.stdout.write(question + prompt)
 28 |         choice = raw_input().lower()
 29 |         if default is not None and choice == '':
 30 |             return valid[default]
 31 |         elif choice in valid:
 32 |             return valid[choice]
 33 |         else:
 34 |             sys.stdout.write("Please respond with 'yes' or 'no' "
 35 |                              "(or 'y' or 'n').\n")
 36 | 
 37 | 
 38 | ##########################################################
 39 | # Read arguments
 40 | ##########################################################
 41 | job_name = sys.argv[1]
 42 | command = ' '.join(sys.argv[2:])
 43 | 
 44 | header = """#!/bin/bash
 45 | 
 46 | ####################################
 47 | #     ARIS slurm script template   #
 48 | #                                  #
 49 | # Submit script: sbatch filename   #
 50 | #                                  #
 51 | ####################################
 52 | 
 53 | #SBATCH --job-name={0}    # DO NOT FORGET TO CHANGE THIS
 54 | #SBATCH --output={1}.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
 55 | #SBATCH --error={2}.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
 56 | #SBATCH --ntasks=1     # How many times the command will run. Leave this to 1 unless you know what you are doing
 57 | #SBATCH --nodes=1     # The task will break in so many nodes. Use this if you need many GPUs
 58 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated
 59 | #SBATCH --ntasks-per-node=1     # Same as ntasks
 60 | #SBATCH --cpus-per-task=1     # If you need multithreading
 61 | #SBATCH --time=32:00:00   # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit
 62 | #SBATCH --mem=32G   # memory to be allocated per NODE
 63 | #SBATCH --partition=gpu    # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node
 64 | #SBATCH --account=pa181004    # DO NOT CHANGE THIS
 65 | """.format(job_name, job_name, job_name)
 66 | 
 67 | body = """
 68 | 
 69 | export I_MPI_FABRICS=shm:dapl
 70 | 
 71 | if [ x$SLURM_CPUS_PER_TASK == x ]; then
 72 |   export OMP_NUM_THREADS=1
 73 | else
 74 |   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
 75 | fi
 76 | 
 77 | 
 78 | ## LOAD MODULES ##
 79 | module purge            # clean up loaded modules
 80 | 
 81 | # load necessary modules
 82 | module use ${HOME}/modulefiles
 83 | module load gnu/8.3.0
 84 | module load intel/18.0.5
 85 | module load intelmpi/2018.5
 86 | module load cuda/10.1.168
 87 | module load python/3.6.5
 88 | module load pytorch/1.3.1
 89 | module load slp/1.3.1
 90 | 
 91 | 
 92 | """
 93 | 
 94 | footer = """
 95 | 
 96 | ## RUN YOUR PROGRAM ##
 97 | srun python {0}
 98 | 
 99 | """.format(command)
100 | 
101 | runner = header + body + footer
102 | 
103 | write_approval = query_yes_no("IS THE GENERATED SCRIPT OK? \n\n" + "=" * 50 +
104 |                               "\n\n\n {0}".format(runner), default="no")
105 | 
106 | if write_approval:
107 |     with open("{0}.sh".format(job_name), "w") as f:
108 |         f.write(header + body + footer)
109 | 
110 |     ex_approval = query_yes_no("Execute the job '{0}' ?".format(job_name),
111 |                                default="no")
112 | 
113 |     if ex_approval:
114 |         os.system("sbatch {0}.sh".format(job_name))
115 | 
116 | else:
117 |     print("Exiting...")
118 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # grnet_guide
  2 | 
  3 | ## Step 1: RTFM
  4 | 
  5 | Read the infrastructure documentation [here](http://doc.aris.grnet.gr/)
  6 | 
  7 | Don't even continue reading (or even worse try to use the cluster) if you don't finish this document.
  8 | 
  9 | 
 10 | ## Step 2: Copy your data
 11 | 
 12 | We all share a common $USER in the cluster, so we need to be careful to keep the filesystem
 13 | organized.
 14 | 
 15 | 1. Create a source dir in /users/pa18/geopar/${YOUR_NAME}. This is your home. Here you put code etc.  
 16 | 2. Create a source dir in /work2/pa18/geopar/${YOUR_NAME}. This is where your data are kept.
 17 | 3. scp or rsync your data in /work2/pa18/geopar/${YOUR_NAME}.
 18 | 
 19 | As you read in the docs you are not supposed to run your jobs (data) from /users. All data should be in work2. If you did not read this RTFM before continuing.
 20 | 
 21 | ## Step 2: Creating a batch job
 22 | 
 23 | Now that you RTFM you know ARIS is a batch system and you need to submit batch jobs to it.
 24 | 
 25 | An example batch script is contained in `example_job.sh`.
 26 | 
 27 | Let's say we need to run `test.py`. First we need to configure the batch job
 28 | 
 29 | ```bash
 30 | #SBATCH --job-name=test_job    # DO NOT FORGET TO CHANGE THIS
 31 | #SBATCH --output=test_job.%j.out # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
 32 | #SBATCH --error=test_job.%j.err # DO NOT FORGET TO CHANGE THIS. the job stdout will be dumped here. (%j expands to jobId).
 33 | #SBATCH --ntasks=1     # How many times the command will run. Leave this to 1 unless you know what you are doing
 34 | #SBATCH --nodes=1     # The task will break in so many nodes. Use this if you need many GPUs
 35 | #SBATCH --gres=gpu:1 # GPUs per node to be allocated
 36 | #SBATCH --ntasks-per-node=1     # Same as ntasks
 37 | #SBATCH --cpus-per-task=1     # If you need multithreading
 38 | #SBATCH --time=0:01:00   # HH:MM:SS Estimated time the job will take. It will be killed if it exceeds the time limit
 39 | #SBATCH --mem=1G   # memory to be allocated per NODE
 40 | #SBATCH --partition=gpu    # gpu: Job will run on one or more of the nodes in gpu partition. ml: job will run on the ml node
 41 | #SBATCH --account=pa181004    # DO NOT CHANGE THIS
 42 | ```
 43 | 
 44 | After this the script loads the required modules. Modules are a way to dynamically configure your dependencies.
 45 | 
 46 | caffe2/201809 is for pytorch. There are also some tensorflow modules. You can see all available modules if you run module avail.
 47 | 
 48 | There is also the slp/0.1.0 module which loads some useful tools that did not exist in the system like nltk, gensim, librosa and ekphrasis, courtesy of @geopar.
 49 | 
 50 | You normally won't need to change this section except if you want tensorflow.
 51 | 
 52 | ```bash
 53 | ## LOAD MODULES ##
 54 | module purge            # clean up loaded modules 
 55 | 
 56 | # load necessary modules
 57 | module use ${HOME}/modulefiles
 58 | module load gnu/6.4.0
 59 | module load intel/19.0.0
 60 | module load openblas/0.2.20
 61 | module load cuda/9.2.148
 62 | module load caffe2/201809
 63 | module load slp/0.1.0
 64 | ```
 65 | 
 66 | Now you specify the script you want to run
 67 | 
 68 | ```bash
 69 | ## RUN YOUR PROGRAM ##
 70 | srun python test.py
 71 | ```
 72 | 
 73 | This is saved in a `job.sh` script and you can submit it to the queue with
 74 | 
 75 | ```bash
 76 | sbatch job.sh
 77 | ```
 78 | 
 79 | ## Hacking the system
 80 | 
 81 | As you probably will know because you definitely read the documentation...
 82 | Ok seriously go read it.
 83 | 
 84 | As you know SLURM is a batch system and there is a job queue. This means that you may wait for other jobs to complete before enough resources become available.
 85 | 
 86 | Here are some tips and tricks to jump the queue
 87 | 
 88 | 1. Don't spam jobs. This is not a development environment, it's a production one. Make sure your code runs correctly before submitting. There is some fair sharing built into the queue so if you spam, other users will jump before you.  
 89 | 2. Do not oversize your jobs. Just because you **can** request 256GB of RAM and 10 GPUs it doesn't mean you **should**. Have a basic knowledge of your actual HW  requirements.  
 90 | 3. Do not overestimate the time your job will need to complete. You should know how long it should take approximately. If it takes 3 hours, request 4-5 to be safe, not a week.
 91 | 
 92 | ## Access
 93 | 
 94 | 1. You need to connect to the NTUA VPN to access the infrastructure. Here's a
 95 | [guide](http://www.noc.ntua.gr/el/help-page/vpn/linux) on how to connect  
 96 | 2. Send your ssh key to @geopar and he will grant you access. If you have read the docs. There will be an exam.
 97 | 
 98 | 
 99 | ## How to install a dependency
100 | 
101 | 1. Send a mail to support  
102 | 2. If this fails, notify geopar  
103 | 3. DIY  
104 | ```bash
105 | git clone http://github.com/<vendor>/my_repo
106 | cd my_repo
107 | module purge # clean up loaded modules 
108 | # load necessary modules
109 | module use ${HOME}/modulefiles
110 | module load gnu/6.4.0
111 | module load intel/19.0.0
112 | module load openblas/0.2.20
113 | module load cuda/9.2.148 
114 | module load caffe2/201809
115 | module load slp/0.1.0
116 | pip install . --prefix /users/pa18/geopar/packages/python/
117 | 
118 | ```
119 | 


--------------------------------------------------------------------------------