├── Queues.md ├── README.md ├── Topic Outline.md ├── applications.md ├── cluster_overview.md ├── cluster_utilization.md ├── code_of_conduct.md ├── directory_structure.md ├── facilities_statement.md ├── intro_to_bash.md ├── nodes.md ├── questions_comments.md ├── researchIT_team.md └── storage.md /Queues.md: -------------------------------------------------------------------------------- 1 | The high performance computing (HPC) resources of The Jackson Laboratory represent a shared resource available to all JAX researchers, and in order to keep these resources available to all researchers in a consistent and fair manner, a number of walltime-based queues have been implemented on these resources. These queues allow the Information Technology department the ability to better plan maintenance and schedule upgrade windows on these systems, while providing a more consistent and stable operating environment for JAX HPC users. 2 | 3 | ### Identifying the partitions 4 | 5 | ### squeue 6 | 7 | The squeue account provides details about many things including the partitions configured currently 8 | 9 | ~~~ 10 | squeue 11 | squeue -u 12 | squeue --help 13 | ~~~ 14 | 15 | Examples: 16 | 17 | squeue 18 | 19 | ~~~ 20 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 21 | 643231 gpu RnnTrain zhaoyu PD 0:00 1 (Resources) 22 | 643232 gpu RnnTrain zhaoyu PD 0:00 1 (Priority) 23 | 639633_[68-135%30] gpu track_gr sheppk PD 0:00 1 (JobArrayTaskLimit) 24 | 552775_7337 compute checkBug zhaoyu PD 0:00 1 (launch failed requeued held) 25 | 627649 compute build tewher R 1-03:00:18 1 sumner054 26 | 627650 compute build tewher R 1-03:00:18 1 sumner054 27 | 627651 compute build tewher R 1-03:00:18 1 sumner054 28 | 627652 compute build tewher R 1-03:00:18 1 sumner054 29 | 627648 compute build tewher R 1-03:00:28 1 sumner054 30 | ~~~ 31 | 32 | squeue -u tewher 33 | 34 | ~~~ 35 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 36 | 627649 compute build tewher R 1-03:02:14 1 sumner054 37 | 627650 compute build tewher R 1-03:02:14 1 sumner054 38 | 627651 compute build tewher R 1-03:02:14 1 sumner054 39 | 627652 compute build tewher R 1-03:02:14 1 sumner054 40 | 627648 compute build tewher R 1-03:02:24 1 sumner054 41 | 638914 compute build tewher R 22:07:11 1 sumner014 42 | 638981 compute build tewher R 22:07:11 1 sumner014 43 | 638982 compute build tewher R 22:07:11 1 sumner039 44 | 638984 compute build tewher R 22:07:11 1 sumner054 45 | 638985 compute build tewher R 22:07:11 1 sumner014 46 | ~~~ 47 | 48 | squeue --help 49 | 50 | ~~~ 51 | Usage: squeue [OPTIONS] 52 | -A, --account=account(s) comma separated list of accounts 53 | to view, default is all accounts 54 | -a, --all display jobs in hidden partitions 55 | --array-unique display one unique pending job array 56 | element per line 57 | --federation Report federated information if a member 58 | of one 59 | -h, --noheader no headers on output 60 | --hide do not display jobs in hidden partitions 61 | -i, --iterate=seconds specify an interation period 62 | -j, --job=job(s) comma separated list of jobs IDs 63 | to view, default is all 64 | --local Report information only about jobs on the 65 | local cluster. Overrides --federation. 66 | -l, --long long report 67 | -L, --licenses=(license names) comma separated list of license names to view 68 | -M, --clusters=cluster_name cluster to issue commands to. Default is 69 | current cluster. cluster with no name will 70 | reset to default. Implies --local. 71 | -n, --name=job_name(s) comma separated list of job names to view 72 | --noconvert don't convert units from their original type 73 | (e.g. 2048M won't be converted to 2G). 74 | -o, --format=format format specification 75 | -O, --Format=format format specification 76 | -p, --partition=partition(s) comma separated list of partitions 77 | to view, default is all partitions 78 | -q, --qos=qos(s) comma separated list of qos's 79 | to view, default is all qos's 80 | -R, --reservation=name reservation to view, default is all 81 | -r, --array display one job array element per line 82 | --sibling Report information about all sibling jobs 83 | on a federated cluster. Implies --federation. 84 | -s, --step=step(s) comma separated list of job steps 85 | to view, default is all 86 | -S, --sort=fields comma separated list of fields to sort on 87 | --start print expected start times of pending jobs 88 | -t, --states=states comma separated list of states to view, 89 | default is pending and running, 90 | '--states=all' reports all states 91 | -u, --user=user_name(s) comma separated list of users to view 92 | --name=job_name(s) comma separated list of job names to view 93 | -v, --verbose verbosity level 94 | -V, --version output version information and exit 95 | -w, --nodelist=hostlist list of nodes to view, default is 96 | all nodes 97 | 98 | Help options: 99 | --help show this help message 100 | --usage display a brief summary of squeue options 101 | ~~~ 102 | 103 | ### Identifying the QOS (Queues) 104 | 105 | ### `sacctmgr` 106 | 107 | The `sacctmgr` (or Slurm Account Manager) command shows the current QOS (aka queues) on the HPC environment. 108 | 109 | Examples: 110 | 111 | sacctmgr 112 | 113 | ~~~ 114 | sacctmgr 115 | sacctmgr show qos 116 | sacctmgr --help 117 | ~~~ 118 | 119 | sacctmgr show qos 120 | 121 | ~~~ 122 | $ sacctmgr show qos 123 | ~~~ 124 | 125 | ~~~ 126 | Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES 127 | ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 128 | long 0 00:00:00 cluster 1.000000 14-00:00:00 cpu=3600 129 | batch 0 00:00:00 cluster 1.000000 3-00:00:00 cpu=3600 130 | ~~~ 131 | 132 | sacctmgr --help 133 | 134 | ~~~ 135 | $ sacctmgr --help 136 | ~~~ 137 | 138 | ~~~ 139 | sacctmgr [