├── molecule ├── default ├── test4 │ ├── requirements.yml │ ├── prepare.yml │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── images │ └── Dockerfile ├── test13 │ ├── slurm.extra.conf │ ├── converge.yml │ ├── verify.yml │ └── molecule.yml ├── test6 │ ├── testohpc-login-0 │ │ └── etc │ │ │ └── munge │ │ │ └── munge.key │ ├── molecule.yml │ ├── converge.yml │ └── verify.yml ├── requirements.txt ├── test1c │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── test1 │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── test1b │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── test10 │ ├── converge.yml │ ├── molecule.yml │ └── verify.yml ├── test2 │ ├── converge.yml │ ├── verify.yml │ └── molecule.yml ├── test8 │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── test11 │ ├── converge.yml │ ├── molecule.yml │ └── verify.yml ├── test9 │ ├── converge.yml │ ├── verify.yml │ └── molecule.yml ├── test12 │ ├── converge.yml │ ├── molecule.yml │ └── verify.yml ├── test15 │ ├── verify.yml │ ├── converge.yml │ └── molecule.yml ├── test3 │ ├── converge.yml │ ├── verify.yml │ └── molecule.yml └── README.md ├── module_utils ├── __init__.py └── slurm_utils.py ├── tests ├── filter_plugins ├── test.yml ├── inventory ├── filter.yml └── inventory-mock-groups ├── .github ├── CODEOWNERS └── workflows │ ├── publish-role.yml │ └── ci.yml ├── .gitignore ├── .gemini └── config.yaml ├── .ansible-lint ├── templates ├── mpi.conf.j2 ├── cgroup.conf.j2 ├── gres.conf.j2 ├── slurmctld.service.j2 ├── slurmd.service.j2 ├── slurmdbd.service.j2 ├── slurmdbd.conf.j2 └── slurm.conf.j2 ├── tasks ├── pre.yml ├── main.yml ├── facts.yml ├── install-ohpc.yml ├── install-generic.yml ├── validate.yml ├── upgrade.yml └── runtime.yml ├── meta └── main.yml ├── vars └── main.yml ├── .yamllint ├── handlers └── main.yml ├── library ├── gpu_info.py └── sacct_cluster.py ├── files └── slurm.conf.ohpc ├── filter_plugins └── slurm_conf.py ├── defaults └── main.yml ├── LICENSE └── README.md /molecule/default: -------------------------------------------------------------------------------- 1 | test1 -------------------------------------------------------------------------------- /module_utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/filter_plugins: -------------------------------------------------------------------------------- 1 | ../filter_plugins -------------------------------------------------------------------------------- /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @stackhpc/batch 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.retry 3 | *.pyc 4 | venv 5 | -------------------------------------------------------------------------------- /molecule/test4/requirements.yml: -------------------------------------------------------------------------------- 1 | - name: geerlingguy.mysql 2 | -------------------------------------------------------------------------------- /.gemini/config.yaml: -------------------------------------------------------------------------------- 1 | code_review: 2 | pull_request_opened: 3 | summary: false 4 | -------------------------------------------------------------------------------- /molecule/images/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM rockylinux/rockylinux:9 2 | RUN dnf install -y systemd && dnf clean all 3 | -------------------------------------------------------------------------------- /molecule/test13/slurm.extra.conf: -------------------------------------------------------------------------------- 1 | LaunchParameters=use_interactive_step 2 | FirstJobId={{ test_first_job_id }} 3 | -------------------------------------------------------------------------------- /tests/test.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - hosts: openstack 3 | connection: local 4 | roles: 5 | - stackhpc.openhpc 6 | -------------------------------------------------------------------------------- /tests/inventory: -------------------------------------------------------------------------------- 1 | [openstack] 2 | localhost ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 3 | -------------------------------------------------------------------------------- /molecule/test6/testohpc-login-0/etc/munge/munge.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stackhpc/ansible-role-openhpc/HEAD/molecule/test6/testohpc-login-0/etc/munge/munge.key -------------------------------------------------------------------------------- /molecule/requirements.txt: -------------------------------------------------------------------------------- 1 | pip 2 | setuptools 3 | molecule[lint,ansible] 4 | molecule-plugins[podman]==23.5.0 5 | ansible>=2.9.0 6 | yamllint 7 | ansible-lint 8 | jmespath 9 | -------------------------------------------------------------------------------- /.ansible-lint: -------------------------------------------------------------------------------- 1 | skip_list: 2 | - '106' # Role name {} does not match ``^[a-z][a-z0-9_]+$`` pattern' 3 | - 'fqcn-builtins' 4 | - 'key-order' 5 | - 'name[missing]' 6 | warn_list: 7 | - 'deprecated-module' 8 | -------------------------------------------------------------------------------- /templates/mpi.conf.j2: -------------------------------------------------------------------------------- 1 | # {{ ansible_managed }} 2 | {% for k, v in openhpc_mpi_config.items() %} 3 | {% if v != "omit" %}{# allow removing items using setting key: omit #} 4 | {{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }} 5 | {% endif %} 6 | {% endfor %} 7 | -------------------------------------------------------------------------------- /tasks/pre.yml: -------------------------------------------------------------------------------- 1 | - name: Enable batch on login-only nodes 2 | # TODO: why can't we remove this by just setting openhpc_enable.batch: true for appliance login nodes?? 3 | set_fact: 4 | openhpc_enable: "{{ openhpc_enable | combine({'batch': true}) }}" 5 | when: 6 | - openhpc_login_only_nodes in group_names 7 | -------------------------------------------------------------------------------- /.github/workflows/publish-role.yml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Publish Ansible Role 3 | 'on': 4 | push: 5 | tags: 6 | - "v?[0-9]+.[0-9]+.[0-9]+" 7 | workflow_dispatch: 8 | jobs: 9 | publish_role: 10 | uses: stackhpc/.github/.github/workflows/publish-role.yml@main 11 | secrets: 12 | GALAXY_API_KEY: ${{ secrets.GALAXY_API_KEY }} 13 | -------------------------------------------------------------------------------- /molecule/test1c/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Check slurm hostlist 3 | hosts: testohpc_login 4 | tasks: 5 | - name: Get slurm partition info 6 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 7 | register: sinfo 8 | - name: 9 | assert: 10 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,compute10,compute-a']" 11 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 12 | -------------------------------------------------------------------------------- /templates/cgroup.conf.j2: -------------------------------------------------------------------------------- 1 | ### 2 | # 3 | # Slurm cgroup support configuration file 4 | # 5 | # See man slurm.conf and man cgroup.conf for further 6 | # information on cgroup configuration parameters 7 | #-- 8 | {% for k, v in openhpc_cgroup_default_config | combine(openhpc_cgroup_config) | items %} 9 | {% if v != "omit" %}{# allow removing items using setting key: null #} 10 | {{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }} 11 | {% endif %} 12 | {% endfor %} 13 | -------------------------------------------------------------------------------- /meta/main.yml: -------------------------------------------------------------------------------- 1 | --- 2 | galaxy_info: 3 | role_name: openhpc 4 | namespace: stackhpc 5 | author: StackHPC Ltd. 6 | description: > 7 | This role provisions an exisiting cluster to support OpenHPC control 8 | and batch compute functions and installs necessary runtime packages. 9 | company: StackHPC Ltd 10 | license: Apache2 11 | min_ansible_version: 2.5.0 12 | platforms: 13 | - name: EL 14 | versions: ['7', '8'] 15 | galaxy_tags: 16 | - openhpc 17 | - slurm 18 | 19 | dependencies: [] 20 | -------------------------------------------------------------------------------- /vars/main.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # OpenHPC 2 on CentOS 8 3 | 4 | ohpc_slurm_packages: 5 | control: 6 | - "ohpc-slurm-server" 7 | - "slurm-slurmctld-ohpc" 8 | - "slurm-example-configs-ohpc" 9 | batch: 10 | - "ohpc-base-compute" 11 | - "ohpc-slurm-client" 12 | runtime: 13 | - "slurm-ohpc" 14 | - "munge" 15 | - "slurm-slurmd-ohpc" 16 | - "slurm-example-configs-ohpc" 17 | - "{{ 'lmod-ohpc' if openhpc_module_system_install else '' }}" 18 | database: 19 | - "slurm-slurmdbd-ohpc" 20 | ... 21 | -------------------------------------------------------------------------------- /molecule/test6/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | provisioner: 18 | name: ansible 19 | inventory: 20 | hosts: 21 | testohpc_compute: {} 22 | 23 | verifier: 24 | name: ansible 25 | -------------------------------------------------------------------------------- /tasks/main.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Validate configuration 4 | block: 5 | - include_tasks: validate.yml 6 | when: openhpc_enable.runtime | default(false) | bool 7 | tags: install 8 | 9 | - name: Install packages 10 | block: 11 | - include_tasks: install-ohpc.yml 12 | when: openhpc_enable.runtime | default(false) | bool 13 | tags: install 14 | 15 | - name: Configure 16 | block: 17 | - include_tasks: runtime.yml 18 | when: openhpc_enable.runtime | default(false) | bool 19 | tags: configure 20 | 21 | ... 22 | -------------------------------------------------------------------------------- /molecule/test6/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | runtime: true 8 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 9 | openhpc_nodegroups: 10 | - name: "n/a" 11 | openhpc_cluster_name: testohpc 12 | tasks: 13 | - name: "Include ansible-role-openhpc" 14 | include_role: 15 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 16 | -------------------------------------------------------------------------------- /molecule/test1/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | - name: 10 | assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 11 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 12 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 13 | -------------------------------------------------------------------------------- /molecule/test1b/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | - name: 10 | assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 11 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,1,idle,testohpc-compute-0']" 12 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 13 | -------------------------------------------------------------------------------- /molecule/test1/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | tasks: 14 | - name: "Include ansible-role-openhpc" 15 | include_role: 16 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 17 | -------------------------------------------------------------------------------- /molecule/test1b/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | tasks: 14 | - name: "Include ansible-role-openhpc" 15 | include_role: 16 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 17 | -------------------------------------------------------------------------------- /molecule/test10/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Create initial cluster 3 | hosts: initial 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | tasks: 14 | - name: "Include ansible-role-openhpc" 15 | include_role: 16 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 17 | -------------------------------------------------------------------------------- /molecule/test2/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "part1" 12 | - name: "part2" 13 | openhpc_cluster_name: testohpc 14 | tasks: 15 | - name: "Include ansible-role-openhpc" 16 | include_role: 17 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 18 | -------------------------------------------------------------------------------- /molecule/test4/prepare.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Prepare 4 | hosts: testohpc_login 5 | vars: 6 | # Slurm recommends larger than default values: https://slurm.schedmd.com/accounting.html 7 | mysql_innodb_buffer_pool_size: 1024M 8 | mysql_innodb_lock_wait_timeout: 900 9 | mysql_root_password: super-secure-password 10 | mysql_databases: 11 | - name: slurm_acct_db 12 | mysql_users: 13 | - name: slurm 14 | host: "{{ groups['testohpc_login'] | first }}" 15 | password: secure-password 16 | priv: "slurm_acct_db.*:ALL" 17 | tasks: 18 | - include_role: 19 | name: geerlingguy.mysql -------------------------------------------------------------------------------- /molecule/test8/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login # NB for this test this is 2x non-control nodes, so tests they can contact slurmctld too 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | - name: 10 | assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 11 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 12 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 13 | -------------------------------------------------------------------------------- /molecule/test11/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | tasks: 5 | - name: "Include ansible-role-openhpc" 6 | include_role: 7 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 8 | vars: 9 | openhpc_enable: 10 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 11 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 12 | runtime: true 13 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 14 | openhpc_nodegroups: 15 | - name: "compute_orig" 16 | openhpc_cluster_name: testohpc 17 | -------------------------------------------------------------------------------- /molecule/test1c/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_service_enabled: true 10 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 11 | openhpc_nodegroups: 12 | - name: "compute" 13 | openhpc_cluster_name: testohpc 14 | tasks: 15 | - name: "Include ansible-role-openhpc" 16 | include_role: 17 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 18 | -------------------------------------------------------------------------------- /molecule/test8/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_control'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_control'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | openhpc_login_only_nodes: 'testohpc_login' 14 | tasks: 15 | - name: "Include ansible-role-openhpc" 16 | include_role: 17 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 18 | -------------------------------------------------------------------------------- /molecule/test9/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_control'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_control'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | openhpc_login_only_nodes: 'testohpc_login' 14 | tasks: 15 | - name: "Include ansible-role-openhpc" 16 | include_role: 17 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 18 | -------------------------------------------------------------------------------- /molecule/test12/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | tasks: 5 | - name: "Include ansible-role-openhpc" 6 | include_role: 7 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 8 | vars: 9 | openhpc_enable: 10 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 11 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 12 | runtime: true 13 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 14 | openhpc_nodegroups: 15 | - name: "compute" 16 | openhpc_cluster_name: testohpc 17 | openhpc_slurm_job_comp_type: jobcomp/filetxt 18 | -------------------------------------------------------------------------------- /molecule/test15/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | vars: 6 | expected_sinfo: | # NB compute is default (*) 7 | 'compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]' 8 | 'beta,up,60-00:00:00,2,idle,testohpc-compute-[0-1]' 9 | tasks: 10 | - name: Get slurm partition info 11 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 12 | register: sinfo 13 | - name: 14 | assert: 15 | that: "sinfo.stdout.split() == expected_sinfo.split()" 16 | fail_msg: "FAILED - got {{ sinfo.stdout.split() }} expected {{ expected_sinfo.split() }}" 17 | -------------------------------------------------------------------------------- /.yamllint: -------------------------------------------------------------------------------- 1 | --- 2 | # Based on ansible-lint config 3 | extends: default 4 | 5 | rules: 6 | braces: 7 | max-spaces-inside: 1 8 | level: error 9 | brackets: 10 | max-spaces-inside: 1 11 | level: error 12 | colons: 13 | max-spaces-after: -1 14 | level: error 15 | commas: 16 | max-spaces-after: -1 17 | level: error 18 | comments: disable 19 | comments-indentation: disable 20 | document-start: disable 21 | empty-lines: 22 | max: 3 23 | level: error 24 | hyphens: 25 | level: error 26 | indentation: disable 27 | key-duplicates: enable 28 | line-length: disable 29 | new-line-at-end-of-file: disable 30 | new-lines: 31 | type: unix 32 | trailing-spaces: disable 33 | truthy: disable 34 | -------------------------------------------------------------------------------- /molecule/test13/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_control'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_control'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | openhpc_cluster_name: testohpc 13 | openhpc_login_only_nodes: 'testohpc_login' 14 | openhpc_config: 15 | FirstJobId: 13 16 | SlurmctldSyslogDebug: error 17 | tasks: 18 | - name: "Include ansible-role-openhpc" 19 | include_role: 20 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 21 | -------------------------------------------------------------------------------- /molecule/test3/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: grp1 12 | - name: grp2 13 | openhpc_partitions: 14 | - name: compute 15 | nodegroups: 16 | - grp1 17 | - grp2 18 | openhpc_cluster_name: testohpc 19 | tasks: 20 | - name: "Include ansible-role-openhpc" 21 | include_role: 22 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 23 | -------------------------------------------------------------------------------- /molecule/test3/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | vars: 6 | expected_sinfo: | 7 | testohpc-grp1-0 1 compute* idle 8 | testohpc-grp1-1 1 compute* idle 9 | testohpc-grp2-0 1 compute* idle 10 | testohpc-grp2-1 1 compute* idle 11 | tasks: 12 | - name: Get slurm partition info 13 | command: sinfo -h sinfo --Node -S "#P,+N" # node-oriented output, sort by partition in order defined in slurm.conf then increasing node name 14 | register: sinfo 15 | - name: 16 | assert: 17 | that: "sinfo.stdout.split() == expected_sinfo.split()" 18 | fail_msg: "FAILED - got {{ sinfo.stdout.split() }} expected {{ expected_sinfo.split() }}" 19 | -------------------------------------------------------------------------------- /molecule/test9/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - hosts: testohpc_control 4 | tasks: 5 | - name: Check both compute nodes are listed and compute-0 is up 6 | shell: 'sinfo --noheader --Node --format="%P,%a,%l,%D,%t,%N"' # using --format ensures we control whitespace 7 | register: sinfo 8 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 9 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,1,idle,testohpc-compute-0','compute*,up,60-00:00:00,1,unk*,testohpc-compute-1']" # NB: -1 goes 'down' after a while! 10 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 11 | - name: Check login nodes in config 12 | command: "grep NodeName={{ item }} /etc/slurm/slurm.conf" 13 | loop: "{{ groups['testohpc_login'] }}" 14 | -------------------------------------------------------------------------------- /templates/gres.conf.j2: -------------------------------------------------------------------------------- 1 | AutoDetect={{ openhpc_gres_autodetect }} 2 | {% for nodegroup in openhpc_nodegroups %} 3 | {% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %} 4 | {% set inventory_group_hosts = groups.get(inventory_group_name, []) %} 5 | {% set hostlist_string = inventory_group_hosts | hostlist_expression | join(',') %} 6 | {% for gres in nodegroup.gres | default([]) %} 7 | {% set gres_name, gres_type, _ = gres.conf.split(':') %} 8 | NodeName={{ hostlist_string }}{% if 'gres_autodetect' in nodegroup %} AutoDetect={{ nodegroup.gres_autodetect }}{% endif %} Name={{ gres_name }} Type={{ gres_type }}{% if 'file' in gres %} File={{ gres.file }}{% endif %} 9 | 10 | {% endfor %}{# gres #} 11 | {% endfor %}{# nodegroup #} 12 | -------------------------------------------------------------------------------- /templates/slurmctld.service.j2: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Slurm controller daemon 3 | After=network-online.target munge.service 4 | Wants=network-online.target 5 | ConditionPathExists={{ openhpc_slurm_conf_path }} 6 | 7 | [Service] 8 | Type=simple 9 | EnvironmentFile=-/etc/sysconfig/slurmctld 10 | EnvironmentFile=-/etc/default/slurmctld 11 | ExecStart={{ openhpc_sbin_dir }}/slurmctld -D -s -f {{ openhpc_slurm_conf_path }} $SLURMCTLD_OPTIONS 12 | ExecReload=/bin/kill -HUP $MAINPID 13 | LimitNOFILE=65536 14 | TasksMax=infinity 15 | 16 | # Uncomment the following lines to disable logging through journald. 17 | # NOTE: It may be preferable to set these through an override file instead. 18 | #StandardOutput=null 19 | #StandardError=null 20 | 21 | [Install] 22 | WantedBy=multi-user.target 23 | -------------------------------------------------------------------------------- /templates/slurmd.service.j2: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Slurm node daemon 3 | After=munge.service network-online.target remote-fs.target 4 | Wants=network-online.target 5 | 6 | [Service] 7 | Type=simple 8 | EnvironmentFile=-/etc/sysconfig/slurmd 9 | EnvironmentFile=-/etc/default/slurmd 10 | ExecStart={{ openhpc_sbin_dir }}/slurmd -D -s $SLURMD_OPTIONS 11 | ExecReload=/bin/kill -HUP $MAINPID 12 | KillMode=process 13 | LimitNOFILE=131072 14 | LimitMEMLOCK=infinity 15 | LimitSTACK=infinity 16 | Delegate=yes 17 | TasksMax=infinity 18 | 19 | # Uncomment the following lines to disable logging through journald. 20 | # NOTE: It may be preferable to set these through an override file instead. 21 | #StandardOutput=null 22 | #StandardError=null 23 | 24 | [Install] 25 | WantedBy=multi-user.target 26 | -------------------------------------------------------------------------------- /molecule/test1b/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | dependency: 3 | name: galaxy 4 | driver: 5 | name: podman 6 | platforms: 7 | - name: testohpc-login-0 8 | image: ${MOLECULE_IMAGE} 9 | pre_build_image: true 10 | groups: 11 | - testohpc_login 12 | command: /sbin/init 13 | tmpfs: 14 | /run: rw 15 | /tmp: rw 16 | volumes: 17 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 18 | network: net1 19 | - name: testohpc-compute-0 20 | image: ${MOLECULE_IMAGE} 21 | pre_build_image: true 22 | groups: 23 | - testohpc_compute 24 | command: /sbin/init 25 | tmpfs: 26 | /run: rw 27 | /tmp: rw 28 | volumes: 29 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 30 | network: net1 31 | provisioner: 32 | name: ansible 33 | verifier: 34 | name: ansible 35 | -------------------------------------------------------------------------------- /molecule/test2/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | vars: 6 | expected_sinfo: | # NB part2 is default (*), as slurm.conf.j2 says both are so last wins 7 | testohpc-part1-0 1 part1 idle 8 | testohpc-part1-1 1 part1 idle 9 | testohpc-part2-0 1 part2* idle 10 | testohpc-part2-1 1 part2* idle 11 | tasks: 12 | - name: Get slurm partition info 13 | command: sinfo -h sinfo --Node -S "#P,+N" # node-oriented output, sort by partition in order defined in slurm.conf then increasing node name 14 | register: sinfo 15 | - name: 16 | assert: 17 | that: "sinfo.stdout.split() == expected_sinfo.split()" 18 | fail_msg: "FAILED - got {{ sinfo.stdout.split() }} expected {{ expected_sinfo.split() }}" 19 | -------------------------------------------------------------------------------- /templates/slurmdbd.service.j2: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Slurm DBD accounting daemon 3 | After=network-online.target munge.service mysql.service mysqld.service mariadb.service 4 | Wants=network-online.target 5 | ConditionPathExists={{ openhpc_slurm_conf_path | dirname + '/slurmdbd.conf' }} 6 | 7 | [Service] 8 | Type=simple 9 | EnvironmentFile=-/etc/sysconfig/slurmdbd 10 | EnvironmentFile=-/etc/default/slurmdbd 11 | ExecStart={{ openhpc_sbin_dir }}/slurmdbd -D -s $SLURMDBD_OPTIONS 12 | ExecReload=/bin/kill -HUP $MAINPID 13 | LimitNOFILE=65536 14 | TasksMax=infinity 15 | 16 | # Uncomment the following lines to disable logging through journald. 17 | # NOTE: It may be preferable to set these through an override file instead. 18 | #StandardOutput=null 19 | #StandardError=null 20 | 21 | [Install] 22 | WantedBy=multi-user.target 23 | -------------------------------------------------------------------------------- /molecule/test4/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Submit a test job 7 | command: sbatch --wrap hostname --parsable 8 | register: job 9 | 10 | - name: Allow some time for the job to complete 11 | pause: 12 | seconds: 10 13 | 14 | - name: Query sacct for job that was submitted 15 | command: sacct -j {{ job.stdout }} -p 16 | register: sacct 17 | 18 | - name: Check that job is in database 19 | vars: 20 | record: "{{ sacct.stdout_lines[1].split('|') }}" 21 | assert: 22 | # Expected: JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode| 23 | that: 24 | - record[0] == job.stdout 25 | - record[5] == "COMPLETED" 26 | fail_msg: "FAILED - actual value: {{ sacct.stdout_lines[1] }}" 27 | -------------------------------------------------------------------------------- /tasks/facts.yml: -------------------------------------------------------------------------------- 1 | - name: Capture configuration from scontrol 2 | # this includes any dynamically-generated config, not just what is set in 3 | # slurm.conf 4 | ansible.builtin.command: scontrol show config 5 | changed_when: false 6 | register: _scontrol_config 7 | 8 | - name: Create facts directory 9 | ansible.builtin.file: 10 | path: /etc/ansible/facts.d/ 11 | state: directory 12 | owner: root 13 | group: root 14 | mode: ugo=rwX 15 | 16 | - name: Template slurm configuration facts 17 | copy: 18 | dest: /etc/ansible/facts.d/slurm.fact 19 | content: "{{ _scontrol_config.stdout_lines | config2dict | to_nice_json }}" 20 | owner: slurm 21 | group: slurm 22 | mode: ug=rw,o=r # any user can run scontrol show config anyway 23 | register: _template_facts 24 | notify: Reload facts 25 | -------------------------------------------------------------------------------- /molecule/test6/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | - name: 10 | assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 11 | that: "sinfo.stdout_lines == ['n/a*,up,60-00:00:00,0,n/a,']" 12 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 13 | - name: Check munge key copied 14 | hosts: localhost 15 | tasks: 16 | - stat: 17 | path: "./testohpc-login-0/etc/munge/munge.key" 18 | register: local_mungekey 19 | - assert: 20 | that: local_mungekey.stat.exists 21 | fail_msg: "Failed to find munge key copied from node on ansible control host" 22 | -------------------------------------------------------------------------------- /tests/filter.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - hosts: openstack 3 | connection: local 4 | gather_facts: false 5 | vars: 6 | grouped_0: "{{ groups['mock_group_0'] | hostlist_expression }}" 7 | grouped_1: "{{ groups['mock_group_1'] | hostlist_expression }}" 8 | grouped_2: "{{ groups['mock_group_2'] | hostlist_expression }}" 9 | tasks: 10 | - name: Test filter 11 | assert: 12 | that: item.result == item.expected 13 | fail_msg: | 14 | expected: {{ item.expected }} 15 | got: {{ item.result }} 16 | loop: 17 | - result: "{{ grouped_0 }}" 18 | expected: ['localhost-0-[0-3,5]', 'localhost-non-numerical'] 19 | - result: "{{ grouped_1 }}" 20 | expected: ['localhost-1-[1-2,4-5,10]', 'localhost-2-[1-3]'] 21 | - result: "{{ grouped_2 }}" 22 | expected: ['localhost-[1,0001-0003,0008,0010]', 'localhost-admin'] 23 | ... 24 | -------------------------------------------------------------------------------- /molecule/test15/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | runtime: true 9 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 10 | openhpc_nodegroups: 11 | - name: "compute" 12 | partition_params: 13 | PreemptMode: requeue 14 | - name: beta 15 | groups: 16 | - name: "compute" 17 | partition_params: 18 | PreemptMode: 'OFF' 19 | Priority: 1000 20 | Default: false 21 | AllowAccounts: Group_own_thePartition 22 | openhpc_cluster_name: testohpc 23 | tasks: 24 | - name: "Include ansible-role-openhpc" 25 | include_role: 26 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 27 | -------------------------------------------------------------------------------- /molecule/test13/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Slurm checks 4 | hosts: testohpc_login # NB for this test this is 2x non-control nodes, so tests they can contact slurmctld too 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 10 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 11 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 12 | - name: Get slurm config info 13 | command: scontrol show config 14 | register: slurm_config 15 | - assert: 16 | that: "item in (slurm_config.stdout_lines | map('replace', ' ', ''))" 17 | fail_msg: "FAILED - {{ item }} not found in slurm config" 18 | loop: 19 | - SlurmctldSyslogDebug=error 20 | - FirstJobId=13 21 | -------------------------------------------------------------------------------- /molecule/test4/converge.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Converge 3 | hosts: all 4 | vars: 5 | openhpc_enable: 6 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 7 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 8 | database: "{{ inventory_hostname in groups['testohpc_login'] }}" 9 | runtime: true 10 | openhpc_slurm_accounting_storage_type: 'accounting_storage/slurmdbd' 11 | openhpc_slurmdbd_mysql_database: slurm_acct_db 12 | openhpc_slurmdbd_mysql_password: secure-password 13 | openhpc_slurmdbd_mysql_username: slurm 14 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 15 | openhpc_nodegroups: 16 | - name: "compute" 17 | openhpc_cluster_name: testohpc 18 | openhpc_slurm_accounting_storage_client_package: mariadb 19 | tasks: 20 | - name: "Include ansible-role-openhpc" 21 | include_role: 22 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 23 | -------------------------------------------------------------------------------- /molecule/test1c/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: compute-a 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | command: /sbin/init 23 | tmpfs: 24 | /run: rw 25 | /tmp: rw 26 | volumes: 27 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 28 | network: net1 29 | - name: compute10 30 | image: ${MOLECULE_IMAGE} 31 | pre_build_image: true 32 | groups: 33 | - testohpc_compute 34 | command: /sbin/init 35 | tmpfs: 36 | /run: rw 37 | /tmp: rw 38 | volumes: 39 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 40 | network: net1 41 | provisioner: 42 | name: ansible 43 | verifier: 44 | name: ansible 45 | -------------------------------------------------------------------------------- /molecule/test1/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-compute-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | command: /sbin/init 23 | tmpfs: 24 | /run: rw 25 | /tmp: rw 26 | volumes: 27 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 28 | network: net1 29 | - name: testohpc-compute-1 30 | image: ${MOLECULE_IMAGE} 31 | pre_build_image: true 32 | groups: 33 | - testohpc_compute 34 | command: /sbin/init 35 | tmpfs: 36 | /run: rw 37 | /tmp: rw 38 | volumes: 39 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 40 | network: net1 41 | provisioner: 42 | name: ansible 43 | verifier: 44 | name: ansible 45 | -------------------------------------------------------------------------------- /molecule/test12/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-compute-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | command: /sbin/init 23 | tmpfs: 24 | /run: rw 25 | /tmp: rw 26 | volumes: 27 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 28 | network: net1 29 | - name: testohpc-compute-1 30 | image: ${MOLECULE_IMAGE} 31 | pre_build_image: true 32 | groups: 33 | - testohpc_compute 34 | command: /sbin/init 35 | tmpfs: 36 | /run: rw 37 | /tmp: rw 38 | volumes: 39 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 40 | network: net1 41 | provisioner: 42 | name: ansible 43 | verifier: 44 | name: ansible 45 | -------------------------------------------------------------------------------- /molecule/test15/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-compute-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | command: /sbin/init 23 | tmpfs: 24 | /run: rw 25 | /tmp: rw 26 | volumes: 27 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 28 | network: net1 29 | - name: testohpc-compute-1 30 | image: ${MOLECULE_IMAGE} 31 | pre_build_image: true 32 | groups: 33 | - testohpc_compute 34 | command: /sbin/init 35 | tmpfs: 36 | /run: rw 37 | /tmp: rw 38 | volumes: 39 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 40 | network: net1 41 | provisioner: 42 | name: ansible 43 | verifier: 44 | name: ansible 45 | -------------------------------------------------------------------------------- /molecule/test11/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-compute-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | - testohpc_compute_orig 23 | - testohpc_compute_new 24 | command: /sbin/init 25 | tmpfs: 26 | /run: rw 27 | /tmp: rw 28 | volumes: 29 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 30 | network: net1 31 | - name: testohpc-compute-1 32 | image: ${MOLECULE_IMAGE} 33 | pre_build_image: true 34 | groups: 35 | - testohpc_compute 36 | - testohpc_compute_orig 37 | command: /sbin/init 38 | tmpfs: 39 | /run: rw 40 | /tmp: rw 41 | volumes: 42 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 43 | network: net1 44 | provisioner: 45 | name: ansible 46 | verifier: 47 | name: ansible 48 | -------------------------------------------------------------------------------- /molecule/test4/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | hostname: testohpc-login-0 7 | image: ${MOLECULE_IMAGE} 8 | pre_build_image: true 9 | groups: 10 | - testohpc_login 11 | command: /sbin/init 12 | tmpfs: 13 | /run: rw 14 | /tmp: rw 15 | volumes: 16 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 17 | network: net1 18 | - name: testohpc-compute-0 19 | hostname: testohpc-compute-0 20 | image: ${MOLECULE_IMAGE} 21 | pre_build_image: true 22 | groups: 23 | - testohpc_compute 24 | command: /sbin/init 25 | tmpfs: 26 | /run: rw 27 | /tmp: rw 28 | volumes: 29 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 30 | network: net1 31 | - name: testohpc-compute-1 32 | hostname: testohpc-compute-1 33 | image: ${MOLECULE_IMAGE} 34 | pre_build_image: true 35 | groups: 36 | - testohpc_compute 37 | command: /sbin/init 38 | tmpfs: 39 | /run: rw 40 | /tmp: rw 41 | volumes: 42 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 43 | network: net1 44 | provisioner: 45 | name: ansible 46 | verifier: 47 | name: ansible 48 | -------------------------------------------------------------------------------- /templates/slurmdbd.conf.j2: -------------------------------------------------------------------------------- 1 | {{ ansible_managed | comment }} 2 | # 3 | # Example slurmdbd.conf file. 4 | # 5 | # See the slurmdbd.conf man page for more information. 6 | # 7 | # Archive info 8 | #ArchiveJobs=yes 9 | #ArchiveDir="/tmp" 10 | #ArchiveSteps=yes 11 | #ArchiveScript= 12 | #JobPurge=12 13 | #StepPurge=1 14 | # 15 | # Authentication info 16 | AuthType=auth/munge 17 | #AuthInfo=/var/run/munge/munge.socket.2 18 | # 19 | # slurmDBD info 20 | DbdHost={{ openhpc_slurmdbd_host }} 21 | DbdAddr={{ openhpc_slurmdbd_host }} 22 | DbdPort={{ openhpc_slurmdbd_port }} 23 | SlurmUser=slurm 24 | #MessageTimeout=300 25 | DebugLevel=4 26 | #DefaultQOS=normal,standby 27 | # NOTE: By default, slurmdbd will log to syslog 28 | #LogFile=/var/log/slurm/slurmdbd.log 29 | PidFile=/var/run/slurmdbd.pid 30 | #PluginDir=/usr/lib/slurm 31 | #PrivateData=accounts,users,usage,jobs 32 | #TrackWCKey=yes 33 | # 34 | # Database info 35 | StorageType=accounting_storage/mysql 36 | StorageHost={{ openhpc_slurmdbd_mysql_host }} 37 | StorageUser={{ openhpc_slurmdbd_mysql_username }} 38 | StoragePass={{ openhpc_slurmdbd_mysql_password | mandatory('You must set openhpc_slurmdbd_mysql_password') }} 39 | StorageLoc={{ openhpc_slurmdbd_mysql_database }} 40 | -------------------------------------------------------------------------------- /molecule/test12/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check slurm hostlist 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | changed_when: false 10 | - name: Assert slurm running ok 11 | assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 12 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 13 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 14 | - name: Run a slurm job 15 | command: 16 | cmd: "sbatch -N2 --wrap 'srun hostname'" 17 | register: sbatch 18 | - name: Set fact for slurm jobid 19 | set_fact: 20 | jobid: "{{ sbatch.stdout.split()[-1] }}" 21 | - name: Get job completion info 22 | command: 23 | cmd: "sacct --completion --noheader --parsable2" 24 | changed_when: false 25 | register: sacct 26 | until: "sacct.stdout.strip() != ''" 27 | retries: 5 28 | delay: 1 29 | - assert: 30 | that: "(jobid + '|0|wrap|compute|2|testohpc-compute-[0-1]|COMPLETED') in sacct.stdout" 31 | fail_msg: "Didn't find expected output for {{ jobid }} in sacct output: {{ sacct.stdout }}" 32 | 33 | -------------------------------------------------------------------------------- /molecule/test10/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | - initial 11 | command: /sbin/init 12 | tmpfs: 13 | /run: rw 14 | /tmp: rw 15 | volumes: 16 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 17 | network: net1 18 | - name: testohpc-compute-0 19 | image: ${MOLECULE_IMAGE} 20 | pre_build_image: true 21 | groups: 22 | - testohpc_compute 23 | - initial 24 | command: /sbin/init 25 | tmpfs: 26 | /run: rw 27 | /tmp: rw 28 | volumes: 29 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 30 | network: net1 31 | - name: testohpc-compute-1 32 | image: ${MOLECULE_IMAGE} 33 | pre_build_image: true 34 | groups: 35 | - testohpc_compute 36 | - initial 37 | command: /sbin/init 38 | tmpfs: 39 | /run: rw 40 | /tmp: rw 41 | volumes: 42 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 43 | network: net1 44 | - name: testohpc-compute-2 45 | image: ${MOLECULE_IMAGE} 46 | pre_build_image: true 47 | groups: # NB this is NOT in the "testohpc_compute" so that it isn't added to slurm.conf initially 48 | - new 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | provisioner: 57 | name: ansible 58 | verifier: 59 | name: ansible 60 | -------------------------------------------------------------------------------- /module_utils/slurm_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Copyright: (c) 2020, StackHPC 4 | # Apache 2 License 5 | 6 | def slurm_parse(output): 7 | """Parse sacct parsable output(i.e. using -p) 8 | 9 | :param output: a string containing the full output of sacct or 10 | an iterable yielding lines of sacct output 11 | :return List of dictionaries which map column headiings to values. 12 | """ 13 | # Example input: 14 | # Cluster|ControlHost|ControlPort|RPC|Share|GrpJobs|GrpTRES|GrpSubmit|MaxJobs|MaxTRES|MaxSubmit|MaxWall|QOS|Def QOS| 15 | # testohpc|172.20.0.2|6817|8448|1||||||||normal|| 16 | result = [] 17 | if isinstance(output, str): 18 | output = output.splitlines() 19 | lines = iter(output) 20 | first_line = next(lines) 21 | if "|" not in first_line: 22 | raise ValueError("Could not parse headings") 23 | # Last value is empty due to trailing '|' 24 | headings = first_line.split("|")[:-1] 25 | for line in lines: 26 | record = {} 27 | # Last value is empty due to trailing '|' 28 | values = line.split("|")[:-1] 29 | for i, value in enumerate(values): 30 | record[headings[i]] = value 31 | result.append(record) 32 | return result 33 | 34 | if __name__ == "__main__": 35 | slurm_parse_input = """Cluster|ControlHost|ControlPort|RPC|Share|GrpJobs|GrpTRES|GrpSubmit|MaxJobs|MaxTRES|MaxSubmit|MaxWall|QOS|Def QOS| 36 | testohpc|172.20.0.2|6817|8448|1||||||||normal||""" 37 | slurm_parse_result = slurm_parse(slurm_parse_input) 38 | assert slurm_parse_result[0]["Cluster"] == "testohpc" 39 | assert "" not in slurm_parse_result 40 | -------------------------------------------------------------------------------- /molecule/test3/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-grp1-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | - testohpc_grp1 23 | command: /sbin/init 24 | tmpfs: 25 | /run: rw 26 | /tmp: rw 27 | volumes: 28 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 29 | network: net1 30 | - name: testohpc-grp1-1 31 | image: ${MOLECULE_IMAGE} 32 | pre_build_image: true 33 | groups: 34 | - testohpc_compute 35 | - testohpc_grp1 36 | command: /sbin/init 37 | tmpfs: 38 | /run: rw 39 | /tmp: rw 40 | volumes: 41 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 42 | network: net1 43 | - name: testohpc-grp2-0 44 | image: ${MOLECULE_IMAGE} 45 | pre_build_image: true 46 | groups: 47 | - testohpc_compute 48 | - testohpc_grp2 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | - name: testohpc-grp2-1 57 | image: ${MOLECULE_IMAGE} 58 | pre_build_image: true 59 | groups: 60 | - testohpc_compute 61 | - testohpc_grp2 62 | command: /sbin/init 63 | tmpfs: 64 | /run: rw 65 | /tmp: rw 66 | volumes: 67 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 68 | network: net1 69 | provisioner: 70 | name: ansible 71 | verifier: 72 | name: ansible 73 | -------------------------------------------------------------------------------- /molecule/test9/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-control 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_control 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | 18 | - name: testohpc-login-0 19 | image: ${MOLECULE_IMAGE} 20 | pre_build_image: true 21 | groups: 22 | - testohpc_login 23 | command: /sbin/init 24 | tmpfs: 25 | /run: rw 26 | /tmp: rw 27 | volumes: 28 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 29 | network: net1 30 | 31 | - name: testohpc-login-1 32 | image: ${MOLECULE_IMAGE} 33 | pre_build_image: true 34 | groups: 35 | - testohpc_login 36 | command: /sbin/init 37 | tmpfs: 38 | /run: rw 39 | /tmp: rw 40 | volumes: 41 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 42 | network: net1 43 | 44 | - name: testohpc-compute-0 45 | image: ${MOLECULE_IMAGE} 46 | pre_build_image: true 47 | groups: 48 | - testohpc_compute 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | - name: testohpc-compute-1 57 | image: ${MOLECULE_IMAGE} 58 | pre_build_image: true 59 | groups: 60 | - testohpc_compute 61 | command: /sbin/init 62 | tmpfs: 63 | /run: rw 64 | /tmp: rw 65 | volumes: 66 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 67 | network: net1 68 | provisioner: 69 | name: ansible 70 | ansible_args: 71 | - --limit=testohpc-control,testohpc-compute-0 72 | verifier: 73 | name: ansible 74 | -------------------------------------------------------------------------------- /molecule/test13/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-control 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_control 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | 18 | - name: testohpc-login-0 19 | image: ${MOLECULE_IMAGE} 20 | pre_build_image: true 21 | groups: 22 | - testohpc_login 23 | command: /sbin/init 24 | tmpfs: 25 | /run: rw 26 | /tmp: rw 27 | volumes: 28 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 29 | network: net1 30 | 31 | - name: testohpc-login-1 32 | image: ${MOLECULE_IMAGE} 33 | pre_build_image: true 34 | groups: 35 | - testohpc_login 36 | command: /sbin/init 37 | tmpfs: 38 | /run: rw 39 | /tmp: rw 40 | volumes: 41 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 42 | network: net1 43 | 44 | - name: testohpc-compute-0 45 | image: ${MOLECULE_IMAGE} 46 | pre_build_image: true 47 | groups: 48 | - testohpc_compute 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | - name: testohpc-compute-1 57 | image: ${MOLECULE_IMAGE} 58 | pre_build_image: true 59 | groups: 60 | - testohpc_compute 61 | command: /sbin/init 62 | tmpfs: 63 | /run: rw 64 | /tmp: rw 65 | volumes: 66 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 67 | network: net1 68 | provisioner: 69 | name: ansible 70 | # ansible_args: 71 | # - --limit=testohpc-control,testohpc-compute-0 72 | verifier: 73 | name: ansible 74 | -------------------------------------------------------------------------------- /molecule/test8/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-control 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_control 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | 18 | - name: testohpc-login-0 19 | image: ${MOLECULE_IMAGE} 20 | pre_build_image: true 21 | groups: 22 | - testohpc_login 23 | command: /sbin/init 24 | tmpfs: 25 | /run: rw 26 | /tmp: rw 27 | volumes: 28 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 29 | network: net1 30 | 31 | - name: testohpc-login-1 32 | image: ${MOLECULE_IMAGE} 33 | pre_build_image: true 34 | groups: 35 | - testohpc_login 36 | command: /sbin/init 37 | tmpfs: 38 | /run: rw 39 | /tmp: rw 40 | volumes: 41 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 42 | network: net1 43 | 44 | - name: testohpc-compute-0 45 | image: ${MOLECULE_IMAGE} 46 | pre_build_image: true 47 | groups: 48 | - testohpc_compute 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | - name: testohpc-compute-1 57 | image: ${MOLECULE_IMAGE} 58 | pre_build_image: true 59 | groups: 60 | - testohpc_compute 61 | command: /sbin/init 62 | tmpfs: 63 | /run: rw 64 | /tmp: rw 65 | volumes: 66 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 67 | network: net1 68 | provisioner: 69 | name: ansible 70 | # ansible_args: 71 | # - --limit=testohpc-control,testohpc-compute-0 72 | verifier: 73 | name: ansible 74 | -------------------------------------------------------------------------------- /molecule/test2/molecule.yml: -------------------------------------------------------------------------------- 1 | --- 2 | driver: 3 | name: podman 4 | platforms: 5 | - name: testohpc-login-0 6 | image: ${MOLECULE_IMAGE} 7 | pre_build_image: true 8 | groups: 9 | - testohpc_login 10 | command: /sbin/init 11 | tmpfs: 12 | /run: rw 13 | /tmp: rw 14 | volumes: 15 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 16 | network: net1 17 | - name: testohpc-part1-0 18 | image: ${MOLECULE_IMAGE} 19 | pre_build_image: true 20 | groups: 21 | - testohpc_compute 22 | - testohpc_part1 23 | command: /sbin/init 24 | tmpfs: 25 | /run: rw 26 | /tmp: rw 27 | volumes: 28 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 29 | network: net1 30 | - name: testohpc-part1-1 31 | image: ${MOLECULE_IMAGE} 32 | pre_build_image: true 33 | groups: 34 | - testohpc_compute 35 | - testohpc_part1 36 | command: /sbin/init 37 | tmpfs: 38 | /run: rw 39 | /tmp: rw 40 | volumes: 41 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 42 | network: net1 43 | - name: testohpc-part2-0 44 | image: ${MOLECULE_IMAGE} 45 | pre_build_image: true 46 | groups: 47 | - testohpc_compute 48 | - testohpc_part2 49 | command: /sbin/init 50 | tmpfs: 51 | /run: rw 52 | /tmp: rw 53 | volumes: 54 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 55 | network: net1 56 | - name: testohpc-part2-1 57 | image: ${MOLECULE_IMAGE} 58 | pre_build_image: true 59 | groups: 60 | - testohpc_compute 61 | - testohpc_part2 62 | command: /sbin/init 63 | tmpfs: 64 | /run: rw 65 | /tmp: rw 66 | volumes: 67 | - /sys/fs/cgroup:/sys/fs/cgroup:ro 68 | network: net1 69 | provisioner: 70 | name: ansible 71 | verifier: 72 | name: ansible 73 | -------------------------------------------------------------------------------- /molecule/test11/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - name: Check initial cluster has 2x nodes 4 | hosts: testohpc_login 5 | tasks: 6 | - name: Get slurm partition info 7 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 8 | register: sinfo 9 | changed_when: false 10 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 11 | that: "sinfo.stdout_lines == ['compute_orig*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 12 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 13 | success_msg: "OK - 2x nodes idle" 14 | 15 | - name: Rerun with smaller compute group 16 | hosts: 17 | - testohpc_login 18 | - testohpc_compute_new 19 | tasks: 20 | - name: "Include ansible-role-openhpc" 21 | include_role: 22 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 23 | vars: 24 | openhpc_enable: 25 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 26 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 27 | runtime: true 28 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 29 | openhpc_nodegroups: 30 | - name: "compute_new" 31 | openhpc_cluster_name: testohpc 32 | 33 | - name: Check modified cluster has 1x nodes 34 | hosts: testohpc_login 35 | tasks: 36 | - name: Get slurm partition info 37 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 38 | register: sinfo 39 | changed_when: false 40 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 41 | that: "sinfo.stdout_lines == ['compute_new*,up,60-00:00:00,1,idle,testohpc-compute-0']" 42 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 43 | success_msg: "OK - 1x nodes idle" 44 | -------------------------------------------------------------------------------- /tasks/install-ohpc.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - include_tasks: pre.yml 4 | 5 | - name: Ensure OpenHPC repos 6 | ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] 7 | loop: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] }}" 8 | loop_control: 9 | label: "{{ item.name }}" 10 | 11 | - name: Ensure extra repos 12 | ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] 13 | loop: "{{ ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}" 14 | loop_control: 15 | label: "{{ item.name }}" 16 | 17 | - name: Enable PowerTools repo 18 | # NB: doesn't run command `dnf config-manager --set-enabled PowerTools` as can't make that idempotent 19 | community.general.ini_file: 20 | path: /etc/yum.repos.d/Rocky-PowerTools.repo 21 | section: powertools 22 | option: enabled 23 | value: "1" 24 | create: false 25 | no_extra_spaces: true 26 | when: ansible_distribution_major_version == '8' 27 | 28 | - name: Enable CRB repo 29 | community.general.ini_file: 30 | path: /etc/yum.repos.d/rocky.repo 31 | section: crb 32 | option: enabled 33 | value: "1" 34 | create: false 35 | no_extra_spaces: true 36 | when: ansible_distribution_major_version == '9' 37 | 38 | - name: Build host-specific list of required slurm packages 39 | set_fact: 40 | openhpc_slurm_pkglist: "{{ openhpc_slurm_pkglist | default([]) + item.value }}" 41 | loop: "{{ ohpc_slurm_packages | dict2items }}" 42 | when: (openhpc_enable.get(item.key, false)) or () 43 | 44 | - name: Install required slurm packages 45 | dnf: 46 | name: "{{ openhpc_slurm_pkglist | reject('eq', '') }}" 47 | install_weak_deps: false # avoids getting recommended packages 48 | when: openhpc_slurm_pkglist | default(false, true) 49 | 50 | - name: Install other packages 51 | yum: 52 | name: "{{ openhpc_packages + [openhpc_slurm_accounting_storage_client_package] }}" 53 | 54 | ... 55 | -------------------------------------------------------------------------------- /molecule/test10/verify.yml: -------------------------------------------------------------------------------- 1 | --- 2 | - name: Check initial cluster has 2x nodes 3 | hosts: testohpc_login 4 | tasks: 5 | - name: Get slurm partition info 6 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 7 | register: sinfo 8 | changed_when: false 9 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 10 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,2,idle,testohpc-compute-[0-1]']" 11 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 12 | success_msg: "OK - 2x nodes idle" 13 | 14 | - name: Add new host(s) to cluster 15 | hosts: all 16 | tasks: 17 | - name: Add new host(s) to group for slurm partition 18 | add_host: 19 | name: "{{ item }}" 20 | groups: testohpc_compute 21 | loop: "{{ groups['new'] }}" 22 | run_once: true 23 | - name: "Include ansible-role-openhpc" 24 | include_role: 25 | name: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}" 26 | vars: 27 | openhpc_enable: 28 | control: "{{ inventory_hostname in groups['testohpc_login'] }}" 29 | batch: "{{ inventory_hostname in groups['testohpc_compute'] }}" 30 | runtime: true 31 | openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}" 32 | openhpc_nodegroups: 33 | - name: "compute" 34 | openhpc_cluster_name: testohpc 35 | 36 | - name: Check modified cluster has 3x nodes 37 | hosts: testohpc_login 38 | tasks: 39 | - name: Get slurm partition info 40 | command: sinfo --noheader --format="%P,%a,%l,%D,%t,%N" # using --format ensures we control whitespace 41 | register: sinfo 42 | changed_when: false 43 | - assert: # PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 44 | that: "sinfo.stdout_lines == ['compute*,up,60-00:00:00,3,idle,testohpc-compute-[0-2]']" 45 | fail_msg: "FAILED - actual value: {{ sinfo.stdout_lines }}" 46 | success_msg: "OK - 3x nodes idle" 47 | -------------------------------------------------------------------------------- /tests/inventory-mock-groups: -------------------------------------------------------------------------------- 1 | [mock_group_0] 2 | localhost-0-0 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 3 | localhost-0-1 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 4 | localhost-0-2 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 5 | localhost-0-3 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 6 | localhost-0-5 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 7 | localhost-non-numerical ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 8 | 9 | [mock_group_1] 10 | localhost-1-1 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 11 | localhost-1-2 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 12 | localhost-1-4 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 13 | localhost-1-5 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 14 | localhost-1-10 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 15 | localhost-2-1 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 16 | localhost-2-2 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 17 | localhost-2-3 ansible_host=127.0.0.1 ansible_connection='local' ansible_python_interpreter='/usr/bin/env python' 18 | 19 | [mock_group_2] 20 | # test padding: 21 | # $ scontrol show hostlist localhost-admin,localhost-0001,localhost-0002,localhost-0003,localhost-0008,localhost-0010,localhost-1 22 | # localhost-admin,localhost-[0001-0003,0008,0010,1] 23 | localhost-0001 24 | localhost-1 25 | localhost-0010 26 | localhost-admin 27 | localhost-0002 28 | localhost-0008 29 | localhost-0003 30 | -------------------------------------------------------------------------------- /handlers/main.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | # NOTE: we need this running before slurmctld start 4 | - name: Issue slurmdbd restart command 5 | service: 6 | name: "slurmdbd" 7 | state: restarted 8 | delegate_to: "{{ openhpc_slurmdbd_host }}" 9 | when: 10 | - openhpc_slurmdbd_host is defined 11 | - openhpc_slurm_service_started | bool 12 | - openhpc_slurmdbd_host in play_hosts 13 | run_once: true 14 | listen: Restart slurmdbd service 15 | 16 | - name: Check slurmdbd actually restarted 17 | wait_for: 18 | port: "{{ openhpc_slurmdbd_port }}" 19 | delegate_to: "{{ openhpc_slurmdbd_host }}" 20 | run_once: true 21 | when: 22 | - openhpc_slurmdbd_host is defined 23 | - openhpc_slurm_service_started | bool 24 | - openhpc_slurmdbd_host in play_hosts 25 | listen: Restart slurmdbd service 26 | 27 | # NOTE: we need this running before slurmd 28 | # Allows you to reconfigure slurmctld from another host 29 | - name: Issue slurmctld restart command 30 | service: 31 | name: "slurmctld" 32 | state: restarted 33 | delegate_to: "{{ openhpc_slurm_control_host }}" 34 | run_once: true 35 | when: 36 | - openhpc_slurm_service_started | bool 37 | - openhpc_slurm_control_host in play_hosts 38 | listen: Restart slurmctld service 39 | 40 | - name: Check slurmctld actually restarted 41 | wait_for: 42 | port: 6817 43 | delay: 10 44 | delegate_to: "{{ openhpc_slurm_control_host }}" 45 | run_once: true 46 | when: 47 | - openhpc_slurm_service_started | bool 48 | - openhpc_slurm_control_host in play_hosts 49 | listen: Restart slurmctld service 50 | 51 | - name: Restart slurmd service 52 | service: 53 | name: "slurmd" 54 | state: restarted 55 | retries: 5 56 | register: slurmd_restart 57 | until: slurmd_restart is success 58 | delay: 30 59 | when: 60 | - openhpc_slurm_service_started | bool 61 | - openhpc_enable.batch | default(false) | bool 62 | # 2nd condition required as notification happens on controller, which isn't necessarily a compute note 63 | 64 | - name: Reload facts 65 | ansible.builtin.setup: 66 | filter: ansible_local 67 | -------------------------------------------------------------------------------- /tasks/install-generic.yml: -------------------------------------------------------------------------------- 1 | - include_tasks: pre.yml 2 | 3 | - name: Create a list of slurm daemons 4 | set_fact: 5 | _ohpc_daemons: "{{ _ohpc_daemon_map | dict2items | selectattr('value') | items2dict | list }}" 6 | vars: 7 | _ohpc_daemon_map: 8 | slurmctld: "{{ openhpc_enable.control }}" 9 | slurmd: "{{ openhpc_enable.batch }}" 10 | slurmdbd: "{{ openhpc_enable.database }}" 11 | 12 | - name: Ensure extra repos 13 | ansible.builtin.yum_repository: "{{ item }}" # noqa: args[module] 14 | loop: "{{ openhpc_extra_repos }}" 15 | loop_control: 16 | label: "{{ item.name }}" 17 | 18 | - name: Install system packages 19 | dnf: 20 | name: "{{ openhpc_generic_packages }}" 21 | 22 | - name: Create Slurm user 23 | user: 24 | name: slurm 25 | comment: SLURM resource manager 26 | home: /etc/slurm 27 | shell: /sbin/nologin 28 | 29 | - name: Create Slurm unit files 30 | template: 31 | src: "{{ item }}.service.j2" 32 | dest: /lib/systemd/system/{{ item }}.service 33 | owner: root 34 | group: root 35 | mode: ug=rw,o=r 36 | loop: "{{ _ohpc_daemons }}" 37 | register: _slurm_systemd_units 38 | 39 | - name: Get current library locations 40 | shell: 41 | cmd: "ldconfig -v | grep -v ^$'\t'" # noqa: no-tabs risky-shell-pipe 42 | register: _slurm_ldconfig 43 | changed_when: false 44 | 45 | - name: Add library locations to ldd search path 46 | copy: 47 | dest: /etc/ld.so.conf.d/slurm.conf 48 | content: "{{ openhpc_lib_dir }}" 49 | owner: root 50 | group: root 51 | mode: ug=rw,o=r 52 | when: openhpc_lib_dir not in _ldd_paths 53 | vars: 54 | _ldd_paths: "{{ _slurm_ldconfig.stdout_lines | map('split', ':') | map('first') }}" 55 | 56 | - name: Reload Slurm unit files 57 | # Can't do just this from systemd module 58 | command: systemctl daemon-reload # noqa: command-instead-of-module no-changed-when no-handler 59 | when: _slurm_systemd_units.changed 60 | 61 | - name: Prepend $PATH with slurm user binary location 62 | lineinfile: 63 | path: /etc/environment 64 | line: "{{ new_path }}" 65 | regexp: "^{{ new_path | regex_escape }}" 66 | owner: root 67 | group: root 68 | mode: u=gw,go=r 69 | vars: 70 | new_path: PATH="{{ openhpc_bin_dir }}:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin" 71 | 72 | - meta: reset_connection # to get new environment 73 | -------------------------------------------------------------------------------- /library/gpu_info.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Copyright: (c) 2025, StackHPC 4 | # Apache 2 License 5 | 6 | from ansible.module_utils.basic import AnsibleModule 7 | 8 | ANSIBLE_METADATA = { 9 | "metadata_version": "0.1", 10 | "status": ["preview"], 11 | "supported_by": "community", 12 | } 13 | 14 | DOCUMENTATION = """ 15 | --- 16 | module: gpu_info 17 | short_description: Gathers information about NVIDIA GPUs on a node 18 | description: 19 | - "This module queries for NVIDIA GPUs using `nvidia-smi` and returns information about them. It is designed to fail gracefully if `nvidia-smi` is not present or if the NVIDIA driver is not running." 20 | options: {} 21 | requirements: 22 | - "python >= 3.6" 23 | author: 24 | - Steve Brasier, StackHPC 25 | """ 26 | 27 | EXAMPLES = """ 28 | """ 29 | 30 | import collections 31 | 32 | def run_module(): 33 | module_args = dict({}) 34 | 35 | module = AnsibleModule(argument_spec=module_args, supports_check_mode=True) 36 | 37 | try: 38 | rc ,stdout, stderr = module.run_command("nvidia-smi --query-gpu=name --format=noheader", check_rc=False, handle_exceptions=False) 39 | except FileNotFoundError: # nvidia-smi not installed 40 | rc = None 41 | 42 | # nvidia-smi return codes: https://docs.nvidia.com/deploy/nvidia-smi/index.html 43 | gpus = {} 44 | result = {'changed': False, 'gpus': gpus, 'gres':''} 45 | if rc == 0: 46 | # stdout line e.g. 'NVIDIA H200' for each GPU 47 | lines = [line for line in stdout.splitlines() if line != ''] # defensive: currently no blank lines 48 | models = [line.split()[1] for line in lines] 49 | gpus.update(collections.Counter(models)) 50 | elif rc == 9: 51 | # nvidia-smi installed but driver not running 52 | pass 53 | elif rc == None: 54 | # nvidia-smi not installed 55 | pass 56 | else: 57 | result.update({'stdout': stdout, 'rc': rc, 'stderr':stderr}) 58 | module.fail_json(**result) 59 | 60 | if len(gpus) > 0: 61 | gres_parts = [] 62 | for model, count in gpus.items(): 63 | gres_parts.append(f"gpu:{model}:{count}") 64 | result.update({'gres': ','.join(gres_parts)}) 65 | 66 | module.exit_json(**result) 67 | 68 | 69 | def main(): 70 | run_module() 71 | 72 | 73 | if __name__ == "__main__": 74 | main() 75 | -------------------------------------------------------------------------------- /library/sacct_cluster.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Copyright: (c) 2020, StackHPC 4 | # Apache 2 License 5 | 6 | from ansible.module_utils.basic import AnsibleModule 7 | from ansible.module_utils.slurm_utils import slurm_parse 8 | 9 | ANSIBLE_METADATA = { 10 | "metadata_version": "0.1", 11 | "status": ["preview"], 12 | "supported_by": "community", 13 | } 14 | 15 | DOCUMENTATION = """ 16 | --- 17 | module: sacct_cluster 18 | short_description: Manages clusters in the accounting database 19 | version_added: "2.9" 20 | description: 21 | - "Adds/removes a cluster from the accounting database" 22 | options: 23 | name: 24 | description: 25 | - Name of the cluster 26 | required: true 27 | type: str 28 | state: 29 | description: 30 | - If C(present), cluster will be added if it does't already exist 31 | - If C(absent), cluster will be removed if it exists 32 | type: str 33 | required: true 34 | choices: [ absent, present] 35 | 36 | requirements: 37 | - "python >= 3.6" 38 | author: 39 | - Will Szumski, StackHPC 40 | """ 41 | 42 | EXAMPLES = """ 43 | - name: Create a cluster 44 | slurm_acct: 45 | name: test123 46 | state: present 47 | """ 48 | 49 | def run_module(): 50 | module_args = dict( 51 | name=dict(type="str", required=True), 52 | state=dict(type="str", required=True, choices=['absent', 'present']), 53 | ) 54 | 55 | module = AnsibleModule(argument_spec=module_args, supports_check_mode=True) 56 | result = {"changed": False} 57 | 58 | cluster = module.params["name"] 59 | state = module.params["state"] 60 | 61 | if module.check_mode: 62 | module.exit_json(**result) 63 | 64 | _,stdout,_ = module.run_command("sacctmgr list cluster -p", check_rc=True) 65 | records = slurm_parse(stdout) 66 | clusters = [record["Cluster"] for record in records] 67 | 68 | if (cluster not in clusters and state == "present") or (cluster in clusters and state == "absent"): 69 | result["changed"] = True 70 | 71 | if module.check_mode or not result["changed"]: 72 | module.exit_json(**result) 73 | 74 | if state == "present": 75 | module.run_command("sacctmgr --immediate add cluster name=%s" % cluster, check_rc=True) 76 | else: 77 | module.run_command("sacctmgr --immediate delete cluster name=%s" % cluster, check_rc=True) 78 | 79 | module.exit_json(**result) 80 | 81 | 82 | def main(): 83 | run_module() 84 | 85 | 86 | if __name__ == "__main__": 87 | main() 88 | -------------------------------------------------------------------------------- /tasks/validate.yml: -------------------------------------------------------------------------------- 1 | - name: Check openhpc_slurm_control_host and openhpc_cluster_name 2 | assert: 3 | that: 4 | - openhpc_slurm_control_host is defined 5 | - openhpc_slurm_control_host != '' 6 | - openhpc_cluster_name is defined 7 | - openhpc_cluster_name != '' 8 | fail_msg: openhpc role variables not correctly defined, see detail above 9 | delegate_to: localhost 10 | run_once: true 11 | 12 | - name: Validate nodegroups define names 13 | # Needed for the next check.. 14 | # NB: Don't validate names against inventory groups as those are allowed to be missing 15 | ansible.builtin.assert: 16 | that: "'name' in item" 17 | fail_msg: "A mapping in openhpc_nodegroups is missing 'name'" 18 | loop: "{{ openhpc_nodegroups }}" 19 | delegate_to: localhost 20 | run_once: true 21 | 22 | - name: Check no host appears in more than one nodegroup 23 | assert: 24 | that: "{{ _openhpc_check_hosts.values() | select('greaterthan', 1) | length == 0 }}" 25 | fail_msg: | 26 | Some hosts appear more than once in inventory groups {{ _openhpc_node_inventory_groups | join(', ') }}: 27 | {{ _openhpc_check_hosts | dict2items | rejectattr('value', 'equalto', 1) | items2dict | to_nice_yaml }} 28 | vars: 29 | _openhpc_node_inventory_groups: "{{ openhpc_nodegroups | map(attribute='name') | map('regex_replace', '^', openhpc_cluster_name ~ '_') }}" 30 | _openhpc_check_hosts: "{{ groups | dict2items | list | selectattr('key', 'in', _openhpc_node_inventory_groups) | map(attribute='value') | flatten | community.general.counter }}" 31 | delegate_to: localhost 32 | run_once: true 33 | 34 | - name: Validate GRES definitions 35 | ansible.builtin.assert: 36 | that: "(item.gres | select('contains', 'file') | length) == (item.gres | length)" 37 | fail_msg: "GRES configuration(s) in openhpc_nodegroups '{{ item.name }}' do not include 'file' but GRES autodetection is not enabled" 38 | loop: "{{ openhpc_nodegroups }}" 39 | when: 40 | - item.gres_autodetect | default(openhpc_gres_autodetect) == 'off' 41 | - "'gres' in item" 42 | delegate_to: localhost 43 | run_once: true 44 | 45 | - name: Fail if partition configuration is outdated 46 | assert: 47 | that: openhpc_slurm_partitions is not defined 48 | fail_msg: stackhpc.openhpc parameter openhpc_slurm_partitions has been replaced - see openhpc_nodegroups and openhpc_partitions 49 | delegate_to: localhost 50 | run_once: true 51 | 52 | - name: Fail if munge key configuration is outdated 53 | assert: 54 | that: openhpc_munge_key is not defined 55 | fail_msg: stackhpc.openhpc parameter openhpc_munge_key has been replaced with openhpc_munge_key_b64 56 | delegate_to: localhost 57 | run_once: true 58 | -------------------------------------------------------------------------------- /molecule/README.md: -------------------------------------------------------------------------------- 1 | Molecule tests for the role. 2 | 3 | # Test Matrix 4 | 5 | Test options in "Other" column flow down through table unless changed. 6 | 7 | Test | # Partitions | Groups in partitions? | Other 8 | --- | --- | --- | --- 9 | test1 | 1 | N | 2x compute node, sequential names (default test) 10 | test1b | 1 | N | 1x compute node 11 | test1c | 1 | N | 2x compute nodes, nonsequential names 12 | test2 | 2 | N | 4x compute node, sequential names 13 | test3 | 1 | Y | 4x compute nodes in 2x groups, single partition 14 | test4 | 1 | N | 2x compute node, accounting enabled 15 | test5 | - | - | [removed, now always configless] 16 | test6 | 1 | N | 0x compute nodes 17 | test7 | 1 | N | [removed, image build should just run install.yml task, this is not expected to work] 18 | test8 | 1 | N | 2x compute node, 2x login-only nodes 19 | test9 | 1 | N | As test8 but uses `--limit=testohpc-control,testohpc-compute-0` and checks login nodes still end up in slurm.conf 20 | test10 | 1 | N | As for #1 but then tries to add an additional node 21 | test11 | 1 | N | As for #1 but then deletes a node (actually changes the partition due to molecule/ansible limitations) 22 | test12 | 1 | N | As for #1 but enabling job completion and testing `sacct -c` 23 | test13 | 1 | N | As for #1 but tests `openhpc_config` variable. 24 | test14 | - | - | [removed, extra_nodes removed] 25 | test15 | 1 | Y | As for #1 but also tests partitions with different name but with the same NodeName. 26 | 27 | # Local Installation & Running 28 | 29 | Local installation on a Rocky Linux 8.x machine looks like: 30 | 31 | sudo dnf install -y podman 32 | sudo dnf install podman-plugins # required for DNS 33 | sudo yum install -y git 34 | git clone git@github.com:stackhpc/ansible-role-openhpc.git 35 | cd ansible-role-openhpc/ 36 | python3.9 -m venv venv 37 | . venv/bin/activate 38 | pip install -U pip 39 | pip install -r molecule/requirements.txt 40 | ansible-galaxy collection install containers.podman:>=1.10.1 41 | 42 | Build a Rocky Linux 9 image with systemd included: 43 | 44 | cd ansible-role-openhpc/molecule/images 45 | podman build -t rocky9systemd:latest . 46 | 47 | Run tests, e.g.: 48 | 49 | cd ansible-role-openhpc/ 50 | MOLECULE_NO_LOG="false" MOLECULE_IMAGE=rockylinux:8 molecule test --all 51 | 52 | where the image may be `rockylinux:8` or `localhost/rocky9systemd`. 53 | 54 | Other useful options during development: 55 | - Prevent destroying instances by using `molecule test --destroy never` 56 | - Run only a single test using e.g. `molecule test --scenario test5` 57 | -------------------------------------------------------------------------------- /tasks/upgrade.yml: -------------------------------------------------------------------------------- 1 | - name: Check if slurm database has been initialised 2 | # DB is initialised on the first slurmdbd startup (without -u option). 3 | # If it is not initialised, `slurmdbd -u` errors with something like 4 | # > Slurm Database is somehow higher than expected '4294967294' but I only 5 | # > know as high as '16'. Conversion needed. 6 | community.mysql.mysql_query: 7 | login_db: "{{ openhpc_slurmdbd_mysql_database }}" 8 | login_user: "{{ openhpc_slurmdbd_mysql_username }}" 9 | login_password: "{{ openhpc_slurmdbd_mysql_password }}" 10 | login_host: "{{ openhpc_slurmdbd_host }}" 11 | query: SHOW TABLES 12 | config_file: '' 13 | register: _openhpc_slurmdb_tables 14 | 15 | - name: Check if slurm database requires an upgrade 16 | ansible.builtin.command: slurmdbd -u 17 | register: _openhpc_slurmdbd_check 18 | changed_when: false 19 | failed_when: >- 20 | _openhpc_slurmdbd_check.rc > 1 or 21 | 'Slurm Database is somehow higher than expected' in _openhpc_slurmdbd_check.stdout 22 | # from https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/as_mysql_convert.c 23 | when: _openhpc_slurmdb_tables.query_result | flatten | length > 0 # i.e. when db is initialised 24 | 25 | - name: Set fact for slurm database upgrade 26 | # Explanation of ifs below: 27 | # - `slurmdbd -u` rc == 0 then no conversion required (from manpage) 28 | # - default of 0 on rc skips upgrade steps if check was skipped because 29 | # db is not initialised 30 | # - Usage message (and rc == 1) if -u option doesn't exist, in which case 31 | # it can't be a major upgrade due to existing openhpc versions 32 | set_fact: 33 | _openhpc_slurmdb_upgrade: >- 34 | {{ false 35 | if ( 36 | ( _openhpc_slurmdbd_check.rc | default(0) == 0) 37 | or 38 | ( 'Usage: slurmdbd' in _openhpc_slurmdbd_check.stderr ) 39 | ) else 40 | true 41 | }} 42 | 43 | - name: Ensure Slurm database service stopped 44 | ansible.builtin.systemd: 45 | name: "{{ openhpc_slurm_accounting_storage_service }}" 46 | state: stopped 47 | register: _openhpc_slurmdb_state 48 | when: 49 | - _openhpc_slurmdb_upgrade 50 | - openhpc_slurm_accounting_storage_service != '' 51 | 52 | - name: Backup Slurm database 53 | ansible.builtin.shell: # noqa: command-instead-of-shell 54 | cmd: "{{ openhpc_slurm_accounting_storage_backup_cmd }}" 55 | delegate_to: "{{ openhpc_slurm_accounting_storage_backup_host }}" 56 | become: "{{ openhpc_slurm_accounting_storage_backup_become }}" 57 | changed_when: true 58 | run_once: true 59 | when: 60 | - _openhpc_slurmdb_upgrade 61 | - openhpc_slurm_accounting_storage_backup_cmd != '' 62 | 63 | - name: Ensure Slurm database service started 64 | ansible.builtin.systemd: 65 | name: "{{ openhpc_slurm_accounting_storage_service }}" 66 | state: started 67 | when: 68 | - openhpc_slurm_accounting_storage_service != '' 69 | - _openhpc_slurmdb_state.changed | default(false) 70 | 71 | - name: Run slurmdbd in foreground for upgrade 72 | ansible.builtin.expect: 73 | command: /usr/sbin/slurmdbd -D -vvv 74 | responses: 75 | (?i)Everything rolled up: 76 | # See https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrade-slurmdbd 77 | # and 78 | # https://github.com/SchedMD/slurm/blob/0ce058c5adcf63001ec2ad211c65e67b0e7682a8/src/plugins/accounting_storage/mysql/as_mysql_usage.c#L1042 79 | become: true 80 | become_user: slurm 81 | when: _openhpc_slurmdb_upgrade 82 | -------------------------------------------------------------------------------- /templates/slurm.conf.j2: -------------------------------------------------------------------------------- 1 | ClusterName={{ openhpc_cluster_name }} 2 | 3 | # PARAMETERS 4 | {% for k, v in openhpc_default_config | combine(openhpc_config) | items %} 5 | {% if v != "omit" %}{# allow removing items using setting key: omit #} 6 | {% if k != 'SlurmctldParameters' %}{# handled separately due to configless mode #} 7 | {{ k }}={{ v | join(',') if (v is sequence and v is not string) else v }} 8 | {% endif %} 9 | {% endif %} 10 | {% endfor %} 11 | 12 | {% set slurmctldparameters = ((openhpc_config.get('SlurmctldParameters', []) + ['enable_configless']) | unique) %} 13 | SlurmctldParameters={{ slurmctldparameters | join(',') }} 14 | 15 | # LOGIN-ONLY NODES 16 | # Define slurmd nodes not in partitions for login-only nodes in "configless" mode: 17 | {%if openhpc_login_only_nodes %}{% for node in groups[openhpc_login_only_nodes] %} 18 | NodeName={{ node }} 19 | {% endfor %}{% endif %} 20 | 21 | 22 | # COMPUTE NODES 23 | {% for nodegroup in openhpc_nodegroups %} 24 | # nodegroup: {{ nodegroup.name }} 25 | {% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %} 26 | {% set inventory_group_hosts = groups.get(inventory_group_name, []) %} 27 | {% if inventory_group_hosts | length > 0 %} 28 | {% set play_group_hosts = inventory_group_hosts | intersect (play_hosts) %} 29 | {% set first_host = play_group_hosts | first | mandatory('Inventory group "' ~ inventory_group_name ~ '" contains no hosts in this play - was --limit used?') %} 30 | {% set first_host_hv = hostvars[first_host] %} 31 | {% set ram_mb = (first_host_hv['ansible_memory_mb']['real']['total'] * (nodegroup.ram_multiplier | default(openhpc_ram_multiplier))) | int %} 32 | {% set hostlists = (inventory_group_hosts | hostlist_expression) %}{# hosts in inventory group aren't necessarily a single hostlist expression #} 33 | NodeName={{ hostlists | join(',') }} {{ '' -}} 34 | Features={{ (['nodegroup_' ~ nodegroup.name] + nodegroup.features | default([]) ) | join(',') }} {{ '' -}} 35 | State=UNKNOWN {{ '' -}} 36 | RealMemory={{ nodegroup.ram_mb | default(ram_mb) }} {{ '' -}} 37 | Sockets={{ first_host_hv['ansible_processor_count'] }} {{ '' -}} 38 | CoresPerSocket={{ first_host_hv['ansible_processor_cores'] }} {{ '' -}} 39 | ThreadsPerCore={{ first_host_hv['ansible_processor_threads_per_core'] }} {{ '' -}} 40 | {{ nodegroup.node_params | default({}) | dict2parameters }} {{ '' -}} 41 | {% if 'gres' in nodegroup -%} 42 | Gres={{ ','.join(nodegroup.gres | map(attribute='conf')) -}} 43 | {% elif nodegroup.gres_autodetect | default(openhpc_gres_autodetect) == 'nvml' and first_host_hv['ohpc_node_gpu_gres'] != '' -%} 44 | Gres={{ first_host_hv['ohpc_node_gpu_gres'] -}} 45 | {% endif %} 46 | 47 | {% endif %}{# 1 or more hosts in inventory #} 48 | NodeSet=nodegroup_{{ nodegroup.name }} Feature=nodegroup_{{ nodegroup.name }} 49 | 50 | {% endfor %} 51 | 52 | # Define a non-existent node, in no partition, so that slurmctld starts even with all partitions empty 53 | NodeName=nonesuch 54 | 55 | # PARTITIONS 56 | {% for partition in openhpc_partitions %} 57 | PartitionName={{partition.name}} {{ '' -}} 58 | Default={{ partition.get('default', 'YES') }} {{ '' -}} 59 | MaxTime={{ partition.get('maxtime', openhpc_job_maxtime) }} {{ '' -}} 60 | State=UP {{ '' -}} 61 | Nodes={{ partition.get('nodegroups', [partition.name]) | map('regex_replace', '^', 'nodegroup_') | join(',') }} {{ '' -}} 62 | {{ partition.partition_params | default({}) | dict2parameters }} 63 | {% endfor %}{# openhpc_partitions #} 64 | -------------------------------------------------------------------------------- /files/slurm.conf.ohpc: -------------------------------------------------------------------------------- 1 | # 2 | # Example slurm.conf file. Please run configurator.html 3 | # (in doc/html) to build a configuration file customized 4 | # for your environment. 5 | # 6 | # 7 | # slurm.conf file generated by configurator.html. 8 | # Put this file on all nodes of your cluster. 9 | # See the slurm.conf man page for more information. 10 | # 11 | ClusterName=cluster 12 | SlurmctldHost=linux0 13 | #SlurmctldHost= 14 | # 15 | #DisableRootJobs=NO 16 | #EnforcePartLimits=NO 17 | #Epilog= 18 | #EpilogSlurmctld= 19 | #FirstJobId=1 20 | #MaxJobId=67043328 21 | #GresTypes= 22 | #GroupUpdateForce=0 23 | #GroupUpdateTime=600 24 | #JobFileAppend=0 25 | #JobRequeue=1 26 | #JobSubmitPlugins=lua 27 | #KillOnBadExit=0 28 | #LaunchType=launch/slurm 29 | #Licenses=foo*4,bar 30 | #MailProg=/bin/mail 31 | #MaxJobCount=10000 32 | #MaxStepCount=40000 33 | #MaxTasksPerNode=512 34 | MpiDefault=none 35 | #MpiParams=ports=#-# 36 | #PluginDir= 37 | #PlugStackConfig= 38 | #PrivateData=jobs 39 | ProctrackType=proctrack/cgroup 40 | #Prolog= 41 | #PrologFlags= 42 | #PrologSlurmctld= 43 | #PropagatePrioProcess=0 44 | #PropagateResourceLimits= 45 | #PropagateResourceLimitsExcept= 46 | #RebootProgram= 47 | SlurmctldPidFile=/var/run/slurmctld.pid 48 | SlurmctldPort=6817 49 | SlurmdPidFile=/var/run/slurmd.pid 50 | SlurmdPort=6818 51 | SlurmdSpoolDir=/var/spool/slurmd 52 | SlurmUser=slurm 53 | #SlurmdUser=root 54 | #SrunEpilog= 55 | #SrunProlog= 56 | StateSaveLocation=/var/spool/slurmctld 57 | SwitchType=switch/none 58 | #TaskEpilog= 59 | TaskPlugin=task/affinity 60 | #TaskProlog= 61 | #TopologyPlugin=topology/tree 62 | #TmpFS=/tmp 63 | #TrackWCKey=no 64 | #TreeWidth= 65 | #UnkillableStepProgram= 66 | #UsePAM=0 67 | # 68 | # 69 | # TIMERS 70 | #BatchStartTimeout=10 71 | #CompleteWait=0 72 | #EpilogMsgTime=2000 73 | #GetEnvTimeout=2 74 | #HealthCheckInterval=0 75 | #HealthCheckProgram= 76 | InactiveLimit=0 77 | KillWait=30 78 | #MessageTimeout=10 79 | #ResvOverRun=0 80 | MinJobAge=300 81 | #OverTimeLimit=0 82 | SlurmctldTimeout=120 83 | SlurmdTimeout=300 84 | #UnkillableStepTimeout=60 85 | #VSizeFactor=0 86 | Waittime=0 87 | # 88 | # 89 | # SCHEDULING 90 | #DefMemPerCPU=0 91 | #MaxMemPerCPU=0 92 | #SchedulerTimeSlice=30 93 | SchedulerType=sched/backfill 94 | SelectType=select/cons_tres 95 | SelectTypeParameters=CR_Core 96 | # 97 | # 98 | # JOB PRIORITY 99 | #PriorityFlags= 100 | #PriorityType=priority/basic 101 | #PriorityDecayHalfLife= 102 | #PriorityCalcPeriod= 103 | #PriorityFavorSmall= 104 | #PriorityMaxAge= 105 | #PriorityUsageResetPeriod= 106 | #PriorityWeightAge= 107 | #PriorityWeightFairshare= 108 | #PriorityWeightJobSize= 109 | #PriorityWeightPartition= 110 | #PriorityWeightQOS= 111 | # 112 | # 113 | # LOGGING AND ACCOUNTING 114 | #AccountingStorageEnforce=0 115 | #AccountingStorageHost= 116 | #AccountingStoragePass= 117 | #AccountingStoragePort= 118 | AccountingStorageType=accounting_storage/none 119 | #AccountingStorageUser= 120 | #AccountingStoreFlags= 121 | #JobCompHost= 122 | #JobCompLoc= 123 | #JobCompPass= 124 | #JobCompPort= 125 | JobCompType=jobcomp/none 126 | #JobCompUser= 127 | #JobContainerType=job_container/none 128 | JobAcctGatherFrequency=30 129 | JobAcctGatherType=jobacct_gather/none 130 | SlurmctldDebug=info 131 | SlurmctldLogFile=/var/log/slurmctld.log 132 | SlurmdDebug=info 133 | SlurmdLogFile=/var/log/slurmd.log 134 | #SlurmSchedLogFile= 135 | #SlurmSchedLogLevel= 136 | #DebugFlags= 137 | # 138 | # 139 | # POWER SAVE SUPPORT FOR IDLE NODES (optional) 140 | #SuspendProgram= 141 | #ResumeProgram= 142 | #SuspendTimeout= 143 | #ResumeTimeout= 144 | #ResumeRate= 145 | #SuspendExcNodes= 146 | #SuspendExcParts= 147 | #SuspendRate= 148 | #SuspendTime= 149 | # 150 | # 151 | # COMPUTE NODES 152 | # OpenHPC default configuration 153 | TaskPlugin=task/affinity 154 | PropagateResourceLimitsExcept=MEMLOCK 155 | JobCompType=jobcomp/filetxt 156 | Epilog=/etc/slurm/slurm.epilog.clean 157 | NodeName=c[1-4] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN 158 | PartitionName=normal Nodes=c[1-4] Default=YES MaxTime=24:00:00 State=UP Oversubscribe=EXCLUSIVE 159 | SlurmctldParameters=enable_configless 160 | ReturnToService=1 161 | -------------------------------------------------------------------------------- /filter_plugins/slurm_conf.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2019 StackHPC Ltd. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); you may 4 | # not use this file except in compliance with the License. You may obtain 5 | # a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 11 | # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 12 | # License for the specific language governing permissions and limitations 13 | # under the License. 14 | 15 | # NB: To test this from the repo root run: 16 | # ansible-playbook -i tests/inventory -i tests/inventory-mock-groups tests/filter.yml 17 | 18 | from ansible import errors 19 | import jinja2 20 | import re 21 | 22 | # Pattern to match a hostname with numerical ending 23 | pattern = re.compile("^(.*\D(?=\d))(\d+)$") 24 | 25 | def hostlist_expression(hosts): 26 | """ Group hostnames using Slurm's hostlist expression format. 27 | 28 | E.g. with an inventory containing: 29 | 30 | [compute] 31 | dev-foo-00 ansible_host=localhost 32 | dev-foo-3 ansible_host=localhost 33 | my-random-host 34 | dev-foo-04 ansible_host=localhost 35 | dev-foo-05 ansible_host=localhost 36 | dev-compute-000 ansible_host=localhost 37 | dev-compute-001 ansible_host=localhost 38 | 39 | Then "{{ groups[compute] | hostlist_expression }}" will return: 40 | 41 | ['dev-foo-[00,04-05,3]', 'dev-compute-[000-001]', 'my-random-host'] 42 | 43 | NB: This does not guranteed to return parts in the same order as `scontrol hostlist`, but its output should return the same hosts when passed to `scontrol hostnames`. 44 | """ 45 | 46 | results = {} 47 | unmatchable = [] 48 | for v in hosts: 49 | m = pattern.match(v) 50 | if m: 51 | prefix, suffix = m.groups() 52 | r = results.setdefault(prefix, []) 53 | r.append(suffix) 54 | else: 55 | unmatchable.append(v) 56 | return ['{}[{}]'.format(k, _group_numbers(v)) for k, v in results.items()] + unmatchable 57 | 58 | def _group_numbers(numbers): 59 | units = [] 60 | ints = [int(n) for n in numbers] 61 | lengths = [len(n) for n in numbers] 62 | # sort numbers by int value and length: 63 | ints, lengths, numbers = zip(*sorted(zip(ints, lengths, numbers))) 64 | prev = min(ints) 65 | for i, v in enumerate(sorted(ints)): 66 | if v == prev + 1: 67 | units[-1].append(numbers[i]) 68 | else: 69 | units.append([numbers[i]]) 70 | prev = v 71 | return ','.join(['{}-{}'.format(u[0], u[-1]) if len(u) > 1 else str(u[0]) for u in units]) 72 | 73 | def error(condition, msg): 74 | """ Raise an error if condition is not True """ 75 | 76 | if not condition: 77 | raise errors.AnsibleFilterError(msg) 78 | 79 | def dict2parameters(d): 80 | """ Convert a dict into a str in 'k1=v1 k2=v2 ...' format """ 81 | parts = ['%s=%s' % (k, v) for k, v in d.items()] 82 | return ' '.join(parts) 83 | 84 | def config2dict(lines): 85 | """ Convert a sequence of output lines from `scontrol show config` to a dict. 86 | 87 | As per man page uppercase keys are derived parameters, mixed case are from 88 | from config files. 89 | 90 | The following case-insensitive conversions of values are carried out: 91 | - '(null)' and 'n/a' are converted to None. 92 | - yes and no are converted to True and False respectively 93 | 94 | Except for these, values are always strings. 95 | """ 96 | cfg = {} 97 | for line in lines: 98 | if '=' not in line: # ditch blank/info lines 99 | continue 100 | else: 101 | parts = [x.strip() for x in line.split('=', maxsplit=1)] # maxplit handles '=' in values 102 | if len(parts) != 2: 103 | raise errors.AnsibleFilterError(f'line {line} cannot be split into key=value') 104 | k, v = parts 105 | small_v = v.lower() 106 | if small_v == '(null)': 107 | v = None 108 | elif small_v == 'n/a': 109 | v = None 110 | elif small_v == 'no': 111 | v = False 112 | elif small_v == 'yes': 113 | v = True 114 | cfg[k] = v 115 | return cfg 116 | 117 | 118 | class FilterModule(object): 119 | 120 | def filters(self): 121 | return { 122 | 'hostlist_expression': hostlist_expression, 123 | 'error': error, 124 | 'dict2parameters': dict2parameters, 125 | 'config2dict': config2dict, 126 | } 127 | -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | --- 2 | name: CI 3 | 'on': 4 | pull_request: 5 | push: 6 | branches: 7 | - master 8 | 9 | jobs: 10 | build: 11 | name: Build Rocky Linux 9 container image 12 | # Upstream rockylinux/rockylinux:9 images don't contain systemd, which means /sbin/init fails. 13 | # A workaround of using "/bin/bash -c 'dnf -y install systemd && /sbin/init'" 14 | # as the container command is flaky. 15 | # This job builds an image using the upstream rockylinux/rockylinux:9 image which ensures 16 | # that the image used for the molecule workflow is always updated. 17 | runs-on: ubuntu-latest 18 | defaults: 19 | run: 20 | working-directory: molecule/images 21 | steps: 22 | - name: Check out the codebase. 23 | uses: actions/checkout@v4 24 | 25 | - name: Build image 26 | run: podman build -t rocky9systemd:latest . 27 | 28 | - name: Save image 29 | run: podman save --output rocky9systemd.docker rocky9systemd:latest 30 | 31 | - name: Upload rocky9 image 32 | uses: actions/upload-artifact@v4 33 | with: 34 | name: rocky9systemd 35 | path: molecule/images/rocky9systemd.docker 36 | 37 | molecule: 38 | name: Molecule 39 | runs-on: ubuntu-latest 40 | needs: build 41 | strategy: 42 | fail-fast: false 43 | matrix: 44 | image: 45 | - 'rockylinux/rockylinux:8' 46 | - 'localhost/rocky9systemd' 47 | scenario: 48 | - test1 49 | - test1b 50 | - test1c 51 | - test2 52 | - test3 53 | - test4 54 | - test6 55 | - test8 56 | - test9 57 | - test10 58 | - test11 59 | - test12 60 | - test13 61 | exclude: 62 | # mariadb package provides /usr/bin/mysql on RL8 which doesn't work with geerlingguy/mysql role 63 | - scenario: test4 64 | image: 'rockylinux/rockylinux:8' 65 | 66 | steps: 67 | - name: Check out the codebase. 68 | uses: actions/checkout@v4 69 | 70 | - name: Download rocky9 container image 71 | uses: actions/download-artifact@v4 72 | with: 73 | name: rocky9systemd 74 | path: molecule/images/rocky9systemd.docker 75 | if: matrix.image == 'localhost/rocky9systemd' 76 | 77 | - name: Load rocky9 container image 78 | run: podman load --input rocky9systemd.docker/rocky9systemd.docker 79 | working-directory: molecule/images 80 | if: matrix.image == 'localhost/rocky9systemd' 81 | 82 | - name: Set up Python 3. 83 | uses: actions/setup-python@v4 84 | with: 85 | python-version: '3.9' 86 | 87 | - name: Install test dependencies. 88 | run: | 89 | pip3 install -U pip 90 | pip install -r molecule/requirements.txt 91 | ansible-galaxy collection install containers.podman:>=1.10.1 # otherwise get https://github.com/containers/ansible-podman-collections/issues/428 92 | 93 | - name: Display ansible version 94 | run: ansible --version 95 | 96 | - name: Compensate for repo name being different to the role 97 | run: ln -s $(pwd) ../stackhpc.openhpc 98 | 99 | - name: Create ansible.cfg with correct roles_path 100 | run: printf '[defaults]\nroles_path=../' >ansible.cfg 101 | 102 | - name: Run Molecule tests. 103 | run: molecule test -s ${{ matrix.scenario }} 104 | env: 105 | PY_COLORS: '1' 106 | ANSIBLE_FORCE_COLOR: '1' 107 | MOLECULE_IMAGE: ${{ matrix.image }} 108 | 109 | checks: 110 | name: Checks 111 | runs-on: ubuntu-latest 112 | steps: 113 | - name: Check out the codebase. 114 | uses: actions/checkout@v3 115 | 116 | - name: Set up Python 3. 117 | uses: actions/setup-python@v5 118 | with: 119 | python-version: '3.9' 120 | 121 | - name: Install test dependencies. 122 | run: | 123 | pip3 install -U ansible ansible-lint 124 | ansible-galaxy collection install containers.podman:>=1.10.1 # otherwise get https://github.com/containers/ansible-podman-collections/issues/428 125 | 126 | - name: Display ansible version 127 | run: ansible --version 128 | 129 | - name: Compensate for repo name being different to the role 130 | run: ln -s $(pwd) ../stackhpc.openhpc 131 | 132 | - name: Create ansible.cfg with correct roles_path 133 | run: printf '[defaults]\nroles_path=../' >ansible.cfg 134 | 135 | - name: Run Ansible syntax check 136 | run: ansible-playbook tests/test.yml -i tests/inventory --syntax-check 137 | 138 | - name: Run Ansible lint 139 | run: ansible-lint . 140 | 141 | - name: Test custom filters 142 | run: ansible-playbook tests/filter.yml -i tests/inventory -i tests/inventory-mock-groups 143 | -------------------------------------------------------------------------------- /defaults/main.yml: -------------------------------------------------------------------------------- 1 | --- 2 | openhpc_slurm_service_enabled: true 3 | openhpc_slurm_service_started: "{{ openhpc_slurm_service_enabled }}" 4 | openhpc_slurm_service: 5 | openhpc_slurm_control_host: "{{ inventory_hostname }}" 6 | #openhpc_slurm_control_host_address: 7 | openhpc_partitions: "{{ openhpc_nodegroups }}" 8 | openhpc_nodegroups: [] 9 | openhpc_cluster_name: 10 | openhpc_packages: 11 | - slurm-libpmi-ohpc 12 | openhpc_resume_timeout: 300 13 | openhpc_retry_delay: 10 14 | openhpc_job_maxtime: '60-0' # quote this to avoid ansible converting some formats to seconds, which is interpreted as minutes by Slurm 15 | openhpc_gres_autodetect: 'off' 16 | openhpc_default_config: 17 | # This only defines values which are not Slurm defaults 18 | SlurmctldHost: "{{ openhpc_slurm_control_host }}{% if openhpc_slurm_control_host_address is defined %}({{ openhpc_slurm_control_host_address }}){% endif %}" 19 | ProctrackType: proctrack/linuxproc # TODO: really want cgroup but needs cgroup.conf and workaround for CI 20 | SlurmdSpoolDir: /var/spool/slurm # NB: not OpenHPC default! 21 | SlurmUser: slurm 22 | StateSaveLocation: "{{ openhpc_state_save_location }}" 23 | SlurmctldTimeout: 300 24 | SchedulerType: sched/backfill 25 | SelectType: select/cons_tres 26 | SelectTypeParameters: CR_Core 27 | PriorityWeightPartition: 1000 28 | PreemptType: preempt/partition_prio 29 | PreemptMode: SUSPEND,GANG 30 | AccountingStoragePass: "{{ openhpc_slurm_accounting_storage_pass | default('omit') }}" 31 | AccountingStorageHost: "{{ openhpc_slurm_accounting_storage_host }}" 32 | AccountingStoragePort: "{{ openhpc_slurm_accounting_storage_port }}" 33 | AccountingStorageType: "{{ openhpc_slurm_accounting_storage_type }}" 34 | AccountingStorageUser: "{{ openhpc_slurm_accounting_storage_user }}" 35 | JobCompLoc: "{{ openhpc_slurm_job_comp_loc }}" 36 | JobCompType: "{{ openhpc_slurm_job_comp_type }}" 37 | JobAcctGatherFrequency: "{{ openhpc_slurm_job_acct_gather_frequency }}" 38 | JobAcctGatherType: "{{ openhpc_slurm_job_acct_gather_type }}" 39 | SlurmctldSyslogDebug: info 40 | SlurmdSyslogDebug: info 41 | PropagateResourceLimitsExcept: MEMLOCK 42 | Epilog: /etc/slurm/slurm.epilog.clean 43 | ReturnToService: 2 44 | GresTypes: "{{ ohpc_gres_types if ohpc_gres_types != '' else 'omit' }}" 45 | openhpc_cgroup_default_config: 46 | ConstrainCores: "yes" 47 | ConstrainDevices: "yes" 48 | ConstrainRAMSpace: "yes" 49 | ConstrainSwapSpace: "yes" 50 | 51 | openhpc_config: {} 52 | openhpc_cgroup_config: {} 53 | ohpc_gres_types: >- 54 | {{ 55 | ( 56 | ['gpu'] if openhpc_gres_autodetect == 'nvml' else [] + 57 | ['gpu'] if openhpc_nodegroups | map(attribute='gres_autodetect', default='') | unique | select('eq', 'nvml') else [] + 58 | openhpc_nodegroups | 59 | community.general.json_query('[].gres[].conf') | 60 | map('regex_search', '^(\w+)') 61 | ) | flatten | reject('eq', '') | sort | unique | join(',') 62 | }} 63 | openhpc_gres_template: gres.conf.j2 64 | openhpc_cgroup_template: cgroup.conf.j2 65 | 66 | openhpc_mpi_template: mpi.conf.j2 67 | openhpc_mpi_config: {} 68 | 69 | openhpc_state_save_location: /var/spool/slurm 70 | openhpc_slurmd_spool_dir: /var/spool/slurm 71 | openhpc_slurm_conf_path: /etc/slurm/slurm.conf 72 | openhpc_slurm_conf_template: slurm.conf.j2 73 | 74 | # Accounting 75 | openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}" 76 | openhpc_slurm_accounting_storage_port: 6819 77 | openhpc_slurm_accounting_storage_type: accounting_storage/none 78 | # NOTE: You only need to set these if using accounting_storage/mysql 79 | openhpc_slurm_accounting_storage_user: slurm 80 | #openhpc_slurm_accounting_storage_pass: 81 | 82 | # Job accounting 83 | openhpc_slurm_job_acct_gather_type: jobacct_gather/linux 84 | openhpc_slurm_job_acct_gather_frequency: 30 85 | openhpc_slurm_job_comp_type: jobcomp/none 86 | openhpc_slurm_job_comp_loc: /var/log/slurm_jobacct.log 87 | 88 | # slurmdbd configuration 89 | openhpc_slurmdbd_host: "{{ openhpc_slurm_control_host }}" 90 | openhpc_slurmdbd_port: "{{ openhpc_slurm_accounting_storage_port }}" 91 | openhpc_slurmdbd_mysql_host: "{{ openhpc_slurm_control_host }}" 92 | openhpc_slurmdbd_mysql_database: slurm_acct_db 93 | #openhpc_slurmdbd_mysql_password: 94 | openhpc_slurmdbd_mysql_username: slurm 95 | 96 | openhpc_enable: 97 | control: false 98 | batch: false 99 | database: false 100 | runtime: false 101 | 102 | # Only used for install-generic.yml: 103 | openhpc_generic_packages: 104 | - munge 105 | - mariadb-connector-c # only required on slurmdbd 106 | - hwloc-libs # only required on slurmd 107 | openhpc_sbin_dir: /usr/sbin # path to slurm daemon binaries (e.g. slurmctld) 108 | openhpc_bin_dir: /usr/bin # path to slurm user binaries (e.g sinfo) 109 | openhpc_lib_dir: /usr/lib64/slurm # path to slurm libraries 110 | 111 | # Repository configuration 112 | openhpc_extra_repos: [] 113 | 114 | ohpc_openhpc_repos: 115 | "9": 116 | - name: OpenHPC 117 | file: OpenHPC 118 | description: OpenHPC-3 - Base 119 | baseurl: "http://repos.openhpc.community/OpenHPC/3/EL_9" 120 | gpgcheck: true 121 | gpgkey: https://raw.githubusercontent.com/openhpc/ohpc/v3.0.GA/components/admin/ohpc-release/SOURCES/RPM-GPG-KEY-OpenHPC-3 122 | - name: OpenHPC-updates 123 | file: OpenHPC 124 | description: OpenHPC-3 - Updates 125 | baseurl: "http://repos.openhpc.community/OpenHPC/3/updates/EL_9" 126 | gpgcheck: true 127 | gpgkey: https://raw.githubusercontent.com/openhpc/ohpc/v3.0.GA/components/admin/ohpc-release/SOURCES/RPM-GPG-KEY-OpenHPC-3 128 | "8": 129 | - name: OpenHPC 130 | file: OpenHPC 131 | description: OpenHPC-2 - Base 132 | baseurl: "http://repos.openhpc.community/OpenHPC/2/CentOS_8" 133 | gpgcheck: true 134 | gpgkey: https://raw.githubusercontent.com/openhpc/ohpc/v2.6.1.GA/components/admin/ohpc-release/SOURCES/RPM-GPG-KEY-OpenHPC-2 135 | - name: OpenHPC-updates 136 | file: OpenHPC 137 | description: OpenHPC-2 - Updates 138 | baseurl: "http://repos.openhpc.community/OpenHPC/2/updates/CentOS_8" 139 | gpgcheck: true 140 | gpgkey: https://raw.githubusercontent.com/openhpc/ohpc/v2.6.1.GA/components/admin/ohpc-release/SOURCES/RPM-GPG-KEY-OpenHPC-2 141 | 142 | ohpc_default_extra_repos: 143 | "9": 144 | - name: epel 145 | file: epel 146 | description: "Extra Packages for Enterprise Linux $releasever - $basearch" 147 | metalink: "https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=$basearch&infra=$infra&content=$contentdir" 148 | gpgcheck: true 149 | gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-9" 150 | "8": 151 | - name: epel 152 | file: epel 153 | description: "Extra Packages for Enterprise Linux 8 - $basearch" 154 | metalink: "https://mirrors.fedoraproject.org/metalink?repo=epel-8&arch=$basearch&infra=$infra&content=$contentdir" 155 | gpgcheck: true 156 | gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-8" 157 | 158 | openhpc_munge_key_b64: 159 | openhpc_login_only_nodes: '' 160 | openhpc_module_system_install: true # only works for install-ohpc.yml/main.yml 161 | 162 | # Auto detection 163 | openhpc_ram_multiplier: 0.95 164 | 165 | # Database upgrade 166 | openhpc_slurm_accounting_storage_service: '' 167 | openhpc_slurm_accounting_storage_backup_cmd: '' 168 | openhpc_slurm_accounting_storage_backup_host: "{{ openhpc_slurm_accounting_storage_host }}" 169 | openhpc_slurm_accounting_storage_backup_become: true 170 | openhpc_slurm_accounting_storage_client_package: mysql 171 | -------------------------------------------------------------------------------- /tasks/runtime.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - include_tasks: pre.yml 4 | 5 | - name: Fail if control host not in play and munge key not specified 6 | fail: 7 | msg: "Either the slurm control node must be in the play or `openhpc_munge_key` must be set" 8 | when: 9 | - openhpc_slurm_control_host not in ansible_play_hosts 10 | - not openhpc_munge_key 11 | 12 | - name: Ensure Slurm directories exists 13 | file: 14 | path: "{{ item.path }}" 15 | owner: slurm 16 | group: slurm 17 | mode: '0755' 18 | state: directory 19 | loop: 20 | - path: "{{ openhpc_state_save_location }}" # StateSaveLocation 21 | enable: control 22 | - path: "{{ openhpc_slurm_conf_path | dirname }}" 23 | enable: control 24 | - path: "{{ openhpc_slurmd_spool_dir }}" # SlurmdSpoolDir 25 | enable: batch 26 | when: "openhpc_enable[item.enable] | default(false) | bool" 27 | 28 | - name: Retrieve Munge key from control host 29 | # package install generates a node-unique one 30 | slurp: 31 | src: "/etc/munge/munge.key" 32 | register: openhpc_control_munge_key 33 | delegate_to: "{{ openhpc_slurm_control_host }}" 34 | when: openhpc_slurm_control_host in ansible_play_hosts 35 | 36 | - name: Write Munge key 37 | copy: 38 | content: "{{ (openhpc_munge_key_b64 or openhpc_control_munge_key.content) | b64decode }}" 39 | dest: "/etc/munge/munge.key" 40 | owner: munge 41 | group: munge 42 | mode: '0400' 43 | register: _openhpc_munge_key_copy 44 | 45 | - name: Ensure JobComp logfile exists 46 | file: 47 | path: "{{ openhpc_slurm_job_comp_loc }}" 48 | state: touch 49 | owner: slurm 50 | group: slurm 51 | mode: '0644' 52 | access_time: preserve 53 | modification_time: preserve 54 | when: openhpc_slurm_job_comp_type == 'jobcomp/filetxt' 55 | 56 | - name: Template slurmdbd.conf 57 | template: 58 | src: slurmdbd.conf.j2 59 | dest: "{{ openhpc_slurm_conf_path | dirname }}/slurmdbd.conf" 60 | mode: "0600" 61 | owner: slurm 62 | group: slurm 63 | notify: Restart slurmdbd service 64 | when: openhpc_enable.database | default(false) | bool 65 | 66 | - name: Query GPU info 67 | gpu_info: 68 | register: _gpu_info 69 | when: openhpc_enable.batch | default(false) 70 | 71 | - name: Set fact for node GPU GRES 72 | set_fact: 73 | ohpc_node_gpu_gres: "{{ _gpu_info.gres }}" 74 | when: openhpc_enable.batch | default(false) 75 | 76 | - name: Template slurm.conf 77 | template: 78 | src: "{{ openhpc_slurm_conf_template }}" 79 | dest: "{{ openhpc_slurm_conf_path }}" 80 | owner: root 81 | group: root 82 | mode: '0644' 83 | when: openhpc_enable.control | default(false) 84 | notify: 85 | - Restart slurmctld service 86 | register: ohpc_slurm_conf 87 | # NB uses restart rather than reload as number of nodes might have changed 88 | 89 | - name: Create gres.conf 90 | template: 91 | src: "{{ openhpc_gres_template }}" 92 | dest: "{{ openhpc_slurm_conf_path | dirname }}/gres.conf" 93 | mode: "0600" 94 | owner: slurm 95 | group: slurm 96 | when: openhpc_enable.control | default(false) 97 | notify: 98 | - Restart slurmctld service 99 | register: ohpc_gres_conf 100 | # NB uses restart rather than reload as this is needed in some cases 101 | 102 | - name: Template cgroup.conf 103 | # appears to be required even with NO cgroup plugins: https://slurm.schedmd.com/cgroups.html#cgroup_design 104 | template: 105 | src: "{{ openhpc_cgroup_template }}" 106 | dest: "{{ openhpc_slurm_conf_path | dirname }}/cgroup.conf" 107 | mode: "0644" # perms/ownership based off src from ohpc package 108 | owner: root 109 | group: root 110 | when: openhpc_enable.control | default(false) 111 | notify: 112 | - Restart slurmctld service 113 | register: ohpc_cgroup_conf 114 | # NB uses restart rather than reload as this is needed in some cases 115 | 116 | - name: Template mpi.conf 117 | template: 118 | src: "{{ openhpc_mpi_template }}" 119 | dest: "{{ openhpc_slurm_conf_path | dirname }}/mpi.conf" 120 | owner: root 121 | group: root 122 | mode: "0644" 123 | when: 124 | - openhpc_enable.control | default(false) 125 | - openhpc_mpi_config | length > 0 126 | notify: 127 | - Restart slurmctld service 128 | register: ohpc_mpi_conf 129 | 130 | # Workaround for https://bugs.rockylinux.org/view.php?id=10165 131 | - name: Fix permissions on /etc for Munge service 132 | ansible.builtin.file: 133 | mode: g-w 134 | path: /etc 135 | 136 | - name: Ensure Munge service is running 137 | service: 138 | name: munge 139 | state: "{{ 'restarted' if _openhpc_munge_key_copy.changed else 'started' }}" 140 | when: openhpc_slurm_service_started | bool 141 | 142 | - name: Check slurmdbd state 143 | command: systemctl is-active slurmdbd # noqa: command-instead-of-module 144 | changed_when: false 145 | failed_when: false # rc = 0 when active 146 | register: _openhpc_slurmdbd_state 147 | 148 | - name: Ensure slurm database is upgraded if slurmdbd inactive 149 | import_tasks: upgrade.yml # need import for conditional support 150 | when: 151 | - "_openhpc_slurmdbd_state.stdout == 'inactive'" 152 | - openhpc_enable.database | default(false) 153 | 154 | - name: Notify handler for slurmd restart 155 | debug: 156 | msg: "notifying handlers" # meta: noop doesn't support 'when' 157 | changed_when: true 158 | when: 159 | - openhpc_slurm_control_host in ansible_play_hosts 160 | - hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or 161 | hostvars[openhpc_slurm_control_host].ohpc_mpi_conf.changed or 162 | hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or 163 | hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler 164 | notify: 165 | - Restart slurmd service 166 | 167 | - name: Configure slurmd command line options 168 | vars: 169 | slurmd_options_configless: "--conf-server {{ openhpc_slurm_control_host_address | default(openhpc_slurm_control_host) }}" 170 | lineinfile: 171 | path: /etc/sysconfig/slurmd 172 | line: "SLURMD_OPTIONS='{{ slurmd_options_configless }}'" 173 | regexp: "^SLURMD_OPTIONS=" 174 | create: yes 175 | owner: root 176 | group: root 177 | mode: '0644' 178 | when: 179 | - openhpc_enable.batch | default(false) 180 | notify: 181 | - Restart slurmd service 182 | # Reloading is sufficent, but using a single handler means no bounce. Realistically this won't regularly change on a running slurmd so restarting is ok. 183 | 184 | # Munge state could be unchanged but the service is not running. 185 | # Handle that here. 186 | - name: Configure Munge service 187 | service: 188 | name: munge 189 | enabled: "{{ openhpc_slurm_service_enabled | bool }}" 190 | state: "{{ 'started' if openhpc_slurm_service_started | bool else 'stopped' }}" 191 | 192 | - name: Flush handler 193 | meta: flush_handlers # as then subsequent "ensure" is a no-op if slurm services bounced 194 | 195 | - name: Ensure slurmdbd state 196 | service: 197 | name: slurmdbd 198 | enabled: "{{ openhpc_slurm_service_enabled | bool }}" 199 | state: "{{ 'started' if openhpc_slurm_service_started | bool else 'stopped' }}" 200 | when: openhpc_enable.database | default(false) | bool 201 | 202 | - name: Ensure slurmctld state 203 | service: 204 | name: slurmctld 205 | enabled: "{{ openhpc_slurm_service_enabled | bool }}" 206 | state: "{{ 'started' if openhpc_slurm_service_started | bool else 'stopped' }}" 207 | when: openhpc_enable.control | default(false) | bool 208 | 209 | - name: Ensure slurmd state 210 | service: 211 | name: slurmd 212 | enabled: "{{ openhpc_slurm_service_enabled | bool }}" 213 | state: "{{ 'started' if openhpc_slurm_service_started | bool else 'stopped' }}" 214 | when: openhpc_enable.batch | default(false) | bool 215 | 216 | - import_tasks: facts.yml 217 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://github.com/stackhpc/ansible-role-openhpc/workflows/CI/badge.svg)](https://github.com/stackhpc/ansible-role-openhpc/actions) 2 | 3 | # stackhpc.openhpc 4 | 5 | This Ansible role installs packages and performs configuration to provide a Slurm cluster. By default this uses packages from [OpenHPC](https://openhpc.community/) but it can also use user-provided Slurm binaries. 6 | 7 | As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches. 8 | 9 | The minimal image for nodes is a Rocky Linux 8 GenericCloud image. 10 | 11 | ## Task files 12 | This role provides four task files which can be selected by using the `tasks_from` parameter of Ansible's `import_role` or `include_role` modules: 13 | - `main.yml`: Runs `install-ohpc.yml` and `runtime.yml`. Default if no `tasks_from` parameter is used. 14 | - `install-ohpc.yml`: Installs repos and packages for OpenHPC. 15 | - `install-generic.yml`: Installs systemd units etc. for user-provided binaries. 16 | - `runtime.yml`: Slurm/service configuration. 17 | 18 | ## Role Variables 19 | 20 | Variables only relevant for `install-ohpc.yml` or `install-generic.yml` task files are marked as such below. 21 | 22 | `openhpc_extra_repos`: Optional list. Extra Yum repository definitions to configure, following the format of the Ansible 23 | [yum_repository](https://docs.ansible.com/ansible/2.9/modules/yum_repository_module.html) module. 24 | 25 | `openhpc_slurm_service_enabled`: Optional boolean, whether to enable the appropriate slurm service (slurmd/slurmctld). Default `true`. 26 | 27 | `openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`. 28 | 29 | `openhpc_slurm_control_host`: Required string. Ansible inventory hostname (and short hostname) of the controller e.g. `"{{ groups['cluster_control'] | first }}"`. 30 | 31 | `openhpc_slurm_control_host_address`: Optional string. IP address or name to use for the `openhpc_slurm_control_host`, e.g. to use a different interface than is resolved from `openhpc_slurm_control_host`. 32 | 33 | `openhpc_packages`: Optional list. Additional OpenHPC packages to install (`install-ohpc.yml` only). 34 | 35 | `openhpc_enable`: 36 | * `control`: whether to enable control host 37 | * `database`: whether to enable slurmdbd 38 | * `batch`: whether to enable compute nodes 39 | * `runtime`: whether to enable OpenHPC runtime 40 | 41 | `openhpc_slurmdbd_host`: Optional. Where to deploy slurmdbd if are using this role to deploy slurmdbd, otherwise where an existing slurmdbd is running. This should be the name of a host in your inventory. Set this to `none` to prevent the role from managing slurmdbd. Defaults to `openhpc_slurm_control_host`. 42 | 43 | `openhpc_munge_key_b64`: Optional. A base-64 encoded munge key. If not provided then the one generated on package install is used, but the `openhpc_slurm_control_host` must be in the play. 44 | 45 | `openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config. 46 | 47 | `openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one (`install-ohpc.yml` only). 48 | 49 | `openhpc_generic_packages`: Optional. List of system packages to install, see `defaults/main.yml` for details (`install-generic.yml` only). 50 | 51 | `openhpc_sbin_dir`: Optional. Path to slurm daemon binaries such as `slurmctld`, default `/usr/sbin` (`install-generic.yml` only). 52 | 53 | `openhpc_bin_dir`: Optional. Path to Slurm user binaries such as `sinfo`, default `/usr/bin` (`install-generic.yml` only). 54 | 55 | `openhpc_lib_dir`: Optional. Path to Slurm libraries, default `/usr/lib64/slurm` (`install-generic.yml` only). 56 | 57 | ### slurm.conf 58 | 59 | Note this role always operates in Slurm's [configless mode](https://slurm.schedmd.com/configless_slurm.html) 60 | where the `slurm.conf` configuration file is only present on the control node. 61 | 62 | `openhpc_nodegroups`: Optional, default `[]`. List of mappings, each defining a 63 | unique set of homogenous nodes: 64 | * `name`: Required. Name of node group. 65 | * `ram_mb`: Optional. The physical RAM available in each node of this group 66 | ([slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `RealMemory`) 67 | in MiB. This is set using ansible facts if not defined, equivalent to 68 | `free --mebi` total * `openhpc_ram_multiplier`. 69 | * `ram_multiplier`: Optional. An override for the top-level definition 70 | `openhpc_ram_multiplier`. Has no effect if `ram_mb` is set. 71 | * `gres_autodetect`: Optional. The [hardware autodetection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) 72 | to use for [generic resources](https://slurm.schedmd.com/gres.html). 73 | **NB:** A value of `'off'` (the default) must be quoted to avoid yaml 74 | conversion to `false`. 75 | * `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). 76 | Not required if using `nvml` GRES autodetection. Keys/values in dicts are: 77 | - `conf`: A string defining the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) 78 | in the format `::`, e.g. `gpu:A100:2`. 79 | - `file`: A string defining device path(s) as per [File](https://slurm.schedmd.com/gres.conf.html#OPT_File), 80 | e.g. `/dev/nvidia[0-1]`. Not required if using any GRES autodetection. 81 | 82 | Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) is 83 | automatically set from the defined GRES or GRES autodetection. See [GRES Configuration](#gres-configuration) 84 | for more discussion. 85 | * `features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings. 86 | * `node_params`: Optional. Mapping of additional parameters and values for 87 | [node configuration](https://slurm.schedmd.com/slurm.conf.html#lbAE). 88 | **NB:** Parameters which can be set via the keys above must not be included here. 89 | 90 | Each nodegroup will contain hosts from an Ansible inventory group named 91 | `{{ openhpc_cluster_name }}_{{ name }}`, where `name` is the nodegroup name. 92 | Note that: 93 | - Each host may only appear in one nodegroup. 94 | - Hosts in a nodegroup are assumed to be homogenous in terms of processor and memory. 95 | - Hosts may have arbitrary hostnames, but these should be lowercase to avoid a 96 | mismatch between inventory and actual hostname. 97 | - An inventory group may be missing or empty, in which case the node group 98 | contains no hosts. 99 | - If the inventory group is not empty the play must contain at least one host. 100 | This is used to set `Sockets`, `CoresPerSocket`, `ThreadsPerCore` and 101 | optionally `RealMemory` for the nodegroup. 102 | 103 | `openhpc_partitions`: Optional. List of mappings, each defining a 104 | partition. Each partition mapping may contain: 105 | * `name`: Required. Name of partition. 106 | * `nodegroups`: Optional. List of node group names. If omitted, the node group 107 | with the same name as the partition is used. 108 | * `default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`. 109 | * `maxtime`: Optional. A partition-specific time limit overriding `openhpc_job_maxtime`. 110 | * `partition_params`: Optional. Mapping of additional parameters and values for 111 | [partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION). 112 | **NB:** Parameters which can be set via the keys above must not be included here. 113 | 114 | If this variable is not set one partition per nodegroup is created, with default 115 | partition configuration for each. 116 | 117 | `openhpc_gres_autodetect`: Optional. A global default for `openhpc_nodegroups.gres_autodetect` 118 | defined above. **NB:** A value of `'off'` (the default) must be quoted to avoid 119 | yaml conversion to `false`. 120 | 121 | `openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see 122 | [slurm.conf:MaxTime](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime). 123 | **NB:** This should be quoted to avoid Ansible conversions. 124 | 125 | `openhpc_cluster_name`: name of the cluster. 126 | 127 | `openhpc_config`: Optional. Mapping of additional parameters and values for 128 | [slurm.conf](https://slurm.schedmd.com/slurm.conf.html). Keys are slurm.conf 129 | parameter names and values are lists or strings as appropriate. This can be 130 | used to supplement or override the template defaults. Templated parameters can 131 | also be removed by setting the value to the literal string `'omit'` - note 132 | that this is *not the same* as the Ansible `omit` [special variable](https://docs.ansible.com/ansible/latest/reference_appendices/special_variables.html#term-omit). 133 | 134 | `openhpc_cgroup_config`: Optional. Mapping of additional parameters and values for 135 | [cgroup.conf](https://slurm.schedmd.com/cgroup.conf.html). Keys are cgroup.conf 136 | parameter names and values are lists or strings as appropriate. This can be 137 | used to supplement or override the template defaults. Templated parameters can 138 | also be removed by setting the value to the literal string `'omit'` - note 139 | that this is *not the same* as the Ansible `omit` [special variable](https://docs.ansible.com/ansible/latest/reference_appendices/special_variables.html#term-omit). 140 | 141 | `openhpc_mpi_config`: Optional. Mapping of additional parameters and values for 142 | [mpi.conf](https://slurm.schedmd.com/mpi.conf.html). Keys are mpi.conf 143 | parameter names and values are lists or strings as appropriate. This can be 144 | used to supplement or override the template defaults. Templated parameters can 145 | also be removed by setting the value to the literal string `'omit'` - note 146 | that this is *not the same* as the Ansible `omit` [special variable](https://docs.ansible.com/ansible/latest/reference_appendices/special_variables.html#term-omit). 147 | 148 | `openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set. 149 | 150 | `openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation)) 151 | 152 | `openhpc_slurmd_spool_dir`: Optional. Absolute path for slurmd state (`slurm.conf` parameter [SlurmdSpoolDir](https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir)) 153 | 154 | `openhpc_slurm_conf_template`: Optional. Path of Jinja template for `slurm.conf` configuration file. Default is `slurm.conf.j2` template in role. **NB:** The required templating is complex, if just setting specific parameters use `openhpc_config` intead. 155 | 156 | `openhpc_slurm_conf_path`: Optional. Path to template `slurm.conf` configuration file to. Default `/etc/slurm/slurm.conf` 157 | 158 | `openhpc_gres_template`: Optional. Path of Jinja template for `gres.conf` configuration file. Default is `gres.conf.j2` template in role. 159 | 160 | `openhpc_cgroup_template`: Optional. Path of Jinja template for `cgroup.conf` configuration file. Default is `cgroup.conf.j2` template in role. 161 | 162 | #### Accounting 163 | 164 | By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`[1](#slurm_ver_footnote). Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting: 165 | 166 | * Configure a mariadb or mysql server as described in the slurm accounting [documentation](https://slurm.schedmd.com/accounting.html) on one of the nodes in your inventory and set `openhpc_enable.database `to `true` for this node. 167 | * Set `openhpc_slurm_accounting_storage_type` to `accounting_storage/slurmdbd`. 168 | * Configure the variables for `slurmdbd.conf` below. 169 | 170 | The role will take care of configuring the following variables for you: 171 | 172 | `openhpc_slurm_accounting_storage_host`: Where the accounting storage service is running i.e where slurmdbd running. 173 | 174 | `openhpc_slurm_accounting_storage_port`: Which port to use to connect to the accounting storage. 175 | 176 | `openhpc_slurm_accounting_storage_user`: Username for authenticating with the accounting storage. 177 | 178 | `openhpc_slurm_accounting_storage_pass`: Mungekey or database password to use for authenticating. 179 | 180 | For more advanced customisation or to configure another storage type, you might want to modify these values manually. 181 | 182 | #### Job accounting 183 | 184 | This is largely redundant if you are using the accounting plugin above, but will give you basic 185 | accounting data such as start and end times. By default no job accounting is configured. 186 | 187 | `openhpc_slurm_job_comp_type`: Logging mechanism for job accounting. Can be one of 188 | `jobcomp/filetxt`, `jobcomp/none`, `jobcomp/elasticsearch`. 189 | 190 | `openhpc_slurm_job_acct_gather_type`: Mechanism for collecting job accounting data. Can be one 191 | of `jobacct_gather/linux`, `jobacct_gather/cgroup` and `jobacct_gather/none`. 192 | 193 | `openhpc_slurm_job_acct_gather_frequency`: Sampling period for job accounting (seconds). 194 | 195 | `openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of 196 | `openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk. 197 | 198 | ### slurmdbd 199 | 200 | When the slurm database daemon (`slurmdbd`) is enabled by setting 201 | `openhpc_enable.database` to `true` the following options must be configured. 202 | See documentation for [slurmdbd.conf](https://slurm.schedmd.com/slurmdbd.conf.html) 203 | for more details. 204 | 205 | `openhpc_slurmdbd_port`: Port for slurmdb to listen on, defaults to `6819`. 206 | 207 | `openhpc_slurmdbd_mysql_host`: Hostname or IP Where mariadb is running, defaults to `openhpc_slurm_control_host`. 208 | 209 | `openhpc_slurmdbd_mysql_database`: Database to use for accounting, defaults to `slurm_acct_db`. 210 | 211 | `openhpc_slurmdbd_mysql_password`: Password for authenticating with the database. You must set this variable. 212 | 213 | `openhpc_slurmdbd_mysql_username`: Username for authenticating with the database, defaults to `slurm`. 214 | 215 | Before starting `slurmdbd`, the role will check if a database upgrade is 216 | required to due to a Slurm major version upgrade and carry it out if so. 217 | Slurm versions before 24.11 do not support this check and so no upgrade will 218 | occur. The following variables control behaviour during this upgrade: 219 | 220 | `openhpc_slurm_accounting_storage_client_package`: Optional. String giving the 221 | name of the database client package to install, e.g. `mariadb`. Default `mysql`. 222 | 223 | `openhpc_slurm_accounting_storage_backup_cmd`: Optional. String (possibly 224 | multi-line) giving a command for `ansible.builtin.shell` to run a backup of the 225 | Slurm database before performing the databse upgrade. Default is the empty 226 | string which performs no backup. 227 | 228 | `openhpc_slurm_accounting_storage_backup_host`: Optional. Inventory hostname 229 | defining host to run the backup command. Default is `openhpc_slurm_accounting_storage_host`. 230 | 231 | `openhpc_slurm_accounting_storage_backup_become`: Optional. Whether to run the 232 | backup command as root. Default `true`. 233 | 234 | `openhpc_slurm_accounting_storage_service`: Optional. Name of systemd service 235 | for the accounting storage database, e.g. `mysql`. If this is defined this 236 | service is stopped before the backup and restarted after, to allow for physical 237 | backups. Default is the empty string, which does not stop/restart any service. 238 | 239 | ## Facts 240 | 241 | This role creates local facts from the live Slurm configuration, which can be 242 | accessed (with facts gathering enabled) using `ansible_local.slurm`. As per the 243 | `scontrol show config` man page, uppercase keys are derived parameters and keys 244 | in mixed case are from from config files. Note the facts are only refreshed 245 | when this role is run. 246 | 247 | ## Example 248 | 249 | ### Simple 250 | 251 | The following creates a cluster with a a single partition `compute` 252 | containing two nodes: 253 | 254 | ```ini 255 | # inventory/hosts: 256 | [hpc_login] 257 | cluster-login-0 258 | 259 | [hpc_compute] 260 | cluster-compute-0 261 | cluster-compute-1 262 | 263 | [hpc_control] 264 | cluster-control 265 | ``` 266 | 267 | ```yaml 268 | #playbook.yml 269 | --- 270 | - hosts: all 271 | become: yes 272 | tasks: 273 | - import_role: 274 | name: stackhpc.openhpc 275 | vars: 276 | openhpc_cluster_name: hpc 277 | openhpc_enable: 278 | control: "{{ inventory_hostname in groups['cluster_control'] }}" 279 | batch: "{{ inventory_hostname in groups['cluster_compute'] }}" 280 | runtime: true 281 | openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}" 282 | openhpc_nodegroups: 283 | - name: compute 284 | openhpc_partitions: 285 | - name: compute 286 | --- 287 | ``` 288 | 289 | ### Multiple nodegroups 290 | 291 | This example shows how partitions can span multiple types of compute node. 292 | 293 | Assume an inventory containing two types of compute node (login and 294 | control nodes are omitted for brevity): 295 | 296 | ```ini 297 | # inventory/hosts: 298 | ... 299 | [hpc_general] 300 | # standard compute nodes 301 | cluster-general-0 302 | cluster-general-1 303 | 304 | [hpc_large] 305 | # large memory nodes 306 | cluster-largemem-0 307 | cluster-largemem-1 308 | ... 309 | ``` 310 | 311 | Firstly `openhpc_nodegroups` maps to these inventory groups and applies any 312 | node-level parameters - in this case the `largemem` nodes have 2x cores 313 | reserved for some reason: 314 | 315 | ```yaml 316 | openhpc_cluster_name: hpc 317 | openhpc_nodegroups: 318 | - name: general 319 | - name: large 320 | node_params: 321 | CoreSpecCount: 2 322 | ``` 323 | 324 | Now two partitions can be configured using `openhpc_partitions`: A default 325 | partition for testing jobs with a short timelimit and no large memory nodes, 326 | and another partition with all hardware and longer job runtime for "production" 327 | jobs: 328 | 329 | ```yaml 330 | openhpc_partitions: 331 | - name: test 332 | nodegroups: 333 | - general 334 | maxtime: '1:0:0' # 1 hour 335 | default: 'YES' 336 | - name: general 337 | nodegroups: 338 | - general 339 | - large 340 | maxtime: '2-0' # 2 days 341 | default: 'NO' 342 | ``` 343 | Users will select the partition using `--partition` argument and request nodes 344 | with appropriate memory using the `--mem` option for `sbatch` or `srun`. 345 | 346 | ## GRES Configuration 347 | 348 | ### Autodetection 349 | 350 | Some autodetection mechanisms require recompilation of Slurm packages to link 351 | against external libraries. Examples are shown in the sections below. 352 | 353 | #### Recompiling Slurm binaries against the [NVIDIA Management library](https://developer.nvidia.com/management-library-nvml) 354 | 355 | This allows using `openhpc_gres_autodetect: nvml` or `openhpc_nodegroup.gres_autodetect: nvml`. 356 | 357 | First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/). 358 | You can then recompile the slurm packages from the source RPMS as follows: 359 | 360 | ```sh 361 | dnf download --source slurm-slurmd-ohpc 362 | rpm -i slurm-ohpc-*.src.rpm 363 | cd /root/rpmbuild/SPECS 364 | dnf builddep slurm.spec 365 | rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt 366 | ``` 367 | 368 | NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example). 369 | 370 | The RPMs will be created in `/root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to 371 | each compute node is out of scope of this document. 372 | 373 | ## GRES configuration examples 374 | 375 | For NVIDIA GPUs, `nvml` GRES autodetection can be used. This requires: 376 | - The relevant GPU nodes to have the `nvidia-smi` binary installed 377 | - Slurm to be compiled against the NVIDIA management library as above 378 | 379 | Autodetection can then be enabled using either for all nodegroups: 380 | 381 | ```yaml 382 | openhpc_gres_autodetection: nvml 383 | ``` 384 | 385 | or for individual nodegroups e.g: 386 | ```yaml 387 | openhpc_nodegroups: 388 | - name: example 389 | gres_autodetection: nvml 390 | ... 391 | ``` 392 | 393 | In either case no additional configuration of GRES is required. Any nodegroups 394 | with NVIDIA GPUs will automatically get `gpu` GRES defined for all GPUs found. 395 | GPUs within a node do not need to be the same model but nodes in a nodegroup 396 | must be homogenous. GRES types are set to the autodetected model names e.g. `H100`. 397 | 398 | For `nvml` GRES autodetection per-nodegroup `gres_autodetection` and/or `gres` keys 399 | can be still be provided. These can be used to disable/override the default 400 | autodetection method, or to allow checking autodetected resources against 401 | expectations as described by [gres.conf documentation](https://slurm.schedmd.com/gres.conf.html). 402 | 403 | Without any autodetection, a GRES configuration for NVIDIA GPUs might be: 404 | 405 | ``` 406 | openhpc_nodegroups: 407 | - name: general 408 | - name: gpu 409 | gres: 410 | - conf: gpu:H200:2 411 | file: /dev/nvidia[0-1] 412 | ``` 413 | 414 | Note that the `nvml` autodetection is special in this role. Other autodetection 415 | mechanisms, e.g. `nvidia` or `rsmi` allow the `gres.file:` specification to be 416 | omitted but still require `gres.conf:` to be defined. 417 | 418 | 1 Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage) 419 | --------------------------------------------------------------------------------