├── LICENSE ├── README.md ├── ansible.cfg ├── clean-os-error.sh ├── cluster_create.sh ├── cluster_create_local.sh ├── cluster_destroy.sh ├── cluster_destroy_local.sh ├── compute_build_base_img.yml ├── compute_take_snapshot.sh ├── cron-node-check.sh ├── figures └── virtual-clusters.jpeg ├── install.sh ├── install_jupyterhub.yml ├── install_local.sh ├── jhub_files ├── https_redirect.conf.j2 ├── jhub_conf.py ├── jhub_service.j2 ├── jhub_sudoers ├── jupyterhub.conf.j2 └── python_mod_3.8 ├── prevent-updates.ci ├── slurm-logrotate.conf ├── slurm.conf ├── slurm_prolog.sh ├── slurm_resume.sh ├── slurm_suspend.sh ├── slurm_test.job └── ssh.cfg /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 XSEDE 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Elastic Slurm Cluster on the Jetstream 2 Cloud 2 | 3 | ## Intro 4 | 5 | This repo contains scripts and ansible playbooks for creating a virtual 6 | cluster in an Openstack environment, specifically aimed at the Jetstream2 resource. 7 | 8 | The basic structure is to have a single image act as headnode, with 9 | compute nodes managed by SLURM via the openstack API. A customized 10 | image is created for worker nodes, which contains configuration 11 | and software specific to that cluster. The Slurm daemon on the 12 | headnode dynamically creates and destroys worker nodes in response to 13 | jobs in the queue (refer to the figure below). The current version is based on Rocky Linux 8, using 14 | RPMs from the [OpenHPC project](https://openhpc.community). 15 | 16 | As current installation scripts work for Rocky Linux 8 distribution it is expected to have 17 | a virtual machine created from the latest Rocky Linux 8 base image in Jestream 2 before proceeding with the installation 18 | 19 | ![Integration Diagram](figures/virtual-clusters.jpeg) 20 | 21 | ### Installation 22 | 1. Login to the Rocky Linux installed virtual machine. This is the installation host and the head node for the virtual cluster 23 | 2. Move to the rocky user if you are in a different user. ```sudo su - rocky``` 24 | 3. If you have not already done, create an openrc fire for your jetsream2 account by following the [Jestream 2 Documentation](https://docs.jetstream-cloud.org/ui/cli/openrc/) 25 | 4. Copy the generated openrc file to the home directory of rocky user 26 | 5. Clone the [Virtual Cluster repository](https://github.com/access-ci-org/Jetstream_Cluster) 27 | 6. If you'd like to modify your cluster, now is a good time! 28 | 29 | * The number of nodes can be set in the slurm.conf file, by editing 30 | the NodeName and PartitionName line. 31 | * If you'd like to change the default node size, the ```node_size=```line 32 | in ```slurm_resume.sh``` must be changed. 33 | * If you'd like to enable any specific software, you should edit 34 | ```compute_build_base_img.yml```. The task named "install basic packages" 35 | can be easily extended to install anything available from a yum 36 | repository. If you need to *add* a repo, you can copy the task 37 | titled "Add OpenHPC 1.3.? Repo". For more detailed configuration, 38 | it may be easiest to build your software in /export on the headnode, 39 | and only install the necessary libraries via the compute_build_base_img 40 | (or ensure that they're available in the shared filesystem). 41 | * For other modifications, feel free to get in touch! 42 | 7. Now you are all set to install the cluster. Run ```cluster_create_local.sh``` as the rocky user. This will take around 43 | 30 minutes to fully install the cluster. 44 | 8. If you need to destroy the cluster, Run ```cluster_destroy_local.sh```. This will decommission the SLURM cluster and any 45 | runnning compute nodes. If you need to delete the headnode as well, pass the -d paramaeter to cluster destroy script 46 | 47 | 48 | ### Usage note: 49 | Slurm will run the suspend/resume scripts in response to 50 | ``` 51 | scontrol update nodename=compute-[0-1] state=power_down 52 | ``` 53 | or 54 | ``` 55 | scontrol update nodename=compute-[0-1] state=power_up 56 | ``` 57 | 58 | If compute instances got stuck in a bad state, it's often helpful to 59 | cycle through the following: 60 | 61 | ``` 62 | scontrol update nodename=compute-[?] state=down reason=resetting 63 | ``` 64 | ``` 65 | scontrol update nodename=compute-[?] state=idle 66 | ``` 67 | 68 | or to re-run the suspend/resume scripts as above (if the instance 69 | power state doesn't match the current state as seen by slurm). Instances 70 | in a failed state within Openstack may simply be deleted, as they will 71 | be built anew by slurm the next time they are needed. 72 | 73 | 74 | This work supported by [![NSF-1548562](https://img.shields.io/badge/NSF-1548562-blue.svg)](https://nsf.gov/awardsearch/showAward?AWD_ID=1548562) 75 | -------------------------------------------------------------------------------- /ansible.cfg: -------------------------------------------------------------------------------- 1 | # config file for ansible -- https://ansible.com/ 2 | # =============================================== 3 | 4 | # nearly all parameters can be overridden in ansible-playbook 5 | # or with command line flags. ansible will read ANSIBLE_CONFIG, 6 | # ansible.cfg in the current working directory, .ansible.cfg in 7 | # the home directory or /etc/ansible/ansible.cfg, whichever it 8 | # finds first 9 | 10 | [defaults] 11 | 12 | # some basic default values... 13 | 14 | #inventory = /etc/ansible/hosts 15 | #library = /usr/share/my_modules/ 16 | #module_utils = /usr/share/my_module_utils/ 17 | remote_tmp = /tmp/.ansible/tmp 18 | local_tmp = /tmp/.ansible/tmp 19 | #forks = 5 20 | #poll_interval = 15 21 | #sudo_user = root 22 | #ask_sudo_pass = True 23 | #ask_pass = True 24 | #transport = smart 25 | #remote_port = 22 26 | #module_lang = C 27 | #module_set_locale = False 28 | 29 | # plays will gather facts by default, which contain information about 30 | # the remote system. 31 | # 32 | # smart - gather by default, but don't regather if already gathered 33 | # implicit - gather by default, turn off with gather_facts: False 34 | # explicit - do not gather by default, must say gather_facts: True 35 | #gathering = implicit 36 | 37 | # This only affects the gathering done by a play's gather_facts directive, 38 | # by default gathering retrieves all facts subsets 39 | # all - gather all subsets 40 | # network - gather min and network facts 41 | # hardware - gather hardware facts (longest facts to retrieve) 42 | # virtual - gather min and virtual facts 43 | # facter - import facts from facter 44 | # ohai - import facts from ohai 45 | # You can combine them using comma (ex: network,virtual) 46 | # You can negate them using ! (ex: !hardware,!facter,!ohai) 47 | # A minimal set of facts is always gathered. 48 | #gather_subset = all 49 | 50 | # some hardware related facts are collected 51 | # with a maximum timeout of 10 seconds. This 52 | # option lets you increase or decrease that 53 | # timeout to something more suitable for the 54 | # environment. 55 | # gather_timeout = 10 56 | 57 | # additional paths to search for roles in, colon separated 58 | #roles_path = /etc/ansible/roles 59 | 60 | # uncomment this to disable SSH key host checking 61 | #host_key_checking = False 62 | 63 | # change the default callback 64 | #stdout_callback = skippy 65 | # enable additional callbacks 66 | #callback_whitelist = timer, mail 67 | 68 | # Determine whether includes in tasks and handlers are "static" by 69 | # default. As of 2.0, includes are dynamic by default. Setting these 70 | # values to True will make includes behave more like they did in the 71 | # 1.x versions. 72 | #task_includes_static = True 73 | #handler_includes_static = True 74 | 75 | # Controls if a missing handler for a notification event is an error or a warning 76 | #error_on_missing_handler = True 77 | 78 | # change this for alternative sudo implementations 79 | #sudo_exe = sudo 80 | 81 | # What flags to pass to sudo 82 | # WARNING: leaving out the defaults might create unexpected behaviours 83 | #sudo_flags = -H -S -n 84 | 85 | # SSH timeout 86 | #timeout = 10 87 | 88 | # default user to use for playbooks if user is not specified 89 | # (/usr/bin/ansible will use current user as default) 90 | #remote_user = root 91 | 92 | # logging is off by default unless this path is defined 93 | # if so defined, consider logrotate 94 | #log_path = /var/log/ansible.log 95 | 96 | # default module name for /usr/bin/ansible 97 | #module_name = command 98 | 99 | # use this shell for commands executed under sudo 100 | # you may need to change this to bin/bash in rare instances 101 | # if sudo is constrained 102 | #executable = /bin/sh 103 | 104 | # if inventory variables overlap, does the higher precedence one win 105 | # or are hash values merged together? The default is 'replace' but 106 | # this can also be set to 'merge'. 107 | #hash_behaviour = replace 108 | 109 | # by default, variables from roles will be visible in the global variable 110 | # scope. To prevent this, the following option can be enabled, and only 111 | # tasks and handlers within the role will see the variables there 112 | #private_role_vars = yes 113 | 114 | # list any Jinja2 extensions to enable here: 115 | #jinja2_extensions = jinja2.ext.do,jinja2.ext.i18n 116 | 117 | # if set, always use this private key file for authentication, same as 118 | # if passing --private-key to ansible or ansible-playbook 119 | #private_key_file = /path/to/file 120 | 121 | # If set, configures the path to the Vault password file as an alternative to 122 | # specifying --vault-password-file on the command line. 123 | #vault_password_file = /path/to/vault_password_file 124 | 125 | # format of string {{ ansible_managed }} available within Jinja2 126 | # templates indicates to users editing templates files will be replaced. 127 | # replacing {file}, {host} and {uid} and strftime codes with proper values. 128 | #ansible_managed = Ansible managed: {file} modified on %Y-%m-%d %H:%M:%S by {uid} on {host} 129 | # {file}, {host}, {uid}, and the timestamp can all interfere with idempotence 130 | # in some situations so the default is a static string: 131 | #ansible_managed = Ansible managed 132 | 133 | # by default, ansible-playbook will display "Skipping [host]" if it determines a task 134 | # should not be run on a host. Set this to "False" if you don't want to see these "Skipping" 135 | # messages. NOTE: the task header will still be shown regardless of whether or not the 136 | # task is skipped. 137 | #display_skipped_hosts = True 138 | 139 | # by default, if a task in a playbook does not include a name: field then 140 | # ansible-playbook will construct a header that includes the task's action but 141 | # not the task's args. This is a security feature because ansible cannot know 142 | # if the *module* considers an argument to be no_log at the time that the 143 | # header is printed. If your environment doesn't have a problem securing 144 | # stdout from ansible-playbook (or you have manually specified no_log in your 145 | # playbook on all of the tasks where you have secret information) then you can 146 | # safely set this to True to get more informative messages. 147 | #display_args_to_stdout = False 148 | 149 | # by default (as of 1.3), Ansible will raise errors when attempting to dereference 150 | # Jinja2 variables that are not set in templates or action lines. Uncomment this line 151 | # to revert the behavior to pre-1.3. 152 | #error_on_undefined_vars = False 153 | 154 | # by default (as of 1.6), Ansible may display warnings based on the configuration of the 155 | # system running ansible itself. This may include warnings about 3rd party packages or 156 | # other conditions that should be resolved if possible. 157 | # to disable these warnings, set the following value to False: 158 | #system_warnings = True 159 | 160 | # by default (as of 1.4), Ansible may display deprecation warnings for language 161 | # features that should no longer be used and will be removed in future versions. 162 | # to disable these warnings, set the following value to False: 163 | #deprecation_warnings = True 164 | 165 | # (as of 1.8), Ansible can optionally warn when usage of the shell and 166 | # command module appear to be simplified by using a default Ansible module 167 | # instead. These warnings can be silenced by adjusting the following 168 | # setting or adding warn=yes or warn=no to the end of the command line 169 | # parameter string. This will for example suggest using the git module 170 | # instead of shelling out to the git command. 171 | # command_warnings = False 172 | 173 | 174 | # set plugin path directories here, separate with colons 175 | #action_plugins = /usr/share/ansible/plugins/action 176 | #cache_plugins = /usr/share/ansible/plugins/cache 177 | #callback_plugins = /usr/share/ansible/plugins/callback 178 | #connection_plugins = /usr/share/ansible/plugins/connection 179 | #lookup_plugins = /usr/share/ansible/plugins/lookup 180 | #inventory_plugins = /usr/share/ansible/plugins/inventory 181 | #vars_plugins = /usr/share/ansible/plugins/vars 182 | #filter_plugins = /usr/share/ansible/plugins/filter 183 | #test_plugins = /usr/share/ansible/plugins/test 184 | #terminal_plugins = /usr/share/ansible/plugins/terminal 185 | #strategy_plugins = /usr/share/ansible/plugins/strategy 186 | 187 | 188 | # by default, ansible will use the 'linear' strategy but you may want to try 189 | # another one 190 | #strategy = free 191 | 192 | # by default callbacks are not loaded for /bin/ansible, enable this if you 193 | # want, for example, a notification or logging callback to also apply to 194 | # /bin/ansible runs 195 | #bin_ansible_callbacks = False 196 | 197 | 198 | # don't like cows? that's unfortunate. 199 | # set to 1 if you don't want cowsay support or export ANSIBLE_NOCOWS=1 200 | #nocows = 1 201 | 202 | # set which cowsay stencil you'd like to use by default. When set to 'random', 203 | # a random stencil will be selected for each task. The selection will be filtered 204 | # against the `cow_whitelist` option below. 205 | #cow_selection = default 206 | #cow_selection = random 207 | 208 | # when using the 'random' option for cowsay, stencils will be restricted to this list. 209 | # it should be formatted as a comma-separated list with no spaces between names. 210 | # NOTE: line continuations here are for formatting purposes only, as the INI parser 211 | # in python does not support them. 212 | #cow_whitelist=bud-frogs,bunny,cheese,daemon,default,dragon,elephant-in-snake,elephant,eyes,\ 213 | # hellokitty,kitty,luke-koala,meow,milk,moofasa,moose,ren,sheep,small,stegosaurus,\ 214 | # stimpy,supermilker,three-eyes,turkey,turtle,tux,udder,vader-koala,vader,www 215 | 216 | # don't like colors either? 217 | # set to 1 if you don't want colors, or export ANSIBLE_NOCOLOR=1 218 | #nocolor = 1 219 | 220 | # if set to a persistent type (not 'memory', for example 'redis') fact values 221 | # from previous runs in Ansible will be stored. This may be useful when 222 | # wanting to use, for example, IP information from one group of servers 223 | # without having to talk to them in the same playbook run to get their 224 | # current IP information. 225 | #fact_caching = memory 226 | 227 | 228 | # retry files 229 | # When a playbook fails by default a .retry file will be created in ~/ 230 | # You can disable this feature by setting retry_files_enabled to False 231 | # and you can change the location of the files by setting retry_files_save_path 232 | 233 | retry_files_enabled = False 234 | #retry_files_save_path = ~/.ansible-retry 235 | 236 | # squash actions 237 | # Ansible can optimise actions that call modules with list parameters 238 | # when looping. Instead of calling the module once per with_ item, the 239 | # module is called once with all items at once. Currently this only works 240 | # under limited circumstances, and only with parameters named 'name'. 241 | #squash_actions = apk,apt,dnf,homebrew,pacman,pkgng,yum,zypper 242 | 243 | # prevents logging of task data, off by default 244 | #no_log = False 245 | 246 | # prevents logging of tasks, but only on the targets, data is still logged on the master/controller 247 | #no_target_syslog = False 248 | 249 | # controls whether Ansible will raise an error or warning if a task has no 250 | # choice but to create world readable temporary files to execute a module on 251 | # the remote machine. This option is False by default for security. Users may 252 | # turn this on to have behaviour more like Ansible prior to 2.1.x. See 253 | # https://docs.ansible.com/ansible/become.html#becoming-an-unprivileged-user 254 | # for more secure ways to fix this than enabling this option. 255 | #allow_world_readable_tmpfiles = False 256 | 257 | # controls the compression level of variables sent to 258 | # worker processes. At the default of 0, no compression 259 | # is used. This value must be an integer from 0 to 9. 260 | #var_compression_level = 9 261 | 262 | # controls what compression method is used for new-style ansible modules when 263 | # they are sent to the remote system. The compression types depend on having 264 | # support compiled into both the controller's python and the client's python. 265 | # The names should match with the python Zipfile compression types: 266 | # * ZIP_STORED (no compression. available everywhere) 267 | # * ZIP_DEFLATED (uses zlib, the default) 268 | # These values may be set per host via the ansible_module_compression inventory 269 | # variable 270 | #module_compression = 'ZIP_DEFLATED' 271 | 272 | # This controls the cutoff point (in bytes) on --diff for files 273 | # set to 0 for unlimited (RAM may suffer!). 274 | #max_diff_size = 1048576 275 | 276 | # This controls how ansible handles multiple --tags and --skip-tags arguments 277 | # on the CLI. If this is True then multiple arguments are merged together. If 278 | # it is False, then the last specified argument is used and the others are ignored. 279 | #merge_multiple_cli_flags = False 280 | 281 | # Controls showing custom stats at the end, off by default 282 | #show_custom_stats = True 283 | 284 | # Controls which files to ignore when using a directory as inventory with 285 | # possibly multiple sources (both static and dynamic) 286 | #inventory_ignore_extensions = ~, .orig, .bak, .ini, .cfg, .retry, .pyc, .pyo 287 | 288 | # This family of modules use an alternative execution path optimized for network appliances 289 | # only update this setting if you know how this works, otherwise it can break module execution 290 | #network_group_modules=['eos', 'nxos', 'ios', 'iosxr', 'junos', 'vyos'] 291 | 292 | # When enabled, this option allows lookups (via variables like {{lookup('foo')}} or when used as 293 | # a loop with `with_foo`) to return data that is not marked "unsafe". This means the data may contain 294 | # jinja2 templating language which will be run through the templating engine. 295 | # ENABLING THIS COULD BE A SECURITY RISK 296 | #allow_unsafe_lookups = False 297 | 298 | [privilege_escalation] 299 | become=True 300 | become_method=sudo 301 | become_user=root 302 | become_ask_pass=False 303 | 304 | [paramiko_connection] 305 | 306 | # uncomment this line to cause the paramiko connection plugin to not record new host 307 | # keys encountered. Increases performance on new host additions. Setting works independently of the 308 | # host key checking setting above. 309 | #record_host_keys=False 310 | 311 | # by default, Ansible requests a pseudo-terminal for commands executed under sudo. Uncomment this 312 | # line to disable this behaviour. 313 | #pty=False 314 | 315 | # paramiko will default to looking for SSH keys initially when trying to 316 | # authenticate to remote devices. This is a problem for some network devices 317 | # that close the connection after a key failure. Uncomment this line to 318 | # disable the Paramiko look for keys function 319 | #look_for_keys = False 320 | 321 | # When using persistent connections with Paramiko, the connection runs in a 322 | # background process. If the host doesn't already have a valid SSH key, by 323 | # default Ansible will prompt to add the host key. This will cause connections 324 | # running in background processes to fail. Uncomment this line to have 325 | # Paramiko automatically add host keys. 326 | #host_key_auto_add = True 327 | 328 | [ssh_connection] 329 | 330 | # ssh arguments to use 331 | # Leaving off ControlPersist will result in poor performance, so use 332 | # paramiko on older platforms rather than removing it, -C controls compression use 333 | ssh_args = -F /etc/ansible/ssh.cfg -C -o ControlMaster=auto -o ControlPersist=60s 334 | 335 | # The base directory for the ControlPath sockets. 336 | # This is the "%(directory)s" in the control_path option 337 | # 338 | # Example: 339 | # control_path_dir = /tmp/.ansible/cp 340 | control_path_dir = /tmp/.ansible/cp 341 | 342 | # The path to use for the ControlPath sockets. This defaults to a hashed string of the hostname, 343 | # port and username (empty string in the config). The hash mitigates a common problem users 344 | # found with long hostames and the conventional %(directory)s/ansible-ssh-%%h-%%p-%%r format. 345 | # In those cases, a "too long for Unix domain socket" ssh error would occur. 346 | # 347 | # Example: 348 | # control_path = %(directory)s/%%h-%%r 349 | #control_path = 350 | 351 | # Enabling pipelining reduces the number of SSH operations required to 352 | # execute a module on the remote server. This can result in a significant 353 | # performance improvement when enabled, however when using "sudo:" you must 354 | # first disable 'requiretty' in /etc/sudoers 355 | # 356 | # By default, this option is disabled to preserve compatibility with 357 | # sudoers configurations that have requiretty (the default on many distros). 358 | # 359 | #pipelining = False 360 | 361 | # Control the mechanism for transferring files (old) 362 | # * smart = try sftp and then try scp [default] 363 | # * True = use scp only 364 | # * False = use sftp only 365 | #scp_if_ssh = smart 366 | 367 | # Control the mechanism for transferring files (new) 368 | # If set, this will override the scp_if_ssh option 369 | # * sftp = use sftp to transfer files 370 | # * scp = use scp to transfer files 371 | # * piped = use 'dd' over SSH to transfer files 372 | # * smart = try sftp, scp, and piped, in that order [default] 373 | #transfer_method = smart 374 | 375 | # if False, sftp will not use batch mode to transfer files. This may cause some 376 | # types of file transfer failures impossible to catch however, and should 377 | # only be disabled if your sftp version has problems with batch mode 378 | #sftp_batch_mode = False 379 | 380 | [persistent_connection] 381 | 382 | # Configures the persistent connection timeout value in seconds. This value is 383 | # how long the persistent connection will remain idle before it is destroyed. 384 | # If the connection doesn't receive a request before the timeout value 385 | # expires, the connection is shutdown. The default value is 30 seconds. 386 | connect_timeout = 30 387 | 388 | # Configures the persistent connection retries. This value configures the 389 | # number of attempts the ansible-connection will make when trying to connect 390 | # to the local domain socket. The default value is 30. 391 | connect_retries = 30 392 | 393 | # Configures the amount of time in seconds to wait between connection attempts 394 | # to the local unix domain socket. This value works in conjunction with the 395 | # connect_retries value to define how long to try to connect to the local 396 | # domain socket when setting up a persistent connection. The default value is 397 | # 1 second. 398 | connect_interval = 1 399 | 400 | [accelerate] 401 | #accelerate_port = 5099 402 | #accelerate_timeout = 30 403 | #accelerate_connect_timeout = 5.0 404 | 405 | # The daemon timeout is measured in minutes. This time is measured 406 | # from the last activity to the accelerate daemon. 407 | #accelerate_daemon_timeout = 30 408 | 409 | # If set to yes, accelerate_multi_key will allow multiple 410 | # private keys to be uploaded to it, though each user must 411 | # have access to the system via SSH to add a new key. The default 412 | # is "no". 413 | #accelerate_multi_key = yes 414 | 415 | [selinux] 416 | # file systems that require special treatment when dealing with security context 417 | # the default behaviour that copies the existing context or uses the user default 418 | # needs to be changed to use the file system dependent context. 419 | #special_context_filesystems=nfs,vboxsf,fuse,ramfs,9p 420 | 421 | # Set this to yes to allow libvirt_lxc connections to work without SELinux. 422 | #libvirt_lxc_noseclabel = yes 423 | 424 | [colors] 425 | #highlight = white 426 | #verbose = blue 427 | #warn = bright purple 428 | #error = red 429 | #debug = dark gray 430 | #deprecate = purple 431 | #skip = cyan 432 | #unreachable = red 433 | #ok = green 434 | #changed = yellow 435 | #diff_add = green 436 | #diff_remove = red 437 | #diff_lines = cyan 438 | 439 | 440 | [diff] 441 | # Always print diff when running ( same as always running with -D/--diff ) 442 | # always = no 443 | 444 | # Set how many context lines to show in diff 445 | # context = 3 446 | -------------------------------------------------------------------------------- /clean-os-error.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /etc/slurm/openrc.sh 4 | 5 | os_error_list=$(openstack server list --status ERROR -f value -c ID) 6 | logfile=/var/log/slurm/os_clean.log 7 | 8 | for host_id in $os_error_list 9 | do 10 | echo "Removing OS_HOST $host_id" >> $logfile 2>&1 11 | openstack server show $host_id >> $logfile 2>&1 12 | openstack server delete $host_id >> $logfile 2>&1 13 | done 14 | -------------------------------------------------------------------------------- /cluster_create.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #This script makes several assumptions: 4 | # 1. Running on a host with openstack client tools installed 5 | # 2. Using a default ssh key in ~/.ssh/ 6 | # 3. The user knows what they're doing. 7 | # 4. Take some options: 8 | # openrc file 9 | # headnode size 10 | # cluster name 11 | # volume size 12 | 13 | show_help() { 14 | echo "Options: 15 | -n: HEADNODE_NAME: required, name of the cluster 16 | -o: OPENRC_PATH: optional, path to a valid openrc file, default is ./openrc.sh 17 | -s: HEADNODE_SIZE: optional, size of the headnode in Openstack flavor (default: m1.small) 18 | -v: VOLUME_SIZE: optional, size of storage volume in GB, volume not created if 0 19 | -d: DOCKER_ALLOW: optional flag, leave docker installed on headnode if set. 20 | -j: JUPYTERHUB_BUILD: optional flag, install jupyterhub with SSL certs. 21 | 22 | Usage: $0 -n [HEADNODE_NAME] -o [OPENRC_PATH] -v [VOLUME_SIZE] -s [HEADNODE_SIZE] [-d]" 23 | } 24 | 25 | OPTIND=1 26 | 27 | openrc_path="./openrc.sh" 28 | headnode_size="m1.small" 29 | headnode_name="noname" 30 | volume_size="0" 31 | install_opts="" 32 | 33 | while getopts ":jdhhelp:n:o:s:v:" opt; do 34 | case ${opt} in 35 | h|help|\?) show_help 36 | exit 0 37 | ;; 38 | d) install_opts+="-d " 39 | ;; 40 | j) install_opts+="-j " 41 | ;; 42 | o) openrc_path=${OPTARG} 43 | ;; 44 | s) headnode_size=${OPTARG} 45 | ;; 46 | v) volume_size=${OPTARG} 47 | ;; 48 | n) headnode_name=${OPTARG} 49 | ;; 50 | :) echo "Option -$OPTARG requires an argument." 51 | exit 1 52 | ;; 53 | 54 | esac 55 | done 56 | 57 | 58 | if [[ ! -f ${openrc_path} ]]; then 59 | echo "openrc path: ${openrc_path} \n does not point to a file!" 60 | exit 1 61 | fi 62 | 63 | #Move this to allow for error checking of OS conflicts 64 | source ${openrc_path} 65 | 66 | if [[ -z $( echo ${headnode_size} | grep -E '^m1|^m2|^g1|^g2' ) ]]; then 67 | echo "Headnode size ${headnode_size} is not a valid JS instance size!" 68 | exit 1 69 | elif [[ -n "$(echo ${volume_size} | tr -d [0-9])" ]]; then 70 | echo "Volume size must be numeric only, in units of GB." 71 | exit 1 72 | elif [[ ${headnode_name} == "noname" ]]; then 73 | echo "No headnode name provided with -n, exiting!" 74 | exit 1 75 | elif [[ -n $(openstack server list | grep -i ${headnode_name}) ]]; then 76 | echo "Cluster name [${headnode_name}] conficts with existing Openstack entity!" 77 | exit 1 78 | elif [[ -n $(openstack volume list | grep -i ${headnode_name}-storage) ]]; then 79 | echo "Volume name [${headnode_name}-storage] conficts with existing Openstack entity!" 80 | exit 1 81 | fi 82 | 83 | if [[ ! -e ${HOME}/.ssh/id_rsa.pub ]]; then 84 | #This may be temporary... but seems fairly reasonable. 85 | echo "NO KEY FOUND IN ${HOME}/.ssh/id_rsa.pub! - please create one and re-run!" 86 | exit 87 | fi 88 | 89 | volume_name="${headnode_name}-storage" 90 | 91 | # Defining a function here to check for quotas, and exit if this script will cause problems! 92 | # also, storing 'quotas' in a global var, so we're not calling it every single time 93 | quotas=$(openstack quota show) 94 | quota_check () 95 | { 96 | quota_name=$1 97 | type_name=$2 #the name for a quota and the name for the thing itself are not the same 98 | number_created=$3 #number of the thing that we'll create here. 99 | 100 | current_num=$(openstack ${type_name} list -f value | wc -l) 101 | 102 | max_types=$(echo "${quotas}" | awk -v quota=${quota_name} '$0 ~ quota {print $4}') 103 | 104 | #echo "checking quota for ${quota_name} of ${type_name} to create ${number_created} - want ${current_num} to be less than ${max_types}" 105 | 106 | if [[ "${current_num}" -lt "$((max_types + number_created))" ]]; then 107 | return 0 108 | fi 109 | return 1 110 | } 111 | 112 | 113 | quota_check "secgroups" "security group" 1 114 | quota_check "networks" "network" 1 115 | quota_check "subnets" "subnet" 1 116 | quota_check "routers" "router" 1 117 | quota_check "key-pairs" "keypair" 1 118 | quota_check "instances" "server" 1 119 | 120 | #These must match those defined in install.sh, slurm_resume.sh, compute_build_base_img.yml 121 | # and compute_take_snapshot.sh, which ASSUME the headnode_name convention has not been deviated from. 122 | 123 | OS_PREFIX=${headnode_name} 124 | OS_NETWORK_NAME=${OS_PREFIX}-elastic-net 125 | OS_SUBNET_NAME=${OS_PREFIX}-elastic-subnet 126 | OS_ROUTER_NAME=${OS_PREFIX}-elastic-router 127 | OS_SSH_SECGROUP_NAME=${OS_PREFIX}-ssh-global 128 | OS_INTERNAL_SECGROUP_NAME=${OS_PREFIX}-internal 129 | OS_HTTP_S_SECGROUP_NAME=${OS_PREFIX}-http-s 130 | OS_KEYPAIR_NAME=${OS_USERNAME}-elastic-key 131 | OS_APP_CRED=${OS_PREFIX}-slurm-app-cred 132 | 133 | # This will allow for customization of the 1st 24 bits of the subnet range 134 | # The last 8 will be assumed open (netmask 255.255.255.0 or /24) 135 | # because going beyond that requires a general mechanism for translation from CIDR 136 | # to wildcard notation for ssh.cfg and compute_build_base_img.yml 137 | # which is assumed to be beyond the scope of this project. 138 | # If there is a maintainable mechanism for this, of course, please let us know! 139 | SUBNET_PREFIX=10.0.0 140 | 141 | 142 | # Ensure that the correct private network/router/subnet exists 143 | if [[ -z "$(openstack network list | grep ${OS_NETWORK_NAME})" ]]; then 144 | openstack network create ${OS_NETWORK_NAME} 145 | openstack subnet create --network ${OS_NETWORK_NAME} --subnet-range ${SUBNET_PREFIX}.0/24 ${OS_SUBNET_NAME} 146 | fi 147 | ##openstack subnet list 148 | if [[ -z "$(openstack router list | grep ${OS_ROUTER_NAME})" ]]; then 149 | openstack router create ${OS_ROUTER_NAME} 150 | openstack router add subnet ${OS_ROUTER_NAME} ${OS_SUBNET_NAME} 151 | openstack router set --external-gateway public ${OS_ROUTER_NAME} 152 | fi 153 | 154 | security_groups=$(openstack security group list -f value) 155 | if [[ ! ("${security_groups}" =~ "${OS_SSH_SECGROUP_NAME}") ]]; then 156 | openstack security group create --description "ssh \& icmp enabled" ${OS_SSH_SECGROUP_NAME} 157 | openstack security group rule create --protocol tcp --dst-port 22:22 --remote-ip 0.0.0.0/0 ${OS_SSH_SECGROUP_NAME} 158 | openstack security group rule create --protocol icmp ${OS_SSH_SECGROUP_NAME} 159 | fi 160 | if [[ ! ("${security_groups}" =~ "${OS_INTERNAL_SECGROUP_NAME}") ]]; then 161 | openstack security group create --description "internal group for cluster" ${OS_INTERNAL_SECGROUP_NAME} 162 | openstack security group rule create --protocol tcp --dst-port 1:65535 --remote-ip ${SUBNET_PREFIX}.0/24 ${OS_INTERNAL_SECGROUP_NAME} 163 | openstack security group rule create --protocol icmp ${OS_INTERNAL_SECGROUP_NAME} 164 | fi 165 | if [[ (! ("${security_groups}" =~ "${OS_HTTP_S_SECGROUP_NAME}")) && "${install_opts}" =~ "j" ]]; then 166 | openstack security group create --description "http/s for jupyterhub" ${OS_HTTP_S_SECGROUP_NAME} 167 | openstack security group rule create --protocol tcp --dst-port 80 --remote-ip 0.0.0.0/0 ${OS_HTTP_S_SECGROUP_NAME} 168 | openstack security group rule create --protocol tcp --dst-port 443 --remote-ip 0.0.0.0/0 ${OS_HTTP_S_SECGROUP_NAME} 169 | fi 170 | 171 | #Check if ${HOME}/.ssh/id_rsa.pub exists in JS 172 | if [[ -e ${HOME}/.ssh/id_rsa.pub ]]; then 173 | home_key_fingerprint=$(ssh-keygen -l -E md5 -f ${HOME}/.ssh/id_rsa.pub | sed 's/.*MD5:\(\S*\) .*/\1/') 174 | fi 175 | openstack_keys=$(openstack keypair list -f value) 176 | 177 | home_key_in_OS=$(echo "${openstack_keys}" | awk -v mykey="${home_key_fingerprint}" '$2 ~ mykey {print $1}') 178 | 179 | if [[ -n "${home_key_in_OS}" ]]; then 180 | #RESET this to key that's already in OS 181 | OS_KEYPAIR_NAME=${home_key_in_OS} 182 | elif [[ -n $(echo "${openstack_keys}" | grep ${OS_KEYPAIR_NAME}) ]]; then 183 | openstack keypair delete ${OS_KEYPAIR_NAME} 184 | # This doesn't need to depend on the OS_PROJECT_NAME, as the slurm-key does, in install.sh and slurm_resume 185 | openstack keypair create --public-key ${HOME}/.ssh/id_rsa.pub ${OS_KEYPAIR_NAME} 186 | else 187 | # This doesn't need to depend on the OS_PROJECT_NAME, as the slurm-key does, in install.sh and slurm_resume 188 | openstack keypair create --public-key ${HOME}/.ssh/id_rsa.pub ${OS_KEYPAIR_NAME} 189 | fi 190 | 191 | #centos_base_image=$(openstack image list --status active | grep -iE "API-Featured-centos7-[[:alpha:]]{3,4}-[0-9]{2}-[0-9]{4}" | awk '{print $4}' | tail -n 1) 192 | centos_base_image="JS-API-Featured-CentOS8-Latest" 193 | 194 | #Now, generate an Openstack Application Credential for use on the cluster 195 | export $(openstack application credential create -f shell ${OS_APP_CRED} | sed 's/^\(.*\)/OS_ac_\1/') 196 | 197 | #Write it to a temporary file 198 | echo -e "export OS_AUTH_TYPE=v3applicationcredential 199 | export OS_AUTH_URL=${OS_AUTH_URL} 200 | export OS_IDENTITY_API_VERSION=3 201 | export OS_REGION_NAME="RegionOne" 202 | export OS_INTERFACE=public 203 | export OS_APPLICATION_CREDENTIAL_ID=${OS_ac_id} 204 | export OS_APPLICATION_CREDENTIAL_SECRET=${OS_ac_secret}" > ./openrc-app.sh 205 | 206 | #Function to generate file: sections for cloud-init config files 207 | # arguments are owner path permissions file_to_be_copied 208 | # All calls to this must come after an "echo "write_files:\n" 209 | generate_write_files () { 210 | #This is generating YAML, so... spaces are important. 211 | echo -e " - encoding: b64\n owner: $1\n path: $2\n permissions: $3\n content: |\n$(cat $4 | base64 | sed 's/^/ /')" 212 | } 213 | 214 | user_data="$(cat ./prevent-updates.ci)\n" 215 | user_data+="$(echo -e "write_files:")\n" 216 | user_data+="$(generate_write_files "slurm" "/etc/slurm/openrc.sh" "0400" "./openrc-app.sh")\n" 217 | 218 | #Clean up! 219 | rm ./openrc-app.sh 220 | 221 | echo -e "openstack server create\ 222 | --user-data <(echo -e "${user_data}") \ 223 | --flavor ${headnode_size} \ 224 | --image ${centos_base_image} \ 225 | --key-name ${OS_KEYPAIR_NAME} \ 226 | --security-group ${OS_SSH_SECGROUP_NAME} \ 227 | --security-group ${OS_INTERNAL_SECGROUP_NAME} \ 228 | --nic net-id=${OS_NETWORK_NAME} \ 229 | ${headnode_name}" 230 | 231 | openstack server create \ 232 | --user-data <(echo -e "${user_data}") \ 233 | --flavor ${headnode_size} \ 234 | --image ${centos_base_image} \ 235 | --key-name ${OS_KEYPAIR_NAME} \ 236 | --security-group ${OS_SSH_SECGROUP_NAME} \ 237 | --security-group ${OS_INTERNAL_SECGROUP_NAME} \ 238 | --nic net-id=${OS_NETWORK_NAME} \ 239 | ${headnode_name} 240 | 241 | public_ip=$(openstack floating ip create public | awk '/floating_ip_address/ {print $4}') 242 | #For some reason there's a time issue here - adding a sleep command to allow network to become ready 243 | sleep 10 244 | openstack server add floating ip ${headnode_name} ${public_ip} 245 | 246 | hostname_test=$(ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no centos@${public_ip} 'hostname') 247 | echo "test1: ${hostname_test}" 248 | until [[ ${hostname_test} =~ "${headnode_name}" ]]; do 249 | sleep 2 250 | hostname_test=$(ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no centos@${public_ip} 'hostname') 251 | echo "ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no centos@${public_ip} 'hostname'" 252 | echo "test2: ${hostname_test}" 253 | done 254 | 255 | rsync -qa --exclude="openrc.sh" -e 'ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' ${PWD} centos@${public_ip}: 256 | 257 | if [[ "${volume_size}" != "0" ]]; then 258 | echo "Creating volume ${volume_name} of ${volume_size} GB" 259 | openstack volume create --size ${volume_size} ${volume_name} 260 | openstack server add volume --device /dev/sdb ${headnode_name} ${volume_name} 261 | sleep 5 # To fix a wait issue in volume creation 262 | ssh -o StrictHostKeyChecking=no centos@${public_ip} 'sudo mkfs.xfs /dev/sdb && sudo mkdir -m 777 /export' 263 | vol_uuid=$(ssh centos@${public_ip} 'sudo blkid /dev/sdb | sed "s|.*UUID=\"\(.\{36\}\)\" .*|\1|"') 264 | echo "volume uuid is: ${vol_uuid}" 265 | ssh centos@${public_ip} "echo -e \"UUID=${vol_uuid} /export xfs defaults 0 0\" | sudo tee -a /etc/fstab && sudo mount -a" 266 | echo "Volume sdb has UUID ${vol_uuid} on ${public_ip}" 267 | if [[ ${docker_allow} == 1 ]]; then 268 | ssh centos@${public_ip} "echo -E '{ \"data-root\": \"/export/docker\" }' | sudo tee -a /etc/docker/daemon.json && sudo systemctl restart docker" 269 | fi 270 | 271 | fi 272 | 273 | if [[ "${install_opts}" =~ "-j" ]]; then 274 | openstack server add security group ${headnode_name} ${OS_HTTP_S_SECGROUP_NAME} 275 | fi 276 | 277 | echo "Copied over VC files, beginning Slurm installation and Compute Image configuration - should take 8-10 minutes." 278 | 279 | #Since PWD on localhost has the full path, we only want the current directory name 280 | ssh -o StrictHostKeyChecking=no centos@${public_ip} "cd ./${PWD##*/} && sudo ./install.sh ${install_opts}" 281 | 282 | echo "You should be able to login to your headnode with your Jetstream key: ${OS_KEYPAIR_NAME}, at ${public_ip}" 283 | 284 | if [[ ${install_opts} =~ "-j" ]]; then 285 | echo "You will need to edit the file ${PWD}/install_jupyterhub.yml to reflect the public hostname of your new cluster, and use your email for SSL certs." 286 | echo "Then, run the following command from the directory ${PWD} ON THE NEW HEADNODE to complete your jupyterhub setup:" 287 | echo "sudo ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' install_jupyterhub.yml" 288 | fi 289 | -------------------------------------------------------------------------------- /cluster_create_local.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Uncomment below to help with debugging 4 | set -x 5 | 6 | #This script makes several assumptions: 7 | # 1. Running on a host with openstack client tools installed 8 | # 2. Using a default ssh key in ~/.ssh/ 9 | # 3. The user knows what they're doing. 10 | # 4. Take some options: 11 | # openrc file 12 | # volume size 13 | 14 | show_help() { 15 | echo "Options: 16 | -o: OPENRC_PATH: optional, path to a valid openrc file, default is ~/openrc.sh 17 | -v: VOLUME_SIZE: optional, size of storage volume in GB, volume not created if 0 18 | -d: DOCKER_ALLOW: optional flag, leave docker installed on headnode if set. 19 | -j: JUPYTERHUB_BUILD: optional flag, install jupyterhub with SSL certs. 20 | 21 | Usage: $0 -o [OPENRC_PATH] -v [VOLUME_SIZE] [-d]" 22 | } 23 | 24 | OPTIND=1 25 | 26 | openrc_path="${HOME}/openrc.sh" 27 | volume_size="0" 28 | install_opts="" 29 | 30 | while getopts ":jdhhelp:n:o:s:v:" opt; do 31 | case ${opt} in 32 | h|help|\?) show_help 33 | exit 0 34 | ;; 35 | d) install_opts+="-d " 36 | ;; 37 | j) install_opts+="-j " 38 | ;; 39 | o) openrc_path=${OPTARG} 40 | ;; 41 | v) volume_size=${OPTARG} 42 | ;; 43 | :) echo "Option -$OPTARG requires an argument." 44 | exit 1 45 | ;; 46 | 47 | esac 48 | done 49 | 50 | sudo pip3 install openstacksdk==0.61.0 51 | sudo pip3 install python-openstackclient 52 | sudo ln -s /usr/local/bin/openstack /usr/bin/openstack 53 | 54 | if [[ ! -f ${openrc_path} ]]; then 55 | echo "openrc path: ${openrc_path} \n does not point to a file!" 56 | exit 1 57 | fi 58 | 59 | headnode_name="$(hostname --short)" 60 | 61 | #Move this to allow for error checking of OS conflicts 62 | source ${openrc_path} 63 | 64 | if [[ -n "$(echo ${volume_size} | tr -d [0-9])" ]]; then 65 | echo "Volume size must be numeric only, in units of GB." 66 | exit 1 67 | elif [[ -n $(openstack volume list | grep -i ${headnode_name}-storage) ]]; then 68 | echo "Volume name [${headnode_name}-storage] conficts with existing Openstack entity!" 69 | exit 1 70 | fi 71 | 72 | if [[ ! -e ${HOME}/.ssh/id_rsa.pub ]]; then 73 | ssh-keygen -q -N "" -f ${HOME}/.ssh/id_rsa 74 | fi 75 | 76 | volume_name="${headnode_name}-storage" 77 | 78 | # Defining a function here to check for quotas, and exit if this script will cause problems! 79 | # also, storing 'quotas' in a global var, so we're not calling it every single time 80 | quotas=$(openstack quota show) 81 | quota_check () 82 | { 83 | quota_name=$1 84 | type_name=$2 #the name for a quota and the name for the thing itself are not the same 85 | number_created=$3 #number of the thing that we'll create here. 86 | 87 | current_num=$(openstack ${type_name} list -f value | wc -l) 88 | 89 | max_types=$(echo "${quotas}" | awk -v quota=${quota_name} '$0 ~ quota {print $4}') 90 | 91 | #echo "checking quota for ${quota_name} of ${type_name} to create ${number_created} - want ${current_num} to be less than ${max_types}" 92 | 93 | if [[ "${current_num}" -lt "$((max_types + number_created))" ]]; then 94 | return 0 95 | fi 96 | return 1 97 | } 98 | 99 | 100 | quota_check "secgroups" "security group" 1 101 | quota_check "networks" "network" 1 102 | quota_check "subnets" "subnet" 1 103 | quota_check "routers" "router" 1 104 | quota_check "key-pairs" "keypair" 1 105 | quota_check "instances" "server" 1 106 | 107 | #These must match those defined in install.sh, slurm_resume.sh, compute_build_base_img.yml 108 | # and compute_take_snapshot.sh, which ASSUME the headnode_name convention has not been deviated from. 109 | 110 | OS_PREFIX=${headnode_name} 111 | OS_SSH_SECGROUP_NAME=${OS_PREFIX}-ssh-global 112 | OS_INTERNAL_SECGROUP_NAME=${OS_PREFIX}-internal 113 | OS_HTTP_S_SECGROUP_NAME=${OS_PREFIX}-http-s 114 | OS_KEYPAIR_NAME=${OS_PREFIX}-elastic-key 115 | 116 | HEADNODE_NETWORK=$(openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $1}' | awk '{$1=$1};1') 117 | HEADNODE_IP=$(openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $2}' | awk -F',' '{print $1}') 118 | SUBNET=$(ip addr | grep $HEADNODE_IP | awk '{print $2}') 119 | 120 | echo "Headnode network name ${HEADNODE_NETWORK}" 121 | echo "Headnode ip ${HEADNODE_IP}" 122 | echo "Subnet ${SUBNET}" 123 | 124 | # This will allow for customization of the 1st 24 bits of the subnet range 125 | # The last 8 will be assumed open (netmask 255.255.255.0 or /24) 126 | # because going beyond that requires a general mechanism for translation from CIDR 127 | # to wildcard notation for ssh.cfg and compute_build_base_img.yml 128 | # which is assumed to be beyond the scope of this project. 129 | # If there is a maintainable mechanism for this, of course, please let us know! 130 | 131 | security_groups=$(openstack security group list -f value) 132 | if [[ ! ("${security_groups}" =~ "${OS_SSH_SECGROUP_NAME}") ]]; then 133 | openstack security group create --description "ssh \& icmp enabled" ${OS_SSH_SECGROUP_NAME} 134 | openstack security group rule create --protocol tcp --dst-port 22:22 --remote-ip 0.0.0.0/0 ${OS_SSH_SECGROUP_NAME} 135 | openstack security group rule create --protocol icmp ${OS_SSH_SECGROUP_NAME} 136 | fi 137 | if [[ ! ("${security_groups}" =~ "${OS_INTERNAL_SECGROUP_NAME}") ]]; then 138 | openstack security group create --description "internal group for cluster" ${OS_INTERNAL_SECGROUP_NAME} 139 | openstack security group rule create --protocol tcp --dst-port 1:65535 --remote-ip ${SUBNET} ${OS_INTERNAL_SECGROUP_NAME} 140 | openstack security group rule create --protocol icmp ${OS_INTERNAL_SECGROUP_NAME} 141 | fi 142 | if [[ (! ("${security_groups}" =~ "${OS_HTTP_S_SECGROUP_NAME}")) && "${install_opts}" =~ "j" ]]; then 143 | openstack security group create --description "http/s for jupyterhub" ${OS_HTTP_S_SECGROUP_NAME} 144 | openstack security group rule create --protocol tcp --dst-port 80 --remote-ip 0.0.0.0/0 ${OS_HTTP_S_SECGROUP_NAME} 145 | openstack security group rule create --protocol tcp --dst-port 443 --remote-ip 0.0.0.0/0 ${OS_HTTP_S_SECGROUP_NAME} 146 | fi 147 | 148 | #Check if ${HOME}/.ssh/id_rsa.pub exists in JS 149 | if [[ -e ${HOME}/.ssh/id_rsa.pub ]]; then 150 | home_key_fingerprint=$(ssh-keygen -l -E md5 -f ${HOME}/.ssh/id_rsa.pub | sed 's/.*MD5:\(\S*\) .*/\1/') 151 | fi 152 | openstack_keys=$(openstack keypair list -f value) 153 | 154 | home_key_in_OS=$(echo "${openstack_keys}" | awk -v mykey="${home_key_fingerprint}" '$2 ~ mykey {print $1}') 155 | 156 | if [[ -n "${home_key_in_OS}" ]]; then 157 | #RESET this to key that's already in OS 158 | OS_KEYPAIR_NAME=${home_key_in_OS} 159 | elif [[ -n $(echo "${openstack_keys}" | grep ${OS_KEYPAIR_NAME}) ]]; then 160 | openstack keypair delete ${OS_KEYPAIR_NAME} 161 | # This doesn't need to depend on the OS_PROJECT_NAME, as the slurm-key does, in install.sh and slurm_resume 162 | openstack keypair create --public-key ${HOME}/.ssh/id_rsa.pub ${OS_KEYPAIR_NAME} 163 | else 164 | # This doesn't need to depend on the OS_PROJECT_NAME, as the slurm-key does, in install.sh and slurm_resume 165 | openstack keypair create --public-key ${HOME}/.ssh/id_rsa.pub ${OS_KEYPAIR_NAME} 166 | fi 167 | 168 | SERVER_UUID=$(curl http://169.254.169.254/openstack/latest/meta_data.json | jq '.uuid' | sed -e 's#"##g') 169 | 170 | server_security_groups=$(openstack server show -f value -c security_groups ${SERVER_UUID} | sed -e "s#name=##" -e "s#'##g" | paste -s -) 171 | 172 | if [[ ! ("${server_security_groups}" =~ "${OS_SSH_SECGROUP_NAMEOS_SSH_SECGROUP_NAME}") ]]; then 173 | echo -e "openstack server add security group ${SERVER_UUID} ${OS_SSH_SECGROUP_NAME}" 174 | openstack server add security group ${SERVER_UUID} ${OS_SSH_SECGROUP_NAME} 175 | fi 176 | 177 | if [[ ! ("${server_security_groups}" =~ "${OS_INTERNAL_SECGROUP_NAME}") ]]; then 178 | echo -e "openstack server add security group ${SERVER_UUID} ${OS_INTERNAL_SECGROUP_NAME}" 179 | openstack server add security group ${SERVER_UUID} ${OS_INTERNAL_SECGROUP_NAME} 180 | fi 181 | 182 | if [[ "${volume_size}" != "0" ]]; then 183 | echo "Creating volume ${volume_name} of ${volume_size} GB" 184 | openstack volume create --size ${volume_size} ${volume_name} 185 | openstack server add volume --device /dev/sdb ${SERVER_UUID} ${volume_name} 186 | sleep 5 # To fix a wait issue in volume creation 187 | sudo mkfs.xfs /dev/sdb && sudo mkdir -m 777 /export 188 | vol_uuid=$(sudo blkid /dev/sdb | sed "s|.*UUID=\"\(.\{36\}\)\" .*|\1|") 189 | echo "volume uuid is: ${vol_uuid}" 190 | echo -e \"UUID=${vol_uuid} /export xfs defaults 0 0\" | sudo tee -a /etc/fstab && sudo mount -a 191 | echo "Volume sdb has UUID ${vol_uuid}" 192 | if [[ ${docker_allow} == 1 ]]; then 193 | echo -E '{ \"data-root\": \"/export/docker\" }' | sudo tee -a /etc/docker/daemon.json && sudo systemctl restart docker 194 | fi 195 | 196 | fi 197 | 198 | if [[ (! ("${server_security_groups}" =~ "${OS_HTTP_S_SECGROUP_NAME}")) && "${install_opts}" =~ "-j" ]]; then 199 | openstack server add security group ${SERVER_UUID} ${OS_HTTP_S_SECGROUP_NAME} 200 | fi 201 | 202 | echo "Beginning Slurm installation and Compute Image configuration - should take 8-10 minutes." 203 | 204 | sudo mkdir -p /etc/slurm 205 | sudo cp "${openrc_path}" /etc/slurm/openrc.sh 206 | sudo chmod 400 /etc/slurm/openrc.sh 207 | 208 | sudo ./install_local.sh ${install_opts} 209 | 210 | if [[ ${install_opts} =~ "-j" ]]; then 211 | echo "You will need to edit the file ${PWD}/install_jupyterhub.yml to reflect the public hostname of your new cluster, and use your email for SSL certs." 212 | echo "Then, run the following command from the directory ${PWD} on this instance to complete your jupyterhub setup:" 213 | echo "sudo ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' install_jupyterhub.yml" 214 | fi 215 | -------------------------------------------------------------------------------- /cluster_destroy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #This script makes several assumptions: 4 | # 1. Running on a host with openstack client tools installed 5 | # 2. Using a default ssh key in ~/.ssh/ 6 | # 3. The user knows what they're doing. 7 | # 4. Take some options: 8 | # openrc file 9 | # headnode size 10 | # cluster name 11 | # volume size 12 | 13 | show_help() { 14 | echo "Options: 15 | HEADNODE_NAME: required, name of the cluster to delete 16 | OPENRC_PATH: optional, path to a valid openrc file, defaults to ./openrc.sh 17 | VOLUME_DELETE: optional flag, set to delete storage volumes, default false 18 | 19 | Usage: $0 -n -o [OPENRC_PATH] [-v] " 20 | } 21 | 22 | OPTIND=1 23 | 24 | openrc_path="./openrc.sh" 25 | headnode_name="noname" 26 | volume_delete="0" 27 | 28 | while getopts ":hhelp:o:n:v" opt; do 29 | case ${opt} in 30 | h|help|\?) show_help 31 | exit 0 32 | ;; 33 | o) openrc_path=${OPTARG} 34 | ;; 35 | n) headnode_name=${OPTARG} 36 | ;; 37 | v) volume_delete=1 38 | ;; 39 | :) echo "Option -$OPTARG requires an argument." 40 | exit 1 41 | ;; 42 | 43 | esac 44 | done 45 | 46 | 47 | if [[ ! -f ${openrc_path} ]]; then 48 | echo "openrc path: ${openrc_path} \n does not point to a file!" 49 | exit 1 50 | elif [[ ${headnode_name} == "noname" ]]; then 51 | echo "No headnode name provided with -n, exiting!" 52 | exit 1 53 | elif [[ "${volume_delete}" != "0" && "${volume_delete}" != "1" ]]; then 54 | echo "Volume_delete parameter must be 0 or 1 instead of ${volume_delete}" 55 | exit 1 56 | fi 57 | 58 | source ${openrc_path} 59 | 60 | headnode_ip=$(openstack server list -f value -c Networks --name ${headnode_name} | sed 's/.*\(149.[0-9]\{1,3\}.[0-9]\{1,3\}.[0-9]\{1,3\}\).*/\1/') 61 | echo "Removing cluster based on ${headnode_name} at ${headnode_ip}" 62 | 63 | #remove 1st instance of "id", then the only alphanumeric chars left are the openstack id of the instance 64 | volume_id=$(openstack server show -f value -c volumes_attached ${headnode_name} | sed 's/id//' | tr -dc [:alnum:]-) 65 | 66 | openstack server delete ${headnode_name} 67 | 68 | openstack floating ip delete ${headnode_ip} 69 | 70 | #There's only one of each thing floating around, potentially some compute instances that weren't cleaned up and some images... SO! 71 | OS_PREFIX=${headnode_name} 72 | OS_SSH_SECGROUP_NAME=${OS_PREFIX}-ssh-global 73 | OS_INTERNAL_SECGROUP_NAME=${OS_PREFIX}-internal 74 | OS_SLURM_KEYPAIR=${OS_PREFIX}-slurm-key 75 | OS_ROUTER_NAME=${OS_PREFIX}-elastic-router 76 | OS_SUBNET_NAME=${OS_PREFIX}-elastic-subnet 77 | OS_NETWORK_NAME=${OS_PREFIX}-elastic-net 78 | OS_APP_CRED=${OS_PREFIX}-slurm-app-cred 79 | 80 | compute_nodes=$(openstack server list -f value -c Name | grep -E "compute-${headnode_name}-base-instance|${headnode_name}-compute" ) 81 | if [[ -n "${compute_nodes}" ]]; then 82 | for node in "${compute_nodes}" 83 | do 84 | echo "Deleting compute node: ${node}" 85 | openstack server delete ${node} 86 | done 87 | fi 88 | 89 | sleep 5 # seems like there are issues with the network deleting correctly 90 | 91 | if [[ "${volume_delete}" == "1" ]]; then 92 | echo "DELETING VOLUME: ${volume_id}" 93 | openstack volume delete ${volume_id} 94 | fi 95 | 96 | openstack security group delete ${OS_SSH_SECGROUP_NAME} 97 | openstack security group delete ${OS_INTERNAL_SECGROUP_NAME} 98 | openstack keypair delete ${OS_SLURM_KEYPAIR} # We don't delete the elastic-key, since it could be a user's key used for other stuff 99 | openstack router unset --external-gateway ${OS_ROUTER_NAME} 100 | openstack router remove subnet ${OS_ROUTER_NAME} ${OS_SUBNET_NAME} 101 | openstack router delete ${OS_ROUTER_NAME} 102 | openstack subnet delete ${OS_SUBNET_NAME} 103 | openstack network delete ${OS_NETWORK_NAME} 104 | 105 | 106 | headnode_images=$(openstack image list --private -f value -c Name | grep ${headnode_name}-compute-image- ) 107 | for image in "${headnode_images}" 108 | do 109 | openstack image delete ${image} 110 | done 111 | 112 | openstack application credential delete ${OS_APP_CRED} 113 | -------------------------------------------------------------------------------- /cluster_destroy_local.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Uncomment below to help with debugging 4 | # set -x 5 | 6 | #This script makes several assumptions: 7 | # 1. Running on a host with openstack client tools installed 8 | # 2. Using a default ssh key in ~/.ssh/ 9 | # 3. The user knows what they're doing. 10 | # 4. Take some options: 11 | # openrc file 12 | # cluster name 13 | # volume size 14 | 15 | show_help() { 16 | echo "Options: 17 | -n: HEADNODE_NAME: required, name of the cluster to delete 18 | -o: OPENRC_PATH: optional, path to a valid openrc file, defaults to ~/openrc.sh 19 | -v: VOLUME_DELETE: optional flag, set to delete storage volumes, default false 20 | -d: HEADNODE_DELETE: delete the headnode once the cluster is deleted 21 | 22 | Usage: $0 -n -o [OPENRC_PATH] [-v] [-d]" 23 | } 24 | 25 | OPTIND=1 26 | 27 | openrc_path="${HOME}/openrc.sh" 28 | headnode_name="$(hostname --short)" 29 | volume_delete="0" 30 | headnode_delete="0" 31 | 32 | while getopts ":hhelp:o:n:v:d" opt; do 33 | case ${opt} in 34 | h|help|\?) show_help 35 | exit 0 36 | ;; 37 | o) openrc_path=${OPTARG} 38 | ;; 39 | n) headnode_name=${OPTARG} 40 | ;; 41 | v) volume_delete=1 42 | ;; 43 | d) headnode_delete=1 44 | ;; 45 | :) echo "Option -$OPTARG requires an argument." 46 | exit 1 47 | ;; 48 | 49 | esac 50 | done 51 | 52 | 53 | if [[ ! -f ${openrc_path} ]]; then 54 | echo "openrc path: ${openrc_path} \n does not point to a file!" 55 | exit 1 56 | elif [[ "${volume_delete}" != "0" && "${volume_delete}" != "1" ]]; then 57 | echo "Volume_delete parameter must be 0 or 1 instead of ${volume_delete}" 58 | exit 1 59 | elif [[ "${headnode_delete}" != "0" && "${headnode_delete}" != "1" ]]; then 60 | echo "Headnode_delete parameter must be 0 or 1 instead of ${headnode_delete}" 61 | exit 1 62 | fi 63 | 64 | source ${openrc_path} 65 | 66 | #There's only one of each thing floating around, potentially some compute instances that weren't cleaned up and some images... SO! 67 | OS_PREFIX=${headnode_name} 68 | OS_SSH_SECGROUP_NAME=${OS_PREFIX}-ssh-global 69 | OS_INTERNAL_SECGROUP_NAME=${OS_PREFIX}-internal 70 | OS_SLURM_KEYPAIR=${OS_PREFIX}-slurm-key 71 | OS_KEYPAIR_NAME=${OS_PREFIX}-elastic-key 72 | 73 | compute_nodes=$(openstack server list -f value -c Name | grep -E "compute-${headnode_name}-base-instance|${headnode_name}-compute" ) 74 | if [[ -n "${compute_nodes}" ]]; then 75 | for node in "${compute_nodes}" 76 | do 77 | echo "Deleting compute node: ${node}" 78 | openstack server delete ${node} 79 | done 80 | fi 81 | 82 | sleep 5 # seems like there are issues with the network deleting correctly 83 | 84 | SERVER_UUID=$(curl http://169.254.169.254/openstack/latest/meta_data.json | jq '.uuid' | sed -e 's#"##g') 85 | 86 | openstack server remove security group ${SERVER_UUID} ${OS_SSH_SECGROUP_NAME} || true 87 | openstack server remove security group ${SERVER_UUID} ${OS_INTERNAL_SECGROUP_NAME} || true 88 | 89 | openstack security group delete ${OS_SSH_SECGROUP_NAME} 90 | openstack security group delete ${OS_INTERNAL_SECGROUP_NAME} 91 | openstack keypair delete ${OS_SLURM_KEYPAIR} 92 | # We DO delete the elastic-key, since we created it from scratch before 93 | openstack keypair delete ${OS_KEYPAIR_NAME} 94 | 95 | headnode_images=$(openstack image list --private -f value -c Name | grep ${headnode_name}-compute-image- ) 96 | for image in "${headnode_images}" 97 | do 98 | openstack image delete ${image} 99 | done 100 | 101 | if [[ "${headnode_delete}" == "1" ]]; then 102 | echo "DELETING HEADNODE: ${headnode_name}" 103 | openstack server delete ${headnode_name} 104 | fi 105 | -------------------------------------------------------------------------------- /compute_build_base_img.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - hosts: localhost 4 | 5 | vars: 6 | compute_base_image: "Featured-RockyLinux8" 7 | sec_group_global: "{{ ansible_facts.hostname }}-ssh-global" 8 | sec_group_internal: "{{ ansible_facts.hostname }}-internal" 9 | compute_base_size: "m3.tiny" 10 | network_name: "{{ ansible_facts.hostname }}-elastic-net" 11 | JS_ssh_keyname: "{{ ansible_facts.hostname }}-slurm-key" 12 | openstack_cloud: "openstack" 13 | 14 | vars_files: 15 | - clouds.yaml 16 | 17 | tasks: 18 | 19 | - name: build compute base instance 20 | os_server: 21 | timeout: 300 22 | state: present 23 | name: "compute-{{ ansible_facts.hostname }}-base-instance" 24 | cloud: "{{ openstack_cloud }}" 25 | image: "{{ compute_base_image }}" 26 | key_name: "{{ JS_ssh_keyname }}" 27 | security_groups: "{{ sec_group_global }},{{ sec_group_internal }}" 28 | flavor: "{{ compute_base_size }}" 29 | meta: { compute: "base" } 30 | auto_ip: "no" 31 | user_data: | 32 | #cloud-config 33 | packages: [] 34 | package_update: false 35 | package_upgrade: false 36 | package_reboot_if_required: false 37 | final_message: "Boot completed in $UPTIME seconds" 38 | network: "{{ network_name }}" 39 | wait: yes 40 | register: "os_host" 41 | 42 | - debug: 43 | var: os_host 44 | 45 | - name: add compute instance to inventory 46 | add_host: 47 | name: "{{ os_host['openstack']['name'] }}" 48 | groups: "compute-base" 49 | ansible_host: "{{ os_host.openstack.private_v4 }}" 50 | 51 | - name: pause for ssh to come up 52 | pause: 53 | seconds: 90 54 | 55 | 56 | - hosts: compute-base 57 | 58 | vars: 59 | compute_base_package_list: 60 | - "python3-libselinux" 61 | - "telnet" 62 | - "bind-utils" 63 | - "vim" 64 | - "openmpi4-gnu9-ohpc" 65 | - "ohpc-slurm-client" 66 | - "lmod-ohpc" 67 | - "ceph-common" 68 | packages_to_remove: 69 | - "environment-modules" 70 | - "containerd.io.x86_64" 71 | - "docker-ce.x86_64" 72 | - "docker-ce-cli.x86_64" 73 | - "docker-ce-rootless-extras.x86_64" 74 | - "Lmod" 75 | 76 | tasks: 77 | 78 | - name: Get the headnode private IP 79 | local_action: 80 | module: shell source /etc/slurm/openrc.sh && openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $2}' | awk -F',' '{print $1}' 81 | register: headnode_private_ip 82 | become: False # for running as slurm, since no sudo on localhost 83 | 84 | - name: Get the slurmctld uid 85 | local_action: 86 | module: shell getent passwd slurm | awk -F':' '{print $3}' 87 | register: headnode_slurm_uid 88 | become: False # for running as slurm, since no sudo on localhost 89 | 90 | - name: turn off the firewall 91 | service: 92 | name: firewalld 93 | state: stopped 94 | enabled: no 95 | 96 | - name: Add OpenHPC 2.0 repo 97 | dnf: 98 | name: "http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm" 99 | state: present 100 | lock_timeout: 900 101 | disable_gpg_check: yes 102 | 103 | 104 | - name: Enable CentOS PowerTools repo 105 | command: dnf config-manager --set-enabled powertools 106 | 107 | - name: Disable docker-ce repo 108 | command: dnf config-manager --set-disabled docker-ce-stable 109 | 110 | - name: remove env-modules and docker packages 111 | dnf: 112 | name: "{{ packages_to_remove }}" 113 | state: absent 114 | lock_timeout: 300 115 | 116 | # There is an issue in removing Lmod in early call. Seems like we need to run it twice 117 | - name: remove Lmod packages 118 | dnf: 119 | name: Lmod 120 | state: absent 121 | lock_timeout: 300 122 | 123 | - name: install basic packages 124 | dnf: 125 | name: "{{ compute_base_package_list }}" 126 | state: present 127 | lock_timeout: 300 128 | 129 | - name: fix slurm user uid 130 | user: 131 | name: slurm 132 | uid: "{{ headnode_slurm_uid.stdout}}" 133 | shell: "/sbin/nologin" 134 | home: "/etc/slurm" 135 | 136 | - name: create slurm spool directories 137 | file: 138 | path: /var/spool/slurm/ctld 139 | state: directory 140 | owner: slurm 141 | group: slurm 142 | mode: 0755 143 | recurse: yes 144 | 145 | - name: change ownership of slurm files 146 | file: 147 | path: "{{ item }}" 148 | owner: slurm 149 | group: slurm 150 | with_items: 151 | - "/var/spool/slurm" 152 | - "/var/spool/slurm/ctld" 153 | # - "/var/log/slurm_jobacct.log" 154 | 155 | - name: disable selinux 156 | selinux: state=permissive policy=targeted 157 | 158 | # - name: allow use_nfs_home_dirs 159 | # seboolean: name=use_nfs_home_dirs state=yes persistent=yes 160 | 161 | - name: import /home on compute nodes 162 | lineinfile: 163 | dest: /etc/fstab 164 | line: "{{ headnode_private_ip.stdout }}:/home /home nfs defaults,nfsvers=4.0 0 0" 165 | state: present 166 | 167 | - name: ensure /opt/ohpc/pub exists 168 | file: path=/opt/ohpc/pub state=directory mode=777 recurse=yes 169 | 170 | - name: import /opt/ohpc/pub on compute nodes 171 | lineinfile: 172 | dest: /etc/fstab 173 | line: "{{ headnode_private_ip.stdout }}:/opt/ohpc/pub /opt/ohpc/pub nfs defaults,nfsvers=4.0 0 0" 174 | state: present 175 | 176 | - name: ensure /export exists 177 | file: path=/export state=directory mode=777 178 | 179 | - name: import /export on compute nodes 180 | lineinfile: 181 | dest: /etc/fstab 182 | line: "{{ headnode_private_ip.stdout }}:/export /export nfs defaults,nfsvers=4.0 0 0" 183 | state: present 184 | 185 | - name: fix sda1 mount in fstab 186 | lineinfile: 187 | dest: /etc/fstab 188 | regex: "/ xfs defaults" 189 | line: "/dev/sda1 / xfs defaults 0 0" 190 | state: present 191 | 192 | - name: add local users to compute node 193 | script: /tmp/add_users.sh 194 | ignore_errors: True 195 | 196 | - name: copy munge key from headnode 197 | synchronize: 198 | mode: push 199 | src: /etc/munge/munge.key 200 | dest: /etc/munge/munge.key 201 | set_remote_user: no 202 | use_ssh_args: yes 203 | 204 | - name: fix perms on munge key 205 | file: 206 | path: /etc/munge/munge.key 207 | owner: munge 208 | group: munge 209 | mode: 0600 210 | 211 | - name: copy slurm.conf from headnode 212 | synchronize: 213 | mode: push 214 | src: /etc/slurm/slurm.conf 215 | dest: /etc/slurm/slurm.conf 216 | set_remote_user: no 217 | use_ssh_args: yes 218 | 219 | - name: copy slurm_prolog.sh from headnode 220 | synchronize: 221 | mode: push 222 | src: /usr/local/sbin/slurm_prolog.sh 223 | dest: /usr/local/sbin/slurm_prolog.sh 224 | set_remote_user: no 225 | use_ssh_args: yes 226 | 227 | - name: enable munge 228 | service: name=munge.service enabled=yes 229 | 230 | - name: enable slurmd 231 | service: name=slurmd enabled=yes 232 | 233 | #cat /etc/systemd/system/multi-user.target.wants/slurmd.service 234 | #[Unit] 235 | #Description=Slurm node daemon 236 | #After=network.target munge.service #CHANGING TO: network-online.target 237 | #ConditionPathExists=/etc/slurm/slurm.conf 238 | # 239 | #[Service] 240 | #Type=forking 241 | #EnvironmentFile=-/etc/sysconfig/slurmd 242 | #ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS 243 | #ExecReload=/bin/kill -HUP $MAINPID 244 | #PIDFile=/var/run/slurmd.pid 245 | #KillMode=process 246 | #LimitNOFILE=51200 247 | #LimitMEMLOCK=infinity 248 | #LimitSTACK=infinity 249 | #Delegate=yes 250 | # 251 | # 252 | #[Install] 253 | #WantedBy=multi-user.target 254 | 255 | - name: change slurmd service "After" to sshd and remote filesystems 256 | command: sed -i 's/network.target/sshd.service remote-fs.target/' /usr/lib/systemd/system/slurmd.service 257 | 258 | - name: add slurmd service "Requires" of sshd and remote filesystems 259 | command: sed -i '/After=network/aRequires=sshd.service remote-fs.target' /usr/lib/systemd/system/slurmd.service 260 | 261 | # - name: mount -a on compute nodes 262 | # command: "mount -a" 263 | 264 | - hosts: localhost 265 | 266 | vars_files: 267 | - clouds.yaml 268 | 269 | tasks: 270 | 271 | - name: create compute instance snapshot 272 | command: ./compute_take_snapshot.sh 273 | 274 | # os_server no longer handles instance state correctly 275 | # - name: remove compute instance 276 | # os_server: 277 | # timeout: 200 278 | # state: absent 279 | # name: "compute-{{ inventory_hostname_short }}-base-instance" 280 | # cloud: "tacc" 281 | -------------------------------------------------------------------------------- /compute_take_snapshot.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /etc/slurm/openrc.sh 4 | 5 | compute_image="$(hostname -s)-compute-image-latest" 6 | compute_instance="compute-$(hostname -s)-base-instance" 7 | 8 | openstack server stop ${compute_instance} 9 | 10 | count=0 11 | declare -i count 12 | until [[ ${count} -ge 12 || "${shutoff_check}" =~ "SHUTOFF" ]]; 13 | do 14 | shutoff_check=$(openstack server show -f value -c status ${compute_instance}) 15 | count+=1 16 | sleep 5 17 | done 18 | 19 | image_check=$(openstack image show -f value -c name ${compute_image}) 20 | # If there is already a -latest image, re-name it with the date of its creation 21 | if [[ -n ${image_check} ]]; 22 | then 23 | old_image_date=$(openstack image show ${compute_image} -f value -c created_at | cut -d'T' -f 1) 24 | backup_image_name=${compute_image::-7}-${old_image_date} 25 | 26 | if [[ ${old_image_date} == "$(date +%Y-%m-%d)" && -n "$(openstack image show -f value -c name ${backup_image_name})" ]]; 27 | then 28 | openstack image delete ${backup_image_name} 29 | fi 30 | 31 | openstack image set --name ${backup_image_name} ${compute_image} 32 | fi 33 | 34 | openstack server image create --name ${compute_image} ${compute_instance} 35 | 36 | count=0 37 | declare -i count 38 | until [[ ${count} -ge 20 || "${instance_check}" =~ "active" ]]; 39 | do 40 | instance_check=$(openstack image show -f value -c status ${compute_image}) 41 | count+=1 42 | sleep 15 43 | done 44 | 45 | if [[ ${count} -ge 20 ]]; 46 | then 47 | echo "Image still in queued status after 300 seconds" 48 | exit 2 49 | fi 50 | 51 | echo "Done after ${count} sleeps." 52 | 53 | openstack image list | grep $(hostname -s) 54 | openstack server delete ${compute_instance} 55 | -------------------------------------------------------------------------------- /cron-node-check.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | sinfo_check=$(sinfo | grep -iE "drain|down") 4 | 5 | #mail_domain=$(curl -s https://ipinfo.io/hostname) 6 | mail_domain=$(host $(curl -s http://169.254.169.254/latest/meta-data/public-ipv4) | sed 's/.*pointer \(.*\)./\1/') 7 | 8 | email_addr="" 9 | 10 | try_count=0 11 | declare -i try_count 12 | until [[ -n $mail_domain || $try_count -ge 10 ]]; 13 | do 14 | sleep 3 15 | mail_domain=$(curl -s https://ipinfo.io/hostname) 16 | try_count=$try_count+1 17 | echo $mail_domain, $try_count 18 | done 19 | 20 | if [[ $try_count -ge 10 ]]; then 21 | echo "failed to get domain name!" 22 | exit 1 23 | fi 24 | 25 | if [[ -n $sinfo_check ]]; then 26 | echo $sinfo_check | mailx -r "node-check@$mail_domain" -s "NODE IN BAD STATE - $mail_domain" $email_addr 27 | # echo "$sinfo_check mailx -r "node-check@$mail_domain" -s "NODE IN BAD STATE - $mail_domain" $email_addr" # TESTING LINE 28 | fi 29 | 30 | #Check for ACTIVE nodes without running/cf/cg jobs 31 | squeue_check=$(squeue -h -t CF,CG,R) 32 | 33 | #source the openrc.sh for instance check 34 | $(sudo cat /etc/slurm/openrc.sh) 35 | compute_node_check=$(openstack server list | awk '/compute/ && /ACTIVE/') 36 | 37 | if [[ -n $compute_node_check && -z $squeue_check ]]; then 38 | echo $compute_node_check $squeue_check | mailx -r "node-check@$mail_domain" -s "NODE IN ACTIVE STATE WITHOUT JOBS- $mail_domain" $email_addr 39 | fi 40 | -------------------------------------------------------------------------------- /figures/virtual-clusters.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/access-ci-org/Jetstream_Cluster/29d3c0dfe54ed09c0501f9547c5b860e87efbbe3/figures/virtual-clusters.jpeg -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | OPTIND=1 4 | 5 | docker_allow=0 #default to NOT installing docker; must be 0 or 1 6 | jhub_build=0 #default to NOT installing jupyterhub; must be 0 or 1 7 | 8 | while getopts ":jd" opt; do 9 | case ${opt} in 10 | d) docker_allow=1 11 | ;; 12 | j) jhub_build=1 13 | ;; 14 | \?) echo "BAD OPTION! $opt TRY AGAIN" 15 | exit 1 16 | ;; 17 | esac 18 | done 19 | 20 | if [[ ! -e /etc/slurm/openrc.sh ]]; then 21 | echo "NO OPENRC FOUND! CREATE ONE, AND TRY AGAIN!" 22 | exit 1 23 | fi 24 | 25 | if [[ $EUID -ne 0 ]]; then 26 | echo "This script must be run as root" 27 | exit 1 28 | fi 29 | 30 | #do this early, allow the user to leave while the rest runs! 31 | source /etc/slurm/openrc.sh 32 | 33 | OS_PREFIX=$(hostname -s) 34 | OS_SLURM_KEYPAIR=${OS_PREFIX}-slurm-key 35 | 36 | SUBNET_PREFIX=10.0.0 37 | 38 | #Open the firewall on the internal network for Cent8 39 | firewall-cmd --permanent --add-rich-rule="rule source address="${SUBNET_PREFIX}.0/24" family='ipv4' accept" 40 | firewall-cmd --add-rich-rule="rule source address="${SUBNET_PREFIX}.0/24" family='ipv4' accept" 41 | 42 | dnf -y install http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm \ 43 | centos-release-openstack-train 44 | 45 | dnf config-manager --set-enabled powertools 46 | 47 | if [[ ${docker_allow} == 0 ]]; then 48 | dnf config-manager --set-disabled docker-ce-stable 49 | 50 | dnf -y remove containerd.io.x86_64 docker-ce.x86_64 docker-ce-cli.x86_64 docker-ce-rootless-extras.x86_64 51 | fi 52 | 53 | dnf -y --allowerasing install \ 54 | ohpc-slurm-server \ 55 | vim \ 56 | ansible \ 57 | mailx \ 58 | lmod-ohpc \ 59 | bash-completion \ 60 | gnu9-compilers-ohpc \ 61 | openmpi4-gnu9-ohpc \ 62 | singularity-ohpc \ 63 | lmod-defaults-gnu9-openmpi4-ohpc \ 64 | moreutils \ 65 | bind-utils \ 66 | python3-openstackclient \ 67 | python3-pexpect 68 | 69 | dnf -y update # until the base python2-openstackclient install works out of the box! 70 | 71 | #create user that can be used to submit jobs 72 | [ ! -d /home/gateway-user ] && useradd -m gateway-user 73 | 74 | [ ! -f slurm-key ] && ssh-keygen -b 2048 -t rsa -P "" -f slurm-key 75 | 76 | # generate a local key for centos for after homedirs are mounted! 77 | [ ! -f /home/centos/.ssh/id_rsa ] && su centos - -c 'ssh-keygen -t rsa -b 2048 -P "" -f /home/centos/.ssh/id_rsa && cat /home/centos/.ssh/id_rsa.pub >> /home/centos/.ssh/authorized_keys' 78 | 79 | 80 | #create clouds.yaml file from contents of openrc 81 | echo -e "clouds: 82 | tacc: 83 | auth: 84 | auth_url: '${OS_AUTH_URL}' 85 | application_credential_id: '${OS_APPLICATION_CREDENTIAL_ID}' 86 | application_credential_secret: '${OS_APPLICATION_CREDENTIAL_SECRET}' 87 | user_domain_name: tacc 88 | identity_api_version: 3 89 | project_domain_name: tacc 90 | auth_type: 'v3applicationcredential'" > clouds.yaml 91 | 92 | #Make sure only root can read this 93 | chmod 400 clouds.yaml 94 | 95 | if [[ -n $(openstack keypair list | grep ${OS_SLURM_KEYPAIR}) ]]; then 96 | openstack keypair delete ${OS_SLURM_KEYPAIR} 97 | openstack keypair create --public-key slurm-key.pub ${OS_SLURM_KEYPAIR} 98 | else 99 | openstack keypair create --public-key slurm-key.pub ${OS_SLURM_KEYPAIR} 100 | fi 101 | 102 | #TACC-specific changes: 103 | 104 | if [[ $OS_AUTH_URL =~ "tacc" ]]; then 105 | #Insert headnode into /etc/hosts 106 | echo "$(ip add show dev eth0 | awk '/inet / {sub("/24","",$2); print $2}') $(hostname) $(hostname -s)" >> /etc/hosts 107 | fi 108 | 109 | #Get OS Network name of *this* server, and set as the network for compute-nodes 110 | # Only need this if you've changed the subnet name for some reason 111 | #headnode_os_subnet=$(openstack server show $(hostname | cut -f 1 -d'.') | awk '/addresses/ {print $4}' | cut -f 1 -d'=') 112 | #sed -i "s/network_name=.*/network_name=$headnode_os_subnet/" ./slurm_resume.sh 113 | 114 | #Set compute node names to $OS_PREFIX-compute- 115 | sed -i "s/=compute-*/=${OS_PREFIX}-compute-/" ./slurm.conf 116 | sed -i "s/Host compute-*/Host ${OS_PREFIX}-compute-/" ./ssh.cfg 117 | 118 | #set the subnet in ssh.cfg and compute_build_base_img.yml 119 | sed -i "s/Host 10.0.0.\*/Host ${SUBNET_PREFIX}.\*/" ./ssh.cfg 120 | sed -i "s/^\(.*\)10.0.0\(.*\)$/\1${SUBNET_PREFIX}\2/" ./compute_build_base_img.yml 121 | 122 | # Deal with files required by slurm - better way to encapsulate this section? 123 | 124 | mkdir -p -m 700 /etc/slurm/.ssh 125 | 126 | cp slurm-key slurm-key.pub /etc/slurm/.ssh/ 127 | 128 | #Make sure slurm-user will still be valid after the nfs mount happens! 129 | cat slurm-key.pub >> /home/centos/.ssh/authorized_keys 130 | 131 | chown -R slurm:slurm /etc/slurm/.ssh 132 | 133 | setfacl -m u:slurm:rw /etc/hosts 134 | setfacl -m u:slurm:rwx /etc/ 135 | 136 | chmod +t /etc 137 | 138 | #The following may be removed when appcred gen during cluster_create is working 139 | ##Possible to handle this at the cloud-init level? From a machine w/ 140 | ## pre-loaded openrc, possible via user-data and write_files, yes. 141 | ## This needs a check for success, and if not, fail? 142 | ##export $(openstack application credential create -f shell ${OS_APP_CRED} | sed 's/^\(.*\)/OS_ac_\1/') 143 | ##echo -e "export OS_AUTH_TYPE=v3applicationcredential 144 | ##export OS_AUTH_URL=${OS_AUTH_URL} 145 | ##export OS_IDENTITY_API_VERSION=3 146 | ##export OS_REGION_NAME="RegionOne" 147 | ##export OS_INTERFACE=public 148 | ##export OS_APPLICATION_CREDENTIAL_ID=${OS_ac_id} 149 | ##export OS_APPLICATION_CREDENTIAL_SECRET=${OS_ac_secret} > /etc/slurm/openrc.sh 150 | # 151 | #echo -e "export OS_PROJECT_DOMAIN_NAME=tacc 152 | #export OS_USER_DOMAIN_NAME=tacc 153 | #export OS_PROJECT_NAME=${OS_PROJECT_NAME} 154 | #export OS_USERNAME=${OS_USERNAME} 155 | #export OS_PASSWORD=${OS_PASSWORD} 156 | #export OS_AUTH_URL=${OS_AUTH_URL} 157 | #export OS_IDENTITY_API_VERSION=3" > /etc/slurm/openrc.sh 158 | 159 | #chown slurm:slurm /etc/slurm/openrc.sh 160 | 161 | #chmod 400 /etc/slurm/openrc.sh 162 | 163 | cp prevent-updates.ci /etc/slurm/ 164 | 165 | chown slurm:slurm /etc/slurm/openrc.sh 166 | chown slurm:slurm /etc/slurm/prevent-updates.ci 167 | 168 | mkdir -p /var/log/slurm 169 | 170 | touch /var/log/slurm/slurm_elastic.log 171 | touch /var/log/slurm/os_clean.log 172 | 173 | chown -R slurm:slurm /var/log/slurm 174 | 175 | cp slurm-logrotate.conf /etc/logrotate.d/slurm 176 | 177 | setfacl -m u:slurm:rw /etc/ansible/hosts 178 | setfacl -m u:slurm:rwx /etc/ansible/ 179 | 180 | cp slurm_*.sh /usr/local/sbin/ 181 | 182 | cp cron-node-check.sh /usr/local/sbin/ 183 | cp clean-os-error.sh /usr/local/sbin/ 184 | 185 | chown slurm:slurm /usr/local/sbin/slurm_*.sh 186 | chown slurm:slurm /usr/local/sbin/clean-os-error.sh 187 | 188 | chown centos:centos /usr/local/sbin/cron-node-check.sh 189 | 190 | echo "#13 */6 * * * centos /usr/local/sbin/cron-node-check.sh" >> /etc/crontab 191 | echo "#*/4 * * * * slurm /usr/local/sbin/clean-os-error.sh" >> /etc/crontab 192 | 193 | #"dynamic" hostname adjustment 194 | sed -i "s/ControlMachine=slurm-example/ControlMachine=$(hostname -s)/" ./slurm.conf 195 | cp slurm.conf /etc/slurm/slurm.conf 196 | 197 | cp ansible.cfg /etc/ansible/ 198 | 199 | cp ssh.cfg /etc/ansible/ 200 | 201 | cp slurm_test.job ${HOME} 202 | 203 | #create share directory 204 | mkdir -m 777 -p /export 205 | 206 | #create export of homedirs and /export and /opt/ohpc/pub 207 | echo -e "/home ${SUBNET_PREFIX}.0/24(rw,no_root_squash) \n/export ${SUBNET_PREFIX}.0/24(rw,no_root_squash)" > /etc/exports 208 | echo -e "/opt/ohpc/pub ${SUBNET_PREFIX}.0/24(rw,no_root_squash)" >> /etc/exports 209 | 210 | #Get latest CentOS7 minimal image for base - if os_image_facts or the os API allowed for wildcards, 211 | # this would be different. But this is the world we live in. 212 | # After the naming convention change of May 5, 2020, this is no longer necessary - JS-API-Featured-CentOS7-Latest is the default. 213 | # These lines remain as a testament to past struggles. 214 | #centos_base_image=$(openstack image list --status active | grep -iE "API-Featured-centos7-[[:alpha:]]{3,4}-[0-9]{2}-[0-9]{4}" | awk '{print $4}' | tail -n 1) 215 | #centos_base_image="JS-API-Featured-CentOS7-Latest" 216 | #sed -i "s/\(\s*compute_base_image: \).*/\1\"${centos_base_image}\"/" compute_build_base_img.yml | head -n 10 217 | 218 | #create temporary script to add local users 219 | echo "#!/bin/bash" > /tmp/add_users.sh 220 | cat /etc/passwd | awk -F':' '$4 >= 1001 && $4 < 65000 {print "useradd -M -u", $3, $1}' >> /tmp/add_users.sh 221 | 222 | # build instance for compute base image generation, take snapshot, and destroy it 223 | echo "Creating compute image! based on $centos_base_image" 224 | 225 | ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' compute_build_base_img.yml 226 | 227 | #to allow other users to run ansible! 228 | rm -r /tmp/.ansible 229 | 230 | if [[ ${jhub_build} == 1 ]]; then 231 | ansible-galaxy collection install community.general 232 | ansible-galaxy collection install ansible.posix 233 | ansible-galaxy install geerlingguy.certbot 234 | # ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' install_jupyterhub.yml 235 | fi 236 | 237 | #Start required services 238 | systemctl enable slurmctld munge nfs-server rpcbind 239 | systemctl restart munge slurmctld nfs-server rpcbind 240 | 241 | echo -e "If you wish to enable an email when node state is drain or down, please uncomment \nthe cron-node-check.sh job in /etc/crontab, and place your email of choice in the 'email_addr' variable \nat the beginning of /usr/local/sbin/cron-node-check.sh" 242 | -------------------------------------------------------------------------------- /install_jupyterhub.yml: -------------------------------------------------------------------------------- 1 | --- 2 | 3 | - hosts: localhost 4 | 5 | vars: 6 | headnode_public_hostname: FILL-ME-IN 7 | headnode_alternate_hostname: "" #Optional addition DNS entry pointing to your host 8 | certbot_create_if_missing: yes 9 | certbot_admin_email: FILL-ME-IN 10 | certbot_install_method: snap 11 | certbot_create_method: standalone 12 | certbot_certs: 13 | - domains: 14 | - "{{ headnode_public_hostname }}" 15 | certbot_create_standalone_stop_services: 16 | - httpd 17 | 18 | roles: 19 | - geerlingguy.certbot 20 | 21 | pre_tasks: 22 | 23 | - name: disable selinux 24 | ansible.posix.selinux: 25 | policy: targeted 26 | state: permissive 27 | 28 | - name: install httpd bits 29 | dnf: 30 | state: latest 31 | name: 32 | - nodejs 33 | - npm 34 | - httpd 35 | - httpd-filesystem 36 | - httpd-tools 37 | - python3-certbot-apache 38 | - snapd 39 | - snap-confine 40 | - snapd-selinux 41 | 42 | - name: start and enable snapd 43 | service: 44 | name: snapd 45 | state: started 46 | enabled: yes 47 | 48 | - name: add http/s to firewalld 49 | shell: firewall-cmd --add-service http --zone=public --permanent && \ 50 | firewall-cmd --add-service https --zone=public --permanent && \ 51 | firewall-cmd --reload 52 | 53 | tasks: 54 | 55 | - name: Get the headnode private IP 56 | local_action: 57 | module: shell ip addr | grep -Eo '10.0.0.[0-9]*' | head -1 58 | register: headnode_private_ip 59 | 60 | - name: Get the headnode hostname 61 | local_action: 62 | module: shell hostname -s 63 | register: headnode_hostname 64 | 65 | - name: https redirect config 66 | template: 67 | src: jhub_files/https_redirect.conf.j2 68 | dest: /etc/httpd/conf.d/https_redirect.conf 69 | owner: root 70 | mode: 0644 71 | 72 | - name: jupyterhub proxy config 73 | template: 74 | src: jhub_files/jupyterhub.conf.j2 75 | dest: /etc/httpd/conf.d/jupyterhub.conf 76 | owner: root 77 | mode: 0644 78 | 79 | - name: restart httpd 80 | service: 81 | name: httpd 82 | state: restarted 83 | enabled: yes 84 | 85 | - name: create a shadow group 86 | group: 87 | name: shadow 88 | state: present 89 | 90 | - name: let shadow group read /etc/shadow 91 | file: 92 | path: /etc/shadow 93 | mode: 0040 94 | group: shadow 95 | owner: root 96 | 97 | - name: create jupyterhub user and group 98 | user: 99 | name: jupyterhub 100 | state: present 101 | groups: shadow 102 | 103 | - name: create jupyterhub-users group 104 | group: 105 | name: jupyterhub-users 106 | state: present 107 | 108 | - name: create sudoers directory 109 | file: 110 | path: /etc/sudoers.d 111 | owner: root 112 | group: root 113 | mode: 0750 114 | state: directory 115 | 116 | - name: set sudoers permissions for jupyterhub non-root 117 | copy: 118 | src: jhub_files/jhub_sudoers 119 | dest: /etc/sudoers.d/ 120 | owner: root 121 | group: root 122 | mode: 0440 123 | 124 | - name: create jupyterhub config dir 125 | file: 126 | path: /etc/jupyterhub 127 | owner: jupyterhub 128 | group: jupyterhub 129 | mode: 0755 130 | state: directory 131 | 132 | - name: install devel deps for building Python 133 | dnf: 134 | state: latest 135 | name: 136 | - bzip2-devel 137 | - ncurses-devel 138 | - gdbm-devel 139 | - libsqlite3x-devel 140 | - sqlite-devel 141 | - libuuid-devel 142 | - uuid-devel 143 | - openssl-devel 144 | - readline-devel 145 | - zlib-devel 146 | - libffi-devel 147 | - xz-devel 148 | - tk-devel 149 | 150 | - name: install configurable-http-proxy 151 | npm: 152 | name: configurable-http-proxy 153 | global: yes 154 | 155 | - name: create tmp builddir 156 | file: 157 | path: /tmp/build/ 158 | state: directory 159 | 160 | - name: fetch python source 161 | unarchive: 162 | src: https://www.python.org/ftp/python/3.8.10/Python-3.8.10.tgz 163 | dest: /tmp/build/ 164 | remote_src: yes 165 | 166 | - name: run python configure 167 | command: 168 | cmd: ./configure --prefix=/opt/python3 169 | chdir: /tmp/build/Python-3.8.10 170 | 171 | - name: build python source 172 | community.general.make: 173 | target: all 174 | chdir: /tmp/build/Python-3.8.10 175 | 176 | - name: install python 177 | community.general.make: 178 | target: install 179 | chdir: /tmp/build/Python-3.8.10 180 | become: yes 181 | 182 | - name: run python configure for public build 183 | command: 184 | cmd: ./configure --prefix=/opt/ohpc/pub/compiler/python3 185 | chdir: /tmp/build/Python-3.8.10 186 | 187 | - name: install python publicly 188 | community.general.make: 189 | target: install 190 | chdir: /tmp/build/Python-3.8.10 191 | become: yes 192 | 193 | - name: install jupyterhub 194 | pip: 195 | executable: /opt/python3/bin/pip3 196 | name: jupyterhub 197 | 198 | - name: install wrapspawner 199 | pip: 200 | executable: /opt/python3/bin/pip3 201 | name: 202 | - wrapspawner 203 | - traitlets<5 204 | 205 | - name: install jupyterlab 206 | pip: 207 | executable: /opt/ohpc/pub/compiler/python3/bin/pip3 208 | name: jupyterlab 209 | 210 | - name: create jupyterhub service 211 | template: 212 | src: jhub_files/jhub_service.j2 213 | dest: /etc/systemd/system/jupyterhub.service 214 | mode: 0644 215 | owner: root 216 | group: root 217 | 218 | #This is hard b/c of Batchspawner config 219 | - name: install base jupyterhub config 220 | copy: 221 | src: jhub_files/jhub_conf.py 222 | dest: /etc/jupyterhub/jupyterhub_config.py 223 | owner: jupyterhub 224 | group: jupyterhub 225 | mode: 0644 226 | 227 | - name: set headnode ip in jhub_config 228 | lineinfile: 229 | regexp: JEC_HEADNODE_IP 230 | line: "c.JupyterHub.hub_ip = \'{{ headnode_private_ip.stdout }}\' #JEC_HEADNODE_IP" 231 | path: /etc/jupyterhub/jupyterhub_config.py 232 | 233 | - name: set hostname in jhub_config for batchspawner 234 | lineinfile: 235 | regexp: JEC_SPAWNER_HOSTNAME 236 | line: "c.BatchSpawnerBase.req_host = \'{{ headnode_hostname.stdout }}\' #JEC_SPAWNER_HOSTNAME " 237 | path: /etc/jupyterhub/jupyterhub_config.py 238 | 239 | - name: set hostname in jhub_config for batchspawner 240 | lineinfile: 241 | regexp: JEC_PUBLIC_HOSTNAME 242 | line: "public_hostname = \'{{ headnode_public_hostname }}\' #JEC_PUBLIC_HOSTNAME" 243 | path: /etc/jupyterhub/jupyterhub_config.py 244 | 245 | - name: install batchspawner to jhub python 246 | pip: 247 | name: batchspawner 248 | executable: /opt/python3/bin/pip3 249 | 250 | - name: install batchspawner to public python 251 | pip: 252 | name: batchspawner 253 | executable: /opt/ohpc/pub/compiler/python3/bin/pip3 254 | 255 | - name: create python module dir 256 | file: 257 | state: directory 258 | path: /opt/ohpc/pub/modulefiles/python3.8 259 | 260 | - name: create python module 261 | copy: 262 | src: jhub_files/python_mod_3.8 263 | dest: /opt/ohpc/pub/modulefiles/python3.8/3.8.10 264 | mode: 0777 265 | owner: root 266 | group: root 267 | 268 | - name: start the jupyterhub service 269 | service: 270 | name: jupyterhub 271 | enabled: yes 272 | state: started 273 | -------------------------------------------------------------------------------- /install_local.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Uncomment below to help with debugging 4 | # set -x 5 | 6 | OPTIND=1 7 | 8 | docker_allow=0 #default to NOT installing docker; must be 0 or 1 9 | jhub_build=0 #default to NOT installing jupyterhub; must be 0 or 1 10 | 11 | while getopts ":jd" opt; do 12 | case ${opt} in 13 | d) docker_allow=1 14 | ;; 15 | j) jhub_build=1 16 | ;; 17 | \?) echo "BAD OPTION! $opt TRY AGAIN" 18 | exit 1 19 | ;; 20 | esac 21 | done 22 | 23 | if [[ ! -e /etc/slurm/openrc.sh ]]; then 24 | echo "NO OPENRC FOUND! CREATE ONE, AND TRY AGAIN!" 25 | exit 1 26 | fi 27 | 28 | if [[ $EUID -ne 0 ]]; then 29 | echo "This script must be run as root" 30 | exit 1 31 | fi 32 | 33 | #do this early, allow the user to leave while the rest runs! 34 | source /etc/slurm/openrc.sh 35 | 36 | OS_PREFIX=$(hostname -s) 37 | OS_SLURM_KEYPAIR=${OS_PREFIX}-slurm-key 38 | 39 | HEADNODE_NETWORK=$(openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $1}' | awk '{$1=$1};1') 40 | HEADNODE_IP=$(openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $2}' | awk -F',' '{print $1}') 41 | SUBNET=$(ip addr | grep $HEADNODE_IP | awk '{print $2}') 42 | SUBNET_PREFIX=$(ip addr | grep $HEADNODE_IP | awk '{print $2}' | awk -F. '{print $1 "." $2 "." $3 ".*"}') 43 | 44 | #Open the firewall on the internal network for Cent8. Use offline tool as this runs as a cloud init script. 45 | # See the discussion : https://titanwolf.org/Network/Articles/Article?AID=ca474d74-d632-4b1e-9b03-cd10add19633 46 | firewall-offline-cmd --add-rich-rule="rule source address="${SUBNET}" family='ipv4' accept" 47 | systemctl enable firewalld 48 | systemctl restart firewalld 49 | 50 | dnf -y install http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm 51 | 52 | dnf config-manager --set-enabled powertools 53 | 54 | if [[ ${docker_allow} == 0 ]]; then 55 | dnf config-manager --set-disabled docker-ce-stable 56 | 57 | dnf -y remove containerd.io.x86_64 docker-ce.x86_64 docker-ce-cli.x86_64 docker-ce-rootless-extras.x86_64 58 | fi 59 | 60 | dnf -y --allowerasing install \ 61 | ohpc-slurm-server \ 62 | vim \ 63 | mailx \ 64 | lmod-ohpc \ 65 | bash-completion \ 66 | gnu9-compilers-ohpc \ 67 | openmpi4-gnu9-ohpc \ 68 | singularity-ohpc \ 69 | lmod-defaults-gnu9-openmpi4-ohpc \ 70 | moreutils \ 71 | bind-utils \ 72 | python3-pexpect 73 | 74 | pip3 install ansible 75 | mkdir -p /etc/ansible 76 | ln -s /usr/local/bin/ansible-playbook /usr/bin/ansible-playbook 77 | 78 | pip3 install openstacksdk==0.61.0 79 | pip3 install python-openstackclient 80 | 81 | dnf -y update # until the base python2-openstackclient install works out of the box! 82 | 83 | #create user that can be used to submit jobs 84 | [ ! -d /home/gateway-user ] && useradd -m gateway-user 85 | 86 | [ ! -f slurm-key ] && ssh-keygen -b 2048 -t rsa -P "" -f slurm-key 87 | 88 | # generate a local key for centos for after homedirs are mounted! 89 | [ ! -f /home/rocky/.ssh/id_rsa ] && su rocky - -c 'ssh-keygen -t rsa -b 2048 -P "" -f /home/rocky/.ssh/id_rsa && cat /home/rocky/.ssh/id_rsa.pub >> /home/rocky/.ssh/authorized_keys' 90 | 91 | 92 | #create clouds.yaml file from contents of openrc 93 | echo -e "clouds: 94 | openstack: 95 | auth: 96 | auth_url: '${OS_AUTH_URL}' 97 | application_credential_id: '${OS_APPLICATION_CREDENTIAL_ID}' 98 | application_credential_secret: '${OS_APPLICATION_CREDENTIAL_SECRET}' 99 | region_name: '${OS_REGION_NAME}' 100 | interface: '${OS_INTERFACE}' 101 | identity_api_version: '${OS_IDENTITY_API_VERSION}' 102 | auth_type: 'v3applicationcredential'" > clouds.yaml 103 | 104 | #Make sure only root can read this 105 | chmod 400 clouds.yaml 106 | 107 | if [[ -n $(openstack keypair list | grep ${OS_SLURM_KEYPAIR}) ]]; then 108 | openstack keypair delete ${OS_SLURM_KEYPAIR} 109 | openstack keypair create --public-key slurm-key.pub ${OS_SLURM_KEYPAIR} 110 | else 111 | openstack keypair create --public-key slurm-key.pub ${OS_SLURM_KEYPAIR} 112 | fi 113 | 114 | #TACC-specific changes: 115 | 116 | if [[ $OS_AUTH_URL =~ "tacc" ]]; then 117 | #Insert headnode into /etc/hosts 118 | echo "$(ip addr | grep -Eo '10.0.0.[0-9]*' | head -1) $(hostname) $(hostname -s)" >> /etc/hosts 119 | fi 120 | 121 | #Get OS Network name of *this* server, and set as the network for compute-nodes 122 | # Only need this if you've changed the subnet name for some reason 123 | #headnode_os_subnet=$(openstack server show $(hostname | cut -f 1 -d'.') | awk '/addresses/ {print $4}' | cut -f 1 -d'=') 124 | #sed -i "s/network_name=.*/network_name=$headnode_os_subnet/" ./slurm_resume.sh 125 | 126 | #Set compute node names to $OS_PREFIX-compute- 127 | sed -i "s/=compute-*/=${OS_PREFIX}-compute-/" ./slurm.conf 128 | sed -i "s/Host compute-*/Host ${OS_PREFIX}-compute-/" ./ssh.cfg 129 | 130 | #set the subnet in ssh.cfg and compute_build_base_img.yml 131 | sed -i "s/Host 10.0.0.\*/Host ${SUBNET_PREFIX}/" ./ssh.cfg 132 | sed -i "s/{{ ansible_facts.hostname }}-elastic-net/${HEADNODE_NETWORK}/" ./compute_build_base_img.yml 133 | 134 | # Deal with files required by slurm - better way to encapsulate this section? 135 | 136 | mkdir -p -m 700 /etc/slurm/.ssh 137 | 138 | cp slurm-key slurm-key.pub /etc/slurm/.ssh/ 139 | 140 | #Make sure slurm-user will still be valid after the nfs mount happens! 141 | cat slurm-key.pub >> /home/rocky/.ssh/authorized_keys 142 | 143 | chown -R slurm:slurm /etc/slurm/.ssh 144 | 145 | setfacl -m u:slurm:rw /etc/hosts 146 | setfacl -m u:slurm:rwx /etc/ 147 | 148 | chmod +t /etc 149 | 150 | #The following may be removed when appcred gen during cluster_create is working 151 | ##Possible to handle this at the cloud-init level? From a machine w/ 152 | ## pre-loaded openrc, possible via user-data and write_files, yes. 153 | ## This needs a check for success, and if not, fail? 154 | ##export $(openstack application credential create -f shell ${OS_APP_CRED} | sed 's/^\(.*\)/OS_ac_\1/') 155 | ##echo -e "export OS_AUTH_TYPE=v3applicationcredential 156 | ##export OS_AUTH_URL=${OS_AUTH_URL} 157 | ##export OS_IDENTITY_API_VERSION=3 158 | ##export OS_REGION_NAME="RegionOne" 159 | ##export OS_INTERFACE=public 160 | ##export OS_APPLICATION_CREDENTIAL_ID=${OS_ac_id} 161 | ##export OS_APPLICATION_CREDENTIAL_SECRET=${OS_ac_secret} > /etc/slurm/openrc.sh 162 | # 163 | #echo -e "export OS_PROJECT_DOMAIN_NAME=tacc 164 | #export OS_USER_DOMAIN_NAME=tacc 165 | #export OS_PROJECT_NAME=${OS_PROJECT_NAME} 166 | #export OS_USERNAME=${OS_USERNAME} 167 | #export OS_PASSWORD=${OS_PASSWORD} 168 | #export OS_AUTH_URL=${OS_AUTH_URL} 169 | #export OS_IDENTITY_API_VERSION=3" > /etc/slurm/openrc.sh 170 | 171 | #chown slurm:slurm /etc/slurm/openrc.sh 172 | 173 | #chmod 400 /etc/slurm/openrc.sh 174 | 175 | cp prevent-updates.ci /etc/slurm/ 176 | 177 | chown slurm:slurm /etc/slurm/openrc.sh 178 | chown slurm:slurm /etc/slurm/prevent-updates.ci 179 | 180 | mkdir -p /var/log/slurm 181 | 182 | touch /var/log/slurm/slurm_elastic.log 183 | touch /var/log/slurm/os_clean.log 184 | 185 | chown -R slurm:slurm /var/log/slurm 186 | 187 | cp slurm-logrotate.conf /etc/logrotate.d/slurm 188 | 189 | setfacl -m u:slurm:rw /etc/ansible/hosts 190 | setfacl -m u:slurm:rwx /etc/ansible/ 191 | 192 | cp slurm_*.sh /usr/local/sbin/ 193 | 194 | cp cron-node-check.sh /usr/local/sbin/ 195 | cp clean-os-error.sh /usr/local/sbin/ 196 | 197 | chown slurm:slurm /usr/local/sbin/slurm_*.sh 198 | chown slurm:slurm /usr/local/sbin/clean-os-error.sh 199 | 200 | chown rocky:rocky /usr/local/sbin/cron-node-check.sh 201 | 202 | echo "#13 */6 * * * rocky /usr/local/sbin/cron-node-check.sh" >> /etc/crontab 203 | echo "#*/4 * * * * slurm /usr/local/sbin/clean-os-error.sh" >> /etc/crontab 204 | 205 | #"dynamic" hostname adjustment 206 | sed -i "s/ControlMachine=slurm-example/ControlMachine=$(hostname -s)/" ./slurm.conf 207 | cp slurm.conf /etc/slurm/slurm.conf 208 | 209 | cp ansible.cfg /etc/ansible/ 210 | 211 | cp ssh.cfg /etc/ansible/ 212 | 213 | cp slurm_test.job ${HOME} 214 | 215 | #create share directory 216 | mkdir -m 777 -p /export 217 | 218 | #create export of homedirs and /export and /opt/ohpc/pub 219 | echo -e "/home ${SUBNET}(rw,no_root_squash) \n/export ${SUBNET}(rw,no_root_squash)" > /etc/exports 220 | echo -e "/opt/ohpc/pub ${SUBNET}(rw,no_root_squash)" >> /etc/exports 221 | 222 | #Get latest CentOS7 minimal image for base - if os_image_facts or the os API allowed for wildcards, 223 | # this would be different. But this is the world we live in. 224 | # After the naming convention change of May 5, 2020, this is no longer necessary - JS-API-Featured-CentOS7-Latest is the default. 225 | # These lines remain as a testament to past struggles. 226 | #centos_base_image=$(openstack image list --status active | grep -iE "API-Featured-centos7-[[:alpha:]]{3,4}-[0-9]{2}-[0-9]{4}" | awk '{print $4}' | tail -n 1) 227 | #centos_base_image="JS-API-Featured-CentOS7-Latest" 228 | #sed -i "s/\(\s*compute_base_image: \).*/\1\"${centos_base_image}\"/" compute_build_base_img.yml | head -n 10 229 | 230 | #create temporary script to add local users 231 | echo "#!/bin/bash" > /tmp/add_users.sh 232 | cat /etc/passwd | awk -F':' '$4 >= 1001 && $4 < 65000 {print "useradd -M -u", $3, $1}' >> /tmp/add_users.sh 233 | 234 | # build instance for compute base image generation, take snapshot, and destroy it 235 | echo "Creating compute image! based on $centos_base_image" 236 | 237 | ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' compute_build_base_img.yml 238 | 239 | #to allow other users to run ansible! 240 | rm -r /tmp/.ansible 241 | 242 | if [[ ${jhub_build} == 1 ]]; then 243 | ansible-galaxy collection install community.general 244 | ansible-galaxy collection install ansible.posix 245 | ansible-galaxy install geerlingguy.certbot 246 | # ansible-playbook -v --ssh-common-args='-o StrictHostKeyChecking=no' install_jupyterhub.yml 247 | fi 248 | 249 | #Start required services 250 | systemctl enable slurmctld munge nfs-server rpcbind 251 | systemctl restart munge slurmctld nfs-server rpcbind 252 | 253 | echo -e "If you wish to enable an email when node state is drain or down, please uncomment \nthe cron-node-check.sh job in /etc/crontab, and place your email of choice in the 'email_addr' variable \nat the beginning of /usr/local/sbin/cron-node-check.sh" 254 | -------------------------------------------------------------------------------- /jhub_files/https_redirect.conf.j2: -------------------------------------------------------------------------------- 1 | 2 | ServerName {{ headnode_public_hostname }} 3 | ServerAlias {{ headnode_alternate_hostname }} 4 | {% raw %} 5 | # redirect all port 80 traffic to 443 6 | RewriteEngine on 7 | ReWriteCond %{SERVER_PORT} !^443$ 8 | RewriteRule ^/(.*) https://%{HTTP_HOST}/$1 [NC,R,L] 9 | RewriteCond %{SERVER_NAME} {% endraw %} ={{ headnode_alternate_hostname }} 10 | {% raw %} 11 | RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,NE,R=permanent] 12 | RewriteCond %{SERVER_NAME} {% endraw %} ={{ headnode_public_hostname }} 13 | {% raw %} 14 | RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,NE,R=permanent] 15 | 16 | {% endraw %} 17 | -------------------------------------------------------------------------------- /jhub_files/jhub_conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for jupyterhub. 2 | 3 | #------------------------------------------------------------------------------ 4 | # Application(SingletonConfigurable) configuration 5 | #------------------------------------------------------------------------------ 6 | ## This is an application. 7 | 8 | ## The date format used by logging formatters for %(asctime)s 9 | # Default: '%Y-%m-%d %H:%M:%S' 10 | # c.Application.log_datefmt = '%Y-%m-%d %H:%M:%S' 11 | 12 | ## The Logging format template 13 | # Default: '[%(name)s]%(highlevel)s %(message)s' 14 | # c.Application.log_format = '[%(name)s]%(highlevel)s %(message)s' 15 | 16 | ## Set the log level by value or name. 17 | # Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL'] 18 | # Default: 30 19 | # c.Application.log_level = 30 20 | 21 | ## Instead of starting the Application, dump configuration to stdout 22 | # Default: False 23 | # c.Application.show_config = False 24 | 25 | ## Instead of starting the Application, dump configuration to stdout (as JSON) 26 | # Default: False 27 | # c.Application.show_config_json = False 28 | 29 | #------------------------------------------------------------------------------ 30 | # JupyterHub(Application) configuration 31 | #------------------------------------------------------------------------------ 32 | ## An Application for starting a Multi-User Jupyter Notebook server. 33 | 34 | ## Maximum number of concurrent servers that can be active at a time. 35 | # 36 | # Setting this can limit the total resources your users can consume. 37 | # 38 | # An active server is any server that's not fully stopped. It is considered 39 | # active from the time it has been requested until the time that it has 40 | # completely stopped. 41 | # 42 | # If this many user servers are active, users will not be able to launch new 43 | # servers until a server is shutdown. Spawn requests will be rejected with a 429 44 | # error asking them to try again. 45 | # 46 | # If set to 0, no limit is enforced. 47 | # Default: 0 48 | # c.JupyterHub.active_server_limit = 0 49 | 50 | ## Duration (in seconds) to determine the number of active users. 51 | # Default: 1800 52 | # c.JupyterHub.active_user_window = 1800 53 | 54 | ## Resolution (in seconds) for updating activity 55 | # 56 | # If activity is registered that is less than activity_resolution seconds more 57 | # recent than the current value, the new value will be ignored. 58 | # 59 | # This avoids too many writes to the Hub database. 60 | # Default: 30 61 | # c.JupyterHub.activity_resolution = 30 62 | 63 | ## Grant admin users permission to access single-user servers. 64 | # 65 | # Users should be properly informed if this is enabled. 66 | # Default: False 67 | # c.JupyterHub.admin_access = False 68 | 69 | ## DEPRECATED since version 0.7.2, use Authenticator.admin_users instead. 70 | # Default: set() 71 | # c.JupyterHub.admin_users = set() 72 | 73 | ## Allow named single-user servers per user 74 | # Default: False 75 | # c.JupyterHub.allow_named_servers = False 76 | 77 | ## Answer yes to any questions (e.g. confirm overwrite) 78 | # Default: False 79 | # c.JupyterHub.answer_yes = False 80 | 81 | ## PENDING DEPRECATION: consider using services 82 | # 83 | # Dict of token:username to be loaded into the database. 84 | # 85 | # Allows ahead-of-time generation of API tokens for use by externally managed 86 | # services, which authenticate as JupyterHub users. 87 | # 88 | # Consider using services for general services that talk to the JupyterHub API. 89 | # Default: {} 90 | # c.JupyterHub.api_tokens = {} 91 | 92 | ## Authentication for prometheus metrics 93 | # Default: True 94 | # c.JupyterHub.authenticate_prometheus = True 95 | 96 | ## Class for authenticating users. 97 | # 98 | # This should be a subclass of :class:`jupyterhub.auth.Authenticator` 99 | # 100 | # with an :meth:`authenticate` method that: 101 | # 102 | # - is a coroutine (asyncio or tornado) 103 | # - returns username on success, None on failure 104 | # - takes two arguments: (handler, data), 105 | # where `handler` is the calling web.RequestHandler, 106 | # and `data` is the POST form data from the login page. 107 | # 108 | # .. versionchanged:: 1.0 109 | # authenticators may be registered via entry points, 110 | # e.g. `c.JupyterHub.authenticator_class = 'pam'` 111 | # 112 | # Currently installed: 113 | # - default: jupyterhub.auth.PAMAuthenticator 114 | # - dummy: jupyterhub.auth.DummyAuthenticator 115 | # - pam: jupyterhub.auth.PAMAuthenticator 116 | # Default: 'jupyterhub.auth.PAMAuthenticator' 117 | # c.JupyterHub.authenticator_class = 'jupyterhub.auth.PAMAuthenticator' 118 | 119 | ## The base URL of the entire application. 120 | # 121 | # Add this to the beginning of all JupyterHub URLs. Use base_url to run 122 | # JupyterHub within an existing website. 123 | # 124 | # .. deprecated: 0.9 125 | # Use JupyterHub.bind_url 126 | # Default: '/' 127 | # c.JupyterHub.base_url = '/' 128 | 129 | ## The public facing URL of the whole JupyterHub application. 130 | # 131 | # This is the address on which the proxy will bind. Sets protocol, ip, base_url 132 | # Default: 'http://:8000' 133 | c.JupyterHub.bind_url = 'http://127.0.0.1:8000' 134 | 135 | ## Whether to shutdown the proxy when the Hub shuts down. 136 | # 137 | # Disable if you want to be able to teardown the Hub while leaving the proxy 138 | # running. 139 | # 140 | # Only valid if the proxy was starting by the Hub process. 141 | # 142 | # If both this and cleanup_servers are False, sending SIGINT to the Hub will 143 | # only shutdown the Hub, leaving everything else running. 144 | # 145 | # The Hub should be able to resume from database state. 146 | # Default: True 147 | # c.JupyterHub.cleanup_proxy = True 148 | 149 | ## Whether to shutdown single-user servers when the Hub shuts down. 150 | # 151 | # Disable if you want to be able to teardown the Hub while leaving the single- 152 | # user servers running. 153 | # 154 | # If both this and cleanup_proxy are False, sending SIGINT to the Hub will only 155 | # shutdown the Hub, leaving everything else running. 156 | # 157 | # The Hub should be able to resume from database state. 158 | # Default: True 159 | # c.JupyterHub.cleanup_servers = True 160 | 161 | ## Maximum number of concurrent users that can be spawning at a time. 162 | # 163 | # Spawning lots of servers at the same time can cause performance problems for 164 | # the Hub or the underlying spawning system. Set this limit to prevent bursts of 165 | # logins from attempting to spawn too many servers at the same time. 166 | # 167 | # This does not limit the number of total running servers. See 168 | # active_server_limit for that. 169 | # 170 | # If more than this many users attempt to spawn at a time, their requests will 171 | # be rejected with a 429 error asking them to try again. Users will have to wait 172 | # for some of the spawning services to finish starting before they can start 173 | # their own. 174 | # 175 | # If set to 0, no limit is enforced. 176 | # Default: 100 177 | # c.JupyterHub.concurrent_spawn_limit = 100 178 | 179 | ## The config file to load 180 | # Default: 'jupyterhub_config.py' 181 | # c.JupyterHub.config_file = 'jupyterhub_config.py' 182 | 183 | ## DEPRECATED: does nothing 184 | # Default: False 185 | # c.JupyterHub.confirm_no_ssl = False 186 | 187 | ## Number of days for a login cookie to be valid. Default is two weeks. 188 | # Default: 14 189 | # c.JupyterHub.cookie_max_age_days = 14 190 | 191 | ## The cookie secret to use to encrypt cookies. 192 | # 193 | # Loaded from the JPY_COOKIE_SECRET env variable by default. 194 | # 195 | # Should be exactly 256 bits (32 bytes). 196 | # Default: b'' 197 | # c.JupyterHub.cookie_secret = b'' 198 | 199 | ## File in which to store the cookie secret. 200 | # Default: 'jupyterhub_cookie_secret' 201 | # c.JupyterHub.cookie_secret_file = 'jupyterhub_cookie_secret' 202 | 203 | ## The location of jupyterhub data files (e.g. /usr/local/share/jupyterhub) 204 | # Default: '/opt/python3/share/jupyterhub' 205 | # c.JupyterHub.data_files_path = '/opt/python3/share/jupyterhub' 206 | 207 | ## Include any kwargs to pass to the database connection. See 208 | # sqlalchemy.create_engine for details. 209 | # Default: {} 210 | # c.JupyterHub.db_kwargs = {} 211 | 212 | ## url for the database. e.g. `sqlite:///jupyterhub.sqlite` 213 | # Default: 'sqlite:///jupyterhub.sqlite' 214 | # c.JupyterHub.db_url = 'sqlite:///jupyterhub.sqlite' 215 | 216 | ## log all database transactions. This has A LOT of output 217 | # Default: False 218 | # c.JupyterHub.debug_db = False 219 | 220 | ## DEPRECATED since version 0.8: Use ConfigurableHTTPProxy.debug 221 | # Default: False 222 | # c.JupyterHub.debug_proxy = False 223 | 224 | ## If named servers are enabled, default name of server to spawn or open, e.g. by 225 | # user-redirect. 226 | # Default: '' 227 | # c.JupyterHub.default_server_name = '' 228 | 229 | ## The default URL for users when they arrive (e.g. when user directs to "/") 230 | # 231 | # By default, redirects users to their own server. 232 | # 233 | # Can be a Unicode string (e.g. '/hub/home') or a callable based on the handler 234 | # object: 235 | # 236 | # :: 237 | # 238 | # def default_url_fn(handler): 239 | # user = handler.current_user 240 | # if user and user.admin: 241 | # return '/hub/admin' 242 | # return '/hub/home' 243 | # 244 | # c.JupyterHub.default_url = default_url_fn 245 | # Default: traitlets.Undefined 246 | # c.JupyterHub.default_url = traitlets.Undefined 247 | 248 | ## Dict authority:dict(files). Specify the key, cert, and/or ca file for an 249 | # authority. This is useful for externally managed proxies that wish to use 250 | # internal_ssl. 251 | # 252 | # The files dict has this format (you must specify at least a cert):: 253 | # 254 | # { 255 | # 'key': '/path/to/key.key', 256 | # 'cert': '/path/to/cert.crt', 257 | # 'ca': '/path/to/ca.crt' 258 | # } 259 | # 260 | # The authorities you can override: 'hub-ca', 'notebooks-ca', 'proxy-api-ca', 261 | # 'proxy-client-ca', and 'services-ca'. 262 | # 263 | # Use with internal_ssl 264 | # Default: {} 265 | # c.JupyterHub.external_ssl_authorities = {} 266 | 267 | ## Register extra tornado Handlers for jupyterhub. 268 | # 269 | # Should be of the form ``("", Handler)`` 270 | # 271 | # The Hub prefix will be added, so `/my-page` will be served at `/hub/my-page`. 272 | # Default: [] 273 | # c.JupyterHub.extra_handlers = [] 274 | 275 | ## DEPRECATED: use output redirection instead, e.g. 276 | # 277 | # jupyterhub &>> /var/log/jupyterhub.log 278 | # Default: '' 279 | # c.JupyterHub.extra_log_file = '' 280 | 281 | ## Extra log handlers to set on JupyterHub logger 282 | # Default: [] 283 | # c.JupyterHub.extra_log_handlers = [] 284 | 285 | ## Generate certs used for internal ssl 286 | # Default: False 287 | # c.JupyterHub.generate_certs = False 288 | 289 | ## Generate default config file 290 | # Default: False 291 | # c.JupyterHub.generate_config = False 292 | 293 | ## The URL on which the Hub will listen. This is a private URL for internal 294 | # communication. Typically set in combination with hub_connect_url. If a unix 295 | # socket, hub_connect_url **must** also be set. 296 | # 297 | # For example: 298 | # 299 | # "http://127.0.0.1:8081" 300 | # "unix+http://%2Fsrv%2Fjupyterhub%2Fjupyterhub.sock" 301 | # 302 | # .. versionadded:: 0.9 303 | # Default: '' 304 | # c.JupyterHub.hub_bind_url = '' 305 | 306 | ## The ip or hostname for proxies and spawners to use for connecting to the Hub. 307 | # 308 | # Use when the bind address (`hub_ip`) is 0.0.0.0, :: or otherwise different 309 | # from the connect address. 310 | # 311 | # Default: when `hub_ip` is 0.0.0.0 or ::, use `socket.gethostname()`, otherwise 312 | # use `hub_ip`. 313 | # 314 | # Note: Some spawners or proxy implementations might not support hostnames. 315 | # Check your spawner or proxy documentation to see if they have extra 316 | # requirements. 317 | # 318 | # .. versionadded:: 0.8 319 | # Default: '' 320 | # c.JupyterHub.hub_connect_ip = '' 321 | 322 | ## DEPRECATED 323 | # 324 | # Use hub_connect_url 325 | # 326 | # .. versionadded:: 0.8 327 | # 328 | # .. deprecated:: 0.9 329 | # Use hub_connect_url 330 | # Default: 0 331 | # c.JupyterHub.hub_connect_port = 0 332 | 333 | ## The URL for connecting to the Hub. Spawners, services, and the proxy will use 334 | # this URL to talk to the Hub. 335 | # 336 | # Only needs to be specified if the default hub URL is not connectable (e.g. 337 | # using a unix+http:// bind url). 338 | # 339 | # .. seealso:: 340 | # JupyterHub.hub_connect_ip 341 | # JupyterHub.hub_bind_url 342 | # 343 | # .. versionadded:: 0.9 344 | # Default: '' 345 | # c.JupyterHub.hub_connect_url = '' 346 | 347 | ## The ip address for the Hub process to *bind* to. 348 | # 349 | # By default, the hub listens on localhost only. This address must be accessible 350 | # from the proxy and user servers. You may need to set this to a public ip or '' 351 | # for all interfaces if the proxy or user servers are in containers or on a 352 | # different host. 353 | # 354 | # See `hub_connect_ip` for cases where the bind and connect address should 355 | # differ, or `hub_bind_url` for setting the full bind URL. 356 | # Default: '127.0.0.1' 357 | #jecoulte - this is the ip used by the singlespawner instance of a job to connect to the hub 358 | c.JupyterHub.hub_ip = '{{ headnode_ip }}' #JEC_HEADNODE_IP 359 | #c.JupyterHub.hub_ip = '127.0.0.1' 360 | 361 | ## The internal port for the Hub process. 362 | # 363 | # This is the internal port of the hub itself. It should never be accessed 364 | # directly. See JupyterHub.port for the public port to use when accessing 365 | # jupyterhub. It is rare that this port should be set except in cases of port 366 | # conflict. 367 | # 368 | # See also `hub_ip` for the ip and `hub_bind_url` for setting the full bind URL. 369 | # Default: 8081 370 | # c.JupyterHub.hub_port = 8081 371 | 372 | ## Trigger implicit spawns after this many seconds. 373 | # 374 | # When a user visits a URL for a server that's not running, they are shown a 375 | # page indicating that the requested server is not running with a button to 376 | # spawn the server. 377 | # 378 | # Setting this to a positive value will redirect the user after this many 379 | # seconds, effectively clicking this button automatically for the users, 380 | # automatically beginning the spawn process. 381 | # 382 | # Warning: this can result in errors and surprising behavior when sharing access 383 | # URLs to actual servers, since the wrong server is likely to be started. 384 | # Default: 0 385 | # c.JupyterHub.implicit_spawn_seconds = 0 386 | 387 | ## Timeout (in seconds) to wait for spawners to initialize 388 | # 389 | # Checking if spawners are healthy can take a long time if many spawners are 390 | # active at hub start time. 391 | # 392 | # If it takes longer than this timeout to check, init_spawner will be left to 393 | # complete in the background and the http server is allowed to start. 394 | # 395 | # A timeout of -1 means wait forever, which can mean a slow startup of the Hub 396 | # but ensures that the Hub is fully consistent by the time it starts responding 397 | # to requests. This matches the behavior of jupyterhub 1.0. 398 | # 399 | # .. versionadded: 1.1.0 400 | # Default: 10 401 | # c.JupyterHub.init_spawners_timeout = 10 402 | 403 | ## The location to store certificates automatically created by JupyterHub. 404 | # 405 | # Use with internal_ssl 406 | # Default: 'internal-ssl' 407 | # c.JupyterHub.internal_certs_location = 'internal-ssl' 408 | 409 | ## Enable SSL for all internal communication 410 | # 411 | # This enables end-to-end encryption between all JupyterHub components. 412 | # JupyterHub will automatically create the necessary certificate authority and 413 | # sign notebook certificates as they're created. 414 | # Default: False 415 | # c.JupyterHub.internal_ssl = False 416 | 417 | ## The public facing ip of the whole JupyterHub application (specifically 418 | # referred to as the proxy). 419 | # 420 | # This is the address on which the proxy will listen. The default is to listen 421 | # on all interfaces. This is the only address through which JupyterHub should be 422 | # accessed by users. 423 | # 424 | # .. deprecated: 0.9 425 | # Use JupyterHub.bind_url 426 | # Default: '' 427 | # c.JupyterHub.ip = '' 428 | 429 | ## Supply extra arguments that will be passed to Jinja environment. 430 | # Default: {} 431 | # c.JupyterHub.jinja_environment_options = {} 432 | 433 | ## Interval (in seconds) at which to update last-activity timestamps. 434 | # Default: 300 435 | # c.JupyterHub.last_activity_interval = 300 436 | 437 | ## Dict of 'group': ['usernames'] to load at startup. 438 | # 439 | # This strictly *adds* groups and users to groups. 440 | # 441 | # Loading one set of groups, then starting JupyterHub again with a different set 442 | # will not remove users or groups from previous launches. That must be done 443 | # through the API. 444 | # Default: {} 445 | # c.JupyterHub.load_groups = {} 446 | 447 | ## The date format used by logging formatters for %(asctime)s 448 | # See also: Application.log_datefmt 449 | # c.JupyterHub.log_datefmt = '%Y-%m-%d %H:%M:%S' 450 | 451 | ## The Logging format template 452 | # See also: Application.log_format 453 | # c.JupyterHub.log_format = '[%(name)s]%(highlevel)s %(message)s' 454 | 455 | ## Set the log level by value or name. 456 | # See also: Application.log_level 457 | # c.JupyterHub.log_level = 30 458 | 459 | ## Specify path to a logo image to override the Jupyter logo in the banner. 460 | # Default: '' 461 | # c.JupyterHub.logo_file = '' 462 | 463 | ## Maximum number of concurrent named servers that can be created by a user at a 464 | # time. 465 | # 466 | # Setting this can limit the total resources a user can consume. 467 | # 468 | # If set to 0, no limit is enforced. 469 | # Default: 0 470 | # c.JupyterHub.named_server_limit_per_user = 0 471 | 472 | ## File to write PID Useful for daemonizing JupyterHub. 473 | # Default: '' 474 | # c.JupyterHub.pid_file = '' 475 | 476 | ## The public facing port of the proxy. 477 | # 478 | # This is the port on which the proxy will listen. This is the only port through 479 | # which JupyterHub should be accessed by users. 480 | # 481 | # .. deprecated: 0.9 482 | # Use JupyterHub.bind_url 483 | # Default: 8000 484 | # c.JupyterHub.port = 8000 485 | 486 | ## DEPRECATED since version 0.8 : Use ConfigurableHTTPProxy.api_url 487 | # Default: '' 488 | # c.JupyterHub.proxy_api_ip = '' 489 | 490 | ## DEPRECATED since version 0.8 : Use ConfigurableHTTPProxy.api_url 491 | # Default: 0 492 | # c.JupyterHub.proxy_api_port = 0 493 | 494 | ## DEPRECATED since version 0.8: Use ConfigurableHTTPProxy.auth_token 495 | # Default: '' 496 | # c.JupyterHub.proxy_auth_token = '' 497 | 498 | ## Interval (in seconds) at which to check if the proxy is running. 499 | # Default: 30 500 | # c.JupyterHub.proxy_check_interval = 30 501 | 502 | ## The class to use for configuring the JupyterHub proxy. 503 | # 504 | # Should be a subclass of :class:`jupyterhub.proxy.Proxy`. 505 | # 506 | # .. versionchanged:: 1.0 507 | # proxies may be registered via entry points, 508 | # e.g. `c.JupyterHub.proxy_class = 'traefik'` 509 | # 510 | # Currently installed: 511 | # - configurable-http-proxy: jupyterhub.proxy.ConfigurableHTTPProxy 512 | # - default: jupyterhub.proxy.ConfigurableHTTPProxy 513 | # Default: 'jupyterhub.proxy.ConfigurableHTTPProxy' 514 | # c.JupyterHub.proxy_class = 'jupyterhub.proxy.ConfigurableHTTPProxy' 515 | 516 | ## DEPRECATED since version 0.8. Use ConfigurableHTTPProxy.command 517 | # Default: [] 518 | # c.JupyterHub.proxy_cmd = [] 519 | 520 | ## Recreate all certificates used within JupyterHub on restart. 521 | # 522 | # Note: enabling this feature requires restarting all notebook servers. 523 | # 524 | # Use with internal_ssl 525 | # Default: False 526 | # c.JupyterHub.recreate_internal_certs = False 527 | 528 | ## Redirect user to server (if running), instead of control panel. 529 | # Default: True 530 | # c.JupyterHub.redirect_to_server = True 531 | 532 | ## Purge and reset the database. 533 | # Default: False 534 | # c.JupyterHub.reset_db = False 535 | 536 | ## Interval (in seconds) at which to check connectivity of services with web 537 | # endpoints. 538 | # Default: 60 539 | # c.JupyterHub.service_check_interval = 60 540 | 541 | ## Dict of token:servicename to be loaded into the database. 542 | # 543 | # Allows ahead-of-time generation of API tokens for use by externally managed 544 | # services. 545 | # Default: {} 546 | # c.JupyterHub.service_tokens = {} 547 | 548 | ## List of service specification dictionaries. 549 | # 550 | # A service 551 | # 552 | # For instance:: 553 | # 554 | # services = [ 555 | # { 556 | # 'name': 'cull_idle', 557 | # 'command': ['/path/to/cull_idle_servers.py'], 558 | # }, 559 | # { 560 | # 'name': 'formgrader', 561 | # 'url': 'http://127.0.0.1:1234', 562 | # 'api_token': 'super-secret', 563 | # 'environment': 564 | # } 565 | # ] 566 | # Default: [] 567 | # c.JupyterHub.services = [] 568 | 569 | ## Instead of starting the Application, dump configuration to stdout 570 | # See also: Application.show_config 571 | # c.JupyterHub.show_config = False 572 | 573 | ## Instead of starting the Application, dump configuration to stdout (as JSON) 574 | # See also: Application.show_config_json 575 | # c.JupyterHub.show_config_json = False 576 | 577 | ## Shuts down all user servers on logout 578 | # Default: False 579 | # c.JupyterHub.shutdown_on_logout = False 580 | 581 | ## The class to use for spawning single-user servers. 582 | # 583 | # Should be a subclass of :class:`jupyterhub.spawner.Spawner`. 584 | # 585 | # .. versionchanged:: 1.0 586 | # spawners may be registered via entry points, 587 | # e.g. `c.JupyterHub.spawner_class = 'localprocess'` 588 | # 589 | # Currently installed: 590 | # - default: jupyterhub.spawner.LocalProcessSpawner 591 | # - localprocess: jupyterhub.spawner.LocalProcessSpawner 592 | # - simple: jupyterhub.spawner.SimpleLocalProcessSpawner 593 | # Default: 'jupyterhub.spawner.LocalProcessSpawner' 594 | # c.JupyterHub.spawner_class = 'jupyterhub.spawner.LocalProcessSpawner' 595 | 596 | ## Path to SSL certificate file for the public facing interface of the proxy 597 | # 598 | # When setting this, you should also set ssl_key 599 | # Default: '' 600 | # c.JupyterHub.ssl_cert = '' 601 | 602 | ## Path to SSL key file for the public facing interface of the proxy 603 | # 604 | # When setting this, you should also set ssl_cert 605 | # Default: '' 606 | # c.JupyterHub.ssl_key = '' 607 | 608 | ## Host to send statsd metrics to. An empty string (the default) disables sending 609 | # metrics. 610 | # Default: '' 611 | # c.JupyterHub.statsd_host = '' 612 | 613 | ## Port on which to send statsd metrics about the hub 614 | # Default: 8125 615 | # c.JupyterHub.statsd_port = 8125 616 | 617 | ## Prefix to use for all metrics sent by jupyterhub to statsd 618 | # Default: 'jupyterhub' 619 | # c.JupyterHub.statsd_prefix = 'jupyterhub' 620 | 621 | ## Run single-user servers on subdomains of this host. 622 | # 623 | # This should be the full `https://hub.domain.tld[:port]`. 624 | # 625 | # Provides additional cross-site protections for javascript served by single- 626 | # user servers. 627 | # 628 | # Requires `.hub.domain.tld` to resolve to the same host as 629 | # `hub.domain.tld`. 630 | # 631 | # In general, this is most easily achieved with wildcard DNS. 632 | # 633 | # When using SSL (i.e. always) this also requires a wildcard SSL certificate. 634 | # Default: '' 635 | # c.JupyterHub.subdomain_host = '' 636 | 637 | ## Paths to search for jinja templates, before using the default templates. 638 | # Default: [] 639 | # c.JupyterHub.template_paths = [] 640 | 641 | ## Extra variables to be passed into jinja templates 642 | # Default: {} 643 | # c.JupyterHub.template_vars = {} 644 | 645 | ## Extra settings overrides to pass to the tornado application. 646 | # Default: {} 647 | # c.JupyterHub.tornado_settings = {} 648 | 649 | ## Trust user-provided tokens (via JupyterHub.service_tokens) to have good 650 | # entropy. 651 | # 652 | # If you are not inserting additional tokens via configuration file, this flag 653 | # has no effect. 654 | # 655 | # In JupyterHub 0.8, internally generated tokens do not pass through additional 656 | # hashing because the hashing is costly and does not increase the entropy of 657 | # already-good UUIDs. 658 | # 659 | # User-provided tokens, on the other hand, are not trusted to have good entropy 660 | # by default, and are passed through many rounds of hashing to stretch the 661 | # entropy of the key (i.e. user-provided tokens are treated as passwords instead 662 | # of random keys). These keys are more costly to check. 663 | # 664 | # If your inserted tokens are generated by a good-quality mechanism, e.g. 665 | # `openssl rand -hex 32`, then you can set this flag to True to reduce the cost 666 | # of checking authentication tokens. 667 | # Default: False 668 | # c.JupyterHub.trust_user_provided_tokens = False 669 | 670 | ## Names to include in the subject alternative name. 671 | # 672 | # These names will be used for server name verification. This is useful if 673 | # JupyterHub is being run behind a reverse proxy or services using ssl are on 674 | # different hosts. 675 | # 676 | # Use with internal_ssl 677 | # Default: [] 678 | # c.JupyterHub.trusted_alt_names = [] 679 | 680 | ## Downstream proxy IP addresses to trust. 681 | # 682 | # This sets the list of IP addresses that are trusted and skipped when 683 | # processing the `X-Forwarded-For` header. For example, if an external proxy is 684 | # used for TLS termination, its IP address should be added to this list to 685 | # ensure the correct client IP addresses are recorded in the logs instead of the 686 | # proxy server's IP address. 687 | # Default: [] 688 | # c.JupyterHub.trusted_downstream_ips = [] 689 | 690 | ## Upgrade the database automatically on start. 691 | # 692 | # Only safe if database is regularly backed up. Only SQLite databases will be 693 | # backed up to a local file automatically. 694 | # Default: False 695 | # c.JupyterHub.upgrade_db = False 696 | 697 | ## Callable to affect behavior of /user-redirect/ 698 | # 699 | # Receives 4 parameters: 1. path - URL path that was provided after /user- 700 | # redirect/ 2. request - A Tornado HTTPServerRequest representing the current 701 | # request. 3. user - The currently authenticated user. 4. base_url - The 702 | # base_url of the current hub, for relative redirects 703 | # 704 | # It should return the new URL to redirect to, or None to preserve current 705 | # behavior. 706 | # Default: None 707 | # c.JupyterHub.user_redirect_hook = None 708 | 709 | #------------------------------------------------------------------------------ 710 | # Spawner(LoggingConfigurable) configuration 711 | #------------------------------------------------------------------------------ 712 | ## Base class for spawning single-user notebook servers. 713 | # 714 | # Subclass this, and override the following methods: 715 | # 716 | # - load_state - get_state - start - stop - poll 717 | # 718 | # As JupyterHub supports multiple users, an instance of the Spawner subclass is 719 | # created for each user. If there are 20 JupyterHub users, there will be 20 720 | # instances of the subclass. 721 | 722 | ## Extra arguments to be passed to the single-user server. 723 | # 724 | # Some spawners allow shell-style expansion here, allowing you to use 725 | # environment variables here. Most, including the default, do not. Consult the 726 | # documentation for your spawner to verify! 727 | # Default: [] 728 | # c.Spawner.args = [] 729 | 730 | ## An optional hook function that you can implement to pass `auth_state` to the 731 | # spawner after it has been initialized but before it starts. The `auth_state` 732 | # dictionary may be set by the `.authenticate()` method of the authenticator. 733 | # This hook enables you to pass some or all of that information to your spawner. 734 | # 735 | # Example:: 736 | # 737 | # def userdata_hook(spawner, auth_state): 738 | # spawner.userdata = auth_state["userdata"] 739 | # 740 | # c.Spawner.auth_state_hook = userdata_hook 741 | # Default: None 742 | # c.Spawner.auth_state_hook = None 743 | 744 | ## The command used for starting the single-user server. 745 | # 746 | # Provide either a string or a list containing the path to the startup script 747 | # command. Extra arguments, other than this path, should be provided via `args`. 748 | # 749 | # This is usually set if you want to start the single-user server in a different 750 | # python environment (with virtualenv/conda) than JupyterHub itself. 751 | # 752 | # Some spawners allow shell-style expansion here, allowing you to use 753 | # environment variables. Most, including the default, do not. Consult the 754 | # documentation for your spawner to verify! 755 | # Default: ['jupyterhub-singleuser'] 756 | # c.Spawner.cmd = ['jupyterhub-singleuser'] 757 | 758 | ## Maximum number of consecutive failures to allow before shutting down 759 | # JupyterHub. 760 | # 761 | # This helps JupyterHub recover from a certain class of problem preventing 762 | # launch in contexts where the Hub is automatically restarted (e.g. systemd, 763 | # docker, kubernetes). 764 | # 765 | # A limit of 0 means no limit and consecutive failures will not be tracked. 766 | # Default: 0 767 | # c.Spawner.consecutive_failure_limit = 0 768 | 769 | ## Minimum number of cpu-cores a single-user notebook server is guaranteed to 770 | # have available. 771 | # 772 | # If this value is set to 0.5, allows use of 50% of one CPU. If this value is 773 | # set to 2, allows use of up to 2 CPUs. 774 | # 775 | # **This is a configuration setting. Your spawner must implement support for the 776 | # limit to work.** The default spawner, `LocalProcessSpawner`, does **not** 777 | # implement this support. A custom spawner **must** add support for this setting 778 | # for it to be enforced. 779 | # Default: None 780 | # c.Spawner.cpu_guarantee = None 781 | 782 | ## Maximum number of cpu-cores a single-user notebook server is allowed to use. 783 | # 784 | # If this value is set to 0.5, allows use of 50% of one CPU. If this value is 785 | # set to 2, allows use of up to 2 CPUs. 786 | # 787 | # The single-user notebook server will never be scheduled by the kernel to use 788 | # more cpu-cores than this. There is no guarantee that it can access this many 789 | # cpu-cores. 790 | # 791 | # **This is a configuration setting. Your spawner must implement support for the 792 | # limit to work.** The default spawner, `LocalProcessSpawner`, does **not** 793 | # implement this support. A custom spawner **must** add support for this setting 794 | # for it to be enforced. 795 | # Default: None 796 | # c.Spawner.cpu_limit = None 797 | 798 | ## Enable debug-logging of the single-user server 799 | # Default: False 800 | # c.Spawner.debug = False 801 | 802 | ## The URL the single-user server should start in. 803 | # 804 | # `{username}` will be expanded to the user's username 805 | # 806 | # Example uses: 807 | # 808 | # - You can set `notebook_dir` to `/` and `default_url` to `/tree/home/{username}` to allow people to 809 | # navigate the whole filesystem from their notebook server, but still start in their home directory. 810 | # - Start with `/notebooks` instead of `/tree` if `default_url` points to a notebook instead of a directory. 811 | # - You can set this to `/lab` to have JupyterLab start by default, rather than Jupyter Notebook. 812 | # Default: '' 813 | # c.Spawner.default_url = '' 814 | 815 | ## Disable per-user configuration of single-user servers. 816 | # 817 | # When starting the user's single-user server, any config file found in the 818 | # user's $HOME directory will be ignored. 819 | # 820 | # Note: a user could circumvent this if the user modifies their Python 821 | # environment, such as when they have their own conda environments / virtualenvs 822 | # / containers. 823 | # Default: False 824 | # c.Spawner.disable_user_config = False 825 | 826 | ## List of environment variables for the single-user server to inherit from the 827 | # JupyterHub process. 828 | # 829 | # This list is used to ensure that sensitive information in the JupyterHub 830 | # process's environment (such as `CONFIGPROXY_AUTH_TOKEN`) is not passed to the 831 | # single-user server's process. 832 | # Default: ['PATH', 'PYTHONPATH', 'CONDA_ROOT', 'CONDA_DEFAULT_ENV', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'JUPYTERHUB_SINGLEUSER_APP'] 833 | # c.Spawner.env_keep = ['PATH', 'PYTHONPATH', 'CONDA_ROOT', 'CONDA_DEFAULT_ENV', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'JUPYTERHUB_SINGLEUSER_APP'] 834 | 835 | ## Extra environment variables to set for the single-user server's process. 836 | # 837 | # Environment variables that end up in the single-user server's process come from 3 sources: 838 | # - This `environment` configurable 839 | # - The JupyterHub process' environment variables that are listed in `env_keep` 840 | # - Variables to establish contact between the single-user notebook and the hub (such as JUPYTERHUB_API_TOKEN) 841 | # 842 | # The `environment` configurable should be set by JupyterHub administrators to 843 | # add installation specific environment variables. It is a dict where the key is 844 | # the name of the environment variable, and the value can be a string or a 845 | # callable. If it is a callable, it will be called with one parameter (the 846 | # spawner instance), and should return a string fairly quickly (no blocking 847 | # operations please!). 848 | # 849 | # Note that the spawner class' interface is not guaranteed to be exactly same 850 | # across upgrades, so if you are using the callable take care to verify it 851 | # continues to work after upgrades! 852 | # 853 | # .. versionchanged:: 1.2 854 | # environment from this configuration has highest priority, 855 | # allowing override of 'default' env variables, 856 | # such as JUPYTERHUB_API_URL. 857 | # Default: {} 858 | # c.Spawner.environment = {} 859 | 860 | ## Timeout (in seconds) before giving up on a spawned HTTP server 861 | # 862 | # Once a server has successfully been spawned, this is the amount of time we 863 | # wait before assuming that the server is unable to accept connections. 864 | # Default: 30 865 | # c.Spawner.http_timeout = 30 866 | 867 | ## The IP address (or hostname) the single-user server should listen on. 868 | # 869 | # The JupyterHub proxy implementation should be able to send packets to this 870 | # interface. 871 | # Default: '' 872 | # c.Spawner.ip = '' 873 | 874 | ## Minimum number of bytes a single-user notebook server is guaranteed to have 875 | # available. 876 | # 877 | # Allows the following suffixes: 878 | # - K -> Kilobytes 879 | # - M -> Megabytes 880 | # - G -> Gigabytes 881 | # - T -> Terabytes 882 | # 883 | # **This is a configuration setting. Your spawner must implement support for the 884 | # limit to work.** The default spawner, `LocalProcessSpawner`, does **not** 885 | # implement this support. A custom spawner **must** add support for this setting 886 | # for it to be enforced. 887 | # Default: None 888 | # c.Spawner.mem_guarantee = None 889 | 890 | ## Maximum number of bytes a single-user notebook server is allowed to use. 891 | # 892 | # Allows the following suffixes: 893 | # - K -> Kilobytes 894 | # - M -> Megabytes 895 | # - G -> Gigabytes 896 | # - T -> Terabytes 897 | # 898 | # If the single user server tries to allocate more memory than this, it will 899 | # fail. There is no guarantee that the single-user notebook server will be able 900 | # to allocate this much memory - only that it can not allocate more than this. 901 | # 902 | # **This is a configuration setting. Your spawner must implement support for the 903 | # limit to work.** The default spawner, `LocalProcessSpawner`, does **not** 904 | # implement this support. A custom spawner **must** add support for this setting 905 | # for it to be enforced. 906 | # Default: None 907 | # c.Spawner.mem_limit = None 908 | 909 | ## Path to the notebook directory for the single-user server. 910 | # 911 | # The user sees a file listing of this directory when the notebook interface is 912 | # started. The current interface does not easily allow browsing beyond the 913 | # subdirectories in this directory's tree. 914 | # 915 | # `~` will be expanded to the home directory of the user, and {username} will be 916 | # replaced with the name of the user. 917 | # 918 | # Note that this does *not* prevent users from accessing files outside of this 919 | # path! They can do so with many other means. 920 | # Default: '' 921 | # c.Spawner.notebook_dir = '' 922 | 923 | ## An HTML form for options a user can specify on launching their server. 924 | # 925 | # The surrounding `
` element and the submit button are already provided. 926 | # 927 | # For example: 928 | # 929 | # .. code:: html 930 | # 931 | # Set your key: 932 | # 933 | #
934 | # Choose a letter: 935 | # 939 | # 940 | # The data from this form submission will be passed on to your spawner in 941 | # `self.user_options` 942 | # 943 | # Instead of a form snippet string, this could also be a callable that takes as 944 | # one parameter the current spawner instance and returns a string. The callable 945 | # will be called asynchronously if it returns a future, rather than a str. Note 946 | # that the interface of the spawner class is not deemed stable across versions, 947 | # so using this functionality might cause your JupyterHub upgrades to break. 948 | # Default: traitlets.Undefined 949 | # c.Spawner.options_form = traitlets.Undefined 950 | 951 | ## Interval (in seconds) on which to poll the spawner for single-user server's 952 | # status. 953 | # 954 | # At every poll interval, each spawner's `.poll` method is called, which checks 955 | # if the single-user server is still running. If it isn't running, then 956 | # JupyterHub modifies its own state accordingly and removes appropriate routes 957 | # from the configurable proxy. 958 | # Default: 30 959 | # c.Spawner.poll_interval = 30 960 | 961 | ## The port for single-user servers to listen on. 962 | # 963 | # Defaults to `0`, which uses a randomly allocated port number each time. 964 | # 965 | # If set to a non-zero value, all Spawners will use the same port, which only 966 | # makes sense if each server is on a different address, e.g. in containers. 967 | # 968 | # New in version 0.7. 969 | # Default: 0 970 | # c.Spawner.port = 0 971 | 972 | ## An optional hook function that you can implement to do work after the spawner 973 | # stops. 974 | # 975 | # This can be set independent of any concrete spawner implementation. 976 | # Default: None 977 | # c.Spawner.post_stop_hook = None 978 | 979 | ## An optional hook function that you can implement to do some bootstrapping work 980 | # before the spawner starts. For example, create a directory for your user or 981 | # load initial content. 982 | # 983 | # This can be set independent of any concrete spawner implementation. 984 | # 985 | # This maybe a coroutine. 986 | # 987 | # Example:: 988 | # 989 | # from subprocess import check_call 990 | # def my_hook(spawner): 991 | # username = spawner.user.name 992 | # check_call(['./examples/bootstrap-script/bootstrap.sh', username]) 993 | # 994 | # c.Spawner.pre_spawn_hook = my_hook 995 | # Default: None 996 | # c.Spawner.pre_spawn_hook = None 997 | 998 | ## List of SSL alt names 999 | # 1000 | # May be set in config if all spawners should have the same value(s), or set at 1001 | # runtime by Spawner that know their names. 1002 | # Default: [] 1003 | # c.Spawner.ssl_alt_names = [] 1004 | 1005 | ## Whether to include DNS:localhost, IP:127.0.0.1 in alt names 1006 | # Default: True 1007 | # c.Spawner.ssl_alt_names_include_local = True 1008 | 1009 | ## Timeout (in seconds) before giving up on starting of single-user server. 1010 | # 1011 | # This is the timeout for start to return, not the timeout for the server to 1012 | # respond. Callers of spawner.start will assume that startup has failed if it 1013 | # takes longer than this. start should return when the server process is started 1014 | # and its location is known. 1015 | # Default: 60 1016 | # c.Spawner.start_timeout = 60 1017 | 1018 | #------------------------------------------------------------------------------ 1019 | # Authenticator(LoggingConfigurable) configuration 1020 | #------------------------------------------------------------------------------ 1021 | ## Base class for implementing an authentication provider for JupyterHub 1022 | 1023 | ## Set of users that will have admin rights on this JupyterHub. 1024 | # 1025 | # Admin users have extra privileges: 1026 | # - Use the admin panel to see list of users logged in 1027 | # - Add / remove users in some authenticators 1028 | # - Restart / halt the hub 1029 | # - Start / stop users' single-user servers 1030 | # - Can access each individual users' single-user server (if configured) 1031 | # 1032 | # Admin access should be treated the same way root access is. 1033 | # 1034 | # Defaults to an empty set, in which case no user has admin access. 1035 | # Default: set() 1036 | # c.Authenticator.admin_users = set() 1037 | c.Authenticator.admin_users = { 'centos' } 1038 | 1039 | ## Set of usernames that are allowed to log in. 1040 | # 1041 | # Use this with supported authenticators to restrict which users can log in. 1042 | # This is an additional list that further restricts users, beyond whatever 1043 | # restrictions the authenticator has in place. 1044 | # 1045 | # If empty, does not perform any additional restriction. 1046 | # 1047 | # .. versionchanged:: 1.2 1048 | # `Authenticator.whitelist` renamed to `allowed_users` 1049 | # Default: set() 1050 | # c.Authenticator.allowed_users = set() 1051 | 1052 | ## The max age (in seconds) of authentication info before forcing a refresh of 1053 | # user auth info. 1054 | # 1055 | # Refreshing auth info allows, e.g. requesting/re-validating auth tokens. 1056 | # 1057 | # See :meth:`.refresh_user` for what happens when user auth info is refreshed 1058 | # (nothing by default). 1059 | # Default: 300 1060 | # c.Authenticator.auth_refresh_age = 300 1061 | 1062 | ## Automatically begin the login process 1063 | # 1064 | # rather than starting with a "Login with..." link at `/hub/login` 1065 | # 1066 | # To work, `.login_url()` must give a URL other than the default `/hub/login`, 1067 | # such as an oauth handler or another automatic login handler, registered with 1068 | # `.get_handlers()`. 1069 | # 1070 | # .. versionadded:: 0.8 1071 | # Default: False 1072 | # c.Authenticator.auto_login = False 1073 | 1074 | ## Set of usernames that are not allowed to log in. 1075 | # 1076 | # Use this with supported authenticators to restrict which users can not log in. 1077 | # This is an additional block list that further restricts users, beyond whatever 1078 | # restrictions the authenticator has in place. 1079 | # 1080 | # If empty, does not perform any additional restriction. 1081 | # 1082 | # .. versionadded: 0.9 1083 | # 1084 | # .. versionchanged:: 1.2 1085 | # `Authenticator.blacklist` renamed to `blocked_users` 1086 | # Default: set() 1087 | # c.Authenticator.blocked_users = set() 1088 | 1089 | ## Delete any users from the database that do not pass validation 1090 | # 1091 | # When JupyterHub starts, `.add_user` will be called on each user in the 1092 | # database to verify that all users are still valid. 1093 | # 1094 | # If `delete_invalid_users` is True, any users that do not pass validation will 1095 | # be deleted from the database. Use this if users might be deleted from an 1096 | # external system, such as local user accounts. 1097 | # 1098 | # If False (default), invalid users remain in the Hub's database and a warning 1099 | # will be issued. This is the default to avoid data loss due to config changes. 1100 | # Default: False 1101 | # c.Authenticator.delete_invalid_users = False 1102 | 1103 | ## Enable persisting auth_state (if available). 1104 | # 1105 | # auth_state will be encrypted and stored in the Hub's database. This can 1106 | # include things like authentication tokens, etc. to be passed to Spawners as 1107 | # environment variables. 1108 | # 1109 | # Encrypting auth_state requires the cryptography package. 1110 | # 1111 | # Additionally, the JUPYTERHUB_CRYPT_KEY environment variable must contain one 1112 | # (or more, separated by ;) 32B encryption keys. These can be either base64 or 1113 | # hex-encoded. 1114 | # 1115 | # If encryption is unavailable, auth_state cannot be persisted. 1116 | # 1117 | # New in JupyterHub 0.8 1118 | # Default: False 1119 | # c.Authenticator.enable_auth_state = False 1120 | 1121 | ## An optional hook function that you can implement to do some bootstrapping work 1122 | # during authentication. For example, loading user account details from an 1123 | # external system. 1124 | # 1125 | # This function is called after the user has passed all authentication checks 1126 | # and is ready to successfully authenticate. This function must return the 1127 | # authentication dict reguardless of changes to it. 1128 | # 1129 | # This maybe a coroutine. 1130 | # 1131 | # .. versionadded: 1.0 1132 | # 1133 | # Example:: 1134 | # 1135 | # import os, pwd 1136 | # def my_hook(authenticator, handler, authentication): 1137 | # user_data = pwd.getpwnam(authentication['name']) 1138 | # spawn_data = { 1139 | # 'pw_data': user_data 1140 | # 'gid_list': os.getgrouplist(authentication['name'], user_data.pw_gid) 1141 | # } 1142 | # 1143 | # if authentication['auth_state'] is None: 1144 | # authentication['auth_state'] = {} 1145 | # authentication['auth_state']['spawn_data'] = spawn_data 1146 | # 1147 | # return authentication 1148 | # 1149 | # c.Authenticator.post_auth_hook = my_hook 1150 | # Default: None 1151 | # c.Authenticator.post_auth_hook = None 1152 | 1153 | ## Force refresh of auth prior to spawn. 1154 | # 1155 | # This forces :meth:`.refresh_user` to be called prior to launching a server, to 1156 | # ensure that auth state is up-to-date. 1157 | # 1158 | # This can be important when e.g. auth tokens that may have expired are passed 1159 | # to the spawner via environment variables from auth_state. 1160 | # 1161 | # If refresh_user cannot refresh the user auth data, launch will fail until the 1162 | # user logs in again. 1163 | # Default: False 1164 | # c.Authenticator.refresh_pre_spawn = False 1165 | 1166 | ## Dictionary mapping authenticator usernames to JupyterHub users. 1167 | # 1168 | # Primarily used to normalize OAuth user names to local users. 1169 | # Default: {} 1170 | # c.Authenticator.username_map = {} 1171 | 1172 | ## Regular expression pattern that all valid usernames must match. 1173 | # 1174 | # If a username does not match the pattern specified here, authentication will 1175 | # not be attempted. 1176 | # 1177 | # If not set, allow any username. 1178 | # Default: '' 1179 | # c.Authenticator.username_pattern = '' 1180 | 1181 | ## Deprecated, use `Authenticator.allowed_users` 1182 | # Default: set() 1183 | # c.Authenticator.whitelist = set() 1184 | 1185 | #------------------------------------------------------------------------------ 1186 | # CryptKeeper(SingletonConfigurable) configuration 1187 | #------------------------------------------------------------------------------ 1188 | ## Encapsulate encryption configuration 1189 | # 1190 | # Use via the encryption_config singleton below. 1191 | 1192 | # Default: [] 1193 | # c.CryptKeeper.keys = [] 1194 | 1195 | ## The number of threads to allocate for encryption 1196 | # Default: 2 1197 | # c.CryptKeeper.n_threads = 2 1198 | 1199 | #------------------------------------------------------------------------------ 1200 | # Pagination(Configurable) configuration 1201 | #------------------------------------------------------------------------------ 1202 | ## Default number of entries per page for paginated results. 1203 | # Default: 100 1204 | # c.Pagination.default_per_page = 100 1205 | 1206 | ## Maximum number of entries per page for paginated results. 1207 | # Default: 250 1208 | # c.Pagination.max_per_page = 250 1209 | 1210 | 1211 | #------------------------------------------------------------------------------ 1212 | # BatchSpawner Configuration 1213 | #------------------------------------------------------------------------------ 1214 | #BatchSpawner config: 1215 | import batchspawner 1216 | #c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner' 1217 | c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner' 1218 | c.Spawner.http_timeout = 300 1219 | c.Spawner.start_timeout = 300 1220 | c.Spawner.poll_interval = 10 1221 | c.BatchSpawnerBase.req_nprocs = '2' 1222 | c.BatchSpawnerBase.req_partition = 'cloud' 1223 | c.BatchSpawnerBase.req_host = '{{ headnode_hostname }}' #JEC_SPAWNER_HOSTNAME 1224 | c.BatchSpawnerBase.req_runtime = '12:00:00' 1225 | 1226 | c.SlurmSpawner.cmd = "jupyter-labhub" 1227 | 1228 | c.SlurmSpawner.batch_script="""#!/bin/bash 1229 | #SBATCH --output={{homedir}}/jupyterhub_slurmspawner_%j.log 1230 | #SBATCH --job-name=spawner-jupyterhub 1231 | #SBATCH --chdir={{homedir}} 1232 | #SBATCH --export={{keepvars}} 1233 | #SBATCH --get-user-env=L 1234 | {% if partition %}#SBATCH --partition={{partition}} 1235 | {% endif %}{% if runtime %}#SBATCH --time={{runtime}} 1236 | {% endif %}{% if memory %}#SBATCH --mem={{memory}} 1237 | {% endif %}{% if gres %}#SBATCH --gres={{gres}} 1238 | {% endif %}{% if nprocs %}#SBATCH --cpus-per-task={{nprocs}} 1239 | {% endif %}{% if reservation%}#SBATCH --reservation={{reservation}} 1240 | {% endif %}{% if options %}#SBATCH {{options}}{% endif %} 1241 | set -euo pipefail 1242 | trap 'echo SIGTERM received' TERM 1243 | module load python3.8 1244 | 1245 | # to avoid https://github.com/jupyter/notebook/issues/1318 1246 | export XDG_RUNTIME_DIR=$HOME/.jupyter-run 1247 | 1248 | {{cmd}} 1249 | 1250 | echo "jupyterlab-singleuser ended gracefully" 1251 | """ 1252 | ##------------------------------------------------------------------------------ 1253 | ## ProfilesSpawner configuration 1254 | ##------------------------------------------------------------------------------ 1255 | ## List of profiles to offer for selection. Signature is: 1256 | ## List(Tuple( Unicode, Unicode, Type(Spawner), Dict )) 1257 | ## corresponding to profile display name, unique key, Spawner class, 1258 | ## dictionary of spawner config options. 1259 | ## 1260 | ## The first three values will be exposed in the input_template as {display}, 1261 | ## {key}, and {type} 1262 | ## 1263 | #c.ProfilesSpawner.profiles = [ 1264 | # ('Mesabi - 2 cores, 4 GB, 8 hours', 'mesabi2c4g12h', 'batchspawner.TorqueSpawner', 1265 | # ( "Local server", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':'0.0.0.0'} ), 1266 | # dict(req_nprocs='2', req_queue='mesabi', req_runtime='8:00:00', req_memory='4gb')), 1267 | # ('Mesabi - 12 cores, 128 GB, 4 hours', 'mesabi128gb', 'batchspawner.TorqueSpawner', 1268 | # dict(req_nprocs='12', req_queue='ram256g', req_runtime='4:00:00', req_memory='125gb')), 1269 | # ('Mesabi - 2 cores, 4 GB, 24 hours', 'mesabi2c4gb24h', 'batchspawner.TorqueSpawner', 1270 | # dict(req_nprocs='2', req_queue='mesabi', req_runtime='24:00:00', req_memory='4gb')), 1271 | # ('Interactive Cluster - 2 cores, 4 GB, 8 hours', 'lab', 'batchspawner.TorqueSpawner', 1272 | # dict(req_nprocs='2', req_host='labhost.xyz.edu', req_queue='lab', 1273 | # req_runtime='8:00:00', req_memory='4gb', state_exechost_exp='')), 1274 | # ] 1275 | c.ProfilesSpawner.profiles = [ 1276 | ( "Cloud queue", 'cloud-norm', 'batchspawner.SlurmSpawner', 1277 | dict(req_nprocs='2', req_partition='cloud', req_runtime='8:00:00')), 1278 | ] 1279 | # ( "Cloud queue 2", 'cloud-2', 'batchspawner.SlurmSpawner', 1280 | # dict(req_nprocs='2', req_partition='cloud2', req_runtime='1:00:00')), 1281 | # ( "Cloud queue 3", 'cloud-3', 'batchspawner.SlurmSpawner', 1282 | # dict(req_nprocs='2', req_partition='cloud3', req_runtime='2:00:00')), 1283 | c.ProfilesSpawner.ip = '0.0.0.0' 1284 | 1285 | #------------------------------------------------------------------------------ 1286 | # Keycloak Configuration 1287 | #------------------------------------------------------------------------------ 1288 | # Values needed by the django portal 1289 | #KEYCLOAK_AUTHORIZE_URL = 'https://iam.scigap.org/auth/realms/delta/protocol/openid-connect/auth' 1290 | #KEYCLOAK_TOKEN_URL = 'https://iam.scigap.org/auth/realms/delta/protocol/openid-connect/token' 1291 | #KEYCLOAK_USERINFO_URL = 'https://iam.scigap.org/auth/realms/delta/protocol/openid-connect/userinfo' 1292 | #KEYCLOAK_LOGOUT_URL = 'https://iam.scigap.org/auth/realms/delta/protocol/openid-connect/logout' 1293 | 1294 | #public_hostname = extract_hostname(routes, application_name) 1295 | # 1296 | #keycloak_name = os.environ.get('KEYCLOAK_SERVICE_NAME') 1297 | #keycloak_hostname = extract_hostname(routes, keycloak_name) 1298 | #print('keycloak_hostname', keycloak_hostname) 1299 | # 1300 | #keycloak_realm = os.environ.get('KEYCLOAK_REALM') 1301 | # 1302 | #keycloak_account_url = 'https://%s/auth/realms/%s/account' % ( 1303 | # keycloak_hostname, keycloak_realm) 1304 | # 1305 | #with open('templates/vars.html', 'w') as fp: 1306 | # fp.write('{%% set keycloak_account_url = "%s" %%}' % keycloak_account_url) 1307 | import os 1308 | public_hostname="{{ headnode_public_hostname }}" #JEC_PUBLIC_HOSTNAME 1309 | -------------------------------------------------------------------------------- /jhub_files/jhub_service.j2: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Jupyterhub 3 | 4 | [Service] 5 | User=jupyterhub 6 | ExecStart=/opt/python3/bin/jupyterhub -f /etc/jupyterhub/jupyterhub_config.py 7 | WorkingDirectory=/etc/jupyterhub 8 | After=network-online.target 9 | 10 | [Install] 11 | WantedBy=multi-user.target 12 | -------------------------------------------------------------------------------- /jhub_files/jhub_sudoers: -------------------------------------------------------------------------------- 1 | ##Jupyterhub batchspawner 2 | Cmnd_Alias JUPYTER_CMD = /opt/python3/bin/batchspawner-singleuser, /opt/python3/bin/sudospawner, /bin/sbatch, /bin/squeue, /bin/scancel 3 | 4 | %jupyterhub-users ALL=(jupyterhub) /usr/bin/sudo 5 | jupyterhub ALL=(%jupyterhub-users) NOPASSWD:SETENV:JUPYTER_CMD 6 | -------------------------------------------------------------------------------- /jhub_files/jupyterhub.conf.j2: -------------------------------------------------------------------------------- 1 | 2 | ServerName {{ headnode_public_hostname }} 3 | ServerAlias {{ headnode_alternate_hostname }} 4 | {% raw %} 5 | SSLEngine on 6 | SSLProtocol -ALL +TLSv1.2 7 | SSLCipherSuite HIGH:!MEDIUM:!aNULL:!MD5:!SEED:!IDEA:!RC4 8 | SSLHonorCipherOrder on 9 | TraceEnable off 10 | #JupyterHub - provided changes: 11 | RewriteEngine On 12 | RewriteCond %{HTTP:Connection} Upgrade [NC] 13 | RewriteCond %{HTTP:Upgrade} websocket [NC] 14 | RewriteRule /(.*) ws://127.0.0.1:8000/$1 [P,L] 15 | RewriteRule /(.*) http://127.0.0.1:8000/$1 [P,L] 16 | 17 | 18 | # preserve Host header to avoid cross-origin problems 19 | ProxyPreserveHost on 20 | # proxy to JupyterHub 21 | ProxyPass http://127.0.0.1:8000/ 22 | ProxyPassReverse http://127.0.0.1:8000/ 23 | 24 | 25 | # Include /etc/letsencrypt/options-ssl-apache.conf 26 | {% endraw %} 27 | SSLCertificateFile /etc/letsencrypt/live/{{ headnode_public_hostname }}/fullchain.pem 28 | SSLCertificateKeyFile /etc/letsencrypt/live/{{ headnode_public_hostname }}/privkey.pem 29 | 30 | -------------------------------------------------------------------------------- /jhub_files/python_mod_3.8: -------------------------------------------------------------------------------- 1 | #%Module1.0##################################################################### 2 | 3 | proc ModulesHelp { } { 4 | 5 | puts stderr " " 6 | puts stderr "This module loads Python 3.8.10" 7 | puts stderr " " 8 | puts stderr "See the man pages for python3 for detailed information" 9 | puts stderr "on available compiler options and command-line syntax." 10 | puts stderr " " 11 | 12 | puts stderr "\nVersion 3.8.10\n" 13 | 14 | } 15 | module-whatis "Name: Python 3.8.10 environment" 16 | module-whatis "Version: 3.8.10" 17 | module-whatis "Category: compiler, runtime support" 18 | module-whatis "Description: Python 3.8.10 compiler, REPL, and runtime environment" 19 | module-whatis "URL: http://python.org/" 20 | 21 | set version 3.8.10 22 | 23 | prepend-path PATH /opt/ohpc/pub/compiler/python3/bin 24 | prepend-path MANPATH /opt/ohpc/pub/compiler/python3/share/man 25 | prepend-path INCLUDE /opt/ohpc/pub/compiler/python3/include 26 | prepend-path LD_LIBRARY_PATH /opt/ohpc/pub/compiler/python3/lib 27 | prepend-path MODULEPATH /opt/ohpc/pub/moduledeps/python3.8 28 | 29 | family "python_base" 30 | -------------------------------------------------------------------------------- /prevent-updates.ci: -------------------------------------------------------------------------------- 1 | #cloud-config 2 | packages: [] 3 | 4 | package_update: false 5 | package_upgrade: false 6 | package_reboot_if_required: false 7 | 8 | final_message: "Boot completed in $UPTIME seconds" 9 | -------------------------------------------------------------------------------- /slurm-logrotate.conf: -------------------------------------------------------------------------------- 1 | compress 2 | 3 | /var/log/slurmctld.log { 4 | rotate 999 5 | missingok 6 | notifempty 7 | size 10M 8 | Monthly 9 | } 10 | 11 | /var/log/slurm_elastic.log { 12 | rotate 999 13 | missingok 14 | notifempty 15 | size 10M 16 | Monthly 17 | } 18 | -------------------------------------------------------------------------------- /slurm.conf: -------------------------------------------------------------------------------- 1 | # 2 | # Example slurm.conf file. Please run configurator.html 3 | # (in doc/html) to build a configuration file customized 4 | # for your environment. 5 | # 6 | # 7 | # slurm.conf file generated by configurator.html. 8 | # 9 | # See the slurm.conf man page for more information. 10 | # 11 | ClusterName=js-slurm-elastic 12 | ControlMachine=slurm-example 13 | #ControlAddr= 14 | #BackupController= 15 | #BackupAddr= 16 | # 17 | SlurmUser=slurm 18 | SlurmdUser=root 19 | SlurmctldPort=6817 20 | SlurmdPort=6818 21 | AuthType=auth/munge 22 | #JobCredentialPrivateKey= 23 | #JobCredentialPublicCertificate= 24 | StateSaveLocation=/tmp 25 | SlurmdSpoolDir=/tmp/slurmd 26 | SwitchType=switch/none 27 | MpiDefault=none 28 | SlurmctldPidFile=/var/run/slurmctld.pid 29 | SlurmdPidFile=/var/run/slurmd.pid 30 | ProctrackType=proctrack/pgid 31 | #PluginDir= 32 | #FirstJobId= 33 | ReturnToService=1 34 | #MaxJobCount= 35 | #PlugStackConfig= 36 | #PropagatePrioProcess= 37 | #PropagateResourceLimits= 38 | #PropagateResourceLimitsExcept= 39 | Prolog=/usr/local/sbin/slurm_prolog.sh 40 | #Epilog= 41 | #SrunProlog= 42 | #SrunEpilog= 43 | #TaskProlog= 44 | #TaskEpilog= 45 | #TaskPlugin= 46 | #TrackWCKey=no 47 | #TreeWidth=50 48 | #TmpFS= 49 | #UsePAM= 50 | # 51 | # TIMERS 52 | SlurmctldTimeout=300 53 | SlurmdTimeout=300 54 | #make slurm a little more tolerant here 55 | MessageTimeout=30 56 | TCPTimeout=15 57 | BatchStartTimeout=20 58 | GetEnvTimeout=20 59 | InactiveLimit=0 60 | MinJobAge=604800 61 | KillWait=30 62 | Waittime=0 63 | # 64 | # SCHEDULING 65 | SchedulerType=sched/backfill 66 | #SchedulerAuth= 67 | #SchedulerPort= 68 | #SchedulerRootFilter= 69 | #SelectType=select/linear 70 | SelectType=select/cons_res 71 | SelectTypeParameters=CR_CPU 72 | #FastSchedule=0 73 | #PriorityType=priority/multifactor 74 | #PriorityDecayHalfLife=14-0 75 | #PriorityUsageResetPeriod=14-0 76 | #PriorityWeightFairshare=100000 77 | #PriorityWeightAge=1000 78 | #PriorityWeightPartition=10000 79 | #PriorityWeightJobSize=1000 80 | #PriorityMaxAge=1-0 81 | # 82 | # LOGGING 83 | SlurmctldDebug=3 84 | SlurmctldLogFile=/var/log/slurm/slurmctld.log 85 | SlurmdDebug=3 86 | SlurmdLogFile=/var/log/slurmd.log 87 | JobCompType=jobcomp/none 88 | #JobCompLoc= 89 | # 90 | # ACCOUNTING 91 | JobAcctGatherType=jobacct_gather/linux 92 | JobAcctGatherFrequency=30 93 | # 94 | #AccountingStorageType=accounting_storage/filetxt 95 | #AccountingStorageLoc=/var/log/slurm/slurm_jobacct.log 96 | #AccountingStorageEnforce=associations,limits 97 | #AccountingStoragePass= 98 | #AccountingStorageUser= 99 | # 100 | #GENERAL RESOURCE 101 | GresTypes="" 102 | # 103 | #CLOUD CONFIGURATION 104 | PrivateData=cloud 105 | ResumeProgram=/usr/local/sbin/slurm_resume.sh 106 | SuspendProgram=/usr/local/sbin/slurm_suspend.sh 107 | ResumeRate=0 #number of nodes per minute that can be created; 0 means no limit 108 | ResumeTimeout=900 #max time in seconds between ResumeProgram running and when the node is ready for use 109 | SuspendRate=0 #number of nodes per minute that can be suspended/destroyed 110 | SuspendTime=60 #time in seconds before an idle node is suspended 111 | SuspendTimeout=30 #time between running SuspendProgram and the node being completely down 112 | #COMPUTE NODES 113 | NodeName=compute-[0-1] State=CLOUD CPUs=2 114 | #PARTITIONS 115 | PartitionName=cloud Nodes=compute-[0-1] Default=YES MaxTime=INFINITE State=UP 116 | -------------------------------------------------------------------------------- /slurm_prolog.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | mount_test=$(grep home /etc/mtab) 4 | count=0 5 | declare -i count 6 | 7 | until [ -n "${mount_test}" -o $count -ge 10 ]; 8 | do 9 | sleep 1 10 | count+=1 11 | mount_test=$(grep home /etc/mtab) 12 | echo "$count test: $mount_test" 13 | done 14 | 15 | if [[ $count -ge 10 ]]; then 16 | echo "FAILED TO MOUNT home - $hostname" 17 | exit 1 18 | fi 19 | 20 | echo "HOME IS MOUNTED! $hostname" 21 | 22 | exit 0 23 | -------------------------------------------------------------------------------- /slurm_resume.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /etc/slurm/openrc.sh 4 | 5 | node_size="m3.small" 6 | # See compute_take_snapshot.sh for naming convention; backup snapshots exist with date appended 7 | #node_image=$(openstack image list -f value | grep -i $(hostname -s)-compute-image-latest | cut -f 2 -d' '| tail -n 1) 8 | node_image=$(hostname -s)-compute-image-latest 9 | log_loc=/var/log/slurm/slurm_elastic.log 10 | 11 | OS_PREFIX=$(hostname -s) 12 | OS_SLURM_KEYPAIR=${OS_PREFIX}-slurm-key 13 | HEADNODE_NETWORK=$(openstack server show $(hostname -s) | grep addresses | awk -F'|' '{print $3}' | awk -F'=' '{print $1}' | awk '{$1=$1};1') 14 | OS_SSH_SECGROUP_NAME=${OS_PREFIX}-ssh-global 15 | OS_INTERNAL_SECGROUP_NAME=${OS_PREFIX}-internal 16 | 17 | #def f'n to generate a write_files entry for cloud-config for copying over a file 18 | # arguments are owner path permissions file_to_be_copied 19 | # All calls to this must come after an "echo "write_files:\n" 20 | generate_write_files () { 21 | #This is generating YAML, so... spaces are important. 22 | echo -e " - encoding: b64\n owner: $1\n path: $2\n permissions: $3\n content: |\n$(cat $4 | base64 | sed 's/^/ /')" 23 | } 24 | 25 | user_data_long="$(cat /etc/slurm/prevent-updates.ci)\n" 26 | user_data_long+="$(echo -e "hostname: $host \npreserve_hostname: true\ndebug: \nmanage_etc_hosts: false\n")\n" 27 | user_data_long+="$(echo -e "write_files:")\n" 28 | user_data_long+="$(generate_write_files slurm "/etc/slurm/slurm.conf" "0644" "/etc/slurm/slurm.conf")\n" 29 | user_data_long+="$(generate_write_files root "/etc/hosts" "0664" "/etc/hosts")\n" 30 | user_data_long+="$(generate_write_files root "/etc/passwd" "0644" "/etc/passwd")\n" 31 | #Done generating the cloud-config for compute nodes 32 | 33 | echo "Node resume invoked: $0 $*" >> $log_loc 34 | 35 | #First, loop over hosts and run the openstack create commands for *all* resume hosts at once. 36 | for host in $(scontrol show hostname $1) 37 | do 38 | 39 | #Launch compute nodes and check for new ip address in same subprocess - with 2s delay between Openstack requests 40 | #--user-data <(cat /etc/slurm/prevent-updates.ci && echo -e "hostname: $host \npreserve_hostname: true\ndebug:") \ 41 | # the current --user-data pulls in the slurm.conf and /etc/passwd as well, to avoid rebuilding node images 42 | # when adding / changing partitions 43 | 44 | (echo "creating $host" >> $log_loc; 45 | openstack server create $host \ 46 | --flavor $node_size \ 47 | --image $node_image \ 48 | --key-name ${OS_SLURM_KEYPAIR} \ 49 | --user-data <(echo -e "${user_data_long}") \ 50 | --security-group ${OS_SSH_SECGROUP_NAME} --security-group ${OS_INTERNAL_SECGROUP_NAME} \ 51 | --nic net-id=${HEADNODE_NETWORK} 2>&1 \ 52 | | tee -a $log_loc | awk '/status/ {print $4}' >> $log_loc 2>&1; 53 | 54 | node_status="UNKOWN"; 55 | stat_count=0 56 | declare -i stat_count; 57 | until [[ $node_status == "ACTIVE" || $stat_count -ge 20 ]]; do 58 | node_state=$(openstack server show $host 2>&1); 59 | node_status=$(echo -e "${node_state}" | awk '/status/ {print $4}'); 60 | # echo "$host status is: $node_status" >> $log_loc; 61 | # echo "$host ip is: $node_ip" >> $log_loc; 62 | stat_count+=1 63 | sleep 3; 64 | done; 65 | if [[ $node_status != "ACTIVE" ]]; then 66 | echo "$host creation failed" >> $log_loc; 67 | exit 1; 68 | fi; 69 | node_ip=$(echo -e "${node_state}" | awk '/addresses/ {print gensub(/^.*=/,"","g",$4)}'); 70 | 71 | echo "$host ip is $node_ip" >> $log_loc; 72 | scontrol update nodename=$host nodeaddr=$node_ip >> $log_loc;)& 73 | sleep 2 # don't send all the JS requests at "once" 74 | done 75 | -------------------------------------------------------------------------------- /slurm_suspend.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /etc/slurm/openrc.sh 4 | 5 | log_loc=/var/log/slurm/slurm_elastic.log 6 | #log_loc=/dev/stdout 7 | 8 | echo "$(date) Node suspend invoked: $0 $*" >> $log_loc 9 | 10 | ############################## 11 | # active_hosts takes in a hostlist, and echos an updated list of instances 12 | # that are still active - this simplifies the count-loop below, which should 13 | # loop only over active instances 14 | ############################## 15 | active_hosts() { 16 | 17 | hostlist="$1" 18 | os_status_list=$(openstack server list -f value -c ID -c Name -c Status) 19 | 20 | updated_hosts="" 21 | 22 | for host in $hostlist 23 | do 24 | # the quotes around os_status_list preserve newlines! 25 | if [[ "$(echo "$os_status_list" | awk -v host="$host " '$0 ~ host {print $3}')" == "ACTIVE" ]]; then 26 | echo -n "$host " 27 | elif [[ $(echo "${os_status_list}" | grep "$host " | wc -l) -ge 2 ]]; then 28 | #switch to using OS id, because we have multiples of the same host 29 | echo -n $(echo "${os_status_list}" | awk -v host="$host " '$0 ~ host {print $1}') 30 | fi 31 | done 32 | 33 | 34 | return 0 35 | 36 | } 37 | ############################## 38 | 39 | count=0 40 | declare -i count 41 | 42 | hostlist=$(scontrol show hostname $1 | tr '\n' ' ' | sed 's/[ ]*$//') 43 | 44 | #Now, try 3 times to ensure all hosts are suspended... 45 | until [ -z "${hostlist}" -o $count -ge 3 ]; 46 | do 47 | for host in $hostlist 48 | do 49 | if [[ $count == 0 ]]; then 50 | scontrol update nodename=${host} nodeaddr="(null)" >> $log_loc 51 | fi 52 | destroy_result=$(openstack server delete $host 2>&1) 53 | echo "$(date) Deleted $host: $destroy_result" >> $log_loc 54 | done 55 | 56 | sleep 5 #wait a bit for hosts to enter STOP state 57 | count+=1 58 | hostlist="$(active_hosts "${hostlist}")" 59 | echo "$(date) delete Attempt $count: remaining hosts: $hostlist" >> $log_loc 60 | done 61 | -------------------------------------------------------------------------------- /slurm_test.job: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -n 2 3 | #SBATCH -o nodes_%A.out 4 | 5 | module load gnu9 6 | module load openmpi4 7 | 8 | mpirun -n 2 hostname 9 | -------------------------------------------------------------------------------- /ssh.cfg: -------------------------------------------------------------------------------- 1 | Host dimuthu-vc-compute-* 2 | User rocky 3 | StrictHostKeyChecking no 4 | BatchMode yes 5 | UserKnownHostsFile=/dev/null 6 | IdentityFile /etc/slurm/.ssh/slurm-key 7 | 8 | Host 10.0.0.* 9 | User rocky 10 | StrictHostKeyChecking no 11 | BatchMode yes 12 | UserKnownHostsFile=/dev/null 13 | IdentityFile /etc/slurm/.ssh/slurm-key 14 | --------------------------------------------------------------------------------