├── .gitignore ├── .travis.yml ├── HISTORY.md ├── MANIFEST.in ├── README.rst ├── cluster_helper ├── __init__.py ├── cluster.py ├── log │ └── __init__.py ├── lsf.py ├── slurm.py └── utils.py ├── example └── example.py ├── requirements.txt ├── setup.cfg └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ropeproject 2 | build 3 | *.pyc 4 | dist 5 | *.egg-info 6 | .DS_Store 7 | .idea/ 8 | __pycache__/ 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | sudo: false 2 | language: python 3 | python: 4 | - "2.7" 5 | - "3.4" 6 | - "3.5" 7 | cache: 8 | directories: 9 | - ${HOME}/.cache/pip 10 | install: 11 | - pip install -U setuptools pip 12 | - pip install . 13 | script: python example/example.py --local 14 | -------------------------------------------------------------------------------- /HISTORY.md: -------------------------------------------------------------------------------- 1 | ## 0.6.4 (24 September 2019) 2 | - python3 fixes for PBSPro schedulers. 3 | 4 | ## 0.6.3 (4 April 2019) 5 | 6 | - Python3 fixes for LSF schedulers. 7 | 8 | ## 0.6.2 9 | - Support pip10. 10 | - Fix for SGE parallel environments with queues containing a . in the name. 11 | 12 | ## 0.6.0 (14 Dec 2017) 13 | 14 | - Stop IPython parallel trying to setup .ipython in home directories during runs by 15 | setting IPYTHONDIR. This avoids intermittent InvalidSignature errors. 16 | 17 | ## 0.5.9 (16 Nov 2017) 18 | 19 | - PBSPro: Pass memory default to controller and make configurable with `conmem`, 20 | duplicating SLURM functionality. Thanks to Oliver Hofmann. 21 | 22 | ## 0.5.8 (18 Oct 2017) 23 | 24 | - Fix SLURM resource specifications to avoid missing directives in the case of 25 | empty lines in specifications. 26 | 27 | ## 0.5.7 (11 Oct 2017) 28 | 29 | - Fix for SLURM engine output files not having the proper job IDs. 30 | - Additional fixes for python3 support. 31 | 32 | ## 0.5.6 (2 Aug 2017) 33 | 34 | - Additional work to better resolve local IP addresses, first trying to pick 35 | an IP from the fully qualified domain name if available. Thanks to Linh Vu. 36 | - Support PBSPro systems without select statement enabled. Thanks to Roman Valls 37 | Guimerà. 38 | - Enable python3 support for Slurm. Thanks to Vlad Saveliev. 39 | - Enable support for ipython 5. 40 | 41 | ## 0.5.5 (29 May 2017) 42 | 43 | - Try to better resolve multiple local addresses for controllers. Prioritize 44 | earlier addresses (eth0) over later (eth1) if all are valid. Thanks to Luca Beltrame. 45 | - Provide separate logging of SLURM engine job arrays. Thanks to Oliver Hofmann. 46 | - Set minimum PBSPro controller memory usage to 1GB. Thanks to Guus van Dalum 47 | @vandalum. 48 | 49 | ## 0.5.4 (24 February 2017) 50 | - Ensure /bin/sh used for Torque/PBSPro submissions to prevent overwriting 51 | environment from bashrc. Pass LD_* exports, which are also filtered by Torque. 52 | Thanks to Andrey Tovchigrechko. 53 | - Add option to run controller locally. Thanks to Brent Pederson (@brentp) and 54 | Sven-Eric Schelhorn (@schelhorn). 55 | - Fix for Python 3 compatibility in setup.py. Thanks to Amar Chaudhari 56 | (@Amar-Chaudhari). 57 | 58 | ## 0.5.3 (23 August 2016) 59 | 60 | - Emit warning message instead of failing when we try to shut down a cluster and 61 | it is already shut down. Thanks to Gabriel F. Berriz (@gberriz) for raising 62 | the issue. 63 | - Python 2/3 compatibility. Thanks to Alain Péteut (@peteut) and Matt De Both 64 | (@mdeboth). 65 | - Pass LD_PRELOAD to bcbio SGE worker jobs to enable compatibility with 66 | PetaGene. 67 | 68 | ## 0.5.2 (8 June 2016) 69 | 70 | - Pin requirements to use IPython < 5.0 until we migrate to new release. 71 | - Fix passing optional resources to PBSPro. Thanks to Endre Sebestyén. 72 | - Spin up individual engines instead of job arrays for PBSPro clusters. Thanks to 73 | Thanks to Endre Sebestyen (@razZ0r), Francesco Ferrari and Tiziana Castrignanò 74 | for raising the issue, providing an account with access to PBSPro on the Cineca 75 | cluster and testing that the fix works. 76 | - Remove sleep command to stagger engine startups in SGE which breaks if bc 77 | command not present. Thanks to Juan Caballero. 78 | 79 | ## 0.5.1 (January 25, 2016) 80 | - Add support for UGE (an open-source fork of SGE) Thanks to Andrew Oler. 81 | 82 | ## 0.5.0 (October 8, 2015) 83 | 84 | - Adjust memory resource fixes for LSF to avoid underscheduling memory on other clusters 85 | - Change timeouts to provide faster startup for IPython clusters (thanks to Maciej Wójcikowski) 86 | - Don't send SIGKILL to SLURM job arrays since it often doesn't kill all of the jobs. (@mwojcikowski) 87 | 88 | ## 0.4.9 (August 17, 2015) 89 | 90 | - Fix memory resources for LSF when jobs requests 1 core and mincore has a value > 1. 91 | 92 | ## 0.4.8 (August 15, 2015) 93 | 94 | - Fix tarball issue, ensuring requirements.txt included. 95 | 96 | ## 0.4.7 (August 15, 2015) 97 | 98 | - Support for IPython 4.0 (@brainstorm, @roryk, @chapmanb, @lpantano) 99 | 100 | ## 0.4.6 (July 22, 2015) 101 | - Support `numengines` for Torque and PBSPro for correct parallelization with bcbio-nextgen. 102 | - Ensure we only install IPython < 4.0 since 4.0 development versions split out IPython parallel 103 | into a new namespace. Will support this when it is officially released. 104 | - Added wait_for_all_engines support to return the view only after all engines are up (@matthias-k) 105 | - IPython.parallel is moving towards separate packages (@brainstorm). 106 | 107 | ## 0.4.5 (May 20, 2015) 108 | - Added --local and --cores-per-job support to the example script (@matthias-k) 109 | - Add ability to get a cluster view without the context manager (@matthias-k) 110 | 111 | ## 0.4.4 (April 23, 2014) 112 | - Python 3 compatibility (@mjdellwo) 113 | 114 | ## 0.4.3 (March 18, 2015) 115 | 116 | - Fix resource specification problem for SGE. Thanks to Zhengqiu Cai. 117 | 118 | ## 0.4.2 (March 7, 2015) 119 | 120 | - Additional IPython preparation cleanups to prevent race conditions. Thanks to 121 | Lorena Pantano. 122 | 123 | ## 0.4.1 (March 6, 2015) 124 | 125 | - Pre-create IPython database directory to prevent race conditions on 126 | new filesystems with a shared home directory, like AWS. 127 | 128 | ## 0.4.0 (February 17, 2015) 129 | 130 | - Support `conmem` resource parameter, which enables control over the memory 131 | supplied to the controller for SLURM runs. 132 | 133 | ## 0.3.7 (December 09, 2014) 134 | 135 | - Fix SLURM installations that has accounts but the user is not listed. 136 | 137 | ## 0.3.6 (October 18, 2014) 138 | 139 | - Support SLURM installations that do not use accounts, skip finding and passing 140 | an account flag. 141 | 142 | ## 0.3.5 (October 12, 2014) 143 | 144 | - Fix `minconcores` specification to be passed correctly to controllers. 145 | 146 | ## 0.3.4 (October 8, 2014) 147 | 148 | - Support `mincores` resource specification for SGE. 149 | - Add `minconcores` resource specification for specifying if controllers should 150 | use more than a single core, instead of coupling this to `mincores`. Enables 151 | running on systems with only minimum requirements for core usage. 152 | 153 | ## 0.3.3 (September 15, 2014) 154 | 155 | - Handle mincore specification for multicore jobs when memory limits cores to 156 | less than the `mincores` amount. 157 | - Improve timeouts for running on interruptible queues to avoid engine failures 158 | when controllers are requeued. 159 | 160 | ## 0.3.2 (September 10, 2104) 161 | 162 | - Support PBSPro based on Torque setup. Thanks to Piet Jones. 163 | - Improve ability to run on interruptible queuing systems by increasing timeout 164 | when killing non-responsive engines from a controller. Now 1 hour instead of 3 165 | minutes, allowing requeueing of engines. 166 | 167 | ## 0.3.1 (August 20, 2014) 168 | 169 | - Add a special resource flag `-r mincores=n` which requires single core jobs to 170 | use at least n cores. Useful for shared queues where we can only run multicore 171 | jobs, and for sharing memory usage across multiple cores for programs with 172 | spiky memory utilization like variant calling. Available on LSF and SLURM for 173 | testing. 174 | - Add hook to enable improved cleanup of controllers/engines from bcbio-nextgen. 175 | 176 | ## 0.3.0 (August 6, 2014) 177 | 178 | - Make a cluster available after a single engine has registered. Avoids need to 179 | wait for entire cluster to become available on busy clusters. 180 | - Change default SLURM time limit to 1 day to avoid asking for excessive 181 | resources. This can be overridden by passing `-r` resource with desired runtime. 182 | 183 | ## 0.2.19 (April 28, 2014) 184 | - Respect RESOURCE_RESERVE_PER_SLOT for LSF. This causes resources to be specified 185 | at the level of a core instead of a job 186 | 187 | ## 0.2.18 (April 17, 2014) 188 | 189 | - Added ability to get direct view to a cluster. 190 | - Use rusage for LSF jobs instead of --mem. This walls of the memory, preventing nodes 191 | from being oversubscribed. 192 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.txt 2 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ipython-cluster-helper 2 | ====================== 3 | .. image:: https://travis-ci.org/roryk/ipython-cluster-helper.svg 4 | :target: https://travis-ci.org/roryk/ipython-cluster-helper 5 | .. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459799.svg 6 | :target: https://doi.org/10.5281/zenodo.3459799 7 | 8 | Quickly and easily parallelize Python functions using IPython on a 9 | cluster, supporting multiple schedulers. Optimizes IPython defaults to 10 | handle larger clusters and simultaneous processes. 11 | 12 | Example 13 | ------- 14 | 15 | Lets say you wrote a program that takes several files in as arguments 16 | and performs some kind of long running computation on them. Your 17 | original implementation used a loop but it was way too slow 18 | 19 | .. code-block:: python 20 | 21 | from yourmodule import long_running_function 22 | import sys 23 | 24 | if __name__ == "__main__": 25 | for f in sys.argv[1:]: 26 | long_running_function(f) 27 | 28 | If you have access to one of the supported schedulers you can easily 29 | parallelize your program across 5 nodes with ipython-cluster-helper 30 | 31 | .. code-block:: python 32 | 33 | from cluster_helper.cluster import cluster_view 34 | from yourmodule import long_running_function 35 | import sys 36 | 37 | if __name__ == "__main__": 38 | with cluster_view(scheduler="lsf", queue="hsph", num_jobs=5) as view: 39 | view.map(long_running_function, sys.argv[1:]) 40 | 41 | That's it! No setup required. 42 | 43 | To run a local cluster for testing purposes pass `run_local` as an extra 44 | parameter to the cluster_view function 45 | 46 | .. code-block:: python 47 | 48 | with cluster_view(scheduler=None, queue=None, num_jobs=5, 49 | extra_params={"run_local": True}) as view: 50 | view.map(long_running_function, sys.argv[1:]) 51 | 52 | How it works 53 | ------------ 54 | 55 | ipython-cluster-helper creates a throwaway parallel IPython profile, 56 | launches a cluster and returns a view. On program exit it shuts the 57 | cluster down and deletes the throwaway profile. 58 | 59 | Supported schedulers 60 | -------------------- 61 | 62 | Platform LSF ("lsf"), Sun Grid Engine ("sge"), Torque ("torque"), SLURM ("slurm"). 63 | 64 | Credits 65 | ------- 66 | 67 | The cool parts of this were ripped from `bcbio-nextgen`_. 68 | 69 | Contributors 70 | ------------ 71 | * Brad Chapman (@chapmanb) 72 | * Mario Giovacchini (@mariogiov) 73 | * Valentine Svensson (@vals) 74 | * Roman Valls (@brainstorm) 75 | * Rory Kirchner (@roryk) 76 | * Luca Beltrame (@lbeltrame) 77 | * James Porter (@porterjamesj) 78 | * Billy Ziege (@billyziege) 79 | * ink1 (@ink1) 80 | * @mjdellwo 81 | * @matthias-k 82 | * Andrew Oler (@oleraj) 83 | * Alain Péteut (@peteut) 84 | * Matt De Both (@mdeboth) 85 | * Vlad Saveliev (@vladsaveliev) 86 | 87 | .. _bcbio-nextgen: https://github.com/chapmanb/bcbio-nextgen 88 | -------------------------------------------------------------------------------- /cluster_helper/__init__.py: -------------------------------------------------------------------------------- 1 | """Distributed execution using an IPython cluster. 2 | 3 | Uses IPython parallel to setup a cluster and manage execution: 4 | 5 | https://ipyparallel.readthedocs.io/en/latest/ 6 | Borrowed from Brad Chapman's implementation: 7 | https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/pipeline/ipython.py 8 | """ 9 | -------------------------------------------------------------------------------- /cluster_helper/cluster.py: -------------------------------------------------------------------------------- 1 | """Distributed execution using an IPython cluster. 2 | 3 | Uses ipyparallel to setup a cluster and manage execution: 4 | 5 | http://ipython.org/ipython-doc/stable/parallel/index.html 6 | Borrowed from Brad Chapman's implementation: 7 | https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/ipython.py 8 | """ 9 | from __future__ import print_function 10 | import contextlib 11 | import copy 12 | import math 13 | import os 14 | import pipes 15 | import uuid 16 | import shutil 17 | import subprocess 18 | import time 19 | import re 20 | from distutils.version import LooseVersion 21 | import sys 22 | import six 23 | 24 | from ipyparallel import Client 25 | from ipyparallel.apps import launcher 26 | from ipyparallel.apps.launcher import LocalControllerLauncher 27 | from ipyparallel import error as iperror 28 | from IPython.paths import locate_profile, get_ipython_dir 29 | import traitlets 30 | from traitlets import (List, Unicode, CRegExp) 31 | from IPython.core.profiledir import ProfileDir 32 | 33 | from .slurm import get_slurm_attributes 34 | from . import utils 35 | from . import lsf 36 | 37 | DEFAULT_MEM_PER_CPU = 1000 # Mb 38 | 39 | # ## Custom launchers 40 | 41 | # Handles longer timeouts for startup and shutdown ith pings between engine and controller. 42 | # Makes engine pingback shutdown higher, since this is not consecutive misses. 43 | # Gives 1 hour of non-resposiveness before a shutdown to handle interruptable queues. 44 | timeout_params = ["--timeout=960", "--IPEngineApp.wait_for_url_file=960", 45 | "--EngineFactory.max_heartbeat_misses=120"] 46 | controller_params = ["--nodb", "--hwm=1", "--scheme=leastload", 47 | "--HeartMonitor.max_heartmonitor_misses=720", 48 | "--HeartMonitor.period=5000"] 49 | 50 | # ## Work around issues with docker and VM network interfaces 51 | # Can go away when we merge changes into upstream IPython 52 | 53 | import json 54 | import socket 55 | import stat 56 | import netifaces 57 | from ipyparallel.apps.ipcontrollerapp import IPControllerApp 58 | from IPython.utils.data import uniq_stable 59 | 60 | class VMFixIPControllerApp(IPControllerApp): 61 | def _get_public_ip(self): 62 | """Avoid picking up docker and VM network interfaces in IPython 2.0. 63 | 64 | Adjusts _load_ips_netifaces from IPython.utils.localinterfaces. Changes 65 | submitted upstream so we can remove this when incorporated into released IPython. 66 | 67 | Prioritizes a set of common interfaces to try and make better decisions 68 | when choosing from multiple choices. 69 | """ 70 | # First try to get address from domain name 71 | fqdn_ip = socket.gethostbyname(socket.getfqdn()) 72 | if fqdn_ip and not fqdn_ip.startswith("127."): 73 | return fqdn_ip 74 | # otherwise retrieve from interfaces 75 | standard_ips = [] 76 | priority_ips = [] 77 | vm_ifaces = set(["docker0", "virbr0", "lxcbr0"]) # VM/container interfaces we do not want 78 | priority_ifaces = ("eth",) # Interfaces we prefer to get IPs from 79 | 80 | # list of iface names, 'lo0', 'eth0', etc. 81 | # We get addresses and order based on priorty, with more preferred last 82 | for iface in netifaces.interfaces(): 83 | if iface not in vm_ifaces: 84 | # list of ipv4 addrinfo dicts 85 | ipv4s = netifaces.ifaddresses(iface).get(netifaces.AF_INET, []) 86 | for entry in ipv4s: 87 | addr = entry.get('addr') 88 | if not addr: 89 | continue 90 | if not (iface.startswith('lo') or addr.startswith('127.')): 91 | if iface.startswith(priority_ifaces): 92 | priority_ips.append((iface, addr)) 93 | else: 94 | standard_ips.append(addr) 95 | # Prefer earlier interfaces (eth0) over later (eth1) 96 | priority_ips.sort(reverse=True) 97 | priority_ips = [xs[1] for xs in priority_ips] 98 | public_ips = uniq_stable(standard_ips + priority_ips) 99 | return public_ips[-1] 100 | 101 | def save_connection_dict(self, fname, cdict): 102 | """Override default save_connection_dict to use fixed public IP retrieval. 103 | """ 104 | if not cdict.get("location"): 105 | cdict['location'] = self._get_public_ip() 106 | cdict["patch"] = "VMFixIPControllerApp" 107 | fname = os.path.join(self.profile_dir.security_dir, fname) 108 | self.log.info("writing connection info to %s", fname) 109 | with open(fname, 'w') as f: 110 | f.write(json.dumps(cdict, indent=2)) 111 | os.chmod(fname, stat.S_IRUSR | stat.S_IWUSR) 112 | 113 | # Increase resource limits on engines to handle additional processes 114 | # At scale we can run out of open file handles or run out of user 115 | # processes. This tries to adjust this limits for each IPython worker 116 | # within available hard limits. 117 | # Match target_procs to OSX limits for a default. 118 | target_procs = 10240 119 | resource_cmds = ["import resource", 120 | "cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC)", 121 | "target_proc = min(max_proc, %s) if max_proc > 0 else %s" % (target_procs, target_procs), 122 | "resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc))", 123 | "cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE)", 124 | "target_hdls = min(max_hdls, %s) if max_hdls > 0 else %s" % (target_procs, target_procs), 125 | "resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls))"] 126 | 127 | start_cmd = "from ipyparallel.apps.%s import launch_new_instance" 128 | engine_cmd_argv = [sys.executable, "-E", "-c"] + \ 129 | ["; ".join(resource_cmds + [start_cmd % "ipengineapp", "launch_new_instance()"])] 130 | cluster_cmd_argv = [sys.executable, "-E", "-c"] + \ 131 | ["; ".join(resource_cmds + [start_cmd % "ipclusterapp", "launch_new_instance()"])] 132 | #controller_cmd_argv = [sys.executable, "-E", "-c"] + \ 133 | # ["; ".join(resource_cmds + [start_cmd % "ipcontrollerapp", "launch_new_instance()"])] 134 | controller_cmd_argv = [sys.executable, "-E", "-c"] + \ 135 | ["; ".join(resource_cmds + ["from cluster_helper.cluster import VMFixIPControllerApp", 136 | "VMFixIPControllerApp.launch_instance()"])] 137 | 138 | def get_engine_commands(context, n): 139 | """Retrieve potentially multiple engines running in a single submit script. 140 | 141 | Need to background the initial processes if multiple run. 142 | """ 143 | assert n > 0 144 | engine_cmd_full = '%s %s --profile-dir="{profile_dir}" --cluster-id="{cluster_id}"' % \ 145 | (' '.join(map(pipes.quote, engine_cmd_argv)), ' '.join(timeout_params)) 146 | out = [engine_cmd_full.format(**context)] 147 | for _ in range(n - 1): 148 | out.insert(0, "(%s &) &&" % (engine_cmd_full.format(**context))) 149 | return "export IPYTHONDIR={profile_dir}\n".format(**context) + "\n".join(out) 150 | 151 | # ## Platform LSF 152 | class BcbioLSFEngineSetLauncher(launcher.LSFEngineSetLauncher): 153 | """Custom launcher handling heterogeneous clusters on LSF. 154 | """ 155 | batch_file_name = Unicode("lsf_engine" + str(uuid.uuid4())) 156 | cores = traitlets.Integer(1, config=True) 157 | numengines = traitlets.Integer(1, config=True) 158 | mem = traitlets.Unicode("", config=True) 159 | tag = traitlets.Unicode("", config=True) 160 | resources = traitlets.Unicode("", config=True) 161 | job_array_template = Unicode('') 162 | queue_template = Unicode('') 163 | default_template = traitlets.Unicode("""#!/bin/sh 164 | #BSUB -q {queue} 165 | #BSUB -J {tag}-e[1-{n}] 166 | #BSUB -oo bcbio-ipengine.bsub.%%J 167 | #BSUB -n {cores} 168 | #BSUB -R "span[hosts=1]" 169 | {mem} 170 | {resources} 171 | {cmd} 172 | """) 173 | 174 | def start(self, n): 175 | self.context["cores"] = self.cores * self.numengines 176 | if self.mem: 177 | # lsf.conf can specify nonstandard units for memory reservation 178 | mem = lsf.parse_memory(float(self.mem)) 179 | # check if memory reservation is per core or per job 180 | if lsf.per_core_reservation(): 181 | mem = mem / (self.cores * self.numengines) 182 | mem = mem * self.numengines 183 | self.context["mem"] = '#BSUB -R "rusage[mem=%s]"' % mem 184 | else: 185 | self.context["mem"] = "" 186 | self.context["tag"] = self.tag if self.tag else "bcbio" 187 | self.context["resources"] = _format_lsf_resources(self.resources) 188 | self.context["cmd"] = get_engine_commands(self.context, self.numengines) 189 | return super(BcbioLSFEngineSetLauncher, self).start(n) 190 | 191 | def _format_lsf_resources(resources): 192 | resource_str = "" 193 | for r in str(resources).split(";"): 194 | if r.strip(): 195 | if "=" in r: 196 | arg, val = r.split("=") 197 | arg.strip() 198 | val.strip() 199 | else: 200 | arg = r.strip() 201 | val = "" 202 | resource_str += "#BSUB -%s %s\n" % (arg, val) 203 | return resource_str 204 | 205 | class BcbioLSFControllerLauncher(launcher.LSFControllerLauncher): 206 | batch_file_name = Unicode("lsf_controller" + str(uuid.uuid4())) 207 | tag = traitlets.Unicode("", config=True) 208 | cores = traitlets.Integer(1, config=True) 209 | resources = traitlets.Unicode("", config=True) 210 | job_array_template = Unicode('') 211 | queue_template = Unicode('') 212 | default_template = traitlets.Unicode("""#!/bin/sh 213 | #BSUB -q {queue} 214 | #BSUB -J {tag}-c 215 | #BSUB -n {cores} 216 | #BSUB -oo bcbio-ipcontroller.bsub.%%J 217 | {resources} 218 | export IPYTHONDIR={profile_dir} 219 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 220 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 221 | ' '.join(controller_params))) 222 | def start(self): 223 | self.context["cores"] = self.cores 224 | self.context["tag"] = self.tag if self.tag else "bcbio" 225 | self.context["resources"] = _format_lsf_resources(self.resources) 226 | return super(BcbioLSFControllerLauncher, self).start() 227 | 228 | # ## Sun Grid Engine (SGE) 229 | 230 | def _local_environment_exports(): 231 | """Create export string for a batch script with environmental variables to pass. 232 | 233 | Passes additional environmental variables not inherited by some schedulers. 234 | LD_LIBRARY_PATH filtered by '-V' on recent Grid Engine releases. 235 | LD_PRELOAD filtered by SGE without an option to turn off: 236 | 237 | https://github.com/gridengine/gridengine/blob/6a5407d56c85b39290ac2488fb6dec1a4404a974/source/libs/sgeobj/sge_var.c#L994 238 | but used by tools like PetaGene (http://www.petagene.com/). 239 | """ 240 | exports = [] 241 | for envname in ["LD_LIBRARY_PATH", "LD_PRELOAD"]: 242 | envval = os.environ.get(envname) 243 | if envval: 244 | exports.append("export %s=%s" % (envname, envval)) 245 | return "\n".join(exports) 246 | 247 | class BcbioSGEEngineSetLauncher(launcher.SGEEngineSetLauncher): 248 | """Custom launcher handling heterogeneous clusters on SGE. 249 | """ 250 | batch_file_name = Unicode("sge_engine" + str(uuid.uuid4())) 251 | cores = traitlets.Integer(1, config=True) 252 | numengines = traitlets.Integer(1, config=True) 253 | pename = traitlets.Unicode("", config=True) 254 | resources = traitlets.Unicode("", config=True) 255 | mem = traitlets.Unicode("", config=True) 256 | memtype = traitlets.Unicode("mem_free", config=True) 257 | tag = traitlets.Unicode("", config=True) 258 | queue_template = Unicode('') 259 | default_template = traitlets.Unicode("""#$ -V 260 | #$ -cwd 261 | #$ -w w 262 | #$ -j y 263 | #$ -S /bin/sh 264 | #$ -N {tag}-e 265 | #$ -t 1-{n} 266 | #$ -pe {pename} {cores} 267 | {queue} 268 | {mem} 269 | {resources} 270 | {exports} 271 | {cmd} 272 | """) 273 | 274 | def start(self, n): 275 | self.context["cores"] = self.cores * self.numengines 276 | if self.mem: 277 | if self.memtype == "rss": 278 | self.context["mem"] = "#$ -l rss=%sM" % int(float(self.mem) * 1024 / self.cores * self.numengines) 279 | elif self.memtype == "virtual_free": 280 | self.context["mem"] = "#$ -l virtual_free=%sM" % int(float(self.mem) * 1024 * self.numengines) 281 | else: 282 | self.context["mem"] = "#$ -l mem_free=%sM" % int(float(self.mem) * 1024 * self.numengines) 283 | else: 284 | self.context["mem"] = "" 285 | if self.queue: 286 | self.context["queue"] = "#$ -q %s" % self.queue 287 | else: 288 | self.context["queue"] = "" 289 | self.context["tag"] = self.tag if self.tag else "bcbio" 290 | self.context["pename"] = str(self.pename) 291 | self.context["resources"] = "\n".join([_prep_sge_resource(r) 292 | for r in str(self.resources).split(";") 293 | if r.strip()]) 294 | self.context["exports"] = _local_environment_exports() 295 | self.context["cmd"] = get_engine_commands(self.context, self.numengines) 296 | return super(BcbioSGEEngineSetLauncher, self).start(n) 297 | 298 | class BcbioSGEControllerLauncher(launcher.SGEControllerLauncher): 299 | batch_file_name = Unicode("sge_controller" + str(uuid.uuid4())) 300 | tag = traitlets.Unicode("", config=True) 301 | cores = traitlets.Integer(1, config=True) 302 | pename = traitlets.Unicode("", config=True) 303 | resources = traitlets.Unicode("", config=True) 304 | queue_template = Unicode('') 305 | default_template = traitlets.Unicode(u"""#$ -V 306 | #$ -cwd 307 | #$ -w w 308 | #$ -j y 309 | #$ -S /bin/sh 310 | #$ -N {tag}-c 311 | {cores} 312 | {queue} 313 | {resources} 314 | {exports} 315 | export IPYTHONDIR={profile_dir} 316 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 317 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 318 | ' '.join(controller_params))) 319 | def start(self): 320 | self.context["cores"] = "#$ -pe %s %s" % (self.pename, self.cores) if self.cores > 1 else "" 321 | self.context["tag"] = self.tag if self.tag else "bcbio" 322 | self.context["resources"] = "\n".join([_prep_sge_resource(r) 323 | for r in str(self.resources).split(";") 324 | if r.strip()]) 325 | if self.queue: 326 | self.context["queue"] = "#$ -q %s" % self.queue 327 | else: 328 | self.context["queue"] = "" 329 | self.context["exports"] = _local_environment_exports() 330 | return super(BcbioSGEControllerLauncher, self).start() 331 | 332 | def _prep_sge_resource(resource): 333 | """Prepare SGE resource specifications from the command line handling special cases. 334 | """ 335 | resource = resource.strip() 336 | try: 337 | k, v = resource.split("=") 338 | except ValueError: 339 | k, v = None, None 340 | if k and k in set(["ar", "m", "M"]): 341 | return "#$ -%s %s" % (k, v) 342 | else: 343 | return "#$ -l %s" % resource 344 | 345 | def _find_parallel_environment(queue): 346 | """Find an SGE/OGE parallel environment for running multicore jobs in specified queue. 347 | """ 348 | base_queue, ext = os.path.splitext(queue) 349 | queue = base_queue + ".q" 350 | 351 | available_pes = [] 352 | for name in subprocess.check_output(["qconf", "-spl"]).decode().strip().split(): 353 | if name: 354 | for line in subprocess.check_output(["qconf", "-sp", name]).decode().split("\n"): 355 | if _has_parallel_environment(line): 356 | if _queue_can_access_pe(name, queue) or _queue_can_access_pe(name, base_queue) or _queue_can_access_pe(name, base_queue + ext): 357 | available_pes.append(name) 358 | if len(available_pes) == 0: 359 | raise ValueError("Could not find an SGE environment configured for parallel execution. " 360 | "See %s for SGE setup instructions." % 361 | "https://blogs.oracle.com/templedf/entry/configuring_a_new_parallel_environment") 362 | else: 363 | return _prioritize_pes(available_pes) 364 | 365 | def _prioritize_pes(choices): 366 | """Prioritize and deprioritize paired environments based on names. 367 | 368 | We're looking for multiprocessing friendly environments, so prioritize ones with SMP 369 | in the name and deprioritize those with MPI. 370 | """ 371 | # lower scores = better 372 | ranks = {"smp": -1, "mpi": 1} 373 | sort_choices = [] 374 | for n in choices: 375 | # Identify if it fits in any special cases 376 | special_case = False 377 | for k, val in ranks.items(): 378 | if n.lower().find(k) >= 0: 379 | sort_choices.append((val, n)) 380 | special_case = True 381 | break 382 | if not special_case: # otherwise, no priority/de-priority 383 | sort_choices.append((0, n)) 384 | sort_choices.sort() 385 | return sort_choices[0][1] 386 | 387 | def _parseSGEConf(data): 388 | """Handle SGE multiple line output nastiness. 389 | From: https://github.com/clovr/vappio/blob/master/vappio-twisted/vappio_tx/load/sge_queue.py 390 | """ 391 | lines = data.split('\n') 392 | multiline = False 393 | ret = {} 394 | for line in lines: 395 | line = line.strip() 396 | if line: 397 | if not multiline: 398 | key, value = line.split(' ', 1) 399 | value = value.strip().rstrip('\\') 400 | ret[key] = value 401 | else: 402 | # Making use of the fact that the key was created 403 | # in the previous iteration and is stil lin scope 404 | ret[key] += line 405 | multiline = (line[-1] == '\\') 406 | return ret 407 | 408 | def _queue_can_access_pe(pe_name, queue): 409 | """Check if a queue has access to a specific parallel environment, using qconf. 410 | """ 411 | try: 412 | queue_config = _parseSGEConf(subprocess.check_output(["qconf", "-sq", queue]).decode()) 413 | except: 414 | return False 415 | for test_pe_name in re.split('\W+|,', queue_config["pe_list"]): 416 | if test_pe_name == pe_name: 417 | return True 418 | return False 419 | 420 | def _has_parallel_environment(line): 421 | if line.startswith("allocation_rule"): 422 | if line.find("$pe_slots") >= 0 or line.find("$fill_up") >= 0: 423 | return True 424 | return False 425 | 426 | # ## SLURM 427 | class SLURMLauncher(launcher.BatchSystemLauncher): 428 | """A BatchSystemLauncher subclass for SLURM 429 | """ 430 | submit_command = List(['sbatch'], config=True, 431 | help="The SLURM submit command ['sbatch']") 432 | delete_command = List(['scancel'], config=True, 433 | help="The SLURM delete command ['scancel']") 434 | job_id_regexp = CRegExp(r'\d+', config=True, 435 | help="A regular expression used to get the job id from the output of 'sbatch'") 436 | batch_file = Unicode(u'', config=True, 437 | help="The string that is the batch script template itself.") 438 | 439 | queue_regexp = CRegExp('#SBATCH\W+-p\W+\w') 440 | queue_template = Unicode('#SBATCH -p {queue}') 441 | 442 | 443 | class BcbioSLURMEngineSetLauncher(SLURMLauncher, launcher.BatchClusterAppMixin): 444 | """Custom launcher handling heterogeneous clusters on SLURM 445 | """ 446 | batch_file_name = Unicode("SLURM_engine" + str(uuid.uuid4())) 447 | machines = traitlets.Integer(0, config=True) 448 | cores = traitlets.Integer(1, config=True) 449 | numengines = traitlets.Integer(1, config=True) 450 | mem = traitlets.Unicode("", config=True) 451 | tag = traitlets.Unicode("", config=True) 452 | account = traitlets.Unicode("", config=True) 453 | timelimit = traitlets.Unicode("", config=True) 454 | resources = traitlets.Unicode("", config=True) 455 | default_template = traitlets.Unicode("""#!/bin/sh 456 | #SBATCH -p {queue} 457 | #SBATCH -J {tag}-e[1-{n}] 458 | #SBATCH -o bcbio-ipengine.out.%A_%a 459 | #SBATCH -e bcbio-ipengine.err.%A_%a 460 | #SBATCH --cpus-per-task={cores} 461 | #SBATCH --array=1-{n} 462 | #SBATCH -t {timelimit} 463 | {account}{machines}{mem}{resources} 464 | {cmd} 465 | """) 466 | 467 | def start(self, n): 468 | self.context["cores"] = self.cores * self.numengines 469 | if self.mem: 470 | self.context["mem"] = "#SBATCH --mem=%s\n" % int(float(self.mem) * 1024.0 * self.numengines) 471 | else: 472 | self.context["mem"] = "#SBATCH --mem=%d\n" % int(DEFAULT_MEM_PER_CPU * self.cores * self.numengines) 473 | self.context["tag"] = self.tag if self.tag else "bcbio" 474 | self.context["machines"] = ("#SBATCH %s\n" % (self.machines) if int(self.machines) > 0 else "") 475 | self.context["account"] = ("#SBATCH -A %s\n" % self.account if self.account else "") 476 | self.context["timelimit"] = self.timelimit 477 | self.context["resources"] = "\n".join(["#SBATCH --%s\n" % r.strip() 478 | for r in str(self.resources).split(";") 479 | if r.strip()]) 480 | self.context["cmd"] = get_engine_commands(self.context, self.numengines) 481 | return super(BcbioSLURMEngineSetLauncher, self).start(n) 482 | 483 | class BcbioSLURMControllerLauncher(SLURMLauncher, launcher.BatchClusterAppMixin): 484 | batch_file_name = Unicode("SLURM_controller" + str(uuid.uuid4())) 485 | account = traitlets.Unicode("", config=True) 486 | cores = traitlets.Integer(1, config=True) 487 | timelimit = traitlets.Unicode("", config=True) 488 | mem = traitlets.Unicode("", config=True) 489 | tag = traitlets.Unicode("", config=True) 490 | resources = traitlets.Unicode("", config=True) 491 | default_template = traitlets.Unicode("""#!/bin/sh 492 | #SBATCH -J {tag}-c 493 | #SBATCH -o bcbio-ipcontroller.out.%%A_%%a 494 | #SBATCH -e bcbio-ipcontroller.err.%%A_%%a 495 | #SBATCH -t {timelimit} 496 | #SBATCH --cpus-per-task={cores} 497 | {account}{mem}{resources} 498 | export IPYTHONDIR={profile_dir} 499 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 500 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 501 | ' '.join(controller_params))) 502 | def start(self): 503 | self.context["account"] = self.account 504 | self.context["timelimit"] = self.timelimit 505 | self.context["cores"] = self.cores 506 | if self.mem: 507 | self.context["mem"] = "#SBATCH --mem=%s\n" % int(float(self.mem) * 1024.0) 508 | else: 509 | self.context["mem"] = "#SBATCH --mem=%d\n" % (4 * DEFAULT_MEM_PER_CPU) 510 | self.context["tag"] = self.tag if self.tag else "bcbio" 511 | self.context["account"] = ("#SBATCH -A %s\n" % self.account if self.account else "") 512 | self.context["resources"] = "\n".join(["#SBATCH --%s\n" % r.strip() 513 | for r in str(self.resources).split(";") 514 | if r.strip()]) 515 | return super(BcbioSLURMControllerLauncher, self).start(1) 516 | 517 | 518 | class BcbioOLDSLURMEngineSetLauncher(SLURMLauncher, launcher.BatchClusterAppMixin): 519 | """Launch engines using SLURM for version < 2.6""" 520 | machines = traitlets.Integer(1, config=True) 521 | account = traitlets.Unicode("", config=True) 522 | timelimit = traitlets.Unicode("", config=True) 523 | batch_file_name = Unicode("SLURM_engines" + str(uuid.uuid4()), 524 | config=True, help="batch file name for the engine(s) job.") 525 | 526 | default_template = Unicode(u"""#!/bin/sh 527 | #SBATCH -A {account} 528 | #SBATCH --job-name ipengine 529 | #SBATCH -N {machines} 530 | #SBATCH -t {timelimit} 531 | export IPYTHONDIR={profile_dir} 532 | srun -N {machines} -n {n} %s %s --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" 533 | """ % (' '.join(map(pipes.quote, engine_cmd_argv)), 534 | ' '.join(timeout_params))) 535 | 536 | def start(self, n): 537 | """Start n engines by profile or profile_dir.""" 538 | self.context["machines"] = self.machines 539 | self.context["account"] = self.account 540 | self.context["timelimit"] = self.timelimit 541 | return super(BcbioOLDSLURMEngineSetLauncher, self).start(n) 542 | 543 | 544 | class BcbioOLDSLURMControllerLauncher(SLURMLauncher, launcher.BatchClusterAppMixin): 545 | """Launch a controller using SLURM for versions < 2.6""" 546 | account = traitlets.Unicode("", config=True) 547 | timelimit = traitlets.Unicode("", config=True) 548 | batch_file_name = Unicode("SLURM_controller" + str(uuid.uuid4()), 549 | config=True, help="batch file name for the engine(s) job.") 550 | 551 | default_template = Unicode("""#!/bin/sh 552 | #SBATCH -A {account} 553 | #SBATCH --job-name ipcontroller 554 | #SBATCH -t {timelimit} 555 | export IPYTHONDIR={profile_dir} 556 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 557 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 558 | ' '.join(controller_params))) 559 | 560 | def start(self): 561 | """Start the controller by profile or profile_dir.""" 562 | self.context["account"] = self.account 563 | self.context["timelimit"] = self.timelimit 564 | return super(BcbioOLDSLURMControllerLauncher, self).start(1) 565 | 566 | # ## Torque 567 | class TORQUELauncher(launcher.BatchSystemLauncher): 568 | """A BatchSystemLauncher subclass for Torque""" 569 | 570 | submit_command = List(['qsub'], config=True, 571 | help="The PBS submit command ['qsub']") 572 | delete_command = List(['qdel'], config=True, 573 | help="The PBS delete command ['qsub']") 574 | job_id_regexp = CRegExp(r'\d+(\[\])?', config=True, 575 | help="Regular expresion for identifying the job ID [r'\d+']") 576 | 577 | batch_file = Unicode(u'') 578 | #job_array_regexp = CRegExp('#PBS\W+-t\W+[\w\d\-\$]+') 579 | #job_array_template = Unicode('#PBS -t 1-{n}') 580 | queue_regexp = CRegExp('#PBS\W+-q\W+\$?\w+') 581 | queue_template = Unicode('#PBS -q {queue}') 582 | 583 | def _prep_torque_resources(resources): 584 | """Prepare resources passed to torque from input parameters. 585 | """ 586 | out = [] 587 | has_walltime = False 588 | for r in resources.split(";"): 589 | if "=" in r: 590 | k, v = r.split("=") 591 | k.strip() 592 | v.strip() 593 | else: 594 | k = "" 595 | if k.lower() in ["a", "account", "acct"] and v: 596 | out.append("#PBS -A %s" % v) 597 | elif r.strip(): 598 | if k.lower() == "walltime": 599 | has_walltime = True 600 | out.append("#PBS -l %s" % r.strip()) 601 | if not has_walltime: 602 | out.append("#PBS -l walltime=239:00:00") 603 | return out 604 | 605 | class BcbioTORQUEEngineSetLauncher(TORQUELauncher, launcher.BatchClusterAppMixin): 606 | """Launch Engines using Torque""" 607 | cores = traitlets.Integer(1, config=True) 608 | mem = traitlets.Unicode("", config=True) 609 | tag = traitlets.Unicode("", config=True) 610 | numengines = traitlets.Integer(1, config=True) 611 | resources = traitlets.Unicode("", config=True) 612 | batch_file_name = Unicode("torque_engines" + str(uuid.uuid4()), 613 | config=True, help="batch file name for the engine(s) job.") 614 | default_template = Unicode(u"""#!/bin/sh 615 | #PBS -V 616 | #PBS -S /bin/sh 617 | #PBS -j oe 618 | #PBS -N {tag}-e 619 | #PBS -t 1-{n} 620 | #PBS -l nodes=1:ppn={cores} 621 | {mem} 622 | {resources} 623 | {exports} 624 | cd $PBS_O_WORKDIR 625 | {cmd} 626 | """) 627 | 628 | def start(self, n): 629 | """Start n engines by profile or profile_dir.""" 630 | try: 631 | self.context["cores"] = self.cores * self.numengines 632 | if self.mem: 633 | self.context["mem"] = "#PBS -l mem=%smb" % int(float(self.mem) * 1024 * self.numengines) 634 | else: 635 | self.context["mem"] = "" 636 | 637 | self.context["tag"] = self.tag if self.tag else "bcbio" 638 | self.context["resources"] = "\n".join(_prep_torque_resources(self.resources)) 639 | self.context["cmd"] = get_engine_commands(self.context, self.numengines) 640 | self.context["exports"] = _local_environment_exports() 641 | return super(BcbioTORQUEEngineSetLauncher, self).start(n) 642 | except: 643 | self.log.exception("Engine start failed") 644 | 645 | class BcbioTORQUEControllerLauncher(TORQUELauncher, launcher.BatchClusterAppMixin): 646 | """Launch a controller using Torque.""" 647 | batch_file_name = Unicode("torque_controller" + str(uuid.uuid4()), 648 | config=True, help="batch file name for the engine(s) job.") 649 | cores = traitlets.Integer(1, config=True) 650 | tag = traitlets.Unicode("", config=True) 651 | resources = traitlets.Unicode("", config=True) 652 | default_template = Unicode("""#!/bin/sh 653 | #PBS -V 654 | #PBS -S /bin/sh 655 | #PBS -N {tag}-c 656 | #PBS -j oe 657 | #PBS -l nodes=1:ppn={cores} 658 | {resources} 659 | {exports} 660 | cd $PBS_O_WORKDIR 661 | export IPYTHONDIR={profile_dir} 662 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 663 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 664 | ' '.join(controller_params))) 665 | 666 | def start(self): 667 | """Start the controller by profile or profile_dir.""" 668 | try: 669 | self.context["cores"] = self.cores 670 | self.context["tag"] = self.tag if self.tag else "bcbio" 671 | self.context["resources"] = "\n".join(_prep_torque_resources(self.resources)) 672 | self.context["exports"] = _local_environment_exports() 673 | return super(BcbioTORQUEControllerLauncher, self).start(1) 674 | except: 675 | self.log.exception("Controller start failed") 676 | 677 | # ## PBSPro 678 | class PBSPROLauncher(launcher.PBSLauncher): 679 | """A BatchSystemLauncher subclass for PBSPro.""" 680 | job_array_regexp = CRegExp('#PBS\W+-J\W+[\w\d\-\$]+') 681 | job_array_template = Unicode('') 682 | 683 | def stop(self): 684 | job_ids = self.job_id.split(";") 685 | for job in job_ids: 686 | subprocess.check_call("qdel %s" % job, shell=True) 687 | 688 | def notify_start(self, data): 689 | self.log.debug('Process %r started: %r', self.args[0], data) 690 | self.start_data = data 691 | self.state = 'running' 692 | self.job_id = data 693 | return data 694 | 695 | class BcbioPBSPROEngineSetLauncher(PBSPROLauncher, launcher.BatchClusterAppMixin): 696 | """Launch Engines using PBSPro""" 697 | 698 | batch_file_name = Unicode('pbspro_engines' + str(uuid.uuid4()), 699 | config=True, 700 | help="batch file name for the engine(s) job.") 701 | tag = traitlets.Unicode("", config=True) 702 | cores = traitlets.Integer(1, config=True) 703 | mem = traitlets.Unicode("", config=True) 704 | numengines = traitlets.Integer(1, config=True) 705 | resources = traitlets.Unicode("", config=True) 706 | default_template = Unicode(u"""#!/bin/sh 707 | #PBS -V 708 | #PBS -S /bin/sh 709 | #PBS -N {tag}-e 710 | {resources} 711 | {exports} 712 | cd $PBS_O_WORKDIR 713 | {cmd} 714 | """) 715 | 716 | def start(self, n): 717 | tcores = (self.cores * self.numengines) 718 | tmem = int(float(self.mem) * 1024 * self.numengines) 719 | if _pbspro_noselect(self.resources): 720 | resources = "#PBS -l ncpus=%d\n" % tcores 721 | else: 722 | resources = "#PBS -l select=1:ncpus=%d" % tcores 723 | if self.mem: 724 | if _pbspro_noselect(self.resources): 725 | resources += "#PBS -l mem=%smb" % tmem 726 | else: 727 | resources += ":mem=%smb" % tmem 728 | resources = "\n".join([resources] + _prep_pbspro_resources(self.resources)) 729 | self.context["resources"] = resources 730 | self.context["cores"] = self.cores 731 | self.context["tag"] = self.tag if self.tag else "bcbio" 732 | self.context["cmd"] = get_engine_commands(self.context, self.numengines) 733 | self.context["exports"] = _local_environment_exports() 734 | self.write_batch_script(n) 735 | job_ids = [] 736 | for i in range(n): 737 | output = subprocess.check_output("qsub < %s" % self.batch_file_name, 738 | shell=True, universal_newlines=True) 739 | if six.PY3: 740 | if not isinstance(output, str): 741 | output = output.decode('ascii', 'ignore') 742 | job_ids.append(output.strip()) 743 | job_id = ";".join(job_ids) 744 | self.notify_start(job_id) 745 | return job_id 746 | 747 | 748 | class BcbioPBSPROControllerLauncher(PBSPROLauncher, launcher.BatchClusterAppMixin): 749 | """Launch a controller using PBSPro.""" 750 | 751 | batch_file_name = Unicode("pbspro_controller" + str(uuid.uuid4()), 752 | config=True, 753 | help="batch file name for the controller job.") 754 | tag = traitlets.Unicode("", config=True) 755 | cores = traitlets.Integer(1, config=True) 756 | mem = traitlets.Unicode("", config=True) 757 | resources = traitlets.Unicode("", config=True) 758 | default_template = Unicode("""#!/bin/sh 759 | #PBS -V 760 | #PBS -S /bin/sh 761 | #PBS -N {tag}-c 762 | {resources} 763 | {exports} 764 | cd $PBS_O_WORKDIR 765 | export IPYTHONDIR={profile_dir} 766 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s 767 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)), 768 | ' '.join(controller_params))) 769 | 770 | def start(self): 771 | """Start the controller by profile or profile_dir.""" 772 | pbsproresources = _prep_pbspro_resources(self.resources) 773 | tmem = int(float(self.mem) * 1020.-1) if self.mem else (4 * DEFAULT_MEM_PER_CPU) 774 | if _pbspro_noselect(self.resources): 775 | cpuresource = "#PBS -l ncpus=%d" % self.cores 776 | pbsproresources.append("#PBS -l mem=%smb" % tmem) 777 | else: 778 | cpuresource = "#PBS -l select=1:ncpus=%d" % self.cores 779 | cpuresource += ":mem=%smb" % tmem 780 | pbsproresources.append(cpuresource) 781 | resources = "\n".join(pbsproresources) 782 | self.context["resources"] = resources 783 | self.context["cores"] = self.cores 784 | self.context["tag"] = self.tag if self.tag else "bcbio" 785 | self.context["exports"] = _local_environment_exports() 786 | self.resources = resources 787 | return super(BcbioPBSPROControllerLauncher, self).start(1) 788 | 789 | class BcbioLocalControllerLauncher(LocalControllerLauncher): 790 | def find_args(self): 791 | extra_params = ["--ip=*", "--log-to-file", 792 | "--profile-dir=%s" % self.profile_dir, 793 | "--cluster-id=%s" % self.cluster_id] 794 | return controller_cmd_argv + controller_params + extra_params 795 | def start(self): 796 | return super(BcbioLocalControllerLauncher, self).start() 797 | 798 | def _prep_pbspro_resources(resources): 799 | """Prepare resources passed to pbspro from input parameters. 800 | """ 801 | out = [] 802 | has_walltime = False 803 | for r in resources.split(";"): 804 | if "=" in r: 805 | k, v = r.split("=") 806 | k.strip() 807 | v.strip() 808 | else: 809 | k = "" 810 | if k.lower() in ["a", "account", "acct"] and v: 811 | out.append("#PBS -A %s" % v) 812 | elif r.strip(): 813 | if k.lower() == "walltime": 814 | has_walltime = True 815 | if r.strip() == "noselect": 816 | continue 817 | out.append("#PBS -l %s" % r.strip()) 818 | if not has_walltime: 819 | out.append("#PBS -l walltime=239:00:00") 820 | return out 821 | 822 | def _pbspro_noselect(resources): 823 | """ 824 | handles PBSPro setups which don't support the select statement (NCI) 825 | """ 826 | for r in resources.split(";"): 827 | if r.strip() == "noselect": 828 | return True 829 | return False 830 | 831 | def _get_profile_args(profile): 832 | if os.path.isdir(profile) and os.path.isabs(profile): 833 | return ["--profile-dir=%s" % profile] 834 | else: 835 | return ["--profile=%s" % profile] 836 | 837 | def _scheduler_resources(scheduler, params, queue): 838 | """Retrieve custom resource tweaks for specific schedulers. 839 | Handles SGE parallel environments, which allow multicore jobs 840 | but are specific to different environments. 841 | Pulls out hacks to work in different environments: 842 | - mincores -- Require a minimum number of cores when submitting jobs 843 | to avoid single core jobs on constrained queues 844 | - conmem -- Memory (in Gb) for the controller to use 845 | """ 846 | orig_resources = copy.deepcopy(params.get("resources", [])) 847 | specials = {} 848 | if not orig_resources: 849 | orig_resources = [] 850 | if isinstance(orig_resources, six.string_types): 851 | orig_resources = orig_resources.split(";") 852 | resources = [] 853 | for r in orig_resources: 854 | if r.startswith(("mincores=", "minconcores=", "conmem=")): 855 | name, val = r.split("=") 856 | specials[name] = int(val) 857 | else: 858 | resources.append(r) 859 | if scheduler in ["SGE"]: 860 | pass_resources = [] 861 | for r in resources: 862 | if r.startswith("pename="): 863 | _, pename = r.split("=") 864 | specials["pename"] = pename 865 | elif r.startswith("memtype="): 866 | _, memtype = r.split("=") 867 | specials["memtype"] = memtype 868 | else: 869 | pass_resources.append(r) 870 | resources = pass_resources 871 | if "pename" not in specials: 872 | specials["pename"] = _find_parallel_environment(queue) 873 | return ";".join(resources), specials 874 | 875 | def _start(scheduler, profile, queue, num_jobs, cores_per_job, cluster_id, 876 | extra_params): 877 | """Starts cluster from commandline. 878 | """ 879 | ns = "cluster_helper.cluster" 880 | scheduler = scheduler.upper() 881 | if scheduler == "SLURM" and _slurm_is_old(): 882 | scheduler = "OLDSLURM" 883 | engine_class = "Bcbio%sEngineSetLauncher" % scheduler 884 | controller_class = "Bcbio%sControllerLauncher" % scheduler 885 | if not (engine_class in globals() and controller_class in globals()): 886 | print ("The engine and controller class %s and %s are not " 887 | "defined. " % (engine_class, controller_class)) 888 | print ("This may be due to ipython-cluster-helper not supporting " 889 | "your scheduler. If it should, please file a bug report at " 890 | "http://github.com/roryk/ipython-cluster-helper. Thanks!") 891 | sys.exit(1) 892 | resources, specials = _scheduler_resources(scheduler, extra_params, queue) 893 | if scheduler in ["OLDSLURM", "SLURM"]: 894 | resources, slurm_atrs = get_slurm_attributes(queue, resources) 895 | else: 896 | slurm_atrs = None 897 | mincores = specials.get("mincores", 1) 898 | if mincores > cores_per_job: 899 | if cores_per_job > 1: 900 | mincores = cores_per_job 901 | else: 902 | mincores = int(math.ceil(mincores / float(cores_per_job))) 903 | num_jobs = int(math.ceil(num_jobs / float(mincores))) 904 | 905 | args = cluster_cmd_argv + \ 906 | ["start", 907 | "--daemonize=True", 908 | "--IPClusterEngines.early_shutdown=240", 909 | "--delay=10", 910 | "--log-to-file", 911 | "--debug", 912 | "--n=%s" % num_jobs, 913 | "--%s.cores=%s" % (engine_class, cores_per_job), 914 | "--%s.resources='%s'" % (engine_class, resources), 915 | "--%s.mem='%s'" % (engine_class, extra_params.get("mem", "")), 916 | "--%s.tag='%s'" % (engine_class, extra_params.get("tag", "")), 917 | "--IPClusterStart.engine_launcher_class=%s.%s" % (ns, engine_class), 918 | "--%sLauncher.queue='%s'" % (scheduler, queue), 919 | "--cluster-id=%s" % (cluster_id) 920 | ] 921 | # set controller options 922 | local_controller = extra_params.get("local_controller") 923 | if local_controller: 924 | controller_class = "BcbioLocalControllerLauncher" 925 | args += ["--IPClusterStart.controller_launcher_class=%s.%s" % 926 | (ns, controller_class)] 927 | else: 928 | args += [ 929 | "--%s.mem='%s'" % (controller_class, specials.get("conmem", "")), 930 | "--%s.tag='%s'" % (engine_class, extra_params.get("tag", "")), 931 | "--%s.tag='%s'" % (controller_class, extra_params.get("tag", "")), 932 | "--%s.cores=%s" % (controller_class, specials.get("minconcores", 1)), 933 | "--%s.resources='%s'" % (controller_class, resources), 934 | "--IPClusterStart.controller_launcher_class=%s.%s" % 935 | (ns, controller_class)] 936 | 937 | args += _get_profile_args(profile) 938 | if mincores > 1 and mincores > cores_per_job: 939 | args += ["--%s.numengines=%s" % (engine_class, mincores)] 940 | if specials.get("pename"): 941 | if not local_controller: 942 | args += ["--%s.pename=%s" % (controller_class, specials["pename"])] 943 | args += ["--%s.pename=%s" % (engine_class, specials["pename"])] 944 | if specials.get("memtype"): 945 | args += ["--%s.memtype=%s" % (engine_class, specials["memtype"])] 946 | if slurm_atrs: 947 | args += ["--%s.machines=%s" % (engine_class, slurm_atrs.get("machines", "0"))] 948 | args += ["--%s.timelimit='%s'" % (engine_class, slurm_atrs["timelimit"])] 949 | if not local_controller: 950 | args += ["--%s.timelimit='%s'" % (controller_class, slurm_atrs["timelimit"])] 951 | if slurm_atrs.get("account"): 952 | args += ["--%s.account=%s" % (engine_class, slurm_atrs["account"])] 953 | if not local_controller: 954 | args += ["--%s.account=%s" % (controller_class, slurm_atrs["account"])] 955 | subprocess.check_call(args) 956 | return cluster_id 957 | 958 | def _start_local(cores, profile, cluster_id): 959 | """Start a local non-distributed IPython engine. Useful for testing 960 | """ 961 | args = cluster_cmd_argv + \ 962 | ["start", 963 | "--daemonize=True", 964 | "--log-to-file", 965 | "--debug", 966 | "--cluster-id=%s" % cluster_id, 967 | "--n=%s" % cores] 968 | args += _get_profile_args(profile) 969 | subprocess.check_call(args) 970 | return cluster_id 971 | 972 | def stop_from_view(view): 973 | _stop(view.clusterhelper["profile"], view.clusterhelper["cluster_id"]) 974 | 975 | def _stop(profile, cluster_id): 976 | args = cluster_cmd_argv + \ 977 | ["stop", "--cluster-id=%s" % cluster_id] 978 | args += _get_profile_args(profile) 979 | try: 980 | subprocess.check_call(args) 981 | except subprocess.CalledProcessError: 982 | print('Manual shutdown of cluster failed, often this is because the ' 983 | 'cluster was already shutdown.') 984 | 985 | class ClusterView(object): 986 | """Provide a view on an ipython cluster for processing. 987 | 988 | - scheduler: The type of cluster to start (lsf, sge, pbs, torque). 989 | - num_jobs: Number of jobs to start. 990 | - cores_per_job: The number of cores to use for each job. 991 | - start_wait: How long to wait for the cluster to startup, in minutes. 992 | Defaults to 16 minutes. Set to longer for slow starting clusters. 993 | - retries: Number of retries to allow for failed tasks. 994 | - wait_for_all_engines: If False (default), start using cluster 995 | as soon as first engine is up. If True, wait for all 996 | engines. 997 | """ 998 | def __init__(self, scheduler, queue, num_jobs, cores_per_job=1, profile=None, 999 | start_wait=16, extra_params=None, retries=None, direct=False, 1000 | wait_for_all_engines=False): 1001 | self.stopped = False 1002 | self.profile = profile 1003 | num_jobs = int(num_jobs) 1004 | cores_per_job = int(cores_per_job) 1005 | start_wait = int(start_wait) 1006 | 1007 | if extra_params is None: 1008 | extra_params = {} 1009 | max_delay = start_wait * 60 1010 | delay = 5 1011 | max_tries = 10 1012 | _create_base_ipython_dirs() 1013 | if self.profile is None: 1014 | self.has_throwaway = True 1015 | self.profile = create_throwaway_profile() 1016 | else: 1017 | # ensure we have an .ipython directory to prevent issues 1018 | # creating it during parallel startup 1019 | cmd = [sys.executable, "-E", "-c", "from IPython import start_ipython; start_ipython()", 1020 | "profile", "create", "--parallel"] + _get_profile_args(self.profile) 1021 | subprocess.check_call(cmd) 1022 | self.has_throwaway = False 1023 | num_tries = 0 1024 | 1025 | self.cluster_id = str(uuid.uuid4()) 1026 | url_file = get_url_file(self.profile, self.cluster_id) 1027 | 1028 | if profile and os.path.isdir(profile) and os.path.isabs(profile): 1029 | os.environ["IPYTHONDIR"] = profile 1030 | while 1: 1031 | try: 1032 | if extra_params.get("run_local") or queue == "run_local": 1033 | _start_local(num_jobs, self.profile, self.cluster_id) 1034 | else: 1035 | _start(scheduler, self.profile, queue, num_jobs, 1036 | cores_per_job, self.cluster_id, extra_params) 1037 | break 1038 | except subprocess.CalledProcessError: 1039 | if num_tries > max_tries: 1040 | raise 1041 | num_tries += 1 1042 | time.sleep(delay) 1043 | 1044 | try: 1045 | self.client = None 1046 | if wait_for_all_engines: 1047 | # Start using cluster when this many engines are up 1048 | need_engines = num_jobs 1049 | else: 1050 | need_engines = 1 1051 | slept = 0 1052 | max_up = 0 1053 | up = 0 1054 | while up < need_engines: 1055 | up = _nengines_up(url_file) 1056 | print('\r{0} Engines running'.format(up), end="") 1057 | if up < max_up: 1058 | print ("\nEngine(s) that were up have shutdown prematurely. " 1059 | "Aborting cluster startup.") 1060 | _stop(self.profile, self.cluster_id) 1061 | sys.exit(1) 1062 | max_up = up 1063 | time.sleep(delay) 1064 | slept += delay 1065 | if slept > max_delay: 1066 | raise IOError(""" 1067 | 1068 | The cluster startup timed out. This could be for a couple of reasons. The 1069 | most common reason is that the queue you are submitting jobs to is 1070 | oversubscribed. You can check if this is what is happening by trying again, 1071 | and watching to see if jobs are in a pending state or a running state when 1072 | the startup times out. If they are in the pending state, that means we just 1073 | need to wait longer for them to start, which you can specify by passing 1074 | the --timeout parameter, in minutes. 1075 | 1076 | The second reason is that there is a problem with the controller and engine 1077 | jobs being submitted to the scheduler. In the directory you ran from, 1078 | you should see files that are named YourScheduler_enginesABunchOfNumbers and 1079 | YourScheduler_controllerABunchOfNumbers. If you submit one of those files 1080 | manually to your scheduler (for example bsub < YourScheduler_controllerABunchOfNumbers) 1081 | You will get a more helpful error message that might help you figure out what 1082 | is going wrong. 1083 | 1084 | The third reason is that you need to submit your bcbio_nextgen.py job itself as a job; 1085 | bcbio-nextgen needs to run on a compute node, not the login node. So the 1086 | command you use to run bcbio-nextgen should be submitted as a job to 1087 | the scheduler. You can diagnose this because the controller and engine 1088 | jobs will be in the running state, but the cluster will still timeout. 1089 | 1090 | Finally, it may be an issue with how the cluster is configured-- the controller 1091 | and engine jobs are unable to talk to each other. They need to be able to open 1092 | ports on the machines each of them are running on in order to work. You 1093 | can diagnose this as the possible issue by if you have submitted the bcbio-nextgen 1094 | job to the scheduler, the bcbio-nextgen main job and the controller and 1095 | engine jobs are all in a running state and the cluster still times out. This will 1096 | likely to be something that you'll have to talk to the administrators of the cluster 1097 | you are using about. 1098 | 1099 | If you need help debugging, please post an issue here and we'll try to help you 1100 | with the detective work: 1101 | 1102 | https://github.com/roryk/ipython-cluster-helper/issues 1103 | 1104 | """) 1105 | print() 1106 | self.client = Client(url_file, timeout=60) 1107 | if direct: 1108 | self.view = _get_direct_view(self.client, retries) 1109 | else: 1110 | self.view = _get_balanced_blocked_view(self.client, retries) 1111 | self.view.clusterhelper = {"profile": self.profile, 1112 | "cluster_id": self.cluster_id} 1113 | except: 1114 | self.stop() 1115 | raise 1116 | 1117 | def stop(self): 1118 | if not self.stopped: 1119 | if self.client: 1120 | _shutdown(self.client) 1121 | _stop(self.profile, self.cluster_id) 1122 | if self.has_throwaway: 1123 | delete_profile(self.profile) 1124 | self.stopped = True 1125 | 1126 | def __enter__(self): 1127 | yield self 1128 | 1129 | def __exit__(self): 1130 | self.stop() 1131 | 1132 | 1133 | @contextlib.contextmanager 1134 | def cluster_view(scheduler, queue, num_jobs, cores_per_job=1, profile=None, 1135 | start_wait=16, extra_params=None, retries=None, direct=False, 1136 | wait_for_all_engines=False): 1137 | """Provide a view on an ipython cluster for processing. 1138 | 1139 | - scheduler: The type of cluster to start (lsf, sge, pbs, torque). 1140 | - num_jobs: Number of jobs to start. 1141 | - cores_per_job: The number of cores to use for each job. 1142 | - start_wait: How long to wait for the cluster to startup, in minutes. 1143 | Defaults to 16 minutes. Set to longer for slow starting clusters. 1144 | - retries: Number of retries to allow for failed tasks. 1145 | """ 1146 | cluster_view = ClusterView(scheduler, queue, num_jobs, cores_per_job=cores_per_job, 1147 | profile=profile, start_wait=start_wait, extra_params=extra_params, 1148 | retries=retries, direct=direct, 1149 | wait_for_all_engines=wait_for_all_engines) 1150 | try: 1151 | yield cluster_view.view 1152 | finally: 1153 | cluster_view.stop() 1154 | 1155 | def _nengines_up(url_file): 1156 | "return the number of engines up" 1157 | client = None 1158 | try: 1159 | client = Client(url_file, timeout=60) 1160 | up = len(client.ids) 1161 | client.close() 1162 | # the controller isn't up yet 1163 | except iperror.TimeoutError: 1164 | return 0 1165 | # the JSON file is not available to parse 1166 | except IOError: 1167 | return 0 1168 | else: 1169 | return up 1170 | 1171 | def _get_balanced_blocked_view(client, retries): 1172 | view = client.load_balanced_view() 1173 | view.set_flags(block=True) 1174 | if retries: 1175 | view.set_flags(retries=int(retries)) 1176 | return view 1177 | 1178 | def _create_base_ipython_dirs(): 1179 | """Create default user directories to prevent potential race conditions downstream. 1180 | """ 1181 | utils.safe_makedir(get_ipython_dir()) 1182 | ProfileDir.create_profile_dir_by_name(get_ipython_dir()) 1183 | utils.safe_makedir(os.path.join(get_ipython_dir(), "db")) 1184 | utils.safe_makedir(os.path.join(locate_profile(), "db")) 1185 | 1186 | def _shutdown(client): 1187 | print ("Sending a shutdown signal to the controller and engines.") 1188 | client.close() 1189 | 1190 | def _get_direct_view(client, retries): 1191 | view = client[:] 1192 | view.set_flags(block=True) 1193 | if retries: 1194 | view.set_flags(retries=int(retries)) 1195 | return view 1196 | 1197 | def _slurm_is_old(): 1198 | return LooseVersion(_slurm_version()) < LooseVersion("2.6") 1199 | 1200 | def _slurm_version(): 1201 | version_line = subprocess.check_output("sinfo -V", shell=True) 1202 | if six.PY3: 1203 | version_line = version_line.decode('ascii', 'ignore') 1204 | parts = version_line.split() 1205 | if len(parts) > 0: 1206 | return version_line.split()[1] 1207 | else: 1208 | return "2.6+" 1209 | 1210 | # ## Temporary profile management 1211 | 1212 | def create_throwaway_profile(): 1213 | profile = str(uuid.uuid1()) 1214 | cmd = [sys.executable, "-E", "-c", "from IPython import start_ipython; start_ipython()", 1215 | "profile", "create", profile, "--parallel"] 1216 | subprocess.check_call(cmd) 1217 | return profile 1218 | 1219 | def get_url_file(profile, cluster_id): 1220 | 1221 | url_file = "ipcontroller-{0}-client.json".format(cluster_id) 1222 | 1223 | if os.path.isdir(profile) and os.path.isabs(profile): 1224 | # Return full_path if one is given 1225 | return os.path.join(profile, "security", url_file) 1226 | 1227 | return os.path.join(locate_profile(profile), "security", url_file) 1228 | 1229 | def delete_profile(profile): 1230 | MAX_TRIES = 10 1231 | dir_to_remove = locate_profile(profile) 1232 | if os.path.exists(dir_to_remove): 1233 | num_tries = 0 1234 | while True: 1235 | try: 1236 | shutil.rmtree(dir_to_remove) 1237 | break 1238 | except OSError: 1239 | if num_tries > MAX_TRIES: 1240 | raise 1241 | time.sleep(5) 1242 | num_tries += 1 1243 | else: 1244 | raise ValueError("Cannot find {0} to remove, " 1245 | "something is wrong.".format(dir_to_remove)) 1246 | -------------------------------------------------------------------------------- /cluster_helper/log/__init__.py: -------------------------------------------------------------------------------- 1 | """Utility functionality for logging. 2 | """ 3 | import os 4 | import logging 5 | from cluster_helper import utils 6 | 7 | LOG_NAME = "cluster_helper" 8 | 9 | logger = logging.getLogger(LOG_NAME) 10 | 11 | def setup_logging(log_dir): 12 | logger.setLevel(logging.INFO) 13 | if not logger.handlers: 14 | formatter = logging.Formatter('[%(asctime)s] %(message)s') 15 | handler = logging.StreamHandler() 16 | handler.setFormatter(formatter) 17 | logger.addHandler(handler) 18 | if log_dir: 19 | logfile = os.path.join(utils.safe_makedir(log_dir), 20 | "{0}.log".format(LOG_NAME)) 21 | handler = logging.FileHandler(logfile) 22 | handler.setFormatter(formatter) 23 | logger.addHandler(handler) 24 | -------------------------------------------------------------------------------- /cluster_helper/lsf.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import fnmatch 4 | 5 | import six 6 | 7 | from . import utils 8 | 9 | LSB_PARAMS_FILENAME = "lsb.params" 10 | LSF_CONF_FILENAME = "lsf.conf" 11 | LSF_CONF_ENV = ["LSF_CONFDIR", "LSF_ENVDIR"] 12 | DEFAULT_LSF_UNITS = "KB" 13 | DEFAULT_RESOURCE_UNITS = "MB" 14 | 15 | def find(basedir, string): 16 | """ 17 | walk basedir and return all files matching string 18 | """ 19 | matches = [] 20 | for root, dirnames, filenames in os.walk(basedir): 21 | for filename in fnmatch.filter(filenames, string): 22 | matches.append(os.path.join(root, filename)) 23 | return matches 24 | 25 | def find_first_match(basedir, string): 26 | """ 27 | return the first file that matches string starting from basedir 28 | """ 29 | matches = find(basedir, string) 30 | return matches[0] if matches else matches 31 | 32 | def get_conf_file(filename, env): 33 | conf_path = os.environ.get(env) 34 | if not conf_path: 35 | return None 36 | conf_file = find_first_match(conf_path, filename) 37 | return conf_file 38 | 39 | def apply_conf_file(fn, conf_filename): 40 | for env in LSF_CONF_ENV: 41 | conf_file = get_conf_file(conf_filename, env) 42 | if conf_file: 43 | with open(conf_file) as conf_handle: 44 | value = fn(conf_handle) 45 | if value: 46 | return value 47 | return None 48 | 49 | def per_core_reserve_from_stream(stream): 50 | for k, v in tokenize_conf_stream(stream): 51 | if k == "RESOURCE_RESERVE_PER_SLOT": 52 | return v.upper() 53 | return None 54 | 55 | def get_lsf_units_from_stream(stream): 56 | for k, v in tokenize_conf_stream(stream): 57 | if k == "LSF_UNIT_FOR_LIMITS": 58 | return v 59 | return None 60 | 61 | def tokenize_conf_stream(conf_handle): 62 | """ 63 | convert the key=val pairs in a LSF config stream to tuples of tokens 64 | """ 65 | for line in conf_handle: 66 | if line.startswith("#"): 67 | continue 68 | tokens = line.split("=") 69 | if len(tokens) != 2: 70 | continue 71 | yield (tokens[0].strip(), tokens[1].strip()) 72 | 73 | def apply_bparams(fn): 74 | """ 75 | apply fn to each line of bparams, returning the result 76 | """ 77 | cmd = ["bparams", "-a"] 78 | try: 79 | output = subprocess.check_output(cmd) 80 | except: 81 | return None 82 | if six.PY3: 83 | output = output.decode('ascii', 'ignore') 84 | return fn(output.split("\n")) 85 | 86 | def apply_lsadmin(fn): 87 | """ 88 | apply fn to each line of lsadmin, returning the result 89 | """ 90 | cmd = ["lsadmin", "showconf", "lim"] 91 | try: 92 | output = subprocess.check_output(cmd) 93 | except: 94 | return None 95 | if six.PY3: 96 | output = output.decode('ascii', 'ignore') 97 | return fn(output.split("\n")) 98 | 99 | 100 | def get_lsf_units(resource=False): 101 | """ 102 | check if we can find LSF_UNITS_FOR_LIMITS in lsadmin and lsf.conf 103 | files, preferring the value in bparams, then lsadmin, then the lsf.conf file 104 | """ 105 | lsf_units = apply_bparams(get_lsf_units_from_stream) 106 | if lsf_units: 107 | return lsf_units 108 | 109 | lsf_units = apply_lsadmin(get_lsf_units_from_stream) 110 | if lsf_units: 111 | return lsf_units 112 | 113 | lsf_units = apply_conf_file(get_lsf_units_from_stream, LSF_CONF_FILENAME) 114 | if lsf_units: 115 | return lsf_units 116 | 117 | # -R usage units are in MB, not KB by default 118 | if resource: 119 | return DEFAULT_RESOURCE_UNITS 120 | else: 121 | return DEFAULT_LSF_UNITS 122 | 123 | def parse_memory(mem): 124 | """ 125 | Parse memory parameter 126 | """ 127 | lsf_unit = get_lsf_units(resource=True) 128 | return utils.convert_mb(float(mem) * 1024, lsf_unit) 129 | 130 | 131 | def per_core_reservation(): 132 | """ 133 | returns True if the cluster is configured for reservations to be per core, 134 | False if it is per job 135 | """ 136 | per_core = apply_bparams(per_core_reserve_from_stream) 137 | if per_core: 138 | if per_core.upper() == "Y": 139 | return True 140 | else: 141 | return False 142 | 143 | per_core = apply_lsadmin(per_core_reserve_from_stream) 144 | if per_core: 145 | if per_core.upper() == "Y": 146 | return True 147 | else: 148 | return False 149 | 150 | per_core = apply_conf_file(per_core_reserve_from_stream, LSB_PARAMS_FILENAME) 151 | if per_core and per_core.upper() == "Y": 152 | return True 153 | else: 154 | return False 155 | return False 156 | 157 | if __name__ == "__main__": 158 | print(get_lsf_units()) 159 | print(per_core_reservation()) 160 | -------------------------------------------------------------------------------- /cluster_helper/slurm.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | import six 3 | 4 | DEFAULT_TIME = "1-00:00:00" 5 | 6 | class SlurmTime(object): 7 | """ 8 | parse a SlurmTime of the format hours-days, hours-days:minutes, 9 | hours-days:minutes:seconds, minutes, hours:minutes, 10 | hours:minutes:seconds. assume 'infinite' means 365 days 11 | """ 12 | def __init__(self, time): 13 | self.days = 0 14 | self.hours = 0 15 | self.minutes = 0 16 | self.seconds = 0 17 | self._parse_time(time) 18 | 19 | def _parse_time(self, time): 20 | if time == "infinite": 21 | self.days = 365 22 | else: 23 | day_tokens = time.split("-") 24 | if len(day_tokens) == 2: 25 | self.days = int(day_tokens[0]) 26 | time = day_tokens[1] 27 | time_tokens = time.split(":") 28 | if len(time_tokens) == 3: 29 | self.hours, self.minutes, self.seconds = map(int, time_tokens) 30 | elif len(time_tokens) == 2: 31 | self.hours, self.minutes = map(int, time_tokens) 32 | elif self.days: 33 | self.hours = int(time_tokens[0]) 34 | else: 35 | self.minutes = int(time_tokens[0]) 36 | 37 | # For Python 3 38 | def __lt__(self, slurmtime): 39 | for k in ['days', 'hours', 'minutes', 'seconds']: 40 | v1 = getattr(self, k) 41 | v2 = getattr(slurmtime, k) 42 | if v1 != v2: 43 | return v1 < v2 44 | return False 45 | 46 | # For Python 2 47 | def __cmp__(self, slurmtime): 48 | if self < slurmtime: 49 | return -1 50 | else: 51 | return 1 52 | 53 | def __repr__(self): 54 | return("%02d-%02d:%02d:%02d" % (self.days, self.hours, self.minutes, self.seconds)) 55 | 56 | def get_accounts(user): 57 | """ 58 | get all accounts a user can use to submit a job 59 | """ 60 | cmd = "sshare -p --noheader" 61 | out = subprocess.check_output(cmd, shell=True) 62 | if six.PY3: 63 | out = out.decode('ascii', 'ignore') 64 | accounts = [] 65 | has_accounts = False 66 | for line in out.splitlines(): 67 | line = line.split('|') 68 | account = line[0].strip() 69 | account_user = line[1].strip() 70 | if account and account_user == user: 71 | has_accounts = True 72 | accounts.append(account) 73 | 74 | if not has_accounts: 75 | return None 76 | accounts = set(accounts) 77 | try: 78 | assert not set(['']).issuperset(accounts) 79 | except AssertionError: 80 | raise ValueError("No usable accounts were found for user \'{0}\'".format(user)) 81 | 82 | return accounts 83 | 84 | def get_user(): 85 | out = subprocess.check_output("whoami", shell=True) 86 | if six.PY3: 87 | out = out.decode('ascii', 'ignore') 88 | return out.strip() 89 | 90 | def accounts_with_access(queue): 91 | cmd = "sinfo --noheader -p {0} -o %g".format(queue) 92 | out = subprocess.check_output(cmd, shell=True) 93 | if six.PY3: 94 | out = out.decode('ascii', 'ignore') 95 | return set([x.strip() for x in out.split(",")]) 96 | 97 | def get_max_timelimit(queue): 98 | cmd = "sinfo --noheader -p {0}".format(queue) 99 | out = subprocess.check_output(cmd, shell=True) 100 | if six.PY3: 101 | out = out.decode('ascii', 'ignore') 102 | max_limit = out.split()[2].strip() 103 | time_limit = None if max_limit == "inifinite" else max_limit 104 | return max_limit 105 | 106 | def get_account_for_queue(queue): 107 | """ 108 | selects all the accounts this user has access to that can submit to 109 | the queue and returns the first one 110 | """ 111 | user = get_user() 112 | possible_accounts = get_accounts(user) 113 | if not possible_accounts: 114 | return None 115 | queue_accounts = accounts_with_access(queue) 116 | try: 117 | assert not set(['']).issuperset(queue_accounts) 118 | except AssertionError: 119 | raise ValueError("No accounts accessible by user \'{0}\' have access to queue \'{1}\'. Accounts found were: {2}".format( 120 | user, queue, ", ".join(possible_accounts))) 121 | accts = list(possible_accounts.intersection(queue_accounts)) 122 | if "all" not in queue_accounts and len(accts) > 0: 123 | return accts[0] 124 | else: 125 | return possible_accounts.pop() 126 | 127 | def set_timelimit(slurm_atrs): 128 | """ 129 | timelimit can be specified as timelimit, t or time. remap 130 | t and time to timelimit, preferring timelimit 131 | """ 132 | if "timelimit" in slurm_atrs: 133 | return slurm_atrs 134 | if "time" in slurm_atrs: 135 | slurm_atrs["timelimit"] = slurm_atrs["time"] 136 | del slurm_atrs["time"] 137 | return slurm_atrs 138 | if "t" in slurm_atrs: 139 | slurm_atrs["timelimit"] = slurm_atrs["t"] 140 | del slurm_atrs["t"] 141 | return slurm_atrs 142 | slurm_atrs["timelimit"] = DEFAULT_TIME 143 | return slurm_atrs 144 | 145 | def get_slurm_attributes(queue, resources): 146 | slurm_atrs = {} 147 | # specially handled resource specifications 148 | special_resources = set(["machines", "account", "timelimit"]) 149 | if resources: 150 | for parm in resources.split(";"): 151 | k, v = [ a.strip() for a in parm.split('=') ] 152 | slurm_atrs[k] = v 153 | if "account" not in slurm_atrs: 154 | account = get_account_for_queue(queue) 155 | if account: 156 | slurm_atrs["account"] = account 157 | slurm_atrs = set_timelimit(slurm_atrs) 158 | max_limit = get_max_timelimit(queue) 159 | if max_limit: 160 | slurm_atrs["timelimit"] = min(SlurmTime(slurm_atrs["timelimit"]), SlurmTime(max_limit)) 161 | 162 | # reconstitute any general attributes to pass along to slurm 163 | out_resources = [] 164 | for k in slurm_atrs: 165 | if k not in special_resources: 166 | out_resources.append("%s=%s" % (k, slurm_atrs[k])) 167 | return ";".join(out_resources), slurm_atrs 168 | -------------------------------------------------------------------------------- /cluster_helper/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import time 4 | 5 | def convert_mb(kb, unit): 6 | UNITS = {"B": -2, 7 | "KB": -1, 8 | "MB": 0, 9 | "GB": 1, 10 | "TB": 2} 11 | assert unit in UNITS, ("%s not a valid unit, valid units are %s." 12 | % (unit, list(UNITS.keys()))) 13 | return int(float(kb) / float(math.pow(1024, UNITS[unit]))) 14 | 15 | def safe_makedir(dname): 16 | """Make a directory if it doesn't exist, handling concurrent race conditions. 17 | """ 18 | if not dname: 19 | return dname 20 | num_tries = 0 21 | max_tries = 5 22 | while not os.path.exists(dname): 23 | # we could get an error here if multiple processes are creating 24 | # the directory at the same time. Grr, concurrency. 25 | try: 26 | os.makedirs(dname) 27 | except OSError: 28 | if num_tries > max_tries: 29 | raise 30 | num_tries += 1 31 | time.sleep(2) 32 | return dname 33 | -------------------------------------------------------------------------------- /example/example.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import time 3 | import imp 4 | import sys 5 | from ipyparallel import require 6 | from cluster_helper.cluster import cluster_view 7 | 8 | 9 | def long_computation(x, y, z): 10 | import time 11 | import socket 12 | time.sleep(1) 13 | return (socket.gethostname(), x + y + z) 14 | 15 | 16 | @require("cluster_helper") 17 | def require_test(x): 18 | return True 19 | 20 | 21 | def context_test(x): 22 | return True 23 | 24 | if __name__ == "__main__": 25 | parser = argparse.ArgumentParser( 26 | description="example script for doing parallel work with IPython.") 27 | parser.add_argument( 28 | "--scheduler", dest='scheduler', default="", 29 | help="scheduler to use (lsf, sge, torque, slurm, or pbs)") 30 | parser.add_argument("--queue", dest='queue', default="", 31 | help="queue to use on scheduler.") 32 | parser.add_argument("--local_controller", dest='local_controller', default=False, 33 | help="run controller locally", action="store_true") 34 | parser.add_argument("--num_jobs", dest='num_jobs', default=3, 35 | type=int, help="number of jobs to run in parallel.") 36 | parser.add_argument("--cores_per_job", dest="cores_per_job", default=1, 37 | type=int, help="number of cores for each job.") 38 | parser.add_argument("--profile", dest="profile", default=None, 39 | help="Optional profile to test.") 40 | parser.add_argument( 41 | "--resources", dest="resources", default=None, 42 | help=("Native specification flags to the scheduler, pass " 43 | "multiple flags using ; as a separator.")) 44 | parser.add_argument("--timeout", dest="timeout", default=15, 45 | help="Time (in minutes) to wait before timing out.") 46 | parser.add_argument("--memory", dest="mem", default=1, 47 | help="Memory in GB to reserve.") 48 | parser.add_argument( 49 | "--local", dest="local", default=False, action="store_true") 50 | 51 | args = parser.parse_args() 52 | args.resources = {'resources': args.resources, 53 | 'mem': args.mem, 54 | 'local_controller': args.local_controller} 55 | if args.local: 56 | args.resources["run_local"] = True 57 | 58 | if not (args.local or (args.scheduler and args.queue)): 59 | print("Please specify --local to run locally or a scheduler and queue" 60 | "to run on with --scheduler and --queue") 61 | sys.exit(1) 62 | 63 | with cluster_view(args.scheduler, args.queue, args.num_jobs, 64 | cores_per_job=args.cores_per_job, 65 | start_wait=args.timeout, profile=args.profile, 66 | extra_params=args.resources) as view: 67 | print("First check to see if we can talk to the engines.") 68 | results = view.map(lambda x: "hello world!", range(5)) 69 | print("This long computation that waits for 5 seconds before " 70 | "returning takes a while to run serially..") 71 | start_time = time.time() 72 | results = list(map(long_computation, range(20), range(20, 40), range(40, 60))) 73 | print(results) 74 | print("That took {} seconds.".format(time.time() - start_time)) 75 | print("Running it in parallel goes much faster...") 76 | start_time = time.time() 77 | results = list(view.map(long_computation, range(20), range(20, 40), range(40, 60))) 78 | print(results) 79 | print("That took {} seconds.".format(time.time() - start_time)) 80 | 81 | try: 82 | imp.find_module('dill') 83 | found = True 84 | except ImportError: 85 | found = False 86 | 87 | if False: 88 | def make_closure(a): 89 | """make a function with a closure, and return it""" 90 | def has_closure(b): 91 | return a * b 92 | return has_closure 93 | closure = make_closure(5) 94 | print("With dill installed, we can pickle closures!") 95 | print(closure) 96 | print(view.map(closure, [3])) 97 | 98 | with open("test", "w") as test_handle: 99 | print("Does context break it?") 100 | print(view.map(context_test, [3])) 101 | 102 | print("Does context break it with a closure?") 103 | print(view.map(closure, [3])) 104 | 105 | print("But wrapping functions with @reqiure is broken.") 106 | print(view.map(require_test, [3])) 107 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | setuptools>=18.5 2 | pyzmq>=2.1.11 3 | ipython<6.0.0 4 | ipyparallel>=6.0.2 5 | netifaces>=0.10.3 6 | six>=1.10.0 7 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [bdist_wheel] 2 | universal = 1 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import subprocess 3 | import sys 4 | from setuptools import setup, find_packages 5 | from io import open 6 | 7 | def parse_requirements(fn): 8 | """ load requirements from a pip requirements file """ 9 | return [line for line in open(fn) if line and not line.startswith("#")] 10 | 11 | reqs = parse_requirements("requirements.txt") 12 | 13 | setup(name="ipython-cluster-helper", 14 | version="0.6.4", 15 | author="Rory Kirchner", 16 | author_email="rory.kirchner@gmail.com", 17 | description="Simplify IPython cluster start up and use for " 18 | "multiple schedulers.", 19 | long_description=(open('README.rst', encoding='utf-8').read()), 20 | license="MIT", 21 | zip_safe=False, 22 | url="https://github.com/roryk/ipython-cluster-helper", 23 | packages=find_packages(), 24 | install_requires=reqs) 25 | --------------------------------------------------------------------------------