├── .gitignore
├── .travis.yml
├── HISTORY.md
├── MANIFEST.in
├── README.rst
├── cluster_helper
    ├── __init__.py
    ├── cluster.py
    ├── log
    │   └── __init__.py
    ├── lsf.py
    ├── slurm.py
    └── utils.py
├── example
    └── example.py
├── requirements.txt
├── setup.cfg
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .ropeproject
2 | build
3 | *.pyc
4 | dist
5 | *.egg-info
6 | .DS_Store
7 | .idea/
8 | __pycache__/
9 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | sudo: false
 2 | language: python
 3 | python:
 4 |     - "2.7"
 5 |     - "3.4"
 6 |     - "3.5"
 7 | cache:
 8 |     directories:
 9 |         - ${HOME}/.cache/pip
10 | install:
11 |   - pip install -U setuptools pip
12 |   - pip install .
13 | script: python example/example.py --local
14 | 


--------------------------------------------------------------------------------
/HISTORY.md:
--------------------------------------------------------------------------------
  1 | ## 0.6.4 (24 September 2019)
  2 | - python3 fixes for PBSPro schedulers. 
  3 | 
  4 | ## 0.6.3 (4 April 2019)
  5 | 
  6 | - Python3 fixes for LSF schedulers.
  7 | 
  8 | ## 0.6.2 
  9 | - Support pip10.
 10 | - Fix for SGE parallel environments with queues containing a . in the name.
 11 | 
 12 | ## 0.6.0 (14 Dec 2017)
 13 | 
 14 | - Stop IPython parallel trying to setup .ipython in home directories during runs by
 15 |   setting IPYTHONDIR. This avoids intermittent InvalidSignature errors.
 16 | 
 17 | ## 0.5.9 (16 Nov 2017)
 18 | 
 19 | - PBSPro: Pass memory default to controller and make configurable with `conmem`,
 20 |   duplicating SLURM functionality. Thanks to Oliver Hofmann.
 21 | 
 22 | ## 0.5.8 (18 Oct 2017)
 23 | 
 24 | - Fix SLURM resource specifications to avoid missing directives in the case of
 25 |   empty lines in specifications.
 26 | 
 27 | ## 0.5.7 (11 Oct 2017)
 28 | 
 29 | - Fix for SLURM engine output files not having the proper job IDs.
 30 | - Additional fixes for python3 support.
 31 | 
 32 | ## 0.5.6 (2 Aug 2017)
 33 | 
 34 | - Additional work to better resolve local IP addresses, first trying to pick
 35 |   an IP from the fully qualified domain name if available. Thanks to Linh Vu.
 36 | - Support PBSPro systems without select statement enabled. Thanks to Roman Valls
 37 |   Guimerà.
 38 | - Enable python3 support for Slurm. Thanks to Vlad Saveliev.
 39 | - Enable support for ipython 5.
 40 | 
 41 | ## 0.5.5 (29 May 2017)
 42 | 
 43 | - Try to better resolve multiple local addresses for controllers. Prioritize
 44 |   earlier addresses (eth0) over later (eth1) if all are valid. Thanks to Luca Beltrame.
 45 | - Provide separate logging of SLURM engine job arrays. Thanks to Oliver Hofmann.
 46 | - Set minimum PBSPro controller memory usage to 1GB. Thanks to Guus van Dalum
 47 |   @vandalum.
 48 | 
 49 | ## 0.5.4 (24 February 2017)
 50 | - Ensure /bin/sh used for Torque/PBSPro submissions to prevent overwriting
 51 |   environment from bashrc. Pass LD_* exports, which are also filtered by Torque.
 52 |   Thanks to Andrey Tovchigrechko.
 53 | - Add option to run controller locally. Thanks to Brent Pederson (@brentp) and
 54 |   Sven-Eric Schelhorn (@schelhorn).
 55 | - Fix for Python 3 compatibility in setup.py. Thanks to Amar Chaudhari
 56 |   (@Amar-Chaudhari).
 57 | 
 58 | ## 0.5.3 (23 August 2016)
 59 | 
 60 | - Emit warning message instead of failing when we try to shut down a cluster and
 61 |   it is already shut down. Thanks to Gabriel F. Berriz (@gberriz) for raising
 62 |   the issue.
 63 | - Python 2/3 compatibility. Thanks to Alain Péteut (@peteut) and Matt De Both
 64 |   (@mdeboth).
 65 | - Pass LD_PRELOAD to bcbio SGE worker jobs to enable compatibility with
 66 |   PetaGene.
 67 | 
 68 | ## 0.5.2 (8 June 2016)
 69 | 
 70 | - Pin requirements to use IPython < 5.0 until we migrate to new release.
 71 | - Fix passing optional resources to PBSPro. Thanks to Endre Sebestyén.
 72 | - Spin up individual engines instead of job arrays for PBSPro clusters. Thanks to
 73 |   Thanks to Endre Sebestyen (@razZ0r), Francesco Ferrari and Tiziana Castrignanò
 74 |   for raising the issue, providing an account with access to PBSPro on the Cineca
 75 |   cluster and testing that the fix works.
 76 | - Remove sleep command to stagger engine startups in SGE which breaks if bc
 77 |   command not present. Thanks to Juan Caballero.
 78 | 
 79 | ## 0.5.1 (January 25, 2016)
 80 | - Add support for UGE (an open-source fork of SGE) Thanks to Andrew Oler.
 81 | 
 82 | ## 0.5.0 (October 8, 2015)
 83 | 
 84 | - Adjust memory resource fixes for LSF to avoid underscheduling memory on other clusters
 85 | - Change timeouts to provide faster startup for IPython clusters (thanks to Maciej Wójcikowski)
 86 | - Don't send SIGKILL to SLURM job arrays since it often doesn't kill all of the jobs. (@mwojcikowski)
 87 | 
 88 | ## 0.4.9 (August 17, 2015)
 89 | 
 90 | - Fix memory resources for LSF when jobs requests 1 core and mincore has a value > 1.
 91 | 
 92 | ## 0.4.8 (August 15, 2015)
 93 | 
 94 | - Fix tarball issue, ensuring requirements.txt included.
 95 | 
 96 | ## 0.4.7 (August 15, 2015)
 97 | 
 98 | - Support for IPython 4.0 (@brainstorm, @roryk, @chapmanb, @lpantano)
 99 | 
100 | ## 0.4.6 (July 22, 2015)
101 | - Support `numengines` for Torque and PBSPro for correct parallelization with bcbio-nextgen.
102 | - Ensure we only install IPython < 4.0 since 4.0 development versions split out IPython parallel
103 |   into a new namespace. Will support this when it is officially released.
104 | - Added wait_for_all_engines support to return the view only after all engines are up (@matthias-k)
105 | - IPython.parallel is moving towards separate packages (@brainstorm).
106 | 
107 | ## 0.4.5 (May 20, 2015)
108 | - Added --local and --cores-per-job support to the example script (@matthias-k)
109 | - Add ability to get a cluster view without the context manager (@matthias-k)
110 | 
111 | ## 0.4.4 (April 23, 2014)
112 | - Python 3 compatibility (@mjdellwo)
113 | 
114 | ## 0.4.3 (March 18, 2015)
115 | 
116 | - Fix resource specification problem for SGE. Thanks to Zhengqiu Cai.
117 | 
118 | ## 0.4.2 (March 7, 2015)
119 | 
120 | - Additional IPython preparation cleanups to prevent race conditions. Thanks to
121 |   Lorena Pantano.
122 | 
123 | ## 0.4.1 (March 6, 2015)
124 | 
125 | - Pre-create IPython database directory to prevent race conditions on
126 |   new filesystems with a shared home directory, like AWS.
127 | 
128 | ## 0.4.0 (February 17, 2015)
129 | 
130 | - Support `conmem` resource parameter, which enables control over the memory
131 |   supplied to the controller for SLURM runs.
132 | 
133 | ## 0.3.7 (December 09, 2014)
134 | 
135 | - Fix SLURM installations that has accounts but the user is not listed.
136 | 
137 | ## 0.3.6 (October 18, 2014)
138 | 
139 | - Support SLURM installations that do not use accounts, skip finding and passing
140 |   an account flag.
141 | 
142 | ## 0.3.5 (October 12, 2014)
143 | 
144 | - Fix `minconcores` specification to be passed correctly to controllers.
145 | 
146 | ## 0.3.4 (October 8, 2014)
147 | 
148 | - Support `mincores` resource specification for SGE.
149 | - Add `minconcores` resource specification for specifying if controllers should
150 |   use more than a single core, instead of coupling this to `mincores`. Enables
151 |   running on systems with only minimum requirements for core usage.
152 | 
153 | ## 0.3.3 (September 15, 2014)
154 | 
155 | - Handle mincore specification for multicore jobs when memory limits cores to
156 |   less than the `mincores` amount.
157 | - Improve timeouts for running on interruptible queues to avoid engine failures
158 |   when controllers are requeued.
159 | 
160 | ## 0.3.2 (September 10, 2104)
161 | 
162 | - Support PBSPro based on Torque setup. Thanks to Piet Jones.
163 | - Improve ability to run on interruptible queuing systems by increasing timeout
164 |   when killing non-responsive engines from a controller. Now 1 hour instead of 3
165 |   minutes, allowing requeueing of engines.
166 | 
167 | ## 0.3.1 (August 20, 2014)
168 | 
169 | - Add a special resource flag `-r mincores=n` which requires single core jobs to
170 |   use at least n cores. Useful for shared queues where we can only run multicore
171 |   jobs, and for sharing memory usage across multiple cores for programs with
172 |   spiky memory utilization like variant calling. Available on LSF and SLURM for
173 |   testing.
174 | - Add hook to enable improved cleanup of controllers/engines from bcbio-nextgen.
175 | 
176 | ## 0.3.0 (August 6, 2014)
177 | 
178 | - Make a cluster available after a single engine has registered. Avoids need to
179 |   wait for entire cluster to become available on busy clusters.
180 | - Change default SLURM time limit to 1 day to avoid asking for excessive
181 |   resources. This can be overridden by passing `-r` resource with desired runtime.
182 | 
183 | ## 0.2.19 (April 28, 2014)
184 | - Respect RESOURCE_RESERVE_PER_SLOT for LSF. This causes resources to be specified
185 |   at the level of a core instead of a job
186 | 
187 | ## 0.2.18 (April 17, 2014)
188 | 
189 | - Added ability to get direct view to a cluster.
190 | - Use rusage for LSF jobs instead of --mem. This walls of the memory, preventing nodes
191 |   from being oversubscribed.
192 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include *.txt
2 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
 1 | ipython-cluster-helper
 2 | ======================
 3 | .. image:: https://travis-ci.org/roryk/ipython-cluster-helper.svg
 4 |     :target: https://travis-ci.org/roryk/ipython-cluster-helper
 5 | .. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3459799.svg
 6 |     :target: https://doi.org/10.5281/zenodo.3459799
 7 | 
 8 | Quickly and easily parallelize Python functions using IPython on a
 9 | cluster, supporting multiple schedulers. Optimizes IPython defaults to
10 | handle larger clusters and simultaneous processes.
11 | 
12 | Example
13 | -------
14 | 
15 | Lets say you wrote a program that takes several files in as arguments
16 | and performs some kind of long running computation on them. Your
17 | original implementation used a loop but it was way too slow
18 | 
19 | .. code-block:: python
20 | 
21 |     from yourmodule import long_running_function
22 |     import sys
23 | 
24 |     if __name__ == "__main__":
25 |         for f in sys.argv[1:]:
26 |             long_running_function(f)
27 | 
28 | If you have access to one of the supported schedulers you can easily
29 | parallelize your program across 5 nodes with ipython-cluster-helper
30 | 
31 | .. code-block:: python
32 | 
33 |     from cluster_helper.cluster import cluster_view
34 |     from yourmodule import long_running_function
35 |     import sys
36 | 
37 |     if __name__ == "__main__":
38 |         with cluster_view(scheduler="lsf", queue="hsph", num_jobs=5) as view:
39 |             view.map(long_running_function, sys.argv[1:])
40 | 
41 | That's it! No setup required.
42 | 
43 | To run a local cluster for testing purposes pass `run_local` as an extra
44 | parameter to the cluster_view function
45 | 
46 | .. code-block:: python
47 | 
48 |     with cluster_view(scheduler=None, queue=None, num_jobs=5,
49 |                       extra_params={"run_local": True}) as view:
50 |         view.map(long_running_function, sys.argv[1:])
51 | 
52 | How it works
53 | ------------
54 | 
55 | ipython-cluster-helper creates a throwaway parallel IPython profile,
56 | launches a cluster and returns a view. On program exit it shuts the
57 | cluster down and deletes the throwaway profile.
58 | 
59 | Supported schedulers
60 | --------------------
61 | 
62 | Platform LSF ("lsf"), Sun Grid Engine ("sge"), Torque ("torque"), SLURM ("slurm").
63 | 
64 | Credits
65 | -------
66 | 
67 | The cool parts of this were ripped from `bcbio-nextgen`_.
68 | 
69 | Contributors
70 | ------------
71 | * Brad Chapman (@chapmanb)
72 | * Mario Giovacchini (@mariogiov)
73 | * Valentine Svensson (@vals)
74 | * Roman Valls (@brainstorm)
75 | * Rory Kirchner (@roryk)
76 | * Luca Beltrame (@lbeltrame)
77 | * James Porter (@porterjamesj)
78 | * Billy Ziege (@billyziege)
79 | * ink1 (@ink1)
80 | * @mjdellwo
81 | * @matthias-k
82 | * Andrew Oler (@oleraj)
83 | * Alain Péteut (@peteut)
84 | * Matt De Both (@mdeboth)
85 | * Vlad Saveliev (@vladsaveliev)
86 | 
87 | .. _bcbio-nextgen: https://github.com/chapmanb/bcbio-nextgen
88 | 


--------------------------------------------------------------------------------
/cluster_helper/__init__.py:
--------------------------------------------------------------------------------
1 | """Distributed execution using an IPython cluster.
2 | 
3 | Uses IPython parallel to setup a cluster and manage execution:
4 | 
5 | https://ipyparallel.readthedocs.io/en/latest/
6 | Borrowed from Brad Chapman's implementation:
7 | https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/pipeline/ipython.py
8 | """
9 | 


--------------------------------------------------------------------------------
/cluster_helper/cluster.py:
--------------------------------------------------------------------------------
   1 | """Distributed execution using an IPython cluster.
   2 | 
   3 | Uses ipyparallel to setup a cluster and manage execution:
   4 | 
   5 | http://ipython.org/ipython-doc/stable/parallel/index.html
   6 | Borrowed from Brad Chapman's implementation:
   7 | https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/distributed/ipython.py
   8 | """
   9 | from __future__ import print_function
  10 | import contextlib
  11 | import copy
  12 | import math
  13 | import os
  14 | import pipes
  15 | import uuid
  16 | import shutil
  17 | import subprocess
  18 | import time
  19 | import re
  20 | from distutils.version import LooseVersion
  21 | import sys
  22 | import six
  23 | 
  24 | from ipyparallel import Client
  25 | from ipyparallel.apps import launcher
  26 | from ipyparallel.apps.launcher import LocalControllerLauncher
  27 | from ipyparallel import error as iperror
  28 | from IPython.paths import locate_profile, get_ipython_dir
  29 | import traitlets
  30 | from traitlets import (List, Unicode, CRegExp)
  31 | from IPython.core.profiledir import ProfileDir
  32 | 
  33 | from .slurm import get_slurm_attributes
  34 | from . import utils
  35 | from . import lsf
  36 | 
  37 | DEFAULT_MEM_PER_CPU = 1000  # Mb
  38 | 
  39 | # ## Custom launchers
  40 | 
  41 | # Handles longer timeouts for startup and shutdown ith pings between engine and controller.
  42 | # Makes engine pingback shutdown higher, since this is not consecutive misses.
  43 | # Gives 1 hour of non-resposiveness before a shutdown to handle interruptable queues.
  44 | timeout_params = ["--timeout=960", "--IPEngineApp.wait_for_url_file=960",
  45 |                   "--EngineFactory.max_heartbeat_misses=120"]
  46 | controller_params = ["--nodb", "--hwm=1", "--scheme=leastload",
  47 |                      "--HeartMonitor.max_heartmonitor_misses=720",
  48 |                      "--HeartMonitor.period=5000"]
  49 | 
  50 | # ## Work around issues with docker and VM network interfaces
  51 | # Can go away when we merge changes into upstream IPython
  52 | 
  53 | import json
  54 | import socket
  55 | import stat
  56 | import netifaces
  57 | from ipyparallel.apps.ipcontrollerapp import IPControllerApp
  58 | from IPython.utils.data import uniq_stable
  59 | 
  60 | class VMFixIPControllerApp(IPControllerApp):
  61 |     def _get_public_ip(self):
  62 |         """Avoid picking up docker and VM network interfaces in IPython 2.0.
  63 | 
  64 |         Adjusts _load_ips_netifaces from IPython.utils.localinterfaces. Changes
  65 |         submitted upstream so we can remove this when incorporated into released IPython.
  66 | 
  67 |         Prioritizes a set of common interfaces to try and make better decisions
  68 |         when choosing from multiple choices.
  69 |         """
  70 |         # First try to get address from domain name
  71 |         fqdn_ip = socket.gethostbyname(socket.getfqdn())
  72 |         if fqdn_ip and not fqdn_ip.startswith("127."):
  73 |             return fqdn_ip
  74 |         # otherwise retrieve from interfaces
  75 |         standard_ips = []
  76 |         priority_ips = []
  77 |         vm_ifaces = set(["docker0", "virbr0", "lxcbr0"])  # VM/container interfaces we do not want
  78 |         priority_ifaces = ("eth",)  # Interfaces we prefer to get IPs from
  79 | 
  80 |         # list of iface names, 'lo0', 'eth0', etc.
  81 |         # We get addresses and order based on priorty, with more preferred last
  82 |         for iface in netifaces.interfaces():
  83 |             if iface not in vm_ifaces:
  84 |                 # list of ipv4 addrinfo dicts
  85 |                 ipv4s = netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])
  86 |                 for entry in ipv4s:
  87 |                     addr = entry.get('addr')
  88 |                     if not addr:
  89 |                         continue
  90 |                     if not (iface.startswith('lo') or addr.startswith('127.')):
  91 |                         if iface.startswith(priority_ifaces):
  92 |                             priority_ips.append((iface, addr))
  93 |                         else:
  94 |                             standard_ips.append(addr)
  95 |         # Prefer earlier interfaces (eth0) over later (eth1)
  96 |         priority_ips.sort(reverse=True)
  97 |         priority_ips = [xs[1] for xs in priority_ips]
  98 |         public_ips = uniq_stable(standard_ips + priority_ips)
  99 |         return public_ips[-1]
 100 | 
 101 |     def save_connection_dict(self, fname, cdict):
 102 |         """Override default save_connection_dict to use fixed public IP retrieval.
 103 |         """
 104 |         if not cdict.get("location"):
 105 |             cdict['location'] = self._get_public_ip()
 106 |             cdict["patch"] = "VMFixIPControllerApp"
 107 |         fname = os.path.join(self.profile_dir.security_dir, fname)
 108 |         self.log.info("writing connection info to %s", fname)
 109 |         with open(fname, 'w') as f:
 110 |             f.write(json.dumps(cdict, indent=2))
 111 |         os.chmod(fname, stat.S_IRUSR | stat.S_IWUSR)
 112 | 
 113 | # Increase resource limits on engines to handle additional processes
 114 | # At scale we can run out of open file handles or run out of user
 115 | # processes. This tries to adjust this limits for each IPython worker
 116 | # within available hard limits.
 117 | # Match target_procs to OSX limits for a default.
 118 | target_procs = 10240
 119 | resource_cmds = ["import resource",
 120 |                  "cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC)",
 121 |                  "target_proc = min(max_proc, %s) if max_proc > 0 else %s" % (target_procs, target_procs),
 122 |                  "resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc))",
 123 |                  "cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE)",
 124 |                  "target_hdls = min(max_hdls, %s) if max_hdls > 0 else %s" % (target_procs, target_procs),
 125 |                  "resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls))"]
 126 | 
 127 | start_cmd = "from ipyparallel.apps.%s import launch_new_instance"
 128 | engine_cmd_argv = [sys.executable, "-E", "-c"] + \
 129 |                   ["; ".join(resource_cmds + [start_cmd % "ipengineapp", "launch_new_instance()"])]
 130 | cluster_cmd_argv = [sys.executable, "-E", "-c"] + \
 131 |                    ["; ".join(resource_cmds + [start_cmd % "ipclusterapp", "launch_new_instance()"])]
 132 | #controller_cmd_argv = [sys.executable, "-E", "-c"] + \
 133 | #                      ["; ".join(resource_cmds + [start_cmd % "ipcontrollerapp", "launch_new_instance()"])]
 134 | controller_cmd_argv = [sys.executable, "-E", "-c"] + \
 135 |                       ["; ".join(resource_cmds + ["from cluster_helper.cluster import VMFixIPControllerApp",
 136 |                                                   "VMFixIPControllerApp.launch_instance()"])]
 137 | 
 138 | def get_engine_commands(context, n):
 139 |     """Retrieve potentially multiple engines running in a single submit script.
 140 | 
 141 |     Need to background the initial processes if multiple run.
 142 |     """
 143 |     assert n > 0
 144 |     engine_cmd_full = '%s %s --profile-dir="{profile_dir}" --cluster-id="{cluster_id}"' % \
 145 |                       (' '.join(map(pipes.quote, engine_cmd_argv)), ' '.join(timeout_params))
 146 |     out = [engine_cmd_full.format(**context)]
 147 |     for _ in range(n - 1):
 148 |         out.insert(0, "(%s &) &&" % (engine_cmd_full.format(**context)))
 149 |     return "export IPYTHONDIR={profile_dir}\n".format(**context) + "\n".join(out)
 150 | 
 151 | # ## Platform LSF
 152 | class BcbioLSFEngineSetLauncher(launcher.LSFEngineSetLauncher):
 153 |     """Custom launcher handling heterogeneous clusters on LSF.
 154 |     """
 155 |     batch_file_name = Unicode("lsf_engine" + str(uuid.uuid4()))
 156 |     cores = traitlets.Integer(1, config=True)
 157 |     numengines = traitlets.Integer(1, config=True)
 158 |     mem = traitlets.Unicode("", config=True)
 159 |     tag = traitlets.Unicode("", config=True)
 160 |     resources = traitlets.Unicode("", config=True)
 161 |     job_array_template = Unicode('')
 162 |     queue_template = Unicode('')
 163 |     default_template = traitlets.Unicode("""#!/bin/sh
 164 | #BSUB -q {queue}
 165 | #BSUB -J {tag}-e[1-{n}]
 166 | #BSUB -oo bcbio-ipengine.bsub.%%J
 167 | #BSUB -n {cores}
 168 | #BSUB -R "span[hosts=1]"
 169 | {mem}
 170 | {resources}
 171 | {cmd}
 172 | """)
 173 | 
 174 |     def start(self, n):
 175 |         self.context["cores"] = self.cores * self.numengines
 176 |         if self.mem:
 177 |             # lsf.conf can specify nonstandard units for memory reservation
 178 |             mem = lsf.parse_memory(float(self.mem))
 179 |             # check if memory reservation is per core or per job
 180 |             if lsf.per_core_reservation():
 181 |                 mem = mem / (self.cores * self.numengines)
 182 |             mem = mem * self.numengines
 183 |             self.context["mem"] = '#BSUB -R "rusage[mem=%s]"' % mem
 184 |         else:
 185 |             self.context["mem"] = ""
 186 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 187 |         self.context["resources"] = _format_lsf_resources(self.resources)
 188 |         self.context["cmd"] = get_engine_commands(self.context, self.numengines)
 189 |         return super(BcbioLSFEngineSetLauncher, self).start(n)
 190 | 
 191 | def _format_lsf_resources(resources):
 192 |     resource_str = ""
 193 |     for r in str(resources).split(";"):
 194 |         if r.strip():
 195 |             if "=" in r:
 196 |                 arg, val = r.split("=")
 197 |                 arg.strip()
 198 |                 val.strip()
 199 |             else:
 200 |                 arg = r.strip()
 201 |                 val = ""
 202 |             resource_str += "#BSUB -%s %s\n" % (arg, val)
 203 |     return resource_str
 204 | 
 205 | class BcbioLSFControllerLauncher(launcher.LSFControllerLauncher):
 206 |     batch_file_name = Unicode("lsf_controller" + str(uuid.uuid4()))
 207 |     tag = traitlets.Unicode("", config=True)
 208 |     cores = traitlets.Integer(1, config=True)
 209 |     resources = traitlets.Unicode("", config=True)
 210 |     job_array_template = Unicode('')
 211 |     queue_template = Unicode('')
 212 |     default_template = traitlets.Unicode("""#!/bin/sh
 213 | #BSUB -q {queue}
 214 | #BSUB -J {tag}-c
 215 | #BSUB -n {cores}
 216 | #BSUB -oo bcbio-ipcontroller.bsub.%%J
 217 | {resources}
 218 | export IPYTHONDIR={profile_dir}
 219 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 220 |     """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 221 |            ' '.join(controller_params)))
 222 |     def start(self):
 223 |         self.context["cores"] = self.cores
 224 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 225 |         self.context["resources"] = _format_lsf_resources(self.resources)
 226 |         return super(BcbioLSFControllerLauncher, self).start()
 227 | 
 228 | # ## Sun Grid Engine (SGE)
 229 | 
 230 | def _local_environment_exports():
 231 |     """Create export string for a batch script with environmental variables to pass.
 232 | 
 233 |     Passes additional environmental variables not inherited by some schedulers.
 234 |     LD_LIBRARY_PATH filtered by '-V' on recent Grid Engine releases.
 235 |     LD_PRELOAD filtered by SGE without an option to turn off:
 236 | 
 237 |     https://github.com/gridengine/gridengine/blob/6a5407d56c85b39290ac2488fb6dec1a4404a974/source/libs/sgeobj/sge_var.c#L994
 238 |     but used by tools like PetaGene (http://www.petagene.com/).
 239 |     """
 240 |     exports = []
 241 |     for envname in ["LD_LIBRARY_PATH", "LD_PRELOAD"]:
 242 |         envval = os.environ.get(envname)
 243 |         if envval:
 244 |             exports.append("export %s=%s" % (envname, envval))
 245 |     return "\n".join(exports)
 246 | 
 247 | class BcbioSGEEngineSetLauncher(launcher.SGEEngineSetLauncher):
 248 |     """Custom launcher handling heterogeneous clusters on SGE.
 249 |     """
 250 |     batch_file_name = Unicode("sge_engine" + str(uuid.uuid4()))
 251 |     cores = traitlets.Integer(1, config=True)
 252 |     numengines = traitlets.Integer(1, config=True)
 253 |     pename = traitlets.Unicode("", config=True)
 254 |     resources = traitlets.Unicode("", config=True)
 255 |     mem = traitlets.Unicode("", config=True)
 256 |     memtype = traitlets.Unicode("mem_free", config=True)
 257 |     tag = traitlets.Unicode("", config=True)
 258 |     queue_template = Unicode('')
 259 |     default_template = traitlets.Unicode("""#$ -V
 260 | #$ -cwd
 261 | #$ -w w
 262 | #$ -j y
 263 | #$ -S /bin/sh
 264 | #$ -N {tag}-e
 265 | #$ -t 1-{n}
 266 | #$ -pe {pename} {cores}
 267 | {queue}
 268 | {mem}
 269 | {resources}
 270 | {exports}
 271 | {cmd}
 272 | """)
 273 | 
 274 |     def start(self, n):
 275 |         self.context["cores"] = self.cores * self.numengines
 276 |         if self.mem:
 277 |             if self.memtype == "rss":
 278 |                 self.context["mem"] = "#$ -l rss=%sM" % int(float(self.mem) * 1024 / self.cores * self.numengines)
 279 |             elif self.memtype == "virtual_free":
 280 |                 self.context["mem"] = "#$ -l virtual_free=%sM" % int(float(self.mem) * 1024 * self.numengines)
 281 |             else:
 282 |                 self.context["mem"] = "#$ -l mem_free=%sM" % int(float(self.mem) * 1024 * self.numengines)
 283 |         else:
 284 |             self.context["mem"] = ""
 285 |         if self.queue:
 286 |             self.context["queue"] = "#$ -q %s" % self.queue
 287 |         else:
 288 |             self.context["queue"] = ""
 289 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 290 |         self.context["pename"] = str(self.pename)
 291 |         self.context["resources"] = "\n".join([_prep_sge_resource(r)
 292 |                                                for r in str(self.resources).split(";")
 293 |                                                if r.strip()])
 294 |         self.context["exports"] = _local_environment_exports()
 295 |         self.context["cmd"] = get_engine_commands(self.context, self.numengines)
 296 |         return super(BcbioSGEEngineSetLauncher, self).start(n)
 297 | 
 298 | class BcbioSGEControllerLauncher(launcher.SGEControllerLauncher):
 299 |     batch_file_name = Unicode("sge_controller" + str(uuid.uuid4()))
 300 |     tag = traitlets.Unicode("", config=True)
 301 |     cores = traitlets.Integer(1, config=True)
 302 |     pename = traitlets.Unicode("", config=True)
 303 |     resources = traitlets.Unicode("", config=True)
 304 |     queue_template = Unicode('')
 305 |     default_template = traitlets.Unicode(u"""#$ -V
 306 | #$ -cwd
 307 | #$ -w w
 308 | #$ -j y
 309 | #$ -S /bin/sh
 310 | #$ -N {tag}-c
 311 | {cores}
 312 | {queue}
 313 | {resources}
 314 | {exports}
 315 | export IPYTHONDIR={profile_dir}
 316 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 317 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 318 |        ' '.join(controller_params)))
 319 |     def start(self):
 320 |         self.context["cores"] = "#$ -pe %s %s" % (self.pename, self.cores) if self.cores > 1 else ""
 321 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 322 |         self.context["resources"] = "\n".join([_prep_sge_resource(r)
 323 |                                                for r in str(self.resources).split(";")
 324 |                                                if r.strip()])
 325 |         if self.queue:
 326 |             self.context["queue"] = "#$ -q %s" % self.queue
 327 |         else:
 328 |             self.context["queue"] = ""
 329 |         self.context["exports"] = _local_environment_exports()
 330 |         return super(BcbioSGEControllerLauncher, self).start()
 331 | 
 332 | def _prep_sge_resource(resource):
 333 |     """Prepare SGE resource specifications from the command line handling special cases.
 334 |     """
 335 |     resource = resource.strip()
 336 |     try:
 337 |         k, v = resource.split("=")
 338 |     except ValueError:
 339 |         k, v = None, None
 340 |     if k and k in set(["ar", "m", "M"]):
 341 |         return "#$ -%s %s" % (k, v)
 342 |     else:
 343 |         return "#$ -l %s" % resource
 344 | 
 345 | def _find_parallel_environment(queue):
 346 |     """Find an SGE/OGE parallel environment for running multicore jobs in specified queue.
 347 |     """
 348 |     base_queue, ext = os.path.splitext(queue)
 349 |     queue = base_queue + ".q"
 350 | 
 351 |     available_pes = []
 352 |     for name in subprocess.check_output(["qconf", "-spl"]).decode().strip().split():
 353 |         if name:
 354 |             for line in subprocess.check_output(["qconf", "-sp", name]).decode().split("\n"):
 355 |                 if _has_parallel_environment(line):
 356 |                     if _queue_can_access_pe(name, queue) or _queue_can_access_pe(name, base_queue) or _queue_can_access_pe(name, base_queue + ext):
 357 |                         available_pes.append(name)
 358 |     if len(available_pes) == 0:
 359 |         raise ValueError("Could not find an SGE environment configured for parallel execution. "
 360 |                          "See %s for SGE setup instructions." %
 361 |                          "https://blogs.oracle.com/templedf/entry/configuring_a_new_parallel_environment")
 362 |     else:
 363 |         return _prioritize_pes(available_pes)
 364 | 
 365 | def _prioritize_pes(choices):
 366 |     """Prioritize and deprioritize paired environments based on names.
 367 | 
 368 |     We're looking for multiprocessing friendly environments, so prioritize ones with SMP
 369 |     in the name and deprioritize those with MPI.
 370 |     """
 371 |     # lower scores = better
 372 |     ranks = {"smp": -1, "mpi": 1}
 373 |     sort_choices = []
 374 |     for n in choices:
 375 |         # Identify if it fits in any special cases
 376 |         special_case = False
 377 |         for k, val in ranks.items():
 378 |             if n.lower().find(k) >= 0:
 379 |                 sort_choices.append((val, n))
 380 |                 special_case = True
 381 |                 break
 382 |         if not special_case:  # otherwise, no priority/de-priority
 383 |             sort_choices.append((0, n))
 384 |     sort_choices.sort()
 385 |     return sort_choices[0][1]
 386 | 
 387 | def _parseSGEConf(data):
 388 |     """Handle SGE multiple line output nastiness.
 389 |     From: https://github.com/clovr/vappio/blob/master/vappio-twisted/vappio_tx/load/sge_queue.py
 390 |     """
 391 |     lines = data.split('\n')
 392 |     multiline = False
 393 |     ret = {}
 394 |     for line in lines:
 395 |         line = line.strip()
 396 |         if line:
 397 |             if not multiline:
 398 |                 key, value = line.split(' ', 1)
 399 |                 value = value.strip().rstrip('\\')
 400 |                 ret[key] = value
 401 |             else:
 402 |                 # Making use of the fact that the key was created
 403 |                 # in the previous iteration and is stil lin scope
 404 |                 ret[key] += line
 405 |             multiline = (line[-1] == '\\')
 406 |     return ret
 407 | 
 408 | def _queue_can_access_pe(pe_name, queue):
 409 |     """Check if a queue has access to a specific parallel environment, using qconf.
 410 |     """
 411 |     try:
 412 |         queue_config = _parseSGEConf(subprocess.check_output(["qconf", "-sq", queue]).decode())
 413 |     except:
 414 |         return False
 415 |     for test_pe_name in re.split('\W+|,', queue_config["pe_list"]):
 416 |         if test_pe_name == pe_name:
 417 |             return True
 418 |     return False
 419 | 
 420 | def _has_parallel_environment(line):
 421 |     if line.startswith("allocation_rule"):
 422 |         if line.find("$pe_slots") >= 0 or line.find("$fill_up") >= 0:
 423 |                 return True
 424 |     return False
 425 | 
 426 | # ## SLURM
 427 | class SLURMLauncher(launcher.BatchSystemLauncher):
 428 |     """A BatchSystemLauncher subclass for SLURM
 429 |     """
 430 |     submit_command = List(['sbatch'], config=True,
 431 |                           help="The SLURM submit command ['sbatch']")
 432 |     delete_command = List(['scancel'], config=True,
 433 |                           help="The SLURM delete command ['scancel']")
 434 |     job_id_regexp = CRegExp(r'\d+', config=True,
 435 |                             help="A regular expression used to get the job id from the output of 'sbatch'")
 436 |     batch_file = Unicode(u'', config=True,
 437 |                          help="The string that is the batch script template itself.")
 438 | 
 439 |     queue_regexp = CRegExp('#SBATCH\W+-p\W+\w')
 440 |     queue_template = Unicode('#SBATCH -p {queue}')
 441 | 
 442 | 
 443 | class BcbioSLURMEngineSetLauncher(SLURMLauncher, launcher.BatchClusterAppMixin):
 444 |     """Custom launcher handling heterogeneous clusters on SLURM
 445 |     """
 446 |     batch_file_name = Unicode("SLURM_engine" + str(uuid.uuid4()))
 447 |     machines = traitlets.Integer(0, config=True)
 448 |     cores = traitlets.Integer(1, config=True)
 449 |     numengines = traitlets.Integer(1, config=True)
 450 |     mem = traitlets.Unicode("", config=True)
 451 |     tag = traitlets.Unicode("", config=True)
 452 |     account = traitlets.Unicode("", config=True)
 453 |     timelimit = traitlets.Unicode("", config=True)
 454 |     resources = traitlets.Unicode("", config=True)
 455 |     default_template = traitlets.Unicode("""#!/bin/sh
 456 | #SBATCH -p {queue}
 457 | #SBATCH -J {tag}-e[1-{n}]
 458 | #SBATCH -o bcbio-ipengine.out.%A_%a
 459 | #SBATCH -e bcbio-ipengine.err.%A_%a
 460 | #SBATCH --cpus-per-task={cores}
 461 | #SBATCH --array=1-{n}
 462 | #SBATCH -t {timelimit}
 463 | {account}{machines}{mem}{resources}
 464 | {cmd}
 465 | """)
 466 | 
 467 |     def start(self, n):
 468 |         self.context["cores"] = self.cores * self.numengines
 469 |         if self.mem:
 470 |             self.context["mem"] = "#SBATCH --mem=%s\n" % int(float(self.mem) * 1024.0 * self.numengines)
 471 |         else:
 472 |             self.context["mem"] = "#SBATCH --mem=%d\n" % int(DEFAULT_MEM_PER_CPU * self.cores * self.numengines)
 473 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 474 |         self.context["machines"] = ("#SBATCH %s\n" % (self.machines) if int(self.machines) > 0 else "")
 475 |         self.context["account"] = ("#SBATCH -A %s\n" % self.account if self.account else "")
 476 |         self.context["timelimit"] = self.timelimit
 477 |         self.context["resources"] = "\n".join(["#SBATCH --%s\n" % r.strip()
 478 |                                                for r in str(self.resources).split(";")
 479 |                                                if r.strip()])
 480 |         self.context["cmd"] = get_engine_commands(self.context, self.numengines)
 481 |         return super(BcbioSLURMEngineSetLauncher, self).start(n)
 482 | 
 483 | class BcbioSLURMControllerLauncher(SLURMLauncher, launcher.BatchClusterAppMixin):
 484 |     batch_file_name = Unicode("SLURM_controller" + str(uuid.uuid4()))
 485 |     account = traitlets.Unicode("", config=True)
 486 |     cores = traitlets.Integer(1, config=True)
 487 |     timelimit = traitlets.Unicode("", config=True)
 488 |     mem = traitlets.Unicode("", config=True)
 489 |     tag = traitlets.Unicode("", config=True)
 490 |     resources = traitlets.Unicode("", config=True)
 491 |     default_template = traitlets.Unicode("""#!/bin/sh
 492 | #SBATCH -J {tag}-c
 493 | #SBATCH -o bcbio-ipcontroller.out.%%A_%%a
 494 | #SBATCH -e bcbio-ipcontroller.err.%%A_%%a
 495 | #SBATCH -t {timelimit}
 496 | #SBATCH --cpus-per-task={cores}
 497 | {account}{mem}{resources}
 498 | export IPYTHONDIR={profile_dir}
 499 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 500 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 501 |        ' '.join(controller_params)))
 502 |     def start(self):
 503 |         self.context["account"] = self.account
 504 |         self.context["timelimit"] = self.timelimit
 505 |         self.context["cores"] = self.cores
 506 |         if self.mem:
 507 |             self.context["mem"] = "#SBATCH --mem=%s\n" % int(float(self.mem) * 1024.0)
 508 |         else:
 509 |             self.context["mem"] = "#SBATCH --mem=%d\n" % (4 * DEFAULT_MEM_PER_CPU)
 510 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 511 |         self.context["account"] = ("#SBATCH -A %s\n" % self.account if self.account else "")
 512 |         self.context["resources"] = "\n".join(["#SBATCH --%s\n" % r.strip()
 513 |                                                for r in str(self.resources).split(";")
 514 |                                                if r.strip()])
 515 |         return super(BcbioSLURMControllerLauncher, self).start(1)
 516 | 
 517 | 
 518 | class BcbioOLDSLURMEngineSetLauncher(SLURMLauncher, launcher.BatchClusterAppMixin):
 519 |     """Launch engines using SLURM for version < 2.6"""
 520 |     machines = traitlets.Integer(1, config=True)
 521 |     account = traitlets.Unicode("", config=True)
 522 |     timelimit = traitlets.Unicode("", config=True)
 523 |     batch_file_name = Unicode("SLURM_engines" + str(uuid.uuid4()),
 524 |                               config=True, help="batch file name for the engine(s) job.")
 525 | 
 526 |     default_template = Unicode(u"""#!/bin/sh
 527 | #SBATCH -A {account}
 528 | #SBATCH --job-name ipengine
 529 | #SBATCH -N {machines}
 530 | #SBATCH -t {timelimit}
 531 | export IPYTHONDIR={profile_dir}
 532 | srun -N {machines} -n {n} %s %s --profile-dir="{profile_dir}" --cluster-id="{cluster_id}"
 533 |     """ % (' '.join(map(pipes.quote, engine_cmd_argv)),
 534 |            ' '.join(timeout_params)))
 535 | 
 536 |     def start(self, n):
 537 |         """Start n engines by profile or profile_dir."""
 538 |         self.context["machines"] = self.machines
 539 |         self.context["account"] = self.account
 540 |         self.context["timelimit"] = self.timelimit
 541 |         return super(BcbioOLDSLURMEngineSetLauncher, self).start(n)
 542 | 
 543 | 
 544 | class BcbioOLDSLURMControllerLauncher(SLURMLauncher, launcher.BatchClusterAppMixin):
 545 |     """Launch a controller using SLURM for versions < 2.6"""
 546 |     account = traitlets.Unicode("", config=True)
 547 |     timelimit = traitlets.Unicode("", config=True)
 548 |     batch_file_name = Unicode("SLURM_controller" + str(uuid.uuid4()),
 549 |                               config=True, help="batch file name for the engine(s) job.")
 550 | 
 551 |     default_template = Unicode("""#!/bin/sh
 552 | #SBATCH -A {account}
 553 | #SBATCH --job-name ipcontroller
 554 | #SBATCH -t {timelimit}
 555 | export IPYTHONDIR={profile_dir}
 556 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 557 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 558 |        ' '.join(controller_params)))
 559 | 
 560 |     def start(self):
 561 |         """Start the controller by profile or profile_dir."""
 562 |         self.context["account"] = self.account
 563 |         self.context["timelimit"] = self.timelimit
 564 |         return super(BcbioOLDSLURMControllerLauncher, self).start(1)
 565 | 
 566 | # ## Torque
 567 | class TORQUELauncher(launcher.BatchSystemLauncher):
 568 |     """A BatchSystemLauncher subclass for Torque"""
 569 | 
 570 |     submit_command = List(['qsub'], config=True,
 571 |                           help="The PBS submit command ['qsub']")
 572 |     delete_command = List(['qdel'], config=True,
 573 |                           help="The PBS delete command ['qsub']")
 574 |     job_id_regexp = CRegExp(r'\d+(\[\])?', config=True,
 575 |                             help="Regular expresion for identifying the job ID [r'\d+']")
 576 | 
 577 |     batch_file = Unicode(u'')
 578 |     #job_array_regexp = CRegExp('#PBS\W+-t\W+[\w\d\-\$]+')
 579 |     #job_array_template = Unicode('#PBS -t 1-{n}')
 580 |     queue_regexp = CRegExp('#PBS\W+-q\W+\$?\w+')
 581 |     queue_template = Unicode('#PBS -q {queue}')
 582 | 
 583 | def _prep_torque_resources(resources):
 584 |     """Prepare resources passed to torque from input parameters.
 585 |     """
 586 |     out = []
 587 |     has_walltime = False
 588 |     for r in resources.split(";"):
 589 |         if "=" in r:
 590 |             k, v = r.split("=")
 591 |             k.strip()
 592 |             v.strip()
 593 |         else:
 594 |             k = ""
 595 |         if k.lower() in ["a", "account", "acct"] and v:
 596 |             out.append("#PBS -A %s" % v)
 597 |         elif r.strip():
 598 |             if k.lower() == "walltime":
 599 |                 has_walltime = True
 600 |             out.append("#PBS -l %s" % r.strip())
 601 |     if not has_walltime:
 602 |         out.append("#PBS -l walltime=239:00:00")
 603 |     return out
 604 | 
 605 | class BcbioTORQUEEngineSetLauncher(TORQUELauncher, launcher.BatchClusterAppMixin):
 606 |     """Launch Engines using Torque"""
 607 |     cores = traitlets.Integer(1, config=True)
 608 |     mem = traitlets.Unicode("", config=True)
 609 |     tag = traitlets.Unicode("", config=True)
 610 |     numengines = traitlets.Integer(1, config=True)
 611 |     resources = traitlets.Unicode("", config=True)
 612 |     batch_file_name = Unicode("torque_engines" + str(uuid.uuid4()),
 613 |                               config=True, help="batch file name for the engine(s) job.")
 614 |     default_template = Unicode(u"""#!/bin/sh
 615 | #PBS -V
 616 | #PBS -S /bin/sh
 617 | #PBS -j oe
 618 | #PBS -N {tag}-e
 619 | #PBS -t 1-{n}
 620 | #PBS -l nodes=1:ppn={cores}
 621 | {mem}
 622 | {resources}
 623 | {exports}
 624 | cd $PBS_O_WORKDIR
 625 | {cmd}
 626 | """)
 627 | 
 628 |     def start(self, n):
 629 |         """Start n engines by profile or profile_dir."""
 630 |         try:
 631 |             self.context["cores"] = self.cores * self.numengines
 632 |             if self.mem:
 633 |                 self.context["mem"] = "#PBS -l mem=%smb" % int(float(self.mem) * 1024 * self.numengines)
 634 |             else:
 635 |                 self.context["mem"] = ""
 636 | 
 637 |             self.context["tag"] = self.tag if self.tag else "bcbio"
 638 |             self.context["resources"] = "\n".join(_prep_torque_resources(self.resources))
 639 |             self.context["cmd"] = get_engine_commands(self.context, self.numengines)
 640 |             self.context["exports"] = _local_environment_exports()
 641 |             return super(BcbioTORQUEEngineSetLauncher, self).start(n)
 642 |         except:
 643 |             self.log.exception("Engine start failed")
 644 | 
 645 | class BcbioTORQUEControllerLauncher(TORQUELauncher, launcher.BatchClusterAppMixin):
 646 |     """Launch a controller using Torque."""
 647 |     batch_file_name = Unicode("torque_controller" + str(uuid.uuid4()),
 648 |                               config=True, help="batch file name for the engine(s) job.")
 649 |     cores = traitlets.Integer(1, config=True)
 650 |     tag = traitlets.Unicode("", config=True)
 651 |     resources = traitlets.Unicode("", config=True)
 652 |     default_template = Unicode("""#!/bin/sh
 653 | #PBS -V
 654 | #PBS -S /bin/sh
 655 | #PBS -N {tag}-c
 656 | #PBS -j oe
 657 | #PBS -l nodes=1:ppn={cores}
 658 | {resources}
 659 | {exports}
 660 | cd $PBS_O_WORKDIR
 661 | export IPYTHONDIR={profile_dir}
 662 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 663 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 664 |        ' '.join(controller_params)))
 665 | 
 666 |     def start(self):
 667 |         """Start the controller by profile or profile_dir."""
 668 |         try:
 669 |             self.context["cores"] = self.cores
 670 |             self.context["tag"] = self.tag if self.tag else "bcbio"
 671 |             self.context["resources"] = "\n".join(_prep_torque_resources(self.resources))
 672 |             self.context["exports"] = _local_environment_exports()
 673 |             return super(BcbioTORQUEControllerLauncher, self).start(1)
 674 |         except:
 675 |             self.log.exception("Controller start failed")
 676 | 
 677 | # ## PBSPro
 678 | class PBSPROLauncher(launcher.PBSLauncher):
 679 |     """A BatchSystemLauncher subclass for PBSPro."""
 680 |     job_array_regexp = CRegExp('#PBS\W+-J\W+[\w\d\-\$]+')
 681 |     job_array_template = Unicode('')
 682 | 
 683 |     def stop(self):
 684 |         job_ids = self.job_id.split(";")
 685 |         for job in job_ids:
 686 |             subprocess.check_call("qdel %s" % job, shell=True)
 687 | 
 688 |     def notify_start(self, data):
 689 |         self.log.debug('Process %r started: %r', self.args[0], data)
 690 |         self.start_data = data
 691 |         self.state = 'running'
 692 |         self.job_id = data
 693 |         return data
 694 | 
 695 | class BcbioPBSPROEngineSetLauncher(PBSPROLauncher, launcher.BatchClusterAppMixin):
 696 |     """Launch Engines using PBSPro"""
 697 | 
 698 |     batch_file_name = Unicode('pbspro_engines' + str(uuid.uuid4()),
 699 |                               config=True,
 700 |                               help="batch file name for the engine(s) job.")
 701 |     tag = traitlets.Unicode("", config=True)
 702 |     cores = traitlets.Integer(1, config=True)
 703 |     mem = traitlets.Unicode("", config=True)
 704 |     numengines = traitlets.Integer(1, config=True)
 705 |     resources = traitlets.Unicode("", config=True)
 706 |     default_template = Unicode(u"""#!/bin/sh
 707 | #PBS -V
 708 | #PBS -S /bin/sh
 709 | #PBS -N {tag}-e
 710 | {resources}
 711 | {exports}
 712 | cd $PBS_O_WORKDIR
 713 | {cmd}
 714 | """)
 715 | 
 716 |     def start(self, n):
 717 |         tcores = (self.cores * self.numengines)
 718 |         tmem = int(float(self.mem) * 1024 * self.numengines)
 719 |         if _pbspro_noselect(self.resources):
 720 |             resources = "#PBS -l ncpus=%d\n" % tcores
 721 |         else:
 722 |             resources = "#PBS -l select=1:ncpus=%d" % tcores
 723 |         if self.mem:
 724 |             if _pbspro_noselect(self.resources):
 725 |                 resources += "#PBS -l mem=%smb" % tmem
 726 |             else:
 727 |                 resources += ":mem=%smb" % tmem
 728 |         resources = "\n".join([resources] + _prep_pbspro_resources(self.resources))
 729 |         self.context["resources"] = resources
 730 |         self.context["cores"] = self.cores
 731 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 732 |         self.context["cmd"] = get_engine_commands(self.context, self.numengines)
 733 |         self.context["exports"] = _local_environment_exports()
 734 |         self.write_batch_script(n)
 735 |         job_ids = []
 736 |         for i in range(n):
 737 |             output = subprocess.check_output("qsub < %s" % self.batch_file_name,
 738 |                                              shell=True, universal_newlines=True)
 739 |             if six.PY3:
 740 |                 if not isinstance(output, str):
 741 |                     output = output.decode('ascii', 'ignore')
 742 |             job_ids.append(output.strip())
 743 |         job_id = ";".join(job_ids)
 744 |         self.notify_start(job_id)
 745 |         return job_id
 746 | 
 747 | 
 748 | class BcbioPBSPROControllerLauncher(PBSPROLauncher, launcher.BatchClusterAppMixin):
 749 |     """Launch a controller using PBSPro."""
 750 | 
 751 |     batch_file_name = Unicode("pbspro_controller" + str(uuid.uuid4()),
 752 |                               config=True,
 753 |                               help="batch file name for the controller job.")
 754 |     tag = traitlets.Unicode("", config=True)
 755 |     cores = traitlets.Integer(1, config=True)
 756 |     mem = traitlets.Unicode("", config=True)
 757 |     resources = traitlets.Unicode("", config=True)
 758 |     default_template = Unicode("""#!/bin/sh
 759 | #PBS -V
 760 | #PBS -S /bin/sh
 761 | #PBS -N {tag}-c
 762 | {resources}
 763 | {exports}
 764 | cd $PBS_O_WORKDIR
 765 | export IPYTHONDIR={profile_dir}
 766 | %s --ip=* --log-to-file --profile-dir="{profile_dir}" --cluster-id="{cluster_id}" %s
 767 | """ % (' '.join(map(pipes.quote, controller_cmd_argv)),
 768 |        ' '.join(controller_params)))
 769 | 
 770 |     def start(self):
 771 |         """Start the controller by profile or profile_dir."""
 772 |         pbsproresources = _prep_pbspro_resources(self.resources)
 773 |         tmem = int(float(self.mem) * 1020.-1) if self.mem else (4 * DEFAULT_MEM_PER_CPU)
 774 |         if _pbspro_noselect(self.resources):
 775 |             cpuresource = "#PBS -l ncpus=%d" % self.cores
 776 |             pbsproresources.append("#PBS -l mem=%smb" % tmem)
 777 |         else:
 778 |             cpuresource = "#PBS -l select=1:ncpus=%d" % self.cores
 779 |             cpuresource += ":mem=%smb" % tmem
 780 |         pbsproresources.append(cpuresource)
 781 |         resources = "\n".join(pbsproresources)
 782 |         self.context["resources"] = resources
 783 |         self.context["cores"] = self.cores
 784 |         self.context["tag"] = self.tag if self.tag else "bcbio"
 785 |         self.context["exports"] = _local_environment_exports()
 786 |         self.resources = resources
 787 |         return super(BcbioPBSPROControllerLauncher, self).start(1)
 788 | 
 789 | class BcbioLocalControllerLauncher(LocalControllerLauncher):
 790 |     def find_args(self):
 791 |         extra_params = ["--ip=*", "--log-to-file",
 792 |                         "--profile-dir=%s" % self.profile_dir,
 793 |                         "--cluster-id=%s" % self.cluster_id]
 794 |         return controller_cmd_argv + controller_params + extra_params
 795 |     def start(self):
 796 |         return super(BcbioLocalControllerLauncher, self).start()
 797 | 
 798 | def _prep_pbspro_resources(resources):
 799 |     """Prepare resources passed to pbspro from input parameters.
 800 |     """
 801 |     out = []
 802 |     has_walltime = False
 803 |     for r in resources.split(";"):
 804 |         if "=" in r:
 805 |             k, v = r.split("=")
 806 |             k.strip()
 807 |             v.strip()
 808 |         else:
 809 |             k = ""
 810 |         if k.lower() in ["a", "account", "acct"] and v:
 811 |             out.append("#PBS -A %s" % v)
 812 |         elif r.strip():
 813 |             if k.lower() == "walltime":
 814 |                 has_walltime = True
 815 |             if r.strip() == "noselect":
 816 |                 continue
 817 |             out.append("#PBS -l %s" % r.strip())
 818 |     if not has_walltime:
 819 |         out.append("#PBS -l walltime=239:00:00")
 820 |     return out
 821 | 
 822 | def _pbspro_noselect(resources):
 823 |     """
 824 |     handles PBSPro setups which don't support the select statement (NCI)
 825 |     """
 826 |     for r in resources.split(";"):
 827 |         if r.strip() == "noselect":
 828 |             return True
 829 |     return False
 830 | 
 831 | def _get_profile_args(profile):
 832 |     if os.path.isdir(profile) and os.path.isabs(profile):
 833 |         return ["--profile-dir=%s" % profile]
 834 |     else:
 835 |         return ["--profile=%s" % profile]
 836 | 
 837 | def _scheduler_resources(scheduler, params, queue):
 838 |     """Retrieve custom resource tweaks for specific schedulers.
 839 |     Handles SGE parallel environments, which allow multicore jobs
 840 |     but are specific to different environments.
 841 |     Pulls out hacks to work in different environments:
 842 |       - mincores -- Require a minimum number of cores when submitting jobs
 843 |                     to avoid single core jobs on constrained queues
 844 |       - conmem -- Memory (in Gb) for the controller to use
 845 |     """
 846 |     orig_resources = copy.deepcopy(params.get("resources", []))
 847 |     specials = {}
 848 |     if not orig_resources:
 849 |         orig_resources = []
 850 |     if isinstance(orig_resources, six.string_types):
 851 |         orig_resources = orig_resources.split(";")
 852 |     resources = []
 853 |     for r in orig_resources:
 854 |         if r.startswith(("mincores=", "minconcores=", "conmem=")):
 855 |             name, val = r.split("=")
 856 |             specials[name] = int(val)
 857 |         else:
 858 |             resources.append(r)
 859 |     if scheduler in ["SGE"]:
 860 |         pass_resources = []
 861 |         for r in resources:
 862 |             if r.startswith("pename="):
 863 |                 _, pename = r.split("=")
 864 |                 specials["pename"] = pename
 865 |             elif r.startswith("memtype="):
 866 |                 _, memtype = r.split("=")
 867 |                 specials["memtype"] = memtype
 868 |             else:
 869 |                 pass_resources.append(r)
 870 |         resources = pass_resources
 871 |         if "pename" not in specials:
 872 |             specials["pename"] = _find_parallel_environment(queue)
 873 |     return ";".join(resources), specials
 874 | 
 875 | def _start(scheduler, profile, queue, num_jobs, cores_per_job, cluster_id,
 876 |            extra_params):
 877 |     """Starts cluster from commandline.
 878 |     """
 879 |     ns = "cluster_helper.cluster"
 880 |     scheduler = scheduler.upper()
 881 |     if scheduler == "SLURM" and _slurm_is_old():
 882 |         scheduler = "OLDSLURM"
 883 |     engine_class = "Bcbio%sEngineSetLauncher" % scheduler
 884 |     controller_class = "Bcbio%sControllerLauncher" % scheduler
 885 |     if not (engine_class in globals() and controller_class in globals()):
 886 |         print ("The engine and controller class %s and %s are not "
 887 |                "defined. " % (engine_class, controller_class))
 888 |         print ("This may be due to ipython-cluster-helper not supporting "
 889 |                "your scheduler. If it should, please file a bug report at "
 890 |                "http://github.com/roryk/ipython-cluster-helper. Thanks!")
 891 |         sys.exit(1)
 892 |     resources, specials = _scheduler_resources(scheduler, extra_params, queue)
 893 |     if scheduler in ["OLDSLURM", "SLURM"]:
 894 |         resources, slurm_atrs = get_slurm_attributes(queue, resources)
 895 |     else:
 896 |         slurm_atrs = None
 897 |     mincores = specials.get("mincores", 1)
 898 |     if mincores > cores_per_job:
 899 |         if cores_per_job > 1:
 900 |             mincores = cores_per_job
 901 |         else:
 902 |             mincores = int(math.ceil(mincores / float(cores_per_job)))
 903 |             num_jobs = int(math.ceil(num_jobs / float(mincores)))
 904 | 
 905 |     args = cluster_cmd_argv + \
 906 |         ["start",
 907 |          "--daemonize=True",
 908 |          "--IPClusterEngines.early_shutdown=240",
 909 |          "--delay=10",
 910 |          "--log-to-file",
 911 |          "--debug",
 912 |          "--n=%s" % num_jobs,
 913 |          "--%s.cores=%s" % (engine_class, cores_per_job),
 914 |          "--%s.resources='%s'" % (engine_class, resources),
 915 |          "--%s.mem='%s'" % (engine_class, extra_params.get("mem", "")),
 916 |          "--%s.tag='%s'" % (engine_class, extra_params.get("tag", "")),
 917 |          "--IPClusterStart.engine_launcher_class=%s.%s" % (ns, engine_class),
 918 |          "--%sLauncher.queue='%s'" % (scheduler, queue),
 919 |          "--cluster-id=%s" % (cluster_id)
 920 |          ]
 921 |     # set controller options
 922 |     local_controller = extra_params.get("local_controller")
 923 |     if local_controller:
 924 |         controller_class = "BcbioLocalControllerLauncher"
 925 |         args += ["--IPClusterStart.controller_launcher_class=%s.%s" %
 926 |                  (ns, controller_class)]
 927 |     else:
 928 |         args += [
 929 |             "--%s.mem='%s'" % (controller_class, specials.get("conmem", "")),
 930 |             "--%s.tag='%s'" % (engine_class, extra_params.get("tag", "")),
 931 |             "--%s.tag='%s'" % (controller_class, extra_params.get("tag", "")),
 932 |             "--%s.cores=%s" % (controller_class, specials.get("minconcores", 1)),
 933 |             "--%s.resources='%s'" % (controller_class, resources),
 934 |             "--IPClusterStart.controller_launcher_class=%s.%s" %
 935 |             (ns, controller_class)]
 936 | 
 937 |     args += _get_profile_args(profile)
 938 |     if mincores > 1 and mincores > cores_per_job:
 939 |         args += ["--%s.numengines=%s" % (engine_class, mincores)]
 940 |     if specials.get("pename"):
 941 |         if not local_controller:
 942 |             args += ["--%s.pename=%s" % (controller_class, specials["pename"])]
 943 |         args += ["--%s.pename=%s" % (engine_class, specials["pename"])]
 944 |     if specials.get("memtype"):
 945 |         args += ["--%s.memtype=%s" % (engine_class, specials["memtype"])]
 946 |     if slurm_atrs:
 947 |         args += ["--%s.machines=%s" % (engine_class, slurm_atrs.get("machines", "0"))]
 948 |         args += ["--%s.timelimit='%s'" % (engine_class, slurm_atrs["timelimit"])]
 949 |         if not local_controller:
 950 |             args += ["--%s.timelimit='%s'" % (controller_class, slurm_atrs["timelimit"])]
 951 |         if slurm_atrs.get("account"):
 952 |             args += ["--%s.account=%s" % (engine_class, slurm_atrs["account"])]
 953 |             if not local_controller:
 954 |                 args += ["--%s.account=%s" % (controller_class, slurm_atrs["account"])]
 955 |     subprocess.check_call(args)
 956 |     return cluster_id
 957 | 
 958 | def _start_local(cores, profile, cluster_id):
 959 |     """Start a local non-distributed IPython engine. Useful for testing
 960 |     """
 961 |     args = cluster_cmd_argv + \
 962 |         ["start",
 963 |          "--daemonize=True",
 964 |          "--log-to-file",
 965 |          "--debug",
 966 |          "--cluster-id=%s" % cluster_id,
 967 |          "--n=%s" % cores]
 968 |     args += _get_profile_args(profile)
 969 |     subprocess.check_call(args)
 970 |     return cluster_id
 971 | 
 972 | def stop_from_view(view):
 973 |     _stop(view.clusterhelper["profile"], view.clusterhelper["cluster_id"])
 974 | 
 975 | def _stop(profile, cluster_id):
 976 |     args = cluster_cmd_argv + \
 977 |            ["stop", "--cluster-id=%s" % cluster_id]
 978 |     args += _get_profile_args(profile)
 979 |     try:
 980 |         subprocess.check_call(args)
 981 |     except subprocess.CalledProcessError:
 982 |         print('Manual shutdown of cluster failed, often this is because the '
 983 |               'cluster was already shutdown.')
 984 | 
 985 | class ClusterView(object):
 986 |     """Provide a view on an ipython cluster for processing.
 987 | 
 988 |       - scheduler: The type of cluster to start (lsf, sge, pbs, torque).
 989 |       - num_jobs: Number of jobs to start.
 990 |       - cores_per_job: The number of cores to use for each job.
 991 |       - start_wait: How long to wait for the cluster to startup, in minutes.
 992 |         Defaults to 16 minutes. Set to longer for slow starting clusters.
 993 |       - retries: Number of retries to allow for failed tasks.
 994 |       - wait_for_all_engines: If False (default), start using cluster
 995 |            as soon as first engine is up. If True, wait for all
 996 |            engines.
 997 |     """
 998 |     def __init__(self, scheduler, queue, num_jobs, cores_per_job=1, profile=None,
 999 |                  start_wait=16, extra_params=None, retries=None, direct=False,
1000 |                  wait_for_all_engines=False):
1001 |         self.stopped = False
1002 |         self.profile = profile
1003 |         num_jobs = int(num_jobs)
1004 |         cores_per_job = int(cores_per_job)
1005 |         start_wait = int(start_wait)
1006 | 
1007 |         if extra_params is None:
1008 |             extra_params = {}
1009 |         max_delay = start_wait * 60
1010 |         delay = 5
1011 |         max_tries = 10
1012 |         _create_base_ipython_dirs()
1013 |         if self.profile is None:
1014 |             self.has_throwaway = True
1015 |             self.profile = create_throwaway_profile()
1016 |         else:
1017 |             # ensure we have an .ipython directory to prevent issues
1018 |             # creating it during parallel startup
1019 |             cmd = [sys.executable, "-E", "-c", "from IPython import start_ipython; start_ipython()",
1020 |                    "profile", "create", "--parallel"] + _get_profile_args(self.profile)
1021 |             subprocess.check_call(cmd)
1022 |             self.has_throwaway = False
1023 |         num_tries = 0
1024 | 
1025 |         self.cluster_id = str(uuid.uuid4())
1026 |         url_file = get_url_file(self.profile, self.cluster_id)
1027 | 
1028 |         if profile and os.path.isdir(profile) and os.path.isabs(profile):
1029 |             os.environ["IPYTHONDIR"] = profile
1030 |         while 1:
1031 |             try:
1032 |                 if extra_params.get("run_local") or queue == "run_local":
1033 |                     _start_local(num_jobs, self.profile, self.cluster_id)
1034 |                 else:
1035 |                     _start(scheduler, self.profile, queue, num_jobs,
1036 |                            cores_per_job, self.cluster_id, extra_params)
1037 |                 break
1038 |             except subprocess.CalledProcessError:
1039 |                 if num_tries > max_tries:
1040 |                     raise
1041 |                 num_tries += 1
1042 |                 time.sleep(delay)
1043 | 
1044 |         try:
1045 |             self.client = None
1046 |             if wait_for_all_engines:
1047 |                 # Start using cluster when this many engines are up
1048 |                 need_engines = num_jobs
1049 |             else:
1050 |                 need_engines = 1
1051 |             slept = 0
1052 |             max_up = 0
1053 |             up = 0
1054 |             while up < need_engines:
1055 |                 up = _nengines_up(url_file)
1056 |                 print('\r{0} Engines running'.format(up), end="")
1057 |                 if up < max_up:
1058 |                     print ("\nEngine(s) that were up have shutdown prematurely. "
1059 |                            "Aborting cluster startup.")
1060 |                     _stop(self.profile, self.cluster_id)
1061 |                     sys.exit(1)
1062 |                 max_up = up
1063 |                 time.sleep(delay)
1064 |                 slept += delay
1065 |                 if slept > max_delay:
1066 |                     raise IOError("""
1067 | 
1068 |         The cluster startup timed out. This could be for a couple of reasons. The
1069 |         most common reason is that the queue you are submitting jobs to is
1070 |         oversubscribed. You can check if this is what is happening by trying again,
1071 |         and watching to see if jobs are in a pending state or a running state when
1072 |         the startup times out. If they are in the pending state, that means we just
1073 |         need to wait longer for them to start, which you can specify by passing
1074 |         the --timeout parameter, in minutes.
1075 | 
1076 |         The second reason is that there is a problem with the controller and engine
1077 |         jobs being submitted to the scheduler. In the directory you ran from,
1078 |         you should see files that are named YourScheduler_enginesABunchOfNumbers and
1079 |         YourScheduler_controllerABunchOfNumbers. If you submit one of those files
1080 |         manually to your scheduler (for example bsub < YourScheduler_controllerABunchOfNumbers)
1081 |         You will get a more helpful error message that might help you figure out what
1082 |         is going wrong.
1083 | 
1084 |         The third reason is that you need to submit your bcbio_nextgen.py job itself as a job;
1085 |         bcbio-nextgen needs to run on a compute node, not the login node. So the
1086 |         command you use to run bcbio-nextgen should be submitted as a job to
1087 |         the scheduler. You can diagnose this because the controller and engine
1088 |         jobs will be in the running state, but the cluster will still timeout.
1089 | 
1090 |         Finally, it may be an issue with how the cluster is configured-- the controller
1091 |         and engine jobs are unable to talk to each other. They need to be able to open
1092 |         ports on the machines each of them are running on in order to work. You
1093 |         can diagnose this as the possible issue by if you have submitted the bcbio-nextgen
1094 |         job to the scheduler, the bcbio-nextgen main job and the controller and
1095 |         engine jobs are all in a running state and the cluster still times out. This will
1096 |         likely to be something that you'll have to talk to the administrators of the cluster
1097 |         you are using about.
1098 | 
1099 |         If you need help debugging, please post an issue here and we'll try to help you
1100 |         with the detective work:
1101 | 
1102 |         https://github.com/roryk/ipython-cluster-helper/issues
1103 | 
1104 |                             """)
1105 |             print()
1106 |             self.client = Client(url_file, timeout=60)
1107 |             if direct:
1108 |                 self.view = _get_direct_view(self.client, retries)
1109 |             else:
1110 |                 self.view = _get_balanced_blocked_view(self.client, retries)
1111 |             self.view.clusterhelper = {"profile": self.profile,
1112 |                                        "cluster_id": self.cluster_id}
1113 |         except:
1114 |             self.stop()
1115 |             raise
1116 | 
1117 |     def stop(self):
1118 |         if not self.stopped:
1119 |             if self.client:
1120 |                 _shutdown(self.client)
1121 |             _stop(self.profile, self.cluster_id)
1122 |             if self.has_throwaway:
1123 |                 delete_profile(self.profile)
1124 |             self.stopped = True
1125 | 
1126 |     def __enter__(self):
1127 |         yield self
1128 | 
1129 |     def __exit__(self):
1130 |         self.stop()
1131 | 
1132 | 
1133 | @contextlib.contextmanager
1134 | def cluster_view(scheduler, queue, num_jobs, cores_per_job=1, profile=None,
1135 |                  start_wait=16, extra_params=None, retries=None, direct=False,
1136 |                  wait_for_all_engines=False):
1137 |     """Provide a view on an ipython cluster for processing.
1138 | 
1139 |       - scheduler: The type of cluster to start (lsf, sge, pbs, torque).
1140 |       - num_jobs: Number of jobs to start.
1141 |       - cores_per_job: The number of cores to use for each job.
1142 |       - start_wait: How long to wait for the cluster to startup, in minutes.
1143 |         Defaults to 16 minutes. Set to longer for slow starting clusters.
1144 |       - retries: Number of retries to allow for failed tasks.
1145 |     """
1146 |     cluster_view = ClusterView(scheduler, queue, num_jobs, cores_per_job=cores_per_job,
1147 |                                profile=profile, start_wait=start_wait, extra_params=extra_params,
1148 |                                retries=retries, direct=direct,
1149 |                                wait_for_all_engines=wait_for_all_engines)
1150 |     try:
1151 |         yield cluster_view.view
1152 |     finally:
1153 |         cluster_view.stop()
1154 | 
1155 | def _nengines_up(url_file):
1156 |     "return the number of engines up"
1157 |     client = None
1158 |     try:
1159 |         client = Client(url_file, timeout=60)
1160 |         up = len(client.ids)
1161 |         client.close()
1162 |     # the controller isn't up yet
1163 |     except iperror.TimeoutError:
1164 |         return 0
1165 |     # the JSON file is not available to parse
1166 |     except IOError:
1167 |         return 0
1168 |     else:
1169 |         return up
1170 | 
1171 | def _get_balanced_blocked_view(client, retries):
1172 |     view = client.load_balanced_view()
1173 |     view.set_flags(block=True)
1174 |     if retries:
1175 |         view.set_flags(retries=int(retries))
1176 |     return view
1177 | 
1178 | def _create_base_ipython_dirs():
1179 |     """Create default user directories to prevent potential race conditions downstream.
1180 |     """
1181 |     utils.safe_makedir(get_ipython_dir())
1182 |     ProfileDir.create_profile_dir_by_name(get_ipython_dir())
1183 |     utils.safe_makedir(os.path.join(get_ipython_dir(), "db"))
1184 |     utils.safe_makedir(os.path.join(locate_profile(), "db"))
1185 | 
1186 | def _shutdown(client):
1187 |     print ("Sending a shutdown signal to the controller and engines.")
1188 |     client.close()
1189 | 
1190 | def _get_direct_view(client, retries):
1191 |     view = client[:]
1192 |     view.set_flags(block=True)
1193 |     if retries:
1194 |         view.set_flags(retries=int(retries))
1195 |     return view
1196 | 
1197 | def _slurm_is_old():
1198 |     return LooseVersion(_slurm_version()) < LooseVersion("2.6")
1199 | 
1200 | def _slurm_version():
1201 |     version_line = subprocess.check_output("sinfo -V", shell=True)
1202 |     if six.PY3:
1203 |         version_line = version_line.decode('ascii', 'ignore')
1204 |     parts = version_line.split()
1205 |     if len(parts) > 0:
1206 |         return version_line.split()[1]
1207 |     else:
1208 |         return "2.6+"
1209 | 
1210 | # ## Temporary profile management
1211 | 
1212 | def create_throwaway_profile():
1213 |     profile = str(uuid.uuid1())
1214 |     cmd = [sys.executable, "-E", "-c", "from IPython import start_ipython; start_ipython()",
1215 |            "profile", "create", profile, "--parallel"]
1216 |     subprocess.check_call(cmd)
1217 |     return profile
1218 | 
1219 | def get_url_file(profile, cluster_id):
1220 | 
1221 |     url_file = "ipcontroller-{0}-client.json".format(cluster_id)
1222 | 
1223 |     if os.path.isdir(profile) and os.path.isabs(profile):
1224 |         # Return full_path if one is given
1225 |         return os.path.join(profile, "security", url_file)
1226 | 
1227 |     return os.path.join(locate_profile(profile), "security", url_file)
1228 | 
1229 | def delete_profile(profile):
1230 |     MAX_TRIES = 10
1231 |     dir_to_remove = locate_profile(profile)
1232 |     if os.path.exists(dir_to_remove):
1233 |         num_tries = 0
1234 |         while True:
1235 |             try:
1236 |                 shutil.rmtree(dir_to_remove)
1237 |                 break
1238 |             except OSError:
1239 |                 if num_tries > MAX_TRIES:
1240 |                     raise
1241 |                 time.sleep(5)
1242 |                 num_tries += 1
1243 |     else:
1244 |         raise ValueError("Cannot find {0} to remove, "
1245 |                          "something is wrong.".format(dir_to_remove))
1246 | 


--------------------------------------------------------------------------------
/cluster_helper/log/__init__.py:
--------------------------------------------------------------------------------
 1 | """Utility functionality for logging.
 2 | """
 3 | import os
 4 | import logging
 5 | from cluster_helper import utils
 6 | 
 7 | LOG_NAME = "cluster_helper"
 8 | 
 9 | logger = logging.getLogger(LOG_NAME)
10 | 
11 | def setup_logging(log_dir):
12 |     logger.setLevel(logging.INFO)
13 |     if not logger.handlers:
14 |         formatter = logging.Formatter('[%(asctime)s] %(message)s')
15 |         handler = logging.StreamHandler()
16 |         handler.setFormatter(formatter)
17 |         logger.addHandler(handler)
18 |         if log_dir:
19 |             logfile = os.path.join(utils.safe_makedir(log_dir),
20 |                                    "{0}.log".format(LOG_NAME))
21 |             handler = logging.FileHandler(logfile)
22 |             handler.setFormatter(formatter)
23 |             logger.addHandler(handler)
24 | 


--------------------------------------------------------------------------------
/cluster_helper/lsf.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import fnmatch
  4 | 
  5 | import six
  6 | 
  7 | from . import utils
  8 | 
  9 | LSB_PARAMS_FILENAME = "lsb.params"
 10 | LSF_CONF_FILENAME = "lsf.conf"
 11 | LSF_CONF_ENV = ["LSF_CONFDIR", "LSF_ENVDIR"]
 12 | DEFAULT_LSF_UNITS = "KB"
 13 | DEFAULT_RESOURCE_UNITS = "MB"
 14 | 
 15 | def find(basedir, string):
 16 |     """
 17 |     walk basedir and return all files matching string
 18 |     """
 19 |     matches = []
 20 |     for root, dirnames, filenames in os.walk(basedir):
 21 |         for filename in fnmatch.filter(filenames, string):
 22 |             matches.append(os.path.join(root, filename))
 23 |     return matches
 24 | 
 25 | def find_first_match(basedir, string):
 26 |     """
 27 |     return the first file that matches string starting from basedir
 28 |     """
 29 |     matches = find(basedir, string)
 30 |     return matches[0] if matches else matches
 31 | 
 32 | def get_conf_file(filename, env):
 33 |     conf_path = os.environ.get(env)
 34 |     if not conf_path:
 35 |         return None
 36 |     conf_file = find_first_match(conf_path, filename)
 37 |     return conf_file
 38 | 
 39 | def apply_conf_file(fn, conf_filename):
 40 |     for env in LSF_CONF_ENV:
 41 |         conf_file = get_conf_file(conf_filename, env)
 42 |         if conf_file:
 43 |             with open(conf_file) as conf_handle:
 44 |                 value = fn(conf_handle)
 45 |             if value:
 46 |                 return value
 47 |     return None
 48 | 
 49 | def per_core_reserve_from_stream(stream):
 50 |     for k, v in tokenize_conf_stream(stream):
 51 |         if k == "RESOURCE_RESERVE_PER_SLOT":
 52 |             return v.upper()
 53 |     return None
 54 | 
 55 | def get_lsf_units_from_stream(stream):
 56 |     for k, v in tokenize_conf_stream(stream):
 57 |         if k == "LSF_UNIT_FOR_LIMITS":
 58 |             return v
 59 |     return None
 60 | 
 61 | def tokenize_conf_stream(conf_handle):
 62 |     """
 63 |     convert the key=val pairs in a LSF config stream to tuples of tokens
 64 |     """
 65 |     for line in conf_handle:
 66 |         if line.startswith("#"):
 67 |             continue
 68 |         tokens = line.split("=")
 69 |         if len(tokens) != 2:
 70 |             continue
 71 |         yield (tokens[0].strip(), tokens[1].strip())
 72 | 
 73 | def apply_bparams(fn):
 74 |     """
 75 |     apply fn to each line of bparams, returning the result
 76 |     """
 77 |     cmd = ["bparams", "-a"]
 78 |     try:
 79 |         output = subprocess.check_output(cmd)
 80 |     except:
 81 |         return None
 82 |     if six.PY3:
 83 |         output = output.decode('ascii', 'ignore')
 84 |     return fn(output.split("\n"))
 85 | 
 86 | def apply_lsadmin(fn):
 87 |     """
 88 |     apply fn to each line of lsadmin, returning the result
 89 |     """
 90 |     cmd = ["lsadmin", "showconf", "lim"]
 91 |     try:
 92 |         output = subprocess.check_output(cmd)
 93 |     except:
 94 |         return None
 95 |     if six.PY3:
 96 |         output = output.decode('ascii', 'ignore')
 97 |     return fn(output.split("\n"))
 98 | 
 99 | 
100 | def get_lsf_units(resource=False):
101 |     """
102 |     check if we can find LSF_UNITS_FOR_LIMITS in lsadmin and lsf.conf
103 |     files, preferring the value in bparams, then lsadmin, then the lsf.conf file
104 |     """
105 |     lsf_units = apply_bparams(get_lsf_units_from_stream)
106 |     if lsf_units:
107 |         return lsf_units
108 | 
109 |     lsf_units = apply_lsadmin(get_lsf_units_from_stream)
110 |     if lsf_units:
111 |         return lsf_units
112 | 
113 |     lsf_units = apply_conf_file(get_lsf_units_from_stream, LSF_CONF_FILENAME)
114 |     if lsf_units:
115 |         return lsf_units
116 | 
117 |     # -R usage units are in MB, not KB by default
118 |     if resource:
119 |         return DEFAULT_RESOURCE_UNITS
120 |     else:
121 |         return DEFAULT_LSF_UNITS
122 | 
123 | def parse_memory(mem):
124 |     """
125 |     Parse memory parameter
126 |     """
127 |     lsf_unit = get_lsf_units(resource=True)
128 |     return utils.convert_mb(float(mem) * 1024, lsf_unit)
129 | 
130 | 
131 | def per_core_reservation():
132 |     """
133 |     returns True if the cluster is configured for reservations to be per core,
134 |     False if it is per job
135 |     """
136 |     per_core = apply_bparams(per_core_reserve_from_stream)
137 |     if per_core:
138 |         if per_core.upper() == "Y":
139 |             return True
140 |         else:
141 |             return False
142 | 
143 |     per_core = apply_lsadmin(per_core_reserve_from_stream)
144 |     if per_core:
145 |         if per_core.upper() == "Y":
146 |             return True
147 |         else:
148 |             return False
149 | 
150 |     per_core = apply_conf_file(per_core_reserve_from_stream, LSB_PARAMS_FILENAME)
151 |     if per_core and per_core.upper() == "Y":
152 |         return True
153 |     else:
154 |         return False
155 |     return False
156 | 
157 | if __name__ == "__main__":
158 |     print(get_lsf_units())
159 |     print(per_core_reservation())
160 | 


--------------------------------------------------------------------------------
/cluster_helper/slurm.py:
--------------------------------------------------------------------------------
  1 | import subprocess
  2 | import six
  3 | 
  4 | DEFAULT_TIME = "1-00:00:00"
  5 | 
  6 | class SlurmTime(object):
  7 |     """
  8 |     parse a SlurmTime of the format hours-days, hours-days:minutes,
  9 |     hours-days:minutes:seconds, minutes, hours:minutes, 
 10 |     hours:minutes:seconds. assume 'infinite' means 365 days
 11 |     """
 12 |     def __init__(self, time):
 13 |         self.days = 0
 14 |         self.hours = 0
 15 |         self.minutes = 0
 16 |         self.seconds = 0
 17 |         self._parse_time(time)
 18 | 
 19 |     def _parse_time(self, time):
 20 |         if time == "infinite": 
 21 |             self.days = 365
 22 |         else:
 23 |             day_tokens = time.split("-")
 24 |             if len(day_tokens) == 2:
 25 |                 self.days = int(day_tokens[0])
 26 |                 time = day_tokens[1]
 27 |             time_tokens = time.split(":")
 28 |             if len(time_tokens) == 3:
 29 |                 self.hours, self.minutes, self.seconds = map(int, time_tokens)
 30 |             elif len(time_tokens) == 2:
 31 |                 self.hours, self.minutes = map(int, time_tokens)
 32 |             elif self.days:
 33 |                 self.hours = int(time_tokens[0])
 34 |             else:
 35 |                 self.minutes = int(time_tokens[0])
 36 | 
 37 |     # For Python 3
 38 |     def __lt__(self, slurmtime):
 39 |         for k in ['days', 'hours', 'minutes', 'seconds']:
 40 |             v1 = getattr(self, k)
 41 |             v2 = getattr(slurmtime, k)
 42 |             if v1 != v2:
 43 |                 return v1 < v2
 44 |         return False
 45 | 
 46 |     # For Python 2
 47 |     def __cmp__(self, slurmtime):
 48 |         if self < slurmtime:
 49 |             return -1
 50 |         else:
 51 |             return 1
 52 | 
 53 |     def __repr__(self):
 54 |         return("%02d-%02d:%02d:%02d" % (self.days, self.hours, self.minutes, self.seconds))
 55 | 
 56 | def get_accounts(user):
 57 |     """
 58 |     get all accounts a user can use to submit a job
 59 |     """
 60 |     cmd = "sshare -p --noheader"
 61 |     out = subprocess.check_output(cmd, shell=True)
 62 |     if six.PY3: 
 63 |         out = out.decode('ascii', 'ignore')
 64 |     accounts = []
 65 |     has_accounts = False
 66 |     for line in out.splitlines():
 67 |         line = line.split('|')
 68 |         account = line[0].strip()
 69 |         account_user = line[1].strip()
 70 |         if account and account_user == user:
 71 |             has_accounts = True
 72 |             accounts.append(account)
 73 | 
 74 |     if not has_accounts:
 75 |         return None
 76 |     accounts = set(accounts)
 77 |     try:
 78 |         assert not set(['']).issuperset(accounts)
 79 |     except AssertionError:
 80 |         raise ValueError("No usable accounts were found for user \'{0}\'".format(user))
 81 | 
 82 |     return accounts
 83 | 
 84 | def get_user():
 85 |     out = subprocess.check_output("whoami", shell=True)
 86 |     if six.PY3: 
 87 |         out = out.decode('ascii', 'ignore')
 88 |     return out.strip()
 89 | 
 90 | def accounts_with_access(queue):
 91 |     cmd = "sinfo --noheader -p {0} -o %g".format(queue)
 92 |     out = subprocess.check_output(cmd, shell=True)
 93 |     if six.PY3: 
 94 |         out = out.decode('ascii', 'ignore')
 95 |     return set([x.strip() for x in out.split(",")])
 96 | 
 97 | def get_max_timelimit(queue):
 98 |     cmd = "sinfo --noheader -p {0}".format(queue)
 99 |     out = subprocess.check_output(cmd, shell=True)
100 |     if six.PY3: 
101 |         out = out.decode('ascii', 'ignore')
102 |     max_limit = out.split()[2].strip()
103 |     time_limit = None if max_limit == "inifinite" else max_limit
104 |     return max_limit
105 | 
106 | def get_account_for_queue(queue):
107 |     """
108 |     selects all the accounts this user has access to that can submit to
109 |     the queue and returns the first one
110 |     """
111 |     user = get_user()
112 |     possible_accounts = get_accounts(user)
113 |     if not possible_accounts:
114 |         return None
115 |     queue_accounts = accounts_with_access(queue)
116 |     try:
117 |         assert not set(['']).issuperset(queue_accounts)
118 |     except AssertionError:
119 |         raise ValueError("No accounts accessible by user \'{0}\' have access to queue \'{1}\'. Accounts found were: {2}".format(
120 |                          user, queue, ", ".join(possible_accounts)))
121 |     accts = list(possible_accounts.intersection(queue_accounts))
122 |     if "all" not in queue_accounts and len(accts) > 0:
123 |         return accts[0]
124 |     else:
125 |         return possible_accounts.pop()
126 | 
127 | def set_timelimit(slurm_atrs):
128 |     """
129 |     timelimit can be specified as timelimit, t or time. remap 
130 |     t and time to timelimit, preferring timelimit
131 |     """
132 |     if "timelimit" in slurm_atrs:
133 |         return slurm_atrs
134 |     if "time" in slurm_atrs:
135 |         slurm_atrs["timelimit"] = slurm_atrs["time"]
136 |         del slurm_atrs["time"]
137 |         return slurm_atrs
138 |     if "t" in slurm_atrs:
139 |         slurm_atrs["timelimit"] = slurm_atrs["t"]
140 |         del slurm_atrs["t"]
141 |         return slurm_atrs
142 |     slurm_atrs["timelimit"] = DEFAULT_TIME
143 |     return slurm_atrs
144 | 
145 | def get_slurm_attributes(queue, resources):
146 |     slurm_atrs = {}
147 |     # specially handled resource specifications
148 |     special_resources = set(["machines", "account", "timelimit"])
149 |     if resources:
150 |         for parm in resources.split(";"):
151 |             k, v = [ a.strip() for a in  parm.split('=') ]
152 |             slurm_atrs[k] = v
153 |     if "account" not in slurm_atrs:
154 |         account = get_account_for_queue(queue)
155 |         if account:
156 |             slurm_atrs["account"] = account
157 |     slurm_atrs = set_timelimit(slurm_atrs)
158 |     max_limit = get_max_timelimit(queue)
159 |     if max_limit:
160 |         slurm_atrs["timelimit"] = min(SlurmTime(slurm_atrs["timelimit"]), SlurmTime(max_limit))
161 | 
162 |     # reconstitute any general attributes to pass along to slurm
163 |     out_resources = []
164 |     for k in slurm_atrs:
165 |         if k not in special_resources:
166 |             out_resources.append("%s=%s" % (k, slurm_atrs[k]))
167 |     return ";".join(out_resources), slurm_atrs
168 | 


--------------------------------------------------------------------------------
/cluster_helper/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import math
 3 | import time
 4 | 
 5 | def convert_mb(kb, unit):
 6 |     UNITS = {"B": -2,
 7 |              "KB": -1,
 8 |              "MB": 0,
 9 |              "GB": 1,
10 |              "TB": 2}
11 |     assert unit in UNITS, ("%s not a valid unit, valid units are %s."
12 |                            % (unit, list(UNITS.keys())))
13 |     return int(float(kb) / float(math.pow(1024, UNITS[unit])))
14 | 
15 | def safe_makedir(dname):
16 |     """Make a directory if it doesn't exist, handling concurrent race conditions.
17 |     """
18 |     if not dname:
19 |         return dname
20 |     num_tries = 0
21 |     max_tries = 5
22 |     while not os.path.exists(dname):
23 |         # we could get an error here if multiple processes are creating
24 |         # the directory at the same time. Grr, concurrency.
25 |         try:
26 |             os.makedirs(dname)
27 |         except OSError:
28 |             if num_tries > max_tries:
29 |                 raise
30 |             num_tries += 1
31 |             time.sleep(2)
32 |     return dname
33 | 


--------------------------------------------------------------------------------
/example/example.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import time
  3 | import imp
  4 | import sys
  5 | from ipyparallel import require
  6 | from cluster_helper.cluster import cluster_view
  7 | 
  8 | 
  9 | def long_computation(x, y, z):
 10 |     import time
 11 |     import socket
 12 |     time.sleep(1)
 13 |     return (socket.gethostname(), x + y + z)
 14 | 
 15 | 
 16 | @require("cluster_helper")
 17 | def require_test(x):
 18 |     return True
 19 | 
 20 | 
 21 | def context_test(x):
 22 |     return True
 23 | 
 24 | if __name__ == "__main__":
 25 |     parser = argparse.ArgumentParser(
 26 |         description="example script for doing parallel work with IPython.")
 27 |     parser.add_argument(
 28 |         "--scheduler", dest='scheduler', default="",
 29 |         help="scheduler to use (lsf, sge, torque, slurm, or pbs)")
 30 |     parser.add_argument("--queue", dest='queue', default="",
 31 |                         help="queue to use on scheduler.")
 32 |     parser.add_argument("--local_controller", dest='local_controller', default=False,
 33 |                         help="run controller locally", action="store_true")
 34 |     parser.add_argument("--num_jobs", dest='num_jobs', default=3,
 35 |                         type=int, help="number of jobs to run in parallel.")
 36 |     parser.add_argument("--cores_per_job", dest="cores_per_job", default=1,
 37 |                         type=int, help="number of cores for each job.")
 38 |     parser.add_argument("--profile", dest="profile", default=None,
 39 |                         help="Optional profile to test.")
 40 |     parser.add_argument(
 41 |         "--resources", dest="resources", default=None,
 42 |         help=("Native specification flags to the scheduler, pass "
 43 |               "multiple flags using ; as a separator."))
 44 |     parser.add_argument("--timeout", dest="timeout", default=15,
 45 |                         help="Time (in minutes) to wait before timing out.")
 46 |     parser.add_argument("--memory", dest="mem", default=1,
 47 |                         help="Memory in GB to reserve.")
 48 |     parser.add_argument(
 49 |         "--local", dest="local", default=False, action="store_true")
 50 | 
 51 |     args = parser.parse_args()
 52 |     args.resources = {'resources': args.resources,
 53 |                       'mem': args.mem,
 54 |                       'local_controller': args.local_controller}
 55 |     if args.local:
 56 |         args.resources["run_local"] = True
 57 | 
 58 |     if not (args.local or (args.scheduler and args.queue)):
 59 |         print("Please specify --local to run locally or a scheduler and queue"
 60 |               "to run on with --scheduler and --queue")
 61 |         sys.exit(1)
 62 | 
 63 |     with cluster_view(args.scheduler, args.queue, args.num_jobs,
 64 |                       cores_per_job=args.cores_per_job,
 65 |                       start_wait=args.timeout, profile=args.profile,
 66 |                       extra_params=args.resources) as view:
 67 |         print("First check to see if we can talk to the engines.")
 68 |         results = view.map(lambda x: "hello world!", range(5))
 69 |         print("This long computation that waits for 5 seconds before "
 70 |               "returning takes a while to run serially..")
 71 |         start_time = time.time()
 72 |         results = list(map(long_computation, range(20), range(20, 40), range(40, 60)))
 73 |         print(results)
 74 |         print("That took {} seconds.".format(time.time() - start_time))
 75 |         print("Running it in parallel goes much faster...")
 76 |         start_time = time.time()
 77 |         results = list(view.map(long_computation, range(20), range(20, 40), range(40, 60)))
 78 |         print(results)
 79 |         print("That took {} seconds.".format(time.time() - start_time))
 80 | 
 81 |         try:
 82 |             imp.find_module('dill')
 83 |             found = True
 84 |         except ImportError:
 85 |             found = False
 86 | 
 87 |         if False:
 88 |             def make_closure(a):
 89 |                 """make a function with a closure, and return it"""
 90 |                 def has_closure(b):
 91 |                     return a * b
 92 |                 return has_closure
 93 |             closure = make_closure(5)
 94 |             print("With dill installed, we can pickle closures!")
 95 |             print(closure)
 96 |             print(view.map(closure, [3]))
 97 | 
 98 |             with open("test", "w") as test_handle:
 99 |                 print("Does context break it?")
100 |                 print(view.map(context_test, [3]))
101 | 
102 |                 print("Does context break it with a closure?")
103 |                 print(view.map(closure, [3]))
104 | 
105 |             print("But wrapping functions with @reqiure is broken.")
106 |             print(view.map(require_test, [3]))
107 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | setuptools>=18.5
2 | pyzmq>=2.1.11
3 | ipython<6.0.0
4 | ipyparallel>=6.0.2
5 | netifaces>=0.10.3
6 | six>=1.10.0
7 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [bdist_wheel]
2 | universal = 1
3 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import subprocess
 3 | import sys
 4 | from setuptools import setup, find_packages
 5 | from io import open
 6 | 
 7 | def parse_requirements(fn):
 8 |     """ load requirements from a pip requirements file """
 9 |     return [line for line in open(fn) if line and not line.startswith("#")]
10 | 
11 | reqs = parse_requirements("requirements.txt")
12 | 
13 | setup(name="ipython-cluster-helper",
14 |       version="0.6.4",
15 |       author="Rory Kirchner",
16 |       author_email="rory.kirchner@gmail.com",
17 |       description="Simplify IPython cluster start up and use for "
18 |       "multiple schedulers.",
19 |       long_description=(open('README.rst', encoding='utf-8').read()),
20 |       license="MIT",
21 |       zip_safe=False,
22 |       url="https://github.com/roryk/ipython-cluster-helper",
23 |       packages=find_packages(),
24 |       install_requires=reqs)
25 | 


--------------------------------------------------------------------------------