├── README.md ├── SECURITY.md ├── cancelJobFcn.m ├── cancelTaskFcn.m ├── communicatingJobWrapper.sh ├── communicatingJobWrapperSmpd.sh ├── communicatingSubmitFcn.m ├── deleteJobFcn.m ├── deleteTaskFcn.m ├── discover ├── example.conf └── runDiscovery.sh ├── getJobStateFcn.m ├── independentJobWrapper.sh ├── independentSubmitFcn.m ├── license.txt ├── postConstructFcn.m └── private ├── cancelJobOnCluster.m ├── cancelTaskOnCluster.m ├── createEnvironmentWrapper.m ├── createSubmitScript.m ├── extractJobId.m ├── getCommonSubmitArgs.m ├── getRemoteConnection.m ├── getSimplifiedSchedulerIDsForJob.m ├── getSubmitString.m ├── runSchedulerCommand.m └── validatedPropValue.m /README.md: -------------------------------------------------------------------------------- 1 | # Parallel Computing Toolbox plugin for MATLAB Parallel Server with Slurm 2 | 3 | [![View Parallel Computing Toolbox Plugin for Slurm on File Exchange](https://www.mathworks.com/matlabcentral/images/matlab-file-exchange.svg)](https://www.mathworks.com/matlabcentral/fileexchange/127364-parallel-computing-toolbox-plugin-for-slurm) 4 | 5 | Parallel Computing Toolbox™ provides the `Generic` cluster type for submitting MATLAB® jobs to a cluster running a third-party scheduler. 6 | The `Generic` cluster type uses a set of plugin scripts to define how your machine communicates with your scheduler. 7 | You can customize the plugin scripts to configure how MATLAB interacts with the scheduler to best suit your cluster setup and support custom submission options. 8 | 9 | This repository contains MATLAB code files and shell scripts that you can use to submit jobs from a MATLAB or Simulink session running on Windows®, Linux®, or macOS operating systems to a Slurm® scheduler running on Linux. 10 | 11 | ## Products Required 12 | 13 | - [MATLAB](https://www.mathworks.com/products/matlab.html) and [Parallel Computing Toolbox](https://www.mathworks.com/products/parallel-computing.html), R2017a or newer, installed on your computer. 14 | Refer to the documentation for [how to install MATLAB and toolboxes](https://www.mathworks.com/help/install/index.html) on your computer. 15 | - [MATLAB Parallel Server™](https://www.mathworks.com/products/matlab-parallel-server.html) installed on the cluster. 16 | Refer to the documentation for [how to install MATLAB Parallel Server](https://www.mathworks.com/help/matlab-parallel-server/integrate-matlab-with-third-party-schedulers.html) on your cluster. 17 | The cluster administrator normally does this step. 18 | - [Slurm](https://slurm.schedmd.com/) running on the cluster. 19 | 20 | ## Setup Instructions 21 | 22 | ### Download or Clone this Repository 23 | 24 | To download a zip archive of this repository, at the top of this repository page, select **Code > Download ZIP**. 25 | Alternatively, to clone this repository to your computer with Git software installed, enter this command at your system's command line: 26 | ``` 27 | git clone https://github.com/mathworks/matlab-parallel-slurm-plugin 28 | ``` 29 | You can execute a system command from the MATLAB command prompt by adding `!` before the command. 30 | 31 | ### Cluster Discovery 32 | 33 | Since version R2023a, MATLAB can discover clusters running third-party schedulers such as Slurm. 34 | As a cluster admin, you can create a configuration file that describes how to configure the Parallel Computing Toolbox on the user's machine to submit MATLAB jobs to the cluster. 35 | The cluster configuration file is a plain text file with the extension `.conf` containing key-value pairs that describe the cluster configuration information. 36 | The MATLAB client will use the cluster configuration file to create a cluster profile for the user who discovers the cluster. 37 | Therefore, users will not need to follow the instructions in the sections below. 38 | You can find an example of a cluster configuration file in [discover/example.conf](discover/example.conf). 39 | For full details on how to make a cluster running a third-party scheduler discoverable, see the documentation for [Configure for Third-Party Scheduler Cluster Discovery](https://www.mathworks.com/help/matlab-parallel-server/configure-for-cluster-discovery.html). 40 | 41 | ### Create a Cluster Profile in MATLAB 42 | 43 | Create a cluster profile by using either the Cluster Profile Manager or the MATLAB Command Window. 44 | 45 | To open the Cluster Profile Manager, on the **Home** tab, in the **Environment** section, select **Parallel > Create and Manage Clusters**. 46 | In the Cluster Profile Manager, select **Add Cluster Profile > Generic** from the menu to create a new `Generic` cluster profile. 47 | 48 | Alternatively, create a new `Generic` cluster object by entering this command in the MATLAB Command Window: 49 | ```matlab 50 | c = parallel.cluster.Generic; 51 | ``` 52 | 53 | ### Configure Cluster Properties 54 | 55 | This table lists the properties that you must specify to configure the `Generic` cluster profile. 56 | For a full list of cluster properties, see the documentation for [`parallel.Cluster`](https://www.mathworks.com/help/parallel-computing/parallel.cluster.html). 57 | 58 | **Property** | **Description** 59 | ------------------------|---------------- 60 | `JobStorageLocation` | Folder in which your machine stores job data. 61 | `NumWorkers` | Number of workers your license allows. 62 | `ClusterMatlabRoot` | Full path to the MATLAB install folder on the cluster. 63 | `OperatingSystem` | Cluster operating system. 64 | `HasSharedFilesystem` | Indication of whether you have a shared file system. Set this property to `true` if a disk location is accessible to your machine and the workers on the cluster. Set this property to `false` if you do not have a shared file system. 65 | `PluginScriptsLocation` | Full path to the plugin script folder that contains this README. If using R2019a or earlier, this property is called IntegrationScriptsLocation. 66 | 67 | In the Cluster Profile Manager, set each property value. 68 | Alternatively, at the command line, set properties using dot notation: 69 | ```matlab 70 | c.JobStorageLocation = 'C:\MatlabJobs'; 71 | ``` 72 | 73 | At the command line, you can also set properties when you create the `Generic` cluster object by using name-value arguments: 74 | ```matlab 75 | c = parallel.cluster.Generic( ... 76 | 'JobStorageLocation', 'C:\MatlabJobs', ... 77 | 'NumWorkers', 20, ... 78 | 'ClusterMatlabRoot', '/usr/local/MATLAB/R2022a', ... 79 | 'OperatingSystem', 'unix', ... 80 | 'HasSharedFilesystem', true, ... 81 | 'PluginScriptsLocation', 'C:\MatlabSlurmPlugin\shared'); 82 | ``` 83 | 84 | To submit from a Windows machine to a Linux cluster, specify `JobStorageLocation` as a structure with the fields `windows` and `unix`. 85 | The fields are the Windows and Unix paths corresponding to the folder in which your machine stores job data. 86 | For example, if the folder `\\organization\matlabjobs\jobstorage` on Windows corresponds to `/organization/matlabjobs/jobstorage` on the Unix cluster: 87 | ```matlab 88 | struct('windows', '\\organization\matlabjobs\jobstorage', 'unix', '/organization/matlabjobs/jobstorage') 89 | ``` 90 | If you have your `M:` drive mapped to `\\organization\matlabjobs`, set `JobStorageLocation` to: 91 | ```matlab 92 | struct('windows', 'M:\jobstorage', 'unix', '/organization/matlabjobs/jobstorage') 93 | ``` 94 | 95 | You can use `AdditionalProperties` to modify the behaviour of the `Generic` cluster without editing the plugin scripts. 96 | For a full list of the `AdditionalProperties` supported by the plugin scripts in this repository, see [Customize Behavior of Sample Plugin Scripts](https://www.mathworks.com/help/matlab-parallel-server/customize-behavior-of-sample-plugin-scripts.html). 97 | By modifying the plugins, you can add support for your own custom `AdditionalProperties`. 98 | 99 | #### Connect to a Remote Cluster 100 | 101 | To manage work on the cluster, MATLAB calls the Slurm command line utilities. 102 | For example, the `sbatch` command to submit work and `squeue` to query the state of submitted jobs. 103 | If your MATLAB session is running on a machine with the scheduler utilities available, the plugin scripts can call the utilities on the command line. 104 | Scheduler utilities are typically available if your MATLAB session is running on the Slurm cluster to which you want to submit. 105 | 106 | If MATLAB cannot directly access the scheduler utilities on the command line, the plugin scripts create an SSH session to the cluster and run scheduler commands over that connection. 107 | To configure your cluster to submit scheduler commands via SSH, set the `ClusterHost` field of `AdditionalProperties` to the name of the cluster node to which MATLAB connects via SSH. 108 | As MATLAB will run scheduler utilities such as `sbatch` and `squeue`, select the cluster head node or login node. 109 | 110 | In the Cluster Profile Manager, add new `AdditionalProperties` by clicking **Add** under the table corresponding to `AdditionalProperties`. 111 | In the Command Window, use dot notation to add new fields. 112 | For example, if MATLAB should connect to `'slurm01.organization.com'` to submit jobs, set: 113 | ```matlab 114 | c.AdditionalProperties.ClusterHost = 'slurm01.organization.com'; 115 | ``` 116 | 117 | Use this option to connect to a remote cluster to submit jobs from a MATLAB session on a Windows computer to a Linux Slurm cluster on the same network. 118 | Your Windows machine creates an SSH session to the cluster head node to access the Slurm utilities and uses a shared network folder to store job data files. 119 | 120 | If your MATLAB session is running on a compute node of the cluster to which you want to submit work, you can use this option to create an SSH session back to the cluster head node and submit more jobs. 121 | 122 | #### Run Jobs on a Remote Cluster Without a Shared File System 123 | 124 | MATLAB uses files on disk to send tasks to the Parallel Server workers and fetch their results. 125 | This is most effective when the disk location is accessible to your machine and the workers on the cluster. 126 | Your computer can communicate with the workers by reading and writing to this shared file system. 127 | 128 | If you do not have a shared file system, MATLAB uses SSH to submit commands to the scheduler and SFTP (SSH File Transfer Protocol) to copy job and task files between your computer and the cluster. 129 | To configure your cluster to move files between the client and the cluster with SFTP, set the `RemoteJobStorageLocation` field of `AdditionalProperties` to a folder on the cluster that the workers can access. 130 | 131 | Transferring large data files (for example, hundreds of MB) over the SFTP connection can add a noticeable overhead to job submission and fetching results. 132 | For optimal performance, use a shared file system if one is available. 133 | The workers require access to a shared file system, even if your computer cannot access it. 134 | 135 | ### Save New Profile 136 | 137 | In the Cluster Profile Manager, click **Done**. 138 | Alternatively, in the Command Window, enter the command: 139 | ```matlab 140 | saveAsProfile(c, 'mySLURMCluster'); 141 | ``` 142 | Your cluster profile is now ready to use. 143 | 144 | ### Validate Cluster Profile 145 | 146 | Cluster validation submits one job of each type to test whether the cluster profile is configured correctly. 147 | In the Cluster Profile Manager, click **Validate**. 148 | If you make a change to a cluster profile, run cluster validation to ensure your changes have introduced no errors. 149 | You do not need to validate the profile each time you use it or each time you start MATLAB. 150 | 151 | ## Examples 152 | 153 | Create a cluster object using your profile: 154 | ```matlab 155 | c = parcluster("mySLURMCluster") 156 | ``` 157 | 158 | ### Submit Work for Batch Processing 159 | 160 | The `batch` command runs a MATLAB script or function on a worker on the cluster. 161 | For more information about batch processing, see the documentation for [`batch`](https://www.mathworks.com/help/parallel-computing/batch.html). 162 | 163 | ```matlab 164 | % Create a job and submit it to the cluster. 165 | job = batch( ... 166 | c, ... % Cluster object created using parcluster 167 | @sqrt, ... % Function or script to run 168 | 1, ... % Number of output arguments 169 | {[64 100]}); % Input arguments 170 | 171 | % Your MATLAB session is now available to do other work, such 172 | % as create and submit more jobs to the cluster. You can also 173 | % shut down your MATLAB session and come back later. The work 174 | % continues running on the cluster. After you recreate the 175 | % cluster object using parcluster, you can view existing jobs 176 | % using the Jobs property of the cluster object. 177 | 178 | % Wait for the job to complete. If the job is already complete, 179 | % the wait function will return immediately. 180 | wait(job); 181 | 182 | % Retrieve the output arguments for each task. For this example, 183 | % the output is a 1x1 cell array containing the vector [8 10]. 184 | results = fetchOutputs(job) 185 | ``` 186 | 187 | ### Open Parallel Pool 188 | 189 | A parallel pool (parpool) is a group of MATLAB workers on which you can interactively run work. 190 | When you run the `parpool` command, MATLAB submits a special job to the cluster to start the workers. 191 | Once the workers start, your MATLAB session connects to them. 192 | Depending on the network configuration at your organization, including whether it is permissible to connect to a program running on a compute node, parpools may not be functional. 193 | For more information about parpools, see the documentation for [`parpool`](https://www.mathworks.com/help/parallel-computing/parpool.html). 194 | 195 | ```matlab 196 | % Open a parallel pool on the cluster. This command returns a 197 | % pool object once the pool is opened. 198 | pool = parpool(c); 199 | 200 | % List the hosts on which the workers are running. For a small pool, 201 | % all the workers are typically on the same machine. For a large 202 | % pool, the workers are usually spread over multiple nodes. 203 | future = parfevalOnAll(pool, @getenv, 1, 'HOST') 204 | wait(future); 205 | fetchOutputs(future) 206 | 207 | % Output the numbers 1 to 10 in a parallel for loop. Unlike a 208 | % regular for loop, the software does not execute iterations 209 | % of the loop in order. 210 | parfor idx = 1:10 211 | disp(idx) 212 | end 213 | 214 | % Use the pool to calculate the first 500 magic squares. 215 | parfor idx = 1:500 216 | magicSquare{idx} = magic(idx); 217 | end 218 | ``` 219 | 220 | ## License 221 | 222 | The license is available in the [license.txt](license.txt) file in this repository. 223 | 224 | ## Community Support 225 | 226 | [MATLAB Central](https://www.mathworks.com/matlabcentral) 227 | 228 | ## Technical Support 229 | 230 | If you require assistance or have a request for additional features or capabilities, please contact [MathWorks Technical Support](https://www.mathworks.com/support/contact_us.html). 231 | 232 | Copyright 2022-2023 The MathWorks, Inc. 233 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Reporting Security Vulnerabilities 2 | 3 | If you believe you have discovered a security vulnerability, please report it to 4 | [security@mathworks.com](mailto:security@mathworks.com). Please see 5 | [MathWorks Vulnerability Disclosure Policy for Security Researchers](https://www.mathworks.com/company/aboutus/policies_statements/vulnerability-disclosure-policy.html) 6 | for additional information. 7 | -------------------------------------------------------------------------------- /cancelJobFcn.m: -------------------------------------------------------------------------------- 1 | function OK = cancelJobFcn(cluster, job) 2 | %CANCELJOBFCN Cancels a job on Slurm 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you cancel a job. 6 | 7 | % Copyright 2010-2023 The MathWorks, Inc. 8 | 9 | OK = cancelJobOnCluster(cluster, job); 10 | 11 | end 12 | -------------------------------------------------------------------------------- /cancelTaskFcn.m: -------------------------------------------------------------------------------- 1 | function OK = cancelTaskFcn(cluster, task) 2 | %CANCELTASKFCN Cancels a task on Slurm 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you cancel a task. 6 | 7 | % Copyright 2020-2023 The MathWorks, Inc. 8 | 9 | OK = cancelTaskOnCluster(cluster, task); 10 | 11 | end 12 | -------------------------------------------------------------------------------- /communicatingJobWrapper.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # This wrapper script is intended to be submitted to Slurm to support 3 | # communicating jobs. 4 | # 5 | # This script uses the following environment variables set by the submit MATLAB code: 6 | # PARALLEL_SERVER_CMR - the value of ClusterMatlabRoot (may be empty) 7 | # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use 8 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use 9 | # PARALLEL_SERVER_TOTAL_TASKS - total number of workers to start 10 | # PARALLEL_SERVER_NUM_THREADS - number of cores needed per worker 11 | # PARALLEL_SERVER_DEBUG - used to debug problems on the cluster 12 | # 13 | # The following environment variables are forwarded through mpiexec: 14 | # PARALLEL_SERVER_DECODE_FUNCTION - the decode function to use 15 | # PARALLEL_SERVER_STORAGE_LOCATION - used by decode function 16 | # PARALLEL_SERVER_STORAGE_CONSTRUCTOR - used by decode function 17 | # PARALLEL_SERVER_JOB_LOCATION - used by decode function 18 | # 19 | # The following environment variables are set by Slurm: 20 | # SLURM_NODELIST - list of hostnames allocated to this Slurm job 21 | 22 | # Copyright 2015-2025 The MathWorks, Inc. 23 | 24 | # If PARALLEL_SERVER_ environment variables are not set, assign any 25 | # available values with form MDCE_ for backwards compatibility 26 | PARALLEL_SERVER_CMR=${PARALLEL_SERVER_CMR:="${MDCE_CMR}"} 27 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"} 28 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"} 29 | PARALLEL_SERVER_TOTAL_TASKS=${PARALLEL_SERVER_TOTAL_TASKS:="${MDCE_TOTAL_TASKS}"} 30 | PARALLEL_SERVER_NUM_THREADS=${PARALLEL_SERVER_NUM_THREADS:="${MDCE_NUM_THREADS}"} 31 | PARALLEL_SERVER_DEBUG=${PARALLEL_SERVER_DEBUG:="${MDCE_DEBUG}"} 32 | 33 | # Other environment variables to forward 34 | PARALLEL_SERVER_GENVLIST="${PARALLEL_SERVER_GENVLIST},HOME,USER" 35 | 36 | # Echo the nodes that the scheduler has allocated to this job: 37 | echo -e "The scheduler has allocated the following nodes to this job:\n${SLURM_NODELIST:?"Node list undefined"}" 38 | 39 | # Create full path to mw_mpiexec if needed. 40 | FULL_MPIEXEC=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_mpiexec 41 | 42 | # Label stdout/stderr with the rank of the process 43 | MPI_VERBOSE=-l 44 | 45 | # Increase the verbosity of mpiexec if PARALLEL_SERVER_DEBUG is set and not false 46 | if [ ! -z "${PARALLEL_SERVER_DEBUG}" ] && [ "${PARALLEL_SERVER_DEBUG}" != "false" ] ; then 47 | MPI_VERBOSE="${MPI_VERBOSE} -v -print-all-exitcodes" 48 | fi 49 | 50 | if [ ! -z "${PARALLEL_SERVER_BIND_TO_CORE}" ] && [ "${PARALLEL_SERVER_BIND_TO_CORE}" != "false" ] ; then 51 | BIND_TO_CORE_ARG="-bind-to core:${PARALLEL_SERVER_NUM_THREADS}" 52 | else 53 | BIND_TO_CORE_ARG="" 54 | fi 55 | 56 | # Construct the command to run. 57 | CMD="\"${FULL_MPIEXEC}\" \ 58 | ${PARALLEL_SERVER_MPIEXEC_ARG} \ 59 | -genvlist ${PARALLEL_SERVER_GENVLIST} \ 60 | ${BIND_TO_CORE_ARG} \ 61 | ${MPI_VERBOSE} \ 62 | -n ${PARALLEL_SERVER_TOTAL_TASKS} \ 63 | \"${PARALLEL_SERVER_MATLAB_EXE}\" \ 64 | ${PARALLEL_SERVER_MATLAB_ARGS}" 65 | 66 | # Echo the command so that it is shown in the output log. 67 | echo $CMD 68 | 69 | # Execute the command. 70 | eval $CMD 71 | 72 | MPIEXEC_EXIT_CODE=${?} 73 | if [ ${MPIEXEC_EXIT_CODE} -eq 42 ] ; then 74 | # Get here if user code errored out within MATLAB. Overwrite this to zero in 75 | # this case. 76 | echo "Overwriting MPIEXEC exit code from 42 to zero (42 indicates a user-code failure)" 77 | MPIEXEC_EXIT_CODE=0 78 | fi 79 | echo "Exiting with code: ${MPIEXEC_EXIT_CODE}" 80 | exit ${MPIEXEC_EXIT_CODE} 81 | -------------------------------------------------------------------------------- /communicatingJobWrapperSmpd.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # This wrapper script is intended to be submitted to Slurm to support 3 | # communicating jobs. 4 | # 5 | # This script uses the following environment variables set by the submit MATLAB code: 6 | # PARALLEL_SERVER_CMR - the value of ClusterMatlabRoot (might be empty) 7 | # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use 8 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use 9 | # 10 | # The following environment variables are forwarded through mpiexec: 11 | # PARALLEL_SERVER_DECODE_FUNCTION - the decode function to use 12 | # PARALLEL_SERVER_STORAGE_LOCATION - used by decode function 13 | # PARALLEL_SERVER_STORAGE_CONSTRUCTOR - used by decode function 14 | # PARALLEL_SERVER_JOB_LOCATION - used by decode function 15 | 16 | # The following environment variables are set by Slurm 17 | # SLURM_JOB_ID - number of nodes allocated to Slurm job 18 | # SLURM_JOB_NUM_NODES - number of hosts allocated to Slurm job 19 | # SLURM_JOB_NODELIST - list of hostnames allocated to Slurm job 20 | # SLURM_TASKS_PER_NODE - list containing number of tasks allocated per host to Slurm job 21 | 22 | # Copyright 2015-2024 The MathWorks, Inc. 23 | 24 | # If PARALLEL_SERVER_ environment variables are not set, assign any 25 | # available values with form MDCE_ for backwards compatibility 26 | PARALLEL_SERVER_CMR=${PARALLEL_SERVER_CMR:="${MDCE_CMR}"} 27 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"} 28 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"} 29 | 30 | # Other environment variables to forward 31 | PARALLEL_SERVER_GENVLIST="${PARALLEL_SERVER_GENVLIST},HOME,USER" 32 | 33 | # Users of Slurm older than v1.1.34 should uncomment the following code 34 | # to enable mapping from old Slurm environment variables: 35 | 36 | # SLURM_JOB_ID=${SLURM_JOBID} 37 | # SLURM_JOB_NUM_NODES=${SLURM_NNODES} 38 | # SLURM_JOB_NODELIST=${SLURM_NODELIST} 39 | 40 | # Create full paths to mw_smpd/mw_mpiexec if needed 41 | FULL_SMPD=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_smpd 42 | FULL_MPIEXEC=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_mpiexec 43 | 44 | ######################################################################################### 45 | # Work out where we need to launch SMPDs given our hosts file - defines SMPD_HOSTS 46 | chooseSmpdHosts() { 47 | 48 | # SLURM_JOB_NODELIST is required: the following line either echoes the value, or aborts. 49 | echo Node file: ${SLURM_JOB_NODELIST:?"Node file undefined"} 50 | 51 | # SMPD_HOSTS is a single line comma separated list of hostnames: 52 | # node136,node138,node140,node141,node142,node143,node157 53 | # 54 | # Our source of information is SLURM_JOB_NODELIST in the form: 55 | # cnode[136,138],cnode[140-43],cnode157 56 | # 57 | # 'scontrol show hostname ${SLURM_JOB_NODELIST}' produces multi-line list of hostnames: 58 | # node136 59 | # node138 60 | # node140 61 | # ... 62 | # 63 | # Pipe through "tr" to convert newlines to spaces. 64 | 65 | SMPD_HOSTS=`scontrol show hostname ${SLURM_JOB_NODELIST} | tr '\n', ' '` 66 | } 67 | 68 | ######################################################################################### 69 | # Work out which port to use for SMPD 70 | chooseSmpdPort() { 71 | 72 | # Extract the numeric part of SLURM_JOB_ID using sed to choose unique port for SMPD to run on. 73 | # Assumes SLURM_JOB_ID starts with a number, such as: 15.slurm-server-host.domain.com 74 | JOB_NUM=`echo ${SLURM_JOB_ID:?"SLURM_JOB_ID undefined"} | sed 's#^\([0-9][0-9]*\).*$#\1#'` 75 | # Base smpd_port on the numeric part of the above 76 | SMPD_PORT=`expr $JOB_NUM % 10000 + 20000` 77 | } 78 | 79 | ######################################################################################### 80 | # Work out how many processes to launch - set MACHINE_ARG 81 | # 82 | # Inputs: 83 | # SLURM_JOB_NUM_NODES Slurm environment variable: Number of nodes allocated to Slurm job 84 | # 85 | # SMPD_HOSTS Comma separated list of hostnames of nodes set by chooseSmpdHosts 86 | # 87 | # SLURM_TASKS_PER_NODE Slurm environment variable: Number of tasks allocated per node. 88 | # If two or more consecutive nodes have the same task count, 89 | # that count is followed by "(x#)" where "#" is the repetition count. 90 | # Output: 91 | # MACHINE_ARG Arguments to pass to mpiexec in the form: 92 | # -hosts host1 tasks_on_host1 host2 tasks_on_host2 93 | # 94 | # Example 95 | # ------- 96 | # Inputs: 97 | # SLURM_JOB_NUM_NODES 7 98 | # SMPD_HOSTS node136,node138,node140,node141,node42,node143,node157 99 | # SLURM_TASKS_PER_NODE 12(x4),7,9(x2) 100 | # Output: 101 | # -hosts 7 node136 12 node138 12 node140 12 node141 12 node142 7 node143 9 node157 9 102 | # 103 | chooseMachineArg() { 104 | 105 | # Transform SLURM_TASKS_PER_NODE into TASKS_PER_NODE_LIST 106 | # 107 | # Examples: SLURM_TASKS_PER_NODE -> TASKS_PER_NODE_LIST 108 | # ------- -------------------- ------------------- 109 | # Single node has 12 tasks 12 -> 12 110 | # Three nodes have 12 tasks 12(x3) -> 12,12,12 111 | # First two nodes have 7 tasks, the third has 8 tasks 7(x2),8 -> 7,7,8 112 | 113 | TASKS_PER_NODE_LIST='' 114 | # Replace commas with spaces to create space delimited list to use with for loop 115 | LIST_FROM_SLURM=`echo ${SLURM_TASKS_PER_NODE} | sed 's/,/ /g'` 116 | for ITEM in ${LIST_FROM_SLURM} 117 | do 118 | if [ `echo ${ITEM} | grep -e '^[0-9][0-9]*$' -c` -eq 1 ] ; then 119 | # "NUM_TASKS" == "NUM_TASKS(x1)" 120 | NUM_NODES=1 121 | NUM_TASKS=${ITEM} 122 | else 123 | # "NUM_TASKS(xNUM_NODES)" 124 | NUM_NODES=`echo $ITEM | sed 's/^[0-9][0-9]*(x\([0-9][0-9]*\))$/\1/'` 125 | NUM_TASKS=`echo $ITEM | sed 's/^\([0-9][0-9]*\)(x[0-9][0-9]*)$/\1/'` 126 | fi 127 | 128 | # Repeat NUM_NODES iterations: append NUM_TASKS to TASKS_PER_NODE_LIST 129 | COUNT=0 130 | while [ ${COUNT} -lt ${NUM_NODES} ] 131 | do 132 | if [ -z "${TASKS_PER_NODE_LIST}" ] ; then 133 | # List empty, therefore adding first item to list - avoid adding comma 134 | TASKS_PER_NODE_LIST=${NUM_TASKS} 135 | else 136 | # Appending to list - add a comma to delimit entries 137 | TASKS_PER_NODE_LIST="${TASKS_PER_NODE_LIST},${NUM_TASKS}" 138 | fi 139 | COUNT=`expr ${COUNT} + 1` 140 | done 141 | done 142 | 143 | # Add -hosts argument at start of MACHINE_ARG 144 | MACHINE_ARG="-hosts ${SLURM_JOB_NUM_NODES}" 145 | 146 | # For each hostname in SMPD_HOSTS, append ' ' to MACHINE_ARG 147 | INDEX=0 148 | for HOSTNAME in ${SMPD_HOSTS} 149 | do 150 | INDEX=`expr ${INDEX} + 1` 151 | # Use cut to index the '${INDEX}th' item in TASKS_PER_NODE_LIST 152 | TASKS_PER_NODE=`echo ${TASKS_PER_NODE_LIST} | cut -f ${INDEX} -d,` 153 | MACHINE_ARG="${MACHINE_ARG} ${HOSTNAME} ${TASKS_PER_NODE}" 154 | done 155 | echo "Machine args: $MACHINE_ARG" 156 | } 157 | 158 | ######################################################################################### 159 | # Shut down SMPDs and exit with the exit code of the last command executed 160 | cleanupAndExit() { 161 | EXIT_CODE=${?} 162 | 163 | echo "Stopping SMPD ..." 164 | 165 | STOP_SMPD_CMD="srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} --cpu-bind=none ${FULL_SMPD} -shutdown -phrase MATLAB -port ${SMPD_PORT}" 166 | echo $STOP_SMPD_CMD 167 | eval $STOP_SMPD_CMD 168 | 169 | echo "Exiting with code: ${EXIT_CODE}" 170 | exit ${EXIT_CODE} 171 | } 172 | 173 | ######################################################################################### 174 | # Use srun to launch the SMPD daemons on each processor 175 | launchSmpds() { 176 | 177 | # Launch the SMPD processes on all hosts using srun 178 | echo "Starting SMPD on ${SMPD_HOSTS} ..." 179 | 180 | START_SMPD_CMD="srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} --cpu-bind=none ${FULL_SMPD} -phrase MATLAB -port ${SMPD_PORT} -debug 0 &" 181 | echo $START_SMPD_CMD 182 | eval $START_SMPD_CMD 183 | 184 | # Check that the SMPD processes are running on all hosts 185 | SUCCESS=0 186 | NUM_ATTEMPTS=60 187 | ATTEMPT=1 188 | while [ ${ATTEMPT} -le ${NUM_ATTEMPTS} ] 189 | do 190 | echo "Checking that SMPD processes are running (Attempt ${ATTEMPT} of ${NUM_ATTEMPTS})" 191 | SMPD_LAUNCHED_HOSTS="" 192 | NUM_HOSTS_FOUND=0 193 | for HOST in ${SMPD_HOSTS} 194 | do 195 | CHECK_SMPD_CMD="${FULL_SMPD} -phrase MATLAB -port ${SMPD_PORT} -status ${HOST} > /dev/null 2>&1" 196 | echo $CHECK_SMPD_CMD 197 | eval $CHECK_SMPD_CMD 198 | EXIT_CODE=${?} 199 | if [ $EXIT_CODE -ne 0 ] ; then 200 | echo "No SMPD process running on ${HOST}" 201 | else 202 | echo "SMPD process found running on ${HOST}" 203 | NUM_HOSTS_FOUND=$((NUM_HOSTS_FOUND+1)) 204 | 205 | # Append HOST to SMPD_LAUNCHED_HOSTS if it does not already contain it. 206 | case "${SMPD_LAUNCHED_HOSTS}" in 207 | *$HOST* ) ;; 208 | * ) SMPD_LAUNCHED_HOSTS="${SMPD_LAUNCHED_HOSTS} ${HOST}" ;; 209 | esac 210 | fi 211 | done 212 | if [ ${SLURM_JOB_NUM_NODES} -eq ${NUM_HOSTS_FOUND} ] ; then 213 | SUCCESS=1 214 | break 215 | elif [ ${ATTEMPT} -ne ${NUM_ATTEMPTS} ] ; then 216 | sleep 1 217 | fi 218 | ATTEMPT=$((ATTEMPT+1)) 219 | done 220 | if [ $SUCCESS -ne 1 ] ; then 221 | if [ $NUM_HOSTS_FOUND -eq 0 ] ; then 222 | echo "No SMPD processes were found running. Aborting." 223 | else 224 | echo "Found SMPD processes running on only ${NUM_HOSTS_FOUND} of ${SLURM_JOB_NUM_NODES} nodes. Aborting." 225 | echo "Hosts found: ${SMPD_LAUNCHED_HOSTS}" 226 | fi 227 | exit 1 228 | fi 229 | echo "All SMPDs launched" 230 | } 231 | 232 | ######################################################################################### 233 | runMpiexec() { 234 | 235 | CMD="\"${FULL_MPIEXEC}\" -smpd \ 236 | -phrase MATLAB \ 237 | -port ${SMPD_PORT} \ 238 | -l ${MACHINE_ARG} \ 239 | -genvlist ${PARALLEL_SERVER_GENVLIST} \ 240 | \"${PARALLEL_SERVER_MATLAB_EXE}\" \ 241 | ${PARALLEL_SERVER_MATLAB_ARGS}" 242 | 243 | # As a debug stage: echo the command ... 244 | echo $CMD 245 | 246 | # ... and then execute it. 247 | eval $CMD 248 | 249 | MPIEXEC_CODE=${?} 250 | if [ ${MPIEXEC_CODE} -ne 0 ] ; then 251 | exit ${MPIEXEC_CODE} 252 | fi 253 | } 254 | 255 | ######################################################################################### 256 | # Define the order in which we execute the stages defined above 257 | MAIN() { 258 | # Install a trap to ensure that SMPDs are closed if something errors or the 259 | # job is cancelled. 260 | trap "cleanupAndExit" 0 1 2 15 261 | chooseSmpdHosts 262 | chooseSmpdPort 263 | launchSmpds 264 | chooseMachineArg 265 | runMpiexec 266 | exit 0 # Explicitly exit 0 to trigger cleanupAndExit 267 | } 268 | 269 | # Call the MAIN loop 270 | MAIN 271 | -------------------------------------------------------------------------------- /communicatingSubmitFcn.m: -------------------------------------------------------------------------------- 1 | function communicatingSubmitFcn(cluster, job, environmentProperties) 2 | %COMMUNICATINGSUBMITFCN Submit a communicating MATLAB job to a Slurm cluster 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you submit a communicating job. 6 | % 7 | % See also parallel.cluster.generic.communicatingDecodeFcn. 8 | 9 | % Copyright 2010-2024 The MathWorks, Inc. 10 | 11 | % Store the current filename for the errors, warnings and dctSchedulerMessages. 12 | currFilename = mfilename; 13 | if ~isa(cluster, 'parallel.Cluster') 14 | error('parallelexamples:GenericSLURM:NotClusterObject', ... 15 | 'The function %s is for use with clusters created using the parcluster command.', currFilename) 16 | end 17 | 18 | decodeFunction = 'parallel.cluster.generic.communicatingDecodeFcn'; 19 | 20 | clusterOS = cluster.OperatingSystem; 21 | if ~strcmpi(clusterOS, 'unix') 22 | error('parallelexamples:GenericSLURM:UnsupportedOS', ... 23 | 'The function %s only supports clusters with the unix operating system.', currFilename) 24 | end 25 | 26 | % Get the correct quote and file separator for the Cluster OS. 27 | % This check is unnecessary in this file because we explicitly 28 | % checked that the clusterOS is unix. This code is an example 29 | % of how to deal with clusters that can be unix or pc. 30 | if strcmpi(clusterOS, 'unix') 31 | quote = ''''; 32 | fileSeparator = '/'; 33 | scriptExt = '.sh'; 34 | shellCmd = 'sh'; 35 | else 36 | quote = '"'; 37 | fileSeparator = '\'; 38 | scriptExt = '.bat'; 39 | shellCmd = 'cmd /c'; 40 | end 41 | 42 | if isprop(cluster.AdditionalProperties, 'ClusterHost') 43 | remoteConnection = getRemoteConnection(cluster); 44 | end 45 | 46 | % Determine the debug setting. Setting to true makes the MATLAB workers 47 | % output additional logging. If EnableDebug is set in the cluster object's 48 | % AdditionalProperties, that takes precedence. Otherwise, look for the 49 | % PARALLEL_SERVER_DEBUG and MDCE_DEBUG environment variables in that order. 50 | % If nothing is set, debug is false. 51 | enableDebug = 'false'; 52 | if isprop(cluster.AdditionalProperties, 'EnableDebug') 53 | % Use AdditionalProperties.EnableDebug, if it is set 54 | enableDebug = char(string(cluster.AdditionalProperties.EnableDebug)); 55 | else 56 | % Otherwise check the environment variables set locally on the client 57 | environmentVariablesToCheck = {'PARALLEL_SERVER_DEBUG', 'MDCE_DEBUG'}; 58 | for idx = 1:numel(environmentVariablesToCheck) 59 | debugValue = getenv(environmentVariablesToCheck{idx}); 60 | if ~isempty(debugValue) 61 | enableDebug = debugValue; 62 | break 63 | end 64 | end 65 | end 66 | 67 | % The job specific environment variables 68 | % Remove leading and trailing whitespace from the MATLAB arguments 69 | matlabArguments = strtrim(environmentProperties.MatlabArguments); 70 | 71 | % Where the workers store job output 72 | if cluster.HasSharedFilesystem 73 | storageLocation = environmentProperties.StorageLocation; 74 | else 75 | storageLocation = remoteConnection.JobStorageLocation; 76 | % If the RemoteJobStorageLocation ends with a space, add a slash to ensure it is respected 77 | if endsWith(storageLocation, ' ') 78 | storageLocation = [storageLocation, fileSeparator]; 79 | end 80 | end 81 | variables = { ... 82 | 'PARALLEL_SERVER_DECODE_FUNCTION', decodeFunction; ... 83 | 'PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ... 84 | 'PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation; ... 85 | 'PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable; ... 86 | 'PARALLEL_SERVER_MATLAB_ARGS', matlabArguments; ... 87 | 'PARALLEL_SERVER_DEBUG', enableDebug; ... 88 | 'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ... 89 | 'MLM_WEB_USER_CRED', environmentProperties.UserToken; ... 90 | 'MLM_WEB_ID', environmentProperties.LicenseWebID; ... 91 | 'PARALLEL_SERVER_LICENSE_NUMBER', environmentProperties.LicenseNumber; ... 92 | 'PARALLEL_SERVER_STORAGE_LOCATION', storageLocation; ... 93 | 'PARALLEL_SERVER_CMR', strip(cluster.ClusterMatlabRoot, 'right', '/'); ... 94 | 'PARALLEL_SERVER_TOTAL_TASKS', num2str(environmentProperties.NumberOfTasks); ... 95 | 'PARALLEL_SERVER_NUM_THREADS', num2str(cluster.NumThreads)}; 96 | % Starting in R2025a, IntelMPI is supported via MPIImplementation="IntelMPI" 97 | if ~verLessThan('matlab', '25.1') && ... 98 | isprop(cluster.AdditionalProperties, 'MPIImplementation') %#ok 99 | mpiImplementation = cluster.AdditionalProperties.MPIImplementation; 100 | mustBeMember(mpiImplementation, ["IntelMPI", "MPICH"]); 101 | variables = [variables; {'PARALLEL_SERVER_MPIEXEC_ARG', ['-', char(mpiImplementation)]}]; 102 | end 103 | 104 | % Avoid "-bind-to core:N" if AdditionalProperties.UseBindToCore is false (default: true). 105 | if validatedPropValue(cluster.AdditionalProperties, 'UseBindToCore', 'logical', true) 106 | bindToCoreValue = 'true'; 107 | else 108 | bindToCoreValue = 'false'; 109 | end 110 | variables = [variables; {'PARALLEL_SERVER_BIND_TO_CORE', bindToCoreValue}]; 111 | 112 | if ~verLessThan('matlab', '25.1') %#ok 113 | variables = [variables; environmentProperties.JobEnvironment]; 114 | end 115 | % Environment variable names different prior to 19b 116 | if verLessThan('matlab', '9.7') 117 | variables(:,1) = replace(variables(:,1), 'PARALLEL_SERVER_', 'MDCE_'); 118 | end 119 | % Trim the environment variables of empty values. 120 | nonEmptyValues = cellfun(@(x) ~isempty(strtrim(x)), variables(:,2)); 121 | variables = variables(nonEmptyValues, :); 122 | % List of all the variables to forward through mpiexec to the workers 123 | variables = [variables; ... 124 | {'PARALLEL_SERVER_GENVLIST', strjoin(variables(:,1), ',')}]; 125 | 126 | % The job directory as accessed by this machine 127 | localJobDirectory = cluster.getJobFolder(job); 128 | 129 | % The job directory as accessed by workers on the cluster 130 | if cluster.HasSharedFilesystem 131 | jobDirectoryOnCluster = cluster.getJobFolderOnCluster(job); 132 | else 133 | jobDirectoryOnCluster = remoteConnection.getRemoteJobLocation(job.ID, clusterOS); 134 | end 135 | 136 | % Specify the job wrapper script to use. 137 | % Prior to R2019a, only the SMPD process manager is supported. 138 | if verLessThan('matlab', '9.6') || ... 139 | validatedPropValue(cluster.AdditionalProperties, 'UseSmpd', 'logical', false) 140 | if ~verLessThan('matlab', '25.1') %#ok 141 | % Starting in R2025a, smpd launcher is not supported. 142 | error('parallelexamples:GenericSLURM:SmpdNoLongerSupported', ... 143 | 'The smpd process manager is no longer supported.'); 144 | end 145 | jobWrapperName = 'communicatingJobWrapperSmpd.sh'; 146 | else 147 | jobWrapperName = 'communicatingJobWrapper.sh'; 148 | end 149 | % The wrapper script is in the same directory as this file 150 | dirpart = fileparts(mfilename('fullpath')); 151 | localScript = fullfile(dirpart, jobWrapperName); 152 | % Copy the local wrapper script to the job directory 153 | copyfile(localScript, localJobDirectory, 'f'); 154 | 155 | % The script to execute on the cluster to run the job 156 | wrapperPath = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, jobWrapperName); 157 | quotedWrapperPath = sprintf('%s%s%s', quote, wrapperPath, quote); 158 | 159 | % Choose a file for the output 160 | logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, sprintf('Job%d.log', job.ID)); 161 | quotedLogFile = sprintf('%s%s%s', quote, logFile, quote); 162 | dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile); 163 | 164 | jobName = sprintf('MATLAB_R%s_Job%d', version('-release'), job.ID); 165 | 166 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 167 | %% CUSTOMIZATION MAY BE REQUIRED %% 168 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 169 | % You might want to customize this section to match your cluster, 170 | % for example to limit the number of nodes for a single job. 171 | additionalSubmitArgs = sprintf('--ntasks=%d --cpus-per-task=%d', environmentProperties.NumberOfTasks, cluster.NumThreads); 172 | commonSubmitArgs = getCommonSubmitArgs(cluster); 173 | additionalSubmitArgs = strtrim(sprintf('%s %s', additionalSubmitArgs, commonSubmitArgs)); 174 | if validatedPropValue(cluster.AdditionalProperties, 'DisplaySubmitArgs', 'logical', false) 175 | fprintf('Submit arguments: %s\n', additionalSubmitArgs); 176 | end 177 | 178 | % Path to the submit script, to submit the Slurm job using sbatch 179 | submitScriptName = sprintf('submitScript%s', scriptExt); 180 | localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName); 181 | submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName); 182 | quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote); 183 | 184 | % Path to the environment wrapper, which will set the environment variables 185 | % for the job then execute the job wrapper 186 | envScriptName = sprintf('environmentWrapper%s', scriptExt); 187 | localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName); 188 | envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName); 189 | quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote); 190 | 191 | % Create the scripts to submit a Slurm job. 192 | % These will be created in the job directory. 193 | dctSchedulerMessage(5, '%s: Generating scripts for job %d', currFilename, job.ID); 194 | createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, variables); 195 | createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ... 196 | quotedEnvScriptPathOnCluster, additionalSubmitArgs); 197 | 198 | % Create the command to run on the cluster 199 | commandToRun = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster); 200 | 201 | if ~cluster.HasSharedFilesystem 202 | % Start the mirror to copy all the job files over to the cluster 203 | dctSchedulerMessage(4, '%s: Starting mirror for job %d.', currFilename, job.ID); 204 | remoteConnection.startMirrorForJob(job); 205 | end 206 | 207 | if strcmpi(clusterOS, 'unix') 208 | % Add execute permissions to shell scripts 209 | runSchedulerCommand(cluster, sprintf( ... 210 | 'chmod u+x "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator)); 211 | % Convert line endings to Unix 212 | runSchedulerCommand(cluster, sprintf( ... 213 | 'dos2unix --allow-chown "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator)); 214 | end 215 | 216 | % Now ask the cluster to run the submission command 217 | dctSchedulerMessage(4, '%s: Submitting job using command:\n\t%s', currFilename, commandToRun); 218 | try 219 | [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun); 220 | catch err 221 | cmdFailed = true; 222 | cmdOut = err.message; 223 | end 224 | if cmdFailed 225 | if ~cluster.HasSharedFilesystem 226 | % Stop the mirroring if we failed to submit the job - this will also 227 | % remove the job files from the remote location 228 | remoteConnection = getRemoteConnection(cluster); 229 | % Only stop mirroring if we are actually mirroring 230 | if remoteConnection.isJobUsingConnection(job.ID) 231 | dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID); 232 | try 233 | remoteConnection.stopMirrorForJob(job); 234 | catch err 235 | warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ... 236 | 'Failed to stop the file mirroring for job %d.\nReason: %s', ... 237 | job.ID, err.getReport); 238 | end 239 | end 240 | end 241 | error('parallelexamples:GenericSLURM:FailedToSubmitJob', ... 242 | 'Failed to submit job to Slurm using command:\n\t%s.\nReason: %s', ... 243 | commandToRun, cmdOut); 244 | end 245 | 246 | % Calculate the schedulerIDs 247 | jobIDs = extractJobId(cmdOut); 248 | if isempty(jobIDs) 249 | error('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ... 250 | 'Failed to parse the job identifier from the submission output: "%s"', ... 251 | cmdOut); 252 | end 253 | % jobIDs must be a cell array 254 | if ~iscell(jobIDs) 255 | jobIDs = {jobIDs}; 256 | end 257 | 258 | % Store the scheduler ID for each task and the job cluster data 259 | jobData = struct('type', 'generic'); 260 | if isprop(cluster.AdditionalProperties, 'ClusterHost') 261 | % Store the cluster host 262 | jobData.RemoteHost = remoteConnection.Hostname; 263 | end 264 | if ~cluster.HasSharedFilesystem 265 | % Store the remote job storage location 266 | jobData.RemoteJobStorageLocation = remoteConnection.JobStorageLocation; 267 | jobData.HasDoneLastMirror = false; 268 | end 269 | if verLessThan('matlab', '9.7') % schedulerID stored in job data 270 | jobData.ClusterJobIDs = jobIDs; 271 | else % schedulerID on task since 19b 272 | if isscalar(job.Tasks) 273 | schedulerIDs = jobIDs{1}; 274 | else 275 | schedulerIDs = repmat(jobIDs, size(job.Tasks)); 276 | end 277 | set(job.Tasks, 'SchedulerID', schedulerIDs); 278 | end 279 | cluster.setJobClusterData(job, jobData); 280 | 281 | end 282 | -------------------------------------------------------------------------------- /deleteJobFcn.m: -------------------------------------------------------------------------------- 1 | function deleteJobFcn(cluster, job) 2 | %DELETEJOBFCN Deletes a job on Slurm 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you delete a job. 6 | 7 | % Copyright 2017-2023 The MathWorks, Inc. 8 | 9 | cancelJobOnCluster(cluster, job); 10 | 11 | end 12 | -------------------------------------------------------------------------------- /deleteTaskFcn.m: -------------------------------------------------------------------------------- 1 | function deleteTaskFcn(cluster, task) 2 | %DELETETASKFCN Deletes a job on Slurm 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you delete a job. 6 | 7 | % Copyright 2020-2023 The MathWorks, Inc. 8 | 9 | cancelTaskOnCluster(cluster, task); 10 | 11 | end 12 | -------------------------------------------------------------------------------- /discover/example.conf: -------------------------------------------------------------------------------- 1 | # Since version R2023a, MATLAB can discover clusters running third-party 2 | # schedulers such as Slurm. The Discover Clusters functionality 3 | # automatically configures the Parallel Computing Toolbox to submit MATLAB 4 | # jobs to the cluster. To use this functionality, you must create a cluster 5 | # configuration file and store it at a location accessible to MATLAB users. 6 | # 7 | # This file is an example of a cluster configuration which MATLAB can 8 | # discover. You can copy and modify this file to make your cluster discoverable. 9 | # 10 | # For more information, including the required format for this file, see 11 | # the online documentation for making a cluster running a third-party 12 | # scheduler discoverable: 13 | # https://www.mathworks.com/help/matlab-parallel-server/configure-for-cluster-discovery.html 14 | 15 | # Copyright 2023 The MathWorks, Inc. 16 | 17 | # The name MATLAB will display for the cluster when discovered. 18 | Name = My Slurm cluster 19 | 20 | # Maximum number of MATLAB workers a single user can use in a single job. 21 | # This number must not exceed the number of available MATLAB Parallel 22 | # Server licenses. 23 | NumWorkers = 32 24 | 25 | # Path to the MATLAB install on the cluster for the workers to use. Note 26 | # the variable "$MATLAB_VERSION_STRING" returns the release number of the 27 | # MATLAB client that is running discovery, e.g. 2023a. If multiple versions 28 | # of MATLAB are installed on the cluster, this allows discovery to select 29 | # the correct installation path. Add a leading "R" or "r" if needed to 30 | # complete the MATLAB version. 31 | ClusterMatlabRoot = /opt/matlab/R"$MATLAB_VERSION_STRING" 32 | 33 | # Location where the MATLAB client stores job and task information. 34 | JobStorageLocation = /home/matlabjobs 35 | # If the client and cluster share a filesystem but the client is running 36 | # the Windows operating system and the cluster running a Linux operating 37 | # system, you must specify the JobStorageLocation using a structure by 38 | # commenting out the previous line and uncommenting the following lines. 39 | # The 'windows' and 'unix' fields must correspond to the same folder as 40 | # viewed from each of those operating systems. 41 | #JobStorageLocation.windows = \\organization\home\matlabjobs 42 | #JobStorageLocation.unix = /organization/home/matlabjobs 43 | 44 | # Folder that contains the scheduler plugin scripts that describe how 45 | # MATLAB interacts with the scheduler. A property can take different values 46 | # depending on the operating system of the client MATLAB by specifying the 47 | # name of the OS in parentheses. 48 | PluginScriptsLocation (Windows) = \\organization\matlab\pluginscripts 49 | PluginScriptsLocation (Unix) = /organization/matlab/pluginscripts 50 | 51 | # The operating system on the cluster. Valid values are 'unix' and 'windows'. 52 | OperatingSystem = unix 53 | 54 | # Specify whether client and cluster nodes share JobStorageLocation. To 55 | # configure MATLAB to copy job input and output files to and from the 56 | # cluster using SFTP, set this property to false and specify a value for 57 | # AdditionalProperties.RemoteJobStorageLocation below. 58 | HasSharedFilesystem = true 59 | 60 | # Specify whether the cluster uses online licensing. 61 | RequiresOnlineLicensing = false 62 | 63 | # LicenseNumber for the workers to use. Specify only if 64 | # RequiresOnlineLicensing is set to true. 65 | #LicenseNumber = 123456 66 | 67 | [AdditionalProperties] 68 | 69 | # To configure the user's machine to connect to the submission host via 70 | # SSH, uncomment the following line and enter the hostname of the cluster 71 | # machine that has the scheduler utilities to submit jobs. 72 | #ClusterHost = slurm-headnode 73 | 74 | # If the user's machine and the cluster nodes do not have a shared file 75 | # system, MATLAB can copy job input and output files to and from the 76 | # cluster using SFTP. To activate this feature, set HasSharedFilesystem 77 | # above to false. Then uncomment the following lines and enter the location 78 | # on the cluster to store job files. 79 | #RemoteJobStorageLocation (Windows) = /home/"$USERNAME"/.matlab/generic_cluster_jobs 80 | #RemoteJobStorageLocation (Unix) = /home/"$USER"/.matlab/generic_cluster_jobs 81 | 82 | # Username to log in to ClusterHost with. On Linux and Mac, use the USER 83 | # environment variable. On Windows, use the USERNAME variable. 84 | Username (Unix) = "$USER" 85 | Username (Windows) = "$USERNAME" 86 | -------------------------------------------------------------------------------- /discover/runDiscovery.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # Copyright 2023 The MathWorks, Inc. 4 | 5 | usage="$(basename "$0") matlabroot [folder] -- run third-party scheduler discovery in MATLAB R2023a onwards 6 | matlabroot - path to the folder where MATLAB is installed 7 | folder - folder to search for cluster configuration files 8 | (defaults to pwd)" 9 | 10 | # Print usage 11 | if [ -z "$1" ] || [ "$1" = "-h" ] || [ "$1" = "--help" ] ; then 12 | echo "$usage" 13 | exit 0 14 | fi 15 | 16 | # MATLAB executable to launch 17 | matlabExe="$1/bin/matlab" 18 | if [ ! -f "${matlabExe}" ] ; then 19 | echo "Could not find MATLAB executable at ${matlabExe}" 20 | exit 1 21 | fi 22 | 23 | # Folder to run discovery on. If specified, wrap in single-quotes to make a MATLAB charvec. 24 | discoveryFolder="$2" 25 | if [ ! -z "$discoveryFolder" ] ; then 26 | discoveryFolder="'${discoveryFolder}'" 27 | fi 28 | 29 | # Command to run in MATLAB 30 | matlabCmd="parallel.cluster.generic.discoverGenericClusters(${discoveryFolder})" 31 | 32 | # Arguments to pass to MATLAB 33 | matlabArgs="-nojvm -parallelserver -batch" 34 | 35 | # Build and run system command 36 | CMD="\"${matlabExe}\" ${matlabArgs} \"${matlabCmd}\"" 37 | eval $CMD 38 | -------------------------------------------------------------------------------- /getJobStateFcn.m: -------------------------------------------------------------------------------- 1 | function state = getJobStateFcn(cluster, job, state) 2 | %GETJOBSTATEFCN Gets the state of a job from Slurm 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you query the state of a job. 6 | 7 | % Copyright 2010-2024 The MathWorks, Inc. 8 | 9 | % Store the current filename for the errors, warnings and 10 | % dctSchedulerMessages 11 | currFilename = mfilename; 12 | if ~isa(cluster, 'parallel.Cluster') 13 | error('parallelexamples:GenericSLURM:SubmitFcnError', ... 14 | 'The function %s is for use with clusters created using the parcluster command.', currFilename) 15 | end 16 | 17 | % Get the information about the actual cluster used 18 | data = cluster.getJobClusterData(job); 19 | if isempty(data) 20 | % This indicates that the job has not been submitted, so just return 21 | dctSchedulerMessage(1, '%s: Job cluster data was empty for job with ID %d.', currFilename, job.ID); 22 | return 23 | end 24 | 25 | % Shortcut if the job state is already finished or failed 26 | jobInTerminalState = strcmp(state, 'finished') || strcmp(state, 'failed'); 27 | if jobInTerminalState 28 | if cluster.HasSharedFilesystem 29 | return 30 | end 31 | try 32 | hasDoneLastMirror = data.HasDoneLastMirror; 33 | catch err 34 | ex = MException('parallelexamples:GenericSLURM:FailedToRetrieveRemoteParameters', ... 35 | 'Failed to retrieve remote parameters from the job cluster data.'); 36 | ex = ex.addCause(err); 37 | throw(ex); 38 | end 39 | % Can only shortcut here if we've already done the last mirror 40 | if hasDoneLastMirror 41 | return 42 | end 43 | end 44 | 45 | [schedulerIDs, numSubmittedTasks] = getSimplifiedSchedulerIDsForJob(job); 46 | 47 | jobList = strjoin(schedulerIDs, ','); 48 | commandToRun = sprintf('squeue -j %s --states=all --Format=jobarrayid,state --noheader --array', jobList); 49 | dctSchedulerMessage(4, '%s: Querying cluster for job state using command:\n\t%s', currFilename, commandToRun); 50 | 51 | try 52 | % We will ignore the status returned from the state command because 53 | % a non-zero status is returned if the job no longer exists 54 | [~, cmdOut] = runSchedulerCommand(cluster, commandToRun); 55 | catch err 56 | ex = MException('parallelexamples:GenericSLURM:FailedToGetJobState', ... 57 | 'Failed to get job state from cluster.'); 58 | ex = ex.addCause(err); 59 | throw(ex); 60 | end 61 | 62 | clusterState = iExtractJobState(cmdOut, numSubmittedTasks); 63 | dctSchedulerMessage(6, '%s: State %s was extracted from cluster output.', currFilename, clusterState); 64 | 65 | % If we could determine the cluster's state, we'll use that. Otherwise, we assume 66 | % the scheduler is no longer tracking the job because the job has terminated. 67 | if ~strcmp(clusterState, 'unknown') 68 | state = clusterState; 69 | else 70 | state = 'finished'; 71 | end 72 | 73 | if ~cluster.HasSharedFilesystem 74 | % Decide what to do with mirroring based on the cluster's version of job 75 | % state and whether or not the job is currently being mirrored: 76 | % If job is not being mirrored, and job is not finished, resume the mirror 77 | % If job is not being mirrored, and job is finished, do the last mirror 78 | % If the job is being mirrored, and job is finished, do the last mirror 79 | % Otherwise (if job is not finished, and we are mirroring), do nothing 80 | remoteConnection = getRemoteConnection(cluster); 81 | isBeingMirrored = remoteConnection.isJobUsingConnection(job.ID); 82 | isJobFinished = strcmp(state, 'finished') || strcmp(state, 'failed'); 83 | if ~isBeingMirrored && ~isJobFinished 84 | % resume the mirror 85 | dctSchedulerMessage(4, '%s: Resuming mirror for job %d.', currFilename, job.ID); 86 | try 87 | remoteConnection.resumeMirrorForJob(job); 88 | catch err 89 | warning('parallelexamples:GenericSLURM:FailedToResumeMirrorForJob', ... 90 | 'Failed to resume mirror for job %d. Your local job files may not be up-to-date.\nReason: %s', ... 91 | err.getReport); 92 | end 93 | elseif isJobFinished 94 | dctSchedulerMessage(4, '%s: Doing last mirror for job %d.', currFilename, job.ID); 95 | try 96 | remoteConnection.doLastMirrorForJob(job); 97 | % Store the fact that we have done the last mirror so we can shortcut in the future 98 | data.HasDoneLastMirror = true; 99 | cluster.setJobClusterData(job, data); 100 | catch err 101 | warning('parallelexamples:GenericSLURM:FailedToDoFinalMirrorForJob', ... 102 | 'Failed to do last mirror for job %d. Your local job files may not be up-to-date.\nReason: %s', ... 103 | err.getReport); 104 | end 105 | end 106 | end 107 | 108 | end 109 | 110 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 111 | function state = iExtractJobState(squeueOut, numJobs) 112 | % Function to extract the job state from the output of squeue 113 | 114 | numPending = numel(regexp(squeueOut, 'PENDING|SPECIAL_EXIT')); 115 | numRunning = numel(regexp(squeueOut, 'RUNNING|SUSPENDED|COMPLETING|CONFIGURING|STOPPED|RESIZING')); 116 | numFinished = numel(regexp(squeueOut, 'COMPLETED')); 117 | numFailed = numel(regexp(squeueOut, 'CANCELLED|FAIL|TIMEOUT|PREEMPTED|OUT_OF|REVOKED|DEADLINE')); 118 | 119 | % If all of the jobs that we asked about have finished, then we know the 120 | % job has finished. 121 | if numFinished == numJobs 122 | state = 'finished'; 123 | return 124 | end 125 | 126 | % Any running indicates that the job is running 127 | if numRunning > 0 128 | state = 'running'; 129 | return 130 | end 131 | 132 | % We know numRunning == 0 so if there are some still pending then the 133 | % job must be queued again, even if there are some finished 134 | if numPending > 0 135 | state = 'queued'; 136 | return 137 | end 138 | 139 | % Deal with any tasks that have failed 140 | if numFailed > 0 141 | % Set this job to be failed 142 | state = 'failed'; 143 | return 144 | end 145 | 146 | state = 'unknown'; 147 | end 148 | -------------------------------------------------------------------------------- /independentJobWrapper.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # This wrapper script is intended to support independent execution. 3 | # 4 | # This script uses the following environment variables set by the submit MATLAB code: 5 | # PARALLEL_SERVER_MATLAB_EXE - the MATLAB executable to use 6 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use 7 | 8 | # Copyright 2010-2024 The MathWorks, Inc. 9 | 10 | # If PARALLEL_SERVER_ environment variables are not set, assign any 11 | # available values with form MDCE_ for backwards compatibility 12 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"} 13 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"} 14 | 15 | # Echo the node that the scheduler has allocated to this job: 16 | echo "The scheduler has allocated the following node to this job: `hostname`" 17 | 18 | if [ ! -z "${SLURM_ARRAY_TASK_ID}" ] ; then 19 | # Use job arrays 20 | TASK_ID=$((${SLURM_ARRAY_TASK_ID}+${PARALLEL_SERVER_TASK_ID_OFFSET})) 21 | export PARALLEL_SERVER_TASK_LOCATION="${PARALLEL_SERVER_JOB_LOCATION}/Task${TASK_ID}"; 22 | export MDCE_TASK_LOCATION="${MDCE_JOB_LOCATION}/Task${TASK_ID}"; 23 | fi 24 | 25 | # Construct the command to run. 26 | CMD="\"${PARALLEL_SERVER_MATLAB_EXE}\" ${PARALLEL_SERVER_MATLAB_ARGS}" 27 | 28 | # Echo the command so that it is shown in the output log. 29 | echo "Executing: $CMD" 30 | 31 | # Execute the command. 32 | eval $CMD 33 | 34 | EXIT_CODE=${?} 35 | echo "Exiting with code: ${EXIT_CODE}" 36 | exit ${EXIT_CODE} 37 | -------------------------------------------------------------------------------- /independentSubmitFcn.m: -------------------------------------------------------------------------------- 1 | function independentSubmitFcn(cluster, job, environmentProperties) 2 | %INDEPENDENTSUBMITFCN Submit a MATLAB job to a Slurm cluster 3 | % 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this 5 | % function to run it when you submit an independent job. 6 | % 7 | % See also parallel.cluster.generic.independentDecodeFcn. 8 | 9 | % Copyright 2010-2024 The MathWorks, Inc. 10 | 11 | % Store the current filename for the errors, warnings and dctSchedulerMessages. 12 | currFilename = mfilename; 13 | if ~isa(cluster, 'parallel.Cluster') 14 | error('parallelexamples:GenericSLURM:NotClusterObject', ... 15 | 'The function %s is for use with clusters created using the parcluster command.', currFilename) 16 | end 17 | 18 | decodeFunction = 'parallel.cluster.generic.independentDecodeFcn'; 19 | 20 | clusterOS = cluster.OperatingSystem; 21 | if ~strcmpi(clusterOS, 'unix') 22 | error('parallelexamples:GenericSLURM:UnsupportedOS', ... 23 | 'The function %s only supports clusters with the unix operating system.', currFilename) 24 | end 25 | 26 | % Get the correct quote and file separator for the Cluster OS. 27 | % This check is unnecessary in this file because we explicitly 28 | % checked that the clusterOS is unix. This code is an example 29 | % of how to deal with clusters that can be unix or pc. 30 | if strcmpi(clusterOS, 'unix') 31 | quote = ''''; 32 | fileSeparator = '/'; 33 | scriptExt = '.sh'; 34 | shellCmd = 'sh'; 35 | else 36 | quote = '"'; 37 | fileSeparator = '\'; 38 | scriptExt = '.bat'; 39 | shellCmd = 'cmd /c'; 40 | end 41 | 42 | if isprop(cluster.AdditionalProperties, 'ClusterHost') 43 | remoteConnection = getRemoteConnection(cluster); 44 | end 45 | 46 | [useJobArrays, maxJobArraySize] = iGetJobArrayProps(cluster); 47 | % Store data for future reference 48 | cluster.UserData.UseJobArrays = useJobArrays; 49 | if useJobArrays 50 | cluster.UserData.MaxJobArraySize = maxJobArraySize; 51 | end 52 | 53 | % Determine the debug setting. Setting to true makes the MATLAB workers 54 | % output additional logging. If EnableDebug is set in the cluster object's 55 | % AdditionalProperties, that takes precedence. Otherwise, look for the 56 | % PARALLEL_SERVER_DEBUG and MDCE_DEBUG environment variables in that order. 57 | % If nothing is set, debug is false. 58 | enableDebug = 'false'; 59 | if isprop(cluster.AdditionalProperties, 'EnableDebug') 60 | % Use AdditionalProperties.EnableDebug, if it is set 61 | enableDebug = char(string(cluster.AdditionalProperties.EnableDebug)); 62 | else 63 | % Otherwise check the environment variables set locally on the client 64 | environmentVariablesToCheck = {'PARALLEL_SERVER_DEBUG', 'MDCE_DEBUG'}; 65 | for idx = 1:numel(environmentVariablesToCheck) 66 | debugValue = getenv(environmentVariablesToCheck{idx}); 67 | if ~isempty(debugValue) 68 | enableDebug = debugValue; 69 | break 70 | end 71 | end 72 | end 73 | 74 | % The job specific environment variables 75 | % Remove leading and trailing whitespace from the MATLAB arguments 76 | matlabArguments = strtrim(environmentProperties.MatlabArguments); 77 | 78 | % Where the workers store job output 79 | if cluster.HasSharedFilesystem 80 | storageLocation = environmentProperties.StorageLocation; 81 | else 82 | storageLocation = remoteConnection.JobStorageLocation; 83 | % If the RemoteJobStorageLocation ends with a space, add a slash to ensure it is respected 84 | if endsWith(storageLocation, ' ') 85 | storageLocation = [storageLocation, fileSeparator]; 86 | end 87 | end 88 | variables = { ... 89 | 'PARALLEL_SERVER_DECODE_FUNCTION', decodeFunction; ... 90 | 'PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ... 91 | 'PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation; ... 92 | 'PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable; ... 93 | 'PARALLEL_SERVER_MATLAB_ARGS', matlabArguments; ... 94 | 'PARALLEL_SERVER_DEBUG', enableDebug; ... 95 | 'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ... 96 | 'MLM_WEB_USER_CRED', environmentProperties.UserToken; ... 97 | 'MLM_WEB_ID', environmentProperties.LicenseWebID; ... 98 | 'PARALLEL_SERVER_LICENSE_NUMBER', environmentProperties.LicenseNumber; ... 99 | 'PARALLEL_SERVER_STORAGE_LOCATION', storageLocation}; 100 | if ~verLessThan('matlab', '25.1') %#ok 101 | variables = [variables; environmentProperties.JobEnvironment]; 102 | end 103 | % Environment variable names different prior to 19b 104 | if verLessThan('matlab', '9.7') 105 | variables(:,1) = replace(variables(:,1), 'PARALLEL_SERVER_', 'MDCE_'); 106 | end 107 | % Trim the environment variables of empty values. 108 | nonEmptyValues = cellfun(@(x) ~isempty(strtrim(x)), variables(:,2)); 109 | variables = variables(nonEmptyValues, :); 110 | 111 | % The job directory as accessed by this machine 112 | localJobDirectory = cluster.getJobFolder(job); 113 | 114 | % The job directory as accessed by workers on the cluster 115 | if cluster.HasSharedFilesystem 116 | jobDirectoryOnCluster = cluster.getJobFolderOnCluster(job); 117 | else 118 | jobDirectoryOnCluster = remoteConnection.getRemoteJobLocation(job.ID, clusterOS); 119 | end 120 | 121 | % Name of the wrapper script to launch the MATLAB worker 122 | jobWrapperName = 'independentJobWrapper.sh'; 123 | % The wrapper script is in the same directory as this file 124 | dirpart = fileparts(mfilename('fullpath')); 125 | localScript = fullfile(dirpart, jobWrapperName); 126 | % Copy the local wrapper script to the job directory 127 | copyfile(localScript, localJobDirectory, 'f'); 128 | 129 | % The script to execute on the cluster to run the job 130 | wrapperPath = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, jobWrapperName); 131 | quotedWrapperPath = sprintf('%s%s%s', quote, wrapperPath, quote); 132 | 133 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 134 | %% CUSTOMIZATION MAY BE REQUIRED %% 135 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 136 | additionalSubmitArgs = sprintf('--ntasks=1 --cpus-per-task=%d', cluster.NumThreads); 137 | commonSubmitArgs = getCommonSubmitArgs(cluster); 138 | additionalSubmitArgs = strtrim(sprintf('%s %s', additionalSubmitArgs, commonSubmitArgs)); 139 | if validatedPropValue(cluster.AdditionalProperties, 'DisplaySubmitArgs', 'logical', false) 140 | fprintf('Submit arguments: %s\n', additionalSubmitArgs); 141 | end 142 | 143 | % Only keep and submit tasks that are not cancelled. Cancelled tasks 144 | % will have errors. 145 | isPendingTask = cellfun(@isempty, get(job.Tasks, {'Error'})); 146 | tasks = job.Tasks(isPendingTask); 147 | taskIDs = cell2mat(get(tasks, {'ID'})); 148 | numberOfTasks = numel(tasks); 149 | 150 | % Only use job arrays when you can get enough use out of them. 151 | % The submission method in this function requires a minimum maxJobArraySize 152 | % of 10 to get enough use of job arrays. 153 | if numberOfTasks < 2 || maxJobArraySize < 10 154 | useJobArrays = false; 155 | end 156 | 157 | if useJobArrays 158 | % Check if there are more tasks than will fit in one job array. Slurm 159 | % will not accept a job array index greater than its MaxArraySize 160 | % parameter, as defined in slurm.conf, even if the overall size of the 161 | % array is less than MaxArraySize. For example, for the default 162 | % (inclusive) upper limit of MaxArraySize=1000, array indices of 1 to 163 | % 1000 would be accepted, but 1001 or above would not. To get around 164 | % this restriction, submit the full array of tasks in multiple Slurm 165 | % job arrays, hereafter referred to as subarrays. Round the 166 | % MaxArraySize down to the nearest power of 10, as this allows the log 167 | % file of taskX to be named TaskX.log. See iGenerateLogFileName. 168 | if taskIDs(end) > maxJobArraySize 169 | % Use the nearest power of 10 as subarray size. This will make the 170 | % naming of log files easier. 171 | maxJobArraySizeToUse = 10^floor(log10(maxJobArraySize)); 172 | % Group task IDs into bins of jobArraySize size. 173 | groups = findgroups(floor(taskIDs./maxJobArraySizeToUse)); 174 | % Count the number of elements in each group and form subarrays. 175 | jobArraySizes = splitapply(@numel, taskIDs, groups); 176 | else 177 | maxJobArraySizeToUse = maxJobArraySize; 178 | jobArraySizes = numel(tasks); 179 | end 180 | taskIDGroupsForJobArrays = mat2cell(taskIDs,jobArraySizes); 181 | 182 | jobName = sprintf('MATLAB_R%s_Job%d', version('-release'), job.ID); 183 | numJobArrays = numel(taskIDGroupsForJobArrays); 184 | commandsToRun = cell(numJobArrays, 1); 185 | jobIDs = cell(numJobArrays, 1); 186 | schedulerJobArrayIndices = cell(numJobArrays, 1); 187 | for ii = 1:numJobArrays 188 | % Slurm only accepts task IDs up to maxArraySize. Shift all task 189 | % IDs down below the limit. 190 | taskOffset = (ii-1)*maxJobArraySizeToUse; 191 | schedulerJobArrayIndices{ii} = taskIDGroupsForJobArrays{ii} - taskOffset; 192 | % Save the offset as an environment variable to pass to the tasks 193 | % during Slurm submission. 194 | environmentVariables = [variables; ... 195 | {'PARALLEL_SERVER_TASK_ID_OFFSET', num2str(taskOffset)}]; 196 | 197 | % Create a character vector with the ranges of IDs to submit. 198 | jobArrayString = iCreateJobArrayString(schedulerJobArrayIndices{ii}); 199 | 200 | % Choose a file for the output 201 | logFileName = iGenerateLogFileName(ii, maxJobArraySizeToUse); 202 | logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, logFileName); 203 | quotedLogFile = sprintf('%s%s%s', quote, logFile, quote); 204 | dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile); 205 | 206 | % Path to the submit script, to submit the Slurm job using sbatch 207 | submitScriptName = sprintf('submitScript%d%s', ii, scriptExt); 208 | localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName); 209 | submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName); 210 | quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote); 211 | 212 | % Path to the environment wrapper, which will set the environment variables 213 | % for the job then execute the job wrapper 214 | envScriptName = sprintf('environmentWrapper%d%s', ii, scriptExt); 215 | localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName); 216 | envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName); 217 | quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote); 218 | 219 | % Create the scripts to submit a Slurm job. 220 | % These will be created in the job directory. 221 | dctSchedulerMessage(5, '%s: Generating scripts for job array %d', currFilename, ii); 222 | createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, environmentVariables); 223 | createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ... 224 | quotedEnvScriptPathOnCluster, additionalSubmitArgs, jobArrayString); 225 | 226 | % Create the command to run on the cluster 227 | commandsToRun{ii} = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster); 228 | end 229 | else 230 | % Do not use job arrays and submit each task individually. 231 | taskLocations = environmentProperties.TaskLocations(isPendingTask); 232 | jobIDs = cell(1, numberOfTasks); 233 | commandsToRun = cell(numberOfTasks, 1); 234 | 235 | % Loop over every task we have been asked to submit 236 | for ii = 1:numberOfTasks 237 | taskLocation = taskLocations{ii}; 238 | % Add the task location to the environment variables 239 | if verLessThan('matlab', '9.7') % variable name changed in 19b 240 | environmentVariables = [variables; ... 241 | {'MDCE_TASK_LOCATION', taskLocation}]; 242 | else 243 | environmentVariables = [variables; ... 244 | {'PARALLEL_SERVER_TASK_LOCATION', taskLocation}]; 245 | end 246 | 247 | % Choose a file for the output 248 | logFileName = sprintf('Task%d.log', taskIDs(ii)); 249 | logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, logFileName); 250 | quotedLogFile = sprintf('%s%s%s', quote, logFile, quote); 251 | dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile); 252 | 253 | % Submit one task at a time 254 | jobName = sprintf('MATLAB_R%s_Job%d.%d', version('-release'), job.ID, taskIDs(ii)); 255 | 256 | % Path to the submit script, to submit the Slurm job using sbatch 257 | submitScriptName = sprintf('submitScript%d%s', ii, scriptExt); 258 | localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName); 259 | submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName); 260 | quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote); 261 | 262 | % Path to the environment wrapper, which will set the environment variables 263 | % for the job then execute the job wrapper 264 | envScriptName = sprintf('environmentWrapper%d%s', ii, scriptExt); 265 | localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName); 266 | envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName); 267 | quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote); 268 | 269 | % Create the scripts to submit a Slurm job. 270 | % These will be created in the job directory. 271 | dctSchedulerMessage(5, '%s: Generating scripts for task %d', currFilename, ii); 272 | createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, environmentVariables); 273 | createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ... 274 | quotedEnvScriptPathOnCluster, additionalSubmitArgs); 275 | 276 | % Create the command to run on the cluster 277 | commandsToRun{ii} = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster); 278 | end 279 | end 280 | 281 | if ~cluster.HasSharedFilesystem 282 | % Start the mirror to copy all the job files over to the cluster 283 | dctSchedulerMessage(4, '%s: Starting mirror for job %d.', currFilename, job.ID); 284 | remoteConnection.startMirrorForJob(job); 285 | end 286 | 287 | if strcmpi(clusterOS, 'unix') 288 | % Add execute permissions to shell scripts 289 | runSchedulerCommand(cluster, sprintf( ... 290 | 'chmod u+x "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator)); 291 | % Convert line endings to Unix 292 | runSchedulerCommand(cluster, sprintf( ... 293 | 'dos2unix --allow-chown "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator)); 294 | end 295 | 296 | for ii=1:numel(commandsToRun) 297 | commandToRun = commandsToRun{ii}; 298 | jobIDs{ii} = iSubmitJobUsingCommand(cluster, job, commandToRun); 299 | end 300 | 301 | % Calculate the schedulerIDs 302 | if useJobArrays 303 | % The scheduler ID of each task is a combination of the job ID and the 304 | % scheduler array index. cellfun pairs each job ID with its 305 | % corresponding scheduler array indices in schedulerJobArrayIndices and 306 | % returns the combination of both. For example, if jobIDs = {1,2} and 307 | % schedulerJobArrayIndices = {[1,2];[3,4]}, the schedulerID is given by 308 | % combining 1 with [1,2] and 2 with [3,4], in the canonical form of the 309 | % scheduler. 310 | schedulerIDs = cellfun(@(jobID,arrayIndices) jobID + "_" + arrayIndices, ... 311 | jobIDs, schedulerJobArrayIndices, 'UniformOutput',false); 312 | schedulerIDs = vertcat(schedulerIDs{:}); 313 | else 314 | % The scheduler ID of each task is the job ID. 315 | schedulerIDs = string(jobIDs); 316 | end 317 | 318 | % Store the scheduler ID for each task and the job cluster data 319 | jobData = struct('type', 'generic'); 320 | if isprop(cluster.AdditionalProperties, 'ClusterHost') 321 | % Store the cluster host 322 | jobData.RemoteHost = remoteConnection.Hostname; 323 | end 324 | if ~cluster.HasSharedFilesystem 325 | % Store the remote job storage location 326 | jobData.RemoteJobStorageLocation = remoteConnection.JobStorageLocation; 327 | jobData.HasDoneLastMirror = false; 328 | end 329 | if verLessThan('matlab', '9.7') % schedulerID stored in job data 330 | jobData.ClusterJobIDs = schedulerIDs; 331 | else % schedulerID on task since 19b 332 | set(tasks, 'SchedulerID', schedulerIDs); 333 | end 334 | cluster.setJobClusterData(job, jobData); 335 | 336 | end 337 | 338 | function [useJobArrays, maxJobArraySize] = iGetJobArrayProps(cluster) 339 | % Look for useJobArrays and maxJobArray size in the following order: 340 | % 1. Additional Properties 341 | % 2. User Data 342 | % 3. Query scheduler for MaxJobArraySize 343 | 344 | useJobArrays = validatedPropValue(cluster.AdditionalProperties, 'UseJobArrays', 'logical'); 345 | if isempty(useJobArrays) 346 | if isfield(cluster.UserData, 'UseJobArrays') 347 | useJobArrays = cluster.UserData.UseJobArrays; 348 | else 349 | useJobArrays = true; 350 | end 351 | end 352 | 353 | if ~useJobArrays 354 | % Not using job arrays so don't need the max array size 355 | maxJobArraySize = 0; 356 | return 357 | end 358 | 359 | maxJobArraySize = validatedPropValue(cluster.AdditionalProperties, 'MaxJobArraySize', 'numeric'); 360 | if ~isempty(maxJobArraySize) 361 | if maxJobArraySize < 1 362 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 363 | 'MaxJobArraySize must be a positive integer'); 364 | end 365 | return 366 | end 367 | 368 | if isfield(cluster.UserData,'MaxJobArraySize') 369 | maxJobArraySize = cluster.UserData.MaxJobArraySize; 370 | return 371 | end 372 | 373 | % Get job array information by querying the scheduler. 374 | commandToRun = 'scontrol show config'; 375 | try 376 | [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun); 377 | catch err 378 | cmdFailed = true; 379 | cmdOut = err.message; 380 | end 381 | if cmdFailed 382 | error('parallelexamples:GenericSLURM:FailedToRetrieveInfo', ... 383 | 'Failed to retrieve Slurm configuration information using command:\n\t%s.\nReason: %s', ... 384 | commandToRun, cmdOut); 385 | end 386 | 387 | maxJobArraySize = 0; 388 | % Extract the maximum array size for job arrays. For Slurm, the 389 | % configuration line that contains the maximum array index looks like this: 390 | % MaxArraySize = 1000 391 | % Use a regular expression to extract this parameter. 392 | tokens = regexp(cmdOut,'MaxArraySize\s*=\s*(\d+)', 'tokens','once'); 393 | 394 | if isempty(tokens) || (str2double(tokens) == 0) 395 | % No job array support. 396 | useJobArrays = false; 397 | return 398 | end 399 | 400 | useJobArrays = true; 401 | % Set the maximum array size. 402 | maxJobArraySize = str2double(tokens{1}); 403 | % In Slurm, MaxArraySize is an exclusive upper bound. Subtract one to obtain 404 | % the inclusive upper bound. 405 | maxJobArraySize = maxJobArraySize - 1; 406 | end 407 | 408 | function jobID = iSubmitJobUsingCommand(cluster, job, commandToRun) 409 | currFilename = mfilename; 410 | % Ask the cluster to run the submission command. 411 | dctSchedulerMessage(4, '%s: Submitting job %d using command:\n\t%s', currFilename, job.ID, commandToRun); 412 | try 413 | [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun); 414 | catch err 415 | cmdFailed = true; 416 | cmdOut = err.message; 417 | end 418 | if cmdFailed 419 | if ~cluster.HasSharedFilesystem 420 | % Stop the mirroring if we failed to submit the job - this will also 421 | % remove the job files from the remote location 422 | remoteConnection = getRemoteConnection(cluster); 423 | % Only stop mirroring if we are actually mirroring 424 | if remoteConnection.isJobUsingConnection(job.ID) 425 | dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID); 426 | try 427 | remoteConnection.stopMirrorForJob(job); 428 | catch err 429 | warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ... 430 | 'Failed to stop the file mirroring for job %d.\nReason: %s', ... 431 | job.ID, err.getReport); 432 | end 433 | end 434 | end 435 | error('parallelexamples:GenericSLURM:FailedToSubmitJob', ... 436 | 'Failed to submit job to Slurm using command:\n\t%s.\nReason: %s', ... 437 | commandToRun, cmdOut); 438 | end 439 | 440 | jobID = extractJobId(cmdOut); 441 | if isempty(jobID) 442 | error('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ... 443 | 'Failed to parse the job identifier from the submission output: "%s"', ... 444 | cmdOut); 445 | end 446 | end 447 | 448 | function rangesString = iCreateJobArrayString(taskIDs) 449 | % Create a character vector with the ranges of task IDs to submit 450 | if taskIDs(end) - taskIDs(1) + 1 == numel(taskIDs) 451 | % There is only one range. 452 | rangesString = sprintf('%d-%d',taskIDs(1),taskIDs(end)); 453 | else 454 | % There are several ranges. 455 | % Calculate the step size between task IDs. 456 | step = diff(taskIDs); 457 | % Where the step changes, a range ends and another starts. Include 458 | % the initial and ending IDs in the ranges as well. 459 | isStartOfRange = [true; step > 1]; 460 | isEndOfRange = [step > 1; true]; 461 | rangesString = strjoin(compose('%d-%d', ... 462 | taskIDs(isStartOfRange),taskIDs(isEndOfRange)),','); 463 | end 464 | end 465 | 466 | function logFileName = iGenerateLogFileName(subArrayIdx, jobArraySize) 467 | % This function builds the log file specifier, which is then passed to 468 | % Slurm to tell it where each task's output should go. This will be equal 469 | % to TaskX.log where X is the MATLAB ID. Slurm will not accept a job array 470 | % index greater than its MaxArraySize parameter. As a result MATLAB IDs 471 | % must be shifted down below MaxArraySize. To ensure that the log file for 472 | % Task X is called TaskX.log, round the maximum array size down to the 473 | % nearest power of 10 and manually construct the log file specifier. For 474 | % example, for a MaxArraySize of 1500, the Slurm job arrays will be of 475 | % size 1000, and MATLAB task IDs will map as illustrated by the following 476 | % table: 477 | % 478 | % MATLAB ID | Slurm ID | Log file specifier 479 | % ----------+----------+-------------------- 480 | % 1- 999 | 1-999 | Task%a.log 481 | % 1000-1999 | 000-999 | Task1%3a.log 482 | % 2000-2999 | 000-999 | Task2%3a.log 483 | % 3000 | 000 | Task3%3a.log 484 | % 485 | % Note that Slurm expands %a to the Slurm ID, and %3a to the Slurm ID 486 | % padded with zeros to 3 digits. 487 | if subArrayIdx == 1 488 | % Job arrays have more than one task. Use %a so that Slurm expands it 489 | % into the actual task ID. 490 | logFileName = 'Task%a.log'; 491 | else 492 | % For subsequent subarrays after the first one, prepend the index to %a 493 | % to identify the batch of log files and form the final log file name. 494 | padding = floor(log10(jobArraySize)); 495 | logFileName = sprintf('Task%d%%%da.log',subArrayIdx-1,padding); 496 | end 497 | end 498 | -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2022, The MathWorks, Inc. 2 | All rights reserved. 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 4 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 5 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 6 | 3. In all cases, the software is, and all modifications and derivatives of the software shall be, licensed to you solely for use in conjunction with MathWorks products and service offerings. 7 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 8 | -------------------------------------------------------------------------------- /postConstructFcn.m: -------------------------------------------------------------------------------- 1 | function postConstructFcn(cluster) %#ok 2 | %POSTCONSTRUCTFCN Perform custom configuration after call to PARCLUSTER 3 | % 4 | % POSTCONSTRUCTFCN(CLUSTER) execute code on cluster object CLUSTER. 5 | % 6 | % See also parcluster. 7 | 8 | % Copyright 2023 The MathWorks, Inc. 9 | 10 | end 11 | -------------------------------------------------------------------------------- /private/cancelJobOnCluster.m: -------------------------------------------------------------------------------- 1 | function OK = cancelJobOnCluster(cluster, job) 2 | %CANCELJOBONCLUSTER Cancels a job on the Slurm scheduler 3 | 4 | % Copyright 2010-2023 The MathWorks, Inc. 5 | 6 | % Store the current filename for the errors, warnings and 7 | % dctSchedulerMessages 8 | currFilename = mfilename; 9 | if ~isa(cluster, 'parallel.Cluster') 10 | error('parallelexamples:GenericSLURM:SubmitFcnError', ... 11 | 'The function %s is for use with clusters created using the parcluster command.', currFilename) 12 | end 13 | 14 | % Get the information about the actual cluster used 15 | data = cluster.getJobClusterData(job); 16 | if isempty(data) 17 | % This indicates that the job has not been submitted, so return true 18 | dctSchedulerMessage(1, '%s: Job cluster data was empty for job with ID %d.', currFilename, job.ID); 19 | OK = true; 20 | return 21 | end 22 | 23 | % Get a simplified list of schedulerIDs to reduce the number of calls to 24 | % the scheduler. 25 | schedulerIDs = getSimplifiedSchedulerIDsForJob(job); 26 | erroredJobAndCauseStrings = cell(size(schedulerIDs)); 27 | % Get the cluster to delete the job 28 | for ii = 1:length(schedulerIDs) 29 | schedulerID = schedulerIDs{ii}; 30 | commandToRun = sprintf('scancel -v ''%s''', schedulerID); 31 | dctSchedulerMessage(4, '%s: Canceling job on cluster using command:\n\t%s.', currFilename, commandToRun); 32 | try 33 | [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun); 34 | catch err 35 | cmdFailed = true; 36 | cmdOut = err.message; 37 | end 38 | % scancel can return 0 even if there is an error, so also check the 39 | % cmdOut does not contain error text. We do not consider attempting 40 | % to cancel a finished job as a failure, so exclude that. 41 | if (cmdFailed || contains(cmdOut, 'error:')) && ... 42 | ~contains(cmdOut, {'already completing', 'Invalid job id specified'}) 43 | % Keep track of all jobs that errored when being cancelled, either 44 | % through a bad exit code or if an error was thrown. We'll report 45 | % these later on. 46 | erroredJobAndCauseStrings{ii} = sprintf('Job ID: %s\tReason: %s', schedulerID, strtrim(cmdOut)); 47 | dctSchedulerMessage(1, '%s: Failed to cancel job %s on cluster. Reason:\n\t%s', currFilename, schedulerID, cmdOut); 48 | end 49 | end 50 | 51 | if ~cluster.HasSharedFilesystem 52 | % Only stop mirroring if we are actually mirroring 53 | remoteConnection = getRemoteConnection(cluster); 54 | if remoteConnection.isJobUsingConnection(job.ID) 55 | dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID); 56 | try 57 | remoteConnection.stopMirrorForJob(job); 58 | catch err 59 | warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ... 60 | 'Failed to stop the file mirroring for job %d.\nReason: %s', ... 61 | job.ID, err.getReport); 62 | end 63 | end 64 | end 65 | 66 | % Now warn about those jobs that we failed to cancel. 67 | erroredJobAndCauseStrings = erroredJobAndCauseStrings(~cellfun(@isempty, erroredJobAndCauseStrings)); 68 | if ~isempty(erroredJobAndCauseStrings) 69 | warning('parallelexamples:GenericSLURM:FailedToCancelJob', ... 70 | 'Failed to cancel the following jobs on the cluster:\n%s', ... 71 | sprintf(' %s\n', erroredJobAndCauseStrings{:})); 72 | end 73 | OK = isempty(erroredJobAndCauseStrings); 74 | 75 | end 76 | -------------------------------------------------------------------------------- /private/cancelTaskOnCluster.m: -------------------------------------------------------------------------------- 1 | function OK = cancelTaskOnCluster(cluster, task) 2 | %CANCELTASKONCLUSTER Cancels a task on the Slurm scheduler 3 | 4 | % Copyright 2020-2023 The MathWorks, Inc. 5 | 6 | % Store the current filename for the errors, warnings and 7 | % dctSchedulerMessages 8 | currFilename = mfilename; 9 | if ~isa(cluster, 'parallel.Cluster') 10 | error('parallelexamples:GenericSLURM:SubmitFcnError', ... 11 | 'The function %s is for use with clusters created using the parcluster command.', currFilename) 12 | end 13 | 14 | % Get the information about the actual cluster used 15 | data = cluster.getJobClusterData(task.Parent); 16 | if isempty(data) 17 | % This indicates that the parent job has not been submitted, so return true 18 | dctSchedulerMessage(1, '%s: Job cluster data was empty for the parent job with ID %d.', currFilename, task.Parent.ID); 19 | OK = true; 20 | return 21 | end 22 | % We can't cancel a single task of a communicating job on the scheduler 23 | % without cancelling the entire job, so warn and return in this case 24 | if ~strcmpi(task.Parent.Type, 'independent') 25 | OK = false; 26 | warning('parallelexamples:GenericSLURM:FailedToCancelTask', ... 27 | 'Unable to cancel a single task of a communicating job. If you want to cancel the entire job, use the cancel function on the job object instead.'); 28 | return 29 | end 30 | 31 | % Get the cluster to delete the task 32 | if verLessThan('matlab', '9.7') % schedulerID stored in job data 33 | schedulerIDs = data.ClusterJobIDs; 34 | schedulerID = schedulerIDs{task.ID}; 35 | else % schedulerID on task since 19b 36 | schedulerID = task.SchedulerID; 37 | end 38 | erroredTaskAndCauseString = ''; 39 | commandToRun = sprintf('scancel -v ''%s''', schedulerID); 40 | dctSchedulerMessage(4, '%s: Canceling task on cluster using command:\n\t%s.', currFilename, commandToRun); 41 | try 42 | [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun); 43 | catch err 44 | cmdFailed = true; 45 | cmdOut = err.message; 46 | end 47 | % scancel can return 0 even if there is an error, so also check the 48 | % cmdOut does not contain error text. We do not consider attempting 49 | % to cancel a finished job as a failure, so exclude that. 50 | if (cmdFailed || contains(cmdOut, 'error:')) && ... 51 | ~contains(cmdOut, {'already completing', 'Invalid job id specified'}) 52 | % Record if the task errored when being cancelled, either through a bad 53 | % exit code or if an error was thrown. We'll report this as a warning. 54 | erroredTaskAndCauseString = sprintf('Job ID: %s\tReason: %s', schedulerID, strtrim(cmdOut)); 55 | dctSchedulerMessage(1, '%s: Failed to cancel task %s on cluster. Reason:\n\t%s', currFilename, schedulerID, cmdOut); 56 | end 57 | 58 | % Warn if task cancellation failed. 59 | OK = isempty(erroredTaskAndCauseString); 60 | if ~OK 61 | warning('parallelexamples:GenericSLURM:FailedToCancelTask', ... 62 | 'Failed to cancel the task on the cluster:\n %s\n', ... 63 | erroredTaskAndCauseString); 64 | end 65 | 66 | end 67 | -------------------------------------------------------------------------------- /private/createEnvironmentWrapper.m: -------------------------------------------------------------------------------- 1 | function createEnvironmentWrapper(outputFilename, quotedWrapperPath, environmentVariables) 2 | % Create a script that sets the correct environment variables and then 3 | % calls the job wrapper. 4 | 5 | % Copyright 2023 The MathWorks, Inc. 6 | 7 | dctSchedulerMessage(5, '%s: Creating environment wrapper at %s', mfilename, outputFilename); 8 | 9 | % Open file in binary mode to make it cross-platform. 10 | fid = fopen(outputFilename, 'w'); 11 | if fid < 0 12 | error('parallelexamples:GenericSLURM:FileError', ... 13 | 'Failed to open file %s for writing', outputFilename); 14 | end 15 | fileCloser = onCleanup(@() fclose(fid)); 16 | 17 | % Specify shell to use 18 | fprintf(fid, '#!/bin/sh\n'); 19 | 20 | formatSpec = 'export %s=''%s''\n'; 21 | 22 | % Write the commands to set and export environment variables 23 | for ii = 1:size(environmentVariables, 1) 24 | fprintf(fid, formatSpec, environmentVariables{ii,1}, environmentVariables{ii,2}); 25 | end 26 | 27 | % Write the command to run the job wrapper 28 | fprintf(fid, '%s\n', quotedWrapperPath); 29 | 30 | end 31 | -------------------------------------------------------------------------------- /private/createSubmitScript.m: -------------------------------------------------------------------------------- 1 | function createSubmitScript(outputFilename, jobName, quotedLogFile, ... 2 | quotedWrapperPath, additionalSubmitArgs, jobArrayString) 3 | % Create a script that runs the Slurm sbatch command. 4 | 5 | % Copyright 2010-2024 The MathWorks, Inc. 6 | 7 | if nargin < 6 8 | jobArrayString = []; 9 | end 10 | 11 | dctSchedulerMessage(5, '%s: Creating submit script for %s at %s', mfilename, jobName, outputFilename); 12 | 13 | % Open file in binary mode to make it cross-platform. 14 | fid = fopen(outputFilename, 'w'); 15 | if fid < 0 16 | error('parallelexamples:GenericSLURM:FileError', ... 17 | 'Failed to open file %s for writing', outputFilename); 18 | end 19 | fileCloser = onCleanup(@() fclose(fid)); 20 | 21 | % Specify shell to use 22 | fprintf(fid, '#!/bin/sh\n'); 23 | 24 | % Unset all SLURM_ and SBATCH_ variables to avoid conflicting options in 25 | % nested jobs, except for SLURM_CONF which is required for the Slurm 26 | % utilities to work 27 | fprintf(fid, '%s\n', ... 28 | 'for VAR_NAME in $(env | cut -d= -f1 | grep -E ''^(SLURM_|SBATCH_)'' | grep -v ''^SLURM_CONF$''); do', ... 29 | ' unset "$VAR_NAME"', ... 30 | 'done'); 31 | 32 | commandToRun = getSubmitString(jobName, quotedLogFile, quotedWrapperPath, ... 33 | additionalSubmitArgs, jobArrayString); 34 | fprintf(fid, '%s\n', commandToRun); 35 | 36 | end 37 | -------------------------------------------------------------------------------- /private/extractJobId.m: -------------------------------------------------------------------------------- 1 | function jobID = extractJobId(sbatchCommandOutput) 2 | % Extracts the job ID from the sbatch command output for Slurm 3 | 4 | % Copyright 2015-2022 The MathWorks, Inc. 5 | 6 | % Output from sbatch expected to be in the following format: 7 | % Submitted batch job 12345 8 | % 9 | % sbatch could also attach a warning to the output, such as: 10 | % 11 | % sbatch: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1 12 | % Submitted batch job 12346 13 | 14 | % Trim sbatch command output for use in debug message 15 | trimmedCommandOutput = strtrim(sbatchCommandOutput); 16 | 17 | % Ignore anything before or after 'Submitted batch job ###', and extract the numeric value. 18 | searchPattern = '.*Submitted batch job ([0-9]+).*'; 19 | 20 | % When we match searchPattern, matchedTokens is a single entry cell array containing the jobID. 21 | % Otherwise we failed to match searchPattern, so matchedTokens is an empty cell array. 22 | matchedTokens = regexp(sbatchCommandOutput, searchPattern, 'tokens', 'once'); 23 | 24 | if isempty(matchedTokens) 25 | % Callers check for error in extracting Job ID using isempty() on return value. 26 | jobID = ''; 27 | dctSchedulerMessage(0, '%s: Failed to extract Job ID from sbatch output: \n\t%s', mfilename, trimmedCommandOutput); 28 | else 29 | jobID = matchedTokens{1}; 30 | dctSchedulerMessage(0, '%s: Job ID %s was extracted from sbatch output: \n\t%s', mfilename, jobID, trimmedCommandOutput); 31 | end 32 | 33 | end 34 | -------------------------------------------------------------------------------- /private/getCommonSubmitArgs.m: -------------------------------------------------------------------------------- 1 | function commonSubmitArgs = getCommonSubmitArgs(cluster) 2 | % Get any additional submit arguments for the Slurm sbatch command 3 | % that are common to both independent and communicating jobs. 4 | 5 | % Copyright 2016-2023 The MathWorks, Inc. 6 | 7 | commonSubmitArgs = ''; 8 | ap = cluster.AdditionalProperties; 9 | 10 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 11 | %% CUSTOMIZATION MAY BE REQUIRED %% 12 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 13 | % You may wish to support further cluster.AdditionalProperties fields here 14 | % and modify the submission command arguments accordingly. 15 | 16 | % Account name 17 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 18 | 'AccountName', 'char', '-A %s'); 19 | 20 | % Constraint 21 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 22 | 'Constraint', 'char', '-C %s'); 23 | 24 | % Memory required per CPU 25 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 26 | 'MemPerCPU', 'char', '--mem-per-cpu=%s'); 27 | 28 | % Partition (queue) 29 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 30 | 'Partition', 'char', '-p %s'); 31 | 32 | % Require exclusive use of requested nodes 33 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 34 | 'RequireExclusiveNode', 'logical', '--exclusive'); 35 | 36 | % Reservation 37 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 38 | 'Reservation', 'char', '--reservation=%s'); 39 | 40 | % Wall time 41 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 42 | 'WallTime', 'char', '-t %s'); 43 | 44 | % Email notification 45 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 46 | 'EmailAddress', 'char', '--mail-type=ALL --mail-user=%s'); 47 | 48 | % Catch all: directly append anything in the AdditionalSubmitArgs 49 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ... 50 | 'AdditionalSubmitArgs', 'char', '%s'); 51 | 52 | % Trim any whitespace 53 | commonSubmitArgs = strtrim(commonSubmitArgs); 54 | 55 | end 56 | 57 | function commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, propName, propType, submitPattern, defaultValue) 58 | % Helper fcn to append a scheduler option to the submit string. 59 | % Inputs: 60 | % commonSubmitArgs: submit string to append to 61 | % ap: AdditionalProperties object 62 | % propName: name of the property 63 | % propType: type of the property, i.e. char, double or logical 64 | % submitPattern: sprintf-style string specifying the format of the scheduler option 65 | % defaultValue (optional): value to use if the property is not specified in ap 66 | 67 | if nargin < 6 68 | defaultValue = []; 69 | end 70 | arg = validatedPropValue(ap, propName, propType, defaultValue); 71 | if ~isempty(arg) && (~islogical(arg) || arg) 72 | commonSubmitArgs = [commonSubmitArgs, ' ', sprintf(submitPattern, arg)]; 73 | end 74 | end 75 | 76 | function commonSubmitArgs = iAppendRequiredArgument(commonSubmitArgs, ap, propName, propType, submitPattern, errMsg) %#ok 77 | % Helper fcn to append a required scheduler option to the submit string. 78 | % An error is thrown if the property is not specified in AdditionalProperties or is empty. 79 | % Inputs: 80 | % commonSubmitArgs: submit string to append to 81 | % ap: AdditionalProperties object 82 | % propName: name of the property 83 | % propType: type of the property, i.e. char, double or logical 84 | % submitPattern: sprintf-style string specifying the format of the scheduler option 85 | % errMsg (optional): text to append to the error message if the property is not specified in ap 86 | 87 | if ~isprop(ap, propName) 88 | errorText = sprintf('Required field %s is missing from AdditionalProperties.', propName); 89 | if nargin > 5 90 | errorText = [errorText newline errMsg]; 91 | end 92 | error('parallelexamples:GenericSLURM:MissingAdditionalProperties', errorText); 93 | elseif isempty(ap.(propName)) 94 | errorText = sprintf('Required field %s is empty in AdditionalProperties.', propName); 95 | if nargin > 5 96 | errorText = [errorText newline errMsg]; 97 | end 98 | error('parallelexamples:GenericSLURM:EmptyAdditionalProperties', errorText); 99 | end 100 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, propName, propType, submitPattern); 101 | end 102 | -------------------------------------------------------------------------------- /private/getRemoteConnection.m: -------------------------------------------------------------------------------- 1 | function remoteConnection = getRemoteConnection(cluster) 2 | %GETREMOTECONNECTION Get a connected RemoteClusterAccess 3 | % 4 | % getRemoteConnection will either retrieve a RemoteClusterAccess from the 5 | % cluster's UserData or it will create a new RemoteClusterAccess. 6 | 7 | % Copyright 2010-2024 The MathWorks, Inc. 8 | 9 | % Store the current filename for the dctSchedulerMessages 10 | currFilename = mfilename; 11 | 12 | clusterHost = validatedPropValue(cluster.AdditionalProperties, 'ClusterHost', 'char'); 13 | if isempty(clusterHost) 14 | error('parallelexamples:GenericSLURM:MissingAdditionalProperties', ... 15 | 'Required field %s is missing from AdditionalProperties.', 'ClusterHost'); 16 | end 17 | 18 | if ~cluster.HasSharedFilesystem 19 | remoteJobStorageLocation = validatedPropValue(cluster.AdditionalProperties, ... 20 | 'RemoteJobStorageLocation', 'char'); 21 | if isempty(remoteJobStorageLocation) 22 | error('parallelexamples:GenericSLURM:MissingAdditionalProperties', ... 23 | 'Required field %s is missing from AdditionalProperties.', 'RemoteJobStorageLocation'); 24 | end 25 | 26 | useUniqueSubfolders = validatedPropValue(cluster.AdditionalProperties, ... 27 | 'UseUniqueSubfolders', 'logical', false); 28 | end 29 | 30 | needToCreateNewConnection = false; 31 | if isempty(cluster.UserData) 32 | needToCreateNewConnection = true; 33 | else 34 | if ~isstruct(cluster.UserData) 35 | error('parallelexamples:GenericSLURM:IncorrectUserData', ... 36 | ['Failed to retrieve remote connection from cluster''s UserData.\n' ... 37 | 'Expected cluster''s UserData to be a structure, but found %s'], ... 38 | class(cluster.UserData)); 39 | end 40 | 41 | if isfield(cluster.UserData, 'RemoteConnection') 42 | % Get the remote connection out of the cluster user data 43 | remoteConnection = cluster.UserData.RemoteConnection; 44 | 45 | % And check it is of the type that we expect 46 | if isempty(remoteConnection) || (isa(remoteConnection, "handle") && ~isvalid(remoteConnection)) 47 | needToCreateNewConnection = true; 48 | else 49 | clusterAccessClassname = 'parallel.cluster.RemoteClusterAccess'; 50 | if ~isa(remoteConnection, clusterAccessClassname) 51 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 52 | ['Failed to retrieve remote connection from cluster''s UserData.\n' ... 53 | 'Expected the RemoteConnection field of the UserData to contain an object of type %s, but found %s.'], ... 54 | clusterAccessClassname, class(remoteConnection)); 55 | end 56 | 57 | if ~cluster.HasSharedFilesystem 58 | if useUniqueSubfolders 59 | username = remoteConnection.Username; 60 | expectedRemoteJobStorageLocation = iBuildUniqueSubfolder(remoteJobStorageLocation, ... 61 | username, iGetFileSeparator(cluster)); 62 | else 63 | expectedRemoteJobStorageLocation = remoteJobStorageLocation; 64 | end 65 | end 66 | 67 | if ~remoteConnection.IsConnected 68 | needToCreateNewConnection = true; 69 | elseif cluster.HasSharedFilesystem && ... 70 | ~strcmpi(remoteConnection.Hostname, clusterHost) 71 | % The connection stored in the user data does not match the cluster host requested 72 | warning('parallelexamples:GenericSLURM:DifferentRemoteParameters', ... 73 | ['The current cluster is already using cluster host %s.\n', ... 74 | 'The existing connection to %s will be replaced.'], ... 75 | remoteConnection.Hostname, remoteConnection.Hostname); 76 | cluster.UserData.RemoteConnection = []; 77 | needToCreateNewConnection = true; 78 | elseif ~cluster.HasSharedFilesystem && ... 79 | (~strcmpi(remoteConnection.Hostname, clusterHost) || ... 80 | ~remoteConnection.IsFileMirrorSupported || ... 81 | ~strcmpi(remoteConnection.JobStorageLocation, expectedRemoteJobStorageLocation)) 82 | % The connection stored in the user data does not match the cluster host 83 | % and remote location requested 84 | warning('parallelexamples:GenericSLURM:DifferentRemoteParameters', ... 85 | ['The current cluster is already using cluster host %s and remote job storage location %s.\n', ... 86 | 'The existing connection to %s will be replaced.'], ... 87 | remoteConnection.Hostname, remoteConnection.JobStorageLocation, remoteConnection.Hostname); 88 | cluster.UserData.RemoteConnection = []; 89 | needToCreateNewConnection = true; 90 | end 91 | end 92 | else 93 | needToCreateNewConnection = true; 94 | end 95 | end 96 | 97 | if ~needToCreateNewConnection 98 | return 99 | end 100 | 101 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 102 | %% CUSTOMIZATION MAY BE REQUIRED %% 103 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 104 | % Get the credential options from the user using simple 105 | % MATLAB dialogs or command line input. You should change 106 | % this section if you wish for users to provide their credential 107 | % options in a different way. 108 | % The pertinent options are: 109 | % username - The username you use when you run commands on the remote host 110 | % authMode - Authentication mode you use when you connect to the cluster. 111 | % Supported options are: 112 | % 'Password' - Enter your SSH password when prompted by MATLAB. 113 | % 'IdentityFile' - Use an identity file on disk. 114 | % 'Agent' - Interface with an SSH agent running on the client machine. 115 | % Supported in R2021b onwards. 116 | % 'Multifactor' - Enable the cluster to prompt you for input one or more 117 | % times. If two-factor authentication (2FA) is enabled on 118 | % the cluster, the cluster will request your password and 119 | % a response for the second authentication factor. 120 | % Supported in R2022a onwards. 121 | % identityFile - Full path to the identity file. 122 | % identityFileHasPassphrase - True if the identity file requires a passphrase 123 | % (true/false). 124 | 125 | % Use the UI for prompts if MATLAB has been started with the desktop enabled 126 | useUI = iShouldUseUI(); 127 | username = iGetUsername(cluster, useUI); 128 | 129 | % Decide which authentication mode to use 130 | % Default mechanism is to prompt for password 131 | authMode = 'Password'; 132 | if isprop(cluster.AdditionalProperties, 'AuthenticationMode') 133 | % If AdditionalProperties.AuthenticationMode is defined, use that 134 | authMode = cluster.AdditionalProperties.AuthenticationMode; 135 | elseif isprop(cluster.AdditionalProperties, 'UseIdentityFile') 136 | % Otherwise use an identity file if UseIdentityFile is defined and true 137 | useIdentityFile = validatedPropValue(cluster.AdditionalProperties, 'UseIdentityFile', 'logical'); 138 | if useIdentityFile 139 | authMode = 'IdentityFile'; 140 | end 141 | elseif isprop(cluster.AdditionalProperties, 'IdentityFile') 142 | % Otherwise use an identity file if IdentityFile is defined 143 | authMode = 'IdentityFile'; 144 | else 145 | % Otherwise nothing is specified, ask the user what to do 146 | authMode = iPromptUserForAuthenticationMode(cluster, useUI); 147 | end 148 | 149 | % Build the user arguments to pass to RemoteClusterAccess 150 | userArgs = {username}; 151 | if verLessThan('matlab', '9.11') %#ok<*VERLESSMATLAB> We support back to 17a 152 | if ~ischar(authMode) || ~ismember(authMode, {'IdentityFile', 'Password'}) 153 | % Prior to R2021b, only IdentityFile and Password are supported 154 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 155 | 'AuthenticationMode must be either ''IdentityFile'' or ''Password'''); 156 | end 157 | else 158 | % No need to validate authMode, RemoteClusterAccess will do that for us 159 | userArgs = [userArgs, 'AuthenticationMode', {authMode}]; 160 | end 161 | 162 | % If using identity file, also need the filename and whether a passphrase is needed 163 | if any(strcmp(authMode, 'IdentityFile')) 164 | identityFile = iGetIdentityFile(cluster, useUI); 165 | identityFileHasPassphrase = iGetIdentityFileHasPassphrase(cluster, useUI); 166 | userArgs = [userArgs, 'IdentityFilename', {identityFile}, ... 167 | 'IdentityFileHasPassphrase', identityFileHasPassphrase]; 168 | end 169 | 170 | % Changing SSH port supported for R2021b onwards 171 | if ~verLessThan('matlab', '9.11') 172 | sshPort = validatedPropValue(cluster.AdditionalProperties, 'SSHPort', 'double'); 173 | if ~isempty(sshPort) 174 | userArgs = [userArgs, 'Port', sshPort]; 175 | end 176 | end 177 | 178 | % Now connect and store the connection 179 | dctSchedulerMessage(1, '%s: Connecting to remote host %s', ... 180 | currFilename, clusterHost); 181 | if cluster.HasSharedFilesystem 182 | remoteConnection = parallel.cluster.RemoteClusterAccess.getConnectedAccess(clusterHost, userArgs{:}); 183 | else 184 | if useUniqueSubfolders 185 | remoteJobStorageLocation = iBuildUniqueSubfolder(remoteJobStorageLocation, ... 186 | username, iGetFileSeparator(cluster)); 187 | end 188 | remoteConnection = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror(clusterHost, remoteJobStorageLocation, userArgs{:}); 189 | end 190 | dctSchedulerMessage(5, '%s: Storing remote connection in cluster''s user data.', currFilename); 191 | cluster.UserData.RemoteConnection = remoteConnection; 192 | 193 | end 194 | 195 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 196 | function useUI = iShouldUseUI() 197 | if verLessThan('matlab', '9.11') 198 | % Prior to R2021b, check for Java AWT components 199 | useUI = isempty(javachk('awt')); 200 | else 201 | % From R2021b onwards, can use the desktop function 202 | useUI = desktop('-inuse'); 203 | end 204 | end 205 | 206 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 207 | function username = iGetUsername(cluster, useUI) 208 | 209 | username = validatedPropValue(cluster.AdditionalProperties, 'Username', 'char'); 210 | if ~isempty(username) 211 | return 212 | end 213 | 214 | if useUI 215 | dlgMessage = sprintf('Enter the username for %s', cluster.AdditionalProperties.ClusterHost); 216 | dlgTitle = 'User Credentials'; 217 | numlines = 1; 218 | usernameResponse = inputdlg(dlgMessage, dlgTitle, numlines); 219 | % Hitting cancel gives an empty cell array, but a user providing an empty string gives 220 | % a (non-empty) cell array containing an empty string 221 | if isempty(usernameResponse) 222 | % User hit cancel 223 | error('parallelexamples:GenericSLURM:UserCancelledOperation', ... 224 | 'User cancelled operation.'); 225 | end 226 | username = char(usernameResponse); 227 | return 228 | end 229 | 230 | % useUI == false 231 | msg = sprintf('Enter the username for %s:\n ', cluster.AdditionalProperties.ClusterHost); 232 | username = input(msg, 's'); 233 | 234 | end 235 | 236 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 237 | function identityFileHasPassphrase = iGetIdentityFileHasPassphrase(cluster, useUI) 238 | 239 | identityFileHasPassphrase = validatedPropValue( ... 240 | cluster.AdditionalProperties, 'IdentityFileHasPassphrase', 'logical'); 241 | if ~isempty(identityFileHasPassphrase) 242 | return 243 | end 244 | 245 | if useUI 246 | dlgMessage = 'Does the identity file require a password?'; 247 | dlgTitle = 'User Credentials'; 248 | passphraseResponse = questdlg(dlgMessage, dlgTitle); 249 | if strcmp(passphraseResponse, 'Cancel') 250 | % User hit cancel 251 | error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.'); 252 | end 253 | identityFileHasPassphrase = strcmp(passphraseResponse, 'Yes'); 254 | return 255 | end 256 | 257 | % useUI == false 258 | validYesNoResponse = {'y', 'n'}; 259 | passphraseMessage = sprintf('Does the identity file require a password? (y or n)\n '); 260 | passphraseResponse = iLoopUntilValidStringInput(passphraseMessage, validYesNoResponse); 261 | identityFileHasPassphrase = strcmpi(passphraseResponse, 'y'); 262 | 263 | end 264 | 265 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 266 | function identityFile = iGetIdentityFile(cluster, useUI) 267 | 268 | if isprop(cluster.AdditionalProperties, 'IdentityFile') 269 | identityFile = cluster.AdditionalProperties.IdentityFile; 270 | if ~(ischar(identityFile) || isstring(identityFile) || iscellstr(identityFile)) || any(strlength(identityFile) == 0) 271 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 272 | 'Each IdentityFile must be a nonempty character vector'); 273 | end 274 | else 275 | if useUI 276 | dlgMessage = 'Select Identity File to use'; 277 | [filename, pathname] = uigetfile({'*.*', 'All Files (*.*)'}, dlgMessage); 278 | % If the user hit cancel, then filename and pathname will both be 0. 279 | if isequal(filename, 0) && isequal(pathname,0) 280 | error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.'); 281 | end 282 | identityFile = fullfile(pathname, filename); 283 | else 284 | msg = sprintf('Please enter the full path to the Identity File to use:\n '); 285 | identityFile = input(msg, 's'); 286 | end 287 | end 288 | 289 | end 290 | 291 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 292 | function authMode = iPromptUserForAuthenticationMode(cluster, useUI) 293 | 294 | promptMessage = sprintf('Select an authentication method to log in to %s', cluster.AdditionalProperties.ClusterHost); 295 | options = {'Password', 'Identity File', 'Cancel'}; 296 | 297 | if useUI 298 | dlgTitle = 'User Credentials'; 299 | defaultOption = 'Password'; 300 | authMode = questdlg(promptMessage, dlgTitle, options{:}, defaultOption); 301 | authMode = strrep(authMode, ' ', ''); 302 | if strcmp(authMode, 'Cancel') || isempty(authMode) 303 | % User hit cancel or closed the window 304 | error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.'); 305 | end 306 | else 307 | validResponses = {'1', '2', '3'}; 308 | displayItems = [validResponses; options]; 309 | identityFileMessage = [promptMessage, newline, sprintf('%s) %s\n', displayItems{:}), ' ']; 310 | response = iLoopUntilValidStringInput(identityFileMessage, validResponses); 311 | switch response 312 | case '1' 313 | authMode = 'Password'; 314 | case '2' 315 | authMode = 'IdentityFile'; 316 | otherwise 317 | error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.'); 318 | end 319 | end 320 | 321 | end 322 | 323 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 324 | function returnValue = iLoopUntilValidStringInput(message, validValues) 325 | % Function to loop until a valid response is obtained user input 326 | returnValue = ''; 327 | 328 | while isempty(returnValue) || ~any(strcmpi(returnValue, validValues)) 329 | returnValue = input(message, 's'); 330 | end 331 | end 332 | 333 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 334 | function subfolder = iBuildUniqueSubfolder(remoteJobStorageLocation, username, fileSeparator) 335 | % Function to build unique location using username and MATLAB release version 336 | release = ['R' version('-release')]; 337 | subfolder = [remoteJobStorageLocation fileSeparator username fileSeparator release]; 338 | end 339 | 340 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 341 | function fileSeparator = iGetFileSeparator(cluster) 342 | % Function to return file separator for cluster operating system 343 | if strcmpi(cluster.OperatingSystem, 'unix') 344 | fileSeparator = '/'; 345 | else 346 | fileSeparator = '\'; 347 | end 348 | end 349 | -------------------------------------------------------------------------------- /private/getSimplifiedSchedulerIDsForJob.m: -------------------------------------------------------------------------------- 1 | function [schedulerIDs, numTasks] = getSimplifiedSchedulerIDsForJob(job) 2 | %GETSIMPLIFIEDSCHEDULERIDSFORJOB Returns the smallest possible list of Slurm JobIDs that describe the MATLAB job. 3 | % 4 | % SCHEDULERIDS = getSimplifiedSchedulerIDsForJob(JOB) returns the smallest 5 | % possible list of Slurm job IDs that describe the MATLAB job JOB. The 6 | % function converts child job IDs of a job array to the parent job ID of 7 | % the array, and removes any duplicates. 8 | % 9 | % [SCHEDULERIDS, NUMTASKS] = getSimplifiedSchedulerIDsForJob(JOB) also 10 | % returns the number of tasks that SCHEDULERIDS represents. 11 | 12 | % Copyright 2019-2022 The MathWorks, Inc. 13 | 14 | if verLessThan('matlab', '9.7') % schedulerID stored in job data 15 | data = job.Parent.getJobClusterData(job); 16 | schedulerIDs = data.ClusterJobIDs; 17 | else % schedulerID on task since 19b 18 | schedulerIDs = job.getTaskSchedulerIDs(); 19 | end 20 | numTasks = numel(schedulerIDs); 21 | 22 | % Child jobs within a job array will have a schedulerID of the form 23 | % _. 24 | schedulerIDs = regexprep(schedulerIDs, '_\d+', ''); 25 | schedulerIDs = unique(schedulerIDs, 'stable'); 26 | end 27 | -------------------------------------------------------------------------------- /private/getSubmitString.m: -------------------------------------------------------------------------------- 1 | function submitString = getSubmitString(jobName, quotedLogFile, quotedCommand, ... 2 | additionalSubmitArgs, jobArrayString) 3 | %GETSUBMITSTRING Gets the correct sbatch command for a Slurm cluster 4 | 5 | % Copyright 2010-2023 The MathWorks, Inc. 6 | 7 | if ~isempty(jobArrayString) 8 | jobArrayString = strcat('--array=''[', jobArrayString, ']'''); 9 | end 10 | 11 | submitString = sprintf('sbatch --job-name=%s %s --output=%s --export=NONE %s %s', ... 12 | jobName, jobArrayString, quotedLogFile, additionalSubmitArgs, quotedCommand); 13 | 14 | end 15 | -------------------------------------------------------------------------------- /private/runSchedulerCommand.m: -------------------------------------------------------------------------------- 1 | function [status, result] = runSchedulerCommand(cluster, cmd) 2 | %RUNSCHEDULERCOMMAND Run a command on the cluster. 3 | 4 | % Copyright 2019-2024 The MathWorks, Inc. 5 | 6 | persistent wrapper 7 | 8 | if isprop(cluster.AdditionalProperties, 'ClusterHost') 9 | % Need to run the command over SSH 10 | remoteConnection = getRemoteConnection(cluster); 11 | [status, result] = remoteConnection.runCommand(cmd); 12 | else 13 | % Can shell out 14 | if isunix 15 | % Some scheduler utility commands on unix return exit codes > 127, which 16 | % MATLAB interprets as a fatal signal. This is not the case here, so wrap 17 | % the system call to the scheduler on UNIX within a shell script to 18 | % sanitize any exit codes in this range. 19 | if isempty(wrapper) 20 | wrapper = iBuildWrapperPath(); 21 | end 22 | cmd = sprintf('%s %s', wrapper, cmd); 23 | end 24 | [status, result] = system(cmd); 25 | end 26 | 27 | end 28 | 29 | function wrapper = iBuildWrapperPath() 30 | if verLessThan('matlab', '9.7') 31 | pctDir = toolboxdir('distcomp'); %#ok<*DCRENAME> 32 | elseif verLessThan('matlab', '25.1') %#ok<*VERLESSMATLAB> 33 | pctDir = toolboxdir('parallel'); 34 | else 35 | pctDir = fullfile(matlabroot, 'toolbox', 'parallel'); 36 | end 37 | wrapper = fullfile(pctDir, 'bin', 'util', 'shellWrapper.sh'); 38 | end 39 | -------------------------------------------------------------------------------- /private/validatedPropValue.m: -------------------------------------------------------------------------------- 1 | function val = validatedPropValue(ap, prop, type, defaultValue) 2 | % If prop is in the AdditionalProperties ap, validate the value is the correct 3 | % type and return it. If prop is not present, return the provided defaultValue. 4 | % If prop is not present and no defaultValue is provided, returns empty. 5 | 6 | % Copyright 2022 The MathWorks, Inc. 7 | 8 | narginchk(3, 4); 9 | 10 | if nargin < 4 11 | % If no defaultValue specified, use empty 12 | defaultValue = []; 13 | end 14 | 15 | if ~isprop(ap, prop) 16 | % prop is not present in ap, use the defaultValue 17 | val = defaultValue; 18 | return 19 | end 20 | 21 | % If we get here then prop is in ap 22 | val = ap.(prop); 23 | switch type 24 | case {'char', 'string'} 25 | validator = @(x) ischar(x) || isstring(x); 26 | case {'double', 'numeric'} 27 | validator = @isnumeric; 28 | case {'bool', 'logical'} 29 | validator = @islogical; 30 | otherwise 31 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 32 | 'Not a valid data type'); 33 | end 34 | 35 | % If the property is not empty, verify that it is set to the correct type: 36 | % char, double, or logical. 37 | if ~isempty(val) && ~validator(val) 38 | error('parallelexamples:GenericSLURM:IncorrectArguments', ... 39 | 'Expected property ''%s'' to be of type %s, but it has type %s.', prop, type, class(val)); 40 | end 41 | 42 | end 43 | --------------------------------------------------------------------------------