├── README.md
├── SECURITY.md
├── cancelJobFcn.m
├── cancelTaskFcn.m
├── communicatingJobWrapper.sh
├── communicatingJobWrapperSmpd.sh
├── communicatingSubmitFcn.m
├── deleteJobFcn.m
├── deleteTaskFcn.m
├── discover
    ├── example.conf
    └── runDiscovery.sh
├── getJobStateFcn.m
├── independentJobWrapper.sh
├── independentSubmitFcn.m
├── license.txt
├── postConstructFcn.m
└── private
    ├── cancelJobOnCluster.m
    ├── cancelTaskOnCluster.m
    ├── createEnvironmentWrapper.m
    ├── createSubmitScript.m
    ├── extractJobId.m
    ├── getCommonSubmitArgs.m
    ├── getRemoteConnection.m
    ├── getSimplifiedSchedulerIDsForJob.m
    ├── getSubmitString.m
    ├── runSchedulerCommand.m
    └── validatedPropValue.m


/README.md:
--------------------------------------------------------------------------------
  1 | # Parallel Computing Toolbox plugin for MATLAB Parallel Server with Slurm
  2 | 
  3 | [![View Parallel Computing Toolbox Plugin for Slurm on File Exchange](https://www.mathworks.com/matlabcentral/images/matlab-file-exchange.svg)](https://www.mathworks.com/matlabcentral/fileexchange/127364-parallel-computing-toolbox-plugin-for-slurm)
  4 | 
  5 | Parallel Computing Toolbox&trade; provides the `Generic` cluster type for submitting MATLAB&reg; jobs to a cluster running a third-party scheduler.
  6 | The `Generic` cluster type uses a set of plugin scripts to define how your machine communicates with your scheduler.
  7 | You can customize the plugin scripts to configure how MATLAB interacts with the scheduler to best suit your cluster setup and support custom submission options.
  8 | 
  9 | This repository contains MATLAB code files and shell scripts that you can use to submit jobs from a MATLAB or Simulink session running on Windows&reg;, Linux&reg;, or macOS operating systems to a Slurm&reg; scheduler running on Linux.
 10 | 
 11 | ## Products Required
 12 | 
 13 | - [MATLAB](https://www.mathworks.com/products/matlab.html) and [Parallel Computing Toolbox](https://www.mathworks.com/products/parallel-computing.html), R2017a or newer, installed on your computer.
 14 | Refer to the documentation for [how to install MATLAB and toolboxes](https://www.mathworks.com/help/install/index.html) on your computer.
 15 | - [MATLAB Parallel Server&trade;](https://www.mathworks.com/products/matlab-parallel-server.html) installed on the cluster.
 16 | Refer to the documentation for [how to install MATLAB Parallel Server](https://www.mathworks.com/help/matlab-parallel-server/integrate-matlab-with-third-party-schedulers.html) on your cluster.
 17 | The cluster administrator normally does this step.
 18 | - [Slurm](https://slurm.schedmd.com/) running on the cluster.
 19 | 
 20 | ## Setup Instructions
 21 | 
 22 | ### Download or Clone this Repository
 23 | 
 24 | To download a zip archive of this repository, at the top of this repository page, select **Code > Download ZIP**.
 25 | Alternatively, to clone this repository to your computer with Git software installed, enter this command at your system's command line:
 26 | ```
 27 | git clone https://github.com/mathworks/matlab-parallel-slurm-plugin
 28 | ```
 29 | You can execute a system command from the MATLAB command prompt by adding `!` before the command.
 30 | 
 31 | ### Cluster Discovery
 32 | 
 33 | Since version R2023a, MATLAB can discover clusters running third-party schedulers such as Slurm.
 34 | As a cluster admin, you can create a configuration file that describes how to configure the Parallel Computing Toolbox on the user's machine to submit MATLAB jobs to the cluster.
 35 | The cluster configuration file is a plain text file with the extension `.conf` containing key-value pairs that describe the cluster configuration information.
 36 | The MATLAB client will use the cluster configuration file to create a cluster profile for the user who discovers the cluster.
 37 | Therefore, users will not need to follow the instructions in the sections below.
 38 | You can find an example of a cluster configuration file in [discover/example.conf](discover/example.conf).
 39 | For full details on how to make a cluster running a third-party scheduler discoverable, see the documentation for [Configure for Third-Party Scheduler Cluster Discovery](https://www.mathworks.com/help/matlab-parallel-server/configure-for-cluster-discovery.html).
 40 | 
 41 | ### Create a Cluster Profile in MATLAB
 42 | 
 43 | Create a cluster profile by using either the Cluster Profile Manager or the MATLAB Command Window.
 44 | 
 45 | To open the Cluster Profile Manager, on the **Home** tab, in the **Environment** section, select **Parallel > Create and Manage Clusters**.
 46 | In the Cluster Profile Manager, select **Add Cluster Profile > Generic** from the menu to create a new `Generic` cluster profile.
 47 | 
 48 | Alternatively, create a new `Generic` cluster object by entering this command in the MATLAB Command Window:
 49 | ```matlab
 50 | c = parallel.cluster.Generic;
 51 | ```
 52 | 
 53 | ### Configure Cluster Properties
 54 | 
 55 | This table lists the properties that you must specify to configure the `Generic` cluster profile.
 56 | For a full list of cluster properties, see the documentation for [`parallel.Cluster`](https://www.mathworks.com/help/parallel-computing/parallel.cluster.html).
 57 | 
 58 | **Property**            | **Description**
 59 | ------------------------|----------------
 60 | `JobStorageLocation`    | Folder in which your machine stores job data.
 61 | `NumWorkers`            | Number of workers your license allows.
 62 | `ClusterMatlabRoot`     | Full path to the MATLAB install folder on the cluster.
 63 | `OperatingSystem`       | Cluster operating system.
 64 | `HasSharedFilesystem`   | Indication of whether you have a shared file system. Set this property to `true` if a disk location is accessible to your machine and the workers on the cluster. Set this property to `false` if you do not have a shared file system.
 65 | `PluginScriptsLocation` | Full path to the plugin script folder that contains this README. If using R2019a or earlier, this property is called IntegrationScriptsLocation.
 66 | 
 67 | In the Cluster Profile Manager, set each property value.
 68 | Alternatively, at the command line, set properties using dot notation:
 69 | ```matlab
 70 | c.JobStorageLocation = 'C:\MatlabJobs';
 71 | ```
 72 | 
 73 | At the command line, you can also set properties when you create the `Generic` cluster object by using name-value arguments:
 74 | ```matlab
 75 | c = parallel.cluster.Generic( ...
 76 |     'JobStorageLocation', 'C:\MatlabJobs', ...
 77 |     'NumWorkers', 20, ...
 78 |     'ClusterMatlabRoot', '/usr/local/MATLAB/R2022a', ...
 79 |     'OperatingSystem', 'unix', ...
 80 |     'HasSharedFilesystem', true, ...
 81 |     'PluginScriptsLocation', 'C:\MatlabSlurmPlugin\shared');
 82 | ```
 83 | 
 84 | To submit from a Windows machine to a Linux cluster, specify `JobStorageLocation` as a structure with the fields `windows` and `unix`.
 85 | The fields are the Windows and Unix paths corresponding to the folder in which your machine stores job data.
 86 | For example, if the folder `\\organization\matlabjobs\jobstorage` on Windows corresponds to `/organization/matlabjobs/jobstorage` on the Unix cluster:
 87 | ```matlab
 88 | struct('windows', '\\organization\matlabjobs\jobstorage', 'unix', '/organization/matlabjobs/jobstorage')
 89 | ```
 90 | If you have your `M:` drive mapped to `\\organization\matlabjobs`, set `JobStorageLocation` to:
 91 | ```matlab
 92 | struct('windows', 'M:\jobstorage', 'unix', '/organization/matlabjobs/jobstorage')
 93 | ```
 94 | 
 95 | You can use `AdditionalProperties` to modify the behaviour of the `Generic` cluster without editing the plugin scripts.
 96 | For a full list of the `AdditionalProperties` supported by the plugin scripts in this repository, see [Customize Behavior of Sample Plugin Scripts](https://www.mathworks.com/help/matlab-parallel-server/customize-behavior-of-sample-plugin-scripts.html).
 97 | By modifying the plugins, you can add support for your own custom `AdditionalProperties`.
 98 | 
 99 | #### Connect to a Remote Cluster
100 | 
101 | To manage work on the cluster, MATLAB calls the Slurm command line utilities.
102 | For example, the `sbatch` command to submit work and `squeue` to query the state of submitted jobs.
103 | If your MATLAB session is running on a machine with the scheduler utilities available, the plugin scripts can call the utilities on the command line.
104 | Scheduler utilities are typically available if your MATLAB session is running on the Slurm cluster to which you want to submit.
105 | 
106 | If MATLAB cannot directly access the scheduler utilities on the command line, the plugin scripts create an SSH session to the cluster and run scheduler commands over that connection.
107 | To configure your cluster to submit scheduler commands via SSH, set the `ClusterHost` field of `AdditionalProperties` to the name of the cluster node to which MATLAB connects via SSH.
108 | As MATLAB will run scheduler utilities such as `sbatch` and `squeue`, select the cluster head node or login node.
109 | 
110 | In the Cluster Profile Manager, add new `AdditionalProperties` by clicking **Add** under the table corresponding to `AdditionalProperties`.
111 | In the Command Window, use dot notation to add new fields.
112 | For example, if MATLAB should connect to `'slurm01.organization.com'` to submit jobs, set:
113 | ```matlab
114 | c.AdditionalProperties.ClusterHost = 'slurm01.organization.com';
115 | ```
116 | 
117 | Use this option to connect to a remote cluster to submit jobs from a MATLAB session on a Windows computer to a Linux Slurm cluster on the same network.
118 | Your Windows machine creates an SSH session to the cluster head node to access the Slurm utilities and uses a shared network folder to store job data files.
119 | 
120 | If your MATLAB session is running on a compute node of the cluster to which you want to submit work, you can use this option to create an SSH session back to the cluster head node and submit more jobs.
121 | 
122 | #### Run Jobs on a Remote Cluster Without a Shared File System
123 | 
124 | MATLAB uses files on disk to send tasks to the Parallel Server workers and fetch their results.
125 | This is most effective when the disk location is accessible to your machine and the workers on the cluster.
126 | Your computer can communicate with the workers by reading and writing to this shared file system.
127 | 
128 | If you do not have a shared file system, MATLAB uses SSH to submit commands to the scheduler and SFTP (SSH File Transfer Protocol) to copy job and task files between your computer and the cluster.
129 | To configure your cluster to move files between the client and the cluster with SFTP, set the `RemoteJobStorageLocation` field of `AdditionalProperties` to a folder on the cluster that the workers can access.
130 | 
131 | Transferring large data files (for example, hundreds of MB) over the SFTP connection can add a noticeable overhead to job submission and fetching results.
132 | For optimal performance, use a shared file system if one is available.
133 | The workers require access to a shared file system, even if your computer cannot access it.
134 | 
135 | ### Save New Profile
136 | 
137 | In the Cluster Profile Manager, click **Done**.
138 | Alternatively, in the Command Window, enter the command:
139 | ```matlab
140 | saveAsProfile(c, 'mySLURMCluster');
141 | ```
142 | Your cluster profile is now ready to use.
143 | 
144 | ### Validate Cluster Profile
145 | 
146 | Cluster validation submits one job of each type to test whether the cluster profile is configured correctly.
147 | In the Cluster Profile Manager, click **Validate**.
148 | If you make a change to a cluster profile, run cluster validation to ensure your changes have introduced no errors.
149 | You do not need to validate the profile each time you use it or each time you start MATLAB.
150 | 
151 | ## Examples
152 | 
153 | Create a cluster object using your profile:
154 | ```matlab
155 | c = parcluster("mySLURMCluster")
156 | ```
157 | 
158 | ### Submit Work for Batch Processing
159 | 
160 | The `batch` command runs a MATLAB script or function on a worker on the cluster.
161 | For more information about batch processing, see the documentation for [`batch`](https://www.mathworks.com/help/parallel-computing/batch.html).
162 | 
163 | ```matlab
164 | % Create a job and submit it to the cluster.
165 | job = batch( ...
166 |     c, ... % Cluster object created using parcluster
167 |     @sqrt, ... % Function or script to run
168 |     1, ... % Number of output arguments
169 |     {[64 100]}); % Input arguments
170 | 
171 | % Your MATLAB session is now available to do other work, such
172 | % as create and submit more jobs to the cluster. You can also
173 | % shut down your MATLAB session and come back later. The work
174 | % continues running on the cluster. After you recreate the
175 | % cluster object using parcluster, you can view existing jobs
176 | % using the Jobs property of the cluster object.
177 | 
178 | % Wait for the job to complete. If the job is already complete,
179 | % the wait function will return immediately.
180 | wait(job);
181 | 
182 | % Retrieve the output arguments for each task. For this example,
183 | % the output is a 1x1 cell array containing the vector [8 10].
184 | results = fetchOutputs(job)
185 | ```
186 | 
187 | ### Open Parallel Pool
188 | 
189 | A parallel pool (parpool) is a group of MATLAB workers on which you can interactively run work.
190 | When you run the `parpool` command, MATLAB submits a special job to the cluster to start the workers.
191 | Once the workers start, your MATLAB session connects to them.
192 | Depending on the network configuration at your organization, including whether it is permissible to connect to a program running on a compute node, parpools may not be functional.
193 | For more information about parpools, see the documentation for [`parpool`](https://www.mathworks.com/help/parallel-computing/parpool.html).
194 | 
195 | ```matlab
196 | % Open a parallel pool on the cluster. This command returns a
197 | % pool object once the pool is opened.
198 | pool = parpool(c);
199 | 
200 | % List the hosts on which the workers are running. For a small pool,
201 | % all the workers are typically on the same machine. For a large
202 | % pool, the workers are usually spread over multiple nodes.
203 | future = parfevalOnAll(pool, @getenv, 1, 'HOST')
204 | wait(future);
205 | fetchOutputs(future)
206 | 
207 | % Output the numbers 1 to 10 in a parallel for loop. Unlike a
208 | % regular for loop, the software does not execute iterations
209 | % of the loop in order.
210 | parfor idx = 1:10
211 |     disp(idx)
212 | end
213 | 
214 | % Use the pool to calculate the first 500 magic squares.
215 | parfor idx = 1:500
216 |     magicSquare{idx} = magic(idx);
217 | end
218 | ```
219 | 
220 | ## License
221 | 
222 | The license is available in the [license.txt](license.txt) file in this repository.
223 | 
224 | ## Community Support
225 | 
226 | [MATLAB Central](https://www.mathworks.com/matlabcentral)
227 | 
228 | ## Technical Support
229 | 
230 | If you require assistance or have a request for additional features or capabilities, please contact [MathWorks Technical Support](https://www.mathworks.com/support/contact_us.html).
231 | 
232 | Copyright 2022-2023 The MathWorks, Inc.
233 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 | # Reporting Security Vulnerabilities
2 | 
3 | If you believe you have discovered a security vulnerability, please report it to
4 | [security@mathworks.com](mailto:security@mathworks.com). Please see
5 | [MathWorks Vulnerability Disclosure Policy for Security Researchers](https://www.mathworks.com/company/aboutus/policies_statements/vulnerability-disclosure-policy.html)
6 | for additional information.
7 | 


--------------------------------------------------------------------------------
/cancelJobFcn.m:
--------------------------------------------------------------------------------
 1 | function OK = cancelJobFcn(cluster, job)
 2 | %CANCELJOBFCN Cancels a job on Slurm
 3 | %
 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
 5 | % function to run it when you cancel a job.
 6 | 
 7 | % Copyright 2010-2023 The MathWorks, Inc.
 8 | 
 9 | OK = cancelJobOnCluster(cluster, job);
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/cancelTaskFcn.m:
--------------------------------------------------------------------------------
 1 | function OK = cancelTaskFcn(cluster, task)
 2 | %CANCELTASKFCN Cancels a task on Slurm
 3 | %
 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
 5 | % function to run it when you cancel a task.
 6 | 
 7 | % Copyright 2020-2023 The MathWorks, Inc.
 8 | 
 9 | OK = cancelTaskOnCluster(cluster, task);
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/communicatingJobWrapper.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | # This wrapper script is intended to be submitted to Slurm to support
 3 | # communicating jobs.
 4 | #
 5 | # This script uses the following environment variables set by the submit MATLAB code:
 6 | # PARALLEL_SERVER_CMR         - the value of ClusterMatlabRoot (may be empty)
 7 | # PARALLEL_SERVER_MATLAB_EXE  - the MATLAB executable to use
 8 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use
 9 | # PARALLEL_SERVER_TOTAL_TASKS - total number of workers to start
10 | # PARALLEL_SERVER_NUM_THREADS - number of cores needed per worker
11 | # PARALLEL_SERVER_DEBUG       - used to debug problems on the cluster
12 | #
13 | # The following environment variables are forwarded through mpiexec:
14 | # PARALLEL_SERVER_DECODE_FUNCTION     - the decode function to use
15 | # PARALLEL_SERVER_STORAGE_LOCATION    - used by decode function
16 | # PARALLEL_SERVER_STORAGE_CONSTRUCTOR - used by decode function
17 | # PARALLEL_SERVER_JOB_LOCATION        - used by decode function
18 | #
19 | # The following environment variables are set by Slurm:
20 | # SLURM_NODELIST - list of hostnames allocated to this Slurm job
21 | 
22 | # Copyright 2015-2025 The MathWorks, Inc.
23 | 
24 | # If PARALLEL_SERVER_ environment variables are not set, assign any
25 | # available values with form MDCE_ for backwards compatibility
26 | PARALLEL_SERVER_CMR=${PARALLEL_SERVER_CMR:="${MDCE_CMR}"}
27 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"}
28 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"}
29 | PARALLEL_SERVER_TOTAL_TASKS=${PARALLEL_SERVER_TOTAL_TASKS:="${MDCE_TOTAL_TASKS}"}
30 | PARALLEL_SERVER_NUM_THREADS=${PARALLEL_SERVER_NUM_THREADS:="${MDCE_NUM_THREADS}"}
31 | PARALLEL_SERVER_DEBUG=${PARALLEL_SERVER_DEBUG:="${MDCE_DEBUG}"}
32 | 
33 | # Other environment variables to forward
34 | PARALLEL_SERVER_GENVLIST="${PARALLEL_SERVER_GENVLIST},HOME,USER"
35 | 
36 | # Echo the nodes that the scheduler has allocated to this job:
37 | echo -e "The scheduler has allocated the following nodes to this job:\n${SLURM_NODELIST:?"Node list undefined"}"
38 | 
39 | # Create full path to mw_mpiexec if needed.
40 | FULL_MPIEXEC=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_mpiexec
41 | 
42 | # Label stdout/stderr with the rank of the process
43 | MPI_VERBOSE=-l
44 | 
45 | # Increase the verbosity of mpiexec if PARALLEL_SERVER_DEBUG is set and not false
46 | if [ ! -z "${PARALLEL_SERVER_DEBUG}" ] && [ "${PARALLEL_SERVER_DEBUG}" != "false" ] ; then
47 |     MPI_VERBOSE="${MPI_VERBOSE} -v -print-all-exitcodes"
48 | fi
49 | 
50 | if [ ! -z "${PARALLEL_SERVER_BIND_TO_CORE}" ] && [ "${PARALLEL_SERVER_BIND_TO_CORE}" != "false" ] ; then
51 |     BIND_TO_CORE_ARG="-bind-to core:${PARALLEL_SERVER_NUM_THREADS}"
52 | else
53 |     BIND_TO_CORE_ARG=""
54 | fi
55 | 
56 | # Construct the command to run.
57 | CMD="\"${FULL_MPIEXEC}\" \
58 |     ${PARALLEL_SERVER_MPIEXEC_ARG} \
59 |     -genvlist ${PARALLEL_SERVER_GENVLIST} \
60 |     ${BIND_TO_CORE_ARG} \
61 |     ${MPI_VERBOSE} \
62 |     -n ${PARALLEL_SERVER_TOTAL_TASKS} \
63 |     \"${PARALLEL_SERVER_MATLAB_EXE}\" \
64 |     ${PARALLEL_SERVER_MATLAB_ARGS}"
65 | 
66 | # Echo the command so that it is shown in the output log.
67 | echo $CMD
68 | 
69 | # Execute the command.
70 | eval $CMD
71 | 
72 | MPIEXEC_EXIT_CODE=${?}
73 | if [ ${MPIEXEC_EXIT_CODE} -eq 42 ] ; then
74 |     # Get here if user code errored out within MATLAB. Overwrite this to zero in
75 |     # this case.
76 |     echo "Overwriting MPIEXEC exit code from 42 to zero (42 indicates a user-code failure)"
77 |     MPIEXEC_EXIT_CODE=0
78 | fi
79 | echo "Exiting with code: ${MPIEXEC_EXIT_CODE}"
80 | exit ${MPIEXEC_EXIT_CODE}
81 | 


--------------------------------------------------------------------------------
/communicatingJobWrapperSmpd.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/sh
  2 | # This wrapper script is intended to be submitted to Slurm to support
  3 | # communicating jobs.
  4 | #
  5 | # This script uses the following environment variables set by the submit MATLAB code:
  6 | # PARALLEL_SERVER_CMR         - the value of ClusterMatlabRoot (might be empty)
  7 | # PARALLEL_SERVER_MATLAB_EXE  - the MATLAB executable to use
  8 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use
  9 | #
 10 | # The following environment variables are forwarded through mpiexec:
 11 | # PARALLEL_SERVER_DECODE_FUNCTION     - the decode function to use
 12 | # PARALLEL_SERVER_STORAGE_LOCATION    - used by decode function
 13 | # PARALLEL_SERVER_STORAGE_CONSTRUCTOR - used by decode function
 14 | # PARALLEL_SERVER_JOB_LOCATION        - used by decode function
 15 | 
 16 | # The following environment variables are set by Slurm
 17 | # SLURM_JOB_ID         - number of nodes allocated to Slurm job
 18 | # SLURM_JOB_NUM_NODES  - number of hosts allocated to Slurm job
 19 | # SLURM_JOB_NODELIST   - list of hostnames allocated to Slurm job
 20 | # SLURM_TASKS_PER_NODE - list containing number of tasks allocated per host to Slurm job
 21 | 
 22 | # Copyright 2015-2024 The MathWorks, Inc.
 23 | 
 24 | # If PARALLEL_SERVER_ environment variables are not set, assign any
 25 | # available values with form MDCE_ for backwards compatibility
 26 | PARALLEL_SERVER_CMR=${PARALLEL_SERVER_CMR:="${MDCE_CMR}"}
 27 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"}
 28 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"}
 29 | 
 30 | # Other environment variables to forward
 31 | PARALLEL_SERVER_GENVLIST="${PARALLEL_SERVER_GENVLIST},HOME,USER"
 32 | 
 33 | # Users of Slurm older than v1.1.34 should uncomment the following code
 34 | # to enable mapping from old Slurm environment variables:
 35 | 
 36 | # SLURM_JOB_ID=${SLURM_JOBID}
 37 | # SLURM_JOB_NUM_NODES=${SLURM_NNODES}
 38 | # SLURM_JOB_NODELIST=${SLURM_NODELIST}
 39 | 
 40 | # Create full paths to mw_smpd/mw_mpiexec if needed
 41 | FULL_SMPD=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_smpd
 42 | FULL_MPIEXEC=${PARALLEL_SERVER_CMR:+${PARALLEL_SERVER_CMR}/bin/}mw_mpiexec
 43 | 
 44 | #########################################################################################
 45 | # Work out where we need to launch SMPDs given our hosts file - defines SMPD_HOSTS
 46 | chooseSmpdHosts() {
 47 | 
 48 |     # SLURM_JOB_NODELIST is required: the following line either echoes the value, or aborts.
 49 |     echo Node file: ${SLURM_JOB_NODELIST:?"Node file undefined"}
 50 | 
 51 |     # SMPD_HOSTS is a single line comma separated list of hostnames:
 52 |     #   node136,node138,node140,node141,node142,node143,node157
 53 |     #
 54 |     # Our source of information is SLURM_JOB_NODELIST in the form:
 55 |     #   cnode[136,138],cnode[140-43],cnode157
 56 |     #
 57 |     # 'scontrol show hostname ${SLURM_JOB_NODELIST}' produces multi-line list of hostnames:
 58 |     #   node136
 59 |     #   node138
 60 |     #   node140
 61 |     #   ...
 62 |     #
 63 |     # Pipe through "tr" to convert newlines to spaces.
 64 | 
 65 |     SMPD_HOSTS=`scontrol show hostname ${SLURM_JOB_NODELIST} | tr '\n', ' '`
 66 | }
 67 | 
 68 | #########################################################################################
 69 | # Work out which port to use for SMPD
 70 | chooseSmpdPort() {
 71 | 
 72 |     # Extract the numeric part of SLURM_JOB_ID using sed to choose unique port for SMPD to run on.
 73 |     # Assumes SLURM_JOB_ID starts with a number, such as: 15.slurm-server-host.domain.com
 74 |     JOB_NUM=`echo ${SLURM_JOB_ID:?"SLURM_JOB_ID undefined"} | sed 's#^\([0-9][0-9]*\).*$#\1#'`
 75 |     # Base smpd_port on the numeric part of the above
 76 |     SMPD_PORT=`expr $JOB_NUM % 10000 + 20000`
 77 | }
 78 | 
 79 | #########################################################################################
 80 | # Work out how many processes to launch - set MACHINE_ARG
 81 | #
 82 | # Inputs:
 83 | #   SLURM_JOB_NUM_NODES       Slurm environment variable: Number of nodes allocated to Slurm job
 84 | #
 85 | #   SMPD_HOSTS                Comma separated list of hostnames of nodes set by chooseSmpdHosts
 86 | #
 87 | #   SLURM_TASKS_PER_NODE      Slurm environment variable: Number of tasks allocated per node.
 88 | #                             If two or more consecutive nodes have the same task count,
 89 | #                             that count is followed by "(x#)" where "#" is the repetition count.
 90 | # Output:
 91 | #   MACHINE_ARG               Arguments to pass to mpiexec in the form:
 92 | #                               -hosts <num_hosts> host1 tasks_on_host1 host2 tasks_on_host2
 93 | #
 94 | # Example
 95 | # -------
 96 | #   Inputs:
 97 | #     SLURM_JOB_NUM_NODES        7
 98 | #     SMPD_HOSTS                 node136,node138,node140,node141,node42,node143,node157
 99 | #     SLURM_TASKS_PER_NODE       12(x4),7,9(x2)
100 | #   Output:
101 | #     -hosts 7 node136 12 node138 12 node140 12 node141 12 node142 7 node143 9 node157 9
102 | #
103 | chooseMachineArg() {
104 | 
105 |     # Transform SLURM_TASKS_PER_NODE into TASKS_PER_NODE_LIST
106 |     #
107 |     # Examples:                                            SLURM_TASKS_PER_NODE -> TASKS_PER_NODE_LIST
108 |     # -------                                              --------------------    -------------------
109 |     # Single node has 12 tasks                             12                   -> 12
110 |     # Three nodes have 12 tasks                            12(x3)               -> 12,12,12
111 |     # First two nodes have 7 tasks, the third has 8 tasks  7(x2),8              -> 7,7,8
112 | 
113 |     TASKS_PER_NODE_LIST=''
114 |     # Replace commas with spaces to create space delimited list to use with for loop
115 |     LIST_FROM_SLURM=`echo ${SLURM_TASKS_PER_NODE} | sed 's/,/ /g'`
116 |     for ITEM in ${LIST_FROM_SLURM}
117 |     do
118 |         if [ `echo ${ITEM} | grep -e '^[0-9][0-9]*$' -c` -eq 1 ] ; then
119 |             # "NUM_TASKS" == "NUM_TASKS(x1)"
120 |             NUM_NODES=1
121 |             NUM_TASKS=${ITEM}
122 |         else
123 |             # "NUM_TASKS(xNUM_NODES)"
124 |             NUM_NODES=`echo $ITEM | sed 's/^[0-9][0-9]*(x\([0-9][0-9]*\))$/\1/'`
125 |             NUM_TASKS=`echo $ITEM | sed 's/^\([0-9][0-9]*\)(x[0-9][0-9]*)$/\1/'`
126 |         fi
127 | 
128 |         # Repeat NUM_NODES iterations: append NUM_TASKS to TASKS_PER_NODE_LIST
129 |         COUNT=0
130 |         while [ ${COUNT} -lt ${NUM_NODES} ]
131 |         do
132 |           if [ -z "${TASKS_PER_NODE_LIST}" ] ; then
133 |               # List empty, therefore adding first item to list - avoid adding comma
134 |               TASKS_PER_NODE_LIST=${NUM_TASKS}
135 |           else
136 |               # Appending to list - add a comma to delimit entries
137 |               TASKS_PER_NODE_LIST="${TASKS_PER_NODE_LIST},${NUM_TASKS}"
138 |           fi
139 |           COUNT=`expr ${COUNT} + 1`
140 |         done
141 |     done
142 | 
143 |     # Add -hosts argument at start of MACHINE_ARG
144 |     MACHINE_ARG="-hosts ${SLURM_JOB_NUM_NODES}"
145 | 
146 |     # For each hostname in SMPD_HOSTS, append '<hostname> <tasks_per_node>' to MACHINE_ARG
147 |     INDEX=0
148 |     for HOSTNAME in ${SMPD_HOSTS}
149 |     do
150 |       INDEX=`expr ${INDEX} + 1`
151 |       # Use cut to index the '${INDEX}th' item in TASKS_PER_NODE_LIST
152 |       TASKS_PER_NODE=`echo ${TASKS_PER_NODE_LIST} | cut -f ${INDEX} -d,`
153 |       MACHINE_ARG="${MACHINE_ARG} ${HOSTNAME} ${TASKS_PER_NODE}"
154 |     done
155 |     echo "Machine args: $MACHINE_ARG"
156 | }
157 | 
158 | #########################################################################################
159 | # Shut down SMPDs and exit with the exit code of the last command executed
160 | cleanupAndExit() {
161 |     EXIT_CODE=${?}
162 | 
163 |     echo "Stopping SMPD ..."
164 | 
165 |     STOP_SMPD_CMD="srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} --cpu-bind=none ${FULL_SMPD} -shutdown -phrase MATLAB -port ${SMPD_PORT}"
166 |     echo $STOP_SMPD_CMD
167 |     eval $STOP_SMPD_CMD
168 | 
169 |     echo "Exiting with code: ${EXIT_CODE}"
170 |     exit ${EXIT_CODE}
171 | }
172 | 
173 | #########################################################################################
174 | # Use srun to launch the SMPD daemons on each processor
175 | launchSmpds() {
176 | 
177 |     # Launch the SMPD processes on all hosts using srun
178 |     echo "Starting SMPD on ${SMPD_HOSTS} ..."
179 | 
180 |     START_SMPD_CMD="srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} --cpu-bind=none ${FULL_SMPD} -phrase MATLAB -port ${SMPD_PORT} -debug 0 &"
181 |     echo $START_SMPD_CMD
182 |     eval $START_SMPD_CMD
183 | 
184 |     # Check that the SMPD processes are running on all hosts
185 |     SUCCESS=0
186 |     NUM_ATTEMPTS=60
187 |     ATTEMPT=1
188 |     while [ ${ATTEMPT} -le ${NUM_ATTEMPTS} ]
189 |     do
190 |         echo "Checking that SMPD processes are running (Attempt ${ATTEMPT} of ${NUM_ATTEMPTS})"
191 |         SMPD_LAUNCHED_HOSTS=""
192 |         NUM_HOSTS_FOUND=0
193 |         for HOST in ${SMPD_HOSTS}
194 |         do
195 |             CHECK_SMPD_CMD="${FULL_SMPD} -phrase MATLAB -port ${SMPD_PORT} -status ${HOST} > /dev/null 2>&1"
196 |             echo $CHECK_SMPD_CMD
197 |             eval $CHECK_SMPD_CMD
198 |             EXIT_CODE=${?}
199 |             if [ $EXIT_CODE -ne 0 ] ; then
200 |                 echo "No SMPD process running on ${HOST}"
201 |             else
202 |                 echo "SMPD process found running on ${HOST}"
203 |                 NUM_HOSTS_FOUND=$((NUM_HOSTS_FOUND+1))
204 | 
205 |                 # Append HOST to SMPD_LAUNCHED_HOSTS if it does not already contain it.
206 |                 case "${SMPD_LAUNCHED_HOSTS}" in
207 |                     *$HOST* ) ;;
208 |                     *       ) SMPD_LAUNCHED_HOSTS="${SMPD_LAUNCHED_HOSTS} ${HOST}" ;;
209 |                 esac
210 |             fi
211 |         done
212 |         if [ ${SLURM_JOB_NUM_NODES} -eq ${NUM_HOSTS_FOUND} ] ; then
213 |             SUCCESS=1
214 |             break
215 |         elif [ ${ATTEMPT} -ne ${NUM_ATTEMPTS} ] ; then
216 |             sleep 1
217 |         fi
218 |         ATTEMPT=$((ATTEMPT+1))
219 |     done
220 |     if [ $SUCCESS -ne 1 ] ; then
221 |         if [ $NUM_HOSTS_FOUND -eq 0 ] ; then
222 |             echo "No SMPD processes were found running.  Aborting."
223 |         else
224 |             echo "Found SMPD processes running on only ${NUM_HOSTS_FOUND} of ${SLURM_JOB_NUM_NODES} nodes.  Aborting."
225 |             echo "Hosts found: ${SMPD_LAUNCHED_HOSTS}"
226 |         fi
227 |         exit 1
228 |     fi
229 |     echo "All SMPDs launched"
230 | }
231 | 
232 | #########################################################################################
233 | runMpiexec() {
234 | 
235 |     CMD="\"${FULL_MPIEXEC}\" -smpd \
236 |         -phrase MATLAB \
237 |         -port ${SMPD_PORT} \
238 |         -l ${MACHINE_ARG} \
239 |         -genvlist ${PARALLEL_SERVER_GENVLIST} \
240 |         \"${PARALLEL_SERVER_MATLAB_EXE}\" \
241 |         ${PARALLEL_SERVER_MATLAB_ARGS}"
242 | 
243 |     # As a debug stage: echo the command ...
244 |     echo $CMD
245 | 
246 |     # ... and then execute it.
247 |     eval $CMD
248 | 
249 |     MPIEXEC_CODE=${?}
250 |     if [ ${MPIEXEC_CODE} -ne 0 ] ; then
251 |         exit ${MPIEXEC_CODE}
252 |     fi
253 | }
254 | 
255 | #########################################################################################
256 | # Define the order in which we execute the stages defined above
257 | MAIN() {
258 |     # Install a trap to ensure that SMPDs are closed if something errors or the
259 |     # job is cancelled.
260 |     trap "cleanupAndExit" 0 1 2 15
261 |     chooseSmpdHosts
262 |     chooseSmpdPort
263 |     launchSmpds
264 |     chooseMachineArg
265 |     runMpiexec
266 |     exit 0 # Explicitly exit 0 to trigger cleanupAndExit
267 | }
268 | 
269 | # Call the MAIN loop
270 | MAIN
271 | 


--------------------------------------------------------------------------------
/communicatingSubmitFcn.m:
--------------------------------------------------------------------------------
  1 | function communicatingSubmitFcn(cluster, job, environmentProperties)
  2 | %COMMUNICATINGSUBMITFCN Submit a communicating MATLAB job to a Slurm cluster
  3 | %
  4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
  5 | % function to run it when you submit a communicating job.
  6 | %
  7 | % See also parallel.cluster.generic.communicatingDecodeFcn.
  8 | 
  9 | % Copyright 2010-2024 The MathWorks, Inc.
 10 | 
 11 | % Store the current filename for the errors, warnings and dctSchedulerMessages.
 12 | currFilename = mfilename;
 13 | if ~isa(cluster, 'parallel.Cluster')
 14 |     error('parallelexamples:GenericSLURM:NotClusterObject', ...
 15 |         'The function %s is for use with clusters created using the parcluster command.', currFilename)
 16 | end
 17 | 
 18 | decodeFunction = 'parallel.cluster.generic.communicatingDecodeFcn';
 19 | 
 20 | clusterOS = cluster.OperatingSystem;
 21 | if ~strcmpi(clusterOS, 'unix')
 22 |     error('parallelexamples:GenericSLURM:UnsupportedOS', ...
 23 |         'The function %s only supports clusters with the unix operating system.', currFilename)
 24 | end
 25 | 
 26 | % Get the correct quote and file separator for the Cluster OS.
 27 | % This check is unnecessary in this file because we explicitly
 28 | % checked that the clusterOS is unix. This code is an example
 29 | % of how to deal with clusters that can be unix or pc.
 30 | if strcmpi(clusterOS, 'unix')
 31 |     quote = '''';
 32 |     fileSeparator = '/';
 33 |     scriptExt = '.sh';
 34 |     shellCmd = 'sh';
 35 | else
 36 |     quote = '"';
 37 |     fileSeparator = '\';
 38 |     scriptExt = '.bat';
 39 |     shellCmd = 'cmd /c';
 40 | end
 41 | 
 42 | if isprop(cluster.AdditionalProperties, 'ClusterHost')
 43 |     remoteConnection = getRemoteConnection(cluster);
 44 | end
 45 | 
 46 | % Determine the debug setting. Setting to true makes the MATLAB workers
 47 | % output additional logging. If EnableDebug is set in the cluster object's
 48 | % AdditionalProperties, that takes precedence. Otherwise, look for the
 49 | % PARALLEL_SERVER_DEBUG and MDCE_DEBUG environment variables in that order.
 50 | % If nothing is set, debug is false.
 51 | enableDebug = 'false';
 52 | if isprop(cluster.AdditionalProperties, 'EnableDebug')
 53 |     % Use AdditionalProperties.EnableDebug, if it is set
 54 |     enableDebug = char(string(cluster.AdditionalProperties.EnableDebug));
 55 | else
 56 |     % Otherwise check the environment variables set locally on the client
 57 |     environmentVariablesToCheck = {'PARALLEL_SERVER_DEBUG', 'MDCE_DEBUG'};
 58 |     for idx = 1:numel(environmentVariablesToCheck)
 59 |         debugValue = getenv(environmentVariablesToCheck{idx});
 60 |         if ~isempty(debugValue)
 61 |             enableDebug = debugValue;
 62 |             break
 63 |         end
 64 |     end
 65 | end
 66 | 
 67 | % The job specific environment variables
 68 | % Remove leading and trailing whitespace from the MATLAB arguments
 69 | matlabArguments = strtrim(environmentProperties.MatlabArguments);
 70 | 
 71 | % Where the workers store job output
 72 | if cluster.HasSharedFilesystem
 73 |     storageLocation = environmentProperties.StorageLocation;
 74 | else
 75 |     storageLocation = remoteConnection.JobStorageLocation;
 76 |     % If the RemoteJobStorageLocation ends with a space, add a slash to ensure it is respected
 77 |     if endsWith(storageLocation, ' ')
 78 |         storageLocation = [storageLocation, fileSeparator];
 79 |     end
 80 | end
 81 | variables = { ...
 82 |     'PARALLEL_SERVER_DECODE_FUNCTION', decodeFunction; ...
 83 |     'PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ...
 84 |     'PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation; ...
 85 |     'PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable; ...
 86 |     'PARALLEL_SERVER_MATLAB_ARGS', matlabArguments; ...
 87 |     'PARALLEL_SERVER_DEBUG', enableDebug; ...
 88 |     'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ...
 89 |     'MLM_WEB_USER_CRED', environmentProperties.UserToken; ...
 90 |     'MLM_WEB_ID', environmentProperties.LicenseWebID; ...
 91 |     'PARALLEL_SERVER_LICENSE_NUMBER', environmentProperties.LicenseNumber; ...
 92 |     'PARALLEL_SERVER_STORAGE_LOCATION', storageLocation; ...
 93 |     'PARALLEL_SERVER_CMR', strip(cluster.ClusterMatlabRoot, 'right', '/'); ...
 94 |     'PARALLEL_SERVER_TOTAL_TASKS', num2str(environmentProperties.NumberOfTasks); ...
 95 |     'PARALLEL_SERVER_NUM_THREADS', num2str(cluster.NumThreads)};
 96 | % Starting in R2025a, IntelMPI is supported via MPIImplementation="IntelMPI"
 97 | if ~verLessThan('matlab', '25.1') && ...
 98 |         isprop(cluster.AdditionalProperties, 'MPIImplementation') %#ok<VERLESSMATLAB>
 99 |     mpiImplementation = cluster.AdditionalProperties.MPIImplementation;
100 |     mustBeMember(mpiImplementation, ["IntelMPI", "MPICH"]);
101 |     variables = [variables; {'PARALLEL_SERVER_MPIEXEC_ARG', ['-', char(mpiImplementation)]}];
102 | end
103 | 
104 | % Avoid "-bind-to core:N" if AdditionalProperties.UseBindToCore is false (default: true).
105 | if validatedPropValue(cluster.AdditionalProperties, 'UseBindToCore', 'logical', true)
106 |     bindToCoreValue = 'true';
107 | else
108 |     bindToCoreValue = 'false';
109 | end
110 | variables = [variables; {'PARALLEL_SERVER_BIND_TO_CORE', bindToCoreValue}];
111 | 
112 | if ~verLessThan('matlab', '25.1') %#ok<VERLESSMATLAB>
113 |     variables = [variables; environmentProperties.JobEnvironment];
114 | end
115 | % Environment variable names different prior to 19b
116 | if verLessThan('matlab', '9.7')
117 |     variables(:,1) = replace(variables(:,1), 'PARALLEL_SERVER_', 'MDCE_');
118 | end
119 | % Trim the environment variables of empty values.
120 | nonEmptyValues = cellfun(@(x) ~isempty(strtrim(x)), variables(:,2));
121 | variables = variables(nonEmptyValues, :);
122 | % List of all the variables to forward through mpiexec to the workers
123 | variables = [variables; ...
124 |     {'PARALLEL_SERVER_GENVLIST', strjoin(variables(:,1), ',')}];
125 | 
126 | % The job directory as accessed by this machine
127 | localJobDirectory = cluster.getJobFolder(job);
128 | 
129 | % The job directory as accessed by workers on the cluster
130 | if cluster.HasSharedFilesystem
131 |     jobDirectoryOnCluster = cluster.getJobFolderOnCluster(job);
132 | else
133 |     jobDirectoryOnCluster = remoteConnection.getRemoteJobLocation(job.ID, clusterOS);
134 | end
135 | 
136 | % Specify the job wrapper script to use.
137 | % Prior to R2019a, only the SMPD process manager is supported.
138 | if verLessThan('matlab', '9.6') || ...
139 |         validatedPropValue(cluster.AdditionalProperties, 'UseSmpd', 'logical', false)
140 |     if ~verLessThan('matlab', '25.1') %#ok<VERLESSMATLAB>
141 |         % Starting in R2025a, smpd launcher is not supported.
142 |         error('parallelexamples:GenericSLURM:SmpdNoLongerSupported', ...
143 |             'The smpd process manager is no longer supported.');
144 |     end
145 |     jobWrapperName = 'communicatingJobWrapperSmpd.sh';
146 | else
147 |     jobWrapperName = 'communicatingJobWrapper.sh';
148 | end
149 | % The wrapper script is in the same directory as this file
150 | dirpart = fileparts(mfilename('fullpath'));
151 | localScript = fullfile(dirpart, jobWrapperName);
152 | % Copy the local wrapper script to the job directory
153 | copyfile(localScript, localJobDirectory, 'f');
154 | 
155 | % The script to execute on the cluster to run the job
156 | wrapperPath = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, jobWrapperName);
157 | quotedWrapperPath = sprintf('%s%s%s', quote, wrapperPath, quote);
158 | 
159 | % Choose a file for the output
160 | logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, sprintf('Job%d.log', job.ID));
161 | quotedLogFile = sprintf('%s%s%s', quote, logFile, quote);
162 | dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile);
163 | 
164 | jobName = sprintf('MATLAB_R%s_Job%d', version('-release'), job.ID);
165 | 
166 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
167 | %% CUSTOMIZATION MAY BE REQUIRED %%
168 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
169 | % You might want to customize this section to match your cluster,
170 | % for example to limit the number of nodes for a single job.
171 | additionalSubmitArgs = sprintf('--ntasks=%d --cpus-per-task=%d', environmentProperties.NumberOfTasks, cluster.NumThreads);
172 | commonSubmitArgs = getCommonSubmitArgs(cluster);
173 | additionalSubmitArgs = strtrim(sprintf('%s %s', additionalSubmitArgs, commonSubmitArgs));
174 | if validatedPropValue(cluster.AdditionalProperties, 'DisplaySubmitArgs', 'logical', false)
175 |     fprintf('Submit arguments: %s\n', additionalSubmitArgs);
176 | end
177 | 
178 | % Path to the submit script, to submit the Slurm job using sbatch
179 | submitScriptName = sprintf('submitScript%s', scriptExt);
180 | localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName);
181 | submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName);
182 | quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote);
183 | 
184 | % Path to the environment wrapper, which will set the environment variables
185 | % for the job then execute the job wrapper
186 | envScriptName = sprintf('environmentWrapper%s', scriptExt);
187 | localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName);
188 | envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName);
189 | quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote);
190 | 
191 | % Create the scripts to submit a Slurm job.
192 | % These will be created in the job directory.
193 | dctSchedulerMessage(5, '%s: Generating scripts for job %d', currFilename, job.ID);
194 | createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, variables);
195 | createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ...
196 |     quotedEnvScriptPathOnCluster, additionalSubmitArgs);
197 | 
198 | % Create the command to run on the cluster
199 | commandToRun = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster);
200 | 
201 | if ~cluster.HasSharedFilesystem
202 |     % Start the mirror to copy all the job files over to the cluster
203 |     dctSchedulerMessage(4, '%s: Starting mirror for job %d.', currFilename, job.ID);
204 |     remoteConnection.startMirrorForJob(job);
205 | end
206 | 
207 | if strcmpi(clusterOS, 'unix')
208 |     % Add execute permissions to shell scripts
209 |     runSchedulerCommand(cluster, sprintf( ...
210 |         'chmod u+x "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator));
211 |     % Convert line endings to Unix
212 |     runSchedulerCommand(cluster, sprintf( ...
213 |         'dos2unix --allow-chown "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator));
214 | end
215 | 
216 | % Now ask the cluster to run the submission command
217 | dctSchedulerMessage(4, '%s: Submitting job using command:\n\t%s', currFilename, commandToRun);
218 | try
219 |     [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun);
220 | catch err
221 |     cmdFailed = true;
222 |     cmdOut = err.message;
223 | end
224 | if cmdFailed
225 |     if ~cluster.HasSharedFilesystem
226 |         % Stop the mirroring if we failed to submit the job - this will also
227 |         % remove the job files from the remote location
228 |         remoteConnection = getRemoteConnection(cluster);
229 |         % Only stop mirroring if we are actually mirroring
230 |         if remoteConnection.isJobUsingConnection(job.ID)
231 |             dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID);
232 |             try
233 |                 remoteConnection.stopMirrorForJob(job);
234 |             catch err
235 |                 warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ...
236 |                     'Failed to stop the file mirroring for job %d.\nReason: %s', ...
237 |                     job.ID, err.getReport);
238 |             end
239 |         end
240 |     end
241 |     error('parallelexamples:GenericSLURM:FailedToSubmitJob', ...
242 |         'Failed to submit job to Slurm using command:\n\t%s.\nReason: %s', ...
243 |         commandToRun, cmdOut);
244 | end
245 | 
246 | % Calculate the schedulerIDs
247 | jobIDs = extractJobId(cmdOut);
248 | if isempty(jobIDs)
249 |     error('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ...
250 |         'Failed to parse the job identifier from the submission output: "%s"', ...
251 |         cmdOut);
252 | end
253 | % jobIDs must be a cell array
254 | if ~iscell(jobIDs)
255 |     jobIDs = {jobIDs};
256 | end
257 | 
258 | % Store the scheduler ID for each task and the job cluster data
259 | jobData = struct('type', 'generic');
260 | if isprop(cluster.AdditionalProperties, 'ClusterHost')
261 |     % Store the cluster host
262 |     jobData.RemoteHost = remoteConnection.Hostname;
263 | end
264 | if ~cluster.HasSharedFilesystem
265 |     % Store the remote job storage location
266 |     jobData.RemoteJobStorageLocation = remoteConnection.JobStorageLocation;
267 |     jobData.HasDoneLastMirror = false;
268 | end
269 | if verLessThan('matlab', '9.7') % schedulerID stored in job data
270 |     jobData.ClusterJobIDs = jobIDs;
271 | else % schedulerID on task since 19b
272 |     if isscalar(job.Tasks)
273 |         schedulerIDs = jobIDs{1};
274 |     else
275 |         schedulerIDs = repmat(jobIDs, size(job.Tasks));
276 |     end
277 |     set(job.Tasks, 'SchedulerID', schedulerIDs);
278 | end
279 | cluster.setJobClusterData(job, jobData);
280 | 
281 | end
282 | 


--------------------------------------------------------------------------------
/deleteJobFcn.m:
--------------------------------------------------------------------------------
 1 | function deleteJobFcn(cluster, job)
 2 | %DELETEJOBFCN Deletes a job on Slurm
 3 | %
 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
 5 | % function to run it when you delete a job.
 6 | 
 7 | % Copyright 2017-2023 The MathWorks, Inc.
 8 | 
 9 | cancelJobOnCluster(cluster, job);
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/deleteTaskFcn.m:
--------------------------------------------------------------------------------
 1 | function deleteTaskFcn(cluster, task)
 2 | %DELETETASKFCN Deletes a job on Slurm
 3 | %
 4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
 5 | % function to run it when you delete a job.
 6 | 
 7 | % Copyright 2020-2023 The MathWorks, Inc.
 8 | 
 9 | cancelTaskOnCluster(cluster, task);
10 | 
11 | end
12 | 


--------------------------------------------------------------------------------
/discover/example.conf:
--------------------------------------------------------------------------------
 1 | # Since version R2023a, MATLAB can discover clusters running third-party
 2 | # schedulers such as Slurm. The Discover Clusters functionality
 3 | # automatically configures the Parallel Computing Toolbox to submit MATLAB
 4 | # jobs to the cluster. To use this functionality, you must create a cluster
 5 | # configuration file and store it at a location accessible to MATLAB users.
 6 | #
 7 | # This file is an example of a cluster configuration which MATLAB can
 8 | # discover. You can copy and modify this file to make your cluster discoverable.
 9 | #
10 | # For more information, including the required format for this file, see
11 | # the online documentation for making a cluster running a third-party
12 | # scheduler discoverable:
13 | # https://www.mathworks.com/help/matlab-parallel-server/configure-for-cluster-discovery.html
14 | 
15 | # Copyright 2023 The MathWorks, Inc.
16 | 
17 | # The name MATLAB will display for the cluster when discovered.
18 | Name = My Slurm cluster
19 | 
20 | # Maximum number of MATLAB workers a single user can use in a single job.
21 | # This number must not exceed the number of available MATLAB Parallel
22 | # Server licenses.
23 | NumWorkers = 32
24 | 
25 | # Path to the MATLAB install on the cluster for the workers to use. Note
26 | # the variable "$MATLAB_VERSION_STRING" returns the release number of the
27 | # MATLAB client that is running discovery, e.g. 2023a. If multiple versions
28 | # of MATLAB are installed on the cluster, this allows discovery to select
29 | # the correct installation path. Add a leading "R" or "r" if needed to
30 | # complete the MATLAB version.
31 | ClusterMatlabRoot = /opt/matlab/R"$MATLAB_VERSION_STRING"
32 | 
33 | # Location where the MATLAB client stores job and task information.
34 | JobStorageLocation = /home/matlabjobs
35 | # If the client and cluster share a filesystem but the client is running
36 | # the Windows operating system and the cluster running a Linux operating
37 | # system, you must specify the JobStorageLocation using a structure by
38 | # commenting out the previous line and uncommenting the following lines.
39 | # The 'windows' and 'unix' fields must correspond to the same folder as
40 | # viewed from each of those operating systems.
41 | #JobStorageLocation.windows = \\organization\home\matlabjobs
42 | #JobStorageLocation.unix = /organization/home/matlabjobs
43 | 
44 | # Folder that contains the scheduler plugin scripts that describe how
45 | # MATLAB interacts with the scheduler. A property can take different values
46 | # depending on the operating system of the client MATLAB by specifying the
47 | # name of the OS in parentheses.
48 | PluginScriptsLocation (Windows) = \\organization\matlab\pluginscripts
49 | PluginScriptsLocation (Unix) = /organization/matlab/pluginscripts
50 | 
51 | # The operating system on the cluster. Valid values are 'unix' and 'windows'.
52 | OperatingSystem = unix
53 | 
54 | # Specify whether client and cluster nodes share JobStorageLocation. To
55 | # configure MATLAB to copy job input and output files to and from the
56 | # cluster using SFTP, set this property to false and specify a value for
57 | # AdditionalProperties.RemoteJobStorageLocation below.
58 | HasSharedFilesystem = true
59 | 
60 | # Specify whether the cluster uses online licensing.
61 | RequiresOnlineLicensing = false
62 | 
63 | # LicenseNumber for the workers to use. Specify only if
64 | # RequiresOnlineLicensing is set to true.
65 | #LicenseNumber = 123456
66 | 
67 | [AdditionalProperties]
68 | 
69 | # To configure the user's machine to connect to the submission host via
70 | # SSH, uncomment the following line and enter the hostname of the cluster
71 | # machine that has the scheduler utilities to submit jobs.
72 | #ClusterHost = slurm-headnode
73 | 
74 | # If the user's machine and the cluster nodes do not have a shared file
75 | # system, MATLAB can copy job input and output files to and from the
76 | # cluster using SFTP. To activate this feature, set HasSharedFilesystem
77 | # above to false. Then uncomment the following lines and enter the location
78 | # on the cluster to store job files.
79 | #RemoteJobStorageLocation (Windows) = /home/"$USERNAME"/.matlab/generic_cluster_jobs
80 | #RemoteJobStorageLocation (Unix)    = /home/"$USER"/.matlab/generic_cluster_jobs
81 | 
82 | # Username to log in to ClusterHost with. On Linux and Mac, use the USER
83 | # environment variable. On Windows, use the USERNAME variable.
84 | Username (Unix) = "$USER"
85 | Username (Windows) = "$USERNAME"
86 | 


--------------------------------------------------------------------------------
/discover/runDiscovery.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # Copyright 2023 The MathWorks, Inc.
 4 | 
 5 | usage="$(basename "$0") matlabroot [folder] -- run third-party scheduler discovery in MATLAB R2023a onwards
 6 |     matlabroot - path to the folder where MATLAB is installed
 7 |     folder     - folder to search for cluster configuration files
 8 |                  (defaults to pwd)"
 9 | 
10 | # Print usage
11 | if [ -z "$1" ] || [ "$1" = "-h" ] || [ "$1" = "--help" ] ; then
12 |     echo "$usage"
13 |     exit 0
14 | fi
15 | 
16 | # MATLAB executable to launch
17 | matlabExe="$1/bin/matlab"
18 | if [ ! -f "${matlabExe}" ] ; then
19 |     echo "Could not find MATLAB executable at ${matlabExe}"
20 |     exit 1
21 | fi
22 | 
23 | # Folder to run discovery on. If specified, wrap in single-quotes to make a MATLAB charvec.
24 | discoveryFolder="$2"
25 | if [ ! -z "$discoveryFolder" ] ; then
26 |     discoveryFolder="'${discoveryFolder}'"
27 | fi
28 | 
29 | # Command to run in MATLAB
30 | matlabCmd="parallel.cluster.generic.discoverGenericClusters(${discoveryFolder})"
31 | 
32 | # Arguments to pass to MATLAB
33 | matlabArgs="-nojvm -parallelserver -batch"
34 | 
35 | # Build and run system command
36 | CMD="\"${matlabExe}\" ${matlabArgs} \"${matlabCmd}\""
37 | eval $CMD
38 | 


--------------------------------------------------------------------------------
/getJobStateFcn.m:
--------------------------------------------------------------------------------
  1 | function state = getJobStateFcn(cluster, job, state)
  2 | %GETJOBSTATEFCN Gets the state of a job from Slurm
  3 | %
  4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
  5 | % function to run it when you query the state of a job.
  6 | 
  7 | % Copyright 2010-2024 The MathWorks, Inc.
  8 | 
  9 | % Store the current filename for the errors, warnings and
 10 | % dctSchedulerMessages
 11 | currFilename = mfilename;
 12 | if ~isa(cluster, 'parallel.Cluster')
 13 |     error('parallelexamples:GenericSLURM:SubmitFcnError', ...
 14 |         'The function %s is for use with clusters created using the parcluster command.', currFilename)
 15 | end
 16 | 
 17 | % Get the information about the actual cluster used
 18 | data = cluster.getJobClusterData(job);
 19 | if isempty(data)
 20 |     % This indicates that the job has not been submitted, so just return
 21 |     dctSchedulerMessage(1, '%s: Job cluster data was empty for job with ID %d.', currFilename, job.ID);
 22 |     return
 23 | end
 24 | 
 25 | % Shortcut if the job state is already finished or failed
 26 | jobInTerminalState = strcmp(state, 'finished') || strcmp(state, 'failed');
 27 | if jobInTerminalState
 28 |     if cluster.HasSharedFilesystem
 29 |         return
 30 |     end
 31 |     try
 32 |         hasDoneLastMirror = data.HasDoneLastMirror;
 33 |     catch err
 34 |         ex = MException('parallelexamples:GenericSLURM:FailedToRetrieveRemoteParameters', ...
 35 |             'Failed to retrieve remote parameters from the job cluster data.');
 36 |         ex = ex.addCause(err);
 37 |         throw(ex);
 38 |     end
 39 |     % Can only shortcut here if we've already done the last mirror
 40 |     if hasDoneLastMirror
 41 |         return
 42 |     end
 43 | end
 44 | 
 45 | [schedulerIDs, numSubmittedTasks] = getSimplifiedSchedulerIDsForJob(job);
 46 | 
 47 | jobList = strjoin(schedulerIDs, ',');
 48 | commandToRun = sprintf('squeue -j %s --states=all --Format=jobarrayid,state --noheader --array', jobList);
 49 | dctSchedulerMessage(4, '%s: Querying cluster for job state using command:\n\t%s', currFilename, commandToRun);
 50 | 
 51 | try
 52 |     % We will ignore the status returned from the state command because
 53 |     % a non-zero status is returned if the job no longer exists
 54 |     [~, cmdOut] = runSchedulerCommand(cluster, commandToRun);
 55 | catch err
 56 |     ex = MException('parallelexamples:GenericSLURM:FailedToGetJobState', ...
 57 |         'Failed to get job state from cluster.');
 58 |     ex = ex.addCause(err);
 59 |     throw(ex);
 60 | end
 61 | 
 62 | clusterState = iExtractJobState(cmdOut, numSubmittedTasks);
 63 | dctSchedulerMessage(6, '%s: State %s was extracted from cluster output.', currFilename, clusterState);
 64 | 
 65 | % If we could determine the cluster's state, we'll use that. Otherwise, we assume
 66 | % the scheduler is no longer tracking the job because the job has terminated.
 67 | if ~strcmp(clusterState, 'unknown')
 68 |     state = clusterState;
 69 | else
 70 |     state = 'finished';
 71 | end
 72 | 
 73 | if ~cluster.HasSharedFilesystem
 74 |     % Decide what to do with mirroring based on the cluster's version of job
 75 |     % state and whether or not the job is currently being mirrored:
 76 |     % If job is not being mirrored, and job is not finished, resume the mirror
 77 |     % If job is not being mirrored, and job is finished, do the last mirror
 78 |     % If the job is being mirrored, and job is finished, do the last mirror
 79 |     % Otherwise (if job is not finished, and we are mirroring), do nothing
 80 |     remoteConnection = getRemoteConnection(cluster);
 81 |     isBeingMirrored = remoteConnection.isJobUsingConnection(job.ID);
 82 |     isJobFinished = strcmp(state, 'finished') || strcmp(state, 'failed');
 83 |     if ~isBeingMirrored && ~isJobFinished
 84 |         % resume the mirror
 85 |         dctSchedulerMessage(4, '%s: Resuming mirror for job %d.', currFilename, job.ID);
 86 |         try
 87 |             remoteConnection.resumeMirrorForJob(job);
 88 |         catch err
 89 |             warning('parallelexamples:GenericSLURM:FailedToResumeMirrorForJob', ...
 90 |                 'Failed to resume mirror for job %d.  Your local job files may not be up-to-date.\nReason: %s', ...
 91 |                 err.getReport);
 92 |         end
 93 |     elseif isJobFinished
 94 |         dctSchedulerMessage(4, '%s: Doing last mirror for job %d.', currFilename, job.ID);
 95 |         try
 96 |             remoteConnection.doLastMirrorForJob(job);
 97 |             % Store the fact that we have done the last mirror so we can shortcut in the future
 98 |             data.HasDoneLastMirror = true;
 99 |             cluster.setJobClusterData(job, data);
100 |         catch err
101 |             warning('parallelexamples:GenericSLURM:FailedToDoFinalMirrorForJob', ...
102 |                 'Failed to do last mirror for job %d.  Your local job files may not be up-to-date.\nReason: %s', ...
103 |                 err.getReport);
104 |         end
105 |     end
106 | end
107 | 
108 | end
109 | 
110 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
111 | function state = iExtractJobState(squeueOut, numJobs)
112 | % Function to extract the job state from the output of squeue
113 | 
114 | numPending  = numel(regexp(squeueOut, 'PENDING|SPECIAL_EXIT'));
115 | numRunning  = numel(regexp(squeueOut, 'RUNNING|SUSPENDED|COMPLETING|CONFIGURING|STOPPED|RESIZING'));
116 | numFinished = numel(regexp(squeueOut, 'COMPLETED'));
117 | numFailed   = numel(regexp(squeueOut, 'CANCELLED|FAIL|TIMEOUT|PREEMPTED|OUT_OF|REVOKED|DEADLINE'));
118 | 
119 | % If all of the jobs that we asked about have finished, then we know the
120 | % job has finished.
121 | if numFinished == numJobs
122 |     state = 'finished';
123 |     return
124 | end
125 | 
126 | % Any running indicates that the job is running
127 | if numRunning > 0
128 |     state = 'running';
129 |     return
130 | end
131 | 
132 | % We know numRunning == 0 so if there are some still pending then the
133 | % job must be queued again, even if there are some finished
134 | if numPending > 0
135 |     state = 'queued';
136 |     return
137 | end
138 | 
139 | % Deal with any tasks that have failed
140 | if numFailed > 0
141 |     % Set this job to be failed
142 |     state = 'failed';
143 |     return
144 | end
145 | 
146 | state = 'unknown';
147 | end
148 | 


--------------------------------------------------------------------------------
/independentJobWrapper.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | # This wrapper script is intended to support independent execution.
 3 | #
 4 | # This script uses the following environment variables set by the submit MATLAB code:
 5 | # PARALLEL_SERVER_MATLAB_EXE  - the MATLAB executable to use
 6 | # PARALLEL_SERVER_MATLAB_ARGS - the MATLAB args to use
 7 | 
 8 | # Copyright 2010-2024 The MathWorks, Inc.
 9 | 
10 | # If PARALLEL_SERVER_ environment variables are not set, assign any
11 | # available values with form MDCE_ for backwards compatibility
12 | PARALLEL_SERVER_MATLAB_EXE=${PARALLEL_SERVER_MATLAB_EXE:="${MDCE_MATLAB_EXE}"}
13 | PARALLEL_SERVER_MATLAB_ARGS=${PARALLEL_SERVER_MATLAB_ARGS:="${MDCE_MATLAB_ARGS}"}
14 | 
15 | # Echo the node that the scheduler has allocated to this job:
16 | echo "The scheduler has allocated the following node to this job: `hostname`"
17 | 
18 | if [ ! -z "${SLURM_ARRAY_TASK_ID}" ] ; then
19 |     # Use job arrays
20 |     TASK_ID=$((${SLURM_ARRAY_TASK_ID}+${PARALLEL_SERVER_TASK_ID_OFFSET}))
21 |     export PARALLEL_SERVER_TASK_LOCATION="${PARALLEL_SERVER_JOB_LOCATION}/Task${TASK_ID}";
22 |     export MDCE_TASK_LOCATION="${MDCE_JOB_LOCATION}/Task${TASK_ID}";
23 | fi
24 | 
25 | # Construct the command to run.
26 | CMD="\"${PARALLEL_SERVER_MATLAB_EXE}\" ${PARALLEL_SERVER_MATLAB_ARGS}"
27 | 
28 | # Echo the command so that it is shown in the output log.
29 | echo "Executing: $CMD"
30 | 
31 | # Execute the command.
32 | eval $CMD
33 | 
34 | EXIT_CODE=${?}
35 | echo "Exiting with code: ${EXIT_CODE}"
36 | exit ${EXIT_CODE}
37 | 


--------------------------------------------------------------------------------
/independentSubmitFcn.m:
--------------------------------------------------------------------------------
  1 | function independentSubmitFcn(cluster, job, environmentProperties)
  2 | %INDEPENDENTSUBMITFCN Submit a MATLAB job to a Slurm cluster
  3 | %
  4 | % Set your cluster's PluginScriptsLocation to the parent folder of this
  5 | % function to run it when you submit an independent job.
  6 | %
  7 | % See also parallel.cluster.generic.independentDecodeFcn.
  8 | 
  9 | % Copyright 2010-2024 The MathWorks, Inc.
 10 | 
 11 | % Store the current filename for the errors, warnings and dctSchedulerMessages.
 12 | currFilename = mfilename;
 13 | if ~isa(cluster, 'parallel.Cluster')
 14 |     error('parallelexamples:GenericSLURM:NotClusterObject', ...
 15 |         'The function %s is for use with clusters created using the parcluster command.', currFilename)
 16 | end
 17 | 
 18 | decodeFunction = 'parallel.cluster.generic.independentDecodeFcn';
 19 | 
 20 | clusterOS = cluster.OperatingSystem;
 21 | if ~strcmpi(clusterOS, 'unix')
 22 |     error('parallelexamples:GenericSLURM:UnsupportedOS', ...
 23 |         'The function %s only supports clusters with the unix operating system.', currFilename)
 24 | end
 25 | 
 26 | % Get the correct quote and file separator for the Cluster OS.
 27 | % This check is unnecessary in this file because we explicitly
 28 | % checked that the clusterOS is unix. This code is an example
 29 | % of how to deal with clusters that can be unix or pc.
 30 | if strcmpi(clusterOS, 'unix')
 31 |     quote = '''';
 32 |     fileSeparator = '/';
 33 |     scriptExt = '.sh';
 34 |     shellCmd = 'sh';
 35 | else
 36 |     quote = '"';
 37 |     fileSeparator = '\';
 38 |     scriptExt = '.bat';
 39 |     shellCmd = 'cmd /c';
 40 | end
 41 | 
 42 | if isprop(cluster.AdditionalProperties, 'ClusterHost')
 43 |     remoteConnection = getRemoteConnection(cluster);
 44 | end
 45 | 
 46 | [useJobArrays, maxJobArraySize] = iGetJobArrayProps(cluster);
 47 | % Store data for future reference
 48 | cluster.UserData.UseJobArrays = useJobArrays;
 49 | if useJobArrays
 50 |     cluster.UserData.MaxJobArraySize = maxJobArraySize;
 51 | end
 52 | 
 53 | % Determine the debug setting. Setting to true makes the MATLAB workers
 54 | % output additional logging. If EnableDebug is set in the cluster object's
 55 | % AdditionalProperties, that takes precedence. Otherwise, look for the
 56 | % PARALLEL_SERVER_DEBUG and MDCE_DEBUG environment variables in that order.
 57 | % If nothing is set, debug is false.
 58 | enableDebug = 'false';
 59 | if isprop(cluster.AdditionalProperties, 'EnableDebug')
 60 |     % Use AdditionalProperties.EnableDebug, if it is set
 61 |     enableDebug = char(string(cluster.AdditionalProperties.EnableDebug));
 62 | else
 63 |     % Otherwise check the environment variables set locally on the client
 64 |     environmentVariablesToCheck = {'PARALLEL_SERVER_DEBUG', 'MDCE_DEBUG'};
 65 |     for idx = 1:numel(environmentVariablesToCheck)
 66 |         debugValue = getenv(environmentVariablesToCheck{idx});
 67 |         if ~isempty(debugValue)
 68 |             enableDebug = debugValue;
 69 |             break
 70 |         end
 71 |     end
 72 | end
 73 | 
 74 | % The job specific environment variables
 75 | % Remove leading and trailing whitespace from the MATLAB arguments
 76 | matlabArguments = strtrim(environmentProperties.MatlabArguments);
 77 | 
 78 | % Where the workers store job output
 79 | if cluster.HasSharedFilesystem
 80 |     storageLocation = environmentProperties.StorageLocation;
 81 | else
 82 |     storageLocation = remoteConnection.JobStorageLocation;
 83 |     % If the RemoteJobStorageLocation ends with a space, add a slash to ensure it is respected
 84 |     if endsWith(storageLocation, ' ')
 85 |         storageLocation = [storageLocation, fileSeparator];
 86 |     end
 87 | end
 88 | variables = { ...
 89 |     'PARALLEL_SERVER_DECODE_FUNCTION', decodeFunction; ...
 90 |     'PARALLEL_SERVER_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ...
 91 |     'PARALLEL_SERVER_JOB_LOCATION', environmentProperties.JobLocation; ...
 92 |     'PARALLEL_SERVER_MATLAB_EXE', environmentProperties.MatlabExecutable; ...
 93 |     'PARALLEL_SERVER_MATLAB_ARGS', matlabArguments; ...
 94 |     'PARALLEL_SERVER_DEBUG', enableDebug; ...
 95 |     'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ...
 96 |     'MLM_WEB_USER_CRED', environmentProperties.UserToken; ...
 97 |     'MLM_WEB_ID', environmentProperties.LicenseWebID; ...
 98 |     'PARALLEL_SERVER_LICENSE_NUMBER', environmentProperties.LicenseNumber; ...
 99 |     'PARALLEL_SERVER_STORAGE_LOCATION', storageLocation};
100 | if ~verLessThan('matlab', '25.1') %#ok<VERLESSMATLAB>
101 |     variables = [variables; environmentProperties.JobEnvironment];
102 | end
103 | % Environment variable names different prior to 19b
104 | if verLessThan('matlab', '9.7')
105 |     variables(:,1) = replace(variables(:,1), 'PARALLEL_SERVER_', 'MDCE_');
106 | end
107 | % Trim the environment variables of empty values.
108 | nonEmptyValues = cellfun(@(x) ~isempty(strtrim(x)), variables(:,2));
109 | variables = variables(nonEmptyValues, :);
110 | 
111 | % The job directory as accessed by this machine
112 | localJobDirectory = cluster.getJobFolder(job);
113 | 
114 | % The job directory as accessed by workers on the cluster
115 | if cluster.HasSharedFilesystem
116 |     jobDirectoryOnCluster = cluster.getJobFolderOnCluster(job);
117 | else
118 |     jobDirectoryOnCluster = remoteConnection.getRemoteJobLocation(job.ID, clusterOS);
119 | end
120 | 
121 | % Name of the wrapper script to launch the MATLAB worker
122 | jobWrapperName = 'independentJobWrapper.sh';
123 | % The wrapper script is in the same directory as this file
124 | dirpart = fileparts(mfilename('fullpath'));
125 | localScript = fullfile(dirpart, jobWrapperName);
126 | % Copy the local wrapper script to the job directory
127 | copyfile(localScript, localJobDirectory, 'f');
128 | 
129 | % The script to execute on the cluster to run the job
130 | wrapperPath = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, jobWrapperName);
131 | quotedWrapperPath = sprintf('%s%s%s', quote, wrapperPath, quote);
132 | 
133 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
134 | %% CUSTOMIZATION MAY BE REQUIRED %%
135 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
136 | additionalSubmitArgs = sprintf('--ntasks=1 --cpus-per-task=%d', cluster.NumThreads);
137 | commonSubmitArgs = getCommonSubmitArgs(cluster);
138 | additionalSubmitArgs = strtrim(sprintf('%s %s', additionalSubmitArgs, commonSubmitArgs));
139 | if validatedPropValue(cluster.AdditionalProperties, 'DisplaySubmitArgs', 'logical', false)
140 |     fprintf('Submit arguments: %s\n', additionalSubmitArgs);
141 | end
142 | 
143 | % Only keep and submit tasks that are not cancelled. Cancelled tasks
144 | % will have errors.
145 | isPendingTask = cellfun(@isempty, get(job.Tasks, {'Error'}));
146 | tasks = job.Tasks(isPendingTask);
147 | taskIDs = cell2mat(get(tasks, {'ID'}));
148 | numberOfTasks = numel(tasks);
149 | 
150 | % Only use job arrays when you can get enough use out of them.
151 | % The submission method in this function requires a minimum maxJobArraySize
152 | % of 10 to get enough use of job arrays.
153 | if numberOfTasks < 2 || maxJobArraySize < 10
154 |     useJobArrays = false;
155 | end
156 | 
157 | if useJobArrays
158 |     % Check if there are more tasks than will fit in one job array. Slurm
159 |     % will not accept a job array index greater than its MaxArraySize
160 |     % parameter, as defined in slurm.conf, even if the overall size of the
161 |     % array is less than MaxArraySize. For example, for the default
162 |     % (inclusive) upper limit of MaxArraySize=1000, array indices of 1 to
163 |     % 1000 would be accepted, but 1001 or above would not. To get around
164 |     % this restriction, submit the full array of tasks in multiple Slurm
165 |     % job arrays, hereafter referred to as subarrays. Round the
166 |     % MaxArraySize down to the nearest power of 10, as this allows the log
167 |     % file of taskX to be named TaskX.log.  See iGenerateLogFileName.
168 |     if taskIDs(end) > maxJobArraySize
169 |         % Use the nearest power of 10 as subarray size. This will make the
170 |         % naming of log files easier.
171 |         maxJobArraySizeToUse = 10^floor(log10(maxJobArraySize));
172 |         % Group task IDs into bins of jobArraySize size.
173 |         groups = findgroups(floor(taskIDs./maxJobArraySizeToUse));
174 |         % Count the number of elements in each group and form subarrays.
175 |         jobArraySizes = splitapply(@numel, taskIDs, groups);
176 |     else
177 |         maxJobArraySizeToUse = maxJobArraySize;
178 |         jobArraySizes = numel(tasks);
179 |     end
180 |     taskIDGroupsForJobArrays = mat2cell(taskIDs,jobArraySizes);
181 |     
182 |     jobName = sprintf('MATLAB_R%s_Job%d', version('-release'), job.ID);
183 |     numJobArrays = numel(taskIDGroupsForJobArrays);
184 |     commandsToRun = cell(numJobArrays, 1);
185 |     jobIDs = cell(numJobArrays, 1);
186 |     schedulerJobArrayIndices = cell(numJobArrays, 1);
187 |     for ii = 1:numJobArrays
188 |         % Slurm only accepts task IDs up to maxArraySize. Shift all task
189 |         % IDs down below the limit.
190 |         taskOffset = (ii-1)*maxJobArraySizeToUse;
191 |         schedulerJobArrayIndices{ii} = taskIDGroupsForJobArrays{ii} - taskOffset;
192 |         % Save the offset as an environment variable to pass to the tasks
193 |         % during Slurm submission.
194 |         environmentVariables = [variables; ...
195 |             {'PARALLEL_SERVER_TASK_ID_OFFSET', num2str(taskOffset)}];
196 |         
197 |         % Create a character vector with the ranges of IDs to submit.
198 |         jobArrayString = iCreateJobArrayString(schedulerJobArrayIndices{ii});
199 |         
200 |         % Choose a file for the output
201 |         logFileName = iGenerateLogFileName(ii, maxJobArraySizeToUse);
202 |         logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, logFileName);
203 |         quotedLogFile = sprintf('%s%s%s', quote, logFile, quote);
204 |         dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile);
205 |         
206 |         % Path to the submit script, to submit the Slurm job using sbatch
207 |         submitScriptName = sprintf('submitScript%d%s', ii, scriptExt);
208 |         localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName);
209 |         submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName);
210 |         quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote);
211 |         
212 |         % Path to the environment wrapper, which will set the environment variables
213 |         % for the job then execute the job wrapper
214 |         envScriptName = sprintf('environmentWrapper%d%s', ii, scriptExt);
215 |         localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName);
216 |         envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName);
217 |         quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote);
218 |         
219 |         % Create the scripts to submit a Slurm job.
220 |         % These will be created in the job directory.
221 |         dctSchedulerMessage(5, '%s: Generating scripts for job array %d', currFilename, ii);
222 |         createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, environmentVariables);
223 |         createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ...
224 |             quotedEnvScriptPathOnCluster, additionalSubmitArgs, jobArrayString);
225 |         
226 |         % Create the command to run on the cluster
227 |         commandsToRun{ii} = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster);
228 |     end
229 | else
230 |     % Do not use job arrays and submit each task individually.
231 |     taskLocations = environmentProperties.TaskLocations(isPendingTask);
232 |     jobIDs = cell(1, numberOfTasks);
233 |     commandsToRun = cell(numberOfTasks, 1);
234 |     
235 |     % Loop over every task we have been asked to submit
236 |     for ii = 1:numberOfTasks
237 |         taskLocation = taskLocations{ii};
238 |         % Add the task location to the environment variables
239 |         if verLessThan('matlab', '9.7') % variable name changed in 19b
240 |             environmentVariables = [variables; ...
241 |                 {'MDCE_TASK_LOCATION', taskLocation}];
242 |         else
243 |             environmentVariables = [variables; ...
244 |                 {'PARALLEL_SERVER_TASK_LOCATION', taskLocation}];
245 |         end
246 |         
247 |         % Choose a file for the output
248 |         logFileName = sprintf('Task%d.log', taskIDs(ii));
249 |         logFile = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, logFileName);
250 |         quotedLogFile = sprintf('%s%s%s', quote, logFile, quote);
251 |         dctSchedulerMessage(5, '%s: Using %s as log file', currFilename, quotedLogFile);
252 |         
253 |         % Submit one task at a time
254 |         jobName = sprintf('MATLAB_R%s_Job%d.%d', version('-release'), job.ID, taskIDs(ii));
255 |         
256 |         % Path to the submit script, to submit the Slurm job using sbatch
257 |         submitScriptName = sprintf('submitScript%d%s', ii, scriptExt);
258 |         localSubmitScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, submitScriptName);
259 |         submitScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, submitScriptName);
260 |         quotedSubmitScriptPathOnCluster = sprintf('%s%s%s', quote, submitScriptPathOnCluster, quote);
261 |         
262 |         % Path to the environment wrapper, which will set the environment variables
263 |         % for the job then execute the job wrapper
264 |         envScriptName = sprintf('environmentWrapper%d%s', ii, scriptExt);
265 |         localEnvScriptPath = sprintf('%s%s%s', localJobDirectory, fileSeparator, envScriptName);
266 |         envScriptPathOnCluster = sprintf('%s%s%s', jobDirectoryOnCluster, fileSeparator, envScriptName);
267 |         quotedEnvScriptPathOnCluster = sprintf('%s%s%s', quote, envScriptPathOnCluster, quote);
268 |         
269 |         % Create the scripts to submit a Slurm job.
270 |         % These will be created in the job directory.
271 |         dctSchedulerMessage(5, '%s: Generating scripts for task %d', currFilename, ii);
272 |         createEnvironmentWrapper(localEnvScriptPath, quotedWrapperPath, environmentVariables);
273 |         createSubmitScript(localSubmitScriptPath, jobName, quotedLogFile, ...
274 |             quotedEnvScriptPathOnCluster, additionalSubmitArgs);
275 |         
276 |         % Create the command to run on the cluster
277 |         commandsToRun{ii} = sprintf('%s %s', shellCmd, quotedSubmitScriptPathOnCluster);
278 |     end
279 | end
280 | 
281 | if ~cluster.HasSharedFilesystem
282 |     % Start the mirror to copy all the job files over to the cluster
283 |     dctSchedulerMessage(4, '%s: Starting mirror for job %d.', currFilename, job.ID);
284 |     remoteConnection.startMirrorForJob(job);
285 | end
286 | 
287 | if strcmpi(clusterOS, 'unix')
288 |     % Add execute permissions to shell scripts
289 |     runSchedulerCommand(cluster, sprintf( ...
290 |         'chmod u+x "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator));
291 |     % Convert line endings to Unix
292 |     runSchedulerCommand(cluster, sprintf( ...
293 |         'dos2unix --allow-chown "%s%s"*.sh', jobDirectoryOnCluster, fileSeparator));
294 | end
295 | 
296 | for ii=1:numel(commandsToRun)
297 |     commandToRun = commandsToRun{ii};
298 |     jobIDs{ii} = iSubmitJobUsingCommand(cluster, job, commandToRun);
299 | end
300 | 
301 | % Calculate the schedulerIDs
302 | if useJobArrays
303 |     % The scheduler ID of each task is a combination of the job ID and the
304 |     % scheduler array index. cellfun pairs each job ID with its
305 |     % corresponding scheduler array indices in schedulerJobArrayIndices and
306 |     % returns the combination of both. For example, if jobIDs = {1,2} and
307 |     % schedulerJobArrayIndices = {[1,2];[3,4]}, the schedulerID is given by
308 |     % combining 1 with [1,2] and 2 with [3,4], in the canonical form of the
309 |     % scheduler.
310 |     schedulerIDs = cellfun(@(jobID,arrayIndices) jobID + "_" + arrayIndices, ...
311 |         jobIDs, schedulerJobArrayIndices, 'UniformOutput',false);
312 |     schedulerIDs = vertcat(schedulerIDs{:});
313 | else
314 |     % The scheduler ID of each task is the job ID.
315 |     schedulerIDs = string(jobIDs);
316 | end
317 | 
318 | % Store the scheduler ID for each task and the job cluster data
319 | jobData = struct('type', 'generic');
320 | if isprop(cluster.AdditionalProperties, 'ClusterHost')
321 |     % Store the cluster host
322 |     jobData.RemoteHost = remoteConnection.Hostname;
323 | end
324 | if ~cluster.HasSharedFilesystem
325 |     % Store the remote job storage location
326 |     jobData.RemoteJobStorageLocation = remoteConnection.JobStorageLocation;
327 |     jobData.HasDoneLastMirror = false;
328 | end
329 | if verLessThan('matlab', '9.7') % schedulerID stored in job data
330 |     jobData.ClusterJobIDs = schedulerIDs;
331 | else % schedulerID on task since 19b
332 |     set(tasks, 'SchedulerID', schedulerIDs);
333 | end
334 | cluster.setJobClusterData(job, jobData);
335 | 
336 | end
337 | 
338 | function [useJobArrays, maxJobArraySize] = iGetJobArrayProps(cluster)
339 | % Look for useJobArrays and maxJobArray size in the following order:
340 | % 1.  Additional Properties
341 | % 2.  User Data
342 | % 3.  Query scheduler for MaxJobArraySize
343 | 
344 | useJobArrays = validatedPropValue(cluster.AdditionalProperties, 'UseJobArrays', 'logical');
345 | if isempty(useJobArrays)
346 |     if isfield(cluster.UserData, 'UseJobArrays')
347 |         useJobArrays = cluster.UserData.UseJobArrays;
348 |     else
349 |         useJobArrays = true;
350 |     end
351 | end
352 | 
353 | if ~useJobArrays
354 |     % Not using job arrays so don't need the max array size
355 |     maxJobArraySize = 0;
356 |     return
357 | end
358 | 
359 | maxJobArraySize = validatedPropValue(cluster.AdditionalProperties, 'MaxJobArraySize', 'numeric');
360 | if ~isempty(maxJobArraySize)
361 |     if maxJobArraySize < 1
362 |         error('parallelexamples:GenericSLURM:IncorrectArguments', ...
363 |             'MaxJobArraySize must be a positive integer');
364 |     end
365 |     return
366 | end
367 | 
368 | if isfield(cluster.UserData,'MaxJobArraySize')
369 |     maxJobArraySize = cluster.UserData.MaxJobArraySize;
370 |     return
371 | end
372 | 
373 | % Get job array information by querying the scheduler.
374 | commandToRun = 'scontrol show config';
375 | try
376 |     [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun);
377 | catch err
378 |     cmdFailed = true;
379 |     cmdOut = err.message;
380 | end
381 | if cmdFailed
382 |     error('parallelexamples:GenericSLURM:FailedToRetrieveInfo', ...
383 |         'Failed to retrieve Slurm configuration information using command:\n\t%s.\nReason: %s', ...
384 |         commandToRun, cmdOut);
385 | end
386 | 
387 | maxJobArraySize = 0;
388 | % Extract the maximum array size for job arrays. For Slurm, the
389 | % configuration line that contains the maximum array index looks like this:
390 | % MaxArraySize = 1000
391 | % Use a regular expression to extract this parameter.
392 | tokens = regexp(cmdOut,'MaxArraySize\s*=\s*(\d+)', 'tokens','once');
393 | 
394 | if isempty(tokens) || (str2double(tokens) == 0)
395 |     % No job array support.
396 |     useJobArrays = false;
397 |     return
398 | end
399 | 
400 | useJobArrays = true;
401 | % Set the maximum array size.
402 | maxJobArraySize = str2double(tokens{1});
403 | % In Slurm, MaxArraySize is an exclusive upper bound. Subtract one to obtain
404 | % the inclusive upper bound.
405 | maxJobArraySize = maxJobArraySize - 1;
406 | end
407 | 
408 | function jobID = iSubmitJobUsingCommand(cluster, job, commandToRun)
409 | currFilename = mfilename;
410 | % Ask the cluster to run the submission command.
411 | dctSchedulerMessage(4, '%s: Submitting job %d using command:\n\t%s', currFilename, job.ID, commandToRun);
412 | try
413 |     [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun);
414 | catch err
415 |     cmdFailed = true;
416 |     cmdOut = err.message;
417 | end
418 | if cmdFailed
419 |     if ~cluster.HasSharedFilesystem
420 |         % Stop the mirroring if we failed to submit the job - this will also
421 |         % remove the job files from the remote location
422 |         remoteConnection = getRemoteConnection(cluster);
423 |         % Only stop mirroring if we are actually mirroring
424 |         if remoteConnection.isJobUsingConnection(job.ID)
425 |             dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID);
426 |             try
427 |                 remoteConnection.stopMirrorForJob(job);
428 |             catch err
429 |                 warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ...
430 |                     'Failed to stop the file mirroring for job %d.\nReason: %s', ...
431 |                     job.ID, err.getReport);
432 |             end
433 |         end
434 |     end
435 |     error('parallelexamples:GenericSLURM:FailedToSubmitJob', ...
436 |         'Failed to submit job to Slurm using command:\n\t%s.\nReason: %s', ...
437 |         commandToRun, cmdOut);
438 | end
439 | 
440 | jobID = extractJobId(cmdOut);
441 | if isempty(jobID)
442 |     error('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ...
443 |         'Failed to parse the job identifier from the submission output: "%s"', ...
444 |         cmdOut);
445 | end
446 | end
447 | 
448 | function rangesString = iCreateJobArrayString(taskIDs)
449 | % Create a character vector with the ranges of task IDs to submit
450 | if taskIDs(end) - taskIDs(1) + 1 == numel(taskIDs)
451 |     % There is only one range.
452 |     rangesString = sprintf('%d-%d',taskIDs(1),taskIDs(end));
453 | else
454 |     % There are several ranges.
455 |     % Calculate the step size between task IDs.
456 |     step = diff(taskIDs);
457 |     % Where the step changes, a range ends and another starts. Include
458 |     % the initial and ending IDs in the ranges as well.
459 |     isStartOfRange = [true; step > 1];
460 |     isEndOfRange   = [step > 1; true];
461 |     rangesString = strjoin(compose('%d-%d', ...
462 |         taskIDs(isStartOfRange),taskIDs(isEndOfRange)),',');
463 | end
464 | end
465 | 
466 | function logFileName = iGenerateLogFileName(subArrayIdx, jobArraySize)
467 | % This function builds the log file specifier, which is then passed to
468 | % Slurm to tell it where each task's output should go. This will be equal
469 | % to TaskX.log where X is the MATLAB ID. Slurm will not accept a job array
470 | % index greater than its MaxArraySize parameter. As a result MATLAB IDs
471 | % must be shifted down below MaxArraySize. To ensure that the log file for
472 | % Task X is called TaskX.log, round the maximum array size down to the
473 | % nearest power of 10 and manually construct the log file specifier. For
474 | % example, for a MaxArraySize of 1500, the Slurm job arrays will be of
475 | % size 1000, and MATLAB task IDs will map as illustrated by the following
476 | % table:
477 | %
478 | %    MATLAB ID | Slurm ID | Log file specifier
479 | %    ----------+----------+--------------------
480 | %       1- 999 |   1-999  | Task%a.log
481 | %    1000-1999 | 000-999  | Task1%3a.log
482 | %    2000-2999 | 000-999  | Task2%3a.log
483 | %    3000      | 000      | Task3%3a.log
484 | %
485 | % Note that Slurm expands %a to the Slurm ID, and %3a to the Slurm ID
486 | % padded with zeros to 3 digits.
487 | if subArrayIdx == 1
488 |     % Job arrays have more than one task. Use %a so that Slurm expands it
489 |     % into the actual task ID.
490 |     logFileName = 'Task%a.log';
491 | else
492 |     % For subsequent subarrays after the first one, prepend the index to %a
493 |     % to identify the batch of log files and form the final log file name.
494 |     padding = floor(log10(jobArraySize));
495 |     logFileName = sprintf('Task%d%%%da.log',subArrayIdx-1,padding);
496 | end
497 | end
498 | 


--------------------------------------------------------------------------------
/license.txt:
--------------------------------------------------------------------------------
1 | Copyright (c) 2022, The MathWorks, Inc.
2 | All rights reserved.
3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
4 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
5 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
6 | 3. In all cases, the software is, and all modifications and derivatives of the software shall be, licensed to you solely for use in conjunction with MathWorks products and service offerings.
7 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
8 | 


--------------------------------------------------------------------------------
/postConstructFcn.m:
--------------------------------------------------------------------------------
 1 | function postConstructFcn(cluster) %#ok<INUSD>
 2 | %POSTCONSTRUCTFCN Perform custom configuration after call to PARCLUSTER
 3 | %
 4 | % POSTCONSTRUCTFCN(CLUSTER) execute code on cluster object CLUSTER.
 5 | %
 6 | % See also parcluster.
 7 | 
 8 | % Copyright 2023 The MathWorks, Inc.
 9 | 
10 | end
11 | 


--------------------------------------------------------------------------------
/private/cancelJobOnCluster.m:
--------------------------------------------------------------------------------
 1 | function OK = cancelJobOnCluster(cluster, job)
 2 | %CANCELJOBONCLUSTER Cancels a job on the Slurm scheduler
 3 | 
 4 | % Copyright 2010-2023 The MathWorks, Inc.
 5 | 
 6 | % Store the current filename for the errors, warnings and
 7 | % dctSchedulerMessages
 8 | currFilename = mfilename;
 9 | if ~isa(cluster, 'parallel.Cluster')
10 |     error('parallelexamples:GenericSLURM:SubmitFcnError', ...
11 |         'The function %s is for use with clusters created using the parcluster command.', currFilename)
12 | end
13 | 
14 | % Get the information about the actual cluster used
15 | data = cluster.getJobClusterData(job);
16 | if isempty(data)
17 |     % This indicates that the job has not been submitted, so return true
18 |     dctSchedulerMessage(1, '%s: Job cluster data was empty for job with ID %d.', currFilename, job.ID);
19 |     OK = true;
20 |     return
21 | end
22 | 
23 | % Get a simplified list of schedulerIDs to reduce the number of calls to
24 | % the scheduler.
25 | schedulerIDs = getSimplifiedSchedulerIDsForJob(job);
26 | erroredJobAndCauseStrings = cell(size(schedulerIDs));
27 | % Get the cluster to delete the job
28 | for ii = 1:length(schedulerIDs)
29 |     schedulerID = schedulerIDs{ii};
30 |     commandToRun = sprintf('scancel -v ''%s''', schedulerID);
31 |     dctSchedulerMessage(4, '%s: Canceling job on cluster using command:\n\t%s.', currFilename, commandToRun);
32 |     try
33 |         [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun);
34 |     catch err
35 |         cmdFailed = true;
36 |         cmdOut = err.message;
37 |     end
38 |     % scancel can return 0 even if there is an error, so also check the
39 |     % cmdOut does not contain error text. We do not consider attempting
40 |     % to cancel a finished job as a failure, so exclude that.
41 |     if (cmdFailed || contains(cmdOut, 'error:')) && ...
42 |             ~contains(cmdOut, {'already completing', 'Invalid job id specified'})
43 |         % Keep track of all jobs that errored when being cancelled, either
44 |         % through a bad exit code or if an error was thrown. We'll report
45 |         % these later on.
46 |         erroredJobAndCauseStrings{ii} = sprintf('Job ID: %s\tReason: %s', schedulerID, strtrim(cmdOut));
47 |         dctSchedulerMessage(1, '%s: Failed to cancel job %s on cluster.  Reason:\n\t%s', currFilename, schedulerID, cmdOut);
48 |     end
49 | end
50 | 
51 | if ~cluster.HasSharedFilesystem
52 |     % Only stop mirroring if we are actually mirroring
53 |     remoteConnection = getRemoteConnection(cluster);
54 |     if remoteConnection.isJobUsingConnection(job.ID)
55 |         dctSchedulerMessage(5, '%s: Stopping the mirror for job %d.', currFilename, job.ID);
56 |         try
57 |             remoteConnection.stopMirrorForJob(job);
58 |         catch err
59 |             warning('parallelexamples:GenericSLURM:FailedToStopMirrorForJob', ...
60 |                 'Failed to stop the file mirroring for job %d.\nReason: %s', ...
61 |                 job.ID, err.getReport);
62 |         end
63 |     end
64 | end
65 | 
66 | % Now warn about those jobs that we failed to cancel.
67 | erroredJobAndCauseStrings = erroredJobAndCauseStrings(~cellfun(@isempty, erroredJobAndCauseStrings));
68 | if ~isempty(erroredJobAndCauseStrings)
69 |     warning('parallelexamples:GenericSLURM:FailedToCancelJob', ...
70 |         'Failed to cancel the following jobs on the cluster:\n%s', ...
71 |         sprintf('  %s\n', erroredJobAndCauseStrings{:}));
72 | end
73 | OK = isempty(erroredJobAndCauseStrings);
74 | 
75 | end
76 | 


--------------------------------------------------------------------------------
/private/cancelTaskOnCluster.m:
--------------------------------------------------------------------------------
 1 | function OK = cancelTaskOnCluster(cluster, task)
 2 | %CANCELTASKONCLUSTER Cancels a task on the Slurm scheduler
 3 | 
 4 | % Copyright 2020-2023 The MathWorks, Inc.
 5 | 
 6 | % Store the current filename for the errors, warnings and
 7 | % dctSchedulerMessages
 8 | currFilename = mfilename;
 9 | if ~isa(cluster, 'parallel.Cluster')
10 |     error('parallelexamples:GenericSLURM:SubmitFcnError', ...
11 |         'The function %s is for use with clusters created using the parcluster command.', currFilename)
12 | end
13 | 
14 | % Get the information about the actual cluster used
15 | data = cluster.getJobClusterData(task.Parent);
16 | if isempty(data)
17 |     % This indicates that the parent job has not been submitted, so return true
18 |     dctSchedulerMessage(1, '%s: Job cluster data was empty for the parent job with ID %d.', currFilename, task.Parent.ID);
19 |     OK = true;
20 |     return
21 | end
22 | % We can't cancel a single task of a communicating job on the scheduler
23 | % without cancelling the entire job, so warn and return in this case
24 | if ~strcmpi(task.Parent.Type, 'independent')
25 |     OK = false;
26 |     warning('parallelexamples:GenericSLURM:FailedToCancelTask', ...
27 |         'Unable to cancel a single task of a communicating job. If you want to cancel the entire job, use the cancel function on the job object instead.');
28 |     return
29 | end
30 | 
31 | % Get the cluster to delete the task
32 | if verLessThan('matlab', '9.7') % schedulerID stored in job data
33 |     schedulerIDs = data.ClusterJobIDs;
34 |     schedulerID = schedulerIDs{task.ID};
35 | else % schedulerID on task since 19b
36 |     schedulerID = task.SchedulerID;
37 | end
38 | erroredTaskAndCauseString = '';
39 | commandToRun = sprintf('scancel -v ''%s''', schedulerID);
40 | dctSchedulerMessage(4, '%s: Canceling task on cluster using command:\n\t%s.', currFilename, commandToRun);
41 | try
42 |     [cmdFailed, cmdOut] = runSchedulerCommand(cluster, commandToRun);
43 | catch err
44 |     cmdFailed = true;
45 |     cmdOut = err.message;
46 | end
47 | % scancel can return 0 even if there is an error, so also check the
48 | % cmdOut does not contain error text. We do not consider attempting
49 | % to cancel a finished job as a failure, so exclude that.
50 | if (cmdFailed || contains(cmdOut, 'error:')) && ...
51 |         ~contains(cmdOut, {'already completing', 'Invalid job id specified'})
52 |     % Record if the task errored when being cancelled, either through a bad
53 |     % exit code or if an error was thrown. We'll report this as a warning.
54 |     erroredTaskAndCauseString = sprintf('Job ID: %s\tReason: %s', schedulerID, strtrim(cmdOut));
55 |     dctSchedulerMessage(1, '%s: Failed to cancel task %s on cluster.  Reason:\n\t%s', currFilename, schedulerID, cmdOut);
56 | end
57 | 
58 | % Warn if task cancellation failed.
59 | OK = isempty(erroredTaskAndCauseString);
60 | if ~OK
61 |     warning('parallelexamples:GenericSLURM:FailedToCancelTask', ...
62 |         'Failed to cancel the task on the cluster:\n  %s\n', ...
63 |         erroredTaskAndCauseString);
64 | end
65 | 
66 | end
67 | 


--------------------------------------------------------------------------------
/private/createEnvironmentWrapper.m:
--------------------------------------------------------------------------------
 1 | function createEnvironmentWrapper(outputFilename, quotedWrapperPath, environmentVariables)
 2 | % Create a script that sets the correct environment variables and then
 3 | % calls the job wrapper.
 4 | 
 5 | % Copyright 2023 The MathWorks, Inc.
 6 | 
 7 | dctSchedulerMessage(5, '%s: Creating environment wrapper at %s', mfilename, outputFilename);
 8 | 
 9 | % Open file in binary mode to make it cross-platform.
10 | fid = fopen(outputFilename, 'w');
11 | if fid < 0
12 |     error('parallelexamples:GenericSLURM:FileError', ...
13 |         'Failed to open file %s for writing', outputFilename);
14 | end
15 | fileCloser = onCleanup(@() fclose(fid));
16 | 
17 | % Specify shell to use
18 | fprintf(fid, '#!/bin/sh\n');
19 | 
20 | formatSpec = 'export %s=''%s''\n';
21 | 
22 | % Write the commands to set and export environment variables
23 | for ii = 1:size(environmentVariables, 1)
24 |     fprintf(fid, formatSpec, environmentVariables{ii,1}, environmentVariables{ii,2});
25 | end
26 | 
27 | % Write the command to run the job wrapper
28 | fprintf(fid, '%s\n', quotedWrapperPath);
29 | 
30 | end
31 | 


--------------------------------------------------------------------------------
/private/createSubmitScript.m:
--------------------------------------------------------------------------------
 1 | function createSubmitScript(outputFilename, jobName, quotedLogFile, ...
 2 |     quotedWrapperPath, additionalSubmitArgs, jobArrayString)
 3 | % Create a script that runs the Slurm sbatch command.
 4 | 
 5 | % Copyright 2010-2024 The MathWorks, Inc.
 6 | 
 7 | if nargin < 6
 8 |     jobArrayString = [];
 9 | end
10 | 
11 | dctSchedulerMessage(5, '%s: Creating submit script for %s at %s', mfilename, jobName, outputFilename);
12 | 
13 | % Open file in binary mode to make it cross-platform.
14 | fid = fopen(outputFilename, 'w');
15 | if fid < 0
16 |     error('parallelexamples:GenericSLURM:FileError', ...
17 |         'Failed to open file %s for writing', outputFilename);
18 | end
19 | fileCloser = onCleanup(@() fclose(fid));
20 | 
21 | % Specify shell to use
22 | fprintf(fid, '#!/bin/sh\n');
23 | 
24 | % Unset all SLURM_ and SBATCH_ variables to avoid conflicting options in
25 | % nested jobs, except for SLURM_CONF which is required for the Slurm
26 | % utilities to work
27 | fprintf(fid, '%s\n', ...
28 |     'for VAR_NAME in $(env | cut -d= -f1 | grep -E ''^(SLURM_|SBATCH_)'' | grep -v ''^SLURM_CONF$''); do', ...
29 |     '    unset "$VAR_NAME"', ...
30 |     'done');
31 | 
32 | commandToRun = getSubmitString(jobName, quotedLogFile, quotedWrapperPath, ...
33 |     additionalSubmitArgs, jobArrayString);
34 | fprintf(fid, '%s\n', commandToRun);
35 | 
36 | end
37 | 


--------------------------------------------------------------------------------
/private/extractJobId.m:
--------------------------------------------------------------------------------
 1 | function jobID = extractJobId(sbatchCommandOutput)
 2 | % Extracts the job ID from the sbatch command output for Slurm
 3 | 
 4 | % Copyright 2015-2022 The MathWorks, Inc.
 5 | 
 6 | % Output from sbatch expected to be in the following format:
 7 | %   Submitted batch job 12345
 8 | %
 9 | % sbatch could also attach a warning to the output, such as:
10 | %
11 | %   sbatch: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
12 | %   Submitted batch job 12346
13 | 
14 | % Trim sbatch command output for use in debug message
15 | trimmedCommandOutput = strtrim(sbatchCommandOutput);
16 | 
17 | % Ignore anything before or after 'Submitted batch job ###', and extract the numeric value.
18 | searchPattern = '.*Submitted batch job ([0-9]+).*';
19 | 
20 | % When we match searchPattern, matchedTokens is a single entry cell array containing the jobID.
21 | % Otherwise we failed to match searchPattern, so matchedTokens is an empty cell array.
22 | matchedTokens = regexp(sbatchCommandOutput, searchPattern, 'tokens', 'once');
23 | 
24 | if isempty(matchedTokens)
25 |     % Callers check for error in extracting Job ID using isempty() on return value.
26 |     jobID = '';
27 |     dctSchedulerMessage(0, '%s: Failed to extract Job ID from sbatch output: \n\t%s', mfilename, trimmedCommandOutput);
28 | else
29 |     jobID = matchedTokens{1};
30 |     dctSchedulerMessage(0, '%s: Job ID %s was extracted from sbatch output: \n\t%s', mfilename, jobID, trimmedCommandOutput);
31 | end
32 | 
33 | end
34 | 


--------------------------------------------------------------------------------
/private/getCommonSubmitArgs.m:
--------------------------------------------------------------------------------
  1 | function commonSubmitArgs = getCommonSubmitArgs(cluster)
  2 | % Get any additional submit arguments for the Slurm sbatch command
  3 | % that are common to both independent and communicating jobs.
  4 | 
  5 | % Copyright 2016-2023 The MathWorks, Inc.
  6 | 
  7 | commonSubmitArgs = '';
  8 | ap = cluster.AdditionalProperties;
  9 | 
 10 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 11 | %% CUSTOMIZATION MAY BE REQUIRED %%
 12 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 13 | % You may wish to support further cluster.AdditionalProperties fields here
 14 | % and modify the submission command arguments accordingly.
 15 | 
 16 | % Account name
 17 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 18 |     'AccountName', 'char', '-A %s');
 19 | 
 20 | % Constraint
 21 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 22 |     'Constraint', 'char', '-C %s');
 23 | 
 24 | % Memory required per CPU
 25 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 26 |     'MemPerCPU', 'char', '--mem-per-cpu=%s');
 27 | 
 28 | % Partition (queue)
 29 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 30 |     'Partition', 'char', '-p %s');
 31 | 
 32 | % Require exclusive use of requested nodes
 33 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 34 |     'RequireExclusiveNode', 'logical', '--exclusive');
 35 | 
 36 | % Reservation
 37 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 38 |     'Reservation', 'char', '--reservation=%s');
 39 | 
 40 | % Wall time
 41 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 42 |     'WallTime', 'char', '-t %s');
 43 | 
 44 | % Email notification
 45 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 46 |     'EmailAddress', 'char', '--mail-type=ALL --mail-user=%s');
 47 | 
 48 | % Catch all: directly append anything in the AdditionalSubmitArgs
 49 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, ...
 50 |     'AdditionalSubmitArgs', 'char', '%s');
 51 | 
 52 | % Trim any whitespace
 53 | commonSubmitArgs = strtrim(commonSubmitArgs);
 54 | 
 55 | end
 56 | 
 57 | function commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, propName, propType, submitPattern, defaultValue)
 58 | % Helper fcn to append a scheduler option to the submit string.
 59 | % Inputs:
 60 | %  commonSubmitArgs: submit string to append to
 61 | %  ap: AdditionalProperties object
 62 | %  propName: name of the property
 63 | %  propType: type of the property, i.e. char, double or logical
 64 | %  submitPattern: sprintf-style string specifying the format of the scheduler option
 65 | %  defaultValue (optional): value to use if the property is not specified in ap
 66 | 
 67 | if nargin < 6
 68 |     defaultValue = [];
 69 | end
 70 | arg = validatedPropValue(ap, propName, propType, defaultValue);
 71 | if ~isempty(arg) && (~islogical(arg) || arg)
 72 |     commonSubmitArgs = [commonSubmitArgs, ' ', sprintf(submitPattern, arg)];
 73 | end
 74 | end
 75 | 
 76 | function commonSubmitArgs = iAppendRequiredArgument(commonSubmitArgs, ap, propName, propType, submitPattern, errMsg) %#ok<DEFNU>
 77 | % Helper fcn to append a required scheduler option to the submit string.
 78 | % An error is thrown if the property is not specified in AdditionalProperties or is empty.
 79 | % Inputs:
 80 | %  commonSubmitArgs: submit string to append to
 81 | %  ap: AdditionalProperties object
 82 | %  propName: name of the property
 83 | %  propType: type of the property, i.e. char, double or logical
 84 | %  submitPattern: sprintf-style string specifying the format of the scheduler option
 85 | %  errMsg (optional): text to append to the error message if the property is not specified in ap
 86 | 
 87 | if ~isprop(ap, propName)
 88 |     errorText = sprintf('Required field %s is missing from AdditionalProperties.', propName);
 89 |     if nargin > 5
 90 |         errorText = [errorText newline errMsg];
 91 |     end
 92 |     error('parallelexamples:GenericSLURM:MissingAdditionalProperties', errorText);
 93 | elseif isempty(ap.(propName))
 94 |     errorText = sprintf('Required field %s is empty in AdditionalProperties.', propName);
 95 |     if nargin > 5
 96 |         errorText = [errorText newline errMsg];
 97 |     end
 98 |     error('parallelexamples:GenericSLURM:EmptyAdditionalProperties', errorText);
 99 | end
100 | commonSubmitArgs = iAppendArgument(commonSubmitArgs, ap, propName, propType, submitPattern);
101 | end
102 | 


--------------------------------------------------------------------------------
/private/getRemoteConnection.m:
--------------------------------------------------------------------------------
  1 | function remoteConnection = getRemoteConnection(cluster)
  2 | %GETREMOTECONNECTION Get a connected RemoteClusterAccess
  3 | %
  4 | % getRemoteConnection will either retrieve a RemoteClusterAccess from the
  5 | % cluster's UserData or it will create a new RemoteClusterAccess.
  6 | 
  7 | % Copyright 2010-2024 The MathWorks, Inc.
  8 | 
  9 | % Store the current filename for the dctSchedulerMessages
 10 | currFilename = mfilename;
 11 | 
 12 | clusterHost = validatedPropValue(cluster.AdditionalProperties, 'ClusterHost', 'char');
 13 | if isempty(clusterHost)
 14 |     error('parallelexamples:GenericSLURM:MissingAdditionalProperties', ...
 15 |         'Required field %s is missing from AdditionalProperties.', 'ClusterHost');
 16 | end
 17 | 
 18 | if ~cluster.HasSharedFilesystem
 19 |     remoteJobStorageLocation = validatedPropValue(cluster.AdditionalProperties, ...
 20 |         'RemoteJobStorageLocation', 'char');
 21 |     if isempty(remoteJobStorageLocation)
 22 |         error('parallelexamples:GenericSLURM:MissingAdditionalProperties', ...
 23 |             'Required field %s is missing from AdditionalProperties.', 'RemoteJobStorageLocation');
 24 |     end
 25 |     
 26 |     useUniqueSubfolders = validatedPropValue(cluster.AdditionalProperties, ...
 27 |         'UseUniqueSubfolders', 'logical', false);
 28 | end
 29 | 
 30 | needToCreateNewConnection = false;
 31 | if isempty(cluster.UserData)
 32 |     needToCreateNewConnection = true;
 33 | else
 34 |     if ~isstruct(cluster.UserData)
 35 |         error('parallelexamples:GenericSLURM:IncorrectUserData', ...
 36 |             ['Failed to retrieve remote connection from cluster''s UserData.\n' ...
 37 |             'Expected cluster''s UserData to be a structure, but found %s'], ...
 38 |             class(cluster.UserData));
 39 |     end
 40 |     
 41 |     if isfield(cluster.UserData, 'RemoteConnection')
 42 |         % Get the remote connection out of the cluster user data
 43 |         remoteConnection = cluster.UserData.RemoteConnection;
 44 |         
 45 |         % And check it is of the type that we expect
 46 |         if isempty(remoteConnection) || (isa(remoteConnection, "handle") && ~isvalid(remoteConnection))
 47 |             needToCreateNewConnection = true;
 48 |         else
 49 |             clusterAccessClassname = 'parallel.cluster.RemoteClusterAccess';
 50 |             if ~isa(remoteConnection, clusterAccessClassname)
 51 |                 error('parallelexamples:GenericSLURM:IncorrectArguments', ...
 52 |                     ['Failed to retrieve remote connection from cluster''s UserData.\n' ...
 53 |                     'Expected the RemoteConnection field of the UserData to contain an object of type %s, but found %s.'], ...
 54 |                     clusterAccessClassname, class(remoteConnection));
 55 |             end
 56 |             
 57 |             if ~cluster.HasSharedFilesystem
 58 |                 if useUniqueSubfolders
 59 |                     username = remoteConnection.Username;
 60 |                     expectedRemoteJobStorageLocation = iBuildUniqueSubfolder(remoteJobStorageLocation, ...
 61 |                         username, iGetFileSeparator(cluster));
 62 |                 else
 63 |                     expectedRemoteJobStorageLocation = remoteJobStorageLocation;
 64 |                 end
 65 |             end
 66 |             
 67 |             if ~remoteConnection.IsConnected
 68 |                 needToCreateNewConnection = true;
 69 |             elseif cluster.HasSharedFilesystem && ...
 70 |                     ~strcmpi(remoteConnection.Hostname, clusterHost)
 71 |                 % The connection stored in the user data does not match the cluster host requested
 72 |                 warning('parallelexamples:GenericSLURM:DifferentRemoteParameters', ...
 73 |                     ['The current cluster is already using cluster host %s.\n', ...
 74 |                     'The existing connection to %s will be replaced.'], ...
 75 |                     remoteConnection.Hostname, remoteConnection.Hostname);
 76 |                 cluster.UserData.RemoteConnection = [];
 77 |                 needToCreateNewConnection = true;
 78 |             elseif ~cluster.HasSharedFilesystem && ...
 79 |                     (~strcmpi(remoteConnection.Hostname, clusterHost) || ...
 80 |                     ~remoteConnection.IsFileMirrorSupported || ...
 81 |                     ~strcmpi(remoteConnection.JobStorageLocation, expectedRemoteJobStorageLocation))
 82 |                 % The connection stored in the user data does not match the cluster host
 83 |                 % and remote location requested
 84 |                 warning('parallelexamples:GenericSLURM:DifferentRemoteParameters', ...
 85 |                     ['The current cluster is already using cluster host %s and remote job storage location %s.\n', ...
 86 |                     'The existing connection to %s will be replaced.'], ...
 87 |                     remoteConnection.Hostname, remoteConnection.JobStorageLocation, remoteConnection.Hostname);
 88 |                 cluster.UserData.RemoteConnection = [];
 89 |                 needToCreateNewConnection = true;
 90 |             end
 91 |         end
 92 |     else
 93 |         needToCreateNewConnection = true;
 94 |     end
 95 | end
 96 | 
 97 | if ~needToCreateNewConnection
 98 |     return
 99 | end
100 | 
101 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
102 | %% CUSTOMIZATION MAY BE REQUIRED %%
103 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
104 | % Get the credential options from the user using simple
105 | % MATLAB dialogs or command line input.  You should change
106 | % this section if you wish for users to provide their credential
107 | % options in a different way.
108 | % The pertinent options are:
109 | % username - The username you use when you run commands on the remote host
110 | % authMode - Authentication mode you use when you connect to the cluster.
111 | %   Supported options are:
112 | %   'Password' - Enter your SSH password when prompted by MATLAB.
113 | %   'IdentityFile' - Use an identity file on disk.
114 | %   'Agent' - Interface with an SSH agent running on the client machine.
115 | %             Supported in R2021b onwards.
116 | %   'Multifactor' - Enable the cluster to prompt you for input one or more
117 | %                   times. If two-factor authentication (2FA) is enabled on
118 | %                   the cluster, the cluster will request your password and
119 | %                   a response for the second authentication factor.
120 | %                   Supported in R2022a onwards.
121 | % identityFile - Full path to the identity file.
122 | % identityFileHasPassphrase - True if the identity file requires a passphrase
123 | %                             (true/false).
124 | 
125 | % Use the UI for prompts if MATLAB has been started with the desktop enabled
126 | useUI = iShouldUseUI();
127 | username = iGetUsername(cluster, useUI);
128 | 
129 | % Decide which authentication mode to use
130 | % Default mechanism is to prompt for password
131 | authMode = 'Password';
132 | if isprop(cluster.AdditionalProperties, 'AuthenticationMode')
133 |     % If AdditionalProperties.AuthenticationMode is defined, use that
134 |     authMode = cluster.AdditionalProperties.AuthenticationMode;
135 | elseif isprop(cluster.AdditionalProperties, 'UseIdentityFile')
136 |     % Otherwise use an identity file if UseIdentityFile is defined and true
137 |     useIdentityFile = validatedPropValue(cluster.AdditionalProperties, 'UseIdentityFile', 'logical');
138 |     if useIdentityFile
139 |         authMode = 'IdentityFile';
140 |     end
141 | elseif isprop(cluster.AdditionalProperties, 'IdentityFile')
142 |     % Otherwise use an identity file if IdentityFile is defined
143 |     authMode = 'IdentityFile';
144 | else
145 |     % Otherwise nothing is specified, ask the user what to do
146 |     authMode = iPromptUserForAuthenticationMode(cluster, useUI);
147 | end
148 | 
149 | % Build the user arguments to pass to RemoteClusterAccess
150 | userArgs = {username};
151 | if verLessThan('matlab', '9.11') %#ok<*VERLESSMATLAB> We support back to 17a
152 |     if ~ischar(authMode) || ~ismember(authMode, {'IdentityFile', 'Password'})
153 |         % Prior to R2021b, only IdentityFile and Password are supported
154 |         error('parallelexamples:GenericSLURM:IncorrectArguments', ...
155 |             'AuthenticationMode must be either ''IdentityFile'' or ''Password''');
156 |     end
157 | else
158 |     % No need to validate authMode, RemoteClusterAccess will do that for us
159 |     userArgs = [userArgs, 'AuthenticationMode', {authMode}];
160 | end
161 | 
162 | % If using identity file, also need the filename and whether a passphrase is needed
163 | if any(strcmp(authMode, 'IdentityFile'))
164 |     identityFile = iGetIdentityFile(cluster, useUI);
165 |     identityFileHasPassphrase = iGetIdentityFileHasPassphrase(cluster, useUI);
166 |     userArgs = [userArgs, 'IdentityFilename', {identityFile}, ...
167 |         'IdentityFileHasPassphrase', identityFileHasPassphrase];
168 | end
169 | 
170 | % Changing SSH port supported for R2021b onwards
171 | if ~verLessThan('matlab', '9.11')
172 |     sshPort = validatedPropValue(cluster.AdditionalProperties, 'SSHPort', 'double');
173 |     if ~isempty(sshPort)
174 |         userArgs = [userArgs, 'Port', sshPort];
175 |     end
176 | end
177 | 
178 | % Now connect and store the connection
179 | dctSchedulerMessage(1, '%s: Connecting to remote host %s', ...
180 |     currFilename, clusterHost);
181 | if cluster.HasSharedFilesystem
182 |     remoteConnection = parallel.cluster.RemoteClusterAccess.getConnectedAccess(clusterHost, userArgs{:});
183 | else
184 |     if useUniqueSubfolders
185 |         remoteJobStorageLocation = iBuildUniqueSubfolder(remoteJobStorageLocation, ...
186 |             username, iGetFileSeparator(cluster));
187 |     end
188 |     remoteConnection = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror(clusterHost, remoteJobStorageLocation, userArgs{:});
189 | end
190 | dctSchedulerMessage(5, '%s: Storing remote connection in cluster''s user data.', currFilename);
191 | cluster.UserData.RemoteConnection = remoteConnection;
192 | 
193 | end
194 | 
195 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
196 | function useUI = iShouldUseUI()
197 | if verLessThan('matlab', '9.11')
198 |     % Prior to R2021b, check for Java AWT components
199 |     useUI = isempty(javachk('awt'));
200 | else
201 |     % From R2021b onwards, can use the desktop function
202 |     useUI = desktop('-inuse');
203 | end
204 | end
205 | 
206 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
207 | function username = iGetUsername(cluster, useUI)
208 | 
209 | username = validatedPropValue(cluster.AdditionalProperties, 'Username', 'char');
210 | if ~isempty(username)
211 |     return
212 | end
213 | 
214 | if useUI
215 |     dlgMessage = sprintf('Enter the username for %s', cluster.AdditionalProperties.ClusterHost);
216 |     dlgTitle = 'User Credentials';
217 |     numlines = 1;
218 |     usernameResponse = inputdlg(dlgMessage, dlgTitle, numlines);
219 |     % Hitting cancel gives an empty cell array, but a user providing an empty string gives
220 |     % a (non-empty) cell array containing an empty string
221 |     if isempty(usernameResponse)
222 |         % User hit cancel
223 |         error('parallelexamples:GenericSLURM:UserCancelledOperation', ...
224 |             'User cancelled operation.');
225 |     end
226 |     username = char(usernameResponse);
227 |     return
228 | end
229 | 
230 | % useUI == false
231 | msg = sprintf('Enter the username for %s:\n ', cluster.AdditionalProperties.ClusterHost);
232 | username = input(msg, 's');
233 | 
234 | end
235 | 
236 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
237 | function identityFileHasPassphrase = iGetIdentityFileHasPassphrase(cluster, useUI)
238 | 
239 | identityFileHasPassphrase = validatedPropValue( ...
240 |     cluster.AdditionalProperties, 'IdentityFileHasPassphrase', 'logical');
241 | if ~isempty(identityFileHasPassphrase)
242 |     return
243 | end
244 | 
245 | if useUI
246 |     dlgMessage = 'Does the identity file require a password?';
247 |     dlgTitle = 'User Credentials';
248 |     passphraseResponse = questdlg(dlgMessage, dlgTitle);
249 |     if strcmp(passphraseResponse, 'Cancel')
250 |         % User hit cancel
251 |         error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.');
252 |     end
253 |     identityFileHasPassphrase = strcmp(passphraseResponse, 'Yes');
254 |     return
255 | end
256 | 
257 | % useUI == false
258 | validYesNoResponse = {'y', 'n'};
259 | passphraseMessage = sprintf('Does the identity file require a password? (y or n)\n ');
260 | passphraseResponse = iLoopUntilValidStringInput(passphraseMessage, validYesNoResponse);
261 | identityFileHasPassphrase = strcmpi(passphraseResponse, 'y');
262 | 
263 | end
264 | 
265 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
266 | function identityFile = iGetIdentityFile(cluster, useUI)
267 | 
268 | if isprop(cluster.AdditionalProperties, 'IdentityFile')
269 |     identityFile = cluster.AdditionalProperties.IdentityFile;
270 |     if ~(ischar(identityFile) || isstring(identityFile) || iscellstr(identityFile)) || any(strlength(identityFile) == 0)
271 |         error('parallelexamples:GenericSLURM:IncorrectArguments', ...
272 |             'Each IdentityFile must be a nonempty character vector');
273 |     end
274 | else
275 |     if useUI
276 |         dlgMessage = 'Select Identity File to use';
277 |         [filename, pathname] = uigetfile({'*.*', 'All Files (*.*)'},  dlgMessage);
278 |         % If the user hit cancel, then filename and pathname will both be 0.
279 |         if isequal(filename, 0) && isequal(pathname,0)
280 |             error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.');
281 |         end
282 |         identityFile = fullfile(pathname, filename);
283 |     else
284 |         msg = sprintf('Please enter the full path to the Identity File to use:\n ');
285 |         identityFile = input(msg, 's');
286 |     end
287 | end
288 | 
289 | end
290 | 
291 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
292 | function authMode = iPromptUserForAuthenticationMode(cluster, useUI)
293 | 
294 | promptMessage = sprintf('Select an authentication method to log in to %s', cluster.AdditionalProperties.ClusterHost);
295 | options = {'Password', 'Identity File', 'Cancel'};
296 | 
297 | if useUI
298 |     dlgTitle = 'User Credentials';
299 |     defaultOption = 'Password';
300 |     authMode = questdlg(promptMessage, dlgTitle, options{:}, defaultOption);
301 |     authMode = strrep(authMode, ' ', '');
302 |     if strcmp(authMode, 'Cancel') || isempty(authMode)
303 |         % User hit cancel or closed the window
304 |         error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.');
305 |     end
306 | else
307 |     validResponses = {'1', '2', '3'};
308 |     displayItems = [validResponses; options];
309 |     identityFileMessage = [promptMessage, newline, sprintf('%s) %s\n', displayItems{:}), ' '];
310 |     response = iLoopUntilValidStringInput(identityFileMessage, validResponses);
311 |     switch response
312 |         case '1'
313 |             authMode = 'Password';
314 |         case '2'
315 |             authMode = 'IdentityFile';
316 |         otherwise
317 |             error('parallelexamples:GenericSLURM:UserCancelledOperation', 'User cancelled operation.');
318 |     end
319 | end
320 | 
321 | end
322 | 
323 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
324 | function returnValue = iLoopUntilValidStringInput(message, validValues)
325 | % Function to loop until a valid response is obtained user input
326 | returnValue = '';
327 | 
328 | while isempty(returnValue) || ~any(strcmpi(returnValue, validValues))
329 |     returnValue = input(message, 's');
330 | end
331 | end
332 | 
333 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
334 | function subfolder = iBuildUniqueSubfolder(remoteJobStorageLocation, username, fileSeparator)
335 | % Function to build unique location using username and MATLAB release version
336 | release = ['R' version('-release')];
337 | subfolder = [remoteJobStorageLocation fileSeparator username fileSeparator release];
338 | end
339 | 
340 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
341 | function fileSeparator = iGetFileSeparator(cluster)
342 | % Function to return file separator for cluster operating system
343 | if strcmpi(cluster.OperatingSystem, 'unix')
344 |     fileSeparator = '/';
345 | else
346 |     fileSeparator = '\';
347 | end
348 | end
349 | 


--------------------------------------------------------------------------------
/private/getSimplifiedSchedulerIDsForJob.m:
--------------------------------------------------------------------------------
 1 | function [schedulerIDs, numTasks] = getSimplifiedSchedulerIDsForJob(job)
 2 | %GETSIMPLIFIEDSCHEDULERIDSFORJOB Returns the smallest possible list of Slurm JobIDs that describe the MATLAB job.
 3 | %
 4 | % SCHEDULERIDS = getSimplifiedSchedulerIDsForJob(JOB) returns the smallest
 5 | % possible list of Slurm job IDs that describe the MATLAB job JOB. The
 6 | % function converts child job IDs of a job array to the parent job ID of
 7 | % the array, and removes any duplicates.
 8 | %
 9 | % [SCHEDULERIDS, NUMTASKS] = getSimplifiedSchedulerIDsForJob(JOB) also
10 | % returns the number of tasks that SCHEDULERIDS represents.
11 | 
12 | % Copyright 2019-2022 The MathWorks, Inc.
13 | 
14 | if verLessThan('matlab', '9.7') % schedulerID stored in job data
15 |     data = job.Parent.getJobClusterData(job);
16 |     schedulerIDs = data.ClusterJobIDs;
17 | else % schedulerID on task since 19b
18 |     schedulerIDs = job.getTaskSchedulerIDs();
19 | end
20 | numTasks = numel(schedulerIDs);
21 | 
22 | % Child jobs within a job array will have a schedulerID of the form
23 | % <parent job ID>_<array index>.
24 | schedulerIDs = regexprep(schedulerIDs, '_\d+', '');
25 | schedulerIDs = unique(schedulerIDs, 'stable');
26 | end
27 | 


--------------------------------------------------------------------------------
/private/getSubmitString.m:
--------------------------------------------------------------------------------
 1 | function submitString = getSubmitString(jobName, quotedLogFile, quotedCommand, ...
 2 |     additionalSubmitArgs, jobArrayString)
 3 | %GETSUBMITSTRING Gets the correct sbatch command for a Slurm cluster
 4 | 
 5 | % Copyright 2010-2023 The MathWorks, Inc.
 6 | 
 7 | if ~isempty(jobArrayString)
 8 |     jobArrayString = strcat('--array=''[', jobArrayString, ']''');
 9 | end
10 | 
11 | submitString = sprintf('sbatch --job-name=%s %s --output=%s --export=NONE %s %s', ...
12 |     jobName, jobArrayString, quotedLogFile, additionalSubmitArgs, quotedCommand);
13 | 
14 | end
15 | 


--------------------------------------------------------------------------------
/private/runSchedulerCommand.m:
--------------------------------------------------------------------------------
 1 | function [status, result] = runSchedulerCommand(cluster, cmd)
 2 | %RUNSCHEDULERCOMMAND Run a command on the cluster.
 3 | 
 4 | % Copyright 2019-2024 The MathWorks, Inc.
 5 | 
 6 | persistent wrapper
 7 | 
 8 | if isprop(cluster.AdditionalProperties, 'ClusterHost')
 9 |     % Need to run the command over SSH
10 |     remoteConnection = getRemoteConnection(cluster);
11 |     [status, result] = remoteConnection.runCommand(cmd);
12 | else
13 |     % Can shell out
14 |     if isunix
15 |         % Some scheduler utility commands on unix return exit codes > 127, which
16 |         % MATLAB interprets as a fatal signal. This is not the case here, so wrap
17 |         % the system call to the scheduler on UNIX within a shell script to
18 |         % sanitize any exit codes in this range.
19 |         if isempty(wrapper)
20 |             wrapper = iBuildWrapperPath();
21 |         end
22 |         cmd = sprintf('%s %s', wrapper, cmd);
23 |     end
24 |     [status, result] = system(cmd);
25 | end
26 | 
27 | end
28 | 
29 | function wrapper = iBuildWrapperPath()
30 | if verLessThan('matlab', '9.7')
31 |     pctDir = toolboxdir('distcomp'); %#ok<*DCRENAME>
32 | elseif verLessThan('matlab', '25.1') %#ok<*VERLESSMATLAB>
33 |     pctDir = toolboxdir('parallel');
34 | else
35 |     pctDir = fullfile(matlabroot, 'toolbox', 'parallel');
36 | end
37 | wrapper = fullfile(pctDir, 'bin', 'util', 'shellWrapper.sh');
38 | end
39 | 


--------------------------------------------------------------------------------
/private/validatedPropValue.m:
--------------------------------------------------------------------------------
 1 | function val = validatedPropValue(ap, prop, type, defaultValue)
 2 | % If prop is in the AdditionalProperties ap, validate the value is the correct
 3 | % type and return it. If prop is not present, return the provided defaultValue.
 4 | % If prop is not present and no defaultValue is provided, returns empty.
 5 | 
 6 | % Copyright 2022 The MathWorks, Inc.
 7 | 
 8 | narginchk(3, 4);
 9 | 
10 | if nargin < 4
11 |     % If no defaultValue specified, use empty
12 |     defaultValue = [];
13 | end
14 | 
15 | if ~isprop(ap, prop)
16 |     % prop is not present in ap, use the defaultValue
17 |     val = defaultValue;
18 |     return
19 | end
20 | 
21 | % If we get here then prop is in ap
22 | val = ap.(prop);
23 | switch type
24 |     case {'char', 'string'}
25 |         validator = @(x) ischar(x) || isstring(x);
26 |     case {'double', 'numeric'}
27 |         validator = @isnumeric;
28 |     case {'bool', 'logical'}
29 |         validator = @islogical;
30 |     otherwise
31 |         error('parallelexamples:GenericSLURM:IncorrectArguments', ...
32 |             'Not a valid data type');
33 | end
34 | 
35 | % If the property is not empty, verify that it is set to the correct type:
36 | % char, double, or logical.
37 | if ~isempty(val) && ~validator(val)
38 |     error('parallelexamples:GenericSLURM:IncorrectArguments', ...
39 |         'Expected property ''%s'' to be of type %s, but it has type %s.', prop, type, class(val));
40 | end
41 | 
42 | end
43 | 


--------------------------------------------------------------------------------