software:matlab:caviness

Matlab on Caviness

For use on Caviness, MATLAB projects should be developed using a Desktop installation of MATLAB and then copied to Caviness to be run in batch. Here an extended MATLAB example is considered involving one simple MATLAB function, and two MATLAB scripts to execute this function in a loop, and another to execute in parallel using the Parallel Toolbox.

Details on how to run these two scripts in batch are given with the resulting output files. There is also a section with UNIX commands you can use to watch your jobs and gather timing and core count numbers. It is important to know how much memory will be needed and how many cores will be used to set your resource requirements. If you do not ask for enough memory your job will fail. If you do not ask for enough cores, the job will take longer.

Even though it easier to develop on a desktop, MATLAB can be run interactively on Caviness, however it is not recommended for scripts that are long computationally intensive. Two interactive jobs are demonstrated. One shows how to test the function by executing the function one time. A second example shows an interactive session, which starts multiple MATLAB pool of workers to execute the function as a parallel toolbox loop, parfor. The Parallel toolbox gives a faster time to completion, but with more memory and CPU resources consumed.

You can run MATLAB as a Desktop GUI application on Caviness, but that is not recommended as the graphics are slow to display especially with a lower bandwidth network connection.

Many MATLAB research projects fall in the “high throughput computing” category. One run can be done on the desktop, but it is desired complete 100s or 1000s of independent runs. This greatly increases disk, memory and CPU requirements. Thus we have a final example that gives the recommended workflow to scale your job to multiple nodes. Compile the MATLAB code with single thread option and deploy the job as an grid engine array job.

The MATLAB distributed computing server (MDCS) is not installed on Caviness. This means jobs run with the Parallel Computing toolbox can only run on one node. This limits both the size of the job and the number of workers you can use. That is why an array job of compiled MATLAB is recommended for large jobs.

Getting Started

There will be several examples covered in the following sections. To help make things easier to following it is suggested to make a new directory in your home directory ~/ or in your work group directory $WORKDIR . Then cd into the directory. In the new directory you can add the maxEig.m and script.m files. These two files will be used in several of the examples.

[traine@login00 ~]$ mkdir matlab_example
[traine@login00 ~]$ cd matlab_example
Example Directories

As you go through the following example it is suggested that you also create a new directory, for each of them. It will help make it easier to follow and track output files of the different jobs that you will be running.

Now add create and add the following to file in the ~/matlab_example directory

We will using this sample function on the Caviness cluster in multiple demonstrations

maxEig.m
function maxe = maxEig(sd,dim)
%  maxEig  maximum real eigenvalue of a normally distributed random matrix
%     Input parameters
%       sd - seed for random generator
%       dim - size of the square matrix
%     Output value
%       maxe - maximum real eigenvalue
  if (isdeployed)
    sd = str2num(sd)
    dim = str2num(dim)
  end
 
  rng(sd);
  ev = eig( randn(dim) );
  maxe = max( ev(imag(ev)==0) )
end

The page will use this MATLAB function to illustrate using Matlab in batch and interactively. The function will be executed interactively on multiple cores using multiple computational threads, and with 12 workers from a MATLAB pool. A MATLAB script with be run in batch to loop with multiple computational threads, and again with MATLAB pool.

Finally it will be compiled and deployed using the Matlab Compiler Runtime (MCR) environment.

We want to select on the real eigenvalues to compute the maximum. The matrix is a full matrix of both positive and negative elements, so the eigenvalues with be both real and complex. The MATLAB has a function, isreal, but it is useless to select real values from a comples array, since it will return false for all the elements of a complex array. Thus we use the selecting reals by the property that their imaginary part is 0.0. This may be subject to round-off errors, both by selecting complex numbers with very small imaginary parts, or by not selecting some real eigenvalues where the imaginary part is non-zero from rounding.
The last line of this function does not have a semicolon. Thus, the value is displayed with three lines of output, for every function call. This is not what you want, once you confident your code is producing good results. To make this function silent, just and a semicolon. To produced a more information, packed in to one line, you could add the fprintf function:
  maxe = max(ev(imag(ev)==0));
  fprintf('sd=%d counte=%d maxe=%.4f\n', sd, length(ev(imag(ev)==0)), maxe)

First, write a Matlab script file. It should have a comment on the first line describing the purpose of the script and have the quit command on the last line. This script will call the maxEig function 200 times and report the average:

script.m
% script to run maxEig function 200 times and print average.
 
count = 200;
dim = 5001;
sumMaxe = 0;
tic;
for i=1:count;
  sumMaxe = sumMaxe + maxEig(i,dim);
end;
toc
avgMaxEig = sumMaxe/count
 
quit

This is a detailed script example, which calls the maxEig function. This example does no file I/O, all the I/O is to standard out. In Matlab, assignments, not terminated by a semicolon, are displayed on the screen (standard out in batch).

This script ends in a quit command (equivalent to MATLAB exit). This is meant to be a complete script, which terminates MATLAB when done. If you run this from the bash command line with the -r script option, it will come back with a bash prompt when completed. If this is run from a batch job, then you can do other commands in your batch script after the MATLAB script completes.

Without the quit you will come back to the MATLAB prompt on completion for a interactive job. If this is the last line of a batch queue script, then the only difference will be the MATLAB prompt » at the very end of the output file. MATLAB treats the end of batch script file the same as exiting the window, which is the preferred way to exit the MATLAB GUI.

Copy the project folder to a directory on the cluster. Use any file transfer client to copy your entire project directory.

Batch Job

You should have a copy of your MATLAB project directory on the cluster.

Versions of MATLAB

MATLAB has a new version twice a year. It is important to keep the version you use on your desktop the same as the one on the cluster. The command

vpkg_versions matlab

will show you the versions available on a cluster. Choose the one that matches the version on your desktop. We recommend you do not upgrade MATLAB in the middle of a project, unless there is a new feature or bug fix you need.

Two directories

It is frequently advisable to keep your MATLAB project clean from non-MATLAB files such as the queue script file and the script output file. But you may combine them, and even use the MATLAB editor to create the script file and look at the output file. If you create the file on a PC, take care to not transfer the files as binary. See Transfer Files for the appropriate cluster.

When you have one combined directory, do not put the cd command in the queue script; instead, change to the project directory using cd on the command line, before submitting your job.

You should create a job script file to submit a batch job. Start by modifying a job template file (/opt/templates/slurm/generic/serial.qs), for example, to submit a serial job on one core of a compute node, copy the serial template. In your copy newly copied serial.qs file add the below lines.

 
[traine@login00 matlab_example]$ cp /opt/templates/slurm/generic/serial.qs matlab_first.qs
# Add vpkg_require commands after this line:
vpkg_require matlab
#Running the Matlab main_scrip
matlab -nodisplay -singleCompThread -r main_script

Now make a new file call main_script.m Add the below lines to the file.

display 'Hello World'

When this script runs it will display Hello World .

Your shell must be in a workgroup environment to submit any jobs. Use the sbatch command to submit a batch job and note the «JOBID» that is assigned to your job. For example, if you queue script file name is matlab_first.qs, submit the job with:

sbatch matlab_first.qs
WARNING: Please choose a workgroup before submitting jobs

This is the message you get if you are not in the appropriate workgroup.

   sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
Bash script vs queue script

It is true that a queue script file is (usually) a bash script, but it must be executed with the sbatch command instead of the sh command. This way the grid engine commands with be processed, and the job will be run on a compute node.

You can check on the status of you job with the scontrol show job command. For example, to list the information for job «JOBID», type

scontrol show job <<JOBID>>

For long running jobs, you could change your queue script to notify you via an e-mail message when the job is complete.

Post process job

All MATLAB output data files will be in the project directory, but the MATLAB standard output will be in the current directory, from which you submitted the job. Look for a file ending in your assigned «JOBID».

Interactive job

Here are specific details for running MATLAB as an interactive job on a compute node. You should have a copy of your MATLAB project directory on the cluster and will be referred to a project_directory in the examples below.

You should work on a compute node when in command-line MATLAB. Your shell must be in a workgroup environment to submit a single threaded interactive job using salloc.

[traine@login00 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$ salloc --partition=_workgroup_
salloc: Pending job allocation 7809686
salloc: job 7809686 queued and waiting for resources
salloc: job 7809686 has been allocated resources
salloc: Granted job allocation 7809686
salloc: Waiting for resource configuration
salloc: Nodes r00n16 are ready for job
[traine@r00n16 ~]$ vpkg_require matlab
Adding package `matlab/r2018b` to your environment
[traine@r00n16 ~]$ cd matlab_example
[traine@r00n16 matlab_example]$ matlab -nodesktop -singleCompThread
MATLAB is selecting SOFTWARE OPENGL rendering.

                                                   < M A T L A B (R) >
                                         Copyright 1984-2018 The MathWorks, Inc.
                                          R2018b (9.5.0.944444) 64-bit (glnxa64)
                                                    August 28, 2018


To get started, type doc.
For product information, visit www.mathworks.com.
>>

This will start a interactive command-line session in your terminal window. When done type the quit or exit to terminated the MATLAB session and then exit to terminated the salloc session.

MATLAB is selecting SOFTWARE OPENGL rendering.

                                                   < M A T L A B (R) >
                                          Copyright 1984-2018 The MathWorks, Inc.
                                           R2018b (9.5.0.944444) 64-bit (glnxa64)
                                                     August 28, 2018


To get started, type doc.
For product information, visit www.mathworks.com.
>>quit
[traine@r00n16 matlab_example]$ exit
exit
salloc: Relinquishing job allocation 7809686
[(it_css:traine)@login00 ~]$

You should be on a compute node before you start MATLAB. To start a MATLAB desktop (GUI mode) on a cluster, you must be running an X11 server and you must have connected using X11 tunneling.

Your shell must be in a workgroup environment to submit a job using salloc.

[traine@login00 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$ salloc --x11 -N1 -n1 --partition=_workgroup_
salloc: Pending job allocation 7790913
salloc: job 7790913 queued and waiting for resources
salloc: job 7790913 has been allocated resources
salloc: Granted job allocation 7790913
salloc: Waiting for resource configuration
salloc: Nodes r00n10 are ready for job
[traine@r00n10 ~]$ vpkg_require matlab
Adding package `matlab/r2018b` to your environment
[traine@r00n10 ~]$ matlab
MATLAB is selecting SOFTWARE OPENGL rendering.

This will start a interactive DESKTOP session on you X11 screen.

When done type the quit or exit in the command window or just close the window. When back at the terminal bash prompt, type exit to terminate the salloc session.

See tips on starting Matlab in an interactive session without the desktop, including executing a script.

For more information setting up X11 connections with an SSH connection please visit http://www1.udel.edu/it/research/training/config_laptop/. There directions will be provided on how to set up X11 SSH connections for windows, Mac, and Linux OS.

For more information on launching GUI Applications on Caviness visit Launching GUI Applications (X11 Forwarding)

Compiling with MATLAB

We show the three most common ways to work with compilers when using MATLAB.

  1. Compiling your Matlab code to run in the MCR (Matlab Compiler Runtime)
  2. Compiling your C or Fortran program to call MATLAB engine.
  3. Compiling your own function in C or Fortran to be used in a MATLAB session.
Make sure your compiler is newer than the one one required by your MATLAB version. In these examples MATLAB requires gcc 4.7 or newer. You may get the Warning:
Warning: You are using gcc version '4.9.3'. The version currently supported 
with MEX is '4.7.x'. For a list of currently supported compilers see: 
http://www.mathworks.com/support/compilers/current_release.

But the compilation completes successfully.

There is an example MCR project in the /opt/templates/ directory for you to copy and try. Copy on the head node and use salloc to compile with MATLAB on the devel partition or the head node if you need more than 2 hours for compiling (.i.e max request time on the devel partition is 2 hours). Once your program is compiled, you can run it interactively or batch, without needing a MATLAB license.

On the head node, copy the example project into your current directory using the following commands

[traine@login00 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ cp -r /opt/templates/dev-projects/Projects/MCR .
[(it_css:traine)@login00 matlab_example]$ cd MCR

Now compile on the compute node by using

[(it_css:traine)@login00 MCR]$ salloc --partition=devel
salloc: Granted job allocation 7861739
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[triane@r00n56 MCR]$
Remember you must be in a workgroup before using salloc. The prompt ([(it_css:traine)@login00 MCR]$) displays the workgroup (e.g. it_css) in this example.

Resulting output from the make command:

[traine@r00n56 MCR]$ make
Adding package `mcr/2018b:nojvm` to your environment
make[1]: Entering directory `/home/1201/MCR'
mcc -o maxEig -I ./common -R "-nojvm,-nodesktop,-singleCompThread" -mv maxEig.m
Compiler version: 7.0 (R2018b)
Dependency analysis by REQUIREMENTS.
Parsing file "/home/1201/MCR/maxEig.m"
        (Referenced from: "Compiler Command Line").
Deleting 0 temporary MEX authorization files.
Generating file "/home/1201/MCR/readme.txt".
Generating file "run_maxEig.sh".
make[1]: Leaving directory `/home/1201/MCR'

Take note of the package added, and the files that are generated. You can remove these files, as they are not needed. You must use VALET to add the appropriate version of the mcr package in your batch script that matches the same version of Matlab used to compile or to test interactively.

To test interactively on the same compute node.

[traine@r00n56 MCR]$ vpkg_require mcr/2018b:nojvm
[traine@r00n56 MCR]$ time ./maxEig 20.8

maxe =

   5.0101e+03


real    6m58.608s
user    6m38.486s
sys     0m6.114s
This example is designed as a test for batch computing, and takes between 5-15 minutes to complete. If you change the MATLAB statement dim=10000 to dim=1000, and recompile, it will take about 10 seconds

back to the head node

When done, exit the compute node.

[traine@r00n56 MCR]$ exit
exit
salloc: Relinquishing job allocation 7862447
[(it_css:traine)@login00 MCR]$

Copy array job example

Copy the matlab-mcr.qs template file to your current directory using the following command

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ cp -r /opt/templates/dev-projects/Projects/MCR MCR_array
[(it_css:traine)@login00 ~]$ cd MCR_array
[(it_css:traine)@login00 MCR_array]$ cp /opt/templates/slurm/applications/matlab-mcr.qs .

The below lines will need to be changed or added to the matlab-mcr.qs file. The number preceding the code is line number of where the alteration is needed.

...
29 #SBATCH --mem=3G
...
48 #SBATCH --job-name=matlab_mcr_arrray
...
58 #SBATCH --partition=_workgroup_
...
86 #SBATCH --output arrayJob-%A-%3a.out
...
102 # Setting the job array options
103 #SBATCH --array=1-100:1
...
159 echo "Job Running on Host: $HOSTNAME"
159
160 start=$(date "+%s")
161 echo "Job Start: ${start}"
162
163 #Getting the ask ID that will passed as a argument
164 let lambda=$SLURM_ARRAY_TASK_ID
165
166 # Execute your MCR program(s) here; prefix with UD_EXEC to
167 # ensure the job can/will respond to preemption/termination
168 # signals by calling your UD_JOB_EXIT_FN.
169 #
170 # Duplicate all three commands for each MCR program you run
171 # in sequence below.
172 #
173 #UD_EXEC my_mcr_program arg1 arg2
174 #mcr_rc=$?
175 #if [ $mcr_rc -ne 0 ]; then exit $mcr_rc; fi
176
177 #Lines added for MCR_array example
178 UD_EXEC ${HOME}/matlab_example/MCR_array/maxEig $lambda
179 mcr_rc=$?
180 if [ $mcr_rc -ne 0 ]; then exit $mcr_rc; fi
181
182 finish=$(date "+%s")
183 echo "Job Finish: ${finish}"
184
185 runtime=$(($finish-$start))
186
187 echo "Total Runtime: ${runtime}"

Example sbatch Submission

[(it_css:traine)@login00 MCR]$  sbatch matlab-mcr.qs
Submitted batch job 9803575
[(it_css:traine)@login00 MCR]$ date
Fri Oct  30 08:57:58 EDT 2020
[(it_css:traine)@login00 MCR]$ date
Fri Oct  30 08:58:19 EDT 2020
[(it_css:traine)@traine MCR]$ ls -l MCR_array-9803575* | wc -l
100

There are 100 output files with the names MCR_array-9803575-001.out to MCR_array-9803575-100.out For example, file 50 which is MCR_array-9803575-050.out looks like this:

Adding package `mcr/2019b` to your environment
-- Matlab MCR environment setup complete (on r00n13):
--  MCR_ROOT             = /opt/shared/matlab/r2019b
--  MCR_CACHE_ROOT       = /tmp/job_9803625

Job Running on Host: r00n13.localdomain.hpc.udel.edu
Job Start: 1604062673

maxe =

  525.9320

Job Finish: 1604062704
Total Runtime: 31

more examples Under construction: Stay tuned

There is an simple example function fengdemo.F coded in Fortran, you can copy and use as a starting point.

On the head node and in a workgroup shell:

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir matlab_compile
[(it_css:traine)@login00 matlab_example]$ cd matlab_compile
[(it_css:traine)@login00 matlab_compile]$ vpkg_require matlab/r2019a gcc/9.1
[(it_css:traine)@login00 matlab_compile]$ cp $MATLABROOT/extern/examples/eng_mat/fengdemo.F  .
[(it_css:traine)@login00 matlab_compile]$ export LD_LIBRARY_PATH=$MATLABROOT/bin/glnxa64:$MATLABROOT/sys/os/glnx64:$LD_LIBRARY_PATH
[(it_css:traine)@login00 matlab_compile]$ mex -client engine fengdemo.F
Warning: MATLAB FORTRAN MEX Files are now defaulting to -largeArrayDims and 8 byte integers.
         If you are building a FORTRAN S-Function, please recompile using the -compatibleArrayDims flag.
         You can find more about adapting code to use 64-bit array dimensions at:
         https://www.mathworks.com/help/matlab/matlab_external/upgrading-mex-files-to-use-64-bit-api.html.
Building with 'gfortran'.
MEX completed successfully.
[(it_css:triane)@login00 matlab_compile]

To run this program it will require running an interactive compute node with X11 forwarding enabled.

[(it_css:traine)@login00 matlab_compile]$ salloc --x11 -N1 -n1 --partition=_workgroup_
salloc: Granted job allocation 7915683
salloc: Waiting for resource configuration
salloc: Nodes r03g07 are ready for job
[traine@r03g07 matlab_compile]$ vpkg_require matlab/r2019a gcc/9.1
Adding package `matlab/r2019a` to your environment
Adding package `gcc/9.1.0` to your environment
[traine@r03g07 matlab_compile]$ export LD_LIBRARY_PATH=$MATLABROOT/bin/glnxa64:$MATLABROOT/sys/os/glnx64:$LD_LIBRARY_PATH
[traine@r03g07 matlab_compile]$ ./fengdemo

shortly after the starting to run the the program, a Matlab window will open and display a chart which is shown below.

After the the Matlab window is opened, you will see a promote in the terminal to “Exit” or “Continue”. Typing “1” and pressing “Enter” will return a table which is shown below.

After the table is returned, close of the Matlab window with the Chart. Then use the exit command to release the computer node.

 Type 0 <return> to Exit
 Type 1 <return> to continue
1
 MATLAB computed the following distances:
   time(s)  distance(m)
   1.00     -4.90
   2.00     -19.6
   3.00     -44.1
   4.00     -78.4
   5.00     -123.
   6.00     -176.
   7.00     -240.
   8.00     -314.
   9.00     -397.
   10.0     -490.
[traine@r03g07 matlab_compile]$ exit
salloc: Relinquishing job allocation 7915683
[(it_css:traine)@login00 matlab_compile]$

There is an simple example function timestwo.c, coded in c, you can copy and use as a starting point.

On the head node and in a workgroup shell:

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir matlab_function
[(it_css:traine)@login00 matlab_example]$ cd matlab_function
[(it_css:traine)@login00 matlab_function]$ vpkg_require matlab/r2019a gcc/9.1
Adding package `matlab/r2019a` to your environment
Adding package `gcc/9.1.0` to your environment
[(it_css:traine)@login00 matlab_function]$ cp $MATLABROOT/extern/examples/refbook/timestwo.c .
[(it_css:traine)@login00 matlab_function]$ mex timestwo.c
Building with 'gcc'.
Warning: You are using gcc version '9.1.0'. The version of gcc is not supported. The version currently supported with MEX is '6.3.x'. For a list of currently supported compilers see: https://www.mathworks.com/support/compilers/current_release.
MEX completed successfully.
[(it_css:traine)@login00 matlab_function]$

To start MATLAB on a compute node to test this new function:

[(it_css:traine)@login00 matlab_function]$ salloc --partition=devel
salloc: Pending job allocation 7916296
salloc: job 7916296 queued and waiting for resources
salloc: job 7916296 has been allocated resources
salloc: Granted job allocation 7916296
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[traine@r00n56 matlab_function]$ vpkg_require matlab/r2019a gcc/9.1
[traine@r00n56 matlab_function]$ matlab -nodesktop
MATLAB is selecting SOFTWARE OPENGL rendering.

                                             < M A T L A B (R) >
                                    Copyright 1984-2019 The MathWorks, Inc.
                                    R2019a (9.6.0.1072779) 64-bit (glnxa64)
                                                March 8, 2019

To get started, type doc.
For product information, visit www.mathworks.com.

>>

Now test the function by running timestwo(4). The results are shown below. Afterwards use quit to exit Matlab and exit to release the compute node.

>> timestwo(4)

ans =

     8

>> quit
[traine@r00n56 matlab_function]$ exit
exit
salloc: Relinquishing job allocation 7916296
[(it_css:traine)@login00 matlab_function]$

Batch job serial example

Second, write a shell script file to set the Matlab environment and start Matlab running your script file. The following script file will set the Matlab environment and run the command in the script.m file:

Script.m calls the maxEig function in the maxEig.m file. Make sure this is also in the directory.
[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir matlab_slurm
[(it_css:traine)@login00 matlab_example]$ cd matlab_slurm
[(it_css:traine)@login00 matlab_slurm]$ cp /opt/templates/slurm/generic/serial.qs batch.qs
[(it_css:traine)@login00 matlab_slurm]$ vim batch.qs
batch.qs
...
40 #SBATCH --job-name=script.m
...
50 #SBATCH --partition=_workgroup_
...
67 #SBATCH --time=0-03:00:00
...
76 #SBATCH --output %x-%j.out
77 #SBATCH --error %x-%j.out
...
86 #SBATCH --mail-user='traine@udel.edu'
87 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90
...
94 #SBATCH --exclusive
...
137 #
138 # [EDIT] Add your script statements hereafter, or execute a script or program
139 #        using the srun command.
140 #
141 #srun date
142 #Loading MATLAB
143 vpkg_require matlab/r2018b
144 #Running the matlab script
145 matlab -nodisplay -nojvm -r script

The -nodisplay indicates no X11 graphics, which implies -nosplash -nodesktop. The -nojvm indicates no Java. (Java is needed for some functions, e.g., print graphics, but should be excluded for most computational jobs.) The -r is followed by a Matlab command, enclosed in quotes when there is are spaces in the command.

Exclusive access to node: The #SBATCH –exclusive tells the scheduler to wait until your job can get exclusive access to the node. Since your job is the only job on the node, it can use all the memory and all the cores. Matlab assumes you want to use the full node to run as fast as possible. The goal is to reduce real time (wall clock time), not CPU time. When you use exclusive you should monitor the job to see the average core count and the maximum memory usage. With hind sight, this job should have used:
#SBATCH --ntasks=5
#SBATCH --mem=1G

If everyone in your group carefully set these values, multiply jobs can run concurrently on the node. Also you can use –exclusive=user to allow your jobs to run on a node, so this way you could potentially run more than one (Matlab or other application) job at the same time on a given node if you know you have accurately specified the resources for each job.

See Setting maximum number of computational threads

Errors in the Matlab script: The command script will execute the lines in the script.m file. For some errors Matlab will display the error message and wait for a response – clearly not appropriate for a batch job. Consider replacing script with the compound command
"try; script; catch ERR; disp(getReport(ERR,'extended')); quit; end"

The purpose of the try/catch block is to catch the first error in the script, and display a report. With the extended option the report will include a stack trace at the point of the error.

Graphics in the Matlab script
  • Do not include the -nojvm on the matlab command.
  • Do set paper dimensions and print each figure to a file.

The text output will be included in the standard Grid Engine output file, but not any graphics. All figures must be exported using the print command. Normally the print command will print on an 8 1/2 by 11 inch page with margins that are for a printed page of paper. The size and margins will not work if you plan to include the figure in a paper or a web page.

We suggest setting the current figure's PaperUnits, PaperSize and PaperPosition. Matlab provides a handle to the current figure (gcf). For example, the commands

  set(gcf,'PaperUnits','inches','PaperSize',[4,3],'PaperPosition',[0 0 4 3]);
  print('-dpng','-r100','maxe.png');

will set the current figure to be 4 x 3 inches with no margins, and then print the figure as a 400×300 resolution png file.

Third, from the directory with script.m, maxEig.m and batch.qs, submit the batch job with the command:

sbatch batch.qs
You should specify required Matlab licenses for GridEngine as a resource, especially if there are limited number of license seats available for particular toolboxes.

In this example you will only need a license for the base Matlab, and the parallel toolbox needs one license. We are using the default local scheduler which will give you workers on the same node with one license.

Toolbox dependencies You should include toolbox dependencies in your batch script too to help avoid a failure, which will occur if the job starts with no licenses available.

For example, the Bioinformatics toolbox only has one seat, and in addition it requires the Statistics and Machine Learning toolbox, as well as the core MATLAB. So you would add the line:

#$ -l MLM.MATLAB=1,MLM.Statistics_Toolbox=1,MLM.Bioinformatics_Toolbox=1

to your job script.

Finally, wait for the mail notification, which will be sent to traine@udel.com. When the job is done the output from the Matlab command will be in a file with the pattern - script.m-JOBID.out, where JOBID is the number assigned to your job.

After waiting for about 2 or 3 hours, a message was receive from SLURM Administrator. The email will have a title like the one shown below and there will be no content in the body.

SLURM Job_id=7937771 Name=script.m Ended, Run time 02:47:11, COMPLETED, ExitCode 0

The results for Job 2362 are in the file

script.m-7937771.out
Fri Apr 10 16:36:38 EDT 2020
Adding package `matlab/r2018b` to your environment
 
                            < M A T L A B (R) >
                  Copyright 1984-2018 The MathWorks, Inc.
                   R2018b (9.5.0.944444) 64-bit (glnxa64)
                              August 28, 2018
 
 
For online documentation, see https://www.mathworks.com/support
For product information, visit www.mathworks.com.
 
 
maxe =
 
   70.0220
 
 
maxe =
 
   71.7546
 
 
maxe =
 
   70.8331
 
 
maxe =
 
   70.5714
 
 
maxe =
 
   69.4923
 
 
maxe =
 
   67.7814
 
 
maxe =
 
   70.5037
 
 
maxe =
 
   68.3293
 
 
maxe =
 
   69.5694
 
       ...  //Skipping 953 similar displays of variable maxe//
 
maxe =
 
    67.4221
Elapsed time is 10023.165546 seconds.
 
avgMaxEig =
 
    69.5131

Consider a batch job run with these to Slurm options:

     #SBATCH --ntasks=5
     #SBATCH --mem=1G
     #SBATCH --job-name=script_opt.m

The sbatch command will give you the job id, and once it starts running, the squeue command will give you the node you are running on - n=r01n17. After about 10 minutes of running:

[(it_css:traine)@login00 matlab_slurm]$ ssh $n ps -eo pid,ruser,pcpu,pmem,thcount,stime,time,command | egrep '(COMMAND|matlab)'
  PID RUSER    %CPU %MEM THCNT STIME     TIME COMMAND
10853 traine    100  0.6    12 13:56 00:19:53 /opt/shared/matlab/r2018b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm

This ps command will give the percent CPU, which is = >100% for multi-core jobs, the percent memory, the thread count, which is > 5, the start time, the time of executions, and finally the full command used the start the job.

Given the reported PID, 10853, you can drill down and see which of the 10 threads are consuming CPU time:

[(it_css:traine)@login00 matlab_slurm]$ ssh $n ps -eLf | egrep '(PID|10853)' | grep -v ' 0  '
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
traine   10853 10778 10906 99   12 13:56 ?        00:27:05 /opt/shared/matlab/r2018b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm

While the batch job was running on node r00n15, the top command was run to sample the resources being used by Matlab every second two times -b -n 1. This -H option was used to display each individual threads, rather than a summery of all threads in a process.

[(it_css:traine)@login00 matlab_slurm]$ ssh $n top -H -b -n 1 | egrep '(COMMAND|MATLAB)' | grep -v 'S  0'
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
10906 traine    20   0 2450792 859664 105120 R 99.9  0.7  28:47.25 MATLAB     
[(it_css:traine)@login00 matlab_slurm]$ qhost -h $n
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR NLOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
r01n17                  E5-2695v4      36    2   36   36  0.03  124.0G 1024.0M     0.0     0.0

After the job is done you can use sacct to get a recap of resources used:

[(it_css:traine)@login00 matlab_slurm]$ sacct -N r01n17 -j 7935464 -o jobName,jobID,Nodelist,maxVMSize,MaxRSS,CPUTime,Start,End,Elapsed,State
   JobName        JobID        NodeList  MaxVMSize     MaxRSS    CPUTime               Start                 End    Elapsed      State
---------- ------------ --------------- ---------- ---------- ---------- ------------------- ------------------- ---------- ----------
script_op+ 7935464               r01n17                         14:06:35 2020-04-10T13:56:18 2020-04-10T16:45:37   02:49:19  COMPLETED
     batch 7935464.bat+          r01n17    209328K    755008K   14:06:35 2020-04-10T13:56:18 2020-04-10T16:45:37   02:49:19  COMPLETED
    extern 7935464.ext+          r01n17    182456K        20K   14:06:40 2020-04-10T13:56:18 2020-04-10T16:45:38   02:49:20  COMPLETED
      date 7935464.0             r01n17    277996K       488K   00:00:05 2020-04-10T13:56:21 2020-04-10T13:56:22   00:00:01  COMPLETED

Batch job parallel example

The Matlab parallel toolbox uses JVM to manage the workers and communicate while you are running. You need to setup the Matlab pools in your script.

Here is the slightly modified MATLAB script.

Add the necessary commands to configure your parcluster and parpool, and change forparfor.

pscript.m
% script to run maxEig function 200 times
%% Configure parpool
myCluster = parcluster('local');
myCluster.NumWorkers = str2double(getenv('SLURM_CPUS_PER_TASK'));
myCluster.JobStorageLocation = getenv('TMPDIR');
myPool = parpool(myCluster, myCluster.NumWorkers);
 
count = 200;
dim = 5001;
sumMaxe = 0;
 
tic
parfor i=1:count;
  sumMaxe = sumMaxe + maxEig(i,dim);
end
toc
avgMaxEig = sumMaxe/count
 
delete(myPool);
exit

Take out -nojvm, which is needed for the parpool, and require the distributed computing toolbox. Copy the template thread.qs script and name it pbatch.qs

cp /opt/templates/slurm/generic/serial.qs ./pbatch.qs

Make the following changes to the code

pbatch.qs
...
21 #SBATCH --cpus-per-task=20
...
31 #SBATCH --mem-per-cpu=4G
...
48 #SBATCH --job-name=pscript
...
59 #SBATCH --partition=_workgroup_
...
69 #SBATCH --time=0-01:00:00
...
76 #SBATCH --time-min=0-00:30:00
...
84 #SBATCH --output=%x-%j.out
85 #SBATCH --error=%x-%j.out
...
154 #srun /opt/shared/slurm/templates/share/threads.sh
155 vpkg_require matlab/r2019b
156 matlab -nodisplay -r pscript

Reported usage for same job run using the parallel toolbox.

[(it_css:traine)@login00 matlab_slurm]$ sacct  -j 7994763,7997035 -o jobName,jobID,Nodelist,maxVMSize,MaxRSS,CPUTime,Start,End,Elapsed,State
   JobName        JobID        NodeList  MaxVMSize     MaxRSS    CPUTime               Start                 End    Elapsed      State
---------- ------------ --------------- ---------- ---------- ---------- ------------------- ------------------- ---------- ----------
  script.m 7994763               r01n49                       4-05:01:12 2020-04-15T11:16:06 2020-04-15T14:04:28   02:48:22  COMPLETED
     batch 7994763.bat+          r01n49    209328K    804888K 4-05:01:12 2020-04-15T11:16:06 2020-04-15T14:04:28   02:48:22  COMPLETED
    extern 7994763.ext+          r01n49    107904K        20K 4-05:01:12 2020-04-15T11:16:06 2020-04-15T14:04:28   02:48:22  COMPLETED
   pscript 7997035               r03n38                         09:33:20 2020-04-15T15:56:42 2020-04-15T16:11:02   00:14:20  COMPLETED
     batch 7997035.bat+          r03n38    209460K  14462792K   09:33:20 2020-04-15T15:56:42 2020-04-15T16:11:02   00:14:20  COMPLETED
    extern 7997035.ext+          r03n38    107952K          0   09:33:20 2020-04-15T15:56:42 2020-04-15T16:11:02   00:14:20  COMPLETED

Compare script vs pscript

Job Elapsed Time CPUTime Max RSS
script.m 02:48:22 4-05:01:12 804888K
pscript 00:19:35 09:33:20 14462792K

The job script used more CPU resources with the multiple computational threads, while pscript user more memory resources with 20 single-threaded worker.

Interactive job example

The basic steps to running a MATLAB interactively on a compute node that will dedicate all resources exclusively to your job.

Create a directory and add maxEig.m and script.m to it.

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir matlab_interact

[(it_css:traine)@login00 matlab_example]$ cp maxEig.m script.m matlab_interact/
[(it_css:traine)@login00 matlab_example]$ cd matlab_interact
[(it_css:traine)@login00 matlab_interact]$ ls
maxEig.m  script.m

Start an interactive session on a compute node with the salloc command. You will also want to include the options –exclusive and –partition=_workgroup_.

[(it_css:traine)@login00 matlab_interact]$ salloc --exclusive --partition=_workgroup_
salloc: Pending job allocation 7985695
salloc: job 7985695 queued and waiting for resources
salloc: job 7985695 has been allocated resources
salloc: Granted job allocation 7985695
salloc: Waiting for resource configuration
salloc: Nodes r01n10 are ready for job
[traine@r01n10 matlab_interact]$
[traine@r01n10 matlab_interact]$ vpkg_require matlab/r2019b
Adding package `matlab/r2019b` to your environment
[traine@r01n10 matlab_interact]$
[traine@r01n10 matlab_interact]$ matlab -nodesktop -nosplash
MATLAB is selecting SOFTWARE OPENGL rendering.

                                       < M A T L A B (R) >
                             Copyright 1984-2019 The MathWorks, Inc.
                             R2019b (9.7.0.1190202) 64-bit (glnxa64)
                                         August 21, 2019


To get started, type doc.
For product information, visit www.mathworks.com.

>>
>> help maxEig
   maxEig   Maximum Eigenvalue of a random matrix
      Input parameters
        sd - seed for uniform random generator
        dim - size of the square matrix (should be odd)
      Output value
        maxe - maximum real eigvalue
        

Use the tic and toc commands to report the elapsed time to generate the random matrix, find all eigenvalues and report the maximum real eigenvalue.

>> tic; maxEig(1,5001); toc

maxe =

   70.0220

Elapsed time is 54.781289 seconds.
>> exit
[traine@r01n10 matlab_interact]$ exit
exit
salloc: Relinquishing job allocation 7985695
[(it_css:traine)@login00 matlab_interact]$

This example is based on the matlab_interact directory that was created in the Interactive job example demo shown above.

When you using the parallel toolbox, you should logon to a compute node and with the –exclusive option and on a work group partition:

[(it_css:traine)@login00 matlab_interact]$ salloc --exclusive --partition=_workgroup_ --cpus-per-task=20 --mem-per-cpu=2G --mem=40G
salloc: Pending job allocation 7993736
salloc: job 7993736 queued and waiting for resources
salloc: job 7993736 has been allocated resources
salloc: Granted job allocation 7985815
salloc: Waiting for resource configuration
salloc: Nodes r00g01 are ready for job
[traine@r00g01 matlab_interact]$ vpkg_require matlab/r2019b
[traine@r00g01 matlab_interact]$ matlab -nodesktop -nosplash

This will effectively reserve the entire node for your MATLAB workers. The is default number of parallel workers is 12, but you can ask for more – up to the number of cores on the node when using the local scheduler.

Here we request 20 workers with the parpool function, and then use parfor to send a different seed to each worker. The output is from the workers, as they complete, but the order is not deterministic.

Make sure the workers are not doing exactly the same computations In this example, the different seed, passed to the function, causes all the random values to be different on each worker.

It took about 100 seconds for all 20 workers to produce on result. Since they are working in parallel the elapsed time to complete 200 results is about

MATLAB is selecting SOFTWARE OPENGL rendering.

                                       < M A T L A B (R) >
                             Copyright 1984-2019 The MathWorks, Inc.
                             R2019b (9.7.0.1190202) 64-bit (glnxa64)
                                         August 21, 2019


To get started, type doc.
For product information, visit www.mathworks.com.

>> myCluster = parcluster('local');
>> myCluster.NumWorkers = str2double(getenv('SLURM_CPUS_PER_TASK'));
>> myCluster.JobStorageLocation = getenv('TMPDIR');
>> myPool = parpool(myCluster);
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 12).
>>  tic; parfor sd = 1:20; maxEig(sd,5001); end; toc

maxe =

   67.1320


maxe =

   70.8721


maxe =

   71.3507

... skipped lines ...

maxe =

   70.7506


maxe =

   70.2656


maxe =

   71.4253

Elapsed time is 1368.233648 seconds.

Once the job is completed exit MATLAB and release the interactive compute node.

>> exit
[traine@r00g01 matlab_interact]$ exit
exit
salloc: Relinquishing job allocation 7993736
[(it_css:traine)@login00 matlab_interact]$

MCR array job example

Most Matlab functions can be compiled using the Matlab Compiler (MCC) and then deployed to run on the compute nodes in the MATLAB Compiler Runtime (MCR). The MCR is a prerequisite for deployment, and is installed on all the compute nodes. You must use VALET to set up the libraries you will need to run your function from the command line. You do not need to use the shell (.sh file) that the compiler creates.

There are two ways to run compiled MATLAB jobs in a shared environment, such as Mills and Farber.

  1. Compile to produce and executable that uses a single computational thread - MATLAB option '-singleCompThread'
  2. Submit the job to use the nodes exclusively - Slurm option –exclusive

You can run more jobs on each node when they are compiled using just one core (Single Comp Thread). This will give you higher throughput for an array job, but not higher performance.

Make a new directory MCR_array_II directory and then copy maxEig function from the matlab_example directory to the new MCR_array_II directory.

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir MCR_array_II
[(it_css:traine)@login00 matlab_example]$ cp maxEig.m MCR_array_II/
[(it_css:traine)@login00 matlab_example]$ cd MCR_array_II

The maxEig function has a conditional statement to make it work when deployed.

  if (isdeployed)
    sd = str2num(sd)
    dim = str2num(dim)
  end

All arguments of the function are taken as tokens on the shell command used to execute the script, and they are all strings. You must convert numbers from strings to numbers. You can use the same variable names so that the rest of the script will behave the same when deployed or executed directly in Matlab.

You can convert this function into a single computational executable by using the Matlab compiler mcc. To do this, create a file compile.sh and add the below line to the file.

prog=maxEig
opt='-nojvm,-nodisplay,-singleCompThread'
version='r2019a'

vpkg_require matlab/$version
mcc -R "$opt" -mv $prog.m

[ -d ${WORKDIR}/${USER}/sw/bin ] && mv $prog ${WORKDIR}/${USER}/sw/bin
Keep these commands in a file: Even though this is just two commands, we recommend you keep these commands, including the shell assignment statements, as a record of the MATLAB version and options you used to create the executable maxEig. You will need to know these if you want to use the executable in a shell script. You can source this file when you want to rebuild maxEig
You can get mcc usage instructions with mcc -help: The string following the -R flag are the Matlab options you want to use at run time. The -m option tell mcc to build a standalone application to be deployed using MCR. The -v option is for verbose mode.
You cannot execute a file from a directory on the lustre file system. That is why the executable $prog is moved to the special directory, which is added to your path when a new workgroup shell is started or when a queue script is submitted.
[ -d $WORKDIR/sw/bin ] && mv $prog $WORKDIR/sw/bin

Make the directory where the MaxEig function will be placed when the function is compiled.

[(it_css:traine)@login00 MCR_array_II]$ mkdir -p ${WORKDIR}/${USER}/sw/bin
If you have a permission error, check to make sure that you are in your workgroup.

Now request a interactive compute node and run the compile.sh script.

[(it_css:traine)@login00 MCR_array_II]$ salloc --partition=devel
salloc: Granted job allocation 9804138
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[traine@r00n56 MCR_array_II]$ ls
compile.sh  maxEig.m
[traine@r00n56 MCR_array_II]$ . compile.sh
Adding package `matlab/r2019b` to your environment
Compiler version: 7.1 (R2019b)
Dependency analysis by REQUIREMENTS.
Parsing file "/home/1201/matlab_example/MCR_array_II/maxEig.m"
        (referenced from command line).
Generating file "/home/1201/matlab_example/MCR_array_II/readme.txt".
Generating file "run_maxEig.sh".
[traine@r00n56 MCR_array_II]$ ls
compile.sh  maxEig.m  mccExcludedFiles.log  readme.txt  requiredMCRProducts.txt  run_maxEig.sh
[traine@r00n56 MCR_array_II]$ exit
exit
salloc: Relinquishing job allocation 9804138
[(it_css:traine)@login01 MCR_array_II]$

The mcc command will generate a .sh file should not be used. This run script does not use VALET and does not have the appropriate Slurm commands. Instead, you should copy the Slurm template in the file /opt/templates/slurm/applications/matlab-mcr.qs by using the following command

[(it_css:traine)@login00 MCR_array_II]$ cp /opt/templates/slurm/applications/matlab-mcr.qs .

and make the appropriate changes below changes.

...
20 #SBATCH --cpus-per-task=2
...
29 #SBATCH --mem=3G
30 #SBATCH --mem-per-cpu=1024M
...
47 #SBATCH --job-name=maxEig
...
58 #SBATCH --partition=_workgroup_
...
85 #SBATCH --output %x-%A-%3a.out
...
102 # Setting the job array options
103 #SBATCH --array=1-200:1
...
148 # Load a specific Matlab MCR package into the runtime environment:
149 #
150 vpkg_require mcr/2019b:nojvm
151 export MCR_CACHE_ROOT="$TMPDIR"
152
153 #
154 # Do standard MCR environment setup:
155 #
156 . /opt/shared/slurm/templates/libexec/matlab-mcr.sh
157
158 #
159 date "+Start %s"
160 echo "Host ${HOSTNAME}"
161
162
163 #Getting the ask ID that will be passed as a argument
164 let seed=$SLURM_ARRAY_TASK_ID
165 let dim=5001
166
167 # Execute your MCR program(s) here; prefix with UD_EXEC to
168 # ensure the job can/will respond to preemption/termination
169 # signals by calling your UD_JOB_EXIT_FN.
170 #
171 # Duplicate all three commands for each MCR program you run
172 # in sequence below.
173 #
174 #UD_EXEC my_mcr_program arg1 arg2
175 #mcr_rc=$?
176 #if [ $mcr_rc -ne 0 ]; then exit $mcr_rc; fi
177 UD_EXEC ${WORKDIR}/${USER}/sw/bin/maxEig $seed $dim
178 mcr_rc=$?
179 if [ $mcr_rc -ne 0 ]; then exit $mcr_rc; fi
180
181 date "+Finish %s"
182


The two date commands record the start and finish time in seconds for each task. These are then used to compute the total runtime. The echoed host name can be used to calculate the overlapping use of the computer nodes. Since maxEig was compiled as a single threaded job the elapse time will be very close to the wall clock time and CPU time. We do not send email notification since it would generated 200 email messages, one for each task.

To test the example compiled Matlab job on the it_css owner queues, we first compiled the code with mcc and then submited with sbatch.

[(it_css:traine)@login00 MCR_array_II]$ sbatch matlab-mcr.qs
Submitted batch job 9805558

The assigned job number ID assigned is 9805282. After a few minutes 200 files were created in the current directory.

  maxEig-9805558-001.out  ...   maxEig-9805558-200.out

They each had the output of one task. For example for taskid 125:

Adding package `mcr/2019b:nojvm` to your environment
-- OpenMP job setup complete:
--  OMP_NUM_THREADS      = 2
--  OMP_PROC_BIND        = true
--  OMP_PLACES           = cores
--  MP_BLIST             = 32,33

-- Matlab MCR environment setup complete (on r00n10):
--  MCR_ROOT             = /opt/shared/matlab/r2019b
--  MCR_CACHE_ROOT       = /tmp/job_9805684

Start 1604084921
Host r00n10.localdomain.hpc.udel.edu

sd =

   125


dim =

        5001


maxe =

   70.4891

Finish 1604085004

Now we will use wikigather.pl to gather all the information from this files and return the avgMaxEig value. User the link to copy the code perl code, and them create new file in your current directory with that same name wikigather.pl and add the copied code into that file.

After copying the code you will need to make sure that you change the job id value in the pattern variable to match your job id
[(it_css:traine)@login00 MCR_array_II]$ perl wikigather.pl
avgMaxEig = 69.5131125

The script will all create three new .data files and one new .txt file. We are really on interested in the results8012246.data and the wikimaxEig.txt files. Examples of them are shown below.

sd dim maxe
1 5001 70.0220
2 5001 71.7546
3 5001 70.8331
4 5001 70.5714
5 5001 69.4923
....
195 5001 68.7440
196 5001 71.5652
197 5001 69.8530
198 5001 70.1213
199 5001 70.7535
200 5001 67.4221

These are the same results we got from both the matlab loop and the parallel toolbox, but they where computed in just about 8.5 minutes. To see this we gather the start/finish times in seconds and the host name.

wiki9805558.txt Output:

SGE array job started Fri 30 Oct 2020 03:04:16 PM EDT

Used a total of 16585 CPU seconds over 525 seconds of elapsed time on 2 nodes

Node Real Clock Time Ratio
Name Count Min Max Average User/Real
r00n10.localdomain.hpc.udel.edu 108 82.00 85.00 83.32 1.00000
r00n47.localdomain.hpc.udel.edu 92 58.00 84.00 82.46 1.00000

Using gnuplot we get a time chart of usage on the 2 nodes and total CPU usage. Create a file and add the following code to a file named plot«JOB ID».gnuplot.

 set terminal png  size 640,640
 set output "wiki9805558.png"
  set multiplot layout 2,1
  set xrange [0:550]
  set yrange [0:80]
  set key on
  set title "Tasks on 2 nodes by time (seconds)"
  set key on
  plot "count.data" u 1:3 t "r00n10.localdomain.hpc.udel.edu" w filledcurves,"count.data" u 1:2 t "r00n47.localdomain.hpc.udel.edu" w filledcurves
  set title "User time usage rate on all nodes"
  plot "usage.data" u 1:2 w steps t "CPU"
You will need to update the “plot” line with the correct amount nodes and their respected names for the nodes used on the job you ran.

To create the plot we will need to request an interactive compute node on the devel partition. Once the request has been filled we will need to use VALET to load the gnuplot application and run the plot«JOB ID».gnuplot script that we just created. After the script is ran we will release the node and then view the .png that was created by the script.

[(it_css:traine)@login00 MCR_array_II]$ salloc --partition=devel
salloc: Granted job allocation 9805971
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[traine@r00n56 MCR_array_II]$ vpkg_require gnuplot
Adding package `gnuplot/5.2.4` to your environment
[traine@r00n56 MCR_array_II]$ gnuplot plot9805558.gnuplot
[traine@r00n56 MCR_array_II]$exit
[(it_css:traine)@login00 MCR_array_II]$ display wiki9805558.png
Make sure that you have X11 forwarding setup on your SSH connection to be able to view the image.

An example of the .png file that is created by the plot9805558.gnuplot script.

wikigather.pl
$pattern = '\-(9805558)\-(\d+)\.(out)'; # Make sure to change 9805558 to make your job id.
$countFile = 'count.data'; # task count on nodes by seconds
$usageFile = 'usage.data'; # accumulate user time on all nodes by seconds
$nodeUsageFile = "nodeusage.data"; #detail of node usage by seconds
$nodeUsageFiles = "%s_usage.data"; # %s -> host
@varNames = qw/sd dim maxe/; # used for columns in resultfile
$resultFile = "result%s.data"; # %s -> project id
&scandir(".");
 
@node = sort keys %hostCount;
 
foreach $jobid (keys %startTime) {
  my $file = sprintf "wiki%s.txt", $jobid;
  open(WIKI, ">$file");
  print WIKI `date -d \@$startTime{$jobid} +\"SGE array job started %c\n"`;
  print WIKI "Used a total of $userTotal{$jobid} CPU seconds ";
  print WIKI "over ",$stopTime{$jobid}-$startTime{$jobid}," seconds of elapsed time ";
  print WIKI "on ",0+@node," nodes\n";
 
  $baseTime = $startTime{$jobid} if (!defined $baseTime or $startTime{$jobid} < $baseTime);
 
  $avgMaxEig=0;
  $count=0;
  if ($resultFile) {
    my $file = sprintf $resultFile, $jobid;
    open(DATA, ">$file");
    print DATA "@varNames\n";
    foreach $task (sort { $a <=> $b } keys %{$result{$jobid}}) {
      my %var = split($;,$result{$jobid}{$task});
      print DATA "@var{@varNames}\n";
      $avgMaxEig += $var{'maxe'};
      $count += 1;
    }
    close(DATA);
    print 'avgMaxEig = ', $avgMaxEig/$count, "\n";
  }
 
  printf WIKI "^ %18s ^^ %30s ^^^ %12s ^\n","Node ","Real Clock Time ","Ratio ";
  printf WIKI "^ %8s ^ %8s ^ %9s ^ %9s ^ %9s ^ %12s ^\n","Name ","Count ","Min ","Max ","Average ","User/Real ";
  foreach (@node) {
    if ( $hostCountByJob{$jobid}{$_} > 0) {
      printf WIKI "|%8s|%8d| %9.2f|%9.2f|%9.2f |%12.5f|\n", $_, $hostCountByJob{$jobid}{$_},
        $hostRealMin{$jobid}{$_},$hostRealMax{$jobid}{$_}, 
        $hostReal{$jobid}{$_}/$hostCountByJob{$jobid}{$_}, 
        $hostUser{$jobid}{$_}/$hostReal{$jobid}{$_};
    }
  }
  close(WIKI);
}
 
if ($countFile and open(DATA,">$countFile")) {
  my(@col,%byNode,$time,$count);
  $col[$_] = 0 for $[ .. $#node;
  foreach $time (sort { $a <=> $b } keys %timeCount) {
    printf DATA "%d %s\n", $time-$baseTime, "@col";
    $byNode{$_} += $timeCount{$time}{$_} foreach keys %{$timeCount{$time}};
    $count=0;
    $col[$_] = $count += $byNode{$node[$_]} for $[ .. $#node;
    printf DATA "%d %s\n", $time-$baseTime, "@col";
  }
  close(DATA);
}
 
if ($usageFile and open(DATA,">$usageFile")) {
  my ($time, $lastTime, $slope, $usage);
  foreach $time (sort { $a <=> $b } keys %timeRate) {
    $usage += $slope*($time - $lastTime);
    $slope += $timeRate{$time}{$_} foreach keys %{$timeRate{$time}};
    printf DATA "%d %.4f %.4f\n", $time-$baseTime, $slope, $usage;
    $lastTime = $time;
  }
  close(DATA);
}
 
if ($countFile and $usageFile) {
  foreach $jobid (keys %startTime) {
    my $plotTitle = 'Number of tasks on %s by time (seconds)'; # %s -> nodes
    my(@plot);
    $plot[$_] = "\"$countFile\" u 1:".(2+$_-$[)." t \"$node[$_]\" w filledcurves"
      for  $[ .. $#node;
    my $plotTop = join(",",reverse @plot);
    my $titleTop = sprintf $plotTitle, 0+@node." nodes";
    my $key = "off";
    my ($t1,$t2) = (30*int(($startTime{$jobid}-$baseTime)/30),30*int(2+($stopTime{$jobid}-$baseTime)/30));
    $titleTop = sprintf $plotTitle, "nodes @node" if $#node < 5;
    $key = "out horiz top right" if $#node < 9;
 
    open (PLOT, "| gnuplot" );
    print PLOT <<"EOP";
  set term pngcairo font "sans,10" size 640,640
  set output "wiki$jobid.png"
  set multiplot layout 2,1
  set xrange [$t1:$t2]
  set key $key
  set ylabel "Number of Tasks on node"
  plot $plotTop
  set key out horiz top right
  set ylabel "Total CPU usage"
  set xlabel "Time (seconds)"
  plot "$usageFile" u 1:3 w lines t "CPU seconds"
EOP
  }
}
 
sub scanfile {
   my($file) = @_;
   my($jobid,$taskid) = ($file =~ /$pattern/);
   my($host,$start,$finish,$usr1,$usr2,$real,$user,$sys,$lhs,%var);
   open(FILE,$file) || next;
   local $/ = undef;  #Read file as one string
   while (<FILE>) {
      study;
      /^Host (\S+)/m and $host=$1;
      /^Start (\d+)/m and $start=$1;
      /^Finish (\d+)/m and $finish=$1;
      /^SIGUSR1 (\d+)/m and $usr1=$1;
      /^SIGUSR2 (\d+)/m and $usr2=$1;
      /^real(.*?)m(.*?)s/m and $real=60*$1+$2;
      /^user(.*?)m(.*?)s/m and $user=60*$1+$2;
      /^sys(.*?)m(.*?)s/m and $sys=60*$1+$2;
      while(/(\S+)\s*=\s*(.*)/g) { $var{$1}=$2 };
   }
   close(FILE);
   $result{$jobid}{$taskid} = join($;,%var);
   $SGEfile{$file} = sprintf "| %s | %.2f %8.2f %8.2f |", $host, $real, $user, $sys;
   $SGEfile{$file} .= sprintf " %d %d |", $usr1, $usr2;
   $SGEfile{$file} .= join(',', map {" $_=$var{$_}"} keys %var );
   $finish = $usr2 if( $finish==0 );
   $finish = $usr1 if( $finish==0 );
   $finish > 0 || next;
   $real = $finish-$start if($real==0);
   $user = $real-$sys if($user==0);
   $startTime{$jobid} = $start if (!defined $startTime{$jobid} or $start < $startTime{$jobid});
   $stopTime{$jobid} = $finish if (!defined $stopTime{$jobid} or $finish > $stopTime{$jobid});
   $userTotal{$jobid} += $user;
 
   $hostCount{$host} += 1;
   $hostCountByJob{$jobid}{$host} += 1;
   $hostReal{$jobid}{$host} += $real;
   $hostRealMax{$jobid}{$host} = $real if (!defined $hostRealMax{$jobid}{$host} or 
                                          $real > $hostRealMax{$jobid}{$host});
   $hostRealMin{$jobid}{$host} = $real if (!defined $hostRealMin{$jobid}{$host} or 
                                          $real < $hostRealMin{$jobid}{$host});
   $hostUser{$jobid}{$host} += $user;
 
   $timeCount{$start}{$host} += 1;
   $timeCount{$finish}{$host} -= 1;
   $timeRate{$start}{$host} += $user/($finish-$start);
   $timeRate{$finish}{$host} -= $user/($finish-$start);
}
 
sub scandir {
   my($basedir) = @_;
   my(@file,@dir);
 
   opendir(DIR, $basedir) || return;
   foreach ( grep (/^[^\.]/,readdir(DIR)) ) { # ignore hidden files
      next if -l "$basedir/$_" ; # skip sym links
      push @file,$_ if /$pattern/; # save files with this pattern
      push @dir,$_ if -d "$basedir/$_" ; # save directories for recursion
   }
   closedir(DIR);
 
   foreach (@file) {
      &scanfile("$basedir/$_");
   }
 
   foreach (@dir) {
      &scandir("$basedir/$_");
   }
}

Adding checkpoints your Matlab job

Adding checkpoints to your job could help your job more gracefully handle kill signals from the system. The proper handling of these signals can help you restart your job without having to start completely over again. In the following example, we will modify priorly used scripts and functions to track which interval the loop stops at when the job times out.

First we'll create a new directory and copy the needed code into it.

[(it_css:traine)@login00 ~]$ cd matlab_example
[(it_css:traine)@login00 matlab_example]$ mkdir matlab_checkpoint
[(it_css:traine)@login00 ~]$ cd matlab_checkpoint
[(it_css:traine)@login00 matl_checkpoint]$ cp /opt/templates/slurm/generic/serial.qs batch.qs

You will also want to add maxEig.m and script.m in to your matlab_checkpoint/ directory.

Now we will need to make changes to script.m.

% script to run maxEig function 200 times and print average.
count = 200;
dim = 5001;
sumMaxe = 0;
i = 0;
tic;
for i=1:count;
        sumMaxe = sumMaxe + maxEig(i,dim);
        counter = "counter: "+i; %Add this line
        disp(counter); %Add this line
end;
toc
avgMaxEig = sumMaxe/count
quit

The following changes will need to be added to batch.qs

...
40 #SBATCH --job-name=checkpoint
...
60 #SBATCH --time=0-01:30:00
...
75 #SBATCH --output %x-%j.out
76 #SBATCH --error %x-%j.out
...
85 #SBATCH --mail-user='traine@udel.edu'
86 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90
...
108 job_exit_handler() {
109   counter=$(tail -n 2  ${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out | head -n 1)
110   echo "Job ${SLURM_JOB_NAME} ended on ${counter}"
111   #matlab -nodisplay -nojvm -r "disp(getReport(err,'extended')); quit;"
112   # Copy all our output files back to the original job directory:
113   #cp * "$SLURM_SUBMIT_DIR"
114
115   # Don't call again on EXIT signal, please:
116   trap - EXIT
117   exit 0
118 }
119 export UD_JOB_EXIT_FN=job_exit_handler
...
142 #
143 #srun date
144 export UD_JOB_EXIT_FN_SIGNALS="SIGTERM EXIT"
145 #Loading MATLAB
146 vpkg_require matlab/r2018b
147 #Running the matlab script
148 UD_EXEC matlab -nodisplay -nojvm -r "try; script; catch ERR; disp(job_exit_handler(ERR.getReport)); quit; end"

We know from the the MCR example that this script takes between 2-3 hours to run. In the changes we made to batch.qs script, we set the wall clock to 1 hour and 30 minutes. That should insure that this script will fail to complete before the wall clock runs out of time. This is shown in the falling job submission example.

[(it_css:traine)@login01 matlab_checkpoint]$ sbatch batch.qs
Submitted batch job 8382365

After the wall clock runs out we will see the following output.

[(it_css:traine)@login01 matlab_checkpoint]$ tail -n 30 checkpoint-8382365.out

   65.5558

counter: 105

maxe =

   70.1761

counter: 106

maxe =

   68.9765

counter: 107

maxe =

   69.9773

counter: 108

maxe =

   69.3456

counter: 109
slurmstepd: error: *** JOB 8382365 ON r00n47 CANCELLED AT 2020-05-20T19:58:11 DUE TO TIME LIMIT ***
Job checkpoint ended on counter: 109

Now we know that the script completed about 100 of the 200 loop intervals before the wall clock expired. This example could be expanded on so that the job will be re-queued and start at the last loop interval instead of restarting from the beginning.

If you don't want to wait the full amount of time that the wall clock is set to you can user “scancel” to manually stop the job and trigger the job_exit_handler() function. As shown for job 8390581
[(it_css:triane)@login01 matlab_checkpoint]$ sbatch batch.qs
Submitted batch job 8390581
[(it_css:traine)@login01 matlab_checkpoint]$ scancel 8390581
[(it_css:traine)@login01 matlab_checkpoint]$ cat checkpoint-8390581.out
-- Registered exit function 'job_exit_handler' for signal(s) SIGTERM

Adding package `matlab/r2018b` to your environment

                            < M A T L A B (R) >
                  Copyright 1984-2018 The MathWorks, Inc.
                   R2018b (9.5.0.944444) 64-bit (glnxa64)
                              August 28, 2018


For online documentation, see https://www.mathworks.com/support
For product information, visit www.mathworks.com.


maxe =

   70.0220

counter: 1
...

maxe =

   67.7814

counter: 6

maxe =

   70.5037

counter: 7
slurmstepd: error: *** JOB 8390581 ON r00n17 CANCELLED AT 2020-05-21T10:55:03 ***
Job checkpoint ended on counter: 7
  • software/matlab/caviness.txt
  • Last modified: 2020-10-30 16:16
  • by mkyle