software:matlab:caviness

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:matlab:caviness [2021-01-13 18:19] – [Batch job serial example] anitasoftware:matlab:caviness [2023-10-06 10:29] (current) – [Running the checkpoint job and its output] thuachen
Line 134: Line 134:
 </note> </note>
 ===== Create a job script file ===== ===== Create a job script file =====
-You should create a job script file to submit a batch job. Start by modifying a batch job script template file (''/opt/templates/slurm/generic/serial.qs''), for example, to submit a serial job using one core on a compute node, +You should create a job script file to submit a batch job. Start by modifying a batch job script template file (''/opt/shared/templates/slurm/generic/serial.qs''), for example, to submit a serial job using one core on a compute node, 
 In your newly copied serial.qs file, add the following lines at the end. In your newly copied serial.qs file, add the following lines at the end.
 <code>  <code> 
-[traine@login00 matlab_example]$ cp /opt/templates/slurm/generic/serial.qs matlab_first.qs+[traine@login00 matlab_example]$ cp /opt/shared/templates/slurm/generic/serial.qs matlab_first.qs
 </code> </code>
 <code> <code>
Line 297: Line 297:
 ===== Compiling your Matlab code ===== ===== Compiling your Matlab code =====
  
-There is an example MCR project in the ''/opt/templates/'' directory for you to copy and try.  Copy on the head node and use ''salloc'' to compile with MATLAB on the ''devel'' partition (.i.e max request time on the ''devel'' partition is 2 hours and 4 cores).  Once your program is compiled, you can run it interactively or batch, without needing a MATLAB license.+There is an example MCR project in the ''/opt/shared/templates/'' directory for you to copy and try.  Copy on the head node and use ''salloc'' to compile with MATLAB on the ''devel'' partition (.i.e max request time on the ''devel'' partition is 2 hours and 4 cores).  Once your program is compiled, you can run it interactively or batch, without needing a MATLAB license.
  
 ==== Copy dev-projects template ==== ==== Copy dev-projects template ====
Line 305: Line 305:
 [traine@login00 ~]$ workgroup -g it_css [traine@login00 ~]$ workgroup -g it_css
 [(it_css:traine)@login00 ~]$ cd matlab_example [(it_css:traine)@login00 ~]$ cd matlab_example
-[(it_css:traine)@login00 matlab_example]$ cp -r /opt/templates/dev-projects/Projects/MCR .+[(it_css:traine)@login00 matlab_example]$ cp -r /opt/shared/templates/dev-projects/Projects/MCR .
 [(it_css:traine)@login00 matlab_example]$ cd MCR [(it_css:traine)@login00 matlab_example]$ cd MCR
  
Line 373: Line 373:
 <code> <code>
 [(it_css:traine)@login00 ~]$ cd matlab_example [(it_css:traine)@login00 ~]$ cd matlab_example
-[(it_css:traine)@login00 matlab_example]$ cp -r /opt/templates/dev-projects/Projects/MCR MCR_array+[(it_css:traine)@login00 matlab_example]$ cp -r /opt/shared/templates/dev-projects/Projects/MCR MCR_array
 [(it_css:traine)@login00 ~]$ cd MCR_array [(it_css:traine)@login00 ~]$ cd MCR_array
-[(it_css:traine)@login00 MCR_array]$ cp /opt/templates/slurm/applications/matlab-mcr.qs .+[(it_css:traine)@login00 MCR_array]$ cp /opt/shared/templates/slurm/applications/matlab-mcr.qs .
 [(it_css:traine)@login00 MCR_array]$  make [(it_css:traine)@login00 MCR_array]$  make
 Adding package `mcr/2019b` to your environment Adding package `mcr/2019b` to your environment
Line 623: Line 623:
 [(it_css:traine)@login00 matlab_example]$ mkdir matlab_slurm [(it_css:traine)@login00 matlab_example]$ mkdir matlab_slurm
 [(it_css:traine)@login00 matlab_example]$ cd matlab_slurm [(it_css:traine)@login00 matlab_example]$ cd matlab_slurm
-[(it_css:traine)@login00 matlab_slurm]$ cp /opt/templates/slurm/generic/serial.qs batch.qs+[(it_css:traine)@login00 matlab_slurm]$ cp /opt/shared/templates/slurm/generic/serial.qs batch.qs
 [(it_css:traine)@login00 matlab_slurm]$ vim batch.qs [(it_css:traine)@login00 matlab_slurm]$ vim batch.qs
 </code> </code>
Line 639: Line 639:
 86 #SBATCH --mail-user='traine@udel.edu' 86 #SBATCH --mail-user='traine@udel.edu'
 87 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90 87 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90
-... 
-94 #SBATCH --exclusive 
 ... ...
 137 # 137 #
Line 657: Line 655:
 Make sure you change the ''%%--%%mail-user'' from ''traine@udel.edu'' to your preferred email address. The ''-nodisplay'' indicates no X11 graphics, which implies ''-nosplash -nodesktop''. The ''-nojvm'' indicates no Java. (Java is needed for some functions, e.g., print graphics, but should be excluded for most computational jobs.) Make sure you change the ''%%--%%mail-user'' from ''traine@udel.edu'' to your preferred email address. The ''-nodisplay'' indicates no X11 graphics, which implies ''-nosplash -nodesktop''. The ''-nojvm'' indicates no Java. (Java is needed for some functions, e.g., print graphics, but should be excluded for most computational jobs.)
 The ''-batch'' is followed by a Matlab command, enclosed in quotes when there is are spaces in the command.  The ''-batch'' is followed by a Matlab command, enclosed in quotes when there is are spaces in the command. 
- 
-<note important>**Exclusive access to node**: 
-The ''#SBATCH %%--%%exclusive'' tells the scheduler to wait until your job can get exclusive access to the node.  Since your job is the only job on the node, it can use all the memory and all the cores.  Matlab assumes you want to use the full node to run as fast as possible.  The goal is to reduce real time (wall clock time), not CPU time.  When you use exclusive you should monitor the job to see the average core count and the maximum memory usage.  With hind sight, this job should have used: 
-<code> 
-#SBATCH --ntasks=5 
-#SBATCH --mem=1G 
-</code> 
-If everyone in your group carefully set these values, multiply jobs can run concurrently on the node. Also you can use ''%%--%%exclusive=user'' to allow your jobs to run on a node, so this way you could potentially run more than one (Matlab or other application) job at the same time on a given node if you know you have accurately specified the resources for each job. 
- 
-See [[maxNumCompThreadsGridEngine|Setting maximum number of computational threads]] 
- 
-</note> 
  
 <note tip>**Errors in the Matlab script**: <note tip>**Errors in the Matlab script**:
Line 684: Line 670:
   * Do set paper dimensions and print each figure to a file.   * Do set paper dimensions and print each figure to a file.
  
-The text output will be included in the standard Grid Engine output file, but not any graphics.  All figures must be exported using the **print** command. Normally the **print** command will print on an 8 1/2 by 11 inch page with margins that are for a printed page of paper.  The size and margins will not work if you plan to include the figure in a paper or a web page.+The text output will be included in the standard Slurm output file, but not any graphics.  All figures must be exported using the **print** command. Normally the **print** command will print on an 8 1/2 by 11 inch page with margins that are for a printed page of paper.  The size and margins will not work if you plan to include the figure in a paper or a web page.
  
 We suggest setting the current figure's ''PaperUnits'', ''PaperSize'' and ''PaperPosition'' Matlab provides a handle to the current figure (**gcf**).  For example, the commands We suggest setting the current figure's ''PaperUnits'', ''PaperSize'' and ''PaperPosition'' Matlab provides a handle to the current figure (**gcf**).  For example, the commands
Line 855: Line 841:
 %% Configure parpool %% Configure parpool
 myCluster = parcluster('local'); myCluster = parcluster('local');
-myCluster.NumWorkers = str2double(getenv('SLURM_CPUS_ON_NODE')) / str2double(getenv('SLURM_CPUS_PER_TASK'));+myCluster.NumWorkers = str2double(getenv('SLURM_NTASKS'));
 myCluster.JobStorageLocation = getenv('TMPDIR'); myCluster.JobStorageLocation = getenv('TMPDIR');
 myPool = parpool(myCluster, myCluster.NumWorkers); myPool = parpool(myCluster, myCluster.NumWorkers);
Line 879: Line 865:
 Copy the template ''matlab.qs'' script and name it ''pbatch.qs'' by typing Copy the template ''matlab.qs'' script and name it ''pbatch.qs'' by typing
 <code> <code>
-cp /opt/templates/slurm/applications/matlab.qs ./pbatch.qs+cp /opt/shared/templates/slurm/applications/matlab.qs ./pbatch.qs
 </code> </code>
 Make the following changes to the code Make the following changes to the code
Line 886: Line 872:
 19 #SBATCH --ntasks=20 19 #SBATCH --ntasks=20
 ... ...
-37 #SBATCH --mem-per-cpu=4G+37 #SBATCH --mem=60G
 ... ...
 54 #SBATCH --job-name=matlab-pscript 54 #SBATCH --job-name=matlab-pscript
Line 933: Line 919:
 ====== Interactive job example ====== ====== Interactive job example ======
  
-The basic steps to running a [[:software:matlab:caviness#interactive-job|MATLAB]] interactively on a compute node that will dedicate all resources exclusively to your job. +The basic steps to running a [[:software:matlab:caviness#interactive-job|MATLAB]] interactively on a compute node that will dedicate specific resources to your job. 
  
  
Line 940: Line 926:
  
  
-==== Scheduling exclusive interactive job ====+==== Scheduling interactive job ====
 Create a directory and add [[#matlab-function|maxEig.m]] and [[#slurm-script|script.m]] to it. Create a directory and add [[#matlab-function|maxEig.m]] and [[#slurm-script|script.m]] to it.
 <code> <code>
Line 951: Line 937:
 maxEig.m  script.m maxEig.m  script.m
 </code> </code>
-Start an interactive session on a compute node with the ''salloc'' command. You will also want to include the options ''--exclusive'' and ''--partition=_workgroup_''.+Start an interactive session on a compute node with the ''salloc'' command. You will also want to include the options for the number of cores ''ntasks'' and ''--partition=_workgroup_''.
 <code> <code>
-[(it_css:traine)@login00 matlab_interact]$ salloc --exclusive --partition=_workgroup_+[(it_css:traine)@login00 matlab_interact]$ salloc --partition=_workgroup_ --ntasks=20
 salloc: Pending job allocation 7985695 salloc: Pending job allocation 7985695
 salloc: job 7985695 queued and waiting for resources salloc: job 7985695 queued and waiting for resources
Line 1029: Line 1015:
 This example is based on the ''matlab_interact'' directory that was created in the [[software:matlab:caviness#interactive-job|Interactive job example]] demo shown above. This example is based on the ''matlab_interact'' directory that was created in the [[software:matlab:caviness#interactive-job|Interactive job example]] demo shown above.
  
-When you using the parallel toolbox, you should logon to a compute node and with the ''--exclusive'' option and on a work group partition:+When you using the parallel toolbox, you should logon to a compute node using a workgroup partition and the number of tasks and memory required:
  
 <code> <code>
-[(it_css:traine)@login00 matlab_interact]$ salloc --exclusive --partition=_workgroup_ --cpus-per-task=20 --mem-per-cpu=2G --mem=40G+[(it_css:traine)@login00 matlab_interact]$ salloc --partition=_workgroup_ --ntasks=20 --mem=40G
 salloc: Pending job allocation 7993736 salloc: Pending job allocation 7993736
 salloc: job 7993736 queued and waiting for resources salloc: job 7993736 queued and waiting for resources
Line 1043: Line 1029:
 </code> </code>
      
-This will effectively reserve the entire node for your MATLAB workers.  The is default number of parallel workers is 12but you can ask for more -- up to the number of cores on the node when using the local scheduler.+This will effectively reserve 20 cpus and 40G of memory for your interactive job.  The default number of parallel workers when using the parallel toolbox is 12 but you can define the number workers based on the number of tasks requested.
  
-Here we request 20 workers with the parpool function, and then use parfor to send a different seed to each worker.  The output is from the workers, as they complete, but the order is not deterministic.+Here we request 20 workers with the ''parpool'' function, and then use ''parfor'' to send a different seed to each worker.  The output is from the workers, as they complete, but the order is not deterministic.
  
 <note important>**Make sure the workers are not doing exactly the same computations**  In this example, the different seed, passed to the function, causes all the random values to be different on each worker.</note> <note important>**Make sure the workers are not doing exactly the same computations**  In this example, the different seed, passed to the function, causes all the random values to be different on each worker.</note>
  
-It took about 100 seconds for all 20 workers to produce on result. Since they are working in parallel the elapsed time to complete 200 results is about +It took about 100 seconds for all 20 workers to produce result, however since there are 20 workers working in parallel the elapsed time to complete 200 results is about 918 seconds. 
  
 <code> <code>
Line 1064: Line 1050:
  
 >> myCluster = parcluster('local'); >> myCluster = parcluster('local');
->> myCluster.NumWorkers = str2double(getenv('SLURM_CPUS_ON_NODE')) / str2double(getenv('SLURM_CPUS_PER_TASK'));+>> myCluster.NumWorkers = str2double(getenv('SLURM_NTASKS'));
 >> myCluster.JobStorageLocation = getenv('TMPDIR'); >> myCluster.JobStorageLocation = getenv('TMPDIR');
->> myPool = parpool(myCluster);+>> myPool = parpool(myCluster, myCluster.NumWorkers);
 Starting parallel pool (parpool) using the 'local' profile ... Starting parallel pool (parpool) using the 'local' profile ...
-Connected to the parallel pool (number of workers: 12). +Connected to the parallel pool (number of workers: 20). 
->>  tic; parfor sd = 1:20; maxEig(sd,5001); end; toc+>>  tic; parfor sd = 1:200; maxEig(sd,5001); end; toc
  
 maxe = maxe =
Line 1101: Line 1087:
    71.4253    71.4253
  
-Elapsed time is 1368.233648 seconds.+Elapsed time is 918.822702 seconds.
  
 </code> </code>
  
-Once the job is completed exit MATLAB and release the interactive compute node. +Once the job is completed, delete your pool and exit MATLABand release the interactive compute node by typing ''exit''
  
 <code> <code>
 +>> delete(myPool);
 +Parallel pool using the 'local' profile is shutting down.
 >> exit >> exit
 [traine@r00g01 matlab_interact]$ exit [traine@r00g01 matlab_interact]$ exit
Line 1210: Line 1198:
  
 The ''mcc'' command will generate a ''.sh'' file should **not** be used.  This run script does not use VALET and does not have the appropriate Slurm commands.  Instead, you should copy the Slurm template in the file  The ''mcc'' command will generate a ''.sh'' file should **not** be used.  This run script does not use VALET and does not have the appropriate Slurm commands.  Instead, you should copy the Slurm template in the file 
-''/opt/templates/slurm/applications/matlab-mcr.qs'' by using the following command+''/opt/shared/templates/slurm/applications/matlab-mcr.qs'' by using the following command
 <code> <code>
-[(it_css:traine)@login00 MCR_array_II]$ cp /opt/templates/slurm/applications/matlab-mcr.qs .+[(it_css:traine)@login00 MCR_array_II]$ cp /opt/shared/templates/slurm/applications/matlab-mcr.qs .
  
 </code> </code>
Line 1218: Line 1206:
 <code> <code>
 ... ...
-20 #SBATCH --cpus-per-task=2+20 #SBATCH --ntasks=2
 ... ...
 29 #SBATCH --mem=3G 29 #SBATCH --mem=3G
-30 #SBATCH --mem-per-cpu=1024M 
 ... ...
 47 #SBATCH --job-name=maxEig 47 #SBATCH --job-name=maxEig
Line 1579: Line 1566:
  
 ==== Gathering code for job example ==== ==== Gathering code for job example ====
-First we'll create a new directory and copy the needed code into it. +Firstwe'll create a new directory and copy the needed code into it. 
  
 <code> <code>
Line 1585: Line 1572:
 [(it_css:traine)@login00 matlab_example]$ mkdir matlab_checkpoint [(it_css:traine)@login00 matlab_example]$ mkdir matlab_checkpoint
 [(it_css:traine)@login00 ~]$ cd matlab_checkpoint [(it_css:traine)@login00 ~]$ cd matlab_checkpoint
-[(it_css:traine)@login00 matl_checkpoint]$ cp /opt/templates/slurm/generic/serial.qs batch.qs+[(it_css:traine)@login00 matl_checkpoint]$ cp /opt/shared/templates/slurm/generic/serial.qs batch.qs
 </code> </code>
  
 You will also want to put a copy of the [[#matlab-function|maxEig.m]] and [[#matlab-script|script.m]] into your ''matlab_checkpoint'' directory. You will also want to put a copy of the [[#matlab-function|maxEig.m]] and [[#matlab-script|script.m]] into your ''matlab_checkpoint'' directory.
  
-Now we will need to make changes to ''script.m''.+Now we will need to make changes to ''script.m''. The added code block reads the count and records it as the checkpoint for the loop interval, and it will restart from the checkpoint instead from the beginning once the job fails.
 <code> <code>
 % script to run maxEig function 200 times and print average. % script to run maxEig function 200 times and print average.
Line 1597: Line 1584:
 sumMaxe = 0; sumMaxe = 0;
 i = 0; i = 0;
 +id = str2num(getenv('SLURM_JOB_ID'));
 +rc = 0;
 +rc = str2num(getenv('SLURM_RESTART_COUNT'));
 tic; tic;
-for i=1:count;+if isempty(rc); 
 +   for i=1:count;
         sumMaxe = sumMaxe + maxEig(i,dim);         sumMaxe = sumMaxe + maxEig(i,dim);
         counter = "counter: "+i; %Add this line         counter = "counter: "+i; %Add this line
         disp(counter); %Add this line         disp(counter); %Add this line
 +   end;
 +else
 +   filename = ['checkpoint-', num2str(id), '.out']; % Specify the file name where you want to search
 +   searchString = 'ended on counter'; % Specify the string you want to search for
 +   fileID = fopen(filename, 'r'); % Open the text file for reading
 +        if fileID == -1
 +           error('Unable to open the file.');
 +        end
 +   lineNumber = 0;
 +
 +% Read lines from the file and search for the target string
 +    while ~feof(fileID)
 +        line = fgetl(fileID);
 +        if ischar(line)
 +                lineNumber = lineNumber + 1;
 +                if ~isempty(strfind(line, searchString))
 +                num=regexp(line,'counter:\s(\d+)', 'tokens');
 +                counterNumber = str2double(num{1}{1});% Record the counter number
 +                end
 +        end
 +    end
 +    fclose(fileID); % Close the file
 +    for i =counterNumber:count; % Once the job restarted, it will continue from the last counter number
 +       sumMaxe = sumMaxe + maxEig(i,dim);
 +       counter = "counter: "+i;
 +       disp(counter);
 +    end;
 end; end;
 toc toc
Line 1609: Line 1627:
 </code> </code>
  
-The following changes will need to be added to batch.qs+The following changes will need to be added to batch.qs. The option ''--requeue'' allows job to restart, and ''scontrol requeue'' will automatically restart the job once it fails.
 <code> <code>
 ... ...
Line 1616: Line 1634:
 60 #SBATCH --time=0-01:30:00 60 #SBATCH --time=0-01:30:00
 ... ...
-75 #SBATCH --output %x-%j.out +75 #SBATCH --output=%x-%j.out 
-76 #SBATCH --error %x-%j.out+76 #SBATCH --error=%x-%j.out
 ... ...
 85 #SBATCH --mail-user='traine@udel.edu' 85 #SBATCH --mail-user='traine@udel.edu'
 86 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90 86 #SBATCH --mail-type=END,FAIL,TIME_LIMIT_90
 +87 #SBATCH --requeue # allow job requeue
 +88 #SBATCH --open-mode=append # the output will append
 ... ...
-108 job_exit_handler() { +90 max_restarts=1  #only requires a single restart 
-109   counter=$(tail -n 2  ${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out | head -n 1) +91 scontext=$(scontrol show job $SLURM_JOB_ID) 
-110   echo "Job ${SLURM_JOB_NAME} ended on ${counter}" +92 restarts=$(echo "$scontext" | grep -o 'Restarts=.' | cut -d= -f2) # get the restart number 
-111   #matlab -nodisplay -nojvm -batch disp(getReport(err,'extended')); quit;" +93 job_exit_handler() { 
-112   # Copy all our output files back to the original job directory: +94 counter=$(tail -n 2  ${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out | head -n 1) 
-113   #cp * "$SLURM_SUBMIT_DIR" +95   echo "Job ${SLURM_JOB_NAME} ended on ${counter}" 
-114 +96   if [[ $restarts -lt $max_restarts ]];then 
-115   # Don't call again on EXIT signal, please: +97        scontrol requeue ${SLURM_JOB_ID} #automatically resubmit the job once 
-116   trap - EXIT +98   #matlab -nodisplay -nojvm -batch disp(getReport(err,'extended')); quit;" 
-117   exit 0 +99   # Copy all our output files back to the original job directory: 
-118 +100   #cp * "$SLURM_SUBMIT_DIR" 
-119 export UD_JOB_EXIT_FN=job_exit_handler+101 
 +102   # Don't call again on EXIT signal, please: 
 +103      trap - EXIT 
 +104      exit 0 
 +105  else 
 +106      trap - EXIT 
 +107      echo "Your job is over the Maximum restarts limit" 
 +108      exit 1 
 +109  fi 
 +110 
 +111  
 +112 export UD_JOB_EXIT_FN=job_exit_handler
 ... ...
 142 # 142 #
Line 1639: Line 1670:
 144 export UD_JOB_EXIT_FN_SIGNALS="SIGTERM EXIT" 144 export UD_JOB_EXIT_FN_SIGNALS="SIGTERM EXIT"
 145 #Loading MATLAB 145 #Loading MATLAB
-146 vpkg_require matlab/r2018b+146 vpkg_require matlab/r2019b
 147 #Running the matlab script 147 #Running the matlab script
 148 UD_EXEC matlab -nodisplay -nojvm -batch "try; script; catch ERR; disp(job_exit_handler(ERR.getReport)); quit; end" 148 UD_EXEC matlab -nodisplay -nojvm -batch "try; script; catch ERR; disp(job_exit_handler(ERR.getReport)); quit; end"
Line 1645: Line 1676:
 </code> </code>
 ==== Running the checkpoint job and its output ==== ==== Running the checkpoint job and its output ====
-We know from the MCR example that this script takes between 2-3 hours to run. In the changes we made to ''batch.qs'' script, we set the wall clock to 1 hour and 30 minutes. That should insure that this script will fail to complete before the wall clock runs out of time. This is shown in the following job submission example.+We know from the MCR example that this script takes between 2-3 hours to run. In the changes we made to ''batch.qs'' script, we set the wall clock to 40 minutes to demonstrate the usage of the script. That should ensure that this script will fail to complete before the wall clock runs out of time. This is shown in the following job submission example.
  
 <code> <code>
 [(it_css:traine)@login01 matlab_checkpoint]$ sbatch batch.qs [(it_css:traine)@login01 matlab_checkpoint]$ sbatch batch.qs
-Submitted batch job 8382365+Submitted batch job 20426672
 </code> </code>
 After the wall clock runs out we will see the following output. After the wall clock runs out we will see the following output.
 <code> <code>
-[(it_css:traine)@login01 matlab_checkpoint]$ tail -n 30 checkpoint-8382365.out+[(it_css:traine)@login01 matlab_checkpoint]$ less checkpoint-20426672.out 
 +Adding package `matlab/r2019b` to your environment 
 +-- Registered exit function 'job_exit_handler' for signal(s) SIGTERM EXIT
  
-   65.5558 
  
-counter: 105+maxe = 
 + 
 +   70.0220 
 + 
 +counter: 1
  
 maxe = maxe =
  
-   70.1761+   71.7546
  
-counter: 106+counter: 2
  
 maxe = maxe =
  
-   68.9765+   70.8331
  
-counter: 107+counter: 3
  
 maxe = maxe =
  
-   69.9773+   70.5714
  
-counter: 108+counter: 4
  
 maxe = maxe =
  
-   69.3456+   69.4923
  
-counter: 109 +counter: 
-slurmstepd: error: *** JOB 8382365 ON r00n47 CANCELLED AT 2020-05-20T19:58:11 DUE TO TIME LIMIT *** +... 
-Job checkpoint ended on counter: 109+maxe = 
 + 
 +   70.2614 
 + 
 +counter: 53 
 +slurmstepd: error: *** JOB 20426672 ON r01n13 CANCELLED AT 2023-10-03T17:31:54 DUE TO TIME LIMIT *** 
 +Job 20426672 ended on counter: 53 
 +Adding package `matlab/r2019b` to your environment 
 +-- Registered exit function 'job_exit_handler' for signal(s) SIGTERM EXIT 
 + 
 + 
 +maxe = 
 + 
 +   70.2614 
 + 
 +counter: 53 
 + 
 +maxe = 
 + 
 +   58.9360 
 + 
 +counter: 54 
 + 
 +maxe = 
 + 
 +   69.0119 
 + 
 +counter: 55 
 +... 
 +maxe = 
 + 
 +   70.7254 
 + 
 +counter: 104 
 + 
 +maxe = 
 + 
 +   65.5558 
 + 
 +counter: 105 
 +slurmstepd: error: *** JOB 20426672 ON r01n13 CANCELLED AT 2023-10-03T18:19:25 DUE TO TIME LIMIT *** 
 +Job 20426672 ended on counter: 105 
 +Your job is over the Maximum restarts limit
 </code> </code>
  
-Now we know that the script completed about 100 of the 200 loop intervals before the wall clock expired. This example could be expanded to allow the job to be re-queued (by using the Slurm option ''--requeue''and start at the last loop interval instead of restarting from the beginning if the logic in the ''script.m'' is modified accordingly+Now we know that the script completed about 53 of the 200 loop intervals before the wall clock expired. Then it restarts from the 53 loop interval and finally stops at 105 due to reaching the maximum restart limit we set up.
  
 <note tip>If you don't want to wait the full amount of time that the wall clock is set to you can use the command "scancel" to manually stop the job and trigger the job_exit_handler() function. As shown for job 8390581 <note tip>If you don't want to wait the full amount of time that the wall clock is set to you can use the command "scancel" to manually stop the job and trigger the job_exit_handler() function. As shown for job 8390581
Line 1696: Line 1774:
 -- Registered exit function 'job_exit_handler' for signal(s) SIGTERM -- Registered exit function 'job_exit_handler' for signal(s) SIGTERM
  
-Adding package `matlab/r2018b` to your environment +Adding package `matlab/r2019b` to your environment 
- +                    
-                            < M A T L A B (R) > +
-                  Copyright 1984-2018 The MathWorks, Inc. +
-                   R2018b (9.5.0.944444) 64-bit (glnxa64) +
-                              August 28, 2018 +
- +
- +
-For online documentation, see https://www.mathworks.com/support +
-For product information, visit www.mathworks.com. +
- +
 maxe = maxe =
  
  • software/matlab/caviness.txt
  • Last modified: 2023-10-06 10:29
  • by thuachen