Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
software:matlab:caviness [2021-01-13 18:19] – [Batch job serial example] anita | software:matlab:caviness [2023-10-06 10:29] (current) – [Running the checkpoint job and its output] thuachen | ||
---|---|---|---|
Line 134: | Line 134: | ||
</ | </ | ||
===== Create a job script file ===== | ===== Create a job script file ===== | ||
- | You should create a job script file to submit a batch job. Start by modifying a batch job script template file (''/ | + | You should create a job script file to submit a batch job. Start by modifying a batch job script template file (''/ |
In your newly copied serial.qs file, add the following lines at the end. | In your newly copied serial.qs file, add the following lines at the end. | ||
< | < | ||
- | [traine@login00 matlab_example]$ cp / | + | [traine@login00 matlab_example]$ cp /opt/shared/ |
</ | </ | ||
< | < | ||
Line 297: | Line 297: | ||
===== Compiling your Matlab code ===== | ===== Compiling your Matlab code ===== | ||
- | There is an example MCR project in the ''/ | + | There is an example MCR project in the ''/ |
==== Copy dev-projects template ==== | ==== Copy dev-projects template ==== | ||
Line 305: | Line 305: | ||
[traine@login00 ~]$ workgroup -g it_css | [traine@login00 ~]$ workgroup -g it_css | ||
[(it_css: | [(it_css: | ||
- | [(it_css: | + | [(it_css: |
[(it_css: | [(it_css: | ||
Line 373: | Line 373: | ||
< | < | ||
[(it_css: | [(it_css: | ||
- | [(it_css: | + | [(it_css: |
[(it_css: | [(it_css: | ||
- | [(it_css: | + | [(it_css: |
[(it_css: | [(it_css: | ||
Adding package `mcr/2019b` to your environment | Adding package `mcr/2019b` to your environment | ||
Line 623: | Line 623: | ||
[(it_css: | [(it_css: | ||
[(it_css: | [(it_css: | ||
- | [(it_css: | + | [(it_css: |
[(it_css: | [(it_css: | ||
</ | </ | ||
Line 639: | Line 639: | ||
86 #SBATCH --mail-user=' | 86 #SBATCH --mail-user=' | ||
87 #SBATCH --mail-type=END, | 87 #SBATCH --mail-type=END, | ||
- | ... | ||
- | 94 #SBATCH --exclusive | ||
... | ... | ||
137 # | 137 # | ||
Line 657: | Line 655: | ||
Make sure you change the '' | Make sure you change the '' | ||
The '' | The '' | ||
- | |||
- | <note important> | ||
- | The ''# | ||
- | < | ||
- | #SBATCH --ntasks=5 | ||
- | #SBATCH --mem=1G | ||
- | </ | ||
- | If everyone in your group carefully set these values, multiply jobs can run concurrently on the node. Also you can use '' | ||
- | |||
- | See [[maxNumCompThreadsGridEngine|Setting maximum number of computational threads]] | ||
- | |||
- | </ | ||
<note tip> | <note tip> | ||
Line 684: | Line 670: | ||
* Do set paper dimensions and print each figure to a file. | * Do set paper dimensions and print each figure to a file. | ||
- | The text output will be included in the standard | + | The text output will be included in the standard |
We suggest setting the current figure' | We suggest setting the current figure' | ||
Line 855: | Line 841: | ||
%% Configure parpool | %% Configure parpool | ||
myCluster = parcluster(' | myCluster = parcluster(' | ||
- | myCluster.NumWorkers = str2double(getenv(' | + | myCluster.NumWorkers = str2double(getenv(' |
myCluster.JobStorageLocation = getenv(' | myCluster.JobStorageLocation = getenv(' | ||
myPool = parpool(myCluster, | myPool = parpool(myCluster, | ||
Line 879: | Line 865: | ||
Copy the template '' | Copy the template '' | ||
< | < | ||
- | cp / | + | cp /opt/shared/ |
</ | </ | ||
Make the following changes to the code | Make the following changes to the code | ||
Line 886: | Line 872: | ||
19 #SBATCH --ntasks=20 | 19 #SBATCH --ntasks=20 | ||
... | ... | ||
- | 37 #SBATCH --mem-per-cpu=4G | + | 37 #SBATCH --mem=60G |
... | ... | ||
54 #SBATCH --job-name=matlab-pscript | 54 #SBATCH --job-name=matlab-pscript | ||
Line 933: | Line 919: | ||
====== Interactive job example ====== | ====== Interactive job example ====== | ||
- | The basic steps to running a [[: | + | The basic steps to running a [[: |
Line 940: | Line 926: | ||
- | ==== Scheduling | + | ==== Scheduling interactive job ==== |
Create a directory and add [[# | Create a directory and add [[# | ||
< | < | ||
Line 951: | Line 937: | ||
maxEig.m | maxEig.m | ||
</ | </ | ||
- | Start an interactive session on a compute node with the '' | + | Start an interactive session on a compute node with the '' |
< | < | ||
- | [(it_css: | + | [(it_css: |
salloc: Pending job allocation 7985695 | salloc: Pending job allocation 7985695 | ||
salloc: job 7985695 queued and waiting for resources | salloc: job 7985695 queued and waiting for resources | ||
Line 1029: | Line 1015: | ||
This example is based on the '' | This example is based on the '' | ||
- | When you using the parallel toolbox, you should logon to a compute node and with the '' | + | When you using the parallel toolbox, you should logon to a compute node using a workgroup partition |
< | < | ||
- | [(it_css: | + | [(it_css: |
salloc: Pending job allocation 7993736 | salloc: Pending job allocation 7993736 | ||
salloc: job 7993736 queued and waiting for resources | salloc: job 7993736 queued and waiting for resources | ||
Line 1043: | Line 1029: | ||
</ | </ | ||
| | ||
- | This will effectively reserve | + | This will effectively reserve |
- | Here we request 20 workers with the parpool function, and then use parfor to send a different seed to each worker. | + | Here we request 20 workers with the '' |
<note important> | <note important> | ||
- | It took about 100 seconds for all 20 workers to produce | + | It took about 100 seconds for all 20 workers to produce |
< | < | ||
Line 1064: | Line 1050: | ||
>> myCluster = parcluster(' | >> myCluster = parcluster(' | ||
- | >> myCluster.NumWorkers = str2double(getenv(' | + | >> myCluster.NumWorkers = str2double(getenv(' |
>> myCluster.JobStorageLocation = getenv(' | >> myCluster.JobStorageLocation = getenv(' | ||
- | >> myPool = parpool(myCluster); | + | >> myPool = parpool(myCluster, myCluster.NumWorkers); |
Starting parallel pool (parpool) using the ' | Starting parallel pool (parpool) using the ' | ||
- | Connected to the parallel pool (number of workers: | + | Connected to the parallel pool (number of workers: |
- | >> | + | >> |
maxe = | maxe = | ||
Line 1101: | Line 1087: | ||
| | ||
- | Elapsed time is 1368.233648 | + | Elapsed time is 918.822702 |
</ | </ | ||
- | Once the job is completed exit MATLAB and release the interactive compute node. | + | Once the job is completed, delete your pool and exit MATLAB, and release the interactive compute node by typing '' |
< | < | ||
+ | >> delete(myPool); | ||
+ | Parallel pool using the ' | ||
>> exit | >> exit | ||
[traine@r00g01 matlab_interact]$ exit | [traine@r00g01 matlab_interact]$ exit | ||
Line 1210: | Line 1198: | ||
The '' | The '' | ||
- | ''/ | + | ''/ |
< | < | ||
- | [(it_css: | + | [(it_css: |
</ | </ | ||
Line 1218: | Line 1206: | ||
< | < | ||
... | ... | ||
- | 20 #SBATCH --cpus-per-task=2 | + | 20 #SBATCH --ntasks=2 |
... | ... | ||
29 #SBATCH --mem=3G | 29 #SBATCH --mem=3G | ||
- | 30 #SBATCH --mem-per-cpu=1024M | ||
... | ... | ||
47 #SBATCH --job-name=maxEig | 47 #SBATCH --job-name=maxEig | ||
Line 1579: | Line 1566: | ||
==== Gathering code for job example ==== | ==== Gathering code for job example ==== | ||
- | First we'll create a new directory and copy the needed code into it. | + | First, we'll create a new directory and copy the needed code into it. |
< | < | ||
Line 1585: | Line 1572: | ||
[(it_css: | [(it_css: | ||
[(it_css: | [(it_css: | ||
- | [(it_css: | + | [(it_css: |
</ | </ | ||
You will also want to put a copy of the [[# | You will also want to put a copy of the [[# | ||
- | Now we will need to make changes to '' | + | Now we will need to make changes to '' |
< | < | ||
% script to run maxEig function 200 times and print average. | % script to run maxEig function 200 times and print average. | ||
Line 1597: | Line 1584: | ||
sumMaxe = 0; | sumMaxe = 0; | ||
i = 0; | i = 0; | ||
+ | id = str2num(getenv(' | ||
+ | rc = 0; | ||
+ | rc = str2num(getenv(' | ||
tic; | tic; | ||
- | for i=1:count; | + | if isempty(rc); |
+ | for i=1:count; | ||
sumMaxe = sumMaxe + maxEig(i, | sumMaxe = sumMaxe + maxEig(i, | ||
counter = " | counter = " | ||
disp(counter); | disp(counter); | ||
+ | end; | ||
+ | else | ||
+ | | ||
+ | | ||
+ | | ||
+ | if fileID == -1 | ||
+ | | ||
+ | end | ||
+ | | ||
+ | |||
+ | % Read lines from the file and search for the target string | ||
+ | while ~feof(fileID) | ||
+ | line = fgetl(fileID); | ||
+ | if ischar(line) | ||
+ | lineNumber = lineNumber + 1; | ||
+ | if ~isempty(strfind(line, | ||
+ | num=regexp(line,' | ||
+ | counterNumber = str2double(num{1}{1}); | ||
+ | end | ||
+ | end | ||
+ | end | ||
+ | fclose(fileID); | ||
+ | for i =counterNumber: | ||
+ | | ||
+ | | ||
+ | | ||
+ | end; | ||
end; | end; | ||
toc | toc | ||
Line 1609: | Line 1627: | ||
</ | </ | ||
- | The following changes will need to be added to batch.qs | + | The following changes will need to be added to batch.qs. The option '' |
< | < | ||
... | ... | ||
Line 1616: | Line 1634: | ||
60 #SBATCH --time=0-01: | 60 #SBATCH --time=0-01: | ||
... | ... | ||
- | 75 #SBATCH --output %x-%j.out | + | 75 #SBATCH --output=%x-%j.out |
- | 76 #SBATCH --error %x-%j.out | + | 76 #SBATCH --error=%x-%j.out |
... | ... | ||
85 #SBATCH --mail-user=' | 85 #SBATCH --mail-user=' | ||
86 #SBATCH --mail-type=END, | 86 #SBATCH --mail-type=END, | ||
+ | 87 #SBATCH --requeue # allow job requeue | ||
+ | 88 #SBATCH --open-mode=append # the output will append | ||
... | ... | ||
- | 108 job_exit_handler() { | + | 90 max_restarts=1 |
- | 109 counter=$(tail -n 2 ${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out | head -n 1) | + | 91 scontext=$(scontrol show job $SLURM_JOB_ID) |
- | 110 echo "Job ${SLURM_JOB_NAME} ended on ${counter}" | + | 92 restarts=$(echo " |
- | 111 #matlab -nodisplay -nojvm -batch disp(getReport(err,' | + | 93 job_exit_handler() { |
- | 112 # Copy all our output files back to the original job directory: | + | 94 counter=$(tail -n 2 ${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out | head -n 1) |
- | 113 #cp * " | + | 95 echo "Job ${SLURM_JOB_NAME} ended on ${counter}" |
- | 114 | + | 96 if [[ $restarts -lt $max_restarts ]];then |
- | 115 # Don't call again on EXIT signal, please: | + | 97 scontrol requeue ${SLURM_JOB_ID} # |
- | 116 trap - EXIT | + | 98 #matlab -nodisplay -nojvm -batch disp(getReport(err,' |
- | 117 exit 0 | + | 99 # Copy all our output files back to the original job directory: |
- | 118 } | + | 100 #cp * " |
- | 119 export UD_JOB_EXIT_FN=job_exit_handler | + | 101 |
+ | 102 # Don't call again on EXIT signal, please: | ||
+ | 103 | ||
+ | 104 | ||
+ | 105 else | ||
+ | 106 trap - EXIT | ||
+ | 107 echo "Your job is over the Maximum restarts limit" | ||
+ | 108 exit 1 | ||
+ | 109 fi | ||
+ | 110 } | ||
+ | 111 | ||
+ | 112 export UD_JOB_EXIT_FN=job_exit_handler | ||
... | ... | ||
142 # | 142 # | ||
Line 1639: | Line 1670: | ||
144 export UD_JOB_EXIT_FN_SIGNALS=" | 144 export UD_JOB_EXIT_FN_SIGNALS=" | ||
145 #Loading MATLAB | 145 #Loading MATLAB | ||
- | 146 vpkg_require matlab/r2018b | + | 146 vpkg_require matlab/r2019b |
147 #Running the matlab script | 147 #Running the matlab script | ||
148 UD_EXEC matlab -nodisplay -nojvm -batch "try; script; catch ERR; disp(job_exit_handler(ERR.getReport)); | 148 UD_EXEC matlab -nodisplay -nojvm -batch "try; script; catch ERR; disp(job_exit_handler(ERR.getReport)); | ||
Line 1645: | Line 1676: | ||
</ | </ | ||
==== Running the checkpoint job and its output ==== | ==== Running the checkpoint job and its output ==== | ||
- | We know from the MCR example that this script takes between 2-3 hours to run. In the changes we made to '' | + | We know from the MCR example that this script takes between 2-3 hours to run. In the changes we made to '' |
< | < | ||
[(it_css: | [(it_css: | ||
- | Submitted batch job 8382365 | + | Submitted batch job 20426672 |
</ | </ | ||
After the wall clock runs out we will see the following output. | After the wall clock runs out we will see the following output. | ||
< | < | ||
- | [(it_css: | + | [(it_css: |
+ | Adding package `matlab/ | ||
+ | -- Registered exit function ' | ||
- | | ||
- | counter: | + | maxe = |
+ | |||
+ | | ||
+ | |||
+ | counter: | ||
maxe = | maxe = | ||
- | 70.1761 | + | 71.7546 |
- | counter: | + | counter: |
maxe = | maxe = | ||
- | 68.9765 | + | 70.8331 |
- | counter: | + | counter: |
maxe = | maxe = | ||
- | 69.9773 | + | 70.5714 |
- | counter: | + | counter: |
maxe = | maxe = | ||
- | 69.3456 | + | 69.4923 |
- | counter: | + | counter: |
- | slurmstepd: error: *** JOB 8382365 | + | ... |
- | Job checkpoint | + | maxe = |
+ | |||
+ | | ||
+ | |||
+ | counter: 53 | ||
+ | slurmstepd: error: *** JOB 20426672 | ||
+ | Job 20426672 ended on counter: 53 | ||
+ | Adding package `matlab/ | ||
+ | -- Registered exit function ' | ||
+ | |||
+ | |||
+ | maxe = | ||
+ | |||
+ | | ||
+ | |||
+ | counter: 53 | ||
+ | |||
+ | maxe = | ||
+ | |||
+ | 58.9360 | ||
+ | |||
+ | counter: 54 | ||
+ | |||
+ | maxe = | ||
+ | |||
+ | | ||
+ | |||
+ | counter: 55 | ||
+ | ... | ||
+ | maxe = | ||
+ | |||
+ | | ||
+ | |||
+ | counter: 104 | ||
+ | |||
+ | maxe = | ||
+ | |||
+ | | ||
+ | |||
+ | counter: 105 | ||
+ | slurmstepd: error: *** JOB 20426672 ON r01n13 CANCELLED AT 2023-10-03T18: | ||
+ | Job 20426672 | ||
+ | Your job is over the Maximum restarts limit | ||
</ | </ | ||
- | Now we know that the script completed about 100 of the 200 loop intervals before the wall clock expired. | + | Now we know that the script completed about 53 of the 200 loop intervals before the wall clock expired. |
<note tip>If you don't want to wait the full amount of time that the wall clock is set to you can use the command " | <note tip>If you don't want to wait the full amount of time that the wall clock is set to you can use the command " | ||
Line 1696: | Line 1774: | ||
-- Registered exit function ' | -- Registered exit function ' | ||
- | Adding package `matlab/r2018b` to your environment | + | Adding package `matlab/r2019b` to your environment |
- | + | ||
- | < M A T L A B (R) > | + | |
- | Copyright 1984-2018 The MathWorks, Inc. | + | |
- | | + | |
- | August 28, 2018 | + | |
- | + | ||
- | + | ||
- | For online documentation, | + | |
- | For product information, | + | |
- | + | ||
maxe = | maxe = | ||