====== Scheduling Jobs on Caviness ======

<note warning>In order to schedule any job (interactively or batch) on a cluster, you must set your **[[abstract:caviness:app_dev:compute_env#using-workgroup-and-directories|workgroup]]** to define your cluster group or //investing-entity// compute nodes. 
</note>

For example,

<code bash>
[traine@login00 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$
</code>

will set the workgroup to ''it_css'' for account ''traine'' which is reflected in the prompt change ''[(it_css:traine)@login00 ~]$'' showing the workgroup.

Keep in mind job scheduling is very complex. When you submit a job, it doesn't get considered for execution immediately upon submission. Slurm will analyze and determine on each scheduling cycle, only the next N jobs that are pending will be considered for execution. This means the more jobs submitted by users will likely mean the longer your job may have to wait to be considered. To this point, all users should be good citizens and not over submit, and be patient and do not kill jobs and resubmit to try to increase your priority.

<note tip>It is a good idea to periodically check in ''/opt/shared/templates/slurm/'' for updated or new [[technical:slurm:caviness:templates:start|templates]] to use as job scripts to run generic or specific applications designed to provide the best performance on Caviness.</note>

Need help? See [[http://www.hpc.udel.edu/presentations/intro_to_slurm/|Introduction to Slurm]] in UD's HPC community cluster environment.

===== Interactive jobs (salloc) =====

All interactive jobs should be scheduled to run on the compute nodes, not the login/head node.

An interactive session (job) can often be made non-interactive ([[abstract/caviness/runjobs/schedule_jobs#batch-jobs-qsub|batch job]]) by putting the input in a file, using the redirection symbols **<** and **>**, and making the entire command a line in a job script file:

<WRAP box>//program_name//  < //input_command_file//  > //output_command_file//</WRAP>

Then the non-interactive ([[abstract/caviness/runjobs/schedule_jobs#batch-jobs-qsub|batch job]]) job can be scheduled as a batch job.

=== Starting an interactive session ===

Remember you must specify your **[[abstract:caviness:app_dev:compute_env#using-workgroup-and-directories|workgroup]]** to define your cluster group or //investing-entity// compute nodes before submitting any job, and this includes starting an interactive session. Now use the Slurm command **salloc** on the login (head) node. Slurm will look for a node with a free //scheduling slot// (processor core) and a sufficiently light load, and then assign your session to it. If no such node becomes available, your **salloc** request will eventually time out. 

Type

<WRAP box>workgroup -g <<//investing-entity//>></WRAP>
<WRAP box>salloc</WRAP>

to start a remote interactive shell on a node in the standard partition. Remember jobs in the [[:abstract:caviness:runjobs:queues#the-standard-partition|standard partition]] can be preempted. In order to start a session on the workgroups's partition, use ''%%-%%-partition=<<//investing-entity//>>''.

<code bash>
[(it_css:traine)@login00 ~]$ salloc --partition=it_css
salloc: Granted job allocation 35789
salloc: Waiting for resource configuration
salloc: Nodes r01n48 are ready for job
[ssunkara@r01n48 ~]$
</code>

Also, Slurm can detect the current workgroup by writing ''_workgroup_'' in the ''%%-%%-partition'' option.

<code bash>
[(it_css:traine)@login00 ~]$ salloc --partition=_workgroup_
salloc: Granted job allocation 35789
salloc: Waiting for resource configuration
salloc: Nodes r01n48 are ready for job
[ssunkara@r01n48 ~]$
</code>


Type

<WRAP box>salloc %%-%%-partition=<<//investing-entity//>> %%-%%-nodes=2 /bin/bash -i</WRAP>

to open a shell on the login node itself and execute a series of ''srun'' commands against that allocation. Each use of ''srun'' inside the ''salloc'' session represents a job step.

Type
<code bash>
    exit
</code>

to terminate the interactive shell and release the scheduling slot(s).
All the above commands work only when the user is already inside the workgroup. If you do not specify a **[[abstract:caviness:app_dev:compute_env#using-workgroup-and-directories|workgroup]]**, you will get an error similar to this

<code bash>
[traine@login00 ~]$ salloc
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
</code>

There is a no way to avoid running the ''workgroup'' command before submitting a job or requesting an interactive session.

== Acceptable nodes for interactive sessions ==

Use the login (head) node for interactive program development including Fortran, C, and C++ program compilation. Use Slurm (**salloc**) to start interactive shells on your workgroup //investing-entity// compute nodes for testing or running applications.
 
===== Batch jobs (sbatch) =====

A batch job is a command to be executed now or any time in the future. Batch jobs are encapsulated as a shell script (which will be called a job script). The job script can contain special comment lines that provide flags to influence their submission and scheduling.Both the ''srun'' and ''salloc'' command attempt to execute remote commands immediately; if resources are not available they will not return until resources have become available or the user cancels them (by means of <Ctrl>-C).

Slurm provides the **sbatch** command for scheduling batch jobs:

^ command ^ Action ^
| ''sbatch'' <<//command_line_options//>> <<//job_script//>> | Submit job with script command in the file <<//job_script//>> |

For example,

   sbatch myproject.qs

This file ''myproject.qs'' will contain bash shell commands and **SBATCH** statements that include **SBATCH** options and resource specifications. The **SBATCH** statements begin with #.

<note important>
We strongly recommend that you use a script file that you pattern after the prototypes in ''/opt/shared/templates'' by using one of our [[technical:slurm:caviness:templates:start|templates]] and save your job script files within a ''$WORKDIR'' (private work) directory. There are ''README.md'' files in each subdirectory to explain the use of these templates.

Reusable job scripts help you maintain a consistent batch environment across runs. 
</note>

<note important>See also [[abstract:caviness:runjobs:schedule_jobs#command-options-for-sbatch|resource options]] to specify [[abstract:caviness:runjobs:schedule_jobs#time|time]], [[abstract:caviness:runjobs:schedule_jobs#cpu-cores|cpu]] cores,  [[abstract:caviness:runjobs:schedule_jobs#memory|memory]] free and/or available, and also request [[abstract:caviness:runjobs:schedule_jobs#caviness-exclusive-access|exclusive]] access.</note>

===== Slurm environment variables =====

In every batch session, Slurm sets environment variables that are useful within job scripts. Here are some common examples. The rest can be found online in Slurm documentation.////

^ Environment variable ^ Contains ^
| **HOSTNAME** | Name of the execution (compute) node |
| **SLURM_JOB_ID** | Batch job id assigned by Slurm |
| **SLURM_JOB_NAME** | Name you assigned to the batch job |
| **SLURM_JOB_NUM_NODES** | Number of //nodes// allocated to job |
| **SLURM_CPUS_PER_TASK** | Number of cpus requested per task. Only set if the ''%%--%%cpus-per-task'' option is specified for a threaded job |
| **SLURM_ARRAY_TASK_ID** | Task id of an array job sub-task (See [[#array-jobs|Array jobs]]) |
| **SLURM_TMPDIR** | Name of directory on the (compute) node scratch filesystem |

When Slurm assigns one of your job's tasks to a particular node, it creates a temporary work directory on that node's local scratch disk (900GB SSD for base nodes or 32TB (8 x 4TB SSD) enhanced local scratch nodes). And when the task assigned to that node is finished, Slurm removes the directory and its contents. The form of the directory name is

<code bash>
/tmp/[$SLURM_JOB_ID]/0
</code>

For example, after typing ''salloc'' on the head node, an interactive job 1185 (''$SLURM_JOB_ID'') is allocated on node ''r00n45''
<code bash>
[traine@login00 ~]$ workgroup -g it_css
[[(it_css:traine)@login00 ~]$ salloc          
salloc: Granted job allocation 1185
salloc: Waiting for resource configuration
salloc: Nodes r00n45 are ready for job
[traine@r00n45 ~]$
</code>

and now we are ready to use our interactive session on node ''r00n45'', so using ''echo $TMPDIR'' we can see the name of the node scratch directory for this interactive job.

<code bash>
[traine@r00n45 ~]$ echo $TMPDIR
/tmp/1185/0
</code>

See [[:abstract:caviness:filesystems:filesystems|Filesystems]] and [[:abstract:caviness:app_dev:compute_env|Computing environment]] for more information about the node scratch filesystem and using environment variables.

===== Command options =====

^Options^Description^
|''%%--%%array=//<indexes>//''|job array specifications for sbatch only (See [[#array-jobs|Array jobs]])|
|''%%--%%comment=//<string>//''|alternate description of the job (more verbose than job name)|
|''%%--%%cpus-per-task=//<#>//''|each copy of the command should have this many CPU cores allocated to it|
|''%%--%%exclusive''|node(s) allocated to the job must have no other jobs running on them|
|''%%--%%exclusive=user''|node(s) allocated to the job must have no jobs associated with other users running on them except if jobs submitted by the user|
|''%%--%%job-name=//<string>//''|descriptive name for the job|
|''%%--%%mail-user=//<email-address>//''|deliver state-change notification emails to this address|
|''%%--%%mail-type=//<state>{,<state>..}//''|deliver notification emails when the job enters the state(s) indicated|
|''%%--%%mem=//<#>//''|total amount of real memory to allocate to the job|
|''%%--%%mem-per-cpu=//<#>//''|amount of memory to allocate to each CPU core allocated to the job|
|''%%--%%nodes=//<#>//''|execute the command on this many distinct nodes|
|''%%--%%ntasks=//<#>//''|execute this many copies of the command|
|''%%--%%ntasks-per-node=//<#>//''|execute this many copies of the command on each distinct node|
|''%%--%%partition=//<partition-name>//''|execute the command in this partition|
|''%%--%%requeue''|if this job is preempted by a higher-priority job, automatically resubmit it to execute again using the same parameters and job script|
|''%%--%%time=//<time-spec>//''|indicates a maximum wall time limit for the job|

Slurm tries to satisfy all of the resource-management options you specify in a job script or as sbatch command-line options. 

==== Time ====

If no ''%%--%%time=//<time-spec>//'' option is specified, then the default time allocated is 30 minutes.

The ''//<time-spec>//'' can be of the following formats:
  * ''<#>'' - minutes
  * ''<#>:<#>'' - minutes and seconds
  * ''<#>:<#>:<#>'' - hours, minutes, and seconds
  * ''<#>-<#>'' - days and hours
  * ''<#>-<#>:<#>'' - days, hours, and minutes
  * ''<#>-<#>:<#>:<#>'' - days, hours, minutes, and seconds
Thus, specifying ''%%--%%time=4'' indicates a wall time limit of four minutes and ''%%--%%time=4-0'' indicates four days.

<note warning>
Make sure the wall time is mentioned as per the specified format. One of the most frequently seen error is jobs termininating about 1 minute after start ("TIMEOUT" error message).

#SBATCH %%-%%-time=1 00:00:00

doesn't look that different from

#SBATCH %%-%%-time=1-00:00:00
</note>

In the above case, the former is interpreted as 1 minute with a trailing second argument of "00:00:00" while the second equals 1 day.  

==== CPU cores ====

The number of CPU cores associated with a job and the scheme by which they are allocated on nodes can be controlled loosely or strictly by the flags mentioned above. Omitting all such flags implies a default will be set to a single task on a single node meaning 1 CPU core will allocated for your job.

<note important>
Always associate //tasks// with the number of copies of a program, and //cpus-per-task// with the number of threads each copy of the program may use.  While tasks can be distributed across multiple nodes, the cores indicated by cpus-per-task must all be present on the same node.  Thus, programs parallelized with OpenMP directives would primarily be submitted using the ''%%--%%cpus-per-task'' flag, while MPI programs would use the ''%%--%%ntasks'' or ''%%--%%ntasks-per-node'' flag.  Programs capable of hybrid MPI execution would use a combination of the two.
</note>

For example, putting the lines 
<file>
#SBATCH --time=60
#SBATCH --ntasks=4
</file>
the job script tells Slurm to set a hard limit of 1 hour on the CPU time resource for the job, and requests 4 tasks to be allocated mapped with single processor to each.

==== Memory ====

When reserving memory for your job by using ''%%--%%mem'' or ''%%--%%mem-per-cpu'' option, it will be considered MB if no units are specified, otherwise use the suffix k|M|G|T denoting kibi,mebi,gibi and tebibyte as the units. By default, if no memory specifications are provided, Slurm will allocate 1G per core for your job. For example,
specifying 

''%%--%%mem=8G''

tells Slurm to reserve 8 gibibyte units of memory for your job. However, specifying the following two options 

''%%--%%mem-per-cpu=8G %%--%%ntasks=4''

tells Slurm to allocate 8 gibibyte units of memory per core for a total of 32 gigibyte units of memory for your job.

<note tip>kibi, mebi, gibi and tebibyte are terms defined as powers of 1024 where kilo, mega, giga and terabyte are defined as powers of 1000.</note>


The total memory purchased by each investing-entity (workgroup) is used to limit the HPC resources allowed in the priority-access (workgroup) partitions (previously configured as node-count and later removed due to the problems and solutions described in [[technical:slurm:caviness:arraysize-and-nodecounts|Revisions to Slurm Configuration v1.1.2 on Caviness]]. As a result, it is absolutely essential that all jobs are submitted with the proper amount of memory required (that will actually be used) to optimize and allow for the best performance of your workgroup's purchased HPC resources.

<note tip>In the process of addressing QosGrpNodeLimit issue, it became evident that some additional adjustments would be necessary to the configure node and workgroup memory sizes. These new issues and proposed adjustments are outlined in greater detail on the UD HPC wiki: [[technical:slurm:caviness:node-memory-sizes|Revisions to Slurm Configuration v1.1.3 on Caviness]]. As a result, a complete new set of usable memory limits were defined for each node. </note>


The table below provides the usable memory values available for each type of node on the Caviness.

^Node type               ^Slurm selection options                         ^RealMemory/MiB  ^RealMemory/GiB^
|Gen1/128 GiB            |%%--%%constraint='Gen1&128GB'                       |  126976|  124|
|Gen1/256 GiB            |%%--%%constraint='Gen1&256GB'                       |  256000|  250|
|Gen1/512 GiB            |%%--%%constraint='Gen1&512GB'                       |  514048|  502|
|Gen1/GPU/128 GiB        |%%--%%constraint='Gen1&128GB' %%--%%gres=gpu:p100:<N>   |  126976|  124|
|Gen1/GPU/256 GiB        |%%--%%constraint='Gen1&256GB' %%--%%gres=gpu:p100:<N>   |  256000|  250|
|Gen1/GPU/512 GiB        |%%--%%constraint='Gen1&512GB' %%--%%gres=gpu:p100:<N>   |  514048|  502|
|Gen1/NVMe/256 GiB       |%%--%%constraint=Gen1 %%--%%gres=nvme:1                 |  256000|  250|
|Gen2/192 GiB            |%%--%%constraint='Gen2&192GB'                       |  191488|  187|
|Gen2/384 GiB            |%%--%%constraint='Gen2&384GB'                       |  385024|  376|
|Gen2/768 GiB            |%%--%%constraint='Gen2&768GB'                       |  772096|  754|
|Gen2/1 TiB              |%%--%%constraint='Gen2&1024GB'                      |   1030144|  1006|
|Gen2/T4 GPU/192 GiB     |%%--%%constraint='Gen2&192GB' %%--%%gres=gpu:t4:1       |  191488|  187|
|Gen2/T4 GPU/384 GiB     |%%--%%constraint='Gen2&384GB' %%--%%gres=gpu:t4:1       |  385024|  376|
|Gen2/T4 GPU/768 GiB     |%%--%%constraint='Gen2&768GB' %%--%%gres=gpu:t4:1       |  772096|  754|
|Gen2/V100 GPU/384 GiB   |%%--%%constraint='Gen2&384GB' %%--%%gres=gpu:v100:<N>   |  385024|  376|
|Gen2/V100 GPU/768 GiB   |%%--%%constraint='Gen2&768GB' %%--%%gres=gpu:v100:<N>   |  772096|  754|
|Gen2.1/192 GiB          |%%--%%constraint='Gen2.1&192GB'                       |  191488|  187|
|Gen2.1/384 GiB          |%%--%%constraint='Gen2.1&384GB'                       |  385024|  376|
|Gen2.1/768 GiB          |%%--%%constraint='Gen2.1&768GB'                       |  772096|  754|
|Gen2.1/1 TiB            |%%--%%constraint='Gen2.1&1024GB'                      |  1030144|  1006|
|Gen3/192 GiB            |%%--%%constraint='Gen3&192GB'                       |  191488|  187|
|Gen3/384 GiB            |%%--%%constraint='Gen3&384GB'                       |  385024|  376|
|Gen3/768 GiB            |%%--%%constraint='Gen3&768GB'                       |  772096|  754|
|Gen3/1 TiB              |%%--%%constraint='Gen3&1024GB'                      |   1030144|  1006|
|Gen3/GPU/2 TiB          |%%--%%constraint='Gen3&2048GB' %%--%%gres=gpu:a40:<N>   |  2060288|  2012|
|Gen3/GPU/256 GiB        |%%--%%constraint='Gen3&256GB' %%--%%gres=gpu:a100:<N>   |  256000|  250|
|Gen3/GPU/512 GiB        |%%--%%constraint='Gen3&512GB' %%--%%gres=gpu:a100:<N>   |  514048|  502|
|Gen3/GPU/1 TiB          |%%--%%constraint='Gen3&1024GB' %%--%%gres=gpu:a100:<N>   |  1030144|  1006|
|Gen3/GPU/2 TiB          |%%--%%constraint='Gen3&2048GB' %%--%%gres=gpu:a100:<N>   |  2060288|  2012|

where '<N>' should be the number of GPUs and depends on the specific [[abstract:caviness:caviness#compute-nodes|compute node specifications]].

<note important>**VERY IMPORTANT:** Keep in mind that not all memory can be reserved for a node due to a small amount required for system use.  As a result, the maximum amount of memory that can be specified is based on what Slurm shows as available. For example, the baseline nodes in Caviness show a memory size of 124 GiB versus the 128 GiB of physical memory present in them. This means if you try to specify the full amount of memory (i.e. 128G), then Slurm will try to run the job on a larger memory node as long as you have access to a larger memory node. This will work if you specify the standard partition or if you specify a workgroup partition and your research group purchased a larger memory node, otherwise your job will never run. You may also use ''%%--%%mem=0'' to request all the memory on a node.</note>

==== Exclusive access ====

If a job is submitted with the ''%%--exclusive%%'' resource, the allocated nodes cannot be shared with other running jobs.

A job running on a node with ''%%--exclusive%%'' will block any other jobs from making use of resources on that host.
To make sure your program is using all the cores on a node when specifying the exclusive resource, include inside the jobs scripts the ''%%--ntasks%%'' option i.e., ''%%--ntasks=36%%''

Job script example:
<code bash>
#SBATCH nodes=2
# The exclusive flag asks to run this job only on all nodes required to fulfill requested slots
#SBATCH --exclusive
#SBATCH --ntasks=36


...
</code>

Also, the exclusive resource works in two different ways in Slurm on Caviness. One is simply specifying ''%%-%%-exclusive'' and the other way is specifying ''%%-%%-exclusive=user'' when submitting a job. In the first method, the job is scaled up with all the resources available on the node irrespective of the requirement. However, the job will only use the number of CPUs specified by the ''%%-%%ntasks'' option. In the second method, specifying ''=user'' means multiple jobs are allowed at the same time on the same node assigned for exclusive access for the user submitting the jobs.

==== GPU nodes ====

After entering into the workgroup, GPU nodes can be requested through an interactive session using ''salloc'' or through batch submission using ''sbatch''. An appropriate partition name (such as a workgroup for running or ''devel'' if you need to compile on a GPU node) and a GPU resource and type **must** be specified while running the command as below.

<code bash>
[(it_css:traine)@login00 matrixMul]$ salloc --partition=it_css --gres=gpu:p100
salloc: Granted job allocation 2239
salloc: Waiting for resource configuration
salloc: Nodes r01g00 are ready for job
[traine@r01g00 ~]$ nvidia-smi
Tue Aug 28 15:25:31 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   32C    P0    27W / 250W |      0MiB / 12198MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
</code>

Also if your workgroup has purchased more than one kind of GPU node, then you need to choose that specific GPU type to target it, such as ''%%--%%gres=gpu:p100'' or ''%%--%%gres=gpu:v100'' or ''%%--%%gres=gpu:t4'' or ''%%--%%gres=gpu:a100'' to by default get 1 GPU or the form ''%%--%%gres=gpu:<<GPU type>>:<<#>''.  See [[abstract:caviness:runjobs:job_status#sworkgroup|sworkgroup]] to determine your workgroup resources including GPU node type. In the example below, this particular workgroup has (2) ''gpu:p100'', (2) ''gpu:v100'' and (2) ''gpu:a100'' types of GPUs available

<code bash>
[traine@login00 ~]$ sworkgroup -g ececis_research --limits
Partition       Per user Per job Per workgroup
---------------+--------+-------+-----------------------------------------------------------------
devel           3 jobs   cpu=4
ececis_research                  cpu=248,mem=3075G,gres/gpu:p100=2,gres/gpu:v100=2,gres/gpu:a100=2
reserved
standard        cpu=720  cpu=360
</code>

Any user can employ a GPU by running in the ''standard'' partition, however keep in mind a GPU type **must** be specified, jobs can be preempted and would require [[abstract:caviness:runjobs:schedule_jobs#handling-system-signals-aka-checkpointing|checkpointing]] as part of your batch job script.  The interactive session example below requests any node with (2) GPUs v100 type, 1 core, 1 GB of memory and 30 minutes of time (default values if not specified) on the ''standard'' partition.

<code bash>
salloc --partition=standard --gres=gpu:v100:2
</code>

If you are unsure of the GPU types and counts available in the ''standard'' partition, see [[abstract:caviness:caviness#compute-nodes|Compute Nodes]] on Caviness.
==== Enhanced Local Scratch nodes ====

The Lustre file system is typically leveraged when scratch storage in excess of the 960 GB provided by local SSD is necessary. As a network-shared medium, though, some workloads do not perform as well as they would with larger, faster storage local to the node. Software that caches data and accesses that data frequently or in small, random-access patterns may not perform well on Lustre. Some stakeholders indicated a desire for a larger amount of fast scratch storage physically present in the compute node.

Generation 1 featured two compute nodes with dual 3.2 TB NVMe storage devices. A scratch file system was striped across the two NVMe for efficient utilization of both devices. These nodes (''r02s00'' and ''r02s01'') are available to all Caviness users for testing.

The Generation 2 design increases the capacity of the fast local scratch significantly providing 100 Gbps Intel Omni-path network port and eight 4 TB NVMe storage devices.

Using the option ''%%-%%-gres=nvme'' will target the enhanced local scratch nodes for jobs.
====== Interactive Jobs ======

As discussed, an //interactive job// allows a user to enter a sequence of commands manually.  The following qualify as being interactive jobs:

  * A program with a GUI:  e.g. creating graphs in Matlab
  * A program that requires manual input:  e.g. a menu-driven post-processing program
  * Any task that is more easily performed manually

As far as the final bullet point goes, suppose a user has a long-running batch job and must later extract results from its output using a single command that will execute for a short time (say five minutes).  While the user could go to the effort of creating a batch job, it may be easier to just run the command interactively and visually note its output.

===== Submitting an Interactive Job =====

In Slurm, interactive jobs are submitted to the job scheduler using the ''salloc'' command:

<code bash>
[(it_css:traine)@login00 ~]$ salloc               //After entering into workgroup using workgroup -g it_css 
salloc: Granted job allocation 906
salloc: Waiting for resource configuration
salloc: Nodes r00n45 are ready for job
[traine@r00n45 ~]$
</code>

Dissecting this text of both, we see that:

  - the job was assigned a numerical //job identifier// or //job id// of 906 or 4814
  - the job is assigned to the ''standard'' partition with job resources tracked against the account workgroup //investing-entity//,  ''it_css''
  - the job is executing on compute node ''r00n45'' or ''r00n12''
  - the final line is a shell prompt, running on ''r00n45'' or ''r00n12''and waiting for commands to be typed

One can specify all the options that are applicable to ''sbatch'' in the [[abstract:caviness:runjobs:schedule_jobs#command-options-for-sbatch|above-mentioned table]] while running salloc command.

Example : 

<code bash>
[(it_css:traine)@login00 generic]$ salloc --mem=120G
salloc: Granted job allocation 7396
salloc: Waiting for resource configuration
salloc: Nodes r01n55 are ready for job
[traine@r01n55 generic]$


</code>

<code bash>
[(it_css:traine)@login00 generic]$ salloc --mem-per-cpu=120G
salloc: Granted job allocation 7403
salloc: Waiting for resource configuration
salloc: Nodes r01n55 are ready for job
[traine@r01n55 generic]$

</code>

What is not apparent from the text:

  * the shell prompt on compute node ''r01n55'' has as its working directory the directory in which the ''salloc'' command was typed (''it_css'')
  * memory specified as 120G is the maximum amount of memory that can be specified for a 128G node; keep in mind you may get a larger memory node since all nodes are available in the ''standard'' partition
  * if resources had not been immediately available to this job, the text would have "hung" at "''waiting for interactive job to be scheduled ...''" and later resumed with the message about its being successfully scheduled

Another important command that can be used in running interactive jobs within Slurm is ''srun''. It allows to run simple commands on a node in the cluster:

<code bash>
[(it_css:traine)@login00 ~]$ srun /bin/hostname                  //After entering into workgroup
r01n55.localdomain.hpc.udel.edu
</code>

The command can have arguments presented to it:

<code bash>
[(it_css:traine)@login00 ~]$ printf "%s - %s\n" "$(hostname)" "$(date)"
login00 - Mon Jul 23 15:53:31 EDT 2018

[(it_css:traine)@login00 ~]$ srun /bin/bash -c 'printf "%s - %s\n" "$(hostname)" "$(date)"'
r01n55.localdomain.hpc.udel.edu - Mon Jul 23 15:53:01 EDT 2018
</code>

The ''srun'' command accepts the same commonly-used options discussed for ''sbatch'' and ''salloc'' above.

By default, ''salloc'' will start a remote interactive shell on a node in the cluster.  The alternative use is to open a shell on the login node itself and execute a series of ''srun'' commands against that allocation:

<code base>
[(it_css:traine)@login00 ~]$ salloc --nodes=2 /bin/bash -i
salloc: Granted job allocation 908
salloc: Waiting for resource configuration
salloc: Nodes r01n[46,51] are ready for job
[(it_css:traine)@login00 ~]$ hostname
login00
[(it_css:traine)@login00 ~]$ srun hostname
r01n46.localdomain.hpc.udel.edu
r01n51.localdomain.hpc.udel.edu
[(it_css:traine)@login00 ~]$ srun date
Mon Jul 23 16:27:41 EDT 2018
Mon Jul 23 16:27:41 EDT 2018
[(it_css:traine)@login00 ~]$ exit
exit
salloc: Relinquishing job allocation 908
</code>

Each use of ''srun'' inside the ''salloc'' session represents a //job step//.  The first use of ''srun'' is job step zero (0), the second job step 1, etc.  When referring to a specific job step, the syntax is ''<job-id>.<job-step>''.  The Slurm accounting mechanisms retain usage data for each job step as well as an aggregate for the entire job.

In order to dedicate (reserve) an entire node to run your programs only, one might want to use ''%%-%%-exclusive'' option. For more details, read about [[abstract:caviness:runjobs:schedule_jobs#caviness-exclusive-access|exclusive access]].
==== Naming your Job ====

It can be confusing if a user has many interactive jobs submitted at one time.  Taking a moment to name each interactive job according to its purpose may save the user a lot of effort later:

<code bash>
[(it_css:traine)@login00 it_css]$ salloc --job-name=test --partition=it_css    //After entering into workgroup
salloc: Granted job allocation 1164
salloc: Waiting for resource configuration
salloc: Nodes r00n45 are ready for job
[traine@r00n45 ~]$ echo $SLURM_JOB_NAME
test
</code>


The name provided with the ''%%-%%-job-name'' command-line option will be assigned to the interactive session/job that the user started.

===== Launching GUI Applications (VNC for X11 Applications) =====

Please review [[ technical:recipes/vnc-usage | using VNC for X11 Applications]] as an alternative to X11 Forwarding.


===== Launching GUI Applications (X11 Forwarding) =====

We can launch GUI applications on the Caviness using X-forwarding technique. However, there are some pre-requisites required in order to launch GUI applications using X-forwarding.

For Windows OS, Xming is an X11 display server which must be installed and running on Windows (Windows XP and later) and a PuTTY session must configured with X11 before launching GUI applications on Caviness. For help on configuring a PuTTY session with X11 see  [[http://www.udel.edu/it/research/training/config_laptop/puTTY.shtml|X-Windows (X11) and SSH document]] for Windows desktop use.

For Mac OS, SSH connection has to be started with -Y argument, ''ssh -Y caviness.hpc.udel.edu'' and [[https://support.apple.com/en-us/HT201341|XQuartz an X11 display server]] must be installed and running.

Once a SSH connection is established using X11 (and an X11 display server is running, Xming or XQuartz), below are the steps to be followed to test the session.

Type
<code bash>
[traine@login00 ~]$ workgroup -g it_css
</code>

<code bash>
[(it_css:traine)@login00 ~]$ echo $DISPLAY
localhost:15.0
</code>

Check if the current session is being run with X11 using ''xdpyinfo | grep display'' and the name of the display should match the output above.

<code bash>
[(it_css:traine)@login00 ~]$ xdpyinfo | grep display
name of display:    localhost:15.0
</code> 

<note important>If the current session is **not** being run with X11 then you will like get an error. Below is an example of an error when Xming was not running for a Windows PuTTY session:

<code bash>
$ xdpyinfo | grep display
PuTTY X11 proxy: unable to connect to forwarded X server: Network error: Connection refused
xdpyinfo:  unable to open display "localhost:15.0".
</code>
</note>

Once we confirm the session is properly configured with X11 forwarding, now we are ready to launch a GUI application on the compute node.

Type 
<code bash>
[(it_css:traine)@login00 ~]$ salloc --x11 -N1 -n1 --partition=_workgroup_
salloc: Granted job allocation 30298
salloc: Waiting for resource configuration
salloc: Nodes r01n10 are ready for job
</code>

This will launch an interactive job on one of the compute nodes with one cpu (core), 1G of memory and 30 minutes time (default if no ''--time'' option is specified) by using the current workgroup partition set by using the [[abstract:caviness:app_dev:compute_env#using-workgroup-and-directories|workgroup]] command. If ''--partition'' is omitted, then the job will be launched in the [[abstract:caviness:runjobs:queues#the-standard-partition|standard partition]] which is the default and can be preempted (kill without warning to make way for jobs requesting resources for a workgroup partition). 

Now the compute node and environment will be ready to launch any program that has a GUI (Graphical User Interface) and be displayed on your local computer display.

<note important>
The X11 protocol was never meant to handle graphically (in terms of bitmaps/textures) intensive operations. Also, a significant latency will be noticed while running Graphical interfaces using X11 on Linux/Unix systems.
</note>


Additionally, the ''%%-%%-x11'' argument can be augmented in this fashion ''%%-%%-x11=[batch|first|last|all]'' to the following effects:

  * ''%%-%%-x11=first'' This is the default, and provides X11 forwarding to the first compute hosts allocated.
  * ''%%-%%-x11=last'' This provides X11 forwarding to the last of the compute hosts allocated.
  * ''%%-%%-x11=all'' This provides X11 forwarding from all allocated compute hosts, which can be quite resource heavy and is an extremely rare use-case.
  * ''%%-%%-x11=batch'' This supports use in a batch job submission, and will provide X11 forwarding to the first node allocated to a batch job. 

These options can be used and further tested using the above ''display'' OR $DISPLAY commands. 
====== Batch Jobs (Script) ======

As with ''sbatch'' under Slurm, the flags inside comments may be overridden by values on the ''sbatch'' command line.  The job script must:
  * use Unix-style newlines
  * have its executable bit set (e.g. using the ''chmod u+x'' command)
  * have the interpreting shell [[https://en.wikipedia.org/wiki/Shebang_(Unix)|shebang]] present on its first line
A collection of job script [[technical:slurm:caviness:templates:start|templates]] are maintained by the IT-HPC staff in the ''/opt/shared/templates'' directory on Caviness.  All templates therein are written for the Bash shell (the default shell on Linux).
===== Submitting the Job =====

Batch jobs are submitted to the job scheduler using the ''sbatch'' command:

<code bash>
[(it_css:traine)@login00 it_css]$ sbatch job_script_01.qs 
Submitted batch job 1146
</code>

Notice that the job name defaults to being the name of the job script; as discussed in the previous section, a job name can also be explicitly provided

<file bash job_script_02.qs>
#SBATCH --job-name=testing002
#SBATCH --output=my_job_op%j.txt

echo "Hello, world."
</file>
==== Specifying Options on the Command Line ====

It has already been demonstrated that command-line options to the ''sbatch'' command can be embedded in a job script.  Likewise, the options can be specified on the command line.  For example:

<code bash>
[(it_css:traine)@login00 it_css]$ sbatch --output 'output%j.txt' job_script_02.qs  //After entering into workgroup
Submitted batch job 1158
</code>

The ''%%-%%-output'' option was provided in the queue script __and__ on the command line itself:  Slurm will honor options from the command line in preference to those embedded in the script.  Thus, in this case the "''output%j.txt''" provided on the command line overrode the "''my_job_op%h.txt''" from the job script.

The ''sbatch'' command has many options available, all of which are documented in its man page.  A few of the often-used options will be discussed here.

=== Default Options ===

There are several default options that are automatically added to every ''sbatch'' by Slurm as well as default resource requirements supplied, however an explanation of each is beyond the scope of this section.  Providing an alternate value for any of these arguments -- in the job script or on the ''sbatch'' command line -- overrides the default value.

=== Email Notifications ===

Since batch jobs can run unattended, the user may want to be notified of status changes for a job:  when the job begins executing; when the job finishes; or if the job was killed.  Slurm will deliver such notifications (as emails) to a job's owner if the owner requests them using the ''%%--%%mail-user'' option:

^Option^Description^
|''%%--%%mail-user=//<email-address>//''|deliver state-change notification emails to this address|
|''%%--%%mail-type=//<state>{,<state>..}//''|deliver notification emails when the job enters the state(s) indicated|
|''%%--%%requeue''|if this job is preempted by a higher-priority job, automatically resubmit it to execute again using the same parameters and job script|

Consult the man page for the ''sbatch'' command for a deeper discussion of each of the ''%%--%%mail-type'' states.  Valid state names are NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT_50, TIME_LIMIT_80, TIME_LIMIT_90, TIME_LIMIT, ARRAY_TASKS.  The time limit states with numbers indicate a percentage of the full runtime:  so enabling TIME_LIMIT_50 will see an email notification being delivered once 50% of the job's maximum runtime has elapsed.


==== Handling System Signals aka Checkpointing ====

Generally, there are two possible cases when jobs are killed: (1) preemption and (2) walltime configured within the jobs script has elapsed. Checkpointing can be used to intercept and handle the system signals in each of these cases to write out a restart file, perform the cleanup or backup operations, or any other tasks before the job gets killed. Of course this depends on whether or not the application or software you are using is checkpoint enabled.

<note important>Please review the comments provided in the Slurm job script templates available in ''/opt/shared/templates'' that demonstrates the ways to trap these signals.</note> 

"TERM" is the most common system signal that is triggered in both the above cases. However, there is a working logic behind the preemption of job which works as below.

When a job gets submitted to a workgroup-specific partition and resources are tied-up by jobs in the ''standard'' partition, the jobs in the ''standard'' partition will be preempted to make way.  Slurm sends a preemption signal to the job (SIGCONT followed by SIGTERM) then waits for a grace period (5 minutes) before signaling again (SIGCONT followed by SIGTERM) then killing it (SIGKILL).  However, if the job is able to simply be re-run as-is, the user can submit with ''%%-%%-requeue'' to indicate that a ''standard'' job that was preempted should be rerun on the ''standard'' partition (possibly restarting immediately on different nodes, otherwise it will need to wait for resources to become available).

For example using the logic provided in one of the Slurm job script templates, one can catch these signals during the preemption and handle them by performing the cleanup or backing up the job results operations as follows. 

<code bash>
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:02:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#
# [EDIT] Define a Bash function and set this variable to its
#        name if you want to have the function called when the
#        job terminates (time limit reached or job preempted).
#
job_exit_handler() {
  # Copy all our output files back to the original job directory:
  cp * "$SLURM_SUBMIT_DIR"

  # Don't call again on EXIT signal, please:
  trap - EXIT
  exit 0
}
export UD_JOB_EXIT_FN=job_exit_handler


#
# [EDIT] By default, the function defined above is registered
#        to respond to the SIGTERM signal that Slurm sends
#        when jobs reach their runtime limit or are
#        preempted.  You can override with your own signals
#        list of signals using this variable -- as in this
#        example, which registers for both SIGTERM and the
#        EXIT pseudo-signal that Bash calls when the script
#        ends.  In effect, no matter whether the job is
#        terminated or completes, the UD_JOB_EXIT_FN will be
#        called.
#
export UD_JOB_EXIT_FN_SIGNALS="SIGTERM EXIT"

#Do your normal work here
UD_EXEC python test.py

</code>

To [[technical:slurm:caviness:templates:start#signal-handling|catch signals]] asynchronously in Bash, you have to run commands in the background and "wait" for them to complete.  This is why the [[technical:slurm:caviness:templates:start|templates]] includes a shell function named UD_EXEC when you set UD_JOB_EXIT_FN to a trap function name. 

If you implement the restart logic at the start of the script, then you can avoid the signal stuff entirely by using the ''%%-%%-requeue'' option with ''sbatch''. Using this option tells Slurm when the job is preempted, it will automatically be moved back into the queue to execute again.

===== Job Output =====

Equally as important as executing the job is capturing any output produced by the job.  By default, all the output(stdout and stderr) is sent to a single file that output file is named according to the formula

<code bash>
slurm-[job id].out
</code>

For the weather-processing example above, the output would be found in

<code bash>
[(it_css:traine)@login00 it_css]$ sbatch process_weather.qs
Submitted batch job 1158
[(it_css:traine)@login00 it_css]$ 
#
#   ... some time goes by ...
#
[(it_css:traine)@login00 it_css]$ ls *.o*
slurm-1158.out
</code>

<note tip>In the job script itself it is often counterproductive to redirect a constituent command's output to a file.  Allowing all output to stdout/stderr to be directed to the file provided by Slurm automatically provides a degree of "versioning" of all runs of the job by way of the ''-[job id]'' suffix on the output file's name.</note>

The name of the output file can be overridden using the ''--output'' command-line option to ''sbatch''.  The argument to this option is the name of the file, possibly containing special characters that will be replaced by the job id, job name, etc.  See the ''sbatch'' man page for a complete description.

<note>In order to redirect the error output to a separate file(by default stdout and stderr directed to the same file), ''--error'' option can be used and is then directed to a file named as per the naming convention provided. </note>


===== Array Jobs =====

An array job essentially runs the same job by generating a new repeated task many times. Each time, the environment variable **SLURM_ARRAY_TASK_ID** is set to a unique value and its value provides input to the job submission script.

The ''%A_%a'' construct in the output and error file names is used to generate unique output and error files based on the master job ID (''%A'') and the array-tasks ID (''%a''). In this fashion, each array-tasks will be able to write to its own output and error file.

Example: #SBATCH --output=arrayJob_%A_%a.out

<note tip>
The ''SLURM_ARRAY_TASK_ID'' is the key to make the array jobs useful.  Use it in your bash script, or pass it as a parameter so your program can decide how to complete the assigned task.
  
For example, the ''SLURM_ARRAY_TASK_ID'' sequence values of 2, 4, 6, ... , 5000 might be passed as an initial data value to 2500 repetitions of a simulation model. Alternatively, each iteration (task) of a job might use a different data file with filenames of ''data$SLURM_ARRAY_TASK_ID'' (i.e., data1, data2, data3, ', data2000).
</note>

The general form is:

<WRAP left box 100%>
%%-%%-array=   //start_value// - //stop_value// : //step_size//
</WRAP>

For example, specifying a step size 2

<WRAP left box 100%>
%%--%%array=1-7:2
</WRAP>

produces index values of ''1,3,5,7''. The following explicitly sets the indexes as ''1,2,5,19,27''.

<WRAP left box 100%>
%%--%%array=1,2,5,19,27
</WRAP>

<note important>The default job array size limits are set to 10000 for Slurm on Caviness to avoid oversubscribing the scheduler node's own resource limits (causing scheduling to become sluggish or even unresponsive). See the [[technical:slurm:caviness:arraysize-and-nodecounts#job-array-size-limits|technical explanation]] for why this is necessary.
</note>

For more details and information see [[abstract:caviness:runjobs:schedule_jobs#array-jobs1|Array Jobs]].
===== Chaining Jobs =====

If you have a multiple jobs where you want to automatically run other job(s) after the execution of another job, then you can use chaining. When you chain jobs, remember to check the status of the other job to determine if it successfully completed. This will prevent the system from flooding the scheduler with failed jobs.  Here is a simple chaining example with three job scripts ''doThing1.qs'', ''doThing2.qs'' and ''doThing3.qs''.

The running of a job can be held until a particular job completes. This can be done so as to not to "hog" resources or because the output of one job is needed as input for the second. Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the ''%%--%%dependency'' option to ''sbatch'' in the format.

The ''%%--%%dependency'' portion of ''sbatch'' man page lists the flags that are to be used to implement chain jobs. "type" in the below format indicates the flags to be used to establish dependency.

<code bash>
sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...
</code>

The following do1.qs script does 3 important things.
   * If first sleeps for 30 seconds. This gives us time to start dependent jobs.
   * Does an ''ls'' of a non existent file. There is a non-zero exit code for this command.
   * Runs the "hello world" program phostname

<code - do1.qs>
#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:02:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#----------------------
cd $SLURM_SUBMIT_DIR
date
srun -n 8 sleep 30
date

ls this_file_does_not_exist

srun -n 8 /opt/utility/phostname -F
</code>

The same script can be run multiple times to demonstrate the dependency option. ''afterok'' and ''afterany'' options are used for this purpose to establish dependency.

<code bash>
[(it_css:traine)@login00 it_css]$ sbatch --partition=devel do1.qs
Submitted batch job 36805
[(it_css:traine)@login00 it_css]$ sbatch --dependency=afterany:36805 do1.qs 
Submitted batch job 36806
[(it_css:traine)@login00 it_css]$ sbatch --dependency=afterok:36805 do1.qs 
Submitted batch job 36807
</code>

Job 36806 will only start after the intial run i.e., 36805 has finished execution irrespective of its exit status. This is implemented using ''afterany'' flag in the ''sbatch'' command. In the other case, job 36807 will start only after the first run i.e., 36805 finishes successfully (runs to completion with an exit code of zero).

The result of "ls" command will not affect the overall status of the job. So it might not always be sufficient to just use ''afterok'' in chaining jobs. The other option is that you can manually check the error status of individual commands within a script: The error status for a command is held in the variable $?. This can be checked and we can then force the script to exit. For example we can add the line

<code bash>
if [ $? -ne 0 ] ; then ; exit 1234 ;fi 
</code>

<code - do1.qs>
#!/bin/bash
#SBATCH --job-name="atest"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:02:00
#SBATCH -o stdout.%j
#SBATCH -e stderr.%j
#SBATCH --export=ALL

#----------------------
cd $SLURM_SUBMIT_DIR
date
srun -n 8 sleep 30
date

ls this_file_does_not_exist
if [ $? -ne 0 ] ; then ; exit 1234 ;fi

srun -n 8 /opt/utility/phostname -F
</code>

Now, job 36807 will not run after submission as an initial run i.e., 36805 will now exit with a non-zero status because of the ''if'' condition included in the above script.

This is how chain jobs can be implemented using ''dependency'' option.
====== Running Jobs with Parallelism ======

The interactive and batch jobs discussed thus far have all been //serial// in nature:  they exist as a sequence of instructions executed in order on a single CPU core.  Many problems solved on a computer can be solved more quickly by breaking the job into pieces that can be solved concurrently.  If one worker moves a pile of bricks from point A to point B in 30 minutes, then employing a second worker to carry bricks should see the job completed in just 15 minutes.  Adding a third worker should decrease the time to 10 minutes.  //Job parallelism// likewise coordinates between multiple serial workers to finish a computation more quickly than if it had been done by a single worker.  Parallelism can take many forms, the two most prevalent being threading and message passing.  Popular implementations of threading and message passing are the [[http://openmp.org/wp/|OpenMP]] and [[http://www.mpi-forum.org/|MPI]] standards.

Sometimes a more loosely-coupled form of parallelism can be used by a job.  Suppose a user has a collection of 100 files, each containing the full text of a novel.  The user would like to run a program for each file that counts the number of gerunds occurring in the text.  The counting program is a simple serial program, but the task can be completed more quickly by analyzing many files concurrently.  This form of parallelism requires no threading or message passing, and in Slurm parlance is called an //[[#array-jobs|array job]]//.

Need help? See [[http://www.hpc.udel.edu/presentations/intro_to_slurm/|Introduction to Slurm]] in UD's HPC community cluster environment.

===== Threads =====

Programs that use OpenMP or some other form of thread parallelism should use the "threads" parallel environment.  This environment logically limits jobs to run on a single node only, which in turn limits the maximum number of workers to be the CPU core count for a node.

For more details, please look at the job script template ''/opt/shared/templates/slurm/generic/thread.qs''.

===== MPI =====

It is the user's responsibility to setup the MPI environment before running the actual MPI job. The job script template found in ''/opt/shared/templates/slurm/generic/mpi/mpi.qs'' will setup your job requiring a generic MPI parallel environment.  This parallel environment spans multiple nodes and allocates workers by "filling-up" one node before moving on. Slurm looks for the ''%%--%%ntasks-per-node'' to restrict the allocations per node as part of the filling-up strategy. If it is not specified, then the default way of filling-up proceeds.  When a job starts an MPI "machines" file is automatically manufactured and placed in the job's temporary directory at ''${TMPDIR}/machines''.  This file should be copied to a job's working directory or passed directly to the ''mpirun''/''mpiexec'' command used to execute the MPI program.

<note important>Software that uses MPI but is not started using ''mpirun'' or ''mpiexec'' will often have arguments or environment variables which can be set to indicate on which hosts the job should run or what file to consult for that list.  Please consult software manuals and online support resources before contacting UD-IT for help determining how to pass this information to the program.</note>

===== Submitting a Parallel Job =====

Like choosing the parallel environment in Grid Engine, choosing the appropriate number of tasks, threads, and CPUs required for the job is an important step in Slurm. A lot of information has been documented as comments in the template job scripts for your better understanding. In addition, below are few Slurm arguments that hold more weight while running a parallel job.

^Options^Description^
|''%%--%%nodes=//<#>//''|execute the command on this many distinct nodes|
|''%%--%%ntasks=//<#>//''|execute this many copies of the command|
|''%%--%%ntasks-per-node=//<#>//''|execute this many copies of the command on each distinct node|
|''%%--%%cpus-per-task=//<#>//''|each copy of the command should have this many CPU cores allocated to it|
|''%%--%%mem=//<#>//''|total amount of real memory to allocate to the job|
|''%%--%%mem-per-cpu=//<#>//''|amount of memory to allocate to each CPU core allocated to the job|

Understanding or having a clear picture of the differences between these arguments is necessary to freely work with parallel jobs.

<note tip>Using ''%%--%%nodes'' option with ''%%--%%tasks-per-node'' will be equivalent to mentioning the  ''%%--%%ntasks'' as number of hosts * number of tasks per node will give the total number of tasks that the problem has been divided into.</note>

When a parallel job executes, the following environment variables will be set by Slurm:

^Variable^Description^
|''SLURM_CPUS_PER_TASK''|The number of slots granted to the job.  OpenMP jobs should assign the value of ''$SLURM_CPUS_PER_TASK'' to the ''OMP_THREAD_LIMIT'' environment variable, for example.|
|''SLURM_JOB_NODELIST''|List of nodes allocated to the job.|
|''SLURM_TASKS_PER_NODE''|Number of tasks to be initiated on each node|


Keep in mind, Slurm defaults to a node count of 1 on any submitted job, so the mechanism by which you can spread your job across more nodes is a bit more complex.

In essence, if your MPI job wants N CPUs and you're willing to have as few as M of them running per node, then the maximum node count is µ=⌈N/M⌉.

<code>
    sbatch --nodes=1-<µ> --ntasks=<N> --cpus-per-task=1 ...
</code>

Order is significant, so for N=20 and you are willing to run 6 or more per node, then use

<code>
    sbatch --nodes=1-4 --ntasks=20 --cpus-per-task=1 ...
</code>


Do not rely on the output of ''scontrol show job'' or ''squeue'' with regard to the node count while the job is pending; it will not be accurate.  Only once the job is scheduled will it show the actual value.

For example,

<code>
    $ sbatch --nodes=3-40 --ntasks=80 --cpus-per-task=1
    #!/bin/bash

    env

    Submitted batch job 701892

    $ scontrol show job 701892
    JobId=701892 JobName=sbatch
       UserId=frey(1001) GroupId=everyone(900) MCS_label=N/A
       Priority=5961 Nice=0 Account=it_nss QOS=normal
       JobState=PENDING Reason=Resources Dependency=(null)
         :
       NumNodes=9-40 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

    ....some time goes by....

    $ scontrol show job 701892
    JobId=701892 JobName=sbatch
       UserId=frey(1001) GroupId=everyone(900) MCS_label=N/A
       Priority=5961 Nice=0 Account=it_nss QOS=normal
       JobState=COMPLETED Reason=None Dependency=(null)
         :
       NumNodes=5 NumCPUs=80 NumTasks=80 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
</code>

The scheduler found 5 nodes with 80 free CPUs (1@r00n17, 35@r01n03, 35@r01n12, 8@r01n16, 1@r01n50):

<code>
    SLURM_NTASKS=80
    SLURM_TASKS_PER_NODE=1,35(x2),8,1
    SLURM_NODELIST=r00n17,r01n[03,12,16,50]
    SLURMD_NODENAME=r00n17
</code>

==== Job Templates ====

Detailed information pertaining to individual kinds of parallel jobs  are provided by UD IT in a collection of job template scripts on a per-cluster basis under the ''/opt/shared/templates/slurm/generic'' directory.  For example, on Caviness this directory looks like:

<code bash>
[(it_css:traine)@login00 generic]$ ls -l
total 31
drwxr-sr-x 5 frey sysadmin    5 Sep 14 09:36 mpi
-rwxr-xr-x 1 frey sysadmin 5016 Oct 29 10:06 serial.qs
-rwxr-xr-x 1 frey sysadmin 5438 Jan 14 11:53 threads.qs
</code>

The directory layout is self-explanatory:  script templates specific for all MPI jobs can be found in the ''mpi'' directory; Open MPI is in the ''openmpi'' directory, generic MPI in the ''generic'' directory, and MPICH can be found in the ''mpich'' directory (all under the ''mpi'' directory; a template for serial jobs is ''serial.qs'' and ''threads.qs'' should be used for OpenMP jobs. These scripts are heavily documented to aid in users' choice of appropriate templates and are updated as we uncover best practices and performance issues.  Please copy a script templates for new projects rather than potentially using an older version from a previous project. See [[technical:slurm:caviness:templates:start|Caviness Slurm Job Script Templates]] for more details.

Need help? See [[http://www.hpc.udel.edu/presentations/intro_to_slurm/|Introduction to Slurm]] in UD's HPC community cluster environment.

===== Array Jobs =====

Hearkening back to the text-processing example cited above, the analysis of each of the 100 files could be performed by submitting 100 separate jobs to Slurm, each modified to work on a different file.  Using an array job helps to automate this task:  each //sub-task// of the array job gets assigned a unique integer identifier.  Each sub-task can find its sub-task identifier in the ''SLURM_ARRAY_TASK_ID'' environment variable.  

Consider the following job submission script file called ''array_demo.qs'': 

<code bash>
#!/bin/bash

#SBATCH --job-name=arrayJob
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-4
#SBATCH --time=01:00:00
#SBATCH --ntasks=1

######################
# Begin work section #
######################

# Print this sub-job's task ID
echo "My Task ID is : " $SLURM_ARRAY_TASK_ID

# Do some work based on the SLURM_ARRAY_TASK_ID
# For example: 
# ./my_process $SLURM_ARRAY_TASK_ID
# 
# where my_process is your executable.
</code>

<code bash>
[(it_css:traine)@login00 it_css]$ sbatch array_demo.qs
Submitted batch job 1176
[(it_css:traine)@login00 it_css]$ ...time passes...
[(it_css:traine)@login00 it_css]$ ls -1 arrayJob_*
arrayJob_1176_1
arrayJob_1176_2
arrayJob_1176_3
arrayJob_1176_4
[(it_css:traine)@login00 it_css]$ cat arrayJob_1176_3
My Task ID is : 82709.3
</code>

Four sub-tasks are executed, numbered from 1 through 4.  The starting index must be greater than zero, and the ending index must be greater than or equal to the starting index.  The //step size// going from one index to the next defaults to one, but can be any positive integer greater than zero. A step size is appended to the sub-task range as in ''2-20:2'' -- proceed from 2 up to 20 in steps of 2, e.g. 2, 4, 6, 8, 10, et al.

<note important>The default job array size limits are set to 10000 for Slurm on Caviness to avoid oversubscribing the scheduler node's own resource limits (causing scheduling to become sluggish or even unresponsive). See the [[technical:slurm:caviness:arraysize-and-nodecounts#job-array-size-limits|technical explanation]] for why this is necessary.
</note>
==== Partitioning Job Data ====

There are essentially two methods for partitioning input data for array jobs.  Both methods make use of the sub-task identifier in locating the input for a particular sub-task.

If 100 novels were in files with names fitting the pattern ''novel_''<<''sub-task-id''>>''.txt'' then the analysis could be performed with the following job script ''gerund_array.qs'':

<code bash>
#!/bin/bash

#SBATCH --job-name=gerunds
#SBATCH --output=gerund_count_%a.out
#SBATCH --time=01:00:00
#SBATCH --ntasks=1

######################
# Begin work section #
######################

# Count gerunds in the file:
./gerund_count "novel_${SLURM_ARRAY_TASK_ID}.txt"
</code>

<code bash>
[(it_css:traine)@login00 novels]$ sbatch --array=1-100 gerund_array.qs
Submitted batch job 1176
</code>

When complete, the job will produce 100 files named ''gerund_count_''<<''sub-task-id''>> where the ''sub-task-id'' collates the results to the input files.

An alternate method of organizing the chaos associated with large array jobs is to partition the data in directories:  the sub-task identifier is not applied to the filenames but is used to set the working directory for each sub-task. With this kind of logic, the job script''gerund_arrays.qs'' looks like:

<code bash>
#!/bin/bash

#SBATCH --job-name=gerunds
#SBATCH --output=gerund_count.out
#SBATCH --time=01:00:00
#SBATCH --ntasks=1

######################
# Begin work section #
######################

cd ${SLURM_ARRAY_TASK_ID}
../gerund_count novel.txt > gerund_count
</code>

<code bash>
[(it_css:traine)@login00 novels]$ sbatch --array=1-100 gerund_array.qs
Submitted batch job 1177
</code>

When complete, each directory will have a file named ''gerund_count'' containing the output of the ''gerund_count'' command.

=== Using an Index File ===

The partitioning scheme can be as complex as the user desires.  If the directories were not named "1"  through "100" but instead used the name of the novel contained within, an index file could be created containing the directory names, one per line:

<code>
Great_Expectations
Atlas_Shrugged
The_Great_Gatsby
  :
</code>

The job submission script ''gerund_array.qs'' might then look like:

<code bash>
#!/bin/bash

#SBATCH --job-name=gerunds
#SBATCH --output=gerund_count.out
#SBATCH --time=01:00:00
#SBATCH --ntasks=1

######################
# Begin work section #
######################

NOVEL_FOR_TASK=`sed -n ${SLURM_ARRAY_TASK_ID}p index.txt`
cd $NOVEL_FOR_TASK
../gerund_count novel.txt > gerund_count
</code>

<code bash>
[(it_css:traine)@login00 novels]$ sbatch --array=1-100 gerund_array.qs
Submitted batch job 1178
</code>

The ''sed'' command selects a single line of the ''index.txt'' file; for sub-task 1 the first line is selected, sub-task 2 the second line, etc.