====== Revision to Slurm job submission to require GPU types ======

This document summarizes an alteration to the job submission plugin to prevent the non-specific request of GPU resources.

===== Issues =====

Workgroup resource limits are effected through a Slurm //Quality of Service// (QOS) record.  For example:

<code>
[user@login00.caviness ~]$ sacctmgr show qos -np workgroup_X | cut -d\| -f9
cpu=248,mem=3075G,gres/gpu:p100=2,gres/gpu:v100=2,gres/gpu:a100=2
</code>

This workgroup purchased GPU nodes in all generations of Caviness:

^Generation^GPU type^Count^
|1|P100|2|
|2|V100|2|
|3|A100|2|

Generically speaking, workgroup_X has access to 6 GPU devices:  Slurm implicitly includes a generic resource limit for workgroup_X that would be displayed as ''gres/gpu=6''.

When a job is submitted with the flag ''%%--%%gres=gpu:a100:1'' Slurm generates the following resource debits for that job:

<code>
[user@login00.caviness ~]$ scontrol show job 9897654321 | grep 'TRES='
   TRES=cpu=24,mem=500G,node=1,gres/gpu=1,gres/gpu:a100=1
</code>

The type-specific (A100) resource limit is affected, as is the generic implicit limit.  Consider, though, a job that is submitted with the flag ''%%--%%gres=gpu'':

<code>
[user@login00.caviness ~]$ scontrol show job 123456789 | grep 'TRES='
   TRES=cpu=5,mem=700G,node=1,billing=716805,gres/gpu=1
</code>

In this case, **only the generic implicit limit is affected.**  Job 123456789 can use any type of GPU to which workgroup_X has access, but no type-specific limit will influence the scheduler's choice of GPU type.  Even if workgroup_X already has running jobs using 2 of 2 A100 GPUs, job 123456789 would be allowed to use a third A100 GPU — effectively borrowing an A100 against the quota of P100 and V100 GPUs they also purchased.  This is not the intended behavior.

<note important>**[VERY IMPORTANT]** Once this change goes into effect all job scripts and command line requests for GPUs using the syntax ''%%--%%gres=gpu'' or ''%%--%%gres=gpu:<<#>>'' **must** be altered to include the desired GPU type:  for example, ''%%--%%gres=gpu:a100''  or ''%%--%%gres=gpu:a100:<<#>>''.  The command ''sworkgroup -g <<workgroup>> %%--%%limits'' displays the GPU types and counts available to your workgroup for jobs in the workgroup partition.  Jobs submitted to the standard partition (for which workgroup GPU limits do not apply) **must** also specify the GPU type once the change goes into effect. If you are unsure of the GPU types and counts available in the ''standard'' partition, see [[abstract:caviness:caviness#compute-nodes|Compute Nodes]] on Caviness.</note>

==== Generational Change ====

When Caviness was first built and the job submission plugin written, the cluster **only** contained P100 GPUs.  In all cases where the generic ''gres/gpu'' equates with a single type-specific resource (like ''gres/gpu:p100'' originally) jobs that omit the GPU type at submission still debit the resource limit appropriately.

In Generation 2 of Caviness the V100 GPU had become the model of choice from NVIDIA and the Slurm configuration added the ''v100'' GPU type in addition to the ''p100'' type.  At that time the issue with regard to non-specific GPU requests was not anticipated or noted.

Generation 3 added ''a100'' and ''a40'' GPU types.

===== Implementation =====

The job submission plugin for Caviness' Slurm job scheduler must be altered to detect the omission of GPU type and raise an error that prevents the job from being accepted for scheduling.  Users attempting to submit a job using non-specific GPU syntax:

<code>
[user@login00.caviness ~]$ sbatch --gres gpu:2 --partition workgroup_X …
</code>

will receive the error message

<code>
No GPU type requested: gpu:2
</code>

In this case, the user must choose the specific type of GPU the job requires:

<code>
[user@login00.caviness ~]$ sbatch --gres gpu:a100:2 --partition workgroup_X …
</code>

If you are unsure of the GPU types and counts available in your workgroup partition, use the command ''sworkgroup -g workgroup_X %%--%%limits''. Remember jobs submitted to the ''standard'' partition (for which workgroup GPU limits do not apply) **must** also specify the GPU type. If you are unsure of the GPU types and counts available in the ''standard'' partition, see [[abstract:caviness:caviness#compute-nodes|Compute Nodes]] on Caviness.

===== Impact =====

The Slurm scheduler will be restarted to load the updated job submission plugin.  Job submission and query (via ''sbatch'', ''sacct'', ''squeue'' for example) will hang for a period anticipated to be less than one minute.

===== Timeline =====

^Date^Time^Goal/Description^
|2024-01-18| |Authoring of this document|
|2024-01-18| |Alteration of job submission plugin|
|2024-02-01|10:00|Implementation|