Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
technical:slurm:caviness:mandatory_gpu_type [2024-01-18 10:30] – created frey | technical:slurm:caviness:mandatory_gpu_type [2024-01-30 17:17] (current) – [Issues] anita | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Revision to Slurm job submission to require GPU types ====== | ====== Revision to Slurm job submission to require GPU types ====== | ||
+ | This document summarizes an alteration to the job submission plugin to prevent the non-specific request of GPU resources. | ||
+ | ===== Issues ===== | ||
+ | |||
+ | Workgroup resource limits are effected through a Slurm //Quality of Service// (QOS) record. | ||
+ | |||
+ | < | ||
+ | [user@login00.caviness ~]$ sacctmgr show qos -np workgroup_X | cut -d\| -f9 | ||
+ | cpu=248, | ||
+ | </ | ||
+ | |||
+ | This workgroup purchased GPU nodes in all generations of Caviness: | ||
+ | |||
+ | ^Generation^GPU type^Count^ | ||
+ | |1|P100|2| | ||
+ | |2|V100|2| | ||
+ | |3|A100|2| | ||
+ | |||
+ | Generically speaking, workgroup_X has access to 6 GPU devices: | ||
+ | |||
+ | When a job is submitted with the flag '' | ||
+ | |||
+ | < | ||
+ | [user@login00.caviness ~]$ scontrol show job 9897654321 | grep ' | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The type-specific (A100) resource limit is affected, as is the generic implicit limit. | ||
+ | |||
+ | < | ||
+ | [user@login00.caviness ~]$ scontrol show job 123456789 | grep ' | ||
+ | | ||
+ | </ | ||
+ | |||
+ | In this case, **only the generic implicit limit is affected.** | ||
+ | |||
+ | <note important> | ||
+ | |||
+ | ==== Generational Change ==== | ||
+ | |||
+ | When Caviness was first built and the job submission plugin written, the cluster **only** contained P100 GPUs. In all cases where the generic '' | ||
+ | |||
+ | In Generation 2 of Caviness the V100 GPU had become the model of choice from NVIDIA and the Slurm configuration added the '' | ||
+ | |||
+ | Generation 3 added '' | ||
+ | |||
+ | ===== Implementation ===== | ||
+ | |||
+ | The job submission plugin for Caviness' | ||
+ | |||
+ | < | ||
+ | [user@login00.caviness ~]$ sbatch --gres gpu:2 --partition workgroup_X … | ||
+ | </ | ||
+ | |||
+ | will receive the error message | ||
+ | |||
+ | < | ||
+ | No GPU type requested: gpu:2 | ||
+ | </ | ||
+ | |||
+ | In this case, the user must choose the specific type of GPU the job requires: | ||
+ | |||
+ | < | ||
+ | [user@login00.caviness ~]$ sbatch --gres gpu:a100:2 --partition workgroup_X … | ||
+ | </ | ||
+ | |||
+ | If you are unsure of the GPU types and counts available in your workgroup partition, use the command '' | ||
+ | |||
+ | ===== Impact ===== | ||
+ | |||
+ | The Slurm scheduler will be restarted to load the updated job submission plugin. | ||
+ | |||
+ | ===== Timeline ===== | ||
+ | |||
+ | ^Date^Time^Goal/ | ||
+ | |2024-01-18| |Authoring of this document| | ||
+ | |2024-01-18| |Alteration of job submission plugin| | ||
+ | |2024-02-01|10: |