Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
technical:slurm:scheduler-params [2019-10-24 12:23] – frey | technical:slurm:scheduler-params [2020-02-26 13:41] – frey | ||
---|---|---|---|
Line 5: | Line 5: | ||
===== Issues ===== | ===== Issues ===== | ||
- | ==== Users submitting large numbers of jobs ==== | + | ==== Large queue sizes ==== |
There are currently no limits on the number of jobs each user can submit on Caviness. | There are currently no limits on the number of jobs each user can submit on Caviness. | ||
Line 20: | Line 20: | ||
Many Caviness users are used to submitting a job and immediately seeing (via '' | Many Caviness users are used to submitting a job and immediately seeing (via '' | ||
- | One reason the Slurm queue on Caviness can see degraded scheduling efficiency when filled with too many jobs relates to the ordering of the jobs -- and thus to the job priority. | + | One reason the Slurm queue on Caviness can see degraded scheduling efficiency when filled with too many jobs relates to the ordering of the jobs -- and thus to the job priority. |
^factor^ multiplier^notes^ | ^factor^ multiplier^notes^ | ||
Line 27: | Line 27: | ||
|fair-share| 4000|see '' | |fair-share| 4000|see '' | ||
|partition id| 2000|1.0 for all partitions| | |partition id| 2000|1.0 for all partitions| | ||
- | |job resource | + | |job size| 1|all resources in partition=1.0| |
Next to priority access, wait time is the largest factor: | Next to priority access, wait time is the largest factor: | ||
- | Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster. | + | Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster, contrary to the goals of fair-share. |
===== Solutions ===== | ===== Solutions ===== | ||
Line 39: | Line 39: | ||
On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: | On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: | ||
- | It would be preferable to avoid enacting such limits on Caviness. | + | It would be preferable to avoid enacting such limits on Caviness. Over time should user behavior change (and users routinely abuse this lenience) submission limits may become necessary. |
==== Altered priority weights ==== | ==== Altered priority weights ==== | ||
- | The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. | + | The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. |
- | The Slurm documentation also points out that the priority | + | - the job size factor |
+ | - the TRES size factor calculated using PriorityWeightTRES weights and requested resource values | ||
- | === Addition | + | The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors). |
- | New Generic RESource types will be added to represent | + | - Partitions all contribute the same value, therefore the weight can be 0 |
+ | - Priority-access should unequivocally bias the priority higher; as a binary (0.0 or 1.0) very few bits should be necessary | ||
+ | - Fair-share should outweigh the remaining factors | ||
+ | - Wait time and job size should be considered equivalent (or nearly so with wait time greater than job size) | ||
+ | * The job size is determined by the **PriorityWeightTRES** option; currently set to the default, which is empty, which yields 0.0 for every job(!) | ||
- | ^GRES name^Type^Description^ | + | It seems appropriate to split the 32-bit value into groups that represent each priority-weighting tier: |
- | |gpu|v100|nVidia Volta GPU| | + | |
- | |gpu|t4|nVidia T4 GPU| | + | |
- | === Addition of Nodes === | + | ^mask^tier^ |
+ | |3 << 30 = '' | ||
+ | |262143 << 12 = '' | ||
+ | |4095 = '' | ||
- | ^Kind^Features^Nodes^GRES^ | + | The wait time and job size group of bits is split 60% to wait time, 40% to job size: |
- | |Baseline (2 x 20C, 192 GB)|Gen2, | + | |
- | |Large memory (2 x 20C, 384 GB)|Gen2, | + | |
- | |X-large memory (2 x 20C, 768 GB)|Gen2, | + | |
- | |XX-large memory (2 x 20C, 1024 GB)|Gen2, | + | |
- | |Low-end GPU (2 x 20C, 192 GB, 1 x T4)|Gen2, | + | |
- | |Low-end GPU (2 x 20C, 384 GB, 1 x T4)|Gen2, | + | |
- | |Low-end GPU (2 x 20C, 768 GB, 1 x T4)|Gen2, | + | |
- | |All-purpose GPU (2 x 20C, 384 GB, 2 x V100)|Gen2, | + | |
- | |All-purpose GPU (2 x 20C, 768 GB, 2 x V100)|Gen2, | + | |
- | The Features column is a comma-separated list of tags that a job can match against. | + | ^mask^sub-factor^ |
+ | |2457 = '' | ||
+ | |1638 = '' | ||
- | <code bash> | + | The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for job size; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values. |
- | $ sbatch --constraint=Gold-6230& | + | |
- | </code> | + | |
- | All previous-generation nodes' feature lists will have '' | + | <note important> |
- | <code bash> | + | The TRES weights should sum to a maximum |
- | $ sbatch --constraint=Gen1 … | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | === Changes to Workgroup Accounts, QOS, and Shares === | + | |
- | + | ||
- | * Any existing workgroups with additional purchased resource capacity will have their QOS updated to reflect the aggregate core count, memory capacity, and GPU count. | + | |
- | * New workgroups will have an appropriate Slurm account created and populated with sponsored users. | + | |
- | * Slurm cluster shares (for fair-share scheduling) are proportional to each workgroup' | + | |
- | + | ||
- | === Changes to Partitions === | + | |
- | + | ||
- | The **standard** partition node list will be augmented | + | |
- | + | ||
- | < | + | |
- | Nodes=r[00-01]n[01-55], | + | |
- | </ | + | |
- | + | ||
- | Various workgroups' | + | |
- | + | ||
- | === Network Topology Changes === | + | |
- | + | ||
- | The addition | + | |
- | + | ||
- | + | ||
- | ==== Changes to auto_tmpdir ==== | + | |
- | + | ||
- | The directory removal error message has been changed to an internal informational message that users will not see. Additional changes were made to the plugin to address the race condition itself. | + | |
- | + | ||
- | An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system: | + | |
- | + | ||
- | < | + | |
- | --use-shared-tmpdir | + | |
- | by --tmpdir). | + | |
- | unique sub-directories for each node allocated to the job | + | |
- | (e.g. < | + | |
- | </ | + | |
- | + | ||
- | ^Variant^Node^TMPDIR^ | + | |
- | |job, no per-node|'' | + | |
- | |::: | + | |
- | |step, no per-node|'' | + | |
- | |::: | + | |
- | |job, per-node|'' | + | |
- | |::: | + | |
- | |step, per-node|'' | + | |
- | |::: | + | |
===== Implementation ===== | ===== Implementation ===== | ||
- | The auto_tmpdir plugin has already been compiled and debugged/ | + | The Priority weight factors will be adjusted |
- | + | ||
- | To make all of these changes atomic (in a sense), all nodes will be put in the **DRAIN** state to prohibit additional jobs' being scheduled while jobs already running are left alone. | + | |
- | + | ||
- | Next, the Slurm accounting database must be updated with: | + | |
- | + | ||
- | * Changes to cluster share for existing workgroup accounts | + | |
- | * Changes to resource levels for existing workgroup QOS's who purchased Gen2 resources | + | |
- | * Addition of new workgroup accounts | + | |
- | * Addition of new workgroup QOS' | + | |
- | + | ||
- | Adding new nodes and partitions to the Slurm configuration requires the scheduler ('' | + | |
- | + | ||
- | The execution daemons ('' | + | |
- | + | ||
- | The sequence of operations looks something like this (on '' | + | |
- | + | ||
- | <code bash> | + | |
- | $ scontrol update nodename=r[00-01]n[00-56], | + | |
- | $ # …update existing workgroups' | + | |
- | $ # …update existing workgroups' | + | |
- | $ # …add new workgroups' | + | |
- | $ # …add new workgroups' | + | |
- | </ | + | |
- | + | ||
- | The updated Slurm configuration must be pushed to all compute nodes: | + | |
- | + | ||
- | <code bash> | + | |
- | $ wwsh provision set r03g\* --fileadd=gen2-gpu-cgroup.conf | + | |
- | $ wwsh provision set r03g[00-04] r03g[07-08] --fileadd=gen2-gpu-t4-gres.conf | + | |
- | $ wwsh provision set r03g[05-06] --fileadd=gen2-gpu-v100-gres.conf | + | |
- | $ wwsh file sync slurm-nodes.conf slurm-partitions.conf topology.conf | + | |
- | $ pdsh -w r[00-01]n[00-56], | + | |
- | 193 | + | |
- | $ # …wait… | + | |
- | $ pdsh -w r[00-01]n[00-56], | + | |
- | 115 | + | |
- | $ # …repeat until… | + | |
- | $ pdsh -w r[00-01]n[00-56], | + | |
- | 0 | + | |
- | </ | + | |
- | + | ||
- | Now that all nodes have the correct configuration files, the scheduler configuration is copied into place and both instances are restarted: | + | |
- | <code bash> | + | ^Configuration Key^Old Value^New Value^ |
- | $ sudo cp / | + | |'' |
- | $ sudo cp / | + | |'' |
- | $ sudo cp / | + | |'' |
- | $ sudo cp / | + | |'' |
- | $ sudo rsync -arv /etc/slurm/ root@r02mgmt01:/ | + | |'' |
- | $ sudo systemctl restart slurmctld | + | |'' |
- | $ sudo ssh r02mgmt01 systemctl restart slurmctld | + | |'' |
- | </ | + | |
- | Return | + | Priorities will make use of the full 32-bits available, with the fairshare factor dominating and having the most precision assigned |
- | <code bash> | + | Since the '' |
- | $ scontrol reconfigure | + | |
- | $ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=UNDRAIN | + | |
- | </ | + | |
- | And the Gen2 nodes can have their Slurm service started: | + | The modifications to '' |
- | <code bash> | ||
- | $ pdsh -w r03n[00-57], | ||
- | </ | ||
===== Impact ===== | ===== Impact ===== | ||
Line 200: | Line 101: | ||
^Date ^Time ^Goal/ | ^Date ^Time ^Goal/ | ||
- | |2019-09-09| |Authoring of this document| | + | |2019-10-24| |Authoring of this document| |
- | |2019-10-23|11:00|Changes made| | + | |2020-03-09|09:00|Implementation| |