Revisions to Slurm Configuration v2.0.0 on Caviness

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

There are currently no limits on the number of jobs each user can submit on Caviness. Submitted jobs must be repeatedly evaluated by Slurm to determine if/when they should execute. The evaluation includes:

Calculation of owning user's fair-share priority (based on decaying usage history)
Calculation of overall job priority (fair-share, wait time, size, partition id)
Sorting of all jobs in the queue based on priority
From the head of the queue up:
- Search for free resources matching requested resources
- Start execution if the job is eligible and resources are free

The fair-share calculations require extensive queries against the job database, and locating free resources is a complex operation. Thus, as the number of jobs that are pending (in the queue, not yet executing) increases, the time required to process all the jobs increases. Eventually the system may reach a point where the priority calculations and queue sort dominate the allotted scheduling run time.

Many Caviness users are used to submitting a job and immediately seeing (via squeue –job=<job-id>) scheduling status for that job. Some special events (job submission, job completion) trigger Slurm's processing a limited number of pending jobs at the head of the queue, to increase responsiveness. That limit (currently the default, 100 jobs) can become less effective when the queue is filled with many jobs, exacerbated to the limit if all those jobs are owned by a single user and the user is at his/her resource limits.

One reason the Slurm queue on Caviness can see degraded scheduling efficiency when filled with too many jobs relates to the ordering of the jobs – and thus to the job priority. Job priority is currently calculated as the weighted sum of the following factors (which are each valued in the range [0.0,1.0]):

factor	multiplier	notes
qos override (priority-access)	20000	standard,devel=0.0, _workgroup_=1.0
wait time (age)	8000	longest wait time in queue=1.0
fair-share	4000	see `sshare`
partition id	2000	1.0 for all partitions
job size	1	all resources in partition=1.0

Next to priority access, wait time is the largest factor: the longer a job waits to execute, the higher its priority to be scheduled. This seems appropriate, but it competes against the fair-share factor, which prioritizes jobs for underserved users.

Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster, contrary to the goals of fair-share.

On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: for example, a limit of 10 jobs in the queue at once. When the 11th job is submitted by the user the submission fails and the job is NOT added to the queue. Each Slurm partition can have this sort of limit placed on it (both per-user and aggregate) and each QOS can override those limits.

It would be preferable to avoid enacting such limits on Caviness. Over time should user behavior change (and users routinely abuse this lenience) submission limits may become necessary.

The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed. The job size weight of 1 makes that factor more or less useless since the greatest majority of jobs is below 50% of the cores, mapping them to a contribution of 0. In reality, wait time and job size should be considered on a more equal footing. However, there are multiple job size contributions to the priority:

the job size factor already discussed (percentage of cores) weighted by PriorityJobSize
the TRES size factor calculated using PriorityWeightTRES weights and requested resource values

The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors). The scheduling priority is an unsigned 32-bit integer (ranging [0,4294967295]). In constructing a new set of priority weights:

Partitions all contribute the same value, therefore the weight can be 0
Priority-access should unequivocally bias the priority higher; as a binary (0.0 or 1.0) very few bits should be necessary
Fair-share should outweigh the remaining factors in importance
Wait time and job size should be considered equivalent (or nearly so with wait time greater than job size)
- The job size is determined by the PriorityWeightTRES option; currently set to the default, which is empty, which yields 0.0 for every job(!)

It seems appropriate to split the 32-bit value into groups that represent each priority-weighting tier:

mask	tier
3 « 30 = `0xC0000000`	qos (priority-access)
262143 « 12 = `0x3FFFF000`	fair-share
4095 = `0x00000FFF`	wait time and job size

The wait time and job size group of bits is split 60% to wait time, 40% to job size:

mask	sub-factor
2457 = `0x00000999`	wait time
1638 = `0x00000666`	job size

The PriorityWeightTRES must be set, as well, to yield a non-zero contribution for job size; internally, the weights in PriorityWeightTRES are converted to double-precision floating point values. From the Slurm source code, the calculation of job size depends on the TRES limits associated with the partition(s) to which the job was submitted (or in which it is running). The job's requested resource value is divided by the maximum value for the partition, yielding the resource fractions associated with the job. The fractions are multiplied by the weighting values in PriorityWeightTRES.

Calculating job size relative to each partition is counterintuitive for our usage: jobs in partitions with fewer resources are likely to have routinely higher resource fractions than jobs in partitions with more resources. Calculating job size relative to the full complement of cluster resources would be a very useful option (does not exist yet). It's possible that the partition contribution to priority is seen as the way to balance against job size (lower value on larger partitions) but that seems far more complicated versus having a PRIORITY_* flag to select partition versus global fractions.

The TRES weights should sum to a maximum of 1638 (see the job size factor above), with the weight on cpu dominating.

The Priority weight factors will be adjusted as follows:

Configuration Key	Old Value	New Value
`PriorityWeightAge`	8000	2457
`PriorityWeightFairshare`	4000	1073737728
`PriorityWeightJobSize`	1	0
`PriorityWeightPartition`	2000	0
`PriorityWeightQOS`	20000	3221225472
`PriorityWeightTRES`	unset	`cpu=819,mem=245,GRES/gpu=245,node=327`
`PriorityFlags`	`FAIR_TREE,SMALL_RELATIVE_TO_TIME`	`FAIR_TREE`

Priorities will make use of the full 32-bits available, with the fairshare factor dominating and having the most precision assigned to it.

Since the PriorityWeightJobSize will not be used, the more complex "small-relative-to-time" algorithm will be disabled.

The modifications to slurm.conf must be pushed to all systems. The scontrol reconfigure command should be all that is required to activate the altered priority caclulation scheme.

One of the test nodes containing 6 TiB of NVMe storage has been reconstructed with the NMVe as swap devices rather than as file storage devices. This configuration is currently being tested for viability as a solution for jobs requiring extremely large amounts of allocatable memory: the node has 256 GiB of physical RAM and 6 TiB of NVMe swap. A job that allocates more than 256 GiB of RAM will force the OS to move 4 KiB memory pages between the NVMe storage and the 256 GiB of physical RAM (swapping); the idea is that the NVMe efficiency is more nearly a match to physical RAM (versus a hard disk) that the performance penalty may be lower, making this design an attractive option to some workgroups.

A special-access partition has been added to feed jobs to this node. Access is by request only.

No downtime is expected to be required.

Date	Time	Goal/Description
2019-10-24		Authoring of this document
2020-03-09	10:45	Implementation

Revisions to Slurm Configuration v2.0.0 on Caviness

Issues

Large queue sizes

Solutions

Job submission limits

Altered priority weights

Implementation

Addendum: lg-swap partition

Impact

Timeline

hpc documentation