Revisions to Slurm Configuration v2.0.0 on Caviness

This is an old revision of the document!

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

There are currently no limits on the number of jobs each user can submit on Caviness. Submitted jobs must be repeatedly evaluated by Slurm to determine if/when they should execute. The evaluation includes:

Calculation of owning user's fair-share priority (based on decaying usage history)
Calculation of overall job priority (fair-share, wait time, size, partition id)
Sorting of all jobs in the queue based on priority
From the head of the queue up:
- Search for free resources matching requested resources
- Start execution if the job is eligible and resources are free

The fair-share calculations require extensive queries against the job database, and locating free resources is a complex operation. Thus, as the number of jobs that are pending (in the queue, not yet executing) increases, the time required to process all the jobs increases. Eventually the system may reach a point where the priority calculations and queue sort dominate the allotted scheduling run time.

Many Caviness users are used to submitting a job and immediately seeing (via squeue –job=<job-id>) scheduling status for that job. Some special events (job submission, job completion) trigger Slurm's processing a limited number of pending jobs at the head of the queue, to increase responsiveness. That limit (currently the default, 100 jobs) can become less effective when the queue is filled with many jobs, exacerbated to the limit if all those jobs are owned by a single user and the user is at his/her resource limits.

One reason the Slurm queue on Caviness can see degraded scheduling efficiency when filled with too many jobs relates to the ordering of the jobs – and thus to the job priority. Job priority is currently calculated as the weight sum of the following factors (which are valued in the range [0.0,1.0]):

factor	multiplier	notes
qos override (priority-access)	20000	standard,devel=0.0, _workgroup_=1.0
wait time (age)	8000	longest wait time in queue=1.0
fair-share	4000	see `sshare`
partition id	2000	1.0 for all partitions
job resource size	1	largest resource request=1.0

Next to priority access, wait time is the largest factor: the longer a job waits to execute, the higher its priority to be scheduled. This seems appropriate, but it competes against the fair-share factor, which prioritizes jobs for underserved users.

Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster.

On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: for example, a limit of 10 jobs in the queue at once. When the 11th job is submitted by the user the submission fails and the job is NOT added to the queue. Each Slurm partition can have this sort of limit placed on it (both per-user and aggregate) and each QOS can override those limits.

It would be preferable to avoid enacting such limits on Caviness.

The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed. The job size weight of 1 also tends to lead toward issues with small jobs starving-out larger jobs on the cluster; in reality, wait time and job size should be considered on a more equal footing.

The Slurm documentation also points out that the priority factor weights should be of large enough magnitude to allow a wide number of

Addition of GRES

New Generic RESource types will be added to represent the GPU devices present in Generation 2 nodes.

GRES name	Type	Description
gpu	v100	nVidia Volta GPU
gpu	t4	nVidia T4 GPU

Addition of Nodes

Kind	Features	Nodes	GRES
Baseline (2 x 20C, 192 GB)	Gen2,Gold-6230,6230,192GB	r03n[29-57]
Large memory (2 x 20C, 384 GB)	Gen2,Gold-6230,6230,384GB	r03n[00-23],r03n28
X-large memory (2 x 20C, 768 GB)	Gen2,Gold-6230,6230,768GB	r03n27
XX-large memory (2 x 20C, 1024 GB)	Gen2,Gold-6230,6230,1024GB	r03n[24-26]
Low-end GPU (2 x 20C, 192 GB, 1 x T4)	Gen2,Gold-6230,6230,192GB	r03g[00-02]	`gpu:t4:1`
Low-end GPU (2 x 20C, 384 GB, 1 x T4)	Gen2,Gold-6230,6230,384GB	r03g[03-04]	`gpu:t4:1`
Low-end GPU (2 x 20C, 768 GB, 1 x T4)	Gen2,Gold-6230,6230,768GB	r03g[07-08]	`gpu:t4:1`
All-purpose GPU (2 x 20C, 384 GB, 2 x V100)	Gen2,Gold-6230,6230,384GB	r03g05	`gpu:v100:2`
All-purpose GPU (2 x 20C, 768 GB, 2 x V100)	Gen2,Gold-6230,6230,768GB	r03g06	`gpu:v100:2`

The Features column is a comma-separated list of tags that a job can match against. For example, to request that a job execute on node(s) with Gold 6230 processors and (nominally) 768 GB of RAM:

$ sbatch --constraint=Gold-6230&768GB …

All previous-generation nodes' feature lists will have Gen1 added to allow jobs to target a specific generation of the cluster:

$ sbatch --constraint=Gen1 …

Changes to Workgroup Accounts, QOS, and Shares

Any existing workgroups with additional purchased resource capacity will have their QOS updated to reflect the aggregate core count, memory capacity, and GPU count.
New workgroups will have an appropriate Slurm account created and populated with sponsored users. A QOS will be created with purchased core count, memory capacity, and GPU count.
Slurm cluster shares (for fair-share scheduling) are proportional to each workgroup's percentage of the full value of the cluster. All workgroups will have their relative scheduling priority adjusted accordingly.

Changes to Partitions

The standard partition node list will be augmented to:

Nodes=r[00-01]n[01-55],r00g[01-04],r01g[00-04],r02s[00-01],r03[n00-57],r03g[00-08]

Various workgroups' priority-access partitions will also be modified to include node kinds purchased, and any new stakeholders will have their priority-access partition added.

Network Topology Changes

The addition of the new rack brings two new OPA switches into the high-speed network topology. The topology.conf file can be adjusted by running the opa2slurm utility once the rack is integrated and online.

The directory removal error message has been changed to an internal informational message that users will not see. Additional changes were made to the plugin to address the race condition itself.

An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system:

--use-shared-tmpdir     Create temporary directories on shared storage (overridden
                        by --tmpdir).  Use "--use-shared-tmpdir=per-node" to create
                        unique sub-directories for each node allocated to the job
                        (e.g. <base>/job_<jobid>/<nodename>).

Variant	Node	TMPDIR
job, no per-node	`r00n00`	`/lustre/scratch/slurm/job_12345`
job, no per-node	`r00n01`	`/lustre/scratch/slurm/job_12345`
step, no per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/step_0`
step, no per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/step_0`
job, per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/r00n00`
job, per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/r00n01`
step, per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/r00n00/step_0`
step, per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/r01n00/step_0`

The auto_tmpdir plugin has already been compiled and debugged/tested on another Slurm cluster. The code has been compiled on Caviness and to activate must be installed (make install) from the current build directory. The plugin is loaded by slurmstepd as jobs are launched but is not used by slurmd itself, so a restart of slurmd on all Gen1 compute nodes is not necessary in this regard.

To make all of these changes atomic (in a sense), all nodes will be put in the DRAIN state to prohibit additional jobs' being scheduled while jobs already running are left alone.

Next, the Slurm accounting database must be updated with:

Changes to cluster share for existing workgroup accounts
Changes to resource levels for existing workgroup QOS's who purchased Gen2 resources
Addition of new workgroup accounts
Addition of new workgroup QOS's

Adding new nodes and partitions to the Slurm configuration requires the scheduler (slurmctld) to be fully restarted.

The execution daemons (slurmd) on Gen1 compute nodes can be informed of the new configuration once slurmctld has been restarted using scontrol reconfigure but should not require a restart. Finally, the Gen1 nodes can be shifted out of the DRAIN state and Gen2 compute nodes can have slurmd started.

The sequence of operations looks something like this (on r02mgmt00):

$ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=DRAIN reason=reconfiguration
$ # …update existing workgroups' cluster share using sacctmgr…
$ # …update existing workgroups' QOS resource levels using sacctmgr…
$ # …add new workgroups' accounts using sacctmgr…
$ # …add new workgroups' QOS resource levels using sacctmgr…

The updated Slurm configuration must be pushed to all compute nodes:

$ wwsh provision set r03g\* --fileadd=gen2-gpu-cgroup.conf
$ wwsh provision set r03g[00-04] r03g[07-08] --fileadd=gen2-gpu-t4-gres.conf
$ wwsh provision set r03g[05-06] --fileadd=gen2-gpu-v100-gres.conf
$ wwsh file sync slurm-nodes.conf slurm-partitions.conf topology.conf
$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
193
$ # …wait…
$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
115
$ # …repeat until…
$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
0

Now that all nodes have the correct configuration files, the scheduler configuration is copied into place and both instances are restarted:

$ sudo cp /opt/slurm-conf/controller/gres.conf /etc/slurm/gres.conf
$ sudo cp /opt/slurm-conf/nodes.conf /etc/slurm/nodes.conf
$ sudo cp /opt/slurm-conf/partitions.conf /etc/slurm/partitions.conf
$ sudo cp /opt/slurm-conf/topology.conf /etc/slurm/topology.conf
$ sudo rsync -arv /etc/slurm/ root@r02mgmt01:/etc/slurm/
$ sudo systemctl restart slurmctld
$ sudo ssh r02mgmt01 systemctl restart slurmctld

Return the Gen1 nodes to service:

$ scontrol reconfigure
$ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=UNDRAIN

And the Gen2 nodes can have their Slurm service started:

$ pdsh -w r03n[00-57],r03g[00-08] systemctl start slurmd

No downtime is expected to be required.

Date	Time	Goal/Description
2019-09-09		Authoring of this document
2019-10-23	11:00	Changes made

Revisions to Slurm Configuration v2.0.0 on Caviness

Issues

Users submitting large numbers of jobs

Solutions

Job submission limits

Altered priority weights

Addition of GRES

Addition of Nodes

Changes to Workgroup Accounts, QOS, and Shares

Changes to Partitions

Network Topology Changes

Changes to auto_tmpdir

Implementation

Impact

Timeline

hpc documentation