This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.
The Slurm configuration major version will increase to 2 for these changes, reflecting the addition of that generation to the cluster. The changes described herein will represent v2.0.0
.
At the end of June, 2019, the first addition of resources to the Caviness cluster was purchases. The purchase adds:
Beyond simply booting the new nodes, the Slurm configuration must be adjusted to account for the new hardware, the new workgroups, and adjustments to the resource limits of existing workgroups that purchased Generation 2 nodes.
Our Slurm auto_tmpdir plugin automatically manages per-job (and per-step) temporary file storage:
The plugin offers options to:
/tmp
Normally the plugin creates directories on local scratch storage (/tmp
): the temporary files on one node are not visible to other nodes participating in the job. Placing the TMPDIR
on Lustre would mean all nodes participating in the job could see the same temporary files. Users could leverage the existing –tmpdir=<path>
flag with their own arbitrary path on /lustre/scratch
, but that would spread such files across the file system. Having a flag that selects an IT-specified path would keep all shared TMPDIR
storage collocated on the file system. This also opens the possibility in the future of having that directory (and all content under it) make use of an OST pool backed by faster media (SSD, NVMe).
Some users have seen the following message in a job output file and assumed their job failed:
auto_tmpdir: remote: failed stat check of <tmpdir> (uid = <uid#>, st_mode = <perms>, errno = <int-code>)
Slurm was reporting this as an error when, in reality, the message was the result of a race condition whereby the job's temporary directory was removed before a job step had completed and removed its own temporary directory (which is inside the job's temporary directory, e.g. /tmp/job_34523/step_1
). While it is logically an error in the context of the job scheduler, it is not an error from the point of view of the job or the user who submitted the job.
The following changes to the Slurm configuration will be necessary.
New Generic RESource types will be added to represent the GPU devices present in Generation 2 nodes.
GRES name | Type | Description |
---|---|---|
gpu | v100 | nVidia Volta GPU |
gpu | t4 | nVidia T4 GPU |
Kind | Features | Nodes | GRES |
---|---|---|---|
Baseline (2 x 20C, 192 GB) | Gen2,Gold-6230,6230,192GB | r03n[29-57] | |
Large memory (2 x 20C, 384 GB) | Gen2,Gold-6230,6230,384GB | r03n[00-23],r03n28 | |
X-large memory (2 x 20C, 768 GB) | Gen2,Gold-6230,6230,768GB | r03n27 | |
XX-large memory (2 x 20C, 1024 GB) | Gen2,Gold-6230,6230,1024GB | r03n[24-26] | |
Low-end GPU (2 x 20C, 192 GB, 1 x T4) | Gen2,Gold-6230,6230,192GB | r03g[00-02] | gpu:t4:1 |
Low-end GPU (2 x 20C, 384 GB, 1 x T4) | Gen2,Gold-6230,6230,384GB | r03g[03-04] | gpu:t4:1 |
Low-end GPU (2 x 20C, 768 GB, 1 x T4) | Gen2,Gold-6230,6230,768GB | r03g[07-08] | gpu:t4:1 |
All-purpose GPU (2 x 20C, 384 GB, 2 x V100) | Gen2,Gold-6230,6230,384GB | r03g05 | gpu:v100:2 |
All-purpose GPU (2 x 20C, 768 GB, 2 x V100) | Gen2,Gold-6230,6230,768GB | r03g06 | gpu:v100:2 |
The Features column is a comma-separated list of tags that a job can match against. For example, to request that a job execute on node(s) with Gold 6230 processors and (nominally) 768 GB of RAM:
$ sbatch --constraint=Gold-6230&768GB …
All previous-generation nodes' feature lists will have Gen1
added to allow jobs to target a specific generation of the cluster:
$ sbatch --constraint=Gen1 …
The standard partition node list will be augmented to:
Nodes=r[00-01]n[01-55],r00g[01-04],r01g[00-04],r02s[00-01],r03[n00-57],r03g[00-08]
Various workgroups' priority-access partitions will also be modified to include node kinds purchased, and any new stakeholders will have their priority-access partition added.
The addition of the new rack brings two new OPA switches into the high-speed network topology. The topology.conf
file can be adjusted by running the opa2slurm utility once the rack is integrated and online.
The directory removal error message has been changed to an internal informational message that users will not see. Additional changes were made to the plugin to address the race condition itself.
An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system:
--use-shared-tmpdir Create temporary directories on shared storage (overridden by --tmpdir). Use "--use-shared-tmpdir=per-node" to create unique sub-directories for each node allocated to the job (e.g. <base>/job_<jobid>/<nodename>).
Variant | Node | TMPDIR |
---|---|---|
job, no per-node | r00n00 | /lustre/scratch/slurm/job_12345 |
r00n01 | /lustre/scratch/slurm/job_12345 |
|
step, no per-node | r00n00 | /lustre/scratch/slurm/job_12345/step_0 |
r00n01 | /lustre/scratch/slurm/job_12345/step_0 |
|
job, per-node | r00n00 | /lustre/scratch/slurm/job_12345/r00n00 |
r00n01 | /lustre/scratch/slurm/job_12345/r00n01 |
|
step, per-node | r00n00 | /lustre/scratch/slurm/job_12345/r00n00/step_0 |
r00n01 | /lustre/scratch/slurm/job_12345/r01n00/step_0 |
The auto_tmpdir plugin has already been compiled and debugged/tested on another Slurm cluster. The code has been compiled on Caviness and to activate must be installed (make install
) from the current build directory. The plugin is loaded by slurmstepd
as jobs are launched but is not used by slurmd
itself, so a restart of slurmd
on all Gen1 compute nodes is not necessary in this regard.
To make all of these changes atomic (in a sense), all nodes will be put in the DRAIN state to prohibit additional jobs' being scheduled while jobs already running are left alone.
Next, the Slurm accounting database must be updated with:
Adding new nodes and partitions to the Slurm configuration requires the scheduler (slurmctld
) to be fully restarted.
The execution daemons (slurmd
) on Gen1 compute nodes can be informed of the new configuration once slurmctld
has been restarted using scontrol reconfigure
but should not require a restart. Finally, the Gen1 nodes can be shifted out of the DRAIN state and Gen2 compute nodes can have slurmd
started.
The sequence of operations looks something like this (on r02mgmt00
):
$ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=DRAIN reason=reconfiguration $ # …update existing workgroups' cluster share using sacctmgr… $ # …update existing workgroups' QOS resource levels using sacctmgr… $ # …add new workgroups' accounts using sacctmgr… $ # …add new workgroups' QOS resource levels using sacctmgr…
The updated Slurm configuration must be pushed to all compute nodes:
$ wwsh provision set r03g\* --fileadd=gen2-gpu-cgroup.conf $ wwsh provision set r03g[00-04] r03g[07-08] --fileadd=gen2-gpu-t4-gres.conf $ wwsh provision set r03g[05-06] --fileadd=gen2-gpu-v100-gres.conf $ wwsh file sync slurm-nodes.conf slurm-partitions.conf topology.conf $ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l 193 $ # …wait… $ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l 115 $ # …repeat until… $ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l 0
Now that all nodes have the correct configuration files, the scheduler configuration is copied into place and both instances are restarted:
$ sudo cp /opt/slurm-conf/controller/gres.conf /etc/slurm/gres.conf $ sudo cp /opt/slurm-conf/nodes.conf /etc/slurm/nodes.conf $ sudo cp /opt/slurm-conf/partitions.conf /etc/slurm/partitions.conf $ sudo cp /opt/slurm-conf/topology.conf /etc/slurm/topology.conf $ sudo rsync -arv /etc/slurm/ root@r02mgmt01:/etc/slurm/ $ sudo systemctl restart slurmctld $ sudo ssh r02mgmt01 systemctl restart slurmctld
Return the Gen1 nodes to service:
$ scontrol reconfigure $ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=UNDRAIN
And the Gen2 nodes can have their Slurm service started:
$ pdsh -w r03n[00-57],r03g[00-08] systemctl start slurmd
No downtime is expected to be required.
Date | Time | Goal/Description |
---|---|---|
2019-09-09 | Authoring of this document | |
2019-10-23 | 11:00 | Changes made |