Revisions to Slurm Configuration v1.1.5 on Caviness

This is an old revision of the document!

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

At the end of June, 2019, the first addition of resources to the Caviness cluster was purchases. The purchase adds:

70 new nodes of varying specification
2 new kinds of GPU (V100 and T4)
Several new stakeholder workgroups

Beyond simply booting the new nodes, the Slurm configuration must be adjusted to account for the new hardware, the new workgroups, and adjustments to the resource limits of existing workgroups that purchased Generation 2 nodes.

Our Slurm auto_tmpdir plugin automatically manages per-job (and per-step) temporary file storage:

When a job starts, a temporary directory is created and `TMPDIR` is set accordingly in the job's environment
Optionally, when a job step starts a temporary directory within the job directory is created and `TMPDIR` is set accordingly in the step's environment
As steps complete their temporary directories are removed
When the job completes its temporary directory is removed

The plugin offers options to:

prohibit removal of the temporary directories
create per-job directories in a directory other than /tmp
prohibit the creation of per-step temporary directories (steps inherit the job temporary directory)

Normally the plugin creates directories on local scratch storage (/tmp): the temporary files on one node are not visible to other nodes participating in the job. Placing the TMPDIR on Lustre would mean all nodes participating in the job could see the same temporary files. Users could leverage the existing –tmpdir=<path> flag with their own arbitrary path on /lustre/scratch, but that would spread such files across the file system. Having a flag that selects an IT-specified path would keep all shared TMPDIR storage collocated on the file system. This also opens the possibility in the future of having that directory (and all content under it) make use of an OST pool backed by faster media (SSD, NVMe).

Some users have seen the following message in a job output file and assumed their job failed:

auto_tmpdir: remote: failed stat check of <tmpdir> (uid = <uid#>, st_mode = <perms>, errno = <int-code>)

Slurm was reporting this as an error when, in reality, the message was the result of a race condition whereby the job's temporary directory was removed before a job step had completed and removed its own temporary directory (which is inside the job's temporary directory, e.g. /tmp/job_34523/step_1). While it is logically an error in the context of the job scheduler, it is not an error from the point of view of the job or the user who submitted the job.

The following changes to the Slurm configuration will be necessary.

Addition of GRES

New Generic RESource types will be added to represent the GPU devices present in Generation 2 nodes.

GRES name	Type	Description
gpu	v100	nVidia Volta GPU
gpu	t4	nVidia T4 GPU

Addition of Nodes

Kind	Features	Nodes
Baseline (2 x 20C, 192 GB)	Gold-6230,6230,192GB	r03n[29-57]
Large memory (2 x 20C, 384 GB)	Gold-6230,6230,384GB	r03n[00-23],r03n28
X-large memory (2 x 20C, 768 GB)	Gold-6230,6230,768GB	r03n27
XX-large memory (2 x 20C, 1024 GB)	Gold-6230,6230,1024GB	r03n[24-26]
Low-end GPU (2 x 20C, 192 GB, 1 x T4)	Gold-6230,6230,192GB	r03g[00-02]
Low-end GPU (2 x 20C, 384 GB, 1 x T4)	Gold-6230,6230,384GB	r03g[03-04]
Low-end GPU (2 x 20C, 768 GB, 1 x T4)	Gold-6230,6230,768GB	r03g[07-08]
All-purpose GPU (2 x 20C, 384 GB, 2 x V100)	Gold-6230,6230,384GB	r03g05
All-purpose GPU (2 x 20C, 768 GB, 2 x V100)	Gold-6230,6230,768GB	r03g06

Changes to Workgroup Accounts, QOS, and Shares

Any existing workgroups with additional purchased resource capacity will have their QOS updated to reflect the aggregate core count, memory capacity, and GPU count.
New workgroups will have an appropriate Slurm account created and populated with sponsored users. A QOS will be created with purchased core count, memory capacity, and GPU count.
Slurm cluster shares (for fair-share scheduling) are proportional to each workgroup's percentage of the full value of the cluster. All workgroups will have their relative scheduling priority adjusted accordingly.

Changes to Partitions

The standard partition node list will be augmented to:

Nodes=r[00-01]n[01-55],r00g[01-04],r01g[00-04],r02s[00-01],r03[n00-57],r03g[00-08]

Various workgroups' priority-access partitions will also be modified to include node kinds purchased, and any new stakeholders will have their priority-access partition added.

The directory removal error message has been changed to an internal informational message that users will not see. Additional changes were made to the plugin to address the race condition itself.

An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system:

--use-shared-tmpdir     Create temporary directories on shared storage (overridden
                        by --tmpdir).  Use "--use-shared-tmpdir=per-node" to create
                        unique sub-directories for each node allocated to the job
                        (e.g. <base>/job_<jobid>/<nodename>).

Variant	Node	TMPDIR
job, no per-node	`r00n00`	`/lustre/scratch/slurm/job_12345`
job, no per-node	`r00n01`	`/lustre/scratch/slurm/job_12345`
step, no per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/step_0`
step, no per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/step_0`
job, per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/r00n00`
job, per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/r00n01`
step, per-node	`r00n00`	`/lustre/scratch/slurm/job_12345/r00n00/step_0`
step, per-node	`r00n01`	`/lustre/scratch/slurm/job_12345/r01n00/step_0`

The addition of new partitions, nodes, and network topology information to the Slurm configuration should not require a full restart of all daemons.

No downtime is expected to be required.

Date	Time	Goal/Description
2019-09-09		Authoring of this document
		Changes made

Revisions to Slurm Configuration v1.1.5 on Caviness

Issues

Addition of Generation 2 nodes to cluster

Minor changes to automatic TMPDIR plugin

Solutions

Configuration changes

Addition of GRES

Addition of Nodes

Changes to Workgroup Accounts, QOS, and Shares

Changes to Partitions

Changes to auto_tmpdir

Implementation

Impact

Timeline

hpc documentation