Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
technical:slurm:gen2-additions [2019-09-24 10:56] – [Configuration changes] frey | technical:slurm:gen2-additions [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Revisions to Slurm Configuration v1.1.5 on Caviness ====== | ||
- | This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. | ||
- | |||
- | ===== Issues ===== | ||
- | |||
- | ==== Addition of Generation 2 nodes to cluster ==== | ||
- | |||
- | At the end of June, 2019, the first addition of resources to the Caviness cluster was purchases. | ||
- | |||
- | * 70 new nodes of varying specification | ||
- | * 2 new kinds of GPU (V100 and T4) | ||
- | * Several new stakeholder workgroups | ||
- | |||
- | Beyond simply booting the new nodes, the Slurm configuration must be adjusted to account for the new hardware, the new workgroups, and adjustments to the resource limits of existing workgroups that purchased Generation 2 nodes. | ||
- | |||
- | ==== Minor changes to automatic TMPDIR plugin ==== | ||
- | |||
- | Our Slurm [[https:// | ||
- | |||
- | - When a job starts, a temporary directory is created and `TMPDIR` is set accordingly in the job's environment | ||
- | - Optionally, when a job step starts a temporary directory within the job directory is created and `TMPDIR` is set accordingly in the step's environment | ||
- | - As steps complete their temporary directories are removed | ||
- | - When the job completes its temporary directory is removed | ||
- | |||
- | The plugin offers options to: | ||
- | |||
- | - prohibit removal of the temporary directories | ||
- | - create per-job directories in a directory other than ''/ | ||
- | - prohibit the creation of per-step temporary directories (steps inherit the job temporary directory) | ||
- | |||
- | Normally the plugin creates directories on local scratch storage (''/ | ||
- | |||
- | Some users have seen the following message in a job output file and assumed their job failed: | ||
- | |||
- | < | ||
- | auto_tmpdir: | ||
- | </ | ||
- | |||
- | Slurm was reporting this as an error when, in reality, the message was the result of a race condition whereby the job's temporary directory was removed //before// a job step had completed and removed its own temporary directory (which is inside the job's temporary directory, e.g. ''/ | ||
- | |||
- | ===== Solutions ===== | ||
- | |||
- | ==== Configuration changes ==== | ||
- | |||
- | The following changes to the Slurm configuration will be necessary. | ||
- | |||
- | === Addition of GRES === | ||
- | |||
- | New Generic RESource types will be added to represent the GPU devices present in Generation 2 nodes. | ||
- | |||
- | ^GRES name^Type^Description^ | ||
- | |gpu|v100|nVidia Volta GPU| | ||
- | |gpu|t4|nVidia T4 GPU| | ||
- | |||
- | === Addition of Nodes === | ||
- | |||
- | ^Kind^Features^Nodes^GRES^ | ||
- | |Baseline (2 x 20C, 192 GB)|Gen2, | ||
- | |Large memory (2 x 20C, 384 GB)|Gen2, | ||
- | |X-large memory (2 x 20C, 768 GB)|Gen2, | ||
- | |XX-large memory (2 x 20C, 1024 GB)|Gen2, | ||
- | |Low-end GPU (2 x 20C, 192 GB, 1 x T4)|Gen2, | ||
- | |Low-end GPU (2 x 20C, 384 GB, 1 x T4)|Gen2, | ||
- | |Low-end GPU (2 x 20C, 768 GB, 1 x T4)|Gen2, | ||
- | |All-purpose GPU (2 x 20C, 384 GB, 2 x V100)|Gen2, | ||
- | |All-purpose GPU (2 x 20C, 768 GB, 2 x V100)|Gen2, | ||
- | |||
- | The Features column is a comma-separated list of tags that a job can match against. | ||
- | |||
- | <code bash> | ||
- | $ sbatch --constraint=Gold-6230& | ||
- | </ | ||
- | |||
- | All previous-generation nodes' feature lists will have '' | ||
- | |||
- | <code bash> | ||
- | $ sbatch --constraint=Gen1 … | ||
- | </ | ||
- | |||
- | |||
- | === Changes to Workgroup Accounts, QOS, and Shares === | ||
- | |||
- | * Any existing workgroups with additional purchased resource capacity will have their QOS updated to reflect the aggregate core count, memory capacity, and GPU count. | ||
- | * New workgroups will have an appropriate Slurm account created and populated with sponsored users. | ||
- | * Slurm cluster shares (for fair-share scheduling) are proportional to each workgroup' | ||
- | |||
- | === Changes to Partitions === | ||
- | |||
- | The **standard** partition node list will be augmented to: | ||
- | |||
- | < | ||
- | Nodes=r[00-01]n[01-55], | ||
- | </ | ||
- | |||
- | Various workgroups' | ||
- | |||
- | === Network Topology Changes === | ||
- | |||
- | The addition of the new rack brings two new OPA switches into the high-speed network topology. | ||
- | |||
- | |||
- | ==== Changes to auto_tmpdir ==== | ||
- | |||
- | The directory removal error message has been changed to an internal informational message that users will not see. Additional changes were made to the plugin to address the race condition itself. | ||
- | |||
- | An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system: | ||
- | |||
- | < | ||
- | --use-shared-tmpdir | ||
- | by --tmpdir). | ||
- | unique sub-directories for each node allocated to the job | ||
- | (e.g. < | ||
- | </ | ||
- | |||
- | ^Variant^Node^TMPDIR^ | ||
- | |job, no per-node|'' | ||
- | |::: | ||
- | |step, no per-node|'' | ||
- | |::: | ||
- | |job, per-node|'' | ||
- | |::: | ||
- | |step, per-node|'' | ||
- | |::: | ||
- | |||
- | ===== Implementation ===== | ||
- | |||
- | The addition of new partitions, nodes, and network topology information to the Slurm configuration should not require a full restart of all daemons. | ||
- | |||
- | ===== Impact ===== | ||
- | |||
- | No downtime is expected to be required. | ||
- | |||
- | ===== Timeline ===== | ||
- | |||
- | ^Date ^Time ^Goal/ | ||
- | |2019-09-09| |Authoring of this document| | ||
- | | | |Changes made| |