====== Revisions to Slurm Configuration v2.2.1 on Caviness ====== This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. ===== Issues ===== See [[technical:slurm:swap-control|this document]] discussing swap limits in Slurm jobs. ===== Implementation ===== * The ''dynamic_swap_limits'' SPANK plugin will be compiled and installed * The ''plugstack.conf'' configuration file will be modified to require the ''dynamic_swap_limits'' plugin The following aspects of the system construction will be observed in the configuration of the plugin: * In all cases, the active swap present on the node will be used as the ''max_swap'' value * The majority of Gen 1 nodes consist of 36 cores; 1/36 = 2.778%, so jobs will be granted 2.5% of ''max_swap'' per CPU * The three nodes in the Gen 1 devel partition have hyperthreads enabled for 72 cores; 1/72 = 1.389%, so jobs will be granted 1.1% of ''max_swap'' per CPU * The Gen 2 and 2.1 nodes consist of 40 cores; 1/40 = 2.5%, so jobs will be granted 2.3% of ''max_swap'' per CPU * Jobs running under the //lg-swap// partition on the extended memory node(s) will have no swap limits enforced (jobs are scheduled user-exclusive on this node, so any failures will not impact other users' jobs) These details produce the following configuration string: partition(lg-swap)=none,host(r[00,01]n[00,56],r00g00)=1.1%/CPU,host(r[00,01]n[01-55]r[00-01]g[01-04],r1g00,r02s[00-01])=2.5%/CPU,host(r03n[00-57],r03g[00-08],r04n[00-76])=2.3%/CPU,default()=0MiB ===== Impact ===== No downtime is expected. The ''slurmd'' daemon must be restarted on all compute nodes, but currently-executing jobs/job steps should not be affected (they will reconnect to the new ''slurmd'' as necessary to communicate job status, etc.). The ''slurmctld'' daemons do not use the SPANK plugin, thus they do not need to be restarted. ===== Timeline ===== ^Date ^Time ^Goal/Description ^ |2021-12-01| |Authoring of this document| |2021-12-08|09:00|Implementation|