====== Revisions to Slurm Configuration v1.1.3 on Caviness ======

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

===== Issues =====

==== Nominal node memory size is not an appropriate limit ====

When the [[technical:slurm:arraysize-and-nodecounts|v1.1.3 configuration]] was activated on 2019-02-18, all nodes in the cluster began transitioning into the DRAIN state.  The DRAIN state keeps a node online while preventing any new jobs from being scheduled to it.  The reason Slurm cited for this was:

<code>
Reason=Low RealMemory
</code>

Each node runs a Slurm job execution daemon (slurmd) that reports back to the scheduler every few minutes; included in that report are the base resource levels:  socket count, core count, physical memory size, ''/tmp'' disk size.  To effect the v1.1.3 changes we altered Slurm to use //FastSchedule=1// which only consults the resource levels explicitly specified in the Slurm configuration and ignores what the nodes report for those levels.  As it turns out:

<WRAP negative round>
Slurm //FastSchedule=1// causes the scheduler to ignore resource levels reported by slurmd when scheduling jobs, but **does not** prevent it from noticing disparity between configured and actual levels and reacting by DRAINing the node.
</WRAP>

Many nodes transitioned to the DRAIN state within the first 30 minutes after the v1.1.3 changes were activated:  the scheduler noticed that the reported memory size (e.g. 128813) was less than the configured size (nominal 128 GB = 131072).

The changes did not need to be rolled-back, since an additional mode (//FastSchedule=2//) instructs Slurm to ignore such disparity.  Thus, the v1.1.3 changes went into production in that mode circa 10:00.

One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes.  Consider the following:

  * A node runs a job requesting 28 cores and 100 GiB of memory, leaving 8 cores and 28 GiB of memory available according to the node configuration.
  * The same node runs a second job that requests 4 cores and 28 GiB of memory.

Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present in the node.  This causes memory usage to spill into swap space, slowing all jobs on the node considerably.

<WRAP negative round>
Choosing to use the nominal memory size of each node for its RealMemory limit was meant to keep requests like ''--mem=128GB'' satisfiable by nodes with a nominal 128 GiB of memory.  However, that level does not reflect reality and is in conflict with Slurm's proper production functioning.
</WRAP>

==== FastSchedule requires explicit specification of all resources ====

In previous configurations, the RealMemory and TmpDisk resource levels were not explicitly specified in the node configurations.  The TmpDisk level defaults to zero if not explicitly specified.  Thus, after the v1.1.3 changes were initially activated nodes were reporting:

<code>
$ scontrol show node r00n22
NodeName=r00n22 Arch=x86_64 CoresPerSocket=18
      :
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
      :
</code>

Any user submitting a job which requests a minimum amount of ''/tmp'' space (e.g. ''--tmp=4G'') would be met with an error (no nodes satisfy given resource constraints).

<WRAP negative round>
Slurm //FastSchedule=(1|2)// honors **only** the resource levels present in the node configuration file.  No operational mode exists that allows one or more levels reported by slurmd to be overridden by statically configured values.
</WRAP>

This situation was addressed by augmenting the node configurations with explicit TmpDisk values shortly after the v1.1.3 configuration was initially activated.

===== Solutions =====

==== Use realistic RealMemory levels ====

For each type of node present in Caviness, a RealMemory size less than that reported by slurmd (to prevent DRAIN state transitions) will be chosen.  The size will reflect the reality of the nodes to also prevent addition of jobs from pushing nodes into heavy swapping which will degrade performance of all jobs on the node.

<WRAP positive round>
Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.  The //FastSchedule// mode will be restored to //FastSchedule=1//.
</WRAP>

Under mode 1 of //FastSchedule//, nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state — such conditions are indicative of hardware issues, in agreement with the intent of the Slurm developers in this case.

<note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note>

<WRAP positive round>
Workgroup QOS configurations will be updated to reflect the sum over sub-nominal RealMemory sizes rather than nominal sizes used in the v1.1.3 configuration.
</WRAP>

In v1.1.3 the node counts in workgroup QOS's were replaced by aggregate memory sizes which summed over the nominal sizes (128 GiB, 256 GiB, 512 GiB).  In concert with changing the nodes' RealMemory size, the QOS aggregate must change.

==== Proposed RealMemory sizes ====

^Node type^(PHYS_PAGES*PAGESIZE)/MiB^RealMemory/MiB^RealMemory/GiB^
|Gen1/128 GiB|128813|126976|124|
|Gen1/256 GiB|257843|256000|250|
|Gen1/512 GiB|515891|514048|502|
|Gen1/GPU/128 GiB|128813|126976|124|
|Gen1/GPU/256 GiB|257843|256000|250|
|Gen1/GPU/512 GiB|515891|514048|502|
|Gen1/NVMe/256 GiB|257842|256000|250|

A workgroup QOS which under v1.1.3 had ''cpu=756,mem=3801088'' reflecting (13) //Gen1/128 GiB// nodes and (8) //Gen1/256 GiB// nodes will change to ''cpu=756,mem=3698688'' to reflect the RealMemory sizes specified above.

==== Requesting specific node types ====

Our Slurm has always defined several //features// on each node in the cluster:

<code bash>
$ scontrol show node r00n00 
NodeName=r00n00 Arch=x86_64 CoresPerSocket=36
   CPUAlloc=0 CPUErr=0 CPUTot=72 CPULoad=0.07
   AvailableFeatures=E5-2695,E5-2695v4,128GB,HT
   ActiveFeatures=E5-2695,E5-2695v4,128GB,HT
      :
</code>

It is possible to constrain a job to execute on a node with a specific nominal memory size using these features:

<code bash>
$ sbatch … --constraint=128GB …
$ sbatch … --constraint=256GB …
$ sbatch … --constraint=512GB …
</code>

The other features reflect the model processor present in the node.  All Gen1 nodes use the Intel ''E5-2695v4''; node ''r00n00'' is part of the ''devel'' partition and has hyperthreading enabled, hence the ''HT'' feature.
===== Implementation =====

All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and signaling a change in configuration so all daemons refresh their configuration.

===== Impact =====

No downtime is expected to be required.

===== Timeline =====

^Date ^Time ^Goal/Description ^
|2019-02-18| |Authoring of this document|
|2019-02-18| |Document shared with Caviness community for feedback|
|2019-02-18| |Add announcement of impending change to login banner|
|2019-02-25|09:00|Configuration changes pushed to cluster nodes|
| |09:30|Restart scheduler, notify compute nodes of reconfiguration|
|2019-02-27| |Remove announcement from login banner|