Differences

This shows you the differences between two versions of the page.

--- technical:slurm:node-memory-sizes [2019-02-18 11:19] – created frey
+++ technical:slurm:node-memory-sizes [2019-02-18 12:32] – [Requesting specific node types] frey
@@ Line 13: / Line 13: @@
 </code>
-Each node runs a Slurm job execution daemon (slurmd) that reports back to the scheduler every few minutes; included in that report are the base resource levels:  socket count, core count, physical memory size, /tmp disk size.  To effect the v1.1.3 changes we altered Slurm to use //FastSchedule=1// which only consults the resource levels explicitly specified in the Slurm configuration and ignores what the nodes report for those levels.  As it turns out:
+Each node runs a Slurm job execution daemon (slurmd) that reports back to the scheduler every few minutes; included in that report are the base resource levels:  socket count, core count, physical memory size, ''/tmp'' disk size.  To effect the v1.1.3 changes we altered Slurm to use //FastSchedule=1// which only consults the resource levels explicitly specified in the Slurm configuration and ignores what the nodes report for those levels.  As it turns out:
 <WRAP negative round>
@@ Line 23: / Line 23: @@
 The changes did not need to be rolled-back, since an additional mode (//FastSchedule=2//) instructs Slurm to ignore such disparity.  Thus, the v1.1.3 changes went into production in that mode circa 10:00.
-One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes  Consider the following:
+One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes.  Consider the following:
   * A node runs a job requesting 28 cores and 100 GiB of memory, leaving 8 cores and 28 GiB of memory available according to the node configuration.
-  * A second job from a different user, requesting 4 cores and 28 GiB of memory, is scheduled on the node.
+  * The same node runs a second job that requests 4 cores and 28 GiB of memory.
-Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present.  This causes memory usage to spill into swap space, slowing all jobs on the node.
+Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present in the node.  This causes memory usage to spill into swap space, slowing all jobs on the node considerably.
 <WRAP negative round>
@@ Line 46: / Line 46: @@
 </code>
-Any user submitting a job which requests a minimum amount of /tmp space (e.g. ''--tmp=4G'') would be met with an error (no nodes with viable configuration given resource constraints).
+Any user submitting a job which requests a minimum amount of ''/tmp'' space (e.g. ''--tmp=4G'') would be met with an error (no nodes satisfy given resource constraints).
 <WRAP negative round>
@@ Line 52: / Line 52: @@
 </WRAP>
-This situation was addressed by augmenting the node configurations with explicit TmpDisk values shortly after the v1.1.3 configuration was activated.
+This situation was addressed by augmenting the node configurations with explicit TmpDisk values shortly after the v1.1.3 configuration was initially activated.
 ===== Solutions =====
-==== Determine appropriate RealMemory levels ====
+==== Use realistic RealMemory levels ====
-For each type of node present in Caviness, a RealMemory size less than that reported by slurmd (to prevent DRAIN state transitions) will be chosen.  The size will reflect the reality of the nodes to also prevent addition of jobs from pushing nodes into heavy swapping which will degrade performance of all jobs on the node((The use of node //features// can target specific memory-size nodes, e.g. ''--constraint=128GB''.  Available node features are shown via ''scontrol show node <nodename>''.)).
+For each type of node present in Caviness, a RealMemory size less than that reported by slurmd (to prevent DRAIN state transitions) will be chosen.  The size will reflect the reality of the nodes to also prevent addition of jobs from pushing nodes into heavy swapping which will degrade performance of all jobs on the node.
 <WRAP positive round>
-Node configurations will be
+Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.  The //FastSchedule// mode will be restored to //FastSchedule=1//.
 </WRAP>
-Slurm configures the memory size for each node in one of two ways:
+Under mode 1 of //FastSchedule//, nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state — such conditions are indicative of hardware issues, in agreement with the intent of the Slurm developers in this case.
-  * When the job execution daemon (slurmd) starts, it reports the amount of available system memory //at that time// to the scheduler
-  * The Slurm configuration file specifies a nominal memory size for the node
-Caviness has been using the first method, since it discounts the memory consumed by the operating system and reflects what is truly available to jobs.  For example, baseline nodes in Caviness show a memory size of 125.8 GiB versus the 128 GiB of physical memory present in them.  What this has meant for users is that a job submitted with the requirement ''--mem=128G'' actually needs to run on a 256 or 512 GB node, when often the user believed the option was indicating a baseline node would suffice.
-Rather than letting the node report how much memory it has available, the second method cited above will now be used with the nominal amount of physical memory present in the node.  Thus, for baseline nodes the Slurm configuration would be changed to:
+<note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note>
-<code>
+<WRAP positive round>
-NodeName=r00n[01-17,45-55] CPUS=36 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 Memory=131072 Feature="E5-2695,E5-2695v4,128GB" Weight=10
+Workgroup QOS configurations will be updated to reflect the sum over sub-nominal RealMemory sizes rather than nominal sizes used in the v1.1.3 configuration.
-</code>
+</WRAP>
-The ''FastSchedule'' option will be enabled to force the scheduler to consult the values in the configuration file rather than the //real// values reported by the nodes themselves.
+In v1.1.3 the node counts in workgroup QOS's were replaced by aggregate memory sizes which summed over the nominal sizes (128 GiB, 256 GiB, 512 GiB).  In concert with changing the nodes' RealMemory size, the QOS aggregate must change.
-Workgroup QOS aggregate memory sizes will be the sum over node types' nominal memory size times the number of each node type purchased.  In the example above, the workgroup would have the ''node=2'' limit removed, replaced with ''mem=393216''.
+==== Proposed RealMemory sizes ====
-=== Possible issues ===
+^Node type^(PHYS_PAGES*PAGESIZE)/MiB^RealMemory/MiB^RealMemory/GiB^
+|Gen1/128 GiB|128813|126976|124|
+|Gen1/256 GiB|257843|256000|250|
+|Gen1/512 GiB|515891|514048|502|
+|Gen1/GPU/128 GiB|128813|126976|124|
+|Gen1/GPU/256 GiB|257843|256000|250|
+|Gen1/GPU/512 GiB|515891|514048|502|
+|Gen1/NVMe/256 GiB|257842|256000|250|
-The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each.  In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched.  Other workgroups with a similar purchase profile could submit jobs that require just 1 GiB of memory, but an absence of free baseline nodes would see those jobs running on the 256 GiB nodes.  The first workgroup's jobs complete and a job that requires an entire 256 GiB node is eligible to execute.  However, the 256 GiB node is currently occupied by the second workgroup's small-memory jobs, so the eligible job must wait for them to complete.
+A workgroup QOS which under v1.1.3 had ''cpu=756,mem=3801088'' reflecting (13) //Gen1/128 GiB// nodes and (8) //Gen1/256 GiB// nodes will change to ''cpu=756,mem=3698688'' to reflect the RealMemory sizes specified above.
-Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition.  This is simply not possible in Slurm((We could invent our own generic resources (GRES) that qualify the cores in nodes, e.g. ''core_gen1_128GB'', ''core_gen1_256GB''.  To enforce such limits would require that our job submission plugin determine what kind of node each job should run on and add the necessary GRES request.  Such determination would need to also be aware of the chosen partition, the workgroup, and possibly what kinds of node that workgroup purchased.  With each expansion the complexity of this determination would grow.  In short, this solution is not sustainable.)).  The only mitigation to the situation outlined above is to have additional capacity, above and beyond what was purchased; and to rely on walltime limits and workgroup inactivity to keep nodes free.
+==== Requesting specific node types ====
-==== Increase MaxArraySize and MaxJobCount ====
-The job scheduler currently uses the default job count limit of 10000.  Slurm documentation recommends this limit **not** be increased above a few hundred thousand.  The current MaxArraySize limit of 1001 has seemed to pair well with the MaxJobCount of 10000.  The scheduler nodes have a nominal 256 GiB of memory, and the scheduler is currently occupying just over 14 GiB of memory.
+Our Slurm has always defined several //features// on each node in the cluster:
-<WRAP positive round>
+<code bash>
-Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster.
+$ scontrol show node r00n00
-</WRAP>
+NodeName=r00n00 Arch=x86_64 CoresPerSocket=36
+   CPUAlloc=0 CPUErr=0 CPUTot=72 CPULoad=0.07
+   AvailableFeatures=E5-2695,E5-2695v4,128GB,HT
+   ActiveFeatures=E5-2695,E5-2695v4,128GB,HT
+      :
+</code>
-The range for array job indices would become ''[0,10000]''.
+It is possible to constrain a job to execute on a node with a specific nominal memory size using these features:
+<code bash>
+$ sbatch … --constraint=128GB …
+$ sbatch … --constraint=256GB …
+$ sbatch … --constraint=512GB …
+</code>
+The other features reflect the model processor present in the node.  All Gen1 nodes use the Intel ''E5-2695v4''; node ''r00n00'' is part of the ''devel'' partition and has hyperthreading enabled, hence the ''HT'' feature.
 ===== Implementation =====
@@ Line 105: / Line 120: @@
 ^Date ^Time ^Goal/Description ^
-|2019-02-04| |Authoring of this document|
+|2019-02-18| |Authoring of this document|
-|2019-02-06| |Document shared with Caviness community for feedback|
+|2019-02-18| |Document shared with Caviness community for feedback|
-|2019-02-13| |Add announcement of impending change to login banner|
+|2019-02-18| |Add announcement of impending change to login banner|
-|2019-02-18|09:00|Configuration changes pushed to cluster nodes|
+|2019-02-25|09:00|Configuration changes pushed to cluster nodes|
 | |09:30|Restart scheduler, notify compute nodes of reconfiguration|
-|2019-02-20| |Remove announcement from login banner|
+|2019-02-27| |Remove announcement from login banner|