technical:slurm:node-memory-sizes

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
technical:slurm:node-memory-sizes [2019-02-18 11:34] freytechnical:slurm:node-memory-sizes [2019-02-18 12:32] – [Requesting specific node types] frey
Line 23: Line 23:
 The changes did not need to be rolled-back, since an additional mode (//FastSchedule=2//) instructs Slurm to ignore such disparity.  Thus, the v1.1.3 changes went into production in that mode circa 10:00. The changes did not need to be rolled-back, since an additional mode (//FastSchedule=2//) instructs Slurm to ignore such disparity.  Thus, the v1.1.3 changes went into production in that mode circa 10:00.
  
-One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes  Consider the following:+One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes Consider the following:
  
   * A node runs a job requesting 28 cores and 100 GiB of memory, leaving 8 cores and 28 GiB of memory available according to the node configuration.   * A node runs a job requesting 28 cores and 100 GiB of memory, leaving 8 cores and 28 GiB of memory available according to the node configuration.
-  * second job from a different user, requesting 4 cores and 28 GiB of memory, is scheduled on the node.+  * The same node runs a second job that requests 4 cores and 28 GiB of memory.
  
-Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present.  This causes memory usage to spill into swap space, slowing all jobs on the node.+Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present in the node.  This causes memory usage to spill into swap space, slowing all jobs on the node considerably.
  
 <WRAP negative round> <WRAP negative round>
Line 61: Line 61:
  
 <WRAP positive round> <WRAP positive round>
-Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.+Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.  The //FastSchedule// mode will be restored to //FastSchedule=1//.
 </WRAP> </WRAP>
  
-The //FastSchedule// mode will be restored to //FastSchedule=1//.  Nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state.+Under mode 1 of //FastSchedule//, nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state — such conditions are indicative of hardware issues, in agreement with the intent of the Slurm developers in this case.
  
 <note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note> <note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note>
  
 +<WRAP positive round>
 +Workgroup QOS configurations will be updated to reflect the sum over sub-nominal RealMemory sizes rather than nominal sizes used in the v1.1.3 configuration.
 +</WRAP>
 +
 +In v1.1.3 the node counts in workgroup QOS's were replaced by aggregate memory sizes which summed over the nominal sizes (128 GiB, 256 GiB, 512 GiB).  In concert with changing the nodes' RealMemory size, the QOS aggregate must change.
 +
 +==== Proposed RealMemory sizes ====
 +
 +^Node type^(PHYS_PAGES*PAGESIZE)/MiB^RealMemory/MiB^RealMemory/GiB^
 +|Gen1/128 GiB|128813|126976|124|
 +|Gen1/256 GiB|257843|256000|250|
 +|Gen1/512 GiB|515891|514048|502|
 +|Gen1/GPU/128 GiB|128813|126976|124|
 +|Gen1/GPU/256 GiB|257843|256000|250|
 +|Gen1/GPU/512 GiB|515891|514048|502|
 +|Gen1/NVMe/256 GiB|257842|256000|250|
 +
 +A workgroup QOS which under v1.1.3 had ''cpu=756,mem=3801088'' reflecting (13) //Gen1/128 GiB// nodes and (8) //Gen1/256 GiB// nodes will change to ''cpu=756,mem=3698688'' to reflect the RealMemory sizes specified above.
 +
 +==== Requesting specific node types ====
 +
 +Our Slurm has always defined several //features// on each node in the cluster:
 +
 +<code bash>
 +$ scontrol show node r00n00 
 +NodeName=r00n00 Arch=x86_64 CoresPerSocket=36
 +   CPUAlloc=0 CPUErr=0 CPUTot=72 CPULoad=0.07
 +   AvailableFeatures=E5-2695,E5-2695v4,128GB,HT
 +   ActiveFeatures=E5-2695,E5-2695v4,128GB,HT
 +      :
 +</code>
 +
 +It is possible to constrain a job to execute on a node with a specific nominal memory size using these features:
 +
 +<code bash>
 +$ sbatch … --constraint=128GB …
 +$ sbatch … --constraint=256GB …
 +$ sbatch … --constraint=512GB …
 +</code>
 +
 +The other features reflect the model processor present in the node.  All Gen1 nodes use the Intel ''E5-2695v4''; node ''r00n00'' is part of the ''devel'' partition and has hyperthreading enabled, hence the ''HT'' feature.
 ===== Implementation ===== ===== Implementation =====
  
Line 79: Line 120:
  
 ^Date ^Time ^Goal/Description ^ ^Date ^Time ^Goal/Description ^
-|2019-02-04| |Authoring of this document| +|2019-02-18| |Authoring of this document| 
-|2019-02-06| |Document shared with Caviness community for feedback| +|2019-02-18| |Document shared with Caviness community for feedback| 
-|2019-02-13| |Add announcement of impending change to login banner| +|2019-02-18| |Add announcement of impending change to login banner| 
-|2019-02-18|09:00|Configuration changes pushed to cluster nodes|+|2019-02-25|09:00|Configuration changes pushed to cluster nodes|
 | |09:30|Restart scheduler, notify compute nodes of reconfiguration| | |09:30|Restart scheduler, notify compute nodes of reconfiguration|
-|2019-02-20| |Remove announcement from login banner|+|2019-02-27| |Remove announcement from login banner|