technical:slurm:node-memory-sizes

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revisionBoth sides next revision
technical:slurm:node-memory-sizes [2019-02-18 11:34] freytechnical:slurm:node-memory-sizes [2019-02-18 12:15] – [Proposed RealMemory sizes] frey
Line 61: Line 61:
  
 <WRAP positive round> <WRAP positive round>
-Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.+Node configurations will be updated to reflect the chosen sub-nominal RealMemory sizes.  The //FastSchedule// mode will be restored to //FastSchedule=1//.
 </WRAP> </WRAP>
  
-The //FastSchedule// mode will be restored to //FastSchedule=1//.  Nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state.+Under mode 1 of //FastSchedule//, nodes reporting memory below the RealMemory limit or ''/tmp'' storage below the TmpDisk size will (appropriately) enter the DRAIN state — such conditions are indicative of hardware issues, in agreement with the intent of the Slurm developers in this case.
  
 <note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note> <note important>Note that slurmd reports the total capacity of the ''/tmp'' filesystem, not the available capacity.  Since filesystem capacity cannot be reserved the same way memory limits are effected on jobs, the requested ''--tmp=X'' does not reflect the ability to actually write that much data to a node's ''/tmp'' directory.</note>
  
 +<WRAP positive round>
 +Workgroup QOS configurations will be updated to reflect the sum over sub-nominal RealMemory sizes rather than nominal sizes used in the v1.1.3 configuration.
 +</WRAP>
 +
 +In v1.1.3 the node counts in workgroup QOS's were replaced by aggregate memory sizes which summed over the nominal sizes (128 GiB, 256 GiB, 512 GiB).  In concert with changing the nodes' RealMemory size, the QOS aggregate must change.
 +
 +==== Proposed RealMemory sizes ====
 +
 +^Node type^(PHYS_PAGES*PAGESIZE)/MiB^RealMemory/MiB^RealMemory/GiB^
 +|Gen1/128 GiB|128813|126976|124|
 +|Gen1/256 GiB|257843|256000|250|
 +|Gen1/512 GiB|515891|514048|502|
 +|Gen1/GPU/128 GiB|128813|126976|124|
 +|Gen1/GPU/256 GiB|257843|256000|250|
 +|Gen1/GPU/512 GiB|515891|514048|502|
 +|Gen1/NVMe/256 GiB|257842|256000|250|
 +
 +A workgroup QOS which under v1.1.3 had ''cpu=756,mem=3801088'' reflecting (13) //Gen1/128 GiB// nodes and (8) //Gen1/256 GiB// nodes will change to ''cpu=756,mem=3698688'' to reflect the RealMemory sizes specified above.
 ===== Implementation ===== ===== Implementation =====
  
Line 79: Line 97:
  
 ^Date ^Time ^Goal/Description ^ ^Date ^Time ^Goal/Description ^
-|2019-02-04| |Authoring of this document| +|2019-02-18| |Authoring of this document| 
-|2019-02-06| |Document shared with Caviness community for feedback| +|2019-02-18| |Document shared with Caviness community for feedback| 
-|2019-02-13| |Add announcement of impending change to login banner| +|2019-02-18| |Add announcement of impending change to login banner| 
-|2019-02-18|09:00|Configuration changes pushed to cluster nodes|+|2019-02-25|09:00|Configuration changes pushed to cluster nodes|
 | |09:30|Restart scheduler, notify compute nodes of reconfiguration| | |09:30|Restart scheduler, notify compute nodes of reconfiguration|
-|2019-02-20| |Remove announcement from login banner|+|2019-02-27| |Remove announcement from login banner|