Differences

This shows you the differences between two versions of the page.

--- technical:slurm:scheduler-params [2019-10-30 09:48] – frey
+++ technical:slurm:scheduler-params [2020-03-09 11:21] – frey
@@ Line 27: / Line 27: @@
 |fair-share| 4000|see ''sshare''|
 |partition id| 2000|1.0 for all partitions|
-|job resource size| 1|largest resource request=1.0|
+|job size| 1|all resources in partition=1.0|
 Next to priority access, wait time is the largest factor:  the longer a job waits to execute, the higher its priority to be scheduled.  This seems appropriate, but it competes against the fair-share factor, which prioritizes jobs for underserved users.
@@ Line 43: / Line 43: @@
 ==== Altered priority weights ====
-The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem.  Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed.  The job size weight of 1 also tends to lead toward issues with small jobs starving-out larger jobs on the cluster; in reality, wait time and job size should be considered on a more equal footing.
+The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem.  Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed.  The job size weight of 1 makes that factor more or less useless since the greatest majority of jobs is below 50% of the cores, mapping them to a contribution of 0.  In reality, wait time and job size should be considered on a more equal footing.  However, there are multiple //job size// contributions to the priority:
+  - the job size factor already discussed (percentage of cores) weighted by PriorityJobSize
+  - the TRES size factor calculated using PriorityWeightTRES weights and requested resource values
 The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors).  The scheduling priority is an unsigned 32-bit integer (ranging [0,4294967295]).  In constructing a new set of priority weights:
@@ Line 66: / Line 69: @@
 |1638 = ''0x00000666''|job size|
-The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for the job size factor; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values.  Using the following resource ranges for Caviness:
+The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for job size; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values.  From the Slurm source code, the calculation of job size depends on the TRES limits associated with the partition(s) to which the job was submitted (or in which it is running).  The job's requested resource value is divided by the maximum value for the partition, yielding the resource fractions associated with the job.  The fractions are multiplied by the weighting values in **PriorityWeightTRES**.
-^TRES^min^max^notes^
+<note important>Calculating job size relative to each partition is counterintuitive for our usage:  jobs in partitions with fewer resources are likely to have routinely higher resource fractions than jobs in partitions with more resources.  Calculating job size relative to the full complement of cluster resources would be a very useful option (does not exist yet).  It's possible that the partition contribution to priority is seen as the way to balance against job size (lower value on larger partitions) but that seems far more complicated versus having a ''PRIORITY_*'' flag to select partition versus global fractions.</note>
-|cpu|1|10000|current CPU count is ~7000|
-|mem|1024|107374182400|100 TiB; current total is 46 TiB|
-|gpu|1|40|current GPU count is 33 (22 P100, 4 V100, 7 T4)|
-Digging into the Slurm source code, the calculation of job size depends on the TRES limits associated with the partition(s) to which the job was submitted.  The job's requested resource value is divided by the maximum value for the partition, yielding the resource fractions associated with the job.  The fractions are multiplied by the weighting values in **PriorityWeightTRES**.
+The TRES weights should sum to a maximum of 1638 (see the job size factor above), with the weight on ''cpu'' dominating.
-<note important>Calculating job size relative to each partition seems counterintuitive:  jobs in partitions with fewer resources are likely to have routinely higher resource fractions than jobs in partitions with more resources.  Calculating job size relative to the full complement of cluster resources would be a very useful option.</note>
+===== Implementation =====
+The Priority weight factors will be adjusted as follows:
-===== Implementation =====
+^Configuration Key^Old Value^New Value^
+|''PriorityWeightAge''| 8000| 2457|
+|''PriorityWeightFairshare''| 4000| 1073737728|
+|''PriorityWeightJobSize''| 1| 0|
+|''PriorityWeightPartition''| 2000| 0|
+|''PriorityWeightQOS''| 20000| 3221225472|
+|''PriorityWeightTRES''| unset| ''cpu=819,mem=245,GRES/gpu=245,node=327''|
+|''PriorityFlags''| ''FAIR_TREE,SMALL_RELATIVE_TO_TIME''| ''FAIR_TREE''|
+Priorities will make use of the full 32-bits available, with the fairshare factor dominating and having the most precision assigned to it.
+Since the ''PriorityWeightJobSize'' will not be used, the more complex "small-relative-to-time" algorithm will be disabled.
+The modifications to ''slurm.conf'' must be pushed to all systems.  The ''scontrol reconfigure'' command should be all that is required to activate the altered priority caclulation scheme.
+==== Addendum: lg-swap partition ====
+One of the test nodes containing 6 TiB of NVMe storage has been reconstructed with the NMVe as swap devices rather than as file storage devices.  This configuration is currently being tested for viability as a solution for jobs requiring extremely large amounts of allocatable memory:  the node has 256 GiB of physical RAM and 6 TiB of NVMe swap.  A job that allocates more than 256 GiB of RAM will force the OS to move 4 KiB memory pages between the NVMe storage and the 256 GiB of physical RAM (swapping); the idea is that the NVMe efficiency is more nearly a match to physical RAM (versus a hard disk) that the performance penalty may be lower, making this design an attractive option to some workgroups.
+A special-access partition has been added to feed jobs to this node.  Access is by request only.
 ===== Impact =====
@@ Line 88: / Line 108: @@
 ^Date ^Time ^Goal/Description ^
 |2019-10-24| |Authoring of this document|
+|2020-03-09|10:45|Implementation|