Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
technical:slurm:scheduler-params [2019-10-24 14:06] – frey | technical:slurm:scheduler-params [2020-02-20 14:55] – [Implementation] frey | ||
---|---|---|---|
Line 27: | Line 27: | ||
|fair-share| 4000|see '' | |fair-share| 4000|see '' | ||
|partition id| 2000|1.0 for all partitions| | |partition id| 2000|1.0 for all partitions| | ||
- | |job resource | + | |job size| 1|all cores in cluster=1.0| |
Next to priority access, wait time is the largest factor: | Next to priority access, wait time is the largest factor: | ||
- | Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster. | + | Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time. The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster, contrary to the goals of fair-share. |
===== Solutions ===== | ===== Solutions ===== | ||
Line 39: | Line 39: | ||
On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: | On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue: | ||
- | It would be preferable to avoid enacting such limits on Caviness. | + | It would be preferable to avoid enacting such limits on Caviness. Over time should user behavior change (and users routinely abuse this lenience) submission limits may become necessary. |
==== Altered priority weights ==== | ==== Altered priority weights ==== | ||
- | The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. | + | The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem. |
+ | |||
+ | - the job size factor already discussed (percentage of cores) weighted by PriorityJobSize | ||
+ | - the TRES size factor calculated using PriorityWeightTRES weights and requested resource values | ||
The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors). | The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors). | ||
Line 66: | Line 69: | ||
|1638 = '' | |1638 = '' | ||
- | The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for the job size factor; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values. | + | The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for job size; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values. |
- | ^TRES^min^max^notes^ | + | <note important> |
- | |cpu|1|10000|current CPU count is ~7000| | + | |
- | |mem|1024|107374182400|100 TiB; current total is 46 TiB| | + | |
- | |gpu|1|40|current GPU count is 33 (22 P100, 4 V100, 7 T4)| | + | |
+ | The TRES weights should sum to a maximum of 1638 (see the job size factor above), with the weight on '' | ||
===== Implementation ===== | ===== Implementation ===== | ||
+ | |||
+ | The Priority weight factors will be adjusted as follows: | ||
+ | |||
+ | ^Configuration Key^Old Value^New Value^ | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
===== Impact ===== | ===== Impact ===== |