Differences

This shows you the differences between two versions of the page.

--- technical:slurm:scheduler-params [2019-10-24 13:36] – frey
+++ technical:slurm:scheduler-params [2020-02-26 15:20] – [Implementation] frey
@@ Line 27: / Line 27: @@
 |fair-share| 4000|see ''sshare''|
 |partition id| 2000|1.0 for all partitions|
-|job resource size| 1|largest resource request=1.0|
+|job size| 1|all resources in partition=1.0|
 Next to priority access, wait time is the largest factor:  the longer a job waits to execute, the higher its priority to be scheduled.  This seems appropriate, but it competes against the fair-share factor, which prioritizes jobs for underserved users.
-Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time.  The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster.
+Taken together, these factors allow a single user to submit thousands of jobs (even if s/he has a very small share of purchased cluster resources) that quickly sort to the head of the pending queue due to their wait time.  The weight on wait time then begins to prioritize those jobs over jobs submitted by users who have not been using the cluster, contrary to the goals of fair-share.
 ===== Solutions =====
@@ Line 39: / Line 39: @@
 On many HPC systems per-user limits are enacted to restrict how many pending jobs can be present in the queue:  for example, a limit of 10 jobs in the queue at once.  When the 11th job is submitted by the user the submission fails and the job is NOT added to the queue.  Each Slurm partition can have this sort of limit placed on it (both per-user and aggregate) and each QOS can override those limits.
-It would be preferable to avoid enacting such limits on Caviness.
+It would be preferable to avoid enacting such limits on Caviness.  Over time should user behavior change (and users routinely abuse this lenience) submission limits may become necessary.
 ==== Altered priority weights ====
-The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem.  Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed.  The job size weight of 1 also tends to lead toward issues with small jobs starving-out larger jobs on the cluster; in reality, wait time and job size should be considered on a more equal footing.
+The dominance of wait time in priority calculations is probably the factor contributing most greatly to this problem.  Wait time should not dominate over fair-share, so at the very least those two weights' magnitudes must be reversed.  The job size weight of 1 makes that factor more or less useless since the greatest majority of jobs is below 50% of the cores, mapping them to a contribution of 0.  In reality, wait time and job size should be considered on a more equal footing.  However, there are multiple //job size// contributions to the priority:
+  - the job size factor already discussed (percentage of cores) weighted by PriorityJobSize
+  - the TRES size factor calculated using PriorityWeightTRES weights and requested resource values
 The Slurm documentation also points out that the priority factor weights should be of a magnitude that allows enough significant digits from each factor (minimum 1000 for important factors).  The scheduling priority is an unsigned 32-bit integer (ranging [0,4294967295]).  In constructing a new set of priority weights:
@@ Line 62: / Line 65: @@
 The wait time and job size group of bits is split 60% to wait time, 40% to job size:
-^mask^
+^mask^sub-factor^
 |2457 = ''0x00000999''|wait time|
 |1638 = ''0x00000666''|job size|
-The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for the job size factor.
+The **PriorityWeightTRES** must be set, as well, to yield a non-zero contribution for job size; internally, the weights in **PriorityWeightTRES** are converted to double-precision floating point values.  From the Slurm source code, the calculation of job size depends on the TRES limits associated with the partition(s) to which the job was submitted (or in which it is running).  The job's requested resource value is divided by the maximum value for the partition, yielding the resource fractions associated with the job.  The fractions are multiplied by the weighting values in **PriorityWeightTRES**.
-=== Addition of GRES ===
+<note important>Calculating job size relative to each partition is counterintuitive for our usage:  jobs in partitions with fewer resources are likely to have routinely higher resource fractions than jobs in partitions with more resources.  Calculating job size relative to the full complement of cluster resources would be a very useful option (does not exist yet).  It's possible that the partition contribution to priority is seen as the way to balance against job size (lower value on larger partitions) but that seems far more complicated versus having a ''PRIORITY_*'' flag to select partition versus global fractions.</note>
-New Generic RESource types will be added to represent the GPU devices present in Generation 2 nodes.
+The TRES weights should sum to a maximum of 1638 (see the job size factor above), with the weight on ''cpu'' dominating.
-^GRES name^Type^Description^
-|gpu|v100|nVidia Volta GPU|
-|gpu|t4|nVidia T4 GPU|
-=== Addition of Nodes ===
-^Kind^Features^Nodes^GRES^
-|Baseline (2 x 20C, 192 GB)|Gen2,Gold-6230,6230,192GB|r03n[29-57]||
-|Large memory (2 x 20C, 384 GB)|Gen2,Gold-6230,6230,384GB|r03n[00-23],r03n28||
-|X-large memory (2 x 20C, 768 GB)|Gen2,Gold-6230,6230,768GB|r03n27||
-|XX-large memory (2 x 20C, 1024 GB)|Gen2,Gold-6230,6230,1024GB|r03n[24-26]||
-|Low-end GPU (2 x 20C, 192 GB, 1 x T4)|Gen2,Gold-6230,6230,192GB|r03g[00-02]|''gpu:t4:1''|
-|Low-end GPU (2 x 20C, 384 GB, 1 x T4)|Gen2,Gold-6230,6230,384GB|r03g[03-04]|''gpu:t4:1''|
-|Low-end GPU (2 x 20C, 768 GB, 1 x T4)|Gen2,Gold-6230,6230,768GB|r03g[07-08]|''gpu:t4:1''|
-|All-purpose GPU (2 x 20C, 384 GB, 2 x V100)|Gen2,Gold-6230,6230,384GB|r03g05|''gpu:v100:2''|
-|All-purpose GPU (2 x 20C, 768 GB, 2 x V100)|Gen2,Gold-6230,6230,768GB|r03g06|''gpu:v100:2''|
-The Features column is a comma-separated list of tags that a job can match against.  For example, to request that a job execute on node(s) with Gold 6230 processors and (nominally) 768 GB of RAM:
-<code bash>
-$ sbatch --constraint=Gold-6230&768GB …
-</code>
-All previous-generation nodes' feature lists will have ''Gen1'' added to allow jobs to target a specific generation of the cluster:
-<code bash>
-$ sbatch --constraint=Gen1 …
-</code>
-=== Changes to Workgroup Accounts, QOS, and Shares ===
-  * Any existing workgroups with additional purchased resource capacity will have their QOS updated to reflect the aggregate core count, memory capacity, and GPU count.
-  * New workgroups will have an appropriate Slurm account created and populated with sponsored users.  A QOS will be created with purchased core count, memory capacity, and GPU count.
-  * Slurm cluster shares (for fair-share scheduling) are proportional to each workgroup's percentage of the full value of the cluster.  All workgroups will have their relative scheduling priority adjusted accordingly.
-=== Changes to Partitions ===
-The **standard** partition node list will be augmented to:
-<code>
-Nodes=r[00-01]n[01-55],r00g[01-04],r01g[00-04],r02s[00-01],r03[n00-57],r03g[00-08]
-</code>
-Various workgroups' priority-access partitions will also be modified to include node kinds purchased, and any new stakeholders will have their priority-access partition added.
-=== Network Topology Changes ===
-The addition of the new rack brings two new OPA switches into the high-speed network topology.  The ''topology.conf'' file can be adjusted by running the [[http://gitlab.com/jtfrey/opa2slurm|opa2slurm]] utility once the rack is integrated and online.
-==== Changes to auto_tmpdir ====
-The directory removal error message has been changed to an internal informational message that users will not see.  Additional changes were made to the plugin to address the race condition itself.
-An additional option has been added to the plugin to request shared per-job (and per-step) temporary directories on the Lustre file system:
-<code>
---use-shared-tmpdir     Create temporary directories on shared storage (overridden
-                        by --tmpdir).  Use "--use-shared-tmpdir=per-node" to create
-                        unique sub-directories for each node allocated to the job
-                        (e.g. <base>/job_<jobid>/<nodename>).
-</code>
-^Variant^Node^TMPDIR^
-|job, no per-node|''r00n00''|''/lustre/scratch/slurm/job_12345''|
-|:::|''r00n01''|''/lustre/scratch/slurm/job_12345''|
-|step, no per-node|''r00n00''|''/lustre/scratch/slurm/job_12345/step_0''|
-|:::|''r00n01''|''/lustre/scratch/slurm/job_12345/step_0''|
-|job, per-node|''r00n00''|''/lustre/scratch/slurm/job_12345/r00n00''|
-|:::|''r00n01''|''/lustre/scratch/slurm/job_12345/r00n01''|
-|step, per-node|''r00n00''|''/lustre/scratch/slurm/job_12345/r00n00/step_0''|
-|:::|''r00n01''|''/lustre/scratch/slurm/job_12345/r01n00/step_0''|
 ===== Implementation =====
-The auto_tmpdir plugin has already been compiled and debugged/tested on another Slurm cluster.  The code has been compiled on Caviness and to activate must be installed (''make install'') from the current build directory.  The plugin is loaded by ''slurmstepd'' as jobs are launched but is not used by ''slurmd'' itself, so a restart of ''slurmd'' on all Gen1 compute nodes is not necessary in this regard.
+The Priority weight factors will be adjusted as follows:
-To make all of these changes atomic (in a sense), all nodes will be put in the **DRAIN** state to prohibit additional jobs' being scheduled while jobs already running are left alone.
-Next, the Slurm accounting database must be updated with:
-  * Changes to cluster share for existing workgroup accounts
-  * Changes to resource levels for existing workgroup QOS's who purchased Gen2 resources
-  * Addition of new workgroup accounts
-  * Addition of new workgroup QOS's
-Adding new nodes and partitions to the Slurm configuration requires the scheduler (''slurmctld'') to be fully restarted.
-The execution daemons (''slurmd'') on Gen1 compute nodes can be informed of the new configuration once ''slurmctld'' has been restarted using ''scontrol reconfigure'' but should not require a restart.  Finally, the Gen1 nodes can be shifted out of the **DRAIN** state and Gen2 compute nodes can have ''slurmd'' started.
-The sequence of operations looks something like this (on ''r02mgmt00''):
-<code bash>
-$ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=DRAIN reason=reconfiguration
-$ # …update existing workgroups' cluster share using sacctmgr…
-$ # …update existing workgroups' QOS resource levels using sacctmgr…
-$ # …add new workgroups' accounts using sacctmgr…
-$ # …add new workgroups' QOS resource levels using sacctmgr…
-</code>
-The updated Slurm configuration must be pushed to all compute nodes:
-<code bash>
-$ wwsh provision set r03g\* --fileadd=gen2-gpu-cgroup.conf
-$ wwsh provision set r03g[00-04] r03g[07-08] --fileadd=gen2-gpu-t4-gres.conf
-$ wwsh provision set r03g[05-06] --fileadd=gen2-gpu-v100-gres.conf
-$ wwsh file sync slurm-nodes.conf slurm-partitions.conf topology.conf
-$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
-$ # …wait…
-$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
-$ # …repeat until…
-$ pdsh -w r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01],r03n[00-57],r03g[00-08] ls -ld /etc/slurm/nodes.conf | grep '_slurmadm *3074' | wc -l
-</code>
-Now that all nodes have the correct configuration files, the scheduler configuration is copied into place and both instances are restarted:
-<code bash>
+^Configuration Key^Old Value^New Value^
-$ sudo cp /opt/slurm-conf/controller/gres.conf /etc/slurm/gres.conf
+|''PriorityWeightAge''| 8000| 2457|
-$ sudo cp /opt/slurm-conf/nodes.conf /etc/slurm/nodes.conf
+|''PriorityWeightFairshare''| 4000| 1073737728|
-$ sudo cp /opt/slurm-conf/partitions.conf /etc/slurm/partitions.conf
+|''PriorityWeightJobSize''| 1| 0|
-$ sudo cp /opt/slurm-conf/topology.conf /etc/slurm/topology.conf
+|''PriorityWeightPartition''| 2000| 0|
-$ sudo rsync -arv /etc/slurm/ root@r02mgmt01:/etc/slurm/
+|''PriorityWeightQOS''| 20000| 3221225472|
-$ sudo systemctl restart slurmctld
+|''PriorityWeightTRES''| unset| ''cpu=819,mem=245,GRES/gpu=245,node=327''|
-$ sudo ssh r02mgmt01 systemctl restart slurmctld
+|''PriorityFlags''| ''FAIR_TREE,SMALL_RELATIVE_TO_TIME''| ''FAIR_TREE''|
-</code>
-Return the Gen1 nodes to service:
+Priorities will make use of the full 32-bits available, with the fairshare factor dominating and having the most precision assigned to it.
-<code bash>
+Since the ''PriorityWeightJobSize'' will not be used, the more complex "small-relative-to-time" algorithm will be disabled.
-$ scontrol reconfigure
-$ scontrol update nodename=r[00-01]n[00-56],r[00-01]g[00-04],r02s[00-01] state=UNDRAIN
-</code>
-And the Gen2 nodes can have their Slurm service started:
+The modifications to ''slurm.conf'' must be pushed to all systems.  The ''scontrol reconfigure'' command should be all that is required to activate the altered priority caclulation scheme.
-<code bash>
-$ pdsh -w r03n[00-57],r03g[00-08] systemctl start slurmd
-</code>
 ===== Impact =====
@@ Line 221: / Line 101: @@
 ^Date ^Time ^Goal/Description ^
-|2019-09-09| |Authoring of this document|
+|2019-10-24| |Authoring of this document|
-|2019-10-23|11:00|Changes made|
+|2020-03-09|09:00|Implementation|