Differences

This shows you the differences between two versions of the page.

--- technical:slurm:arraysize-and-nodecounts [2019-02-04 11:53] – [Increase MaxArraySize and MaxJobCount] frey
+++ technical:slurm:arraysize-and-nodecounts [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1
@@ Line 1: / Line 1: @@
-====== Revisions to Slurm Configuration v1.1.2 on Caviness ======
-This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.
-===== Issues =====
-==== Priority-access partition node counts ====
-When the priority-access //workgroup partitions// were created, the idea was that each workgroup's partition:
-  * would encompass //all// nodes of the variet(ies) that workgroup purchased
-  * would have resource limits in place to prevent running jobs from using:
-    * more than the number of cores purchased
-    * more than the number of nodes purchased
-    * more than the number of GPUs purchased
-A workgroup that purchased (1) baseline node and (1) 256 GiB node has priority access to 72 cores and 2 nodes, drawn from the pool of all baseline and 256 GiB nodes.  This combined the //workgroup queue// and //spillover queue// found on Mills and Farber — and since Slurm only executes a job in a single partition, that scheme is necessary.  The node-count limit prevents that same workgroup from requesting enough memory per job that they draw their 72 cores from e.g. 4 nodes instead of just 2.
-This configuration was predicated on the assumption that the node-count limit would apply to the number of nodes occupied by a workgroup's running jobs.  In reality:
-<WRAP negative round>
-Slurm node-count limits apply to the number of nodes requested by the job, not the number of nodes actually occupied by jobs.
-</WRAP>
-The workgroup cited above quickly noticed that having 2 jobs running that each requested just 1 core would saturate their node-count limit — despite the fact that the scheduler packed those 2 jobs onto a single node, leaving 70 of their 72 cores unusable.
-As currently configured, priority-access partitions are only optimal for jobs that can make use of 36N cores.  Workgroups with high-throughput workflows are most negatively impacted by this issue.
-==== Job array size limits ====
-On Mills and Farber, Grid Engine job arrays have the following limits per array job:
-  * no more than 2000 array indices will be scheduled concurrently
-  * the number of indices cannot exceed 75000
-The //number of the indices// is affected by the latter limit, not the //values// of the indices:  using ''-t 75000-155000'' would fail (80001 indices), but ''-t 75000-155000:4'' is fine (20000 indices).
-Slurm job array also have limits applied to them.  Caviness currently uses the default limits:
-  * job arrays can only use indices in the range ''[0,N)''
-    * the maximum array size, ''N'', defaults to 1001 and cannot exceed 4000001
-Unlike Grid Engine, where the number of indices was limited, Slurm limits the index value range itself.
-Slurm treats an array job like a template that generates unique jobs:  each index is internally handled like a standard job, inheriting its resource requirements from the template.  Slurm also limits the total number of jobs that it will accept (whether their state is running or pending) to 10000 by default.  Since Slurm holds all job information in memory until the job completes, that limit must be adjusted carefully to avoid oversubscribing the scheduler node's own resource limits (causing scheduling to become sluggish or even unresponsive).
-<WRAP negative round>
-Slurm job arrays use a fixed range of indices in the configured range ''[0,N)''.  The default configuration sets ''N'' to 1001 and the maximum number of jobs to 10000.  Increasing ''N'' typically requires the maximum number of jobs to be increased.  Great attention must be paid to memory consumption by the scheduler as these values are increased.
-</WRAP>
-===== Solutions =====
-==== Remove node-count limits ====
-The manner by which Slurm applies node-count limits is an acknowledged problem in the HPC community.  One of Slurm's strengths is its plugin-oriented construction, which allows the configuration file to select what functionalities are used.  This scheme is applied to many components, including the scheduler algorithm and resource assignment.  Thus, the plugin scheme is also a weakness:  greater sophistication requires change to the API shared by a plugin kind and possible updating of all plugins using that API.  And greater sophistication in a generalized API (like those typical to most Slurm plugins) is extremely difficult to accomplish.  Thus, the recommendation from the developers of Slurm is to not use node-count limits.
-Removing the node-count limits on priority-access partitions must be accompanied by the addition of a limit that also prevents jobs from using more nodes than purchased.  The natural choice would be physical memory in the nodes purchased by the workgroup.
-<WRAP positive round>
-The node-count limit will be removed from priority-access workgroup partitions.  A memory limit will be added that reflects the total physical memory purchased by the workgroup.
-</WRAP>
-In the example above, the workgroup would have the ''node=2'' limit removed, replaced with ''mem=384G''.  However, on any node the operating system and Slurm daemon consume some non-trivial amount of memory:  baseline nodes (with a nominal 128 GiB) report 125.8 GiB of available memory after boot.  With the ''mem=384G'' limit, the workgroup would be able to run 72 jobs requesting 5400M of memory each, and Slurm would have to spread those jobs across 3 or more nodes (not 2).
-The solution to this additional issue is to force a memory size in the Slurm configuration for each node, rather than letting the node report how much memory it has available.  Thus, for baseline nodes the Slurm configuration would be changed to:
-<code>
-NodeName=r00n[01-17,45-55] CPUS=36 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 Memory=126976 Feature="E5-2695,E5-2695v4,128GB" Weight=10
-</code>
-effecting a limit of 124 GiB of usable memory on each node (leaving 4 GiB for the OS and job dispatch, inline with the 125.8 GiB reported by the node).  Appropriate limits below the nominal 256 and 512 GiB will be chosen for nodes with those memory sizes:  250 GiB and 502 GiB, respectively.  The workgroup limit will then be ''mem=374G'' rather than ''mem=384G''.
-=== Possible issues ===
-The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each.  In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched.  Other workgroups with a similar purchase profile could submit jobs that require just 1 GiB of memory, but an absence of free baseline nodes would see those jobs running on the 256 GiB nodes.  The first workgroup's jobs complete and a job that requires an entire 256 GiB node is eligible to execute.  However, the 256 GiB node is currently occupied by the second workgroup's small-memory jobs, so the eligible job must wait for them to complete.
-Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition.  This is simply not possible in Slurm((We could invent our own generic resources (GRES) that qualify the cores in nodes, e.g. ''core_gen1_128GB'', ''core_gen1_256GB''.  To enforce such limits would require that our job submission plugin determine what kind of node each job should run on and add the necessary GRES request.  Such determination would need to also be aware of the chosen partition, the workgroup, and possibly what kinds of node that workgroup purchased.  With each expansion the complexity of this determination would grow.  In short, this solution is not sustainable.)).  The only mitigation to the situation outlined above is to have additional capacity, above and beyond what was purchased; and to rely on walltime limits and workgroup inactivity to keep nodes free.
-==== Increase MaxArraySize and MaxJobCount ====
-The job scheduler currently uses the default job count limit of 10000.  Slurm documentation recommends this limit **not** be increased above a few hundred thousand.  The current MaxArraySize limit of 1001 has seemed to pair well with the MaxJobCount of 10000.  The scheduler nodes have a nominal 256 GiB of memory, and the scheduler is currently occupying just over 14 GiB of memory.
-<WRAP positive round>
-Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster.
-</WRAP>
-The range for array job indices would become ''[0,10000]''.
-===== Implementation =====
-All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and restarting the controller daemon.  Execution daemons on compute nodes are then notified of the change in configuration.
-===== Impact =====
-No downtime is expected to be required.
-===== Timeline =====
-^Date ^Time ^Goal/Description ^
-|2019-02-04| |Authoring of this document|