technical:slurm:caviness:arraysize-and-nodecounts

Revisions to Slurm Configuration v1.1.2 on Caviness

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

When the priority-access workgroup partitions were created, the idea was that each workgroup's partition:

  • would encompass all nodes of the variet(ies) that workgroup purchased
  • would have resource limits in place to prevent running jobs from using:
    • more than the number of cores purchased
    • more than the number of nodes purchased
    • more than the number of GPUs purchased

A workgroup that purchased (1) baseline node and (1) 256 GiB node has priority access to 72 cores and 2 nodes, drawn from the pool of all baseline and 256 GiB nodes. This combined the workgroup queue and spillover queue found on Mills and Farber — and since Slurm only executes a job in a single partition, that scheme is necessary. The node-count limit prevents that same workgroup from requesting enough memory per job that they draw their 72 cores from e.g. 4 nodes instead of just 2.

This configuration was predicated on the assumption that the node-count limit would apply to the number of nodes occupied by a workgroup's running jobs. In reality:

Slurm node-count limits apply to the number of nodes requested by the jobs, not the number of nodes actually occupied by jobs. When the node count is exceeded, jobs will remain pending with a reason of QOSGrpNodeLimit displayed by the squeue command.

The workgroup cited above quickly noticed that having 2 jobs running that each requested just 1 core would saturate their node-count limit — despite the fact that the scheduler packed those 2 jobs onto a single node, leaving 70 of their 72 cores unusable.

As currently configured, priority-access partitions are only optimal for jobs that can make use of 36N cores. Workgroups with high-throughput workflows are most negatively impacted by this issue.

On Mills and Farber, Grid Engine job arrays have the following limits per array job:

  • no more than 2000 array indices will be scheduled concurrently
  • the number of indices cannot exceed 75000

The number of the indices is affected by the latter limit, not the values of the indices: using -t 75000-155000 would fail (80001 indices), but -t 75000-155000:4 is fine (20000 indices).

Slurm job arrays also have limits applied to them. Caviness currently uses the default limits:

  • job arrays can only use indices in the range [0,N)
    • the maximum array size, N, defaults to 1001 and cannot exceed 4000001

Unlike Grid Engine, where the number of indices was limited, Slurm limits the index value range itself.

Slurm treats an array job like a template that generates unique jobs: each index is internally handled like a standard job, inheriting its resource requirements from the template. Slurm also limits the total number of jobs that it will accept (whether their state is running or pending) to 10000 by default. Since Slurm holds all job information in memory until the job completes, that limit must be adjusted carefully to avoid oversubscribing the scheduler node's own resource limits (causing scheduling to become sluggish or even unresponsive).

Slurm job arrays use a fixed range of indices in the configured range [0,N). The default configuration sets N to 1001 and the maximum number of jobs to 10000. Increasing N typically requires the maximum number of jobs to be increased. Great attention must be paid to memory consumption by the scheduler as these values are increased.

The manner by which Slurm applies node-count limits is an acknowledged problem in the HPC community. One of Slurm's strengths is its plugin-oriented construction, which allows the configuration file to select what functionalities are used. This scheme is applied to many components, including the scheduler algorithm and resource assignment. Thus, the plugin scheme is also a weakness: greater sophistication requires change to the API shared by a plugin type and possible updating of all plugins using that API. And greater sophistication in a generalized API (like those typical to most Slurm plugins) is extremely difficult to accomplish. Thus, the recommendation from the developers of Slurm is to not use node-count limits.

Removing the node-count limits on priority-access partitions must be accompanied by the addition of a limit that also prevents jobs from using more nodes than purchased. The natural choice would be physical memory in the nodes purchased by the workgroup.

The node-count limit will be removed from priority-access workgroup partitions. A memory limit will be added that reflects the total physical memory purchased by the workgroup.

Slurm configures the memory size for each node in one of two ways:

  • When the job execution daemon (slurmd) starts, it reports the amount of available system memory at that time to the scheduler
  • The Slurm configuration file specifies a nominal memory size for the node

Caviness has been using the first method, since it discounts the memory consumed by the operating system and reflects what is truly available to jobs. For example, baseline nodes in Caviness show a memory size of 125.8 GiB versus the 128 GiB of physical memory present in them. What this has meant for users is that a job submitted with the requirement –mem=128G actually needs to run on a 256 or 512 GB node, when often the user believed the option was indicating a baseline node would suffice.

Rather than letting the node report how much memory it has available, the second method cited above will now be used with the nominal amount of physical memory present in the node. Thus, for baseline nodes the Slurm configuration would be changed to:

NodeName=r00n[01-17,45-55] CPUS=36 Sockets=2 CoresPerSocket=18 ThreadsPerCore=1 Memory=131072 Feature="E5-2695,E5-2695v4,128GB" Weight=10

The FastSchedule option will be enabled to force the scheduler to consult the values in the configuration file rather than the real values reported by the nodes themselves.

Workgroup QOS aggregate memory sizes will be the sum over node types' nominal memory size times the number of each node type purchased. In the example above, the workgroup would have the node=2 limit removed, replaced with mem=393216.

Possible issues

The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each. In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched. Other workgroups with a similar purchase profile could submit jobs that require just 1 GiB of memory, but an absence of free baseline nodes would see those jobs running on the 256 GiB nodes. The first workgroup's jobs complete and a job that requires an entire 256 GiB node is eligible to execute. However, the 256 GiB node is currently occupied by the second workgroup's small-memory jobs, so the eligible job must wait for them to complete.

Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition. This is simply not possible in Slurm1). The only mitigation to the situation outlined above is to have additional capacity, above and beyond what was purchased; and to rely on walltime limits and workgroup inactivity to keep nodes free.

The job scheduler currently uses the default job count limit of 10000. Slurm documentation recommends this limit not be increased above a few hundred thousand. The current MaxArraySize limit of 1001 has seemed to pair well with the MaxJobCount of 10000. The scheduler nodes have a nominal 256 GiB of memory, and the scheduler is currently occupying just over 14 GiB of memory.

Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster.

The range for array job indices would become [0,10000].

All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and signaling a change in configuration so all daemons refresh their configuration.

No downtime is expected to be required.

Date Time Goal/Description
2019-02-04 Authoring of this document
2019-02-06 Document shared with Caviness community for feedback
2019-02-13 Add announcement of impending change to login banner
2019-02-1809:00Configuration changes pushed to cluster nodes
09:30Restart scheduler, notify compute nodes of reconfiguration
2019-02-20 Remove announcement from login banner

1)
We could invent our own generic resources (GRES) that qualify the cores in nodes, e.g. core_gen1_128GB, core_gen1_256GB. To enforce such limits would require that our job submission plugin determine what kind of node each job should run on and add the necessary GRES request. Such determination would need to also be aware of the chosen partition, the workgroup, and possibly what kinds of node that workgroup purchased. With each expansion the complexity of this determination would grow. In short, this solution is not sustainable.
  • technical/slurm/caviness/arraysize-and-nodecounts.txt
  • Last modified: 2019-02-12 15:09
  • by 127.0.0.1