Augmenting Slurm Swap Control

Slurm implements per-job RAM limits. The memory requested by a job is debited from the configured amount of RAM on the node(s) on which it runs to prevent RAM's being oversubscribed. Runtime RAM limits are enforced on Linux nodes by means of the memory cgroup controller: the job's cgroup RAM limit is set to the requested RAM limit.

Memory is a tiered subsystem, though: physical RAM and swap sit underneath virtual memory spaces. Pages of virtual memory reside in one or the other hardware mechanisms. Swap is backed by storage media (HDD, SSD, NVMe) that operate at speeds less than that of RAM access, and the CPU will only process instructions and data resident in RAM: the OS must more-slowly move virtual memory back and forth between the two, hence the term swapping and the name swap. The memory cgroup controller also has a virtual memory limit property (essentially the sum of the physical RAM limit and an amount of swap).

The speed of NVMe storage has led us to offer extended memory nodes in the Caviness and DARWIN clusters. On Caviness the NVMe storage is used solely as swap, yielding a virtual memory system comprised of 384 GiB of DDR4 RAM and two (2) 15.36 TB NVMe:

[root@r04s00 ~]$ egrep '(Mem|Swap)Total:' /proc/meminfo 
MemTotal:       394821768 kB
SwapTotal:      30000596984 kB

With /tmp as a virtual memory-backed file system, the NVMe is made available as both scratch storage and allocatable memory. The speed of the underlying NVMe hardware means the OS swaps pages more quickly and the performance penalty is not as great (versus HDD or SSD). Other nodes in the clusters minimize the amount of swap present. Again on Caviness, for example, with SSD present in the nodes:

[root@r04n00 ~]# egrep '(Mem|Swap)Total:' /proc/meminfo 
MemTotal:       196638672 kB
SwapTotal:       1048572 kB

These nodes leverage physical RAM over swap. Thus, there are conflicting profiles for swap on the cluster.

Unfortunately, Slurm does not treat swap as a configurable, consumable resource on the nodes. The memory cgroup handling in Slurm uses a single global percentage, such that

MaxSwap_{node} = MaxSwapPercent/100 * RealMemory_{node}

and on a per-job basis

UsableSwap_{job,node} = Min(AllowedSwapSpace/100 * MemoryLimist_{job,node}, MaxSwap_{node})

On the extended memory nodes, the swap space is 7600% the nominal RAM size, while on other nodes swap space is a variable percentage much less than 100% (e.g. 0.53% for the example above). So the static percentages do not yield a workable solution in our clusters.

Several solutions were considered.

Ideally, Slurm would be modified to treat swap in the same way physical RAM is treated: per-node configurable and consumable by jobs. Such a change would alter the node configuration data structures throughout the code base and would require alterations to scheduling and cgroup plugins to track, schedule, and enforce the limit (the newer cons_tres scheduling plugin handles the same board, socket, CPU, core, and memory consumables as cons_res but adds NVIDIA and AMD GPUs to the list).

Additional scheduling complications would be produced in this case: on the DARWIN cluster, for example, jobs are billed by a uniform rate per-core and per-MiB of memory. A job asking for all the memory in a node effectively uses all cores, as well. If swap were a consumable resource, then in similar fashion requesting all of the swap on a node would effectively consume all memory and cores, as well, and would have to be billed.

Note that a consumable GRES would not necessitate the extension of the cons_res or cons_tres plugin but would yield a solely per-node request of the resource. Enforcing of a GRES-based swap limit would still require additional plugins or modifications to Slurm.

A SPANK plugin could be structured to alter the job/step virtual memory limit in the associated memory cgroups. This alteration would be effected at some point after the slurmstepd process has forked and the task_cgroup plugin has created the cgroup hierarchy for the job/step, but before the step executes any user code (e.g. in the slurm_spank_task_post_fork() callback).

The SPANK API does not make requested GRES's available to the plugin, so the job consumable GRES would be unknown to the plugin; this makes a SPANK plugin a non-starter with regard to implementing GRES-based swap limits. But a job-agnostic configurable approach is feasible: the plugin argument list would be leveraged to provide selective swap limit criteria to the plugin. For example:

host(r04s[00-01])=none,partition(reserved)=1%/task,partition(standard)=0MiB,default(){min_swap=250MiB}=50MiB/cpu

Consider jobs that use 2 tasks and 4 CPUs on r03n01:

–partition=standard matches with the third term in the list, and swap is limited to zero bytes beyond the physical RAM limit
–partition=workgroup matches with the fourth term in the list (the default), and a swap limit is calculated as 50 MiB per CPU times 4 CPUs = 200 MiB; the properties on the term indicate a minimum of 250 MiB, so the 200 MiB calculated limit is raised to 250 MiB

Additionally:

for any job executing on the extended-memory nodes the first term would be matched and no virtual memory limit would be assigned to the job (we use user-exclusive scheduling of those nodes)
–partition=reserved matches the second term, and 1% of the available swap on each node is multiplied by the number of tasks to yield the swap limit on that node

Memory cgroup limits are computed and effected by the task_cgroup plugin. A single function in this plugin computes the swap limit for a job. The task_cgroup plugin has access to the job step data structure, which contains fields that hold the job and step GRES's, local task and CPU counts, and job environment. Thus, a GRES-based consumable swap limit would be possible if the task_cgroup plugin is modified. Likewise, the job-agnostic configurable approach outlined above is also permissible in this context.

The primary drawback to this approach is that it requires modification of the Slurm base source code. Any changes made by the Slurm developers have the potential to interfere with the changes being introduced here. So this mechanism would be the most efficient means to implement our dynamic swap limits, but will require additional validation and redesign for future releases of Slurm.

A SPANK plugin was created with a slurm_spank_task_post_fork() function. All calls to Slurm APIs are contained in that function, including effecting the cgroup memory limit settings via the slurmd xcgroup API.

Code implementing the parsing of the swap limit configuration string, matching job parameters to a term in that string, and computing the term's swap limit given job parameters was abstracted into a separate API. That API is called from the slurm_spank_task_post_fork() function. A separate test program was created to exercise the functionality.

The code is available on GitLab.

Augmenting Slurm Swap Control

Solutions

Swap as an Consumable Resource

SPANK Plugin

Altered cgroup task plugin

Implementation

hpc documentation