Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
technical:slurm:arraysize-and-nodecounts [2019-02-04 11:53] – [Increase MaxArraySize and MaxJobCount] frey | technical:slurm:arraysize-and-nodecounts [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Revisions to Slurm Configuration v1.1.2 on Caviness ====== | ||
- | |||
- | This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. | ||
- | |||
- | ===== Issues ===== | ||
- | |||
- | ==== Priority-access partition node counts ==== | ||
- | |||
- | When the priority-access //workgroup partitions// | ||
- | * would encompass //all// nodes of the variet(ies) that workgroup purchased | ||
- | * would have resource limits in place to prevent running jobs from using: | ||
- | * more than the number of cores purchased | ||
- | * more than the number of nodes purchased | ||
- | * more than the number of GPUs purchased | ||
- | A workgroup that purchased (1) baseline node and (1) 256 GiB node has priority access to 72 cores and 2 nodes, drawn from the pool of all baseline and 256 GiB nodes. | ||
- | |||
- | This configuration was predicated on the assumption that the node-count limit would apply to the number of nodes occupied by a workgroup' | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm node-count limits apply to the number of nodes requested by the job, not the number of nodes actually occupied by jobs. | ||
- | </ | ||
- | |||
- | The workgroup cited above quickly noticed that having 2 jobs running that each requested just 1 core would saturate their node-count limit — despite the fact that the scheduler packed those 2 jobs onto a single node, leaving 70 of their 72 cores unusable. | ||
- | |||
- | As currently configured, priority-access partitions are only optimal for jobs that can make use of 36N cores. | ||
- | |||
- | ==== Job array size limits ==== | ||
- | |||
- | On Mills and Farber, Grid Engine job arrays have the following limits per array job: | ||
- | * no more than 2000 array indices will be scheduled concurrently | ||
- | * the number of indices cannot exceed 75000 | ||
- | The //number of the indices// is affected by the latter limit, not the //values// of the indices: | ||
- | |||
- | Slurm job array also have limits applied to them. Caviness currently uses the default limits: | ||
- | * job arrays can only use indices in the range '' | ||
- | * the maximum array size, '' | ||
- | Unlike Grid Engine, where the number of indices was limited, Slurm limits the index value range itself. | ||
- | |||
- | Slurm treats an array job like a template that generates unique jobs: each index is internally handled like a standard job, inheriting its resource requirements from the template. | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm job arrays use a fixed range of indices in the configured range '' | ||
- | </ | ||
- | |||
- | ===== Solutions ===== | ||
- | |||
- | ==== Remove node-count limits ==== | ||
- | |||
- | The manner by which Slurm applies node-count limits is an acknowledged problem in the HPC community. | ||
- | |||
- | Removing the node-count limits on priority-access partitions must be accompanied by the addition of a limit that also prevents jobs from using more nodes than purchased. | ||
- | |||
- | <WRAP positive round> | ||
- | The node-count limit will be removed from priority-access workgroup partitions. | ||
- | </ | ||
- | |||
- | In the example above, the workgroup would have the '' | ||
- | |||
- | The solution to this additional issue is to force a memory size in the Slurm configuration for each node, rather than letting the node report how much memory it has available. | ||
- | |||
- | < | ||
- | NodeName=r00n[01-17, | ||
- | </ | ||
- | |||
- | effecting a limit of 124 GiB of usable memory on each node (leaving 4 GiB for the OS and job dispatch, inline with the 125.8 GiB reported by the node). | ||
- | |||
- | === Possible issues === | ||
- | |||
- | The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each. In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched. | ||
- | |||
- | Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition. | ||
- | |||
- | ==== Increase MaxArraySize and MaxJobCount ==== | ||
- | |||
- | The job scheduler currently uses the default job count limit of 10000. | ||
- | |||
- | <WRAP positive round> | ||
- | Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster. | ||
- | </ | ||
- | |||
- | The range for array job indices would become '' | ||
- | |||
- | ===== Implementation ===== | ||
- | |||
- | All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and restarting the controller daemon. | ||
- | |||
- | ===== Impact ===== | ||
- | |||
- | No downtime is expected to be required. | ||
- | |||
- | ===== Timeline ===== | ||
- | |||
- | ^Date ^Time ^Goal/ | ||
- | |2019-02-04| |Authoring of this document| | ||