Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
technical:slurm:arraysize-and-nodecounts [2019-02-06 16:48] – [Job array size limits] frey | technical:slurm:arraysize-and-nodecounts [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Revisions to Slurm Configuration v1.1.2 on Caviness ====== | ||
- | This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. | ||
- | |||
- | ===== Issues ===== | ||
- | |||
- | ==== Priority-access partition node counts ==== | ||
- | |||
- | When the priority-access //workgroup partitions// | ||
- | * would encompass //all// nodes of the variet(ies) that workgroup purchased | ||
- | * would have resource limits in place to prevent running jobs from using: | ||
- | * more than the number of cores purchased | ||
- | * more than the number of nodes purchased | ||
- | * more than the number of GPUs purchased | ||
- | A workgroup that purchased (1) baseline node and (1) 256 GiB node has priority access to 72 cores and 2 nodes, drawn from the pool of all baseline and 256 GiB nodes. | ||
- | |||
- | This configuration was predicated on the assumption that the node-count limit would apply to the number of nodes occupied by a workgroup' | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm node-count limits apply to the number of nodes requested by the jobs, not the number of nodes actually occupied by jobs. When the node count is exceeded, jobs will remain pending with a reason of '' | ||
- | </ | ||
- | |||
- | The workgroup cited above quickly noticed that having 2 jobs running that each requested just 1 core would saturate their node-count limit — despite the fact that the scheduler packed those 2 jobs onto a single node, leaving 70 of their 72 cores unusable. | ||
- | |||
- | As currently configured, priority-access partitions are only optimal for jobs that can make use of 36N cores. | ||
- | |||
- | ==== Job array size limits ==== | ||
- | |||
- | On Mills and Farber, Grid Engine job arrays have the following limits per array job: | ||
- | * no more than 2000 array indices will be scheduled concurrently | ||
- | * the number of indices cannot exceed 75000 | ||
- | The //number of the indices// is affected by the latter limit, not the //values// of the indices: | ||
- | |||
- | Slurm job arrays also have limits applied to them. Caviness currently uses the default limits: | ||
- | * job arrays can only use indices in the range '' | ||
- | * the maximum array size, '' | ||
- | Unlike Grid Engine, where the number of indices was limited, Slurm limits the index value range itself. | ||
- | |||
- | Slurm treats an array job like a template that generates unique jobs: each index is internally handled like a standard job, inheriting its resource requirements from the template. | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm job arrays use a fixed range of indices in the configured range '' | ||
- | </ | ||
- | |||
- | ===== Solutions ===== | ||
- | |||
- | ==== Remove node-count limits ==== | ||
- | |||
- | The manner by which Slurm applies node-count limits is an acknowledged problem in the HPC community. | ||
- | |||
- | Removing the node-count limits on priority-access partitions must be accompanied by the addition of a limit that also prevents jobs from using more nodes than purchased. | ||
- | |||
- | <WRAP positive round> | ||
- | The node-count limit will be removed from priority-access workgroup partitions. | ||
- | </ | ||
- | |||
- | Slurm configures the memory size for each node in one of two ways: | ||
- | * When the job execution daemon (slurmd) starts, it reports the amount of available system memory //at that time// to the scheduler | ||
- | * The Slurm configuration file specifies a nominal memory size for the node | ||
- | Caviness has been using the first method, since it discounts the memory consumed by the operating system and reflects what is truly available to jobs. For example, baseline nodes in Caviness show a memory size of 125.8 GiB versus the 128 GiB of physical memory present in them. What this has meant for users is that a job submitted with the requirement '' | ||
- | |||
- | Rather than letting the node report how much memory it has available, the second method cited above will now be used with the nominal amount of physical memory present in the node. Thus, for baseline nodes the Slurm configuration would be changed to: | ||
- | |||
- | < | ||
- | NodeName=r00n[01-17, | ||
- | </ | ||
- | |||
- | Workgroup QOS aggregate memory sizes will be the sum over node types' nominal memory size times the number of each node type purchased. | ||
- | |||
- | === Possible issues === | ||
- | |||
- | The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each. In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched. | ||
- | |||
- | Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition. | ||
- | |||
- | ==== Increase MaxArraySize and MaxJobCount ==== | ||
- | |||
- | The job scheduler currently uses the default job count limit of 10000. | ||
- | |||
- | <WRAP positive round> | ||
- | Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster. | ||
- | </ | ||
- | |||
- | The range for array job indices would become '' | ||
- | |||
- | ===== Implementation ===== | ||
- | |||
- | All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and restarting the controller daemon. | ||
- | |||
- | ===== Impact ===== | ||
- | |||
- | No downtime is expected to be required. | ||
- | |||
- | ===== Timeline ===== | ||
- | |||
- | ^Date ^Time ^Goal/ | ||
- | |2019-02-04| |Authoring of this document| | ||
- | |2019-02-06| |Document shared with Caviness community for feedback| | ||
- | |2019-02-13| |Add announcement of impending change to login banner| | ||
- | |2019-02-18|09: | ||
- | | |09: | ||
- | |2019-02-20| |Remove announcement from login banner| |