Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
technical:slurm:node-memory-sizes [2019-02-18 11:19] – created frey | technical:slurm:node-memory-sizes [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Revisions to Slurm Configuration v1.1.3 on Caviness ====== | ||
- | This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. | ||
- | |||
- | ===== Issues ===== | ||
- | |||
- | ==== Nominal node memory size is not an appropriate limit ==== | ||
- | |||
- | When the [[technical: | ||
- | |||
- | < | ||
- | Reason=Low RealMemory | ||
- | </ | ||
- | |||
- | Each node runs a Slurm job execution daemon (slurmd) that reports back to the scheduler every few minutes; included in that report are the base resource levels: | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm // | ||
- | </ | ||
- | |||
- | Many nodes transitioned to the DRAIN state within the first 30 minutes after the v1.1.3 changes were activated: | ||
- | |||
- | The changes did not need to be rolled-back, | ||
- | |||
- | One additional problem could present itself under the v1.1.3 use of nominal physical memory size for the nodes Consider the following: | ||
- | |||
- | * A node runs a job requesting 28 cores and 100 GiB of memory, leaving 8 cores and 28 GiB of memory available according to the node configuration. | ||
- | * A second job from a different user, requesting 4 cores and 28 GiB of memory, is scheduled on the node. | ||
- | |||
- | Since the OS itself occupies some non-trivial amount of the physical memory, the second job eventually extends memory usage above and beyond the amount of physical memory present. | ||
- | |||
- | <WRAP negative round> | ||
- | Choosing to use the nominal memory size of each node for its RealMemory limit was meant to keep requests like '' | ||
- | </ | ||
- | |||
- | ==== FastSchedule requires explicit specification of all resources ==== | ||
- | |||
- | In previous configurations, | ||
- | |||
- | < | ||
- | $ scontrol show node r00n22 | ||
- | NodeName=r00n22 Arch=x86_64 CoresPerSocket=18 | ||
- | : | ||
- | | ||
- | : | ||
- | </ | ||
- | |||
- | Any user submitting a job which requests a minimum amount of /tmp space (e.g. '' | ||
- | |||
- | <WRAP negative round> | ||
- | Slurm // | ||
- | </ | ||
- | |||
- | This situation was addressed by augmenting the node configurations with explicit TmpDisk values shortly after the v1.1.3 configuration was activated. | ||
- | |||
- | ===== Solutions ===== | ||
- | |||
- | ==== Determine appropriate RealMemory levels ==== | ||
- | |||
- | For each type of node present in Caviness, a RealMemory size less than that reported by slurmd (to prevent DRAIN state transitions) will be chosen. | ||
- | |||
- | <WRAP positive round> | ||
- | Node configurations will be | ||
- | </ | ||
- | |||
- | Slurm configures the memory size for each node in one of two ways: | ||
- | * When the job execution daemon (slurmd) starts, it reports the amount of available system memory //at that time// to the scheduler | ||
- | * The Slurm configuration file specifies a nominal memory size for the node | ||
- | Caviness has been using the first method, since it discounts the memory consumed by the operating system and reflects what is truly available to jobs. For example, baseline nodes in Caviness show a memory size of 125.8 GiB versus the 128 GiB of physical memory present in them. What this has meant for users is that a job submitted with the requirement '' | ||
- | |||
- | Rather than letting the node report how much memory it has available, the second method cited above will now be used with the nominal amount of physical memory present in the node. Thus, for baseline nodes the Slurm configuration would be changed to: | ||
- | |||
- | < | ||
- | NodeName=r00n[01-17, | ||
- | </ | ||
- | |||
- | The '' | ||
- | |||
- | Workgroup QOS aggregate memory sizes will be the sum over node types' nominal memory size times the number of each node type purchased. | ||
- | |||
- | === Possible issues === | ||
- | |||
- | The problem with being forced to use a single partition backed by a variety of node kinds is that the workgroup above — with a baseline node and a 256 GiB node — could submit a sequence of jobs that only require 1 GiB of memory each. In that case, the jobs will tend to pack onto baseline nodes, leaving the 256 GiB node untouched. | ||
- | |||
- | Ideally, we would want to be able to express a workgroup limit that differentiates between the kinds of nodes that service a partition. | ||
- | ==== Increase MaxArraySize and MaxJobCount ==== | ||
- | |||
- | The job scheduler currently uses the default job count limit of 10000. | ||
- | |||
- | <WRAP positive round> | ||
- | Increasing both MaxArraySize and MaxJobCount by an order of magnitude should be permissible without negatively impacting the cluster. | ||
- | </ | ||
- | |||
- | The range for array job indices would become '' | ||
- | |||
- | ===== Implementation ===== | ||
- | |||
- | All changes are effected by altering the Slurm configuration files, pushing the changed files to all nodes, and signaling a change in configuration so all daemons refresh their configuration. | ||
- | |||
- | ===== Impact ===== | ||
- | |||
- | No downtime is expected to be required. | ||
- | |||
- | ===== Timeline ===== | ||
- | |||
- | ^Date ^Time ^Goal/ | ||
- | |2019-02-04| |Authoring of this document| | ||
- | |2019-02-06| |Document shared with Caviness community for feedback| | ||
- | |2019-02-13| |Add announcement of impending change to login banner| | ||
- | |2019-02-18|09: | ||
- | | |09: | ||
- | |2019-02-20| |Remove announcement from login banner| |