====== Revisions to Slurm Configuration v1.0.0 on Caviness ======
This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.
===== Issues =====
==== Priority-access partitions ====
The Slurm job scheduler handles the task of accepting computational work and meta data concerning the resources that work will require (a job); prioritizing a list of zero or more jobs that are awaiting execution; and allocating resources to pending jobs and starting their execution. The resources on which jobs execute (nodes) are split into one or more groups for which unique limits and controls are applied (partitions).
There exists a default partition (the standard partition) on Caviness to which jobs are assigned (e.g. when no partition is explicitly requested for a job). This partition spans all nodes in the cluster. Jobs in the standard partition may make use of resources that were not purchased by the workgroup associated with the job, providing opportunistic usage of idle cluster resources. With a seven day time limit on the standard partition, a workgroup might be forced to wait 20 times the period associated with the Farber and Mills standby queues for their job to gain access to owned resources.
Additional hardware-specific partitions were created to provide preemptive access to owned resources. On Mills and Farber, owner queues existed that were backed by nodes directly assigned to that workgroup and spillover queues for each kind of node acted as second-tier resource pools to decrease wait time while standby jobs occupied nodes owned by a workgroup. On Caviness, partitions exist for each kind of node (similar to spillover on Mills and Farber) and only those workgroups who purchased that kind of node are given access. Accompanying that access are resource limits: the number of nodes, cores, and GPUs purchased. The limits are aggregate across all hardware-specific partitions to which the workgroup has access. When a job is submitted to a hardware-specific partition and the necessary resources are currently occupied by jobs in the standard partition, the latter will be preempted to free-up resources for the former.
=== Problem 1: Single Partition Per Job ===
When allocating processing slots to a job, Grid Engine (on Mills and Farber) would start with a list of all queues that satisfied the job's resource requirements and the user's access. If the allocation exceeded the slots provided by the first queue in the list, the process would span the second (then third, fourth, etc.) queue in the list. This is how the spillover queues augment the owner queues.
When jobs are submitted to Slurm, zero or more partitions may be requested. While the job is pending, Slurm will attempt to schedule the job in each of the requested partitions. However, to be eligible to execute the job MUST fit entirely within one of the partitions: jobs cannot span multiple partitions. Under the current configuration of Slurm partitions:
a workgroup who purchases nodes of multiple kinds CANNOT run a priority-access job that spans those node kinds.
Consider a workgroup who purchased one baseline node and one GPU node. Assume the state of the associated partitions is:
^ Partition ^ Available Cores ^
| compute-128GB | 9 |
| gpu-128GB | 18 |
Job 1234 is submitted requesting 20 cores. Though 27 cores are currently available in aggregate across the two partitions, neither partition has enough resources to satisfy the job. As a result, jobs in the standard partition may be preempted to facilitate the execution of job 1234.
Jobs with heterogeneous resource requests can also never be executed in priority-access partitions. For example, a job that requests 4 baseline nodes and one node with a GPU by definition spans multiple priority-access partitions.
=== Problem 2: Quality-of-Service ===
Fine-grain control over resource limits on Slurm partitions must be implemented with a quality-of-service (QOS) definition. Each partition has a baseline QOS that is optionally augmented by a single QOS provided by the user when the job is submitted.
The current configuration requires that each workgroup receive a QOS containing their aggregate purchased-resource limits, and that QOS be allowed to augment the baseline QOS of each partition to which the workgroup has access. On job submission, any job targeting priority-access partitions automatically has the flag ''--qos='' associated with it. Thus:
every job submitted to a priority-access partition must have an overriding workgroup QOS associated with it, which effectively disables use of QOS for other purposes.
QOS is most often used to alter the scheduling behavior of a job, increasing or decreasing the baseline priority or run time limit, for example.
=== Problem 3: Addition of Partitions ===
There are currently six hardware-specific partitions configured on Caviness. As racks are added to the cluster, node specifications will change:
as Caviness evolves over time, more and more hardware-specific partitions will need to be created, which will exacerbate Problem 1.
Following our recommendation to purchase shares annually, a workgroup could easily end up with access to many hardware-specific partitions and no means to effectively use at priority all the resources purchased.
==== Allocation of Cores on GPU Nodes ====
The Slurm resource allocation algorithm configured as the default on Caviness assigns individual processor cores to jobs. However, the hardware-specific partitions associated with GPUs currently override that default and instead allocate by socket only: cores are allocated in sets of 18, to correspond with the GPU controlled by that socket.
=== Problem 4: No Fine-Grain Control over CPUs in GPU Nodes ===
When GPUs were introduced in Farber, some workgroups desired that GPU-bound jobs requiring only a single controlling CPU core be scheduled as such, leaving the other cores on that CPU available for non-GPU workloads. The way the hardware-specific partitions on Caviness are configured:
jobs always occupy all cores associated with an assigned GPU.
Thus, there is no way to pack traditional non-GPU workloads and GPU workloads onto those priority-access partitions.
===== Solution =====
The decision to provide priority-access via hardware-specific partitions is the root of the problem. Since each kind of node was purchased by multiple workgroups, the baseline QOS is unusable for a partition's default resource limits — its actual purpose. Also, the fact that Slurm jobs cannot span multiple partitions adversely impacts the efficiency of the job scheduler in matching jobs to available resources.
The use of workgroup partitions, akin to the owner queues on Mills and Farber, suggests itself as a viable solution. Distinct nodes would not be assigned to a workgroup as on those clusters:
Each workgroup partition will encompass all nodes of the kinds purchased by that workgroup.
In the spirit of the spillover queues on Mills and Farber, the workgroup has priority access to the kinds of nodes they purchased, not just specific nodes in the cluster. In terms of limiting priority-access to the levels actually purchased by a workgroup:
Each workgroup partition will reuse the existing workgroup QOS as its default (baseline) QOS.
This not only provides the necessary resource quota on the partition, but leaves the override QOS available for other purposes (as discussed in Problem 2). Since QOS resource limits are aggregate across all partitions using that QOS:
hardware-specific and workgroup partitions could coexist in the Slurm configuration.
This solution would not address the addition of partitions over time:
The addition of each new workgroup to the cluster will necessitate a new partition in the Slurm configuration.
Existing workgroups who augment their purchase would have their existing partition altered accordingly. This solution does address the issue of priority-access jobs' not being able to span all nodes that a workgroup has purchased:
Since each workgroup partition spans all nodes of all kinds purchased by the workgroup, jobs submitted to that partition can be scheduled more efficiently.
Likewise, Problem 4 is addressed:
Each workgroup partition will use the default (by-core) resource allocation algorithm.
Priority-access to GPU nodes would no longer be allocated by socket.
===== Implementation =====
The existing hardware-specific partition configuration would inform the genesis of workgroup partitions:
* For each , determine all hardware-specific partitions allowing that workgroup QOS
* Concatenate the node lists for those hardware-specific partitions to produce the workgroup partition node list,
The existing workgroup QOS definitions need no modifications. The resulting configuration entry would look like:
PartitionName= Default=NO PriorityTier=10
Nodes= MaxTime=7-00:00:00 DefaultTime=30:00
QOS= Shared=YES TRESBillingWeights=CPU=1.0,Mem=1.0
The following Bash script was used to convert the hardware-specific partitions and their AllowedQOS levels to workgroup partitions:
#!/bin/bash
WORKGROUPS="$(sacctmgr --noheader --parsable list account | awk -F\| '{print $1;}')"
for WORKGROUP in ${WORKGROUPS}; do
WORKGROUP_NODELIST="$(
grep $WORKGROUP partitions.conf | awk '
BEGIN {
nodelist="";
}
/PartitionName=/ {
for ( i=1; i <= NF; i++ ) {
if ( match($i, "^Nodes=(.*)", pieces) > 0 ) {
if ( nodelist ) {
nodelist = nodelist "," pieces[1];
} else {
nodelist = pieces[1];
}
break;
}
}
}
END {
printf("%s\n", nodelist);
}
' | snodelist --nodelist=- --unique --compress
)"
if [ -n "$WORKGROUP_NODELIST" ]; then
cat <
==== Job Submission Plugin ====
The job submission plugin has been modified to remove the forced assignment of ''--qos='' for jobs submitted to hardware-specific partitions. A flag is present in the project's CMake configuration that directs inclusion/omission of this feature.
A second flag was added to include/omit a check for the presence of the name _workgroup_ in the list of requested partitions for the job. If enabled, the word _workgroup_ is replaced with the workgroup name under which the job was submitted by the user:
$ workgroup -g it_nss
$ sbatch --verbose --account=it_css --partition=_workgroup_ …
:
sbatch: partition : _workgroup_
:
Submitted batch job 1234
$ scontrol show job 1234 | egrep -i '(partition|account)='
Priority=2014 Nice=0 Account=it_css QOS=normal
Partition=it_nss AllocNode:Sid=login01:7280
Job 1234 is billed against the it_css account but executes in the it_nss workgroup partition (assuming the it_css account has been granted access to that partition). When the job executes, all processes start with the it_nss Unix group.
All changes have been implemented and are visible in the [[https://github.com/jtfrey/ud_slurm_addons|online code repository]]. The code changes were debugged and tested on the venus.mseg.udel.edu cluster, where all pre-delivery builds and tests were performed for the Caviness cluster.
==== Job Script Templates ====
The Slurm job script templates available under ''/opt/shared/templates'' on the cluster all mention hardware-specific partitions. They will be modified to instead mention the use of ''--partition=_workgroup_'' for priority-access.
===== Impact =====
At this time, the hardware-specific partitions are seeing relatively little use on Caviness. The cluster has been open to production use for less than a month, so a very limited number of users would need to modify their workflow. Even with the addition of workgroup partitions, the existing hardware-specific partitions could be left in service for a limited time until users have completed their move off of them.
New partitions are added to the Slurm configuration (a text file) and distributed to all participating controller and compute nodes. The altered job submission plugin will necessitate a hard restart of the slurmctld service, but a reconfiguration RPC should suffice to bring compute nodes up-to-date:
# 1. Copy new slurm configs into place on all nodes
# 2. Install updated plugins
$ systemctl restart slurmctld
$ ssh r02mgmt01 systemctl restart slurmctld
$ scontrol reconfigure
===== Timeline =====
^Date ^Time ^Goal/Description ^
|2018-10-19| |Limitations of hardware-specific partitions discussed|
|2018-10-24| |Project planning to //replace hardware-specific partitions// completed|
|2018-10-25| |Modifications to job submission plugin completed|
| | |Altered plugin tested and debugged on //Venus// cluster|
| | |Project documentation added to HPC wiki|
|2018-10-26| |Workgroup partition configurations generated and staged for enablement|
| | |Announcement added to login nodes' SSH banner directing users to project documentation|
| | |Job script templates updated and staged for deployment|
|2018-10-29|09:00|Workgroup partitions enabled|
| |09:00|Modified job submission plugin enabled|
| |09:00|Modified job script templates available|
|2018-11-05|09:00|Hardware-specific partitions removed|
| |09:00|Announcement removed from login nodes' SSH banner|