Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
technical:slurm:partitions [2018-10-26 11:03] – frey | technical:slurm:partitions [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Revisions to Slurm v1.0.0 Configuration on Caviness ====== | ||
- | This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. | ||
- | |||
- | ===== Issues ===== | ||
- | |||
- | ==== Priority-access partiions ==== | ||
- | |||
- | The Slurm job scheduler handles the task of accepting computational work and meta data concerning the resources that work will require (a job); prioritizing a list of zero or more jobs that are awaiting execution; and allocating resources to pending jobs and starting their execution. | ||
- | |||
- | There exists a default partition (the standard partition) on Caviness to which jobs are assigned (e.g. when no partition is explicitly requested for a job). This partition spans all nodes in the cluster. | ||
- | |||
- | Additional hardware-specific partitions were created to provide preemptive access to owned resources. | ||
- | |||
- | === Problem 1: Single Partition Per Job === | ||
- | |||
- | When allocating processing slots to a job, Grid Engine (on Mills and Farber) would start with a list of all queues that satisfied the job's resource requirements and the user's access. | ||
- | |||
- | When jobs are submitted to Slurm, zero or more partitions may be requested. | ||
- | |||
- | <WRAP negative> | ||
- | |||
- | Consider a workgroup who purchased one baseline node and one GPU node. Assume the state of the associated partitions is: | ||
- | |||
- | ^ Partition ^ Available Cores ^ | ||
- | | compute-128GB | 9 | | ||
- | | gpu-128GB | 18 | | ||
- | |||
- | |||
- | Job 1234 is submitted requesting 20 cores. | ||
- | |||
- | Jobs with heterogeneous resource requests can also never be executed in priority-access partitions. | ||
- | |||
- | === Problem 2: Quality-of-Service === | ||
- | |||
- | Fine-grain control over resource limits on Slurm partitions must be implemented with a quality-of-service (QOS) definition. | ||
- | |||
- | The current configuration requires that each workgroup receive a QOS containing their aggregate purchased-resource limits, and that QOS be allowed to augment the baseline QOS of each partition to which the workgroup has access. | ||
- | |||
- | <WRAP danger> | ||
- | |||
- | QOS is most often used to alter the scheduling behavior of a job, increasing or decreasing the baseline priority or run time limit, for example. | ||
- | |||
- | === Problem 3: Addition of Partitions === | ||
- | |||
- | There are currently six hardware-specific partitions configured on Caviness. | ||
- | |||
- | <WRAP danger> | ||
- | |||
- | Following our recommendation to purchase shares annually, a workgroup could easily end up with access to many hardware-specific partitions and no means to effectively use at priority all the resources purchased. | ||
- | ==== Allocation of Cores on GPU Nodes ==== | ||
- | |||
- | The Slurm resource allocation algorithm configured as the default on Caviness assigns individual processor cores to jobs. However, the hardware-specific partitions associated with GPUs currently override that default and instead allocate by socket only: cores are allocated in sets of 18, to correspond with the GPU controlled by that socket. | ||
- | |||
- | === Problem 4: No Fine-Grain Control over CPUs in GPU Nodes === | ||
- | |||
- | When GPUs were introduced in Farber, some workgroups desired that GPU-bound jobs requiring only a single controlling CPU core be scheduled as such, leaving the other cores on that CPU available for non-GPU workloads. | ||
- | |||
- | <WRAP danger> | ||
- | </ | ||
- | |||
- | Thus, there is no way to pack traditional non-GPU workloads and GPU workloads onto those priority-access partitions. | ||
- | ===== Solution ===== | ||
- | |||
- | The decision to provide priority-access via hardware-specific partitions is the root of the problem. | ||
- | |||
- | The use of workgroup partitions, akin to the owner queues on Mills and Farber, suggests itself as a viable solution. | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | In the spirit of the spillover queues on Mills and Farber, the workgroup has priority access to the kinds of nodes they purchased, not just specific nodes in the cluster. | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | This not only provides the necessary resource quota on the partition, but leaves the override QOS available for other purposes (as discussed in Problem 2). Since QOS resource limits are aggregate across all partitions using that QOS: | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | This solution would not address the addition of partitions over time: | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | Existing workgroups who augment their purchase would have their existing partition altered accordingly. | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | Likewise, Problem 4 is addressed: | ||
- | |||
- | <WRAP safety> | ||
- | |||
- | Priority-access to GPU nodes would no longer be allocated by socket. | ||
- | ===== Implementation ===== | ||
- | |||
- | The existing hardware-specific partition configuration would inform the genesis of workgroup partitions: | ||
- | |||
- | * For each < | ||
- | * Concatenate the node lists for those hardware-specific partitions to produce the workgroup partition node list, < | ||
- | |||
- | The existing workgroup QOS definitions need no modifications. | ||
- | |||
- | <code bash> | ||
- | PartitionName=< | ||
- | Nodes=< | ||
- | QOS=< | ||
- | </ | ||
- | |||
- | ===== Job Submission Plugin ===== | ||
- | |||
- | The job submission plugin has been modified to remove the forced assignment of " | ||
- | |||
- | A second flag was added to include/ | ||
- | |||
- | <code bash> | ||
- | $ workgroup -g it_nss | ||
- | $ sbatch --verbose --account=it_css --partition=_workgroup_ … | ||
- | : | ||
- | sbatch: partition | ||
- | : | ||
- | Submitted batch job 1234 | ||
- | $ scontrol show job 1234 | egrep -i ' | ||
- | | ||
- | | ||
- | </ | ||
- | |||
- | Job 1234 is billed against the it_css account but executes in the it_nss workgroup partition (assuming the it_css account has been granted access to that partition). | ||
- | |||
- | All changes have been implemented and are visible in the [[https:// | ||
- | |||
- | ===== Impact ===== | ||
- | |||
- | At this time, the hardware-specific partitions are seeing relatively little use on Caviness. | ||
- | |||
- | New partitions are added to the Slurm configuration (a text file) and distributed to all participating controller and compute nodes. | ||
- | |||
- | <code bash> | ||
- | # 1. Copy new slurm configs into place on all nodes | ||
- | # 2. Install updated plugins | ||
- | $ systemctl restart slurmctld | ||
- | $ ssh r02mgmt01 systemctl restart slurmctld | ||
- | $ scontrol reconfigure | ||
- | </ | ||
- | |||
- | ===== Timeline ===== | ||
- | |||
- | {{: |