technical:slurm:partitions

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revisionBoth sides next revision
technical:slurm:partitions [2018-10-26 10:41] – [Solution] freytechnical:slurm:partitions [2019-02-04 09:08] frey
Line 1: Line 1:
-====== Revisions to Slurm v1.0.0 Configuration on Caviness ======+====== Revisions to Slurm Configuration v1.0.0 on Caviness ======
  
 This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster. This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.
Line 5: Line 5:
 ===== Issues ===== ===== Issues =====
  
-==== Priority-access partiions ====+==== Priority-access partitions ====
  
 The Slurm job scheduler handles the task of accepting computational work and meta data concerning the resources that work will require (a job); prioritizing a list of zero or more jobs that are awaiting execution; and allocating resources to pending jobs and starting their execution.  The resources on which jobs execute (nodes) are split into one or more groups for which unique limits and controls are applied (partitions). The Slurm job scheduler handles the task of accepting computational work and meta data concerning the resources that work will require (a job); prioritizing a list of zero or more jobs that are awaiting execution; and allocating resources to pending jobs and starting their execution.  The resources on which jobs execute (nodes) are split into one or more groups for which unique limits and controls are applied (partitions).
Line 19: Line 19:
 When jobs are submitted to Slurm, zero or more partitions may be requested.  While the job is pending, Slurm will attempt to schedule the job in each of the requested partitions.  However, to be eligible to execute the job MUST fit entirely within one of the partitions:  jobs cannot span multiple partitions.  Under the current configuration of Slurm partitions: When jobs are submitted to Slurm, zero or more partitions may be requested.  While the job is pending, Slurm will attempt to schedule the job in each of the requested partitions.  However, to be eligible to execute the job MUST fit entirely within one of the partitions:  jobs cannot span multiple partitions.  Under the current configuration of Slurm partitions:
  
-<WRAP danger>a workgroup who purchases nodes of multiple kinds CANNOT run a priority-access job that spans those node kinds.</WRAP>+<WRAP negative round>a workgroup who purchases nodes of multiple kinds CANNOT run a priority-access job that spans those node kinds.</WRAP>
  
 Consider a workgroup who purchased one baseline node and one GPU node.  Assume the state of the associated partitions is: Consider a workgroup who purchased one baseline node and one GPU node.  Assume the state of the associated partitions is:
Line 36: Line 36:
 Fine-grain control over resource limits on Slurm partitions must be implemented with a quality-of-service (QOS) definition.  Each partition has a baseline QOS that is optionally augmented by a single QOS provided by the user when the job is submitted. Fine-grain control over resource limits on Slurm partitions must be implemented with a quality-of-service (QOS) definition.  Each partition has a baseline QOS that is optionally augmented by a single QOS provided by the user when the job is submitted.
  
-The current configuration requires that each workgroup receive a QOS containing their aggregate purchased-resource limits, and that QOS be allowed to augment the baseline QOS of each partition to which the workgroup has access.  On job submission, any job targeting priority-access partitions automatically has the flag "--qos=<workgroup>associated with it.  Thus:+The current configuration requires that each workgroup receive a QOS containing their aggregate purchased-resource limits, and that QOS be allowed to augment the baseline QOS of each partition to which the workgroup has access.  On job submission, any job targeting priority-access partitions automatically has the flag ''--qos=<workgroup>'' associated with it.  Thus:
  
-<WRAP danger>every job submitted to a priority-access partition must have an overriding workgroup QOS associated with it, which effectively disables use of QOS for other purposes.</WRAP>+<WRAP negative round>every job submitted to a priority-access partition must have an overriding workgroup QOS associated with it, which effectively disables use of QOS for other purposes.</WRAP>
  
 QOS is most often used to alter the scheduling behavior of a job, increasing or decreasing the baseline priority or run time limit, for example. QOS is most often used to alter the scheduling behavior of a job, increasing or decreasing the baseline priority or run time limit, for example.
Line 46: Line 46:
 There are currently six hardware-specific partitions configured on Caviness.  As racks are added to the cluster, node specifications will change: There are currently six hardware-specific partitions configured on Caviness.  As racks are added to the cluster, node specifications will change:
  
-<WRAP danger>as Caviness evolves over time, more and more hardware-specific partitions will need to be created, which will exacerbate Problem 1.</WRAP>+<WRAP negative round>as Caviness evolves over time, more and more hardware-specific partitions will need to be created, which will exacerbate Problem 1.</WRAP>
  
 Following our recommendation to purchase shares annually, a workgroup could easily end up with access to many hardware-specific partitions and no means to effectively use at priority all the resources purchased. Following our recommendation to purchase shares annually, a workgroup could easily end up with access to many hardware-specific partitions and no means to effectively use at priority all the resources purchased.
Line 57: Line 57:
 When GPUs were introduced in Farber, some workgroups desired that GPU-bound jobs requiring only a single controlling CPU core be scheduled as such, leaving the other cores on that CPU available for non-GPU workloads.  The way the hardware-specific partitions on Caviness are configured: When GPUs were introduced in Farber, some workgroups desired that GPU-bound jobs requiring only a single controlling CPU core be scheduled as such, leaving the other cores on that CPU available for non-GPU workloads.  The way the hardware-specific partitions on Caviness are configured:
  
-<WRAP danger>jobs always occupy all cores associated with an assigned GPU.+<WRAP negative round>jobs always occupy all cores associated with an assigned GPU.
 </WRAP> </WRAP>
  
Line 67: Line 67:
 The use of workgroup partitions, akin to the owner queues on Mills and Farber, suggests itself as a viable solution.  Distinct nodes would not be assigned to a workgroup as on those clusters: The use of workgroup partitions, akin to the owner queues on Mills and Farber, suggests itself as a viable solution.  Distinct nodes would not be assigned to a workgroup as on those clusters:
  
-<WRAP safety>Each workgroup partition will encompass all nodes of the kinds purchased by that workgroup.</WRAP>+<WRAP positive round>Each workgroup partition will encompass all nodes of the kinds purchased by that workgroup.</WRAP>
  
 In the spirit of the spillover queues on Mills and Farber, the workgroup has priority access to the kinds of nodes they purchased, not just specific nodes in the cluster.  In terms of limiting priority-access to the levels actually purchased by a workgroup: In the spirit of the spillover queues on Mills and Farber, the workgroup has priority access to the kinds of nodes they purchased, not just specific nodes in the cluster.  In terms of limiting priority-access to the levels actually purchased by a workgroup:
  
-<WRAP safety>Each workgroup partition will reuse the existing workgroup QOS as its default (baseline) QOS.</WRAP>+<WRAP positive round>Each workgroup partition will reuse the existing workgroup QOS as its default (baseline) QOS.</WRAP>
  
 This not only provides the necessary resource quota on the partition, but leaves the override QOS available for other purposes (as discussed in Problem 2).  Since QOS resource limits are aggregate across all partitions using that QOS: This not only provides the necessary resource quota on the partition, but leaves the override QOS available for other purposes (as discussed in Problem 2).  Since QOS resource limits are aggregate across all partitions using that QOS:
  
-<WRAP safety>hardware-specific and workgroup partitions could coexist in the Slurm configuration.</WRAP>+<WRAP positive round>hardware-specific and workgroup partitions could coexist in the Slurm configuration.</WRAP>
  
 This solution would not address the addition of partitions over time: This solution would not address the addition of partitions over time:
  
-<WRAP safety>The addition of each new workgroup to the cluster will necessitate a new partition in the Slurm configuration.</WRAP>+<WRAP positive round>The addition of each new workgroup to the cluster will necessitate a new partition in the Slurm configuration.</WRAP>
  
 Existing workgroups who augment their purchase would have their existing partition altered accordingly.  This solution does address the issue of priority-access jobs' not being able to span all nodes that a workgroup has purchased: Existing workgroups who augment their purchase would have their existing partition altered accordingly.  This solution does address the issue of priority-access jobs' not being able to span all nodes that a workgroup has purchased:
  
-<WRAP safety>Since each workgroup partition spans all nodes of all kinds purchased by the workgroup, jobs submitted to that partition can be scheduled more efficiently.</WRAP>+<WRAP positive round>Since each workgroup partition spans all nodes of all kinds purchased by the workgroup, jobs submitted to that partition can be scheduled more efficiently.</WRAP>
  
 Likewise, Problem 4 is addressed: Likewise, Problem 4 is addressed:
  
-<WRAP safety>Each workgroup partition will use the default (by-core) resource allocation algorithm.</WRAP>+<WRAP positive round>Each workgroup partition will use the default (by-core) resource allocation algorithm.</WRAP>
  
 Priority-access to GPU nodes would no longer be allocated by socket. Priority-access to GPU nodes would no longer be allocated by socket.
Line 99: Line 99:
 The existing workgroup QOS definitions need no modifications.  The resulting configuration entry would look like: The existing workgroup QOS definitions need no modifications.  The resulting configuration entry would look like:
  
-<code bash>+<code>
 PartitionName=<workgroup> Default=NO PriorityTier=10  PartitionName=<workgroup> Default=NO PriorityTier=10 
 Nodes=<workgroup-nodelist> MaxTime=7-00:00:00 DefaultTime=30:00  Nodes=<workgroup-nodelist> MaxTime=7-00:00:00 DefaultTime=30:00 
Line 105: Line 105:
 </code> </code>
  
-===== Job Submission Plugin =====+The following Bash script was used to convert the hardware-specific partitions and their AllowedQOS levels to workgroup partitions:
  
-The job submission plugin has been modified to remove the forced assignment of "--qos=<workgroup>for jobs submitted to hardware-specific partitions.  A flag is present in the project's CMake configuration that directs inclusion/omission of this feature.+<file bash convert-hw-parts.sh> 
 +#!/bin/bash 
 + 
 +WORKGROUPS="$(sacctmgr --noheader --parsable list account | awk -F\| '{print $1;}')" 
 + 
 +for WORKGROUP in ${WORKGROUPS}; do 
 +    WORKGROUP_NODELIST="$( 
 +        grep $WORKGROUP partitions.conf | awk ' 
 +        BEGIN { 
 +          nodelist=""; 
 +        } 
 +        /PartitionName=/
 +          for ( i=1; i <= NF; i++ ) { 
 +            if ( match($i, "^Nodes=(.*)", pieces) > 0 ) { 
 +              if ( nodelist ) { 
 +            nodelist = nodelist "," pieces[1]; 
 +              } else { 
 +            nodelist = pieces[1]; 
 +              } 
 +              break; 
 +            } 
 +          } 
 +        } 
 +        END { 
 +          printf("%s\n", nodelist); 
 +        } 
 +        ' | snodelist --nodelist=- --unique --compress 
 +      )" 
 +    if [ -n "$WORKGROUP_NODELIST" ]; then 
 +        cat <<EOT 
 +
 +# ${WORKGROUP} (gid $(getent group ${WORKGROUP} | awk -F: '{print $3;}')) partition: 
 +
 +PartitionName=${WORKGROUP} Default=NO PriorityTier=10 Nodes=${WORKGROUP_NODELIST} MaxTime=7-00:00:00 DefaultTime=30:00 QOS=${WORKGROUP} Shared=YES TRESBillingWeights=CPU=1.0,Mem=1.0 
 + 
 +EOT 
 +    fi 
 +done 
 + 
 +</file> 
 + 
 +==== Job Submission Plugin ==== 
 + 
 +The job submission plugin has been modified to remove the forced assignment of ''--qos=<workgroup>'' for jobs submitted to hardware-specific partitions.  A flag is present in the project's CMake configuration that directs inclusion/omission of this feature.
  
 A second flag was added to include/omit a check for the presence of the name _workgroup_ in the list of requested partitions for the job.  If enabled, the word _workgroup_ is replaced with the workgroup name under which the job was submitted by the user: A second flag was added to include/omit a check for the presence of the name _workgroup_ in the list of requested partitions for the job.  If enabled, the word _workgroup_ is replaced with the workgroup name under which the job was submitted by the user:
Line 127: Line 170:
 All changes have been implemented and are visible in the [[https://github.com/jtfrey/ud_slurm_addons|online code repository]].  The code changes were debugged and tested on the venus.mseg.udel.edu cluster, where all pre-delivery builds and tests were performed for the Caviness cluster. All changes have been implemented and are visible in the [[https://github.com/jtfrey/ud_slurm_addons|online code repository]].  The code changes were debugged and tested on the venus.mseg.udel.edu cluster, where all pre-delivery builds and tests were performed for the Caviness cluster.
  
 +==== Job Script Templates ====
 +
 +The Slurm job script templates available under ''/opt/templates'' on the cluster all mention hardware-specific partitions.  They will be modified to instead mention the use of ''--partition=_workgroup_'' for priority-access.
 ===== Impact ===== ===== Impact =====
  
 At this time, the hardware-specific partitions are seeing relatively little use on Caviness.  The cluster has been open to production use for less than a month, so a very limited number of users would need to modify their workflow.  Even with the addition of workgroup partitions, the existing hardware-specific partitions could be left in service for a limited time until users have completed their move off of them. At this time, the hardware-specific partitions are seeing relatively little use on Caviness.  The cluster has been open to production use for less than a month, so a very limited number of users would need to modify their workflow.  Even with the addition of workgroup partitions, the existing hardware-specific partitions could be left in service for a limited time until users have completed their move off of them.
  
-New partitions are added to the Slurm configuration (a text file) and distributed to all participating controller and compute nodes.  The altered job submission plugin will necessitate a hard restart of the slurmctld service, but a reconfiguration RPC should suffice to compute nodes up-to-speed:+New partitions are added to the Slurm configuration (a text file) and distributed to all participating controller and compute nodes.  The altered job submission plugin will necessitate a hard restart of the slurmctld service, but a reconfiguration RPC should suffice to bring compute nodes up-to-date:
  
 <code bash> <code bash>
Line 143: Line 189:
 ===== Timeline ===== ===== Timeline =====
  
-{{:technical:slurm:caviness__revisions_to_slurm_v1.0.0_configuration.png?600|}}+^Date ^Time ^Goal/Description ^ 
 +|2018-10-19| |Limitations of hardware-specific partitions discussed| 
 +|2018-10-24| |Project planning to //replace hardware-specific partitions// completed| 
 +|2018-10-25| |Modifications to job submission plugin completed| 
 +| | |Altered plugin tested and debugged on //Venus// cluster| 
 +| | |Project documentation added to HPC wiki| 
 +|2018-10-26| |Workgroup partition configurations generated and staged for enablement| 
 +| | |Announcement added to login nodes' SSH banner directing users to project documentation| 
 +| | |Job script templates updated and staged for deployment| 
 +|2018-10-29|09:00|Workgroup partitions enabled| 
 +| |09:00|Modified job submission plugin enabled| 
 +| |09:00|Modified job script templates available| 
 +|2018-11-05|09:00|Hardware-specific partitions removed| 
 +| |09:00|Announcement removed from login nodes' SSH banner| 
 +