Differences

This shows you the differences between two versions of the page.

--- abstract:caviness:runjobs:queues [2018-10-27 16:41] – [The workgroup-specific partitions] anita
+++ abstract:caviness:runjobs:queues [2023-05-30 13:48] – [The job queues (partitions) on Caviness] anita
@@ Line 1: / Line 1: @@
-====== The job queues (partitions) on Caviness ======
-The Caviness cluster has several kinds of partition (queue) available in which to run jobs:
-^Kind^Description^
-|standard|The default partition if no ''%%--%%partition'' submission flag is specified|
-|devel|A partition with very short runtime limits and small resource limits|
-|workgroup-specific|Partitions associated with specific kinds of compute equipment in the cluster purchased by a research group <<//investing-entity//>> (workgroup)|
-==== The standard partition ====
-This partition is the default when no ''%%--%%partition'' submission flag is specified. Also, anyone on the Caviness can request resources from the standard partition. However, job preemption logics(discussed below) are implemented on this partition to ensure workgroup specific jobs are prioritized.
-The idea of standard partition is partly similar to the spillover queue concept in the earlier clusters
-Limits to jobs submitted to this partition are:
-  * a maximum runtime of 7 days (default is 30 minutes)
-  * Maximum number of CPUs per job = 360
-  * Maximum CPUs per user = 720
-  * per-workgroup resource limits based on
-    *  how many nodes your research group (workgroup) purchased (node=#)
-    *  how many cores your research group (workgroup) purchased (cpu=#)
-    *  how many GPUs your research group (workgroup) purchased (gres/gpu:<kind>=#)
-Standard partition is subject to job preemptions. When a job gets submitted to a workgroup-specific partition and resources are tied-up by “standard” jobs, jobs in the “standard” partition will be preempted to make way. For more information on how to handle preemptions refer to [[abstract:caviness:runjobs:schedule_jobs#Handling-System-Signals-during-preemption-aka- Checkpointing|Checkpointing]]
-==== The devel partition ====
-This partition is used for short-lived jobs with minimal resource needs.  Typical uses for the ''devel'' queue include:
-  * Performing lengthy compiles of code projects
-  * Running test jobs to vet programs or changes to programs
-  * Testing correctness of program parallelization
-  * Interactive sessions
-Because performance is not critical for these use cases, the nodes serviced by the ''devel'' partition have hyperthreads enabled, effectively doubling the number of CPUs available.
-Limits to jobs submitted to this partition are:
-  * a maximum runtime of 2 hours (default is 30 minutes)
-  * each user can submit up to 2 jobs
-  * each job can use up to 4 cores on a single node
-For example:
-<code bash>
-[(it_css:traine)@login00 ~]$ srun --partition=devel --nodes=1 --ntasks=1 --cpus-per-task=4 date
-Mon Jul 23 15:25:07 EDT 2018
-</code>
-One copy of the ''date'' command is executed on one node in the ''devel'' partition; the command has four cores (or in this case, hyperthreads) allocated to it.  An interactive shell in the ''devel'' partition with two cores available would be started via:
-<code bash>
-[traine@login01 ~]$ workgroup -g it_css
-[(it_css:traine)@login01 ~]$ salloc --partition=devel --cpus-per-task=2
-salloc: Granted job allocation 940
-salloc: Waiting for resource configuration
-salloc: Nodes r00n56 are ready for job
-[traine@r00n56 ~]$ echo $SLURM_CPUS_ON_NODE
-</code>
-==== The workgroup-specific partitions ====
-The use of //investing-entity// (workgroup) partitions (queues), are similar to the owner queues on Mills and Farber, however on Caviness distinct nodes will not be assigned to a workgroup-specific partition. Instead priority-access will be given to the //investing-entity// (workgroup) to span all of the cluster resources for each type of node purchased by the workgroup on Caviness. Each workgroup-specific partition will reuse the existing workgroup QOS as its default (baseline) QOS to limit the resources and at the same time guarantee access based on what was purchased by preempting (killing) jobs in the standard queue to make way for jobs submitted to the workgroup-specific queues.  There is a special flag to check for the presence of the name ''_workgroup_'' in the list of requested partitions for the job.  If enabled, the word ''_workgroup_'' is replaced with the //investing-entity// (workgroup) name under which the job was submitted by the user (e.g. ''workgroup -g <<//investing-entity//>>'')
-Limits to jobs submitted to workgroup-specific partitions:
-  * a maximum runtime of 7 days (default is 30 minutes)
-  * per-workgroup resource limits (QOS) based on
-    *  how many nodes your research group (workgroup) purchased (node=#)
-    *  how many cores your research group (workgroup) purchased (cpu=#)
-    *  how many GPUs your research group (workgroup) purchased (gres/gpu:<kind>=#)
-For example:
-<code bash>
-$ workgroup -g it_nss
-$ sbatch --verbose --account=it_css --partition=_workgroup_ …
-  :
-sbatch: partition         : _workgroup_
-  :
-Submitted batch job 1234
-$ scontrol show job 1234 | egrep -i '(partition|account)='
-   Priority=2014 Nice=0 Account=it_css QOS=normal
-   Partition=it_nss AllocNode:Sid=login01:7280
-</code>
-Job 1234 is billed against the it_css account but executes in the it_nss workgroup partition (assuming the it_css account has been granted access to that partition).  When the job executes, all processes start with the it_nss Unix group.
-To check what your workgroup has access to and the guaranteed resources on the Caviness refer to [[abstract:caviness:runjobs:job_status#Checking-the-available-resources|Resources]].