The Caviness cluster has several kinds of partition (queue) available in which to run jobs:
The Caviness cluster has several kinds of partition (queue) available in which to run jobs:
|standard|The default partition if no ''%%--%%partition'' submission flag is specified|
|standard|The default partition if no ''%%--%%partition'' submission flag is specified; jobs can be preempted (killed)|''scontrol show partition standard''|
|devel|A partition with very short runtime limits and small resource limits|
|devel|A partition with very short runtime limits and small resource limits; important to use for any development using compilers|''scontrol show partition devel''|
|workgroup-specific|Partitions associated with specific kinds of compute equipment in the cluster purchased by a research group <<//investing-entity//>> (workgroup)|
|workgroup-specific|Partitions associated with specific kinds of compute equipment in the cluster purchased by a research group <<//investing-entity//>> (workgroup)|''scontrol show partition''<<//workgroup//>>|
==== The standard partition ====
===== The standard partition =====
This partition is the default when no ''%%--%%partition'' submission flag is specified. Also, anyone on the Caviness can request resources from the standard partition. However, job preemption logic (discussed below) is implemented on this partition to ensure workgroup-specific jobs are prioritized.
This partition is the default when no ''%%--%%partition'' submission flag is specified. Also, anyone on the Caviness can request resources from the standard partition. However, job preemption logic (discussed below) is implemented on this partition to ensure workgroup-specific jobs are prioritized.
Line 20:
Line 20:
* Maximum CPUs per user = 720
* Maximum CPUs per user = 720
The standard partition is subject to job preemption because it allows a job submitted to a workgroup-specific partition to release resources tied-up by jobs in the standard partition. In summary, jobs in the standard partition will be preempted to release resources for the workgroup-specific partition job. For more information on how to handle your job if it is preempted, please refer to [[abstract:caviness:runjobs:schedule_jobs#Handling-System-Signals-aka- Checkpointing|Checkpointing]]
The standard partition is subject to job preemption (killed) because it allows a job submitted to a workgroup-specific partition to release resources tied-up by jobs in the standard partition. In summary, jobs in the standard partition will be preempted (killed with 5 minute grace period) to release resources for the workgroup-specific partition job. For more information on how to handle your job if it is preempted, please refer to [[abstract:caviness:runjobs:schedule_jobs#Handling-System-Signals-aka- Checkpointing|Checkpointing]].
==== The devel partition ====
===== The devel partition =====
This partition is used for short-lived jobs with minimal resource needs. Typical uses for the ''devel'' queue include:
This partition is used for short-lived jobs with minimal resource needs. Typical uses for the ''devel'' queue include:
* Performing lengthy compiles of code projects
* Performing compiles of code for projects that otherwise can't be done on the login (head) node and to make sure you are allocated a compute node with the development tools, libraries, etc. which are needed for compilers.
* Running test jobs to vet programs or changes to programs
* Running test jobs to vet programs or changes to programs
* Testing correctness of program parallelization
* Testing correctness of program parallelization
* Interactive sessions
* Interactive sessions
* Removing files especially if cleaning up many files and directories in ''$HOME'', ''$WORKDIR'' and ''/lustre/scratch''
Because performance is not critical for these use cases, the nodes serviced by the ''devel'' partition have hyperthreads enabled, effectively doubling the number of CPUs available.
Because performance is not critical for these use cases, the nodes serviced by the ''devel'' partition have hyperthreads enabled, effectively doubling the number of CPUs available.
Line 38:
Line 39:
For example:
For example:
<code bash>
<code bash>
[traine@login01 ~]$ workgroup -g it_css
[(it_css:traine)@login00 ~]$ srun --partition=devel --nodes=1 --ntasks=1 --cpus-per-task=4 date
[(it_css:traine)@login00 ~]$ srun --partition=devel --nodes=1 --ntasks=1 --cpus-per-task=4 date
Mon Jul 23 15:25:07 EDT 2018
Mon Jul 23 15:25:07 EDT 2018
One copy of the ''date'' command is executed on one node in the ''devel'' partition; the command has four cores (or in this case, hyperthreads) allocated to it. An interactive shell in the ''devel'' partition with two cores available would be started via:
One copy of the ''date'' command is executed on one node in the ''devel'' partition; the command has four cores (or in this case, hyperthreads) allocated to it. An interactive shell in the ''devel'' partition with two cores and one hour of time available would be started via:
The use of //investing-entity// (workgroup) partitions (queues), are similar to the owner queues on Mills and Farber, however on Caviness distinct nodes will not be assigned to a workgroup-specific partition. Instead priority-access will be given to the //investing-entity// (workgroup) to span all of the cluster resources for each type of node purchased by the workgroup on Caviness. Each workgroup-specific partition will reuse the existing workgroup QOS as its default (baseline) QOS to limit the resources and at the same time guarantee access based on what was purchased by preempting (killing) jobs in the standard queue to make way for jobs submitted to the workgroup-specific queues. There is a special flag to check for the presence of the name ''_workgroup_'' in the list of requested partitions for the job. If enabled, the word ''_workgroup_'' is replaced with the //investing-entity// (workgroup) name under which the job was submitted by the user (e.g. ''workgroup -g <<//investing-entity//>>'')
The use of //investing-entity// (workgroup) partitions (queues), are similar to the owner queues on Mills and Farber, however on Caviness distinct nodes will not be assigned to a workgroup-specific partition. Instead priority-access will be given to the //investing-entity// (workgroup) to span all of the cluster resources for each type of node purchased by the workgroup on Caviness. Each workgroup-specific partition will reuse the existing workgroup QOS as its default (baseline) QOS to limit the resources and at the same time guarantee access based on what was purchased by preempting (killing) jobs in the standard queue to make way for jobs submitted to the workgroup-specific queues. There is a special flag to check for the presence of the name ''_workgroup_'' in the list of requested partitions for the job. If enabled, the word ''_workgroup_'' is replaced with the //investing-entity// (workgroup) name under which the job was submitted by the user (e.g. ''workgroup -g <<//investing-entity//>>'')
$ scontrol show job 1234 | egrep -i '(partition|account)='
$ scontrol show job 1234 | egrep -i '(partition|account)='
Priority=2014 Nice=0 Account=it_css QOS=normal
Priority=2014 Nice=0 Account=it_nss QOS=normal
Partition=it_nss AllocNode:Sid=login01:7280
Partition=it_nss AllocNode:Sid=login01:7280
Job 1234 is billed against the it_css account but executes in the it_nss workgroup partition (assuming the it_css account has been granted access to that partition). When the job executes, all processes start with the it_nss Unix group.
Job 1234 is billed against the it_nss account because it is in the it_nss workgroup partition. When the job executes, all processes start with the it_nss Unix group.
To check what your workgroup has access to and the guaranteed resources on the Caviness refer to [[abstract:caviness:runjobs:job_status#Available-Resources|Resources]].
To check what your workgroup has access to and the guaranteed resources on the Caviness refer to [[abstract:caviness:runjobs:job_status#Available-Resources|Resources]].