abstract:darwin:runjobs:queues

The job queues (partitions) on DARWIN

The DARWIN cluster has several partitions (queues) available to specify when running jobs. These partitions correspond to the various node types available in the cluster:

Partition NameDescriptionNode Names
standardContains all 48 standard memory nodes (64 cores, 512 GiB memory per node)r1n00 - r1n47
large-memContains all 32 large memory nodes (64 cores, 1024 GiB memory per node)r2l00 - r2l10
xlarge-memContains all 11 extra-large memory nodes (64 cores, 2048 GiB memory per node)r2x00 - r2x10
extended-memContains the single extended memory node (64 cores, 1024 GiB memory + 2.73 TiB NVMe swap)r2e00
gpu-t4Contains all 9 NVIDIA Tesla T4 GPU nodes (64 cores, 512 GiB memory, 1 T4 GPU per node)r1t00 - r1t07, r2t08
gpu-v100Contains all 3 NVIDIA Tesla V100 GPU nodes (48 cores, 768 GiB memory, 4 V100 GPUs per node)r2v00 - r2v02
gpu-mi50Contains the single AMD Radeon Instinct MI50 GPU node (64 cores, 512 GiB memory, 1 MI50 GPU)r2m00
gpu-mi100Contains the single AMD Radeon Instinct MI100 GPU node (64 cores, 512 GiB memory, 1 MI100 GPU)r2m01
idleContains all nodes in the cluster, jobs on this partition can be preempted but are not charged against your allocation

All partitions on DARWIN have two requirements for submitting jobs:

  1. You must set an allocation workgroup prior to submitting a job by using the workgroup command (e.g., workgroup -g it_nss). This ensures jobs are billed against the correct account in Slurm.
  2. You must explicitly request a single partition in your job submission using --partition or -p.

All partitions on DARWIN except idle have the following defaults:

  • Default run time of 30 minutes
  • Default resources of 1 node, 1 CPU, and 1 GiB memory
  • Default no preemption

All partitions on DARWIN except idle have the following limits:

  • Maximum run time of 7 days
  • Maximum of 400 jobs per user per partition

The idle partition has the same defaults and limits as above with the following differences:

  • Preemption is enabled for all jobs
  • Maximum of 320 jobs per user
  • Maximum of 640 CPUs per user (across all jobs in the partition)

Each type of node (and thus, partition) has a limited amount of memory available for jobs. A small amount of memory must be subtracted from the nominal size listed in the table above for the node's operating system and Slurm. The remainder is the upper limit requestable by jobs, summarized by partition below:

Partition NameMaximum (by node)Maximum (by core)
standard–mem=499712M–mem-per-cpu=7808M
large-mem–mem=999424M–mem-per-cpu=15616M
xlarge-mem–mem=2031616M–mem-per-cpu=31744M
extended-mem–mem=999424M–mem-per-cpu=15616M
gpu-t4–mem=491520M–mem-per-cpu=7680M
gpu-v100–mem=737280M–mem-per-cpu=15360M
gpu-mi50–mem=491520M–mem-per-cpu=7680M
gpu-mi100–mem=491520M–mem-per-cpu=7680M

Because access to the swap cannot be limited via Slurm, the extended-mem partition is configured to run all jobs in exclusive user mode. This means only a single user can be on the node at a time, but that user can run one or more jobs on the node. All jobs on the node will have access to the full amount of swap available, so care must be taken in usage of swap when running multiple jobs.

Jobs that will run in one of the GPU partitions must request GPU resources using ONE of the following flags:

FlagDescription
--gpus=<count><count> GPUs total for the job, regardless of node count
--gpus-per-node=<count><count> GPUs are required on each node allocated to the job
--gpus-per-socket=<count><count> GPUs are required on each socket allocated to the job
--gpus-per-task=<count><count> GPUs are required for each task in the job

If you do not specify one of these flags, your job will not be permitted to run in the GPU partitions.

On DARWIN the --gres flag should NOT be used to request GPU resources. The GPU type will be inferred from the partition to which the job is submitted if not specified.

The idle partition contains all nodes in the cluster. Jobs submitted to the idle partition can be preempted when the resources are required for jobs submitted to the other partitions. Your job should support checkpointing to effectively use the idle partition and avoid lost work.

Be aware that implementing checkpointing is highly dependent on the nature of your job and the ability of your code or software to handle interruptions and restarts. For this reason, we can only provide limited support of the idle partition.

Jobs in the idle partition that have been running for less than 10 minutes are not considered for preemption by Slurm. Additionally, there is a 5 minute grace period between the delivery of the initial preemption signal (SIGCONT+SIGTERM) and the end of the job (SIGCONT+SIGTERM+SIGKILL). This means jobs in the idle partition will have a minimum of 15 minutes of execution time once started. Jobs submitted using the –requeue flag automatically return to the queue to be rescheduled once resources are available again.

Jobs that execute in the idle partition do not result in charges against your allocation(s). However, they do accumulate resource usage for the sake of scheduling priority to ensure fair access to this partition. If your jobs can support checkpointing, the idle partition will enable you to continue your research even if you exhaust your allocation(s).

Since the idle partition contains all nodes in the cluster, you will need to request a specific GPU type if your job needs GPU resources. The three GPU types are:

TypeDescription
tesla_t4NVIDIA Tesla T4
tesla_v100NVIDIA Tesla V100
amd_mi50AMD Radeon Instinct MI50

To request a specific GPU type while using the idle partition, include the --gpus=<type>:<count> flag with your job submission. For example, --gpus=tesla_t4:4 would request 4 NVIDIA Telsa T4 GPUs.

  • abstract/darwin/runjobs/queues.txt
  • Last modified: 2023-07-10 08:51
  • by frey