abstract:darwin:runjobs:accounting

Job Accounting on DARWIN

Accounting for jobs on DARWIN varies with the type of node used within a given allocation type. There are two types of allocations:

  1. Compute - for CPU based nodes with 512 GiB, 1024 GiB, or 2048 GiB of RAM
  2. GPU - for GPU based nodes with NVIDIA Tesla T4, NVIDIA Tesla V100, or AMD Radeon Instinct MI50 GPUs

For all allocations and node types, usage is defined in terms of a Service Unit (SU). The definition of an SU varies with the type of node being used.

IMPORTANT: When a job is submitted, the SUs will be calculated and pre-debited based on the resources requested thereby putting a hold on and deducting the SUs from the allocation credit for your project/workgroup. However, once the job completes the amount of SUs debited will be based on the actual time used. Keep in mind that if you request 20 cores and your job really only takes advantage of 10 cores, then the job will still be billed based on the requested 20 cores. And specifying a time limit of 2 days versus 2 hours may prevent others in your project/workgroup from running jobs as those SUs will be unavailable until the job completes. On the other hand, if you do not request enough resources and your job fails (i.e. did not provide enough time, enough cores, etc.), you will still be billed for those SUs. See Scheduling Jobs Command options for help with specifying resources.

Moral of the story: Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.

Interactive jobs: An interactive job is billed the SU associated with the full wall time of its execution, not just for CPU time accrued through its duration. For example, if you leave an interactive job running for 2 hours and execute code for 2 minutes, your allocation will be billed for 2 hours of time, not 2 minutes. Please review the SU associated for each type of resource requested (compute, gpu) and the associated SUs billed per hour.

If you need to have SU to dollar conversions based on your DARWIN allocation for grant proposals or reports, please submit a Research Computing High Performance Computing (HPC) Clusters Help Request and complete the form including DARWIN and indicated you are requesting SU to dollar conversions in the details in the description field.

A Compute allocation on DARWIN can be used on any of the four compute node types. Each compute node has 64 cores but the amount of memory varies by node type. The available resources for each node type are below:

Compute Node Number of Nodes Total Cores Memory per Node Total Memory
Standard 48 3,072 512 GiB 24 TiB
Large Memory 32 2,048 1024 GiB 32 TiB
Extra-Large Memory 11 704 2,048 GiB 22 TiB
Extended Memory 1 64 1024 GiB + 2.73 TiB1) 3.73 TiB
Total 92 5,888 81.73 TiB

A Service Unit (SU) on compute nodes corresponds to the use of one compute core for one hour. The number of SUs charged for a job is based on the fraction of total cores or fraction of total memory the job requests, whichever is larger. This results in the following SU conversions:

Compute Node SU Conversion
Standard 1 unit = 1 core + 8 GiB RAM for one hour
Large Memory 1 unit = 1 core + 16 GiB RAM for one hour
Extra-Large Memory 1 unit = 1 core + 32 GiB RAM for one hour
Extended Memory 64 units = 64 cores + 1024 GiB RAM + 2.73 TiB swap for one hour2)

See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:

Node Type Cores Memory SUs billed per hour
Standard 1 1 GiB to 8 GiB 1 SU
Standard 1 504 GiB to 512 GiB 64 SUs3)
Standard 64 1 GiB to 512 GiB 64 SUs
Standard 2 1 GiB to 16 GiB 2 SUs
Large Memory 2 > 32 GiB and ≤ 48 GiB 3 SUs4)

Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.

A GPU allocation on DARWIN can be used on any of the three GPU node types. The NVIDIA-T4 and AMD MI50 nodes have 64 cores each, while the NVIDIA-V100 nodes have 48 cores each. The available resources for each node type are below:

GPU Node Number of Nodes Total Cores Memory per Node Total Memory Total GPUs
nvidia-T4 9 576 512 GiB 4.5 TiB 9
nvidia-V100 3 144 768 GiB 2.25 TiB 12
AMD-MI50 1 64 512 GiB .5 TiB 1
Total 13 784 7.25 TiB 22

A Service Unit (SU) on GPU nodes corresponds to the use of one GPU device for one hour. The number of SUs charged for a job is based on the fraction of total GPUs, fraction of total cores, or fraction of total memory the job requests, whichever is larger. Because the NVIDIA T4 and AMD MI50 nodes only have 1 GPU each, you have access to all available memory and cores for 1 SU. The NVIDIA V100 nodes have 4 GPUs each, so the available memory and cores per GPU is 1/4 of the total available on a node. This results in the following SU conversions:

GPU Node SU Conversion
nvidia-T4 1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour
AMD-MI50 1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour
nvidia-V100 1 unit = 1 GPU + 12 cores + 192 GiB RAM for one hour

See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:

Node Type GPUs Cores Memory SUs billed per hour
nvidia-T4 1 1 to 64 1 GiB to 512 GiB 1 SU
nvidia-T4 2 2 to 128 2 GiB to 1024 GiB 2 SUs
AMD-MI50 1 1 to 64 1 GiB to 512 GiB 1 SU
nvidia-V100 1 1 to 12 1 GiB to 192 GiB 1 SU
nvidia-V100 2 1 to 24 1 GiB to 384 GiB 2 SUs
nvidia-V100 1 25 to 48 1 GiB to 192 GiB 2 SUs5)
nvidia-V100 1 1 to 24 > 192 GiB and ≤ 384 GiB 2 SUs6)

Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.

Jobs that execute in the idle partition do not result in charges against your allocation(s). If your jobs can support checkpointing, the idle partition will enable you to continue your research even if you exhaust your allocation(s). However, jobs submitted to the other partitions which do get charged against allocations will take priority and may cause idle partition jobs to be preempted.

Since jobs in the idle partition do not result in charges you will not see them in the output of the sproject command documented below. You can still use standard Slurm commands to check the status of those jobs.

Checking Allocation Usage

UD IT has created the sproject command to allow various queries against allocations (UD and XSEDE) on DARWIN. You can see the help documentation for sproject by running sproject -h or sproject --help. The -h/--help flag also works for any of the subcommands: sproject allocations -h, sproject projects -h, sproject jobs -h, or sproject failures -h.

For all sproject commands you can specify an output-format of table, csv, or json using the --format <output-format> or -f <output-format> options.

The allocations subcommand shows information for resource allocations granted to projects/workgroups on DARWIN to which you are a member. To see a specific workgroup's allocations, use the -g <workgroup> option as in this example for workgroup it_css:

$ sproject allocations -g it_css
Project id Alloc id Alloc descr Category RDR Start date                End date
---------- -------- ----------- -------- --- ------------------------- -------------------------
         2        3 it_css::cpu startup  cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00
         2        4 it_css::gpu startup  gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00
         2       43 it_css::cpu startup  cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00
         2       44 it_css::gpu startup  gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00

The --detail flag will show additional information reflecting the credits, running + completed job charges, debits, and balance of each allocation:

$ sproject allocations -g it_css --detail
Project id Alloc id Alloc descr Category RDR Credit Run+Cmplt Debit Balance
---------- -------- ----------- -------- --- ------ --------- ----- -------
         2        3 it_css::cpu startup  cpu 108500         0 -1678  106822
         2        4 it_css::gpu startup  gpu    417         0   -16     401
         2       43 it_css::cpu startup  cpu  33333         0     0   33333
         2       44 it_css::gpu startup  gpu  33333         0     0   33333

The --by-user flag is helpful to see detailed allocation usage broken out by project user:

$ sproject allocations -g it_css --by-user
Project id Alloc id Alloc descr Category RDR User     Transaction Amount Comments
---------- -------- ----------- -------- --- -------- ----------- ------ -------------------------
         2        3 it_css::cpu startup  cpu          credit         167 Request approved jnhuffma
                  3 it_css::cpu                       credit      108333 Testing allocation
                  3 it_css::cpu              jnhuffma debit        -1678
         2        4 it_css::gpu startup  gpu jnhuffma debit          -16
                  4 it_css::gpu                       credit         417 Testing allocation
         2       43 it_css::cpu startup  cpu          credit       33333
         2       44 it_css::gpu startup  gpu          credit       33333

The projects subcommand shows information (such as the project id, group id, name, and creation date) for projects/workgroups on DARWIN to which you are a member. To see a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

$ sproject projects -g it_css
Project id Account Group id Group name  Creation date
---------- ------- -------- ----------  -------------------------
         2 it_css      1002 it_css      2021-07-12 14:51:57-04:00

Adding the --detail flag will also show each allocation associated with the project.

$ sproject projects -g it_css --detail
Project id Account Group id Group name Allocation id Category RDR Start date                End date                   Creation date
---------- ------- -------- ---------- ------------- -------- --- ------------------------- -------------------------  -------------------------
         2 it_css      1002 it_css                 3 startup  cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00  2021-07-12 15:00:46-04:00
                                                   4 startup  gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00  2021-07-12 15:00:54-04:00
                                                  43 startup  cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00  2021-07-26 14:47:56-04:00
                                                  44 startup  gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00  2021-07-26 14:47:56-04:00

The jobs subcommand shows information (such as the Slurm job id, owner, and amount charged) for individual jobs billed against resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject jobs -h for complete details. To see jobs associated with a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

$ sproject jobs -g it_css
Activity id Alloc id Alloc descr  Job id Owner    Status    Amount  Creation date            
----------- -------- ------------ ------ -------- --------- ------  -------------------------
      25000       43 it_css::cpu   80000 traine   executing  -3966  2021-08-10 18:27:59-04:00
      25001       43 it_css::cpu   80001 traine   executing  -3966  2021-08-10 18:52:11-04:00
      25002       43 it_css::cpu   80002 trainf   executing  -4200  2021-08-11 13:06:26-04:00
      25003       43 it_css::cpu   80003 traine   executing  -1200  2021-08-11 16:07:42-04:00

Jobs that complete execution will be displayed with a status of completed and the actual billable amount used by the job. At the top and bottom of each hour, completed jobs are resolved into per-user debits and disappear from the jobs listing (see the sproject allocations section above for the display of resource allocation credits, debits, and pre-debits).

The failures subcommand shows information (such as the Slurm job id, owner, and amount charged) for all jobs that failed to execute due to insufficient allocation balance on resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject failures -h for complete details. To see failures associated with jobs run as a specific project/workgroup, use the -g <workgroup> option as in this example for workgroup it_css:

$ sproject failures -g it_css
Job id Error message
------ -------------------------------------------------------------
 60472 Requested allocation has insufficient balance: 3641 < 7680
 60473 Requested allocation has insufficient balance: 3641 < 7680
 60476 Requested allocation has insufficient balance: 3641 < 7680
 60474 Requested allocation has insufficient balance: 3641 < 7680
 60475 Requested allocation has insufficient balance: 3641 < 7680
 60471 Requested allocation has insufficient balance: 3641 < 7680
 60478 Requested allocation has insufficient balance: 3641 < 7680
 60477 Requested allocation has insufficient balance: 3641 < 7680
 60487 Requested allocation has insufficient balance: 12943 < 368640
 60488 Requested allocation has insufficient balance: 12943 < 276480
 60489 Requested allocation has insufficient balance: 12943 < 184320
 60490 Requested allocation has insufficient balance: 12943 < 76800
 60491 Requested allocation has insufficient balance: 12943 < 76800
 60492 Requested allocation has insufficient balance: 12943 < 61440
 60493 Requested allocation has insufficient balance: 12943 < 15360

Adding the --detail flag provides further information such as the owner, amount, and creation date.

$ sproject failures -g it_css --detail
Activity id Alloc id Alloc descr Job id Owner    Amount Error message                                                  Creation date
----------- -------- ----------- ------ -------- ------ -------------------------------------------------------------  -------------------------
         75        3 it_css::cpu  60472 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         76        3 it_css::cpu  60473 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         77        3 it_css::cpu  60476 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         78        3 it_css::cpu  60474 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         79        3 it_css::cpu  60475 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         80        3 it_css::cpu  60471 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         81        3 it_css::cpu  60478 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:13-04:00
         82        3 it_css::cpu  60477 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:13-04:00
         92        3 it_css::cpu  60487 jnhuffma   6144 Requested allocation has insufficient balance: 12943 < 368640  2021-07-13 11:09:00-04:00
         93        3 it_css::cpu  60488 jnhuffma   4608 Requested allocation has insufficient balance: 12943 < 276480  2021-07-13 11:09:24-04:00
         94        3 it_css::cpu  60489 jnhuffma   3072 Requested allocation has insufficient balance: 12943 < 184320  2021-07-13 11:09:40-04:00
         95        3 it_css::cpu  60490 jnhuffma   1280 Requested allocation has insufficient balance: 12943 < 76800   2021-07-13 11:15:08-04:00
         96        3 it_css::cpu  60491 jnhuffma   1280 Requested allocation has insufficient balance: 12943 < 76800   2021-07-13 11:16:00-04:00
         97        3 it_css::cpu  60492 jnhuffma   1024 Requested allocation has insufficient balance: 12943 < 61440   2021-07-13 11:19:16-04:00
         98        3 it_css::cpu  60493 jnhuffma    256 Requested allocation has insufficient balance: 12943 < 15360   2021-07-13 11:20:01-04:00

For ACCESS allocations on DARWIN, you may use the ACCESS Allocations portal to check allocation usage, however keep in mind using the sproject command available on DARWIN will provide the most up-to-date allocation usage information since the ACCESS Allocations Portal will only be updated nightly.

Every DARWIN Compute or GPU allocation has a storage allocation associated with it on the DARWIN Lustre file system. These allocations are measured in tebibytes and the default amount is 1 TiB. There are no SUs deducted from your allocation for the space you use, but you will be limited to a storage quota based on your awarded allocation.

Each project/workgroup has a folder associated with it referred to as workgroup storage. Every file in that folder will count against that project/workgroup's allocated quota for their workgroup storage.

You can use the my_quotas command to check storage usage.


1)
1024 GiB of system memory and 2.73 TiB of swap on high-speed Intel Optane NVMe storage
2)
always billed as the entire node
3)
512 GiB RAM on a standard node is equivalent to using all 64 cores, so you are charged as if you used 64 cores
4)
RAM usage exceeds what is available with 2 cores on a large memory node, so you are charged as if you used 3 cores
5)
billed as if you were using 2 GPUs due to the proportion of CPU cores used
6)
billed as if you were using 2 GPUs due to the proportion of memory used
  • abstract/darwin/runjobs/accounting.txt
  • Last modified: 2024-02-29 13:14
  • by anita