Job Accounting on DARWIN
Accounting for jobs on DARWIN varies with the type of node used within a given allocation type. There are two types of allocations:
- Compute - for CPU based nodes with 512 GiB, 1024 GiB, or 2048 GiB of RAM
- GPU - for GPU based nodes with NVIDIA Tesla T4, NVIDIA Tesla V100, or AMD Radeon Instinct MI50 GPUs
For all allocations and node types, usage is defined in terms of a Service Unit (SU). The definition of an SU varies with the type of node being used.
Moral of the story: Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup.
If you need to have SU to dollar conversions based on your DARWIN allocation for grant proposals or reports, please submit a Research Computing High Performance Computing (HPC) Clusters Help Request and complete the form including DARWIN and indicated you are requesting SU to dollar conversions in the details in the description field.
Compute Allocations
A Compute allocation on DARWIN can be used on any of the four compute node types. Each compute node has 64 cores but the amount of memory varies by node type. The available resources for each node type are below:
Compute Node | Number of Nodes | Total Cores | Memory per Node | Total Memory |
---|---|---|---|---|
Standard | 48 | 3,072 | 512 GiB | 24 TiB |
Large Memory | 32 | 2,048 | 1024 GiB | 32 TiB |
Extra-Large Memory | 11 | 704 | 2,048 GiB | 22 TiB |
Extended Memory | 1 | 64 | 1024 GiB + 2.73 TiB1) | 3.73 TiB |
Total | 92 | 5,888 | 81.73 TiB |
A Service Unit (SU) on compute nodes corresponds to the use of one compute core for one hour. The number of SUs charged for a job is based on the fraction of total cores or fraction of total memory the job requests, whichever is larger. This results in the following SU conversions:
Compute Node | SU Conversion |
---|---|
Standard | 1 unit = 1 core + 8 GiB RAM for one hour |
Large Memory | 1 unit = 1 core + 16 GiB RAM for one hour |
Extra-Large Memory | 1 unit = 1 core + 32 GiB RAM for one hour |
Extended Memory | 64 units = 64 cores + 1024 GiB RAM + 2.73 TiB swap for one hour2) |
See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:
Node Type | Cores | Memory | SUs billed per hour |
---|---|---|---|
Standard | 1 | 1 GiB to 8 GiB | 1 SU |
Standard | 1 | 504 GiB to 512 GiB | 64 SUs3) |
Standard | 64 | 1 GiB to 512 GiB | 64 SUs |
Standard | 2 | 1 GiB to 16 GiB | 2 SUs |
Large Memory | 2 | > 32 GiB and ≤ 48 GiB | 3 SUs4) |
Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.
GPU Allocations
A GPU allocation on DARWIN can be used on any of the three GPU node types. The NVIDIA-T4 and AMD MI50 nodes have 64 cores each, while the NVIDIA-V100 nodes have 48 cores each. The available resources for each node type are below:
GPU Node | Number of Nodes | Total Cores | Memory per Node | Total Memory | Total GPUs |
---|---|---|---|---|---|
nvidia-T4 | 9 | 576 | 512 GiB | 4.5 TiB | 9 |
nvidia-V100 | 3 | 144 | 768 GiB | 2.25 TiB | 12 |
AMD-MI50 | 1 | 64 | 512 GiB | .5 TiB | 1 |
Total | 13 | 784 | 7.25 TiB | 22 |
A Service Unit (SU) on GPU nodes corresponds to the use of one GPU device for one hour. The number of SUs charged for a job is based on the fraction of total GPUs, fraction of total cores, or fraction of total memory the job requests, whichever is larger. Because the NVIDIA T4 and AMD MI50 nodes only have 1 GPU each, you have access to all available memory and cores for 1 SU. The NVIDIA V100 nodes have 4 GPUs each, so the available memory and cores per GPU is 1/4 of the total available on a node. This results in the following SU conversions:
GPU Node | SU Conversion |
---|---|
nvidia-T4 | 1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour |
AMD-MI50 | 1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour |
nvidia-V100 | 1 unit = 1 GPU + 12 cores + 192 GiB RAM for one hour |
See the examples below for illustrations of how SUs are billed by the intervals in the conversion table:
Node Type | GPUs | Cores | Memory | SUs billed per hour |
---|---|---|---|---|
nvidia-T4 | 1 | 1 to 64 | 1 GiB to 512 GiB | 1 SU |
nvidia-T4 | 2 | 2 to 128 | 2 GiB to 1024 GiB | 2 SUs |
AMD-MI50 | 1 | 1 to 64 | 1 GiB to 512 GiB | 1 SU |
nvidia-V100 | 1 | 1 to 12 | 1 GiB to 192 GiB | 1 SU |
nvidia-V100 | 2 | 1 to 24 | 1 GiB to 384 GiB | 2 SUs |
nvidia-V100 | 1 | 25 to 48 | 1 GiB to 192 GiB | 2 SUs5) |
nvidia-V100 | 1 | 1 to 24 | > 192 GiB and ≤ 384 GiB | 2 SUs6) |
Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons.
The idle partition
Jobs that execute in the idle partition do not result in charges against your allocation(s). If your jobs can support checkpointing, the idle partition will enable you to continue your research even if you exhaust your allocation(s). However, jobs submitted to the other partitions which do get charged against allocations will take priority and may cause idle
partition jobs to be preempted.
idle
partition do not result in charges you will not see them in the output of the sproject
command documented below. You can still use standard Slurm commands to check the status of those jobs.
Checking Allocation Usage
sproject
UD IT has created the sproject
command to allow various queries against allocations (UD and XSEDE) on DARWIN. You can see the help documentation for sproject
by running sproject -h
or sproject --help
. The -h/--help
flag also works for any of the subcommands: sproject allocations -h
, sproject projects -h
, sproject jobs -h
, or sproject failures -h
.
For all sproject
commands you can specify an output-format of table
, csv
, or json
using the --format <output-format>
or -f <output-format>
options.
sproject allocations
The allocations
subcommand shows information for resource allocations granted to projects/workgroups on DARWIN to which you are a member. To see a specific workgroup's allocations, use the -g <workgroup>
option as in this example for workgroup it_css
:
$ sproject allocations -g it_css Project id Alloc id Alloc descr Category RDR Start date End date ---------- -------- ----------- -------- --- ------------------------- ------------------------- 2 3 it_css::cpu startup cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2 4 it_css::gpu startup gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2 43 it_css::cpu startup cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2 44 it_css::gpu startup gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00
The --detail
flag will show additional information reflecting the credits, running + completed job charges, debits, and balance of each allocation:
$ sproject allocations -g it_css --detail Project id Alloc id Alloc descr Category RDR Credit Run+Cmplt Debit Balance ---------- -------- ----------- -------- --- ------ --------- ----- ------- 2 3 it_css::cpu startup cpu 108500 0 -1678 106822 2 4 it_css::gpu startup gpu 417 0 -16 401 2 43 it_css::cpu startup cpu 33333 0 0 33333 2 44 it_css::gpu startup gpu 33333 0 0 33333
The --by-user
flag is helpful to see detailed allocation usage broken out by project user:
$ sproject allocations -g it_css --by-user Project id Alloc id Alloc descr Category RDR User Transaction Amount Comments ---------- -------- ----------- -------- --- -------- ----------- ------ ------------------------- 2 3 it_css::cpu startup cpu credit 167 Request approved jnhuffma 3 it_css::cpu credit 108333 Testing allocation 3 it_css::cpu jnhuffma debit -1678 2 4 it_css::gpu startup gpu jnhuffma debit -16 4 it_css::gpu credit 417 Testing allocation 2 43 it_css::cpu startup cpu credit 33333 2 44 it_css::gpu startup gpu credit 33333
sproject projects
The projects
subcommand shows information (such as the project id, group id, name, and creation date) for projects/workgroups on DARWIN to which you are a member. To see a specific project/workgroup, use the -g <workgroup>
option as in this example for workgroup it_css
:
$ sproject projects -g it_css Project id Account Group id Group name Creation date ---------- ------- -------- ---------- ------------------------- 2 it_css 1002 it_css 2021-07-12 14:51:57-04:00
Adding the --detail
flag will also show each allocation associated with the project.
$ sproject projects -g it_css --detail Project id Account Group id Group name Allocation id Category RDR Start date End date Creation date ---------- ------- -------- ---------- ------------- -------- --- ------------------------- ------------------------- ------------------------- 2 it_css 1002 it_css 3 startup cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2021-07-12 15:00:46-04:00 4 startup gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2021-07-12 15:00:54-04:00 43 startup cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2021-07-26 14:47:56-04:00 44 startup gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2021-07-26 14:47:56-04:00
sproject jobs
The jobs
subcommand shows information (such as the Slurm job id, owner, and amount charged) for individual jobs billed against resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject jobs -h
for complete details. To see jobs associated with a specific project/workgroup, use the -g <workgroup>
option as in this example for workgroup it_css
:
$ sproject jobs -g it_css Activity id Alloc id Alloc descr Job id Owner Status Amount Creation date ----------- -------- ------------ ------ -------- --------- ------ ------------------------- 25000 43 it_css::cpu 80000 traine executing -3966 2021-08-10 18:27:59-04:00 25001 43 it_css::cpu 80001 traine executing -3966 2021-08-10 18:52:11-04:00 25002 43 it_css::cpu 80002 trainf executing -4200 2021-08-11 13:06:26-04:00 25003 43 it_css::cpu 80003 traine executing -1200 2021-08-11 16:07:42-04:00
Jobs that complete execution will be displayed with a status of completed
and the actual billable amount used by the job. At the top and bottom of each hour, completed jobs are resolved into per-user debits and disappear from the jobs
listing (see the sproject allocations section above for the display of resource allocation credits, debits, and pre-debits).
sproject failures
The failures
subcommand shows information (such as the Slurm job id, owner, and amount charged) for all jobs that failed to execute due to insufficient allocation balance on resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use sproject failures -h
for complete details. To see failures associated with jobs run as a specific project/workgroup, use the -g <workgroup>
option as in this example for workgroup it_css
:
$ sproject failures -g it_css Job id Error message ------ ------------------------------------------------------------- 60472 Requested allocation has insufficient balance: 3641 < 7680 60473 Requested allocation has insufficient balance: 3641 < 7680 60476 Requested allocation has insufficient balance: 3641 < 7680 60474 Requested allocation has insufficient balance: 3641 < 7680 60475 Requested allocation has insufficient balance: 3641 < 7680 60471 Requested allocation has insufficient balance: 3641 < 7680 60478 Requested allocation has insufficient balance: 3641 < 7680 60477 Requested allocation has insufficient balance: 3641 < 7680 60487 Requested allocation has insufficient balance: 12943 < 368640 60488 Requested allocation has insufficient balance: 12943 < 276480 60489 Requested allocation has insufficient balance: 12943 < 184320 60490 Requested allocation has insufficient balance: 12943 < 76800 60491 Requested allocation has insufficient balance: 12943 < 76800 60492 Requested allocation has insufficient balance: 12943 < 61440 60493 Requested allocation has insufficient balance: 12943 < 15360
Adding the --detail
flag provides further information such as the owner, amount, and creation date.
$ sproject failures -g it_css --detail Activity id Alloc id Alloc descr Job id Owner Amount Error message Creation date ----------- -------- ----------- ------ -------- ------ ------------------------------------------------------------- ------------------------- 75 3 it_css::cpu 60472 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 76 3 it_css::cpu 60473 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 77 3 it_css::cpu 60476 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 78 3 it_css::cpu 60474 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 79 3 it_css::cpu 60475 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 80 3 it_css::cpu 60471 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 81 3 it_css::cpu 60478 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:13-04:00 82 3 it_css::cpu 60477 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:13-04:00 92 3 it_css::cpu 60487 jnhuffma 6144 Requested allocation has insufficient balance: 12943 < 368640 2021-07-13 11:09:00-04:00 93 3 it_css::cpu 60488 jnhuffma 4608 Requested allocation has insufficient balance: 12943 < 276480 2021-07-13 11:09:24-04:00 94 3 it_css::cpu 60489 jnhuffma 3072 Requested allocation has insufficient balance: 12943 < 184320 2021-07-13 11:09:40-04:00 95 3 it_css::cpu 60490 jnhuffma 1280 Requested allocation has insufficient balance: 12943 < 76800 2021-07-13 11:15:08-04:00 96 3 it_css::cpu 60491 jnhuffma 1280 Requested allocation has insufficient balance: 12943 < 76800 2021-07-13 11:16:00-04:00 97 3 it_css::cpu 60492 jnhuffma 1024 Requested allocation has insufficient balance: 12943 < 61440 2021-07-13 11:19:16-04:00 98 3 it_css::cpu 60493 jnhuffma 256 Requested allocation has insufficient balance: 12943 < 15360 2021-07-13 11:20:01-04:00
ACCESS (XSEDE) Allocations
For ACCESS allocations on DARWIN, you may use the ACCESS Allocations portal to check allocation usage, however keep in mind using the sproject command available on DARWIN will provide the most up-to-date allocation usage information since the ACCESS Allocations Portal will only be updated nightly.
Storage Allocations
Every DARWIN Compute or GPU allocation has a storage allocation associated with it on the DARWIN Lustre file system. These allocations are measured in tebibytes and the default amount is 1 TiB. There are no SUs deducted from your allocation for the space you use, but you will be limited to a storage quota based on your awarded allocation.
Each project/workgroup has a folder associated with it referred to as workgroup storage. Every file in that folder will count against that project/workgroup's allocated quota for their workgroup storage.
You can use the my_quotas command to check storage usage.