====== Job Accounting on DARWIN ====== Accounting for jobs on DARWIN varies with the type of node used within a given allocation type. There are two types of allocations: - Compute - for CPU based nodes with 512 GiB, 1024 GiB, or 2048 GiB of RAM - GPU - for GPU based nodes with NVIDIA Tesla T4, NVIDIA Tesla V100, or AMD Radeon Instinct MI50 GPUs For all allocations and node types, usage is defined in terms of a Service Unit (SU). The definition of an SU varies with the type of node being used. **IMPORTANT:** When a job is submitted, the SUs will be calculated and pre-debited based on the resources requested thereby putting a hold on and deducting the SUs from the allocation credit for your project/workgroup. However, once the job completes the amount of SUs debited will be based on the actual time used. Keep in mind that if you request 20 cores and your job really only takes advantage of 10 cores, then the job will still be billed based on the requested 20 cores. And specifying a time limit of 2 days versus 2 hours may prevent others in your project/workgroup from running jobs as those SUs will be unavailable until the job completes. On the other hand, if you do not request enough resources and your job fails (i.e. did not provide enough time, enough cores, etc.), you will still be billed for those SUs. See [[abstract:darwin:runjobs:schedule_jobs#command-options|Scheduling Jobs Command options]] for help with specifying resources. **Moral of the story:** Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup. **Interactive jobs:** An interactive job is billed the SU associated with the full wall time of its execution, not just for CPU time accrued through its duration. For example, if you leave an interactive job running for 2 hours and execute code for 2 minutes, your allocation will be billed for 2 hours of time, not 2 minutes. Please review the SU associated for each type of resource requested ([[abstract:darwin:runjobs:accounting#compute-allocations|compute]], [[abstract:darwin:runjobs:accounting#gpu-allocations|gpu]]) and the associated SUs billed per hour. //If you need to have SU to dollar conversions based on your DARWIN allocation for grant proposals or reports, please submit a [[https://services.udel.edu/TDClient/32/Portal/Requests/TicketRequests/NewForm?ID=D5ZRIgFlfLw_|Research Computing High Performance Computing (HPC) Clusters Help Request]] and complete the form including DARWIN and indicated you are requesting SU to dollar conversions in the details in the description field. // ===== Compute Allocations ===== A Compute allocation on DARWIN can be used on any of the four compute node types. Each compute node has 64 cores but the amount of memory varies by node type. The available resources for each node type are below: ^Compute Node ^Number of Nodes ^Total Cores ^Memory per Node ^Total Memory^ |Standard | 48| 3,072| 512 GiB| 24 TiB| |Large Memory | 32| 2,048| 1024 GiB| 32 TiB| |Extra-Large Memory | 11| 704| 2,048 GiB| 22 TiB| |Extended Memory | 1| 64| 1024 GiB + 2.73 TiB((1024 GiB of system memory and 2.73 TiB of swap on high-speed Intel Optane NVMe storage))| 3.73 TiB| |**Total** | 92| 5,888| | 81.73 TiB| A Service Unit (SU) on compute nodes corresponds to the use of one compute core for one hour. The number of SUs charged for a job is based on the fraction of total cores or fraction of total memory the job requests, whichever is larger. This results in the following SU conversions: ^Compute Node ^SU Conversion ^ |Standard |1 unit = 1 core + 8 GiB RAM for one hour | |Large Memory |1 unit = 1 core + 16 GiB RAM for one hour | |Extra-Large Memory |1 unit = 1 core + 32 GiB RAM for one hour | |Extended Memory |64 units = 64 cores + 1024 GiB RAM + 2.73 TiB swap for one hour((always billed as the entire node)) | See the examples below for illustrations of how SUs are billed by the intervals in the conversion table: ^Node Type ^Cores ^Memory ^ SUs billed per hour ^ |Standard | 1| 1 GiB to 8 GiB| 1 SU | |Standard | 1| 504 GiB to 512 GiB| 64 SUs((512 GiB RAM on a standard node is equivalent to using all 64 cores, so you are charged as if you used 64 cores)) | |Standard | 64| 1 GiB to 512 GiB| 64 SUs | |Standard | 2| 1 GiB to 16 GiB| 2 SUs | |Large Memory | 2| > 32 GiB and ≤ 48 GiB| 3 SUs((RAM usage exceeds what is available with 2 cores on a large memory node, so you are charged as if you used 3 cores)) | Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons. ===== GPU Allocations ===== A GPU allocation on DARWIN can be used on any of the three GPU node types. The NVIDIA-T4 and AMD MI50 nodes have 64 cores each, while the NVIDIA-V100 nodes have 48 cores each. The available resources for each node type are below: ^GPU Node ^Number of Nodes ^Total Cores ^Memory per Node ^Total Memory ^Total GPUs^ |nvidia-T4 | 9| 576| 512 GiB| 4.5 TiB| 9| |nvidia-V100 | 3| 144| 768 GiB| 2.25 TiB| 12| |AMD-MI50 | 1| 64| 512 GiB| .5 TiB| 1| |**Total** | 13| 784| | 7.25 TiB| 22| A Service Unit (SU) on GPU nodes corresponds to the use of one GPU device for one hour. The number of SUs charged for a job is based on the fraction of total GPUs, fraction of total cores, or fraction of total memory the job requests, whichever is larger. Because the NVIDIA T4 and AMD MI50 nodes only have 1 GPU each, you have access to all available memory and cores for 1 SU. The NVIDIA V100 nodes have 4 GPUs each, so the available memory and cores per GPU is 1/4 of the total available on a node. This results in the following SU conversions: ^GPU Node ^SU Conversion ^ |nvidia-T4 |1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour | |AMD-MI50 |1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour | |nvidia-V100 |1 unit = 1 GPU + 12 cores + 192 GiB RAM for one hour | See the examples below for illustrations of how SUs are billed by the intervals in the conversion table: ^Node Type ^GPUs ^Cores ^Memory ^ SUs billed per hour ^ |nvidia-T4 | 1| 1 to 64| 1 GiB to 512 GiB| 1 SU | |nvidia-T4 | 2| 2 to 128| 2 GiB to 1024 GiB| 2 SUs | |AMD-MI50 | 1| 1 to 64| 1 GiB to 512 GiB| 1 SU | |nvidia-V100 | 1| 1 to 12| 1 GiB to 192 GiB| 1 SU | |nvidia-V100 | 2| 1 to 24| 1 GiB to 384 GiB| 2 SUs | |nvidia-V100 | 1| 25 to 48| 1 GiB to 192 GiB| 2 SUs((billed as if you were using 2 GPUs due to the proportion of CPU cores used)) | |nvidia-V100 | 1| 1 to 24| > 192 GiB and ≤ 384 GiB| 2 SUs((billed as if you were using 2 GPUs due to the proportion of memory used)) | Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons. ===== The idle partition ===== Jobs that execute in the [[abstract:darwin:runjobs:queues#the-idle-partition|idle partition]] do not result in charges against your allocation(s). If your jobs can support [[abstract:darwin:runjobs:schedule_jobs#handling-system-signals-aka-checkpointing|checkpointing]], the idle partition will enable you to continue your research even if you exhaust your allocation(s). However, jobs submitted to the other partitions which do get charged against allocations **will take priority** and may cause ''idle'' partition jobs to be **preempted**. Since jobs in the ''idle'' partition do not result in charges you will not see them in the output of the ''sproject'' command documented below. You can still use [[abstract:darwin:runjobs:job_status#checking-job-status|standard Slurm commands to check the status]] of those jobs. ====== Checking Allocation Usage ====== ===== sproject ===== UD IT has created the ''sproject'' command to allow various queries against allocations (UD and XSEDE) on DARWIN. You can see the help documentation for ''sproject'' by running ''sproject -h'' or ''sproject %%--%%help''. The ''-h/%%--%%help'' flag also works for any of the subcommands: ''sproject allocations -h'', ''sproject projects -h'', ''sproject jobs -h'', or ''sproject failures -h''. For all ''sproject'' commands you can specify an //output-format// of ''table'', ''csv'', or ''json'' using the ''%%--%%format '' or ''-f '' options. ==== sproject allocations ==== The ''allocations'' subcommand shows information for resource allocations granted to projects/workgroups on DARWIN to which you are a member. To see a specific workgroup's allocations, use the ''-g '' option as in this example for workgroup ''it_css'':


$ sproject allocations -g it_css
Project id Alloc id Alloc descr Category RDR Start date                End date
---------- -------- ----------- -------- --- ------------------------- -------------------------
         2        3 it_css::cpu startup  cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00
         2        4 it_css::gpu startup  gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00
         2       43 it_css::cpu startup  cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00
         2       44 it_css::gpu startup  gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00

The ''%%--%%detail'' flag will show additional information reflecting the credits, running + completed job charges, debits, and balance of each allocation:


$ sproject allocations -g it_css --detail
Project id Alloc id Alloc descr Category RDR Credit Run+Cmplt Debit Balance
---------- -------- ----------- -------- --- ------ --------- ----- -------
         2        3 it_css::cpu startup  cpu 108500         0 -1678  106822
         2        4 it_css::gpu startup  gpu    417         0   -16     401
         2       43 it_css::cpu startup  cpu  33333         0     0   33333
         2       44 it_css::gpu startup  gpu  33333         0     0   33333

The ''%%--%%by-user'' flag is helpful to see detailed allocation usage broken out by project user:


$ sproject allocations -g it_css --by-user
Project id Alloc id Alloc descr Category RDR User     Transaction Amount Comments
---------- -------- ----------- -------- --- -------- ----------- ------ -------------------------
         2        3 it_css::cpu startup  cpu          credit         167 Request approved jnhuffma
                  3 it_css::cpu                       credit      108333 Testing allocation
                  3 it_css::cpu              jnhuffma debit        -1678
         2        4 it_css::gpu startup  gpu jnhuffma debit          -16
                  4 it_css::gpu                       credit         417 Testing allocation
         2       43 it_css::cpu startup  cpu          credit       33333
         2       44 it_css::gpu startup  gpu          credit       33333

==== sproject projects ==== The ''projects'' subcommand shows information (such as the project id, group id, name, and creation date) for projects/workgroups on DARWIN to which you are a member. To see a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'':


$ sproject projects -g it_css
Project id Account Group id Group name  Creation date
---------- ------- -------- ----------  -------------------------
         2 it_css      1002 it_css      2021-07-12 14:51:57-04:00

Adding the ''%%--%%detail'' flag will also show each allocation associated with the project.


$ sproject projects -g it_css --detail
Project id Account Group id Group name Allocation id Category RDR Start date                End date                   Creation date
---------- ------- -------- ---------- ------------- -------- --- ------------------------- -------------------------  -------------------------
         2 it_css      1002 it_css                 3 startup  cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00  2021-07-12 15:00:46-04:00
                                                   4 startup  gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00  2021-07-12 15:00:54-04:00
                                                  43 startup  cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00  2021-07-26 14:47:56-04:00
                                                  44 startup  gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00  2021-07-26 14:47:56-04:00

==== sproject jobs ==== The ''jobs'' subcommand shows information (such as the Slurm job id, owner, and amount charged) for individual jobs billed against resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use ''sproject jobs -h'' for complete details. To see jobs associated with a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'':


$ sproject jobs -g it_css
Activity id Alloc id Alloc descr  Job id Owner    Status    Amount  Creation date            
----------- -------- ------------ ------ -------- --------- ------  -------------------------
      25000       43 it_css::cpu   80000 traine   executing  -3966  2021-08-10 18:27:59-04:00
      25001       43 it_css::cpu   80001 traine   executing  -3966  2021-08-10 18:52:11-04:00
      25002       43 it_css::cpu   80002 trainf   executing  -4200  2021-08-11 13:06:26-04:00
      25003       43 it_css::cpu   80003 traine   executing  -1200  2021-08-11 16:07:42-04:00

Jobs that complete execution will be displayed with a status of ''completed'' and the actual billable amount used by the job. At the top and bottom of each hour, completed jobs are //resolved// into per-user debits and disappear from the ''jobs'' listing (see the **sproject allocations** section above for the display of resource allocation credits, debits, and pre-debits). ==== sproject failures ==== The ''failures'' subcommand shows information (such as the Slurm job id, owner, and amount charged) for all jobs that failed to execute due to insufficient allocation balance on resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use ''sproject failures -h'' for complete details. To see failures associated with jobs run as a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'':


$ sproject failures -g it_css
Job id Error message
------ -------------------------------------------------------------
 60472 Requested allocation has insufficient balance: 3641 < 7680
 60473 Requested allocation has insufficient balance: 3641 < 7680
 60476 Requested allocation has insufficient balance: 3641 < 7680
 60474 Requested allocation has insufficient balance: 3641 < 7680
 60475 Requested allocation has insufficient balance: 3641 < 7680
 60471 Requested allocation has insufficient balance: 3641 < 7680
 60478 Requested allocation has insufficient balance: 3641 < 7680
 60477 Requested allocation has insufficient balance: 3641 < 7680
 60487 Requested allocation has insufficient balance: 12943 < 368640
 60488 Requested allocation has insufficient balance: 12943 < 276480
 60489 Requested allocation has insufficient balance: 12943 < 184320
 60490 Requested allocation has insufficient balance: 12943 < 76800
 60491 Requested allocation has insufficient balance: 12943 < 76800
 60492 Requested allocation has insufficient balance: 12943 < 61440
 60493 Requested allocation has insufficient balance: 12943 < 15360

Adding the ''%%--%%detail'' flag provides further information such as the owner, amount, and creation date.


$ sproject failures -g it_css --detail
Activity id Alloc id Alloc descr Job id Owner    Amount Error message                                                  Creation date
----------- -------- ----------- ------ -------- ------ -------------------------------------------------------------  -------------------------
         75        3 it_css::cpu  60472 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         76        3 it_css::cpu  60473 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         77        3 it_css::cpu  60476 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         78        3 it_css::cpu  60474 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         79        3 it_css::cpu  60475 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         80        3 it_css::cpu  60471 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:10-04:00
         81        3 it_css::cpu  60478 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:13-04:00
         82        3 it_css::cpu  60477 jnhuffma    128 Requested allocation has insufficient balance: 3641 < 7680     2021-07-13 08:57:13-04:00
         92        3 it_css::cpu  60487 jnhuffma   6144 Requested allocation has insufficient balance: 12943 < 368640  2021-07-13 11:09:00-04:00
         93        3 it_css::cpu  60488 jnhuffma   4608 Requested allocation has insufficient balance: 12943 < 276480  2021-07-13 11:09:24-04:00
         94        3 it_css::cpu  60489 jnhuffma   3072 Requested allocation has insufficient balance: 12943 < 184320  2021-07-13 11:09:40-04:00
         95        3 it_css::cpu  60490 jnhuffma   1280 Requested allocation has insufficient balance: 12943 < 76800   2021-07-13 11:15:08-04:00
         96        3 it_css::cpu  60491 jnhuffma   1280 Requested allocation has insufficient balance: 12943 < 76800   2021-07-13 11:16:00-04:00
         97        3 it_css::cpu  60492 jnhuffma   1024 Requested allocation has insufficient balance: 12943 < 61440   2021-07-13 11:19:16-04:00
         98        3 it_css::cpu  60493 jnhuffma    256 Requested allocation has insufficient balance: 12943 < 15360   2021-07-13 11:20:01-04:00

===== ACCESS (XSEDE) Allocations ====== For ACCESS allocations on DARWIN, you may use the [[https://allocations.access-ci.org/allocations/summary|ACCESS Allocations portal]] to check allocation usage, however keep in mind using the [[abstract:darwin:runjobs:accounting#sproject|sproject]] command available on DARWIN will provide the most up-to-date allocation usage information since the ACCESS Allocations Portal will only be updated nightly. ===== Storage Allocations ===== Every DARWIN Compute or GPU allocation has a storage allocation associated with it on the DARWIN Lustre file system. These allocations are measured in tebibytes and the default amount is 1 TiB. There are no SUs deducted from your allocation for the space you use, but you will be limited to a storage quota based on your awarded allocation. Each project/workgroup has a folder associated with it referred to as [[abstract:darwin:filesystems:filesystems#workgroup-storage|workgroup storage]]. Every file in that folder will count against that project/workgroup's allocated quota for their workgroup storage. You can use the [[abstract:darwin:filesystems:filesystems#quotas-and-usage|my_quotas]] command to check storage usage.