====== Job Accounting on DARWIN ====== Accounting for jobs on DARWIN varies with the type of node used within a given allocation type. There are two types of allocations: - Compute - for CPU based nodes with 512 GiB, 1024 GiB, or 2048 GiB of RAM - GPU - for GPU based nodes with NVIDIA Tesla T4, NVIDIA Tesla V100, or AMD Radeon Instinct MI50 GPUs For all allocations and node types, usage is defined in terms of a Service Unit (SU). The definition of an SU varies with the type of node being used. **IMPORTANT:** When a job is submitted, the SUs will be calculated and pre-debited based on the resources requested thereby putting a hold on and deducting the SUs from the allocation credit for your project/workgroup. However, once the job completes the amount of SUs debited will be based on the actual time used. Keep in mind that if you request 20 cores and your job really only takes advantage of 10 cores, then the job will still be billed based on the requested 20 cores. And specifying a time limit of 2 days versus 2 hours may prevent others in your project/workgroup from running jobs as those SUs will be unavailable until the job completes. On the other hand, if you do not request enough resources and your job fails (i.e. did not provide enough time, enough cores, etc.), you will still be billed for those SUs. See [[abstract:darwin:runjobs:schedule_jobs#command-options|Scheduling Jobs Command options]] for help with specifying resources. **Moral of the story:** Request only the resources needed for your job. Over or under requesting resources results in wasting your allocation credits for everyone in your project/workgroup. **Interactive jobs:** An interactive job is billed the SU associated with the full wall time of its execution, not just for CPU time accrued through its duration. For example, if you leave an interactive job running for 2 hours and execute code for 2 minutes, your allocation will be billed for 2 hours of time, not 2 minutes. Please review the SU associated for each type of resource requested ([[abstract:darwin:runjobs:accounting#compute-allocations|compute]], [[abstract:darwin:runjobs:accounting#gpu-allocations|gpu]]) and the associated SUs billed per hour. //If you need to have SU to dollar conversions based on your DARWIN allocation for grant proposals or reports, please submit a [[https://services.udel.edu/TDClient/32/Portal/Requests/TicketRequests/NewForm?ID=D5ZRIgFlfLw_|Research Computing High Performance Computing (HPC) Clusters Help Request]] and complete the form including DARWIN and indicated you are requesting SU to dollar conversions in the details in the description field. // ===== Compute Allocations ===== A Compute allocation on DARWIN can be used on any of the four compute node types. Each compute node has 64 cores but the amount of memory varies by node type. The available resources for each node type are below: ^Compute Node ^Number of Nodes ^Total Cores ^Memory per Node ^Total Memory^ |Standard | 48| 3,072| 512 GiB| 24 TiB| |Large Memory | 32| 2,048| 1024 GiB| 32 TiB| |Extra-Large Memory | 11| 704| 2,048 GiB| 22 TiB| |Extended Memory | 1| 64| 1024 GiB + 2.73 TiB((1024 GiB of system memory and 2.73 TiB of swap on high-speed Intel Optane NVMe storage))| 3.73 TiB| |**Total** | 92| 5,888| | 81.73 TiB| A Service Unit (SU) on compute nodes corresponds to the use of one compute core for one hour. The number of SUs charged for a job is based on the fraction of total cores or fraction of total memory the job requests, whichever is larger. This results in the following SU conversions: ^Compute Node ^SU Conversion ^ |Standard |1 unit = 1 core + 8 GiB RAM for one hour | |Large Memory |1 unit = 1 core + 16 GiB RAM for one hour | |Extra-Large Memory |1 unit = 1 core + 32 GiB RAM for one hour | |Extended Memory |64 units = 64 cores + 1024 GiB RAM + 2.73 TiB swap for one hour((always billed as the entire node)) | See the examples below for illustrations of how SUs are billed by the intervals in the conversion table: ^Node Type ^Cores ^Memory ^ SUs billed per hour ^ |Standard | 1| 1 GiB to 8 GiB| 1 SU | |Standard | 1| 504 GiB to 512 GiB| 64 SUs((512 GiB RAM on a standard node is equivalent to using all 64 cores, so you are charged as if you used 64 cores)) | |Standard | 64| 1 GiB to 512 GiB| 64 SUs | |Standard | 2| 1 GiB to 16 GiB| 2 SUs | |Large Memory | 2| > 32 GiB and ≤ 48 GiB| 3 SUs((RAM usage exceeds what is available with 2 cores on a large memory node, so you are charged as if you used 3 cores)) | Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons. ===== GPU Allocations ===== A GPU allocation on DARWIN can be used on any of the three GPU node types. The NVIDIA-T4 and AMD MI50 nodes have 64 cores each, while the NVIDIA-V100 nodes have 48 cores each. The available resources for each node type are below: ^GPU Node ^Number of Nodes ^Total Cores ^Memory per Node ^Total Memory ^Total GPUs^ |nvidia-T4 | 9| 576| 512 GiB| 4.5 TiB| 9| |nvidia-V100 | 3| 144| 768 GiB| 2.25 TiB| 12| |AMD-MI50 | 1| 64| 512 GiB| .5 TiB| 1| |**Total** | 13| 784| | 7.25 TiB| 22| A Service Unit (SU) on GPU nodes corresponds to the use of one GPU device for one hour. The number of SUs charged for a job is based on the fraction of total GPUs, fraction of total cores, or fraction of total memory the job requests, whichever is larger. Because the NVIDIA T4 and AMD MI50 nodes only have 1 GPU each, you have access to all available memory and cores for 1 SU. The NVIDIA V100 nodes have 4 GPUs each, so the available memory and cores per GPU is 1/4 of the total available on a node. This results in the following SU conversions: ^GPU Node ^SU Conversion ^ |nvidia-T4 |1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour | |AMD-MI50 |1 unit = 1 GPU + 64 cores + 512 GiB RAM for one hour | |nvidia-V100 |1 unit = 1 GPU + 12 cores + 192 GiB RAM for one hour | See the examples below for illustrations of how SUs are billed by the intervals in the conversion table: ^Node Type ^GPUs ^Cores ^Memory ^ SUs billed per hour ^ |nvidia-T4 | 1| 1 to 64| 1 GiB to 512 GiB| 1 SU | |nvidia-T4 | 2| 2 to 128| 2 GiB to 1024 GiB| 2 SUs | |AMD-MI50 | 1| 1 to 64| 1 GiB to 512 GiB| 1 SU | |nvidia-V100 | 1| 1 to 12| 1 GiB to 192 GiB| 1 SU | |nvidia-V100 | 2| 1 to 24| 1 GiB to 384 GiB| 2 SUs | |nvidia-V100 | 1| 25 to 48| 1 GiB to 192 GiB| 2 SUs((billed as if you were using 2 GPUs due to the proportion of CPU cores used)) | |nvidia-V100 | 1| 1 to 24| > 192 GiB and ≤ 384 GiB| 2 SUs((billed as if you were using 2 GPUs due to the proportion of memory used)) | Note that these are estimates based on nominal memory. Actual charges are based on available memory which will be lower than nominal memory due to the memory requirements for the OS and system daemons. ===== The idle partition ===== Jobs that execute in the [[abstract:darwin:runjobs:queues#the-idle-partition|idle partition]] do not result in charges against your allocation(s). If your jobs can support [[abstract:darwin:runjobs:schedule_jobs#handling-system-signals-aka-checkpointing|checkpointing]], the idle partition will enable you to continue your research even if you exhaust your allocation(s). However, jobs submitted to the other partitions which do get charged against allocations **will take priority** and may cause ''idle'' partition jobs to be **preempted**. Since jobs in the ''idle'' partition do not result in charges you will not see them in the output of the ''sproject'' command documented below. You can still use [[abstract:darwin:runjobs:job_status#checking-job-status|standard Slurm commands to check the status]] of those jobs. ====== Checking Allocation Usage ====== ===== sproject ===== UD IT has created the ''sproject'' command to allow various queries against allocations (UD and XSEDE) on DARWIN. You can see the help documentation for ''sproject'' by running ''sproject -h'' or ''sproject %%--%%help''. The ''-h/%%--%%help'' flag also works for any of the subcommands: ''sproject allocations -h'', ''sproject projects -h'', ''sproject jobs -h'', or ''sproject failures -h''. For all ''sproject'' commands you can specify an //output-format// of ''table'', ''csv'', or ''json'' using the ''%%--%%format '' or ''-f '' options. ==== sproject allocations ==== The ''allocations'' subcommand shows information for resource allocations granted to projects/workgroups on DARWIN to which you are a member. To see a specific workgroup's allocations, use the ''-g '' option as in this example for workgroup ''it_css'': $ sproject allocations -g it_css Project id Alloc id Alloc descr Category RDR Start date End date ---------- -------- ----------- -------- --- ------------------------- ------------------------- 2 3 it_css::cpu startup cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2 4 it_css::gpu startup gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2 43 it_css::cpu startup cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2 44 it_css::gpu startup gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 The ''%%--%%detail'' flag will show additional information reflecting the credits, running + completed job charges, debits, and balance of each allocation: $ sproject allocations -g it_css --detail Project id Alloc id Alloc descr Category RDR Credit Run+Cmplt Debit Balance ---------- -------- ----------- -------- --- ------ --------- ----- ------- 2 3 it_css::cpu startup cpu 108500 0 -1678 106822 2 4 it_css::gpu startup gpu 417 0 -16 401 2 43 it_css::cpu startup cpu 33333 0 0 33333 2 44 it_css::gpu startup gpu 33333 0 0 33333 The ''%%--%%by-user'' flag is helpful to see detailed allocation usage broken out by project user: $ sproject allocations -g it_css --by-user Project id Alloc id Alloc descr Category RDR User Transaction Amount Comments ---------- -------- ----------- -------- --- -------- ----------- ------ ------------------------- 2 3 it_css::cpu startup cpu credit 167 Request approved jnhuffma 3 it_css::cpu credit 108333 Testing allocation 3 it_css::cpu jnhuffma debit -1678 2 4 it_css::gpu startup gpu jnhuffma debit -16 4 it_css::gpu credit 417 Testing allocation 2 43 it_css::cpu startup cpu credit 33333 2 44 it_css::gpu startup gpu credit 33333 ==== sproject projects ==== The ''projects'' subcommand shows information (such as the project id, group id, name, and creation date) for projects/workgroups on DARWIN to which you are a member. To see a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'': $ sproject projects -g it_css Project id Account Group id Group name Creation date ---------- ------- -------- ---------- ------------------------- 2 it_css 1002 it_css 2021-07-12 14:51:57-04:00 Adding the ''%%--%%detail'' flag will also show each allocation associated with the project. $ sproject projects -g it_css --detail Project id Account Group id Group name Allocation id Category RDR Start date End date Creation date ---------- ------- -------- ---------- ------------- -------- --- ------------------------- ------------------------- ------------------------- 2 it_css 1002 it_css 3 startup cpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2021-07-12 15:00:46-04:00 4 startup gpu 2021-07-12 00:00:00-04:00 2021-07-25 23:59:59-04:00 2021-07-12 15:00:54-04:00 43 startup cpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2021-07-26 14:47:56-04:00 44 startup gpu 2021-07-26 00:00:00-04:00 2022-07-31 23:59:50-04:00 2021-07-26 14:47:56-04:00 ==== sproject jobs ==== The ''jobs'' subcommand shows information (such as the Slurm job id, owner, and amount charged) for individual jobs billed against resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use ''sproject jobs -h'' for complete details. To see jobs associated with a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'': $ sproject jobs -g it_css Activity id Alloc id Alloc descr Job id Owner Status Amount Creation date ----------- -------- ------------ ------ -------- --------- ------ ------------------------- 25000 43 it_css::cpu 80000 traine executing -3966 2021-08-10 18:27:59-04:00 25001 43 it_css::cpu 80001 traine executing -3966 2021-08-10 18:52:11-04:00 25002 43 it_css::cpu 80002 trainf executing -4200 2021-08-11 13:06:26-04:00 25003 43 it_css::cpu 80003 traine executing -1200 2021-08-11 16:07:42-04:00 Jobs that complete execution will be displayed with a status of ''completed'' and the actual billable amount used by the job. At the top and bottom of each hour, completed jobs are //resolved// into per-user debits and disappear from the ''jobs'' listing (see the **sproject allocations** section above for the display of resource allocation credits, debits, and pre-debits). ==== sproject failures ==== The ''failures'' subcommand shows information (such as the Slurm job id, owner, and amount charged) for all jobs that failed to execute due to insufficient allocation balance on resource allocations for projects/workgroups on DARWIN to which you are a member. Various options are available for sorting and filtering, use ''sproject failures -h'' for complete details. To see failures associated with jobs run as a specific project/workgroup, use the ''-g '' option as in this example for workgroup ''it_css'': $ sproject failures -g it_css Job id Error message ------ ------------------------------------------------------------- 60472 Requested allocation has insufficient balance: 3641 < 7680 60473 Requested allocation has insufficient balance: 3641 < 7680 60476 Requested allocation has insufficient balance: 3641 < 7680 60474 Requested allocation has insufficient balance: 3641 < 7680 60475 Requested allocation has insufficient balance: 3641 < 7680 60471 Requested allocation has insufficient balance: 3641 < 7680 60478 Requested allocation has insufficient balance: 3641 < 7680 60477 Requested allocation has insufficient balance: 3641 < 7680 60487 Requested allocation has insufficient balance: 12943 < 368640 60488 Requested allocation has insufficient balance: 12943 < 276480 60489 Requested allocation has insufficient balance: 12943 < 184320 60490 Requested allocation has insufficient balance: 12943 < 76800 60491 Requested allocation has insufficient balance: 12943 < 76800 60492 Requested allocation has insufficient balance: 12943 < 61440 60493 Requested allocation has insufficient balance: 12943 < 15360 Adding the ''%%--%%detail'' flag provides further information such as the owner, amount, and creation date. $ sproject failures -g it_css --detail Activity id Alloc id Alloc descr Job id Owner Amount Error message Creation date ----------- -------- ----------- ------ -------- ------ ------------------------------------------------------------- ------------------------- 75 3 it_css::cpu 60472 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 76 3 it_css::cpu 60473 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 77 3 it_css::cpu 60476 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 78 3 it_css::cpu 60474 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 79 3 it_css::cpu 60475 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 80 3 it_css::cpu 60471 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:10-04:00 81 3 it_css::cpu 60478 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:13-04:00 82 3 it_css::cpu 60477 jnhuffma 128 Requested allocation has insufficient balance: 3641 < 7680 2021-07-13 08:57:13-04:00 92 3 it_css::cpu 60487 jnhuffma 6144 Requested allocation has insufficient balance: 12943 < 368640 2021-07-13 11:09:00-04:00 93 3 it_css::cpu 60488 jnhuffma 4608 Requested allocation has insufficient balance: 12943 < 276480 2021-07-13 11:09:24-04:00 94 3 it_css::cpu 60489 jnhuffma 3072 Requested allocation has insufficient balance: 12943 < 184320 2021-07-13 11:09:40-04:00 95 3 it_css::cpu 60490 jnhuffma 1280 Requested allocation has insufficient balance: 12943 < 76800 2021-07-13 11:15:08-04:00 96 3 it_css::cpu 60491 jnhuffma 1280 Requested allocation has insufficient balance: 12943 < 76800 2021-07-13 11:16:00-04:00 97 3 it_css::cpu 60492 jnhuffma 1024 Requested allocation has insufficient balance: 12943 < 61440 2021-07-13 11:19:16-04:00 98 3 it_css::cpu 60493 jnhuffma 256 Requested allocation has insufficient balance: 12943 < 15360 2021-07-13 11:20:01-04:00 ===== ACCESS (XSEDE) Allocations ====== For ACCESS allocations on DARWIN, you may use the [[https://allocations.access-ci.org/allocations/summary|ACCESS Allocations portal]] to check allocation usage, however keep in mind using the [[abstract:darwin:runjobs:accounting#sproject|sproject]] command available on DARWIN will provide the most up-to-date allocation usage information since the ACCESS Allocations Portal will only be updated nightly. ===== Storage Allocations ===== Every DARWIN Compute or GPU allocation has a storage allocation associated with it on the DARWIN Lustre file system. These allocations are measured in tebibytes and the default amount is 1 TiB. There are no SUs deducted from your allocation for the space you use, but you will be limited to a storage quota based on your awarded allocation. Each project/workgroup has a folder associated with it referred to as [[abstract:darwin:filesystems:filesystems#workgroup-storage|workgroup storage]]. Every file in that folder will count against that project/workgroup's allocated quota for their workgroup storage. You can use the [[abstract:darwin:filesystems:filesystems#quotas-and-usage|my_quotas]] command to check storage usage.