abstract:darwin:earlyaccess

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

abstract:darwin:earlyaccess [2021-04-22 11:57] anitaabstract:darwin:earlyaccess [2021-04-27 16:21] (current) – external edit 127.0.0.1
Line 7: Line 7:
  
   * All research groups (PIs only) must submit an [[https://docs.google.com/forms/d/e/1FAIpQLSfM5tkR_HtxRFWTj68rdEVZcJeJ5Z3xkETnZIGt8rxbyunH6w/viewform|application for a startup allocation]] to be able to submit jobs on DARWIN during Phase 2 early access. Again priority is being given to those participating in the first early access.   * All research groups (PIs only) must submit an [[https://docs.google.com/forms/d/e/1FAIpQLSfM5tkR_HtxRFWTj68rdEVZcJeJ5Z3xkETnZIGt8rxbyunH6w/viewform|application for a startup allocation]] to be able to submit jobs on DARWIN during Phase 2 early access. Again priority is being given to those participating in the first early access.
-  * The workgroup **unsponsored** will no longer be available for job submission and has been changed to read-only. All those granted a startup allocation will be notified with their new workgroup and should move any files from ''/lustre/unsponsored/users/<//uid//>'' into your new workgroup directory or move them to alternative storage.+  * The workgroup **unsponsored** will no longer be available for job submission and has been changed to read-only. All those granted a startup allocation will be notified with their new workgroup and should move any files from ''/lustre/unsponsored/users/<//uid//>'' into the new workgroup directory or move them to alternative storage.
   * All those granted a startup allocation must be willing to provide feedback to IT during the Phase 2 early access period.   * All those granted a startup allocation must be willing to provide feedback to IT during the Phase 2 early access period.
  
Line 71: Line 71:
 Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre filesystems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster. Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre filesystems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster.
  
-During the early access period:+During Phase 2 early access period:
  
-  * Each user will receive personal Lustre directory (''/lustre/unsponsored/users/<//uid//>'') to be used for running jobs and storing larger amounts of data. +  * Each allocation will be assigned workgroup storage in the Lustre directory (''/lustre/<<//workgroup//>>/''). 
-  * No quota limits will be in place on the personal Lustre directory+  * Each workgroup storage will have a users directory (''/lustre/<<//workgroup//>>/users/<<//uid//>>''for each user of the workgroup to be used as a personal directory for running jobs and storing larger amounts of data. 
- +  * Each workgroup storage will have a software and VALET directory (''/lustre/<<//workgroup//>>/sw/'' and ''/lustre/<<//workgroup//>>/sw/valet'') all allow users of the workgroup to install software and create VALET package files that need to be shared by others in the workgroup. 
-After the early access period:  +  * There will be a 1 TiB quota limit for the workgroup storage.
- +
-  * The Lustre filesystem will be transitioned to workgroup-oriented structure with quotas+
-  * All early access users will be asked to move any needed data from their personal Lustre directory to specific workgroup Lustre directory.+
  
 <note important>While all filesystems on the DARWIN cluster utilize hardware redundancies to protect data, there is **no** backup or replication and **no** recovery available for the home or Lustre filesystems. <note important>While all filesystems on the DARWIN cluster utilize hardware redundancies to protect data, there is **no** backup or replication and **no** recovery available for the home or Lustre filesystems.
Line 90: Line 87:
 Each node scratch filesystem disk is only accessible by the node in which it is physically installed. The job scheduling system creates a temporary directory associated with each running job on this filesystem. When your job terminates, the job scheduler automatically erases that directory and its contents. Each node scratch filesystem disk is only accessible by the node in which it is physically installed. The job scheduling system creates a temporary directory associated with each running job on this filesystem. When your job terminates, the job scheduler automatically erases that directory and its contents.
  
 +More detailed information about DARWIN storage and quotas can be found on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> under [[abstract:darwin:filesystems:filesystems|Storage]].
 ===== Software ===== ===== Software =====
  
-A list of installed software that IT builds and maintains for DARWIN users can be found by logging into DARWIN and using the VALET command ''vpkg_list''.+A list of installed software that IT builds and maintains for DARWIN users can be found by [[abstract:darwin:system_access:system_access#logging-on-to-caviness|logging into DARWIN]] and using the VALET command ''vpkg_list''.
  
-There will **not** be a full set of software during the initial access, but we will be continually installing and updating software.  Installation priority will go to compilers, system libraries, and highly utilized software packages. Please DO let us know if there are packages that you would like to use on DARWIN, as that will help us prioritize user needs, but understand that we may not be able to install software requests in a timely manner. +Documentation for all software is organized in alphabetical order on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> under [[software:software|Software]]. There will likely not be a details by cluster for DARWIN, however referring to Caviness should still be applicable for now. 
 + 
 +There will **not** be a full set of software during early access and testing, but we will be continually installing and updating software.  Installation priority will go to compilers, system libraries, and highly utilized software packages. Please DO let us know if there are packages that you would like to use on DARWIN, as that will help us prioritize user needs, but understand that we may not be able to install software requests in a timely manner. 
      
-<note important>Users will be able compile and install software packages in their own accounts. There will be very limited support for helping with user compiled installs or debugging during early access. Please reference [[technical:recipes:software-managment|basic software building and management]] to get started with software installations utilizing VALET (versus Modules) as suggested and used by IT RCI staff on our HPC systems.</note>+<note important>Users will be able compile and install software packages in their home or workgroup directories. There will be very limited support for helping with user compiled installs or debugging during early access. Please reference [[technical:recipes:software-managment|basic software building and management]] to get started with software installations utilizing VALET (versus Modules) as suggested and used by IT RCI staff on our HPC systems.</note>
  
 Please review the following documents if you are planning to compile and install your own software Please review the following documents if you are planning to compile and install your own software
Line 107: Line 106:
 ===== Scheduler ===== ===== Scheduler =====
  
-DARWIN will being using the Slurm scheduler like Caviness, and is the most common scheduler among XSEDE resources. Slurm on DARWIN is configured as fairshare with each user being giving equal shares to access the current HPC resources available on DARWIN. We do not yet have updated documentation for DARWIN, so please refer to the Caviness documentation on [[abstract:caviness:runjobs:runjobs|running jobs]], as the process should be similar. The [[technical:slurm:caviness:templates:start|job script templates]] for Caviness are also available on DARWIN. +DARWIN will being using the Slurm scheduler like Caviness, and is the most common scheduler among XSEDE resources. Slurm on DARWIN is configured as fairshare with each user being giving equal shares to access the current HPC resources available on DARWIN.
  
 ==== Queues (Partitions) ==== ==== Queues (Partitions) ====
  
-Initially for early access there are two partitions:+During Phase 2 early access partitions have been created to align with allocation requests moving forward based on different node types. There will be no default partition, and only specify one partition at a time.  It is not possible to specify multiple partitions using Slurm to span different node types.
  
-(1) ''standard'' partition (default) has the following limits +See [[abstract/darwin/runjobs/queues|Queues]] on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> for detailed information about the available partitions on DARWIN.
- +
-  * 7 day run time per job (default 30 minutes) +
-  * 576 cores per job (default 1 core) +
-  * 9 nodes per job +
-  * 1152 cores total per user +
-  * maximum 400 job submissions per user (this includes the number of indices specified for an array job) +
- +
-Beginning on **March 8 @ noon** the ''standard'' partition (default) will have the following limits +
- +
-  * 2 day run time per job (default 30 minutes) +
-  * no core limit per job or user (default 1 core) +
-  * maximum 400 job submissions per user (this includes the number of indices specified for an array job) +
- +
-(2) ''lg-swap'' partition to access the **Extended Memory** node with the following limits +
- +
-  * 7 day run time per job (default 30 minutes)+
  
 We fully expect these limits to be changed and adjusted during the early access period. We fully expect these limits to be changed and adjusted during the early access period.
Line 135: Line 118:
 ==== Run Jobs ==== ==== Run Jobs ====
  
-In order to schedule any job (interactively or batch) on the DARWIN cluster, you must set your workgroup to define your cluster group. For early access, everyone is in the same workgroup, **unsponsored**, so typing+In order to schedule any job (interactively or batch) on the DARWIN cluster, you must set your workgroup to define your cluster group. For Phase 2 early access, each research group has been assigned a unique workgroup. Each research group should have received this information in a welcome email for Phase 2 early access. For example,
  
 <code bash> <code bash>
-workgroup -g unsponsored+workgroup -g it_css
 </code> </code>
  
-accomplishes this You will know if you are in your workgroup based on the change in your bash prompt.  See the following example for user ''traine''+will enter the workgroup for ''it_css''. You will know if you are in your workgroup based on the change in your bash prompt.  See the following example for user ''traine''
  
 <code bash> <code bash>
-[traine@login00.darwin ~]$ workgroup -g unsponsored+[traine@login00.darwin ~]$ workgroup -g it_css
 [(unsponsored:traine)@login00.darwin ~]$ printenv USER HOME WORKDIR WORKGROUP WORKDIR_USER [(unsponsored:traine)@login00.darwin ~]$ printenv USER HOME WORKDIR WORKGROUP WORKDIR_USER
 traine traine
 /home/1201 /home/1201
-/lustre/unsponsored +/lustre/it_css 
-unsponsored +it_css 
-/lustre/unsponsored/users/1201 +/lustre/it_css/users/1201 
-[(unsponsored:anita)@login00.darwin ~]$+[(it_css:traine)@login00.darwin ~]$
 </code> </code>
  
-Now we can use ''salloc'' or ''sbatch'' to submit an interactive or batch job respectively.  See Caviness [[abstract:caviness:runjobs:runjobs|Run Jobs]], [[abstract:caviness:runjobs:schedule_jobs|Schedule Jobs]] and [[abstract:caviness:runjobs:job_status|Managing Jobs]] wiki pages for more help about Slurm including how to specify resources and check on the status of your jobs.+Now we can use ''salloc'' or ''sbatch'' as long as a [[abstract:darwin:runjobs:queues|partition]] is specified as well to submit an interactive or batch job respectively.  See DARWIN [[abstract:darwin:runjobs:runjobs|Run Jobs]], [[abstract:darwin:runjobs:schedule_jobs|Schedule Jobs]] and [[abstract:darwin:runjobs:job_status|Managing Jobs]] wiki pages for more help about Slurm including how to specify resources and check on the status of your jobs.
  
 <note important> <note important>
Line 160: Line 143:
 </note> </note>
  
-<note tip>It is a good idea to periodically check in ''/opt/templates/slurm/'' for updated or new [[technical:slurm:darwin:templates:start|templates]] to use as job scripts to run generic or specific applications, designed to provide the best performance on Caviness.</note> +<note tip>It is a good idea to periodically check in ''/opt/shared/templates/slurm/'' for updated or new [[technical:slurm:darwin:templates:start|templates]] to use as job scripts to run generic or specific applications, designed to provide the best performance on DARWIN.</note>
- +
-==== Memory ==== +
- +
-The table below provides the usable memory values available for each type of node currently available on the DARWIN.+
  
-^Node type                  ^Slurm selection options                                        ^RealMemory/MiB  ^RealMemory/GiB^ +See [[abstract/darwin/runjobs/|Run jobs]] on the <html><span style="color:#ffffff;background-color:#2fa4e7;padding:3px 7px !important;border-radius:4px;">sidebar</span></html> for detailed information about the running jobs on DARWIN and specifically [[abstract:darwin:runjobs:schedule_jobs#command-options|Schedule job options]] for memory, time, gpus, etc.
-|Standard/512 GiB           |%%--%%constraint='standard'                                    |    499712|       488| +
-|Large Memory/1 TiB         |%%--%%constraint='large-memory'                                |    999424|       976| +
-|Extra-Large Memory/2 TiB   |%%--%%constraint='xlarge-memory'                                 2031616|      1984| +
-|nVidia-T4/512 GiB          |%%--%%constraint='nvidia-t4' %%--%%gpus=tesla_t4:1                499712|       488| +
-|nVidia-V100/768 GiB        |%%--%%constraint='nvidia-v100' %%--%%gpus=tesla_v100:<N>          737280|       720| +
-|amd-MI50/512 GiB           |%%--%%constraint='amd-mi50' %%--%%gpus=amd_mi50:             |    499712|       488| +
-|Extended Memory/3.73 TiB   |%%--%%partition=lg-swap %%--exclusive%%                        |    999424|       976|+
  
-The **Extended Memory** node is not accessible via Slurm constraint or gres, but instead specifying the partition ''lg-swap'' and ''exclusive'' options.  This allows only one user on the node at a time thereby making all swap space accessible for multiple jobs running on that node at once, sharing the swap; but no other user can be on it during that time. 
 ===== Help ===== ===== Help =====
  
  • abstract/darwin/earlyaccess.1619107041.txt.gz
  • Last modified: 2021-04-22 11:57
  • by anita