Getting started on Caviness

The Caviness cluster, UD's third Community Cluster, was deployed in July 2018 and is a distributed-memory Linux cluster. It is based on a rolling-upgradeable model for expansion and replacement of hardware over time. The current configuration consists of 265 compute nodes, 10124 traditional CPU cores, 110080 CUDA cores, 4800 Turing tensor cores, 77 TiB of RAM and 238 TB of NSF storage. An OmniPath network fabric supports high-speed communication and the Lustre filesystem (approx 476 TB of usable space). Gigabit and 10-Gigabit Ethernet networks provide access to additional filesystems and the campus network. The cluster was purchased with a proposed 5 year life to the first generation hardware, putting its refresh in the April 2023 to June 2023 time period.

For general information and specifications about the Caviness cluster, visit the IT Research Computing website. To cite the Caviness cluster for grants, proposals and publications, use these HPC templates.

An HPC system always has one or more public-facing systems known as login nodes. The login nodes are supplemented by many compute nodes which are connected by a private network. One or more head nodes run programs that manage and facilitate the functioning of the cluster. (In some clusters, the head node functionality is present on the login nodes.) Each compute node typically has several multi-core processors that share memory. Finally, all the nodes share one or more filesystems over a high-speed network.

Login (head) nodes are the gateway into the cluster and are shared by all cluster users. Their computing environment is a full standard variant of Linux configured for scientific applications. This includes command documentation (man pages), scripting tools, compiler suites, debugging/profiling tools, and application software. In addition, the login nodes have several tools to help you move files between the HPC filesystems and your local machine, other clusters, and web-based services.

Login nodes should be used to set up and submit job workflows and to compile programs. You should generally use compute nodes to run or debug application software or your own executables.

If your work requires highly interactive graphics and animations, these are best done on your local workstation rather than on the cluster. Use the cluster to generate files containing the graphics information, and download them from the HPC system to your local system for visualization.

When you use SSH to connect to caviness.hpc.udel.edu your computer will choose one of the login (head) nodes at random. The default command line prompt clearly indicates to which login node you have connected: for example, [traine@login01 ~]$ is shown for account traine when connected to login node login01.caviness.hpc.udel.edu.

Only use SSH to connect to a specific login node if you have existing processes present on it. For example, if you used the 'screen' or 'tmux' utility to preserve your session after logout.

There are many compute nodes with different configurations. Each node consists of multi-core processors (CPUs), memory, and local disk space. Nodes can have different OS versions or OS configurations, but this document assumes all the compute nodes have the same OS and almost the same configuration. Some nodes may have more cores, more memory, GPUs, or more disk.

The standard Linux on the compute nodes is configured to support just the running of your jobs, particularly parallel jobs. For example, there are no man pages on the compute nodes. Large components of the OS, such as development tools, are only added to that environment when needed.

All the multi-core CPUs and GPUs share the same memory in what may be a complicated manner. To add more processing capability while keeping hardware expense and power requirement down, most architectures use Non-Uniform Memory Access (NUMA). Also the processors may be sharing hardware, such as the FPUs (Floating point units).

Commercial applications, and normally your programs, will use a layer of abstraction called a programming model. Consult the cluster specific documentation for advanced techniques to take advantage of the low level architecture.

Permanent filesystems

At UD, permanent cluster filesystems are those that are backed up or replicated at an off-site disaster recovery facility. This always includes the home filesystem, which contains each user's home directory and has a modest per-user quota. A cluster may also have a larger permanent filesystem used for research group projects. The system is designed to let you recover older versions of files through a self-service process.

High-performance filesystems

One important component of HPC designs is to provide fast access to large files and to many small files. These days, high-performance filesystems have capacities ranging from hundreds of terabytes to petabytes. They are designed to use parallel I/O techniques to reduce file-access time. The Lustre filesystems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. They use high-bandwidth interconnects such as InfiniBand and 10-Gigabit Ethernet.

Large capacity high-performance filesystems are typically designed as volatile scratch storage systems. The amount of data present makes backing-up the filesystem practically and financially infeasible. However, the underlying design provides increased user-confidence by providing a high level of built-in redundancy against hardware failure.

Local filesystems

Each node has an internal, locally connected disk. Its capacity is measured in terabytes. Part of the local disk is used for system tasks such memory management, which might include cache memory and virtual memory. This remainder of the disk is ideal for applications that need a moderate amount of scratch storage for the duration of a job's run. That portion is referred to as the node scratch filesystem.

Each node scratch filesystem disk is only accessible by the node in which it is physically installed. The job scheduling system creates a temporary directory associated with each running job on this filesystem. When your job terminates, the job scheduler automatically erases that directory and its contents.

A list of installed software that IT builds and maintains for Caviness users can be found by logging into Caviness and using the VALET command vpkg_list.

Documentation for all software is organized in alphabetical order on the sidebar under Software.

Review the nVidia'a GPU-Accelerated Applications list for applications optimized to work with GPUs. These applications would be able to take advantage of nodes equipped with nVidia P100 “Pascal” GPU coprocessors.

Use of some commercial software on Caviness may require that your research group purchase a right-to-use license (e.g. ANSYS, IDL and COMSOL).

If you are experiencing a system related problem, first check Caviness cluster monitoring and system alerts. To report a new problem, or you just can't find the help you need on this wiki, then submit a Research Computing High Performance Computing (HPC) Clusters Help Request and complete the form including Caviness and your problem details in the description field.

hpc-ask is a Google group established to stimulate interactions within UD’s broader HPC community and is based on members helping members. This is a great venue to post a question about HPC, start a discussion, or share an upcoming event with the community. Anyone may request membership. Messages are sent as a daily summary to all group members. This list is archived, public, and searchable by anyone.

HPC templates are available to use for a proposal or publication to acknowledge use of or describe UD’s Information Technologies HPC resources.

  • abstract/caviness/caviness.txt
  • Last modified: 2020-10-26 12:51
  • by anita