In Generation 1 of Caviness two types of node were offered to stakeholders: traditional compute and GPU-enhanced compute nodes. Each type was upgradeable (at purchase) to memory sizes above the baseline 128 GB. Based on feedback from stakeholders, Generation 2 of Caviness will offer one new type of node and will expand the GPU-enhanced type into three variants.
All nodes will use Intel Cascade Lake scalable processors, which are two generations newer than the Broadwell processors present in Generation 1 nodes. Cascade Lake offers significant performance improvements over Broadwell, including AVX-512 instruction set extensions with additional Vector Neural Network Instructions (VNNI). A small amount of local scratch storage and a high-speed Intel Omni-path network port will also be present.
Each Intel Cascade Lake processor has six memory channels versus a Broadwell processor's four, so optimally-balanced memory sizes are comprised of six memory DIMMs per processor. With four additional cores per node, a baseline 192 GB memory size has been adopted in Generation 2:
Nodes can have system memory upgraded at the time of purchase:
While the 1024 GB (1 TB) option does use more than six memory DIMMs per CPU, the vendor cited workload performance tests that demonstrated no significant performance impact.
Generation 1 made available GPU nodes containing one nVidia P100 (Pascal) GPGPU coprocessor attached to each CPU's PCIe bus. Stakeholder feedback over the past year has highlighted several deficiencies to that design:
For point (1), the nVidia T4 is a lower-power, smaller-sized model suitable for A.I. inference (not training) workloads and general CUDA functionality:
For other workloads, the nVidia V100 (Volta) GPGPU is a generation newer than the P100 used in Generation 1. Each V100 features:
To address point (3) the final metric for each V100 is augmented:
These data led to the design of three GPU node variants for inclusion in Caviness Generation 2.
The low-end GPU design is suited for workloads that are not necessarily GPU-intensive. Programs that offload some processing via CUDA libraries (either directly or via OpenACC language extensions) may see some performance enhancement on this node type:
Since the nVidia T4 does include tensor cores, this node may also efficiently handle some A.I. inference workloads. A.I. training workloads are better handled by the other GPU node types.
Akin to the GPU node design in Generation 1, the all-purpose GPU design features one nVidia V100 per CPU. While any GPGPU workload is permissible on this node type, inter-GPGPU performance is not maximized:
This node type is optimal for workgroups with mixed GPGPU workloads: CUDA offload, A.I. inference and training, and traditional non-GPGPU tasks.
Maximizing both the number of GPGPU coprocessors and the inter-GPGPU performance, the high-end GPU design doubles the number of GPGPU coprocessors and links them via an NVLINK interconnect:
The high-end option is meant to address stakeholder workloads that are very GPU-intensive.
On Caviness the Lustre file system is typically leveraged when scratch storage in excess of the 960 GB provided by local SSD is necessary. As a network-shared medium, though, some workloads do not perform as well as they would with larger, faster storage local to the node. Software that caches data and accesses that data frequently or in small, random-access patterns may not perform well on Lustre. Some stakeholders indicated a desire for a larger amount of fast scratch storage physically present in the compute node.
Generation 1 featured two compute nodes with dual 3.2 TB NVMe storage devices. A scratch file system was striped across the two NVMe for efficient utilization of both devices. These nodes (r02s00 and r02s01) are available to all Caviness users for testing.
The Generation 2 design increases the capacity of the fast local scratch significantly: