June 2019 Caviness Expansion: Node Specifications
In Generation 1 of Caviness two types of node were offered to stakeholders: traditional compute and GPU-enhanced compute nodes. Each type was upgradeable (at purchase) to memory sizes above the baseline 128 GB. Based on feedback from stakeholders, Generation 2 of Caviness will offer one new type of node and will expand the GPU-enhanced type into three variants.
Standard features
All nodes will use Intel Cascade Lake scalable processors, which are two generations newer than the Broadwell processors present in Generation 1 nodes. Cascade Lake offers significant performance improvements over Broadwell, including AVX-512 instruction set extensions with additional Vector Neural Network Instructions (VNNI). A small amount of local scratch storage and a high-speed Intel Omni-path network port will also be present.
- (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz)
- (1) 960 GB SSD local scratch disk
- (1) port, 100 Gbps Intel Omni-path network
System memory
Each Intel Cascade Lake processor has six memory channels versus a Broadwell processor's four, so optimally-balanced memory sizes are comprised of six memory DIMMs per processor. With four additional cores per node, a baseline 192 GB memory size has been adopted in Generation 2:
- 192 GB — 12 x 16 GB DDR4-2666MHz (registered ECC)
Nodes can have system memory upgraded at the time of purchase:
- 384 GB — 12 x 32 GB DDR4-2666MHz (registered ECC)
- 768 GB — 12 x 64 GB DDR4-2666MHz (registered ECC)
- 1024 GB — 16 x 64 GB DDR4-2666MHz (registered ECC)
While the 1024 GB (1 TB) option does use more than six memory DIMMs per CPU, the vendor cited workload performance tests that demonstrated no significant performance impact.
GPU nodes
Generation 1 made available GPU nodes containing one nVidia P100 (Pascal) GPGPU coprocessor attached to each CPU's PCIe bus. Stakeholder feedback over the past year has highlighted several deficiencies to that design:
- High-end GPGPU coprocessors are expensive and have features that some workloads will never utilize; would it be possible to include a less-expensive, lower-end GPGPU in nodes?
- Would it be possible to have more than two high-end GPGPU coprocessors in a node?
- Would it be possible to use the nVidia NVLINK GPGPU interconnect to maximise performance?
For point (1), the nVidia T4 is a lower-power, smaller-sized model suitable for A.I. inference (not training) workloads and general CUDA functionality:
- 2560 CUDA cores
- 320 Turing tensor cores
- 16 GB GDDR6 ECC memory
- 32 GB/s inter-GPGPU bandwidth (PCIe interface)
For other workloads, the nVidia V100 (Volta) GPGPU is a generation newer than the P100 used in Generation 1. Each V100 features:
- 5120 CUDA cores
- 640 Turing tensor cores
- 32 GB HBM2 ECC memory
- 32 GB/s inter-GPGPU bandwidth (PCIe interface)
To address point (3) the final metric for each V100 is augmented:
- 300 GB/s inter-GPGPU bandwidth (SXM2 interface)
These data led to the design of three GPU node variants for inclusion in Caviness Generation 2.
Low-end GPU
The low-end GPU design is suited for workloads that are not necessarily GPU-intensive. Programs that offload some processing via CUDA libraries (either directly or via OpenACC language extensions) may see some performance enhancement on this node type:
- (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz)
- (1) 960 GB SSD local scratch disk
- (1) port, 100 Gbps Intel Omni-path network
- (1) nVidia T4
- 2560 CUDA cores
- 320 Turing tensor cores
Since the nVidia T4 does include tensor cores, this node may also efficiently handle some A.I. inference workloads. A.I. training workloads are better handled by the other GPU node types.
All-purpose GPU
Akin to the GPU node design in Generation 1, the all-purpose GPU design features one nVidia V100 per CPU. While any GPGPU workload is permissible on this node type, inter-GPGPU performance is not maximized:
- (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz)
- (1) 960 GB SSD local scratch disk
- (1) port, 100 Gbps Intel Omni-path network
- (2) nVidia V100 (PCIe interface)
- 10240 CUDA cores
- 1280 Turing tensor cores
This node type is optimal for workgroups with mixed GPGPU workloads: CUDA offload, A.I. inference and training, and traditional non-GPGPU tasks.
High-end GPU
Maximizing both the number of GPGPU coprocessors and the inter-GPGPU performance, the high-end GPU design doubles the number of GPGPU coprocessors and links them via an NVLINK interconnect:
- (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz)
- (1) 960 GB SSD local scratch disk
- (1) port, 100 Gbps Intel Omni-path network
- (4) nVidia V100 (SXM2 interface)
- 20480 CUDA cores
- 2560 Turing tensor cores
The high-end option is meant to address stakeholder workloads that are very GPU-intensive.
Enhanced local scratch
On Caviness the Lustre file system is typically leveraged when scratch storage in excess of the 960 GB provided by local SSD is necessary. As a network-shared medium, though, some workloads do not perform as well as they would with larger, faster storage local to the node. Software that caches data and accesses that data frequently or in small, random-access patterns may not perform well on Lustre. Some stakeholders indicated a desire for a larger amount of fast scratch storage physically present in the compute node.
Generation 1 featured two compute nodes with dual 3.2 TB NVMe storage devices. A scratch file system was striped across the two NVMe for efficient utilization of both devices. These nodes (r02s00 and r02s01) are available to all Caviness users for testing.
The Generation 2 design increases the capacity of the fast local scratch significantly:
- (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz)
- (1) port, 100 Gbps Intel Omni-path network
- (8) 4 TB NVMe