White Papers

Some of the content in this area will be in PDF format and may need to be downloaded before being read.

rJava: When Compilers Get Too Smart

While installing all 16k (nearly 17k) CRAN packages on a recent R 4.1.3 build, many packages with a dependency on rJava would hang when being tested. GDB debugging and analysis of both the C source and runtime assembly code revealed an interesting problem with GCC 11.2's compilation of the code.

Open MPI, PSM2, and MPI_Comm_spawn()

The MPI process-spawning API has not been frequently used on our clusters. A user reported an issue with the Rmpi library and example code that spawns R workers via MPI_Comm_spawn() on the Caviness cluster. The issue was debugged and addressed for all pertinent versions of Open MPI, and is summarized here.

Mellanox UCX and Open MPI on DARWIN

During early-access testing of the DARWIN cluster several users reported issues with their MPI jobs' crashing unexpectedly in code locations that worked on previous clusters (like Caviness). The full troubleshooting and mitigation of the issue should be instructive for DARWIN users who attempt to build and manage their own Open MPI libraries on DARWIN.

/dev/shm exhaustion

As time goes by, the /dev/shm filesystem on compute nodes can fill with orphaned files. Without swap matching the amount of RAM in the node, these files will begin putting pressure on subsequent applications that run on the node. In Automated /dev/shm cleanup, a method of removing orphaned files from /dev/shm is outlined.

R: runtime configurable BLAS/LAPACK

The R statistical computing software can be built atop a variety of BLAS and LAPACK libraries – including its own internal Rblas and Rlapack libraries. Creating alternate builds of R that vary ONLY in the identity of the underlying BLAS/LAPACK implementation can consume extremely large amounts of disk space (and time!). The runtime-configurable R BLAS/LAPACK whitepaper documents the scheme used on our latest HPC cluster to make the choice of library a runtime configurable option.

Mills: threading performance study

The behavior of the Mills cluster's cutting-edge Interlagos processor is studied under multi-threaded and multi-process work loads. Influences of compiler and BLAS/LAPACK library choice are presented.

Download the PDF

Mills: AMD Opteron 6200 Unix Tuning Guide

The Nodes on the Mills cluster have 2 or 4 AMD Opteron 6200 series sockets. Each socket is organized as a multi-chip module package with two CPU dies, interconnected using a HyperTransport link. Each die is organized as 3 core pairs (Interlagos modules). Thus, to the OS, the socket appears as a 12 logical CPUs (12-core sockets). Resources such as memory and floating points unites are shared between the cores.

This technical tuning guide is intended for "systems admins, application end-users, and developers on a Linux platform who perform application development, code tuning, optimization, and initial system installation". The document describes resource sharing, and the effect on your applications.

Download the PDF from the AMD developer site

HPC Challenge Awards Competition at SC Conference

The SC¹⁾ High Performance Computing Challenge includes the benchmarks:

HPL measures the floating point rate of execution for solving a linear system of equations
DGEMM measures the floating point rate of execution of double precision real matrix matrix multiplication
STREAM measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for a simple vector kernel
PTRANS (parallel matrix transpose) exercises communications between pairs of processors. It is a useful test of the total communications capacity of the network.
Random Access measures the rate of integer random updates of memory (GUPS)
FFT measures the floating point rate of execution of double precision complex one dimensional Discrete Fourier Transform (DFT)
Communication bandwidth and latency measures latency and bandwidth of a number of simultaneous communication patterns; based on b_eff (effective bandwidth benchmark).

Download the PDF

Matlab: Computational threads on a shared cluster

By default Matlab uses multiple computational threads for standard linear algebra calculations. Without the options -singleCompThread it will use libraries tuned to use the computational hardware. Examples are the sunperf library on Solaris (Strauss) and the MKL library on intel hardware including Mills.

To fully use the computational threads you must call the built in high level functions or data parallel constructs in Matlab. For example, it is easy to write loops to do a Matrix multiply, but it w

Mills: Using ACML In High Performance Computing Challenge

For Mills, the recommended libraries include OpenMPI, ACML, and FFTW. The AMD recommended compilers include Open64 and PGI. The following document from AMD includes instructions for installing these libraries, but this is not needed on Mills since they are already installed as VALET packages.

Download the PDF from the AMD developer site

Mills: Benchmarking studies

High Performance Computing Challenge studies

open64 compiler with ACML and openmpi libraries

¹⁾

The International Conference for High Performance Computing, Networking, Storage and Analysis