UCX MR caching and large-scale MPI collectives

This document summarizes an issue reported for a variety of MPI workflows on the DARWIN cluster and a mitigation.

Problem

IT RCI noticed from time to time that a variety of users of the VASP software on DARWIN have had jobs end prematurely and Slurm would transition the nodes involved into the DRAIN state due to the processes' not terminating when signaled. On those nodes the kernel would log messages following this format:

[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
[Thu Oct 31 21:32:31 2024] mlx5_core 0000:01:00.0: mlx5_cmd_check:794:(pid 57074): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
  :
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57074): create mkey failed
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57109): create mkey failed
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57097): create mkey failed

Eventually core dumps were collected from one affected user which indicated for the given job the crash systematically occurred at the same code location and on the same node (3 of 4) across several separate runs (including with homogeneous and heterogeneous node types). All 64 ranks on node 3 of 4 crashed while distributing initial state among their orthogonal k-point ranks:

KPAR_SYNC_ALL() ⇒ KPAR_SYNC_FERTOT() ⇒ M_sum_master_d() ⇒ MPI_Reduce()

The M_sum_master_d() subroutine splits the incoming array of REAL*8 into chunks of a smaller dimension (1000 words by default) and calls MPI_Reduce() on each chunk in a loop. Monitoring the MR cache state (via properties exposed under /sys/class/infiniband/mlx5_0/mr_cache) showed cache size and usage growing very quickly (and very large) right up until the first create mkey failed message was logged.

Since Open MPI was being used, it was important to note which collective mechanism was employed under the MPI_Reduce(). The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ucp_rndv_progress_rma_zcopy_common() function.

Debugging

Solution to the issue came in two distinct stages.

MPI_BLOCK

With

the terminal VASP function's stacking many small MPI_Reduce() operations over a very large array
a large number of ranks hitting the MR cache concurrently (64 ranks per node)
the error's being an exhaustion of that cache

the first attempt at a mitigation was to increase the size of the chunks in M_sum_master_d(). For the sake of testing the desire was to make the chunk size configurable at runtime. The vendor's code (see mpi.F) includes static work arrays that are dimensioned at compile time. The code was rewritten to dynamically-allocate those work arrays using the compiled-in size by default, overridden by the value of VASP_MPI_BLOCK in the environment, e.g.

$ mpirun vasp_std
   :
 Using default MPI_BLOCK size:        8000
 MPI work arrays allocated using MPI_BLOCK
   :
 
$ VASP_MPI_BLOCK=131072 mpirun vasp_std
   :
 Using MPI_BLOCK size from environment:      131072
 MPI work arrays allocated using MPI_BLOCK
   :

Testing showed that the introduction of the variable MPI_BLOCK and dynamically-allocated work arrays did not affect the performance of VASP. However, the change only delayed the occurrence of the MR cache exhaustion; it did not remove it. Nevertheless, the variable MPI_BLOCK size seems to be a very useful feature and has been brought to the attention of the VASP developers. Patches for various VASP releases exist in this Github repository.

Ideally, it would be even more useful to respond to VASP_MPI_BLOCK=none or VASP_MPI_BLOCK=0 by not fragmenting the array and instead issuing a single MPI_Reduce(). Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted M_sum_master_d() back in the era of the Intel Pentium no longer exist.

UCX control

When runtime variation of MPI_BLOCK did not completely remove the issue, further testing was performed. The data collected eventually led to Google searches that returned a Github issue with the openucx project. In the dialog associated with the issue, one interesting point was raised:

#
# Maximal number of regions in the registration cache
#
# syntax:    unsigned long: <number>, "inf", or "auto"
# inherits:  UCX_RCACHE_MAX_REGIONS
#
UCX_IB_RCACHE_MAX_REGIONS=inf

By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware. If the MR cache can accommodate N_lines memory regions and an MPI job uses N_r ranks for a collective, then each rank has an effective limit of N_lines / N_r memory regions it can register. Obviously, as N_r grows, the MPI program has the potential to saturate the MR cache at a rate N_r times faster than a serial task.

Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed: for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated. A very conservative limit on UCX's registration behavior was tried first:

$ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std

With the addition of a finite UCX_IB_RCACHE_MAX_REGIONS to the environment, the program made it past the initial KPAR_SYNC_ALL() call and successfully iterated through the wavefunction minimization loop.

Solution

The default of UCX_IB_RCACHE_MAX_REGIONS=inf does not necessarily impact all workloads: the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein. But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises.

To address the problem in a global sense on DARWIN, the ucx VALET package has been modified:

A ucx/system version is now present that makes no modifications to PATH et al.
For all versions, the UCX_IB_RCACHE_MAX_REGIONS is set to 1000 in the environment

The openmpi VALET package has also been modified to load a version of ucx with every version of openmpi: older Open MPI releases have ucx/system as a dependency, while newer releases already included a dependency on ucx/1.13.1. Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have UCX_IB_RCACHE_MAX_REGIONS set in their runtime environment:

$ vpkg_require openmpi/1.8.8
Adding dependency `ucx/system` to your environment
Adding package `openmpi/1.8.8` to your environment
 
$ echo $UCX_IB_RCACHE_MAX_REGIONS 
1000
 
$ vpkg_rollback all
 
$ vpkg_require openmpi/4.1.5:intel-2020
Adding dependency `intel/2020u4` to your environment
Adding dependency `ucx/1.13.1` to your environment
Adding package `openmpi/4.1.5:intel-2020` to your environment
 
$ echo $UCX_IB_RCACHE_MAX_REGIONS 
1000

For users who have built their own MPI libraries and do not use VALET or do not have ucx as a depencency, you are encouraged to effect this change in your own runtime environments.

When setting UCX_IB_RCACHE_MAX_REGIONS in a job's runtime environment, please do not exceed a value of 9000 unless you have explicitly allocated more on-node tasks to the job than you will use. E.g. requesting –nodes=1 –ntasks=8 and running the MPI program with just 2 ranks implies that UCX_IB_RCACHE_MAX_REGIONS=$((9000*4)) is permissible.