This document summarizes an issue reported for a variety of MPI workflows on the DARWIN cluster and a mitigation.
IT RCI noticed from time to time that a variety of users of the VASP software on DARWIN have had jobs end prematurely and Slurm would transition the nodes involved into the DRAIN state due to the processes' not terminating when signaled. On those nodes the kernel would log messages following this format:
[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 [Thu Oct 31 21:32:31 2024] mlx5_core 0000:01:00.0: mlx5_cmd_check:794:(pid 57074): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4) [Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 [Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 : [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57074): create mkey failed [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57109): create mkey failed [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57097): create mkey failed
Eventually core dumps were collected from one affected user which indicated for the given job the crash systematically occurred at the same code location and on the same node (3 of 4) across several separate runs (including with homogeneous and heterogeneous node types). All 64 ranks on node 3 of 4 crashed while distributing initial state among their orthogonal k-point ranks:
KPAR_SYNC_ALL() | ⇒ | KPAR_SYNC_FERTOT() | ⇒ | M_sum_master_d() | ⇒ | MPI_Reduce() |
The M_sum_master_d()
subroutine splits the incoming array of REAL*8
into chunks of a smaller dimension (1000 words by default) and calls MPI_Reduce()
on each chunk in a loop. Monitoring the MR cache state (via properties exposed under /sys/class/infiniband/mlx5_0/mr_cache
) showed cache size and usage growing very quickly (and very large) right up until the first create mkey failed
message was logged.
Since Open MPI was being used, it was important to note which collective mechanism was employed under the MPI_Reduce()
. The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ucp_rndv_progress_rma_zcopy_common()
function.
Solution to the issue came in two distinct stages.
With
MPI_Reduce()
operations over a very large array
the first attempt at a mitigation was to increase the size of the chunks in M_sum_master_d()
. For the sake of testing the desire was to make the chunk size configurable at runtime. The vendor's code (see mpi.F
) includes static work arrays that are dimensioned at compile time. The code was rewritten to dynamically-allocate those work arrays using the compiled-in size by default, overridden by the value of VASP_MPI_BLOCK
in the environment, e.g.
$ mpirun vasp_std : Using default MPI_BLOCK size: 8000 MPI work arrays allocated using MPI_BLOCK : $ VASP_MPI_BLOCK=131072 mpirun vasp_std : Using MPI_BLOCK size from environment: 131072 MPI work arrays allocated using MPI_BLOCK :
Testing showed that the introduction of the variable MPI_BLOCK
and dynamically-allocated work arrays did not affect the performance of VASP. However, the change only delayed the occurrence of the MR cache exhaustion; it did not remove it. Nevertheless, the variable MPI_BLOCK
size seems to be a very useful feature and has been brought to the attention of the VASP developers. Patches for various VASP releases exist in this Github repository.
Ideally, it would be even more useful to respond to VASP_MPI_BLOCK=none
or VASP_MPI_BLOCK=0
by not fragmenting the array and instead issuing a single MPI_Reduce()
. Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted M_sum_master_d()
back in the era of the Intel Pentium no longer exist.
When runtime variation of MPI_BLOCK
did not completely remove the issue, further testing was performed. The data collected eventually led to Google searches that returned a Github issue with the openucx project. In the dialog associated with the issue, one interesting point was raised:
# # Maximal number of regions in the registration cache # # syntax: unsigned long: <number>, "inf", or "auto" # inherits: UCX_RCACHE_MAX_REGIONS # UCX_IB_RCACHE_MAX_REGIONS=inf
By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware. If the MR cache can accommodate N_lines memory regions and an MPI job uses N_r ranks for a collective, then each rank has an effective limit of N_lines / N_r memory regions it can register. Obviously, as N_r grows, the MPI program has the potential to saturate the MR cache at a rate N_r times faster than a serial task.
Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed: for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated. A very conservative limit on UCX's registration behavior was tried first:
$ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std
With the addition of a finite UCX_IB_RCACHE_MAX_REGIONS
to the environment, the program made it past the initial KPAR_SYNC_ALL()
call and successfully iterated through the wavefunction minimization loop.
The default of UCX_IB_RCACHE_MAX_REGIONS=inf
does not necessarily impact all workloads: the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein. But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises.
To address the problem in a global sense on DARWIN, the ucx
VALET package has been modified:
ucx/system
version is now present that makes no modifications to PATH
et al.UCX_IB_RCACHE_MAX_REGIONS
is set to 1000
in the environment
The openmpi
VALET package has also been modified to load a version of ucx
with every version of openmpi
: older Open MPI releases have ucx/system
as a dependency, while newer releases already included a dependency on ucx/1.13.1
. Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have UCX_IB_RCACHE_MAX_REGIONS
set in their runtime environment:
$ vpkg_require openmpi/1.8.8 Adding dependency `ucx/system` to your environment Adding package `openmpi/1.8.8` to your environment $ echo $UCX_IB_RCACHE_MAX_REGIONS 1000 $ vpkg_rollback all $ vpkg_require openmpi/4.1.5:intel-2020 Adding dependency `intel/2020u4` to your environment Adding dependency `ucx/1.13.1` to your environment Adding package `openmpi/4.1.5:intel-2020` to your environment $ echo $UCX_IB_RCACHE_MAX_REGIONS 1000
For users who have built their own MPI libraries and do not use VALET or do not have ucx
as a depencency, you are encouraged to effect this change in your own runtime environments.
When setting UCX_IB_RCACHE_MAX_REGIONS
in a job's runtime environment, please do not exceed a value of 9000 unless you have explicitly allocated more on-node tasks to the job than you will use. E.g. requesting –nodes=1 –ntasks=8
and running the MPI program with just 2 ranks implies that UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))
is permissible.