UCX MR caching and large-scale MPI collectives

This is an old revision of the document!

This document summarizes an issue reported for a variety of MPI workflows on the DARWIN cluster and a mitigation.

IT RCI noticed from time to time that a variety of users of the VASP software on DARWIN have had jobs end prematurely and Slurm would transition the nodes involved into the DRAIN state due to the processes' not terminating when signaled. On those nodes the kernel would log messages following this format:

[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
[Thu Oct 31 21:32:31 2024] mlx5_core 0000:01:00.0: mlx5_cmd_check:794:(pid 57074): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
[Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12
  :
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57074): create mkey failed
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57109): create mkey failed
[Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57097): create mkey failed

Eventually core dumps were collected from one affected user which indicated for the given job the crash systematically occurred at the same code location and on the same node (3 of 4) across several separate runs (including with homogeneous and heterogeneous node types). All 64 ranks on node 3 of 4 crashed while distributing initial state among their orthogonal k-point ranks:

`KPAR_SYNC_ALL()`	⇒	`KPAR_SYNC_FERTOT()`	⇒	`M_sum_master_d()`	⇒	`MPI_Reduce()`

The M_sum_master_d() subroutine splits the incoming array of REAL*8 into chunks of a smaller dimension (1000 words by default) and calls MPI_Reduce() on each chunk in a loop. Monitoring the MR cache state (via properties exposed under /sys/class/infiniband/mlx5_0/mr_cache) showed cache size and usage growing very quickly (and very large) right up until the first create mkey failed message was logged.

Since Open MPI was being used, it was important to note which collective mechanism was employed under the MPI_Reduce(). The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ucp_rndv_progress_rma_zcopy_common() function.

Solution to the issue came in two distinct stages.

With

the terminal VASP function's stacking many small MPI_Reduce() operations over a very large array
a large number of ranks hitting the MR cache concurrently (64 ranks per node)
the error's being an exhaustion of that cache

the first attempt at a mitigation was to increase the size of the chunks in M_sum_master_d(). For the sake of testing the desire was to make the chunk size configurable at runtime. The vendor's code (see mpi.F) includes static work arrays that are dimensioned at compile time. The code was rewritten to dynamically-allocate those work arrays using the compiled-in size by default, overridden by the value of VASP_MPI_BLOCK in the environment, e.g.

$ mpirun vasp_std
   :
 Using default MPI_BLOCK size:        8000
 MPI work arrays allocated using MPI_BLOCK
   :
 
$ VASP_MPI_BLOCK=131072 mpirun vasp_std
   :
 Using MPI_BLOCK size from environment:      131072
 MPI work arrays allocated using MPI_BLOCK
   :

UCX MR caching and large-scale MPI collectives

Problem

Solution

MPI_BLOCK

hpc documentation