====== UCX MR caching and large-scale MPI collectives ====== This document summarizes an issue reported for a variety of MPI workflows on the DARWIN cluster and a mitigation. ===== Problem ===== IT RCI noticed from time to time that a variety of users of the [[https://www.vasp.at/|VASP]] software on DARWIN have had jobs end prematurely and Slurm would transition the nodes involved into the DRAIN state due to the processes' not terminating when signaled. On those nodes the kernel would log messages following this format: [Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 [Thu Oct 31 21:32:31 2024] mlx5_core 0000:01:00.0: mlx5_cmd_check:794:(pid 57074): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4) [Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 [Thu Oct 31 21:32:31 2024] infiniband mlx5_0: create_mkey_callback:148:(pid 57074): async reg mr failed. status -12 : [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57074): create mkey failed [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57109): create mkey failed [Thu Oct 31 21:32:34 2024] infiniband mlx5_0: reg_create:1089:(pid 57097): create mkey failed Eventually core dumps were collected from one affected user which indicated for the given job the crash systematically occurred at the same code location and on the same node (3 of 4) across several separate runs (including with homogeneous and heterogeneous node types). All 64 ranks on node 3 of 4 crashed while distributing initial state among their orthogonal k-point ranks: |''KPAR_SYNC_ALL()''|⇒|''KPAR_SYNC_FERTOT()''|⇒|''M_sum_master_d()''|⇒|''MPI_Reduce()''| The ''M_sum_master_d()'' subroutine splits the incoming array of ''REAL*8'' into chunks of a smaller dimension (1000 words by default) and calls ''MPI_Reduce()'' on each chunk in a loop. Monitoring the MR cache state (via properties exposed under ''/sys/class/infiniband/mlx5_0/mr_cache'') showed cache size and usage growing very quickly (and very large) right up until the first ''create mkey failed'' message was logged. Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()''. The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function. ===== Debugging ===== Solution to the issue came in two distinct stages. ==== MPI_BLOCK ==== With * the terminal VASP function's stacking many small ''MPI_Reduce()'' operations over a very large array * a large number of ranks hitting the MR cache concurrently (64 ranks per node) * the error's being an exhaustion of that cache the first attempt at a mitigation was to increase the size of the chunks in ''M_sum_master_d()''. For the sake of testing the desire was to make the chunk size configurable at runtime. The vendor's code (see ''mpi.F'') includes static work arrays that are dimensioned at compile time. The code was rewritten to dynamically-allocate those work arrays using the compiled-in size by default, overridden by the value of ''VASP_MPI_BLOCK'' in the environment, e.g. $ mpirun vasp_std : Using default MPI_BLOCK size: 8000 MPI work arrays allocated using MPI_BLOCK : $ VASP_MPI_BLOCK=131072 mpirun vasp_std : Using MPI_BLOCK size from environment: 131072 MPI work arrays allocated using MPI_BLOCK : Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP. However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it. Nevertheless, the variable ''MPI_BLOCK'' size seems to be a very useful feature and has been brought to the attention of the VASP developers. Patches for various VASP releases exist in this [[https://github.com/jtfrey/vasp-dynamic-mpi-block-patch|Github repository]]. Ideally, it would be even more useful to respond to ''VASP_MPI_BLOCK=none'' or ''VASP_MPI_BLOCK=0'' by **not** fragmenting the array and instead issuing a single ''MPI_Reduce()''. Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted ''M_sum_master_d()'' back in the era of the Intel Pentium no longer exist. ==== UCX control ==== When runtime variation of ''MPI_BLOCK'' did not completely remove the issue, further testing was performed. The data collected eventually led to Google searches that returned a [[https://github.com/openucx/ucx/issues/6264|Github issue with the openucx project]]. In the dialog associated with the issue, one interesting point was raised: # # Maximal number of regions in the registration cache # # syntax: unsigned long: , "inf", or "auto" # inherits: UCX_RCACHE_MAX_REGIONS # UCX_IB_RCACHE_MAX_REGIONS=inf By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware. If the MR cache can accommodate //N_lines// memory regions and an MPI job uses //N_r// ranks for a collective, then each rank has an effective limit of //N_lines / N_r// memory regions it can register. Obviously, as //N_r// grows, the MPI program has the potential to saturate the MR cache at a rate //N_r// times faster than a serial task. Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed: for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated. A very conservative limit on UCX's registration behavior was tried first: $ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std With the addition of a finite ''UCX_IB_RCACHE_MAX_REGIONS'' to the environment, the program made it past the initial ''KPAR_SYNC_ALL()'' call and successfully iterated through the wavefunction minimization loop. ===== Solution ===== The default of ''UCX_IB_RCACHE_MAX_REGIONS=inf'' does not necessarily impact all workloads: the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein. But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises. To address the problem in a global sense on DARWIN, the ''ucx'' VALET package has been modified: - A ''ucx/system'' version is now present that makes no modifications to ''PATH'' et al. - For all versions, the ''UCX_IB_RCACHE_MAX_REGIONS'' is set to ''1000'' in the environment The ''openmpi'' VALET package has also been modified to load a version of ''ucx'' with every version of ''openmpi'': older Open MPI releases have ''ucx/system'' as a dependency, while newer releases already included a dependency on ''ucx/1.13.1''. Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have ''UCX_IB_RCACHE_MAX_REGIONS'' set in their runtime environment: $ vpkg_require openmpi/1.8.8 Adding dependency `ucx/system` to your environment Adding package `openmpi/1.8.8` to your environment $ echo $UCX_IB_RCACHE_MAX_REGIONS 1000 $ vpkg_rollback all $ vpkg_require openmpi/4.1.5:intel-2020 Adding dependency `intel/2020u4` to your environment Adding dependency `ucx/1.13.1` to your environment Adding package `openmpi/4.1.5:intel-2020` to your environment $ echo $UCX_IB_RCACHE_MAX_REGIONS 1000 For users who have built their own MPI libraries and do not use VALET or do not have ''ucx'' as a depencency, you are encouraged to effect this change in your own runtime environments. When setting ''UCX_IB_RCACHE_MAX_REGIONS'' in a job's runtime environment, **please do not exceed a value of 9000** unless you have explicitly allocated more on-node tasks to the job than you will use. E.g. requesting ''--nodes=1 --ntasks=8'' and running the MPI program with just 2 ranks implies that ''%%UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))%%'' is permissible.