Differences

This shows you the differences between two versions of the page.

--- technical:generic:mpi-and-ucx-mr-cache [2024-12-05 12:07] – frey
+++ technical:generic:mpi-and-ucx-mr-cache [2024-12-05 14:19] (current) – frey
@@ Line 26: / Line 26: @@
 Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()''.  The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function.
-===== Solution =====
+===== Debugging =====
 Solution to the issue came in two distinct stages.
@@ Line 54: / Line 54: @@
 </code>
-Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP.  However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it.
+Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP.  However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it.  Nevertheless, the variable ''MPI_BLOCK'' size seems to be a very useful feature and has been brought to the attention of the VASP developers.  Patches for various VASP releases exist in this [[https://github.com/jtfrey/vasp-dynamic-mpi-block-patch|Github repository]].
+Ideally, it would be even more useful to respond to ''VASP_MPI_BLOCK=none'' or ''VASP_MPI_BLOCK=0'' by **not** fragmenting the array and instead issuing a single ''MPI_Reduce()''.  Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted ''M_sum_master_d()'' back in the era of the Intel Pentium no longer exist.
+==== UCX control ====
+When runtime variation of ''MPI_BLOCK'' did not completely remove the issue, further testing was performed.  The data collected eventually led to Google searches that returned a [[https://github.com/openucx/ucx/issues/6264|Github issue with the openucx project]].  In the dialog associated with the issue, one interesting point was raised:
+<code>
+#
+# Maximal number of regions in the registration cache
+#
+# syntax:    unsigned long: <number>, "inf", or "auto"
+# inherits:  UCX_RCACHE_MAX_REGIONS
+#
+UCX_IB_RCACHE_MAX_REGIONS=inf
+</code>
+By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware.  If the MR cache can accommodate //N_lines// memory regions and an MPI job uses //N_r// ranks for a collective, then each rank has an effective limit of //N_lines / N_r// memory regions it can register.  Obviously, as //N_r// grows, the MPI program has the potential to saturate the MR cache at a rate //N_r// times faster than a serial task.
+Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed:  for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated.  A very conservative limit on UCX's registration behavior was tried first:
+<code bash>
+$ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std
+</code>
+With the addition of a finite ''UCX_IB_RCACHE_MAX_REGIONS'' to the environment, the program made it past the initial ''KPAR_SYNC_ALL()'' call and successfully iterated through the wavefunction minimization loop.
+===== Solution =====
+The default of ''UCX_IB_RCACHE_MAX_REGIONS=inf'' does not necessarily impact all workloads:  the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein.  But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises.
+To address the problem in a global sense on DARWIN, the ''ucx'' VALET package has been modified:
+  - A ''ucx/system'' version is now present that makes no modifications to ''PATH'' et al.
+  - For all versions, the ''UCX_IB_RCACHE_MAX_REGIONS'' is set to ''1000'' in the environment
+The ''openmpi'' VALET package has also been modified to load a version of ''ucx'' with every version of ''openmpi'': older Open MPI releases have ''ucx/system'' as a dependency, while newer releases already included a dependency on ''ucx/1.13.1''.  Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have ''UCX_IB_RCACHE_MAX_REGIONS'' set in their runtime environment:
+<code bash>
+$ vpkg_require openmpi/1.8.8
+Adding dependency `ucx/system` to your environment
+Adding package `openmpi/1.8.8` to your environment
+$ echo $UCX_IB_RCACHE_MAX_REGIONS
+
+$ vpkg_rollback all
+$ vpkg_require openmpi/4.1.5:intel-2020
+Adding dependency `intel/2020u4` to your environment
+Adding dependency `ucx/1.13.1` to your environment
+Adding package `openmpi/4.1.5:intel-2020` to your environment
+$ echo $UCX_IB_RCACHE_MAX_REGIONS
+
+</code>
+For users who have built their own MPI libraries and do not use VALET or do not have ''ucx'' as a depencency, you are encouraged to effect this change in your own runtime environments.
+<WRAP center round alert 60%>
+When setting ''UCX_IB_RCACHE_MAX_REGIONS'' in a job's runtime environment, **please do not exceed a value of 9000** unless you have explicitly allocated more on-node tasks to the job than you will use.  E.g. requesting ''--nodes=1 --ntasks=8'' and running the MPI program with just 2 ranks implies that ''%%UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))%%'' is permissible.
+</WRAP>