technical:generic:mpi-and-ucx-mr-cache

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
technical:generic:mpi-and-ucx-mr-cache [2024-12-05 12:04] – created freytechnical:generic:mpi-and-ucx-mr-cache [2024-12-05 14:19] (current) frey
Line 26: Line 26:
 Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()'' The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function. Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()'' The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function.
  
-===== Solution =====+===== Debugging =====
  
 Solution to the issue came in two distinct stages. Solution to the issue came in two distinct stages.
Line 53: Line 53:
    :     : 
 </code> </code>
 +
 +Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP.  However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it.  Nevertheless, the variable ''MPI_BLOCK'' size seems to be a very useful feature and has been brought to the attention of the VASP developers.  Patches for various VASP releases exist in this [[https://github.com/jtfrey/vasp-dynamic-mpi-block-patch|Github repository]].
 +
 +Ideally, it would be even more useful to respond to ''VASP_MPI_BLOCK=none'' or ''VASP_MPI_BLOCK=0'' by **not** fragmenting the array and instead issuing a single ''MPI_Reduce()'' Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted ''M_sum_master_d()'' back in the era of the Intel Pentium no longer exist.
 +
 +==== UCX control ====
 +
 +When runtime variation of ''MPI_BLOCK'' did not completely remove the issue, further testing was performed.  The data collected eventually led to Google searches that returned a [[https://github.com/openucx/ucx/issues/6264|Github issue with the openucx project]].  In the dialog associated with the issue, one interesting point was raised:
 +
 +<code>
 +#
 +# Maximal number of regions in the registration cache
 +#
 +# syntax:    unsigned long: <number>, "inf", or "auto"
 +# inherits:  UCX_RCACHE_MAX_REGIONS
 +#
 +UCX_IB_RCACHE_MAX_REGIONS=inf
 +</code>
 +
 +By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware.  If the MR cache can accommodate //N_lines// memory regions and an MPI job uses //N_r// ranks for a collective, then each rank has an effective limit of //N_lines / N_r// memory regions it can register.  Obviously, as //N_r// grows, the MPI program has the potential to saturate the MR cache at a rate //N_r// times faster than a serial task.
 +
 +Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed:  for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated.  A very conservative limit on UCX's registration behavior was tried first:
 +
 +<code bash>
 +$ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std
 +</code>
 +
 +With the addition of a finite ''UCX_IB_RCACHE_MAX_REGIONS'' to the environment, the program made it past the initial ''KPAR_SYNC_ALL()'' call and successfully iterated through the wavefunction minimization loop.
 +
 +===== Solution =====
 +
 +The default of ''UCX_IB_RCACHE_MAX_REGIONS=inf'' does not necessarily impact all workloads:  the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein.  But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises.
 +
 +To address the problem in a global sense on DARWIN, the ''ucx'' VALET package has been modified:
 +
 +  - A ''ucx/system'' version is now present that makes no modifications to ''PATH'' et al.
 +  - For all versions, the ''UCX_IB_RCACHE_MAX_REGIONS'' is set to ''1000'' in the environment
 +
 +The ''openmpi'' VALET package has also been modified to load a version of ''ucx'' with every version of ''openmpi'': older Open MPI releases have ''ucx/system'' as a dependency, while newer releases already included a dependency on ''ucx/1.13.1'' Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have ''UCX_IB_RCACHE_MAX_REGIONS'' set in their runtime environment:
 +
 +<code bash>
 +$ vpkg_require openmpi/1.8.8
 +Adding dependency `ucx/system` to your environment
 +Adding package `openmpi/1.8.8` to your environment
 +
 +$ echo $UCX_IB_RCACHE_MAX_REGIONS 
 +1000
 +
 +$ vpkg_rollback all
 +
 +$ vpkg_require openmpi/4.1.5:intel-2020
 +Adding dependency `intel/2020u4` to your environment
 +Adding dependency `ucx/1.13.1` to your environment
 +Adding package `openmpi/4.1.5:intel-2020` to your environment
 +
 +$ echo $UCX_IB_RCACHE_MAX_REGIONS 
 +1000
 +</code>
 +
 +For users who have built their own MPI libraries and do not use VALET or do not have ''ucx'' as a depencency, you are encouraged to effect this change in your own runtime environments.
 +
 +<WRAP center round alert 60%>
 +When setting ''UCX_IB_RCACHE_MAX_REGIONS'' in a job's runtime environment, **please do not exceed a value of 9000** unless you have explicitly allocated more on-node tasks to the job than you will use.  E.g. requesting ''--nodes=1 --ntasks=8'' and running the MPI program with just 2 ranks implies that ''%%UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))%%'' is permissible.
 +</WRAP>
 +
  • technical/generic/mpi-and-ucx-mr-cache.1733418285.txt.gz
  • Last modified: 2024-12-05 12:04
  • by frey