technical:generic:mpi-and-ucx-mr-cache

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
technical:generic:mpi-and-ucx-mr-cache [2024-12-05 12:07] freytechnical:generic:mpi-and-ucx-mr-cache [2024-12-05 14:19] (current) frey
Line 26: Line 26:
 Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()'' The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function. Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()'' The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function.
  
-===== Solution =====+===== Debugging =====
  
 Solution to the issue came in two distinct stages. Solution to the issue came in two distinct stages.
Line 54: Line 54:
 </code> </code>
  
-Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP.  However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it.+Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP.  However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it.  Nevertheless, the variable ''MPI_BLOCK'' size seems to be a very useful feature and has been brought to the attention of the VASP developers.  Patches for various VASP releases exist in this [[https://github.com/jtfrey/vasp-dynamic-mpi-block-patch|Github repository]]. 
 + 
 +Ideally, it would be even more useful to respond to ''VASP_MPI_BLOCK=none'' or ''VASP_MPI_BLOCK=0'' by **not** fragmenting the array and instead issuing a single ''MPI_Reduce()'' Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted ''M_sum_master_d()'' back in the era of the Intel Pentium no longer exist. 
 + 
 +==== UCX control ==== 
 + 
 +When runtime variation of ''MPI_BLOCK'' did not completely remove the issue, further testing was performed.  The data collected eventually led to Google searches that returned a [[https://github.com/openucx/ucx/issues/6264|Github issue with the openucx project]].  In the dialog associated with the issue, one interesting point was raised: 
 + 
 +<code> 
 +
 +# Maximal number of regions in the registration cache 
 +
 +# syntax:    unsigned long: <number>, "inf", or "auto" 
 +# inherits:  UCX_RCACHE_MAX_REGIONS 
 +
 +UCX_IB_RCACHE_MAX_REGIONS=inf 
 +</code> 
 + 
 +By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware.  If the MR cache can accommodate //N_lines// memory regions and an MPI job uses //N_r// ranks for a collective, then each rank has an effective limit of //N_lines / N_r// memory regions it can register.  Obviously, as //N_r// grows, the MPI program has the potential to saturate the MR cache at a rate //N_r// times faster than a serial task. 
 + 
 +Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed:  for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated.  A very conservative limit on UCX's registration behavior was tried first: 
 + 
 +<code bash> 
 +$ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std 
 +</code> 
 + 
 +With the addition of a finite ''UCX_IB_RCACHE_MAX_REGIONS'' to the environment, the program made it past the initial ''KPAR_SYNC_ALL()'' call and successfully iterated through the wavefunction minimization loop. 
 + 
 +===== Solution ===== 
 + 
 +The default of ''UCX_IB_RCACHE_MAX_REGIONS=inf'' does not necessarily impact all workloads:  the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein.  But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises. 
 + 
 +To address the problem in a global sense on DARWIN, the ''ucx'' VALET package has been modified: 
 + 
 +  - A ''ucx/system'' version is now present that makes no modifications to ''PATH'' et al. 
 +  - For all versions, the ''UCX_IB_RCACHE_MAX_REGIONS'' is set to ''1000'' in the environment 
 + 
 +The ''openmpi'' VALET package has also been modified to load a version of ''ucx'' with every version of ''openmpi'': older Open MPI releases have ''ucx/system'' as a dependency, while newer releases already included a dependency on ''ucx/1.13.1'' Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have ''UCX_IB_RCACHE_MAX_REGIONS'' set in their runtime environment: 
 + 
 +<code bash> 
 +$ vpkg_require openmpi/1.8.8 
 +Adding dependency `ucx/system` to your environment 
 +Adding package `openmpi/1.8.8` to your environment 
 + 
 +$ echo $UCX_IB_RCACHE_MAX_REGIONS  
 +1000 
 + 
 +$ vpkg_rollback all 
 + 
 +$ vpkg_require openmpi/4.1.5:intel-2020 
 +Adding dependency `intel/2020u4` to your environment 
 +Adding dependency `ucx/1.13.1` to your environment 
 +Adding package `openmpi/4.1.5:intel-2020` to your environment 
 + 
 +$ echo $UCX_IB_RCACHE_MAX_REGIONS  
 +1000 
 +</code> 
 + 
 +For users who have built their own MPI libraries and do not use VALET or do not have ''ucx'' as a depencency, you are encouraged to effect this change in your own runtime environments. 
 + 
 +<WRAP center round alert 60%> 
 +When setting ''UCX_IB_RCACHE_MAX_REGIONS'' in a job's runtime environment, **please do not exceed a value of 9000** unless you have explicitly allocated more on-node tasks to the job than you will use.  E.g. requesting ''--nodes=1 --ntasks=8'' and running the MPI program with just 2 ranks implies that ''%%UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))%%'' is permissible. 
 +</WRAP> 
  • technical/generic/mpi-and-ucx-mr-cache.1733418454.txt.gz
  • Last modified: 2024-12-05 12:07
  • by frey