Both sides previous revision Previous revision Next revision | Previous revision |
technical:generic:mpi-and-ucx-mr-cache [2024-12-05 12:07] – frey | technical:generic:mpi-and-ucx-mr-cache [2024-12-05 14:19] (current) – frey |
---|
Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()''. The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function. | Since Open MPI was being used, it was important to note which collective mechanism was employed under the ''MPI_Reduce()''. The core dumps revealed that the UCX PML module was involved and the fault asserted downstream in UCX's ''ucp_rndv_progress_rma_zcopy_common()'' function. |
| |
===== Solution ===== | ===== Debugging ===== |
| |
Solution to the issue came in two distinct stages. | Solution to the issue came in two distinct stages. |
</code> | </code> |
| |
Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP. However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it. | Testing showed that the introduction of the variable ''MPI_BLOCK'' and dynamically-allocated work arrays did **not** affect the performance of VASP. However, the change only //delayed// the occurrence of the MR cache exhaustion; it did not remove it. Nevertheless, the variable ''MPI_BLOCK'' size seems to be a very useful feature and has been brought to the attention of the VASP developers. Patches for various VASP releases exist in this [[https://github.com/jtfrey/vasp-dynamic-mpi-block-patch|Github repository]]. |
| |
| Ideally, it would be even more useful to respond to ''VASP_MPI_BLOCK=none'' or ''VASP_MPI_BLOCK=0'' by **not** fragmenting the array and instead issuing a single ''MPI_Reduce()''. Since many modern transport libraries underpinning MPI effect fragmentation themselves and only as necessary, the conditions that prompted ''M_sum_master_d()'' back in the era of the Intel Pentium no longer exist. |
| |
| ==== UCX control ==== |
| |
| When runtime variation of ''MPI_BLOCK'' did not completely remove the issue, further testing was performed. The data collected eventually led to Google searches that returned a [[https://github.com/openucx/ucx/issues/6264|Github issue with the openucx project]]. In the dialog associated with the issue, one interesting point was raised: |
| |
| <code> |
| # |
| # Maximal number of regions in the registration cache |
| # |
| # syntax: unsigned long: <number>, "inf", or "auto" |
| # inherits: UCX_RCACHE_MAX_REGIONS |
| # |
| UCX_IB_RCACHE_MAX_REGIONS=inf |
| </code> |
| |
| By default, the UCX library does not limit the number of memory regions it attempts to register with the underlying InfiniBand hardware. If the MR cache can accommodate //N_lines// memory regions and an MPI job uses //N_r// ranks for a collective, then each rank has an effective limit of //N_lines / N_r// memory regions it can register. Obviously, as //N_r// grows, the MPI program has the potential to saturate the MR cache at a rate //N_r// times faster than a serial task. |
| |
| Empirical observation of the MR cache behavior showed sizes in the neighborhood of 600k when the memkey registration failed: for a node with 64 CPU cores and an MPI job doing an i/o operation across all 64 ranks, a limit of ca. 9375 is indicated. A very conservative limit on UCX's registration behavior was tried first: |
| |
| <code bash> |
| $ UCX_IB_RCACHE_MAX_REGIONS=500 VASP_MPI_BLOCK=131072 mpirun vasp_std |
| </code> |
| |
| With the addition of a finite ''UCX_IB_RCACHE_MAX_REGIONS'' to the environment, the program made it past the initial ''KPAR_SYNC_ALL()'' call and successfully iterated through the wavefunction minimization loop. |
| |
| ===== Solution ===== |
| |
| The default of ''UCX_IB_RCACHE_MAX_REGIONS=inf'' does not necessarily impact all workloads: the majority of MPI jobs run on DARWIN have not encountered the MR cache exhaustion discussed herein. But the fact that the hardware capabilities are over-provisioned by default in UCX is problematic, since most users will have a difficult time debugging this issue when and if it arises. |
| |
| To address the problem in a global sense on DARWIN, the ''ucx'' VALET package has been modified: |
| |
| - A ''ucx/system'' version is now present that makes no modifications to ''PATH'' et al. |
| - For all versions, the ''UCX_IB_RCACHE_MAX_REGIONS'' is set to ''1000'' in the environment |
| |
| The ''openmpi'' VALET package has also been modified to load a version of ''ucx'' with every version of ''openmpi'': older Open MPI releases have ''ucx/system'' as a dependency, while newer releases already included a dependency on ''ucx/1.13.1''. Thus, all jobs using an IT RCI-provided Open MPI library will henceforth have ''UCX_IB_RCACHE_MAX_REGIONS'' set in their runtime environment: |
| |
| <code bash> |
| $ vpkg_require openmpi/1.8.8 |
| Adding dependency `ucx/system` to your environment |
| Adding package `openmpi/1.8.8` to your environment |
| |
| $ echo $UCX_IB_RCACHE_MAX_REGIONS |
| 1000 |
| |
| $ vpkg_rollback all |
| |
| $ vpkg_require openmpi/4.1.5:intel-2020 |
| Adding dependency `intel/2020u4` to your environment |
| Adding dependency `ucx/1.13.1` to your environment |
| Adding package `openmpi/4.1.5:intel-2020` to your environment |
| |
| $ echo $UCX_IB_RCACHE_MAX_REGIONS |
| 1000 |
| </code> |
| |
| For users who have built their own MPI libraries and do not use VALET or do not have ''ucx'' as a depencency, you are encouraged to effect this change in your own runtime environments. |
| |
| <WRAP center round alert 60%> |
| When setting ''UCX_IB_RCACHE_MAX_REGIONS'' in a job's runtime environment, **please do not exceed a value of 9000** unless you have explicitly allocated more on-node tasks to the job than you will use. E.g. requesting ''--nodes=1 --ntasks=8'' and running the MPI program with just 2 ranks implies that ''%%UCX_IB_RCACHE_MAX_REGIONS=$((9000*4))%%'' is permissible. |
| </WRAP> |