technical:whitepaper:darwin_ucx_openmpi

This is an old revision of the document!


Mellanox UCX and Open MPI on DARWIN

During early-access testing of the DARWIN cluster, several users reported unexpected crashes of their Open MPI applications. The crashes were accompanied by a running stream of kernel messages:

  :
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
  :

In the 4.x releases of Open MPI the low-level InfiniBand BTL driver (openib) has been deprecated in favor of the Unifiec Communication X framework. The Mellanox OFED software stack present on each node in DARWIN ships with a copy of the UCX library, so by default Open MPI versions which integrate with UCX build those modules by default. Older releases (1.6, 1.8) continue to build and use the openib BTL for low-level InfiniBand communications, and the builds with UCX support also build that module — even the 4.x releases that have deprecated its use.

For Open MPI 4.x releases that do build and include the openib BTL module, a warning is produced when a job first begins running:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              <nodename>
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   <nodename>
  Local device: mlx5_0
--------------------------------------------------------------------------

Since a UCX library is provided within the OS on DARWIN, it makes sense to disable the openib module by default to avoid this message and ensure the use of the UCX modules. This change is easily effected in the etc/openmpi-mca-params.conf file that is part of the Open MPI install:

btl = ^openib
pml = ucx

# Never use the IPoIB interfaces for TCP communications:
oob_tcp_if_exclude = ib0
btl_tcp_if_exclude = ib0

The openib BTL is disabled, the ucx PML module is selected as the only option, and the InfiniBand's IPoIB interface is excluded from use for TCP/IP communications (out-of-band signaling, for example).

  • technical/whitepaper/darwin_ucx_openmpi.1613164735.txt.gz
  • Last modified: 2021-02-12 16:18
  • by frey