Table of Contents

Mellanox UCX and Open MPI on DARWIN

During early-access testing of the DARWIN cluster, several users reported unexpected crashes of their Open MPI applications. The crashes were accompanied by a running stream of kernel messages:

  :
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)
[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
  :

In the 4.x releases of Open MPI the low-level InfiniBand BTL driver (openib) has been deprecated in favor of the Unifiec Communication X framework. The Mellanox OFED software stack present on each node in DARWIN ships with a copy of the UCX library, so by default Open MPI versions which integrate with UCX build those modules by default. Older releases (1.6, 1.8) continue to build and use the openib BTL for low-level InfiniBand communications, and the builds with UCX support also build that module — even the 4.x releases that have deprecated its use.

MCA Defaults

For Open MPI 4.x releases that do build and include the openib BTL module, a warning is produced when a job first begins running:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              <nodename>
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   <nodename>
  Local device: mlx5_0
--------------------------------------------------------------------------

Since a UCX library is provided within the OS on DARWIN, it makes sense to disable the openib module by default to avoid this message and ensure the use of the UCX modules. This change is easily effected in the etc/openmpi-mca-params.conf file that is part of the Open MPI install:

btl = ^openib
pml = ucx

# Never use the IPoIB interfaces for TCP communications:
oob_tcp_if_exclude = ib0
btl_tcp_if_exclude = ib0

The openib BTL is disabled, the ucx PML module is selected as the only option, and the InfiniBand's IPoIB interface is excluded from use for TCP/IP communications (out-of-band signaling, for example).

Ongoing Errors

With the MCA defaults outlined above, some applications were still seeing the issues that were accompanied by CREATE_MKEY kernel messages:

[Sat Feb  6 16:51:55 2021] infiniband mlx5_0: create_mkey_callback:148:(pid 0): async reg mr failed. status -12
[Sat Feb  6 16:51:55 2021] mlx5_core 0000:81:00.0: mlx5_cmd_check:794:(pid 0): CREATE_MKEY(0x200) op_mod(0x0) failed, status limits exceeded(0x8), syndrome (0x59c8a4)

The error number (-12) corresponds to ENOMEM; examining the mlx5 kernel driver source code, this (with the status limits exceeded message) implies that the Mellanox hardware could not create an additional memory-mapping record.

The Mellanox ConnectX-6 network interface breaks any message into a series of memory mappings of fixed size. The larger the memory region being communicated via RDMA, the more memory-mapping records are necessary. The network interface has a finite amount of memory available for these records, and the kernel messages indicate that that table has filled, the interface is in the process of sending the data represented by those records, and the caller should try again. So:

  1. A large data transmission via RDMA is requested
  2. The Mellanox network interface's MTT begins to fill with memory-mapping records at the rate X
  3. Data begins transmitting, MTT records are completed and removed at rate Y < X
  4. Kernel messages are produced with each failed MTT addition request
  5. Once all MTT records have been added, the network interface processes the remainder while no kernel messages are produced

On occasion, the job did eventually crash or the node itself would encounter a kernel panic and go offline, so the issue was not a facile annoyance.

Google searches eventually turned up a known bug in all Open MPI releases prior to 4.1.0 with respect to newer releases of UCX. The UCX library began promoting a newer API for creating memory keys (recall CREATE_MKEY in the kernel messages) called NBX over an earlier NB API. A copy of Open MPI 4.1.0 was built and one of the applications that was failing reliably (with both 4.0.5 and 3.1.6) was recompiled on Open MPI 4.1.0. Subsequent runs no longer failed or produced the kernel messages regarding MTT exhaustion.

Older Open MPI Releases

The UCX changes implemented in Open MPI 4.1.0 have not been backported to any of the older releases of the software, so this issue does persist. The issue arises when attempting large data broadcasts, for example:

      Allocate(Rmat(2,144259970)
        :
      Call MPI_Bcast(Rmat, 2*144259970, MPI_REAL, 0, MPI_COMM_WORLD, mpierr)

The array in question is 1154079760 Bytes (slightly over 1.0 GiB).

The OpenFabrics Alliance produces a unified messaging library similar to UCX called OFI or libfabric. The OFI library supports Omni-path PSM2 and a low-level InfiniBand verbs module, so we have used it under many Open MPI variants provided on Caviness. The OFI library seemed a logical addition to our pre-4.1.0 DARWIN Open MPI builds as an alternative to UCX for large messages.

Unfortunately, the first test of the code above (with UCX transport disabled and OFI transport enabled) failed:

Message size 1154079760 bigger than supported by selected transport. Max = 1073741824
[r2x00:21827] *** An error occurred in MPI_Bcast

The size cited is exactly the size of the Rmat matrix. Further investigation revealed that this is a known issue in the Open MPI MTL (Matching Transport Layer) whereby large memory regions are not broken into chunks, but rather fail when they exceed the maximum message size allowed by the chosen transport interface. The Mellanox InfiniBand network interface indeed has a maximum message size of 1.0 GiB (1073741824 Bytes). Open MPI developers admit that automatic "chunking" of a too-large buffer (making use of the maximum allowable size returned by OFI) is probably permissible and appropriate, but it has not yet been implemented in the MTL modules.

So the implication is that Open MPI releases prior to 4.1.0 cannot internally handle an MPI_Bcast() in excess of 1.0 GiB on DARWIN. Explicitly reenabling the openib BTL and disabling the UCX and OFI modules may address the issue by forcing the use of a now-deprecated less-efficient mechanism.

Chunked Broadcast

Open MPI MTL does not yet implement "chunking" of a large buffer into multiple smaller ones for broadcast. But it is relatively straightforward for the application developer to implement such functionality:

mpi_utils.f90
Module mpi_utils
    !
    ! This module implements MPI broadcast functions that break a buffer
    ! that may be too large into multiple chunks.  The <stride> passed
    ! to each of the subroutines dictates how many 8-byte elements should
    ! be passed to each underlying MPI_Bcast().
    !
    ! If the <stride> is zero, then the environment variable CONF_BROADCAST_MAX
    ! is checked for an integer value.  If no value is found, or the value
    ! exceeds the maximum threshold, the maximum threshold is used.  The
    ! maximum equates to 1 GiB worth of elements.
    !
    Implicit None
 
    Private
 
    Public :: BroadcastI, BroadcastR, BroadcastD
 
    Integer, Parameter :: MaxStride8Byte = 134217728
 
  Contains
 
    Subroutine BroadcastI(buffer, count, stride, rank, comm, mpierr)
        Use mpi_f08
        Implicit None
 
        Type(MPI_Comm), Intent(In)                  :: comm
        Integer, Intent(In)                         :: count, stride, rank
        Integer, Dimension(*), Intent(InOut)        :: buffer
        Integer, Intent(InOut)                      :: mpierr
        Character(Len=255)                          :: envvar
        Integer                                     :: use_stride, count_remain, i
 
        If (stride .le. 0) Then
            Call GetEnv('CONF_BROADCAST_MAX', envvar)
            Read(envvar, '(i)') use_stride
            If (use_stride .le. 0) use_stride = MaxStride8Byte * 2
        Else
            use_stride = stride
        End If
        If (stride .gt. MaxStride8Byte * 2) use_stride = MaxStride8Byte * 2
        count_remain = count
        i = 1
        mpierr = 0
        Do While (count_remain .gt. use_stride)
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_INTEGER, rank, comm, mpierr) 
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting integer range ',i,':',i+use_stride
                Return
            End If
            count_remain = count_remain - use_stride
            i = i + use_stride
        End Do
        If (count_remain .gt. 0) Then
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_INTEGER, rank, comm, mpierr)
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting integer range ',i,':',i+use_stride
                Return
            End If
        End If
    End Subroutine BroadcastI
 
    Subroutine BroadcastR(buffer, count, stride, rank, comm, mpierr)
        Use mpi_f08
        Implicit None
 
        Type(MPI_Comm), Intent(In)                  :: comm
        Integer, Intent(In)                         :: count, stride, rank
        Real, Dimension(*), Intent(InOut)           :: buffer
        Integer, Intent(InOut)                      :: mpierr
        Character(Len=255)                          :: envvar
        Integer                                     :: use_stride, count_remain, i
 
        If (stride .le. 0) Then
            Call GetEnv('CONF_BROADCAST_MAX', envvar)
            Read(envvar, '(i)') use_stride
            If (use_stride .le. 0) use_stride = MaxStride8Byte * 2
        Else
            use_stride = stride
        End If
        If (stride .gt. MaxStride8Byte * 2) use_stride = MaxStride8Byte * 2
        count_remain = count
        i = 1
        mpierr = 0
        Do While (count_remain .gt. use_stride)
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_REAL, rank, comm, mpierr)
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride
                Return
            End If
            count_remain = count_remain - use_stride
            i = i + use_stride
        End Do
        If (count_remain .gt. 0) Then
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_REAL, rank, comm, mpierr)
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride
                Return
            End If
        End If
    End Subroutine BroadcastR
 
    Subroutine BroadcastD(buffer, count, stride, rank, comm, mpierr)
        Use mpi_f08
        Use, intrinsic :: iso_fortran_env
        Implicit None
 
        Type(MPI_Comm), Intent(In)                  :: comm
        Integer, Intent(In)                         :: count, stride, rank
        Real(real64), Dimension(*), Intent(InOut)   :: buffer
        Integer, Intent(InOut)                      :: mpierr
        Character(Len=255)                          :: envvar
        Integer                                     :: use_stride, count_remain, i
 
        If (stride .le. 0) Then
            Call GetEnv('CONF_BROADCAST_MAX', envvar)
            Read(envvar, '(i)') use_stride
            If (use_stride .le. 0) use_stride = MaxStride8Byte
        Else
            use_stride = stride
        End If
        If (stride .gt. MaxStride8Byte) use_stride = MaxStride8Byte
        count_remain = count
        i = 1
        mpierr = 0
        Do While (count_remain .gt. use_stride)
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_DOUBLE_PRECISION, rank, comm, mpierr)
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride
                Return
            End If
            count_remain = count_remain - use_stride
            i = i + use_stride
        End Do
        If (count_remain .gt. 0) Then
            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_DOUBLE_PRECISION, rank, comm, mpierr)
            If (mpierr .ne. 0 ) Then
                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride
                Return
            End If
        End If
    End Subroutine BroadcastD
 
End Module

The MaxStride8Byte constant represents 1.0 GiB worth of 8-byte entities (a double-precision real on x86_64 is 8 bytes wide).

Using this module, the previous code is transformed to:

      Use mpi_utils
        :
      Allocate(Rmat(2,144259970)
        :
      Call BroadcastR(Rmat, 2*144259970, 0, 0, MPI_COMM_WORLD, mpierr)

The data is now broadcast as a 268435456-element chunk followed by a 20084484-element chunk, both of which are well below the 1.0 GiB limit associated with the OFI MTL. The efficiency of MTL should be well in excess of the overhead involved in the "chunking," making this an efficient broadcast mechanism on older Open MPI releases on DARWIN.