Use of UCX PML in Open MPI 4.x on DARWIN

This document explores a bug in release of Open MPI 4.x prior to 4.1.6. There are multiple variants of 4.0.5, 4.1.0, 4.1.2, 4.1.4, and 4.1.5 present on DARWIN which contained this bug.

UCX is a communications library that provides an abstract transport interface on top of multiple protocols and devices. The same API can be used to move data between processes via shared memory, TCP sockets, and low-latency networks like InfiniBand. The UCX library is favored by Mellanox (now NVIDIA) for their InfiniBand networks.

The Open MPI libraries built and maintained by IT RCI on DARWIN include several components that make use of UCX for accelerated data movement:

the UCX Point-to-point Management Layer (PML) component
the UCX One-Sided Communications (OSC) component

By default, the Open MPI libraries are configured to use the UCX PML.

While testing a workgroup's large-scale MPI code, the following sequence:

Integer (kind=int64) :: n_threshold
   :
If (rank == 0) then
    Call GetEnvVarInteger8('N_THRESHOLD', 10240, n_threshold)
End If
Call MPI_Bcast(n_threshold, 1, MPI_INTEGER8, 0, MPI_COMM_WORLD, mpierr)
Write(*,*) rank, n_threshold

produced the following output:

       0         10240
       1 1133871376384
       2 1133871376384
          :

At first it looked like the data were NOT broadcast to the other ranks, but mpierr was 0. The values in hexadecimal are

10240	`0x000000002800`
1133871376384	`0x010800002800`

The lowest 32-bits of the 64-bit variable have received the lowest 32-bits of the n_threshold value sent by rank 0; whatever was present in the upper 32-bits is left unaltered by the receive operation.

Changing this code to send the variable type-cast as an array of TWO 32-bit integers (totaling 64-bits of data) succeeded
Changing this code to make n_threshold an array of dimension TWO and sending:
- just the first element of the array (ONE 64-bit integer) failed
- both elements of the array (TWO 64-bit integers) succeeded

In fact, a test program also showed that sending ONE double-precision floating-point value (type MPI_DOUBLE) failed. The failure of the test program removed suspicion from the workgroup's large-scale MPI code. Translating the test program to C, that variant also failed, so it had nothing to do with the choice of language. The blame fell squarely on the Open MPI library or an underlying library.

Eventually debugging demonstrated that the UCX PML, when registering a UCX-native datatype to be associated with an MPI-native datatype, was producing an incorrect byte size for any 8-byte type. In the Open MPI 4.x code, an optimization had been added that chose to use a bit shift instead of multiplication when the size is a power of 2 (1, 2, 4, 8, 16, etc.). When the size was detected to be a power of 2, the exponent was determined using:

pml_datatype->size_shift = (int)(log(size) / log(2.0)); /* log2(size) */

Mathematically that expression is exact and accurate; but floating-point arithmetic isn't always exact. The value of this expression for size = 3 was evaluating to 2.9999999. Depending on the rounding mode chosen by the application or the method of truncation to integer form, this value could end up being 3 or 2. This meant that the byte size was registered as 1 « 2 = 4 rather than 1 « 3 = 8 which is why UCX was sending and receiving just the lowest 32-bits of a 64-bit value.

Rather than using floating-point mathematics, simple integer methods are appropriate. Many processors have native instructions for determining how many contiguous zero bits exist from the right or left end of the value. In C, that operation looks like

int ctz(unsigned int v)
{
    int     l;
 
    if ( v == 0 ) return 8 * sizeof(v);
    l = 0;
    while ( (v & 1) == 0 ) l++, v >>= 1;
    return l;
}

Adding the precondition that v ≠ 0 removes the leading conditional

int ctz(unsigned int v)
{
    int     l = 0;
 
    while ( (v & 1) == 0 ) l++, v >>= 1;
    return l;
}

and if v is guaranteed to be a power of two – implying a single bit is set – the code becomes

int ctz(unsigned int v)
{
    int     l = -1;
 
    do { l++, v >>= 1; } while (v);
    return l;
}

The GCC compiler implements a __builtin_ctz() function that may directly produce machine-level assembly code (very fast) when the target ISA supports it. The surrounding code in the UCX PML ensures the integer value is a power of 2 and non-zero, so the final form above is permissible:

#if OPAL_C_HAVE_BUILTIN_CTZ
        pml_datatype->size_shift = __builtin_ctzll(size);
#else
        size_t        lsize = size >> 1;
        pml_datatype->size_shift = 0;
        while ( lsize ) pml_datatype->size_shift++, lsize >>= 1;
#endif

This issue has been noted by the Open MPI developers. A patch has been made and is slated for inclusion in the 4.1.6 and 5.0.0 releases of Open MPI.

The source code for all versions and variants of Open MPI in the 4.x release sequence have been patched according to the information above. All were recompiled and reinstalled on April 25, 2023 to mitigate this issue.

DARWIN users who experienced problems with their MPI code are encouraged to try to determine if the send/broadcast of a single 64-bit integer or floating-point (double) may have been involved. Whether or not this is possible, rerunning failed programs that use these MPI libraries may now yield correctly-working executions.

Use of UCX PML in Open MPI 4.x on DARWIN

What is UCX

What is the issue

How is it fixed

Changes on DARWIN

hpc documentation