Use of UCX PML in Open MPI 4.x on DARWIN
This document explores a bug in release of Open MPI 4.x prior to 4.1.6. There are multiple variants of 4.0.5, 4.1.0, 4.1.2, 4.1.4, and 4.1.5 present on DARWIN which contained this bug.
What is UCX
UCX is a communications library that provides an abstract transport interface on top of multiple protocols and devices. The same API can be used to move data between processes via shared memory, TCP sockets, and low-latency networks like InfiniBand. The UCX library is favored by Mellanox (now NVIDIA) for their InfiniBand networks.
The Open MPI libraries built and maintained by IT RCI on DARWIN include several components that make use of UCX for accelerated data movement:
- the UCX Point-to-point Management Layer (PML) component
- the UCX One-Sided Communications (OSC) component
By default, the Open MPI libraries are configured to use the UCX PML.
What is the issue
While testing a workgroup's large-scale MPI code, the following sequence:
Integer (kind=int64) :: n_threshold : If (rank == 0) then Call GetEnvVarInteger8('N_THRESHOLD', 10240, n_threshold) End If Call MPI_Bcast(n_threshold, 1, MPI_INTEGER8, 0, MPI_COMM_WORLD, mpierr) Write(*,*) rank, n_threshold
produced the following output:
0 10240 1 1133871376384 2 1133871376384 :
At first it looked like the data were NOT broadcast to the other ranks, but mpierr
was 0
. The values in hexadecimal are
10240 | 0x000000002800 |
1133871376384 | 0x010800002800 |
The lowest 32-bits of the 64-bit variable have received the lowest 32-bits of the n_threshold
value sent by rank 0; whatever was present in the upper 32-bits is left unaltered by the receive operation.
- Changing this code to send the variable type-cast as an array of TWO 32-bit integers (totaling 64-bits of data) succeeded
- Changing this code to make
n_threshold
an array of dimension TWO and sending:- just the first element of the array (ONE 64-bit integer) failed
- both elements of the array (TWO 64-bit integers) succeeded
In fact, a test program also showed that sending ONE double-precision floating-point value (type MPI_DOUBLE
) failed. The failure of the test program removed suspicion from the workgroup's large-scale MPI code. Translating the test program to C, that variant also failed, so it had nothing to do with the choice of language. The blame fell squarely on the Open MPI library or an underlying library.
Eventually debugging demonstrated that the UCX PML, when registering a UCX-native datatype to be associated with an MPI-native datatype, was producing an incorrect byte size for any 8-byte type. In the Open MPI 4.x code, an optimization had been added that chose to use a bit shift instead of multiplication when the size is a power of 2 (1, 2, 4, 8, 16, etc.). When the size was detected to be a power of 2, the exponent was determined using:
pml_datatype->size_shift = (int)(log(size) / log(2.0)); /* log2(size) */
Mathematically that expression is exact and accurate; but floating-point arithmetic isn't always exact. The value of this expression for size = 3
was evaluating to 2.9999999. Depending on the rounding mode chosen by the application or the method of truncation to integer form, this value could end up being 3 or 2. This meant that the byte size was registered as 1 « 2 = 4
rather than 1 « 3 = 8
which is why UCX was sending and receiving just the lowest 32-bits of a 64-bit value.
How is it fixed
Rather than using floating-point mathematics, simple integer methods are appropriate. Many processors have native instructions for determining how many contiguous zero bits exist from the right or left end of the value. In C, that operation looks like
int ctz(unsigned int v) { int l; if ( v == 0 ) return 8 * sizeof(v); l = 0; while ( (v & 1) == 0 ) l++, v >>= 1; return l; }
Adding the precondition that v ≠ 0
removes the leading conditional
int ctz(unsigned int v) { int l = 0; while ( (v & 1) == 0 ) l++, v >>= 1; return l; }
and if v
is guaranteed to be a power of two – implying a single bit is set – the code becomes
int ctz(unsigned int v) { int l = -1; do { l++, v >>= 1; } while (v); return l; }
The GCC compiler implements a __builtin_ctz()
function that may directly produce machine-level assembly code (very fast) when the target ISA supports it. The surrounding code in the UCX PML ensures the integer value is a power of 2 and non-zero, so the final form above is permissible:
#if OPAL_C_HAVE_BUILTIN_CTZ pml_datatype->size_shift = __builtin_ctzll(size); #else size_t lsize = size >> 1; pml_datatype->size_shift = 0; while ( lsize ) pml_datatype->size_shift++, lsize >>= 1; #endif
This issue has been noted by the Open MPI developers. A patch has been made and is slated for inclusion in the 4.1.6 and 5.0.0 releases of Open MPI.
Changes on DARWIN
The source code for all versions and variants of Open MPI in the 4.x release sequence have been patched according to the information above. All were recompiled and reinstalled on April 25, 2023 to mitigate this issue.
DARWIN users who experienced problems with their MPI code are encouraged to try to determine if the send/broadcast of a single 64-bit integer or floating-point (double) may have been involved. Whether or not this is possible, rerunning failed programs that use these MPI libraries may now yield correctly-working executions.