Differences

This shows you the differences between two versions of the page.

--- technical:generic:openmpi-4-ucx-issue [2023-04-25 18:39] – [How is it fixed] frey
+++ technical:generic:openmpi-4-ucx-issue [2024-12-05 10:47] (current) – [How is it fixed] frey
@@ Line 1: / Line 1: @@
+====== Use of UCX PML in Open MPI 4.x on DARWIN ======
+This document explores a bug in release of Open MPI 4.x prior to 4.1.6.  There are multiple variants of 4.0.5, 4.1.0, 4.1.2, 4.1.4, and 4.1.5 present on DARWIN which contained this bug.
+===== What is UCX =====
+UCX is a communications library that provides an abstract transport interface on top of multiple protocols and devices.  The same API can be used to move data between processes via shared memory, TCP sockets, and low-latency networks like InfiniBand.  The UCX library is favored by Mellanox (now NVIDIA) for their InfiniBand networks.
+The Open MPI libraries built and maintained by IT RCI on DARWIN include several components that make use of UCX for accelerated data movement:
+  * the UCX Point-to-point Management Layer (PML) component
+  * the UCX One-Sided Communications (OSC) component
+By default, the Open MPI libraries are configured to use the UCX PML.
+===== What is the issue =====
+While testing a workgroup's large-scale MPI code, the following sequence:
+<code fortran>
+Integer (kind=int64) :: n_threshold
+   :
+If (rank == 0) then
+    Call GetEnvVarInteger8('N_THRESHOLD', 10240, n_threshold)
+End If
+Call MPI_Bcast(n_threshold, 1, MPI_INTEGER8, 0, MPI_COMM_WORLD, mpierr)
+Write(*,*) rank, n_threshold
+</code>
+produced the following output:
+<code>
+         10240
+1133871376384
+1133871376384
+          :
+</code>
+At first it looked like the data were NOT broadcast to the other ranks, but ''mpierr'' was ''0''.  The values in hexadecimal are
+|10240| ''0x000000002800''|
+|1133871376384| ''0x010800002800''|
+The lowest 32-bits of the 64-bit variable //have// received the lowest 32-bits of the ''n_threshold'' value sent by rank 0; whatever was present in the upper 32-bits is left unaltered by the receive operation.
+  * Changing this code to send the variable type-cast as an array of TWO 32-bit integers (totaling 64-bits of data) succeeded
+  * Changing this code to make ''n_threshold'' an array of dimension TWO and sending:
+    * just the first element of the array (ONE 64-bit integer) failed
+    * both elements of the array (TWO 64-bit integers) succeeded
+In fact, a test program also showed that sending ONE double-precision floating-point value (type ''MPI_DOUBLE'') failed.  The failure of the test program removed suspicion from the workgroup's large-scale MPI code.  Translating the test program to C, that variant also failed, so it had nothing to do with the choice of language.  The blame fell squarely on the Open MPI library or an underlying library.
+Eventually debugging demonstrated that the UCX PML, when registering a UCX-native datatype to be associated with an MPI-native datatype, was producing an incorrect byte size for any 8-byte type.  In the Open MPI 4.x code, an optimization had been added that chose to use a bit shift instead of multiplication when the size is a power of 2 (1, 2, 4, 8, 16, etc.).  When the size was detected to be a power of 2, the exponent was determined using:
+<code c>
+pml_datatype->size_shift = (int)(log(size) / log(2.0)); /* log2(size) */
+</code>
+Mathematically that expression is exact and accurate; but floating-point arithmetic isn't always exact.  The value of this expression for ''size = 3'' was evaluating to 2.9999999.  Depending on the rounding mode chosen by the application or the method of truncation to integer form, this value could end up being **3 or 2**.  This meant that the byte size was registered as ''1 << 2 = 4'' rather than ''1 << 3 = 8'' which is why UCX was sending and receiving just the lowest 32-bits of a 64-bit value.
+==== How is it fixed ====
+Rather than using floating-point mathematics, simple integer methods are appropriate.  Many processors have native instructions for determining how many contiguous zero bits exist from the right or left end of the value.  In C, that operation looks like
+<code c>
+int ctz(unsigned int v)
+{
+    int     l;
+    if ( v == 0 ) return 8 * sizeof(v);
+    l = 0;
+    while ( (v & 1) == 0 ) l++, v >>= 1;
+    return l;
+}
+</code>
+Adding the precondition that ''v ≠ 0'' removes the leading conditional
+<code c>
+int ctz(unsigned int v)
+{
+    int     l = 0;
+    while ( (v & 1) == 0 ) l++, v >>= 1;
+    return l;
+}
+</code>
+and if ''v'' is guaranteed to be a power of two -- implying a single bit is set -- the code becomes
+<code c>
+int ctz(unsigned int v)
+{
+    int     l = -1;
+    do { l++, v >>= 1; } while (v);
+    return l;
+}
+</code>
+The GCC compiler implements a ''%%__builtin_ctz()%%'' function that may directly produce machine-level assembly code (very fast) when the target ISA supports it.  The surrounding code in the UCX PML ensures the integer value is a power of 2 and non-zero, so the final form above is permissible:
+<code c>
+#if OPAL_C_HAVE_BUILTIN_CTZ
+        pml_datatype->size_shift = __builtin_ctzll(size);
+#else
+        size_t        lsize = size >> 1;
+        pml_datatype->size_shift = 0;
+        while ( lsize ) pml_datatype->size_shift++, lsize >>= 1;
+#endif
+</code>
+This issue has been noted by the Open MPI developers.  A patch has been made and is slated for inclusion in the 4.1.6 and 5.0.0 releases of Open MPI.
+===== Changes on DARWIN =====
+The source code for all versions and variants of Open MPI in the 4.x release sequence have been patched according to the information above.  All were recompiled and reinstalled on April 25, 2023 to mitigate this issue.
+DARWIN users who experienced problems with their MPI code are encouraged to try to determine if the send/broadcast of a single 64-bit integer or floating-point (double) may have been involved.  Whether or not this is possible, rerunning failed programs that use these MPI libraries may now yield correctly-working executions.