Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
technical:generic:openmpi-4-ucx-issue [2023-04-25 18:39] – [How is it fixed] frey | technical:generic:openmpi-4-ucx-issue [2024-12-05 10:47] (current) – [How is it fixed] frey | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Use of UCX PML in Open MPI 4.x on DARWIN ====== | ||
+ | This document explores a bug in release of Open MPI 4.x prior to 4.1.6. | ||
+ | |||
+ | ===== What is UCX ===== | ||
+ | |||
+ | UCX is a communications library that provides an abstract transport interface on top of multiple protocols and devices. | ||
+ | |||
+ | The Open MPI libraries built and maintained by IT RCI on DARWIN include several components that make use of UCX for accelerated data movement: | ||
+ | |||
+ | * the UCX Point-to-point Management Layer (PML) component | ||
+ | * the UCX One-Sided Communications (OSC) component | ||
+ | |||
+ | By default, the Open MPI libraries are configured to use the UCX PML. | ||
+ | |||
+ | ===== What is the issue ===== | ||
+ | |||
+ | While testing a workgroup' | ||
+ | |||
+ | <code fortran> | ||
+ | Integer (kind=int64) :: n_threshold | ||
+ | : | ||
+ | If (rank == 0) then | ||
+ | Call GetEnvVarInteger8(' | ||
+ | End If | ||
+ | Call MPI_Bcast(n_threshold, | ||
+ | Write(*,*) rank, n_threshold | ||
+ | </ | ||
+ | |||
+ | produced the following output: | ||
+ | |||
+ | < | ||
+ | | ||
+ | 1 1133871376384 | ||
+ | 2 1133871376384 | ||
+ | : | ||
+ | </ | ||
+ | |||
+ | At first it looked like the data were NOT broadcast to the other ranks, but '' | ||
+ | |||
+ | |10240| '' | ||
+ | |1133871376384| '' | ||
+ | |||
+ | The lowest 32-bits of the 64-bit variable //have// received the lowest 32-bits of the '' | ||
+ | |||
+ | * Changing this code to send the variable type-cast as an array of TWO 32-bit integers (totaling 64-bits of data) succeeded | ||
+ | * Changing this code to make '' | ||
+ | * just the first element of the array (ONE 64-bit integer) failed | ||
+ | * both elements of the array (TWO 64-bit integers) succeeded | ||
+ | |||
+ | In fact, a test program also showed that sending ONE double-precision floating-point value (type '' | ||
+ | |||
+ | Eventually debugging demonstrated that the UCX PML, when registering a UCX-native datatype to be associated with an MPI-native datatype, was producing an incorrect byte size for any 8-byte type. In the Open MPI 4.x code, an optimization had been added that chose to use a bit shift instead of multiplication when the size is a power of 2 (1, 2, 4, 8, 16, etc.). | ||
+ | |||
+ | <code c> | ||
+ | pml_datatype-> | ||
+ | </ | ||
+ | |||
+ | Mathematically that expression is exact and accurate; but floating-point arithmetic isn't always exact. | ||
+ | |||
+ | ==== How is it fixed ==== | ||
+ | |||
+ | Rather than using floating-point mathematics, | ||
+ | |||
+ | <code c> | ||
+ | int ctz(unsigned int v) | ||
+ | { | ||
+ | int l; | ||
+ | | ||
+ | if ( v == 0 ) return 8 * sizeof(v); | ||
+ | l = 0; | ||
+ | while ( (v & 1) == 0 ) l++, v >>= 1; | ||
+ | return l; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Adding the precondition that '' | ||
+ | |||
+ | <code c> | ||
+ | int ctz(unsigned int v) | ||
+ | { | ||
+ | int l = 0; | ||
+ | | ||
+ | while ( (v & 1) == 0 ) l++, v >>= 1; | ||
+ | return l; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | and if '' | ||
+ | |||
+ | <code c> | ||
+ | int ctz(unsigned int v) | ||
+ | { | ||
+ | int l = -1; | ||
+ | | ||
+ | do { l++, v >>= 1; } while (v); | ||
+ | return l; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | The GCC compiler implements a '' | ||
+ | |||
+ | <code c> | ||
+ | #if OPAL_C_HAVE_BUILTIN_CTZ | ||
+ | pml_datatype-> | ||
+ | #else | ||
+ | size_t | ||
+ | pml_datatype-> | ||
+ | while ( lsize ) pml_datatype-> | ||
+ | #endif | ||
+ | </ | ||
+ | |||
+ | This issue has been noted by the Open MPI developers. | ||
+ | |||
+ | ===== Changes on DARWIN ===== | ||
+ | |||
+ | The source code for all versions and variants of Open MPI in the 4.x release sequence have been patched according to the information above. | ||
+ | |||
+ | DARWIN users who experienced problems with their MPI code are encouraged to try to determine if the send/ |