technical:whitepaper:darwin_ucx_openmpi

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
technical:whitepaper:darwin_ucx_openmpi [2021-02-12 16:47] freytechnical:whitepaper:darwin_ucx_openmpi [2021-02-12 17:15] (current) frey
Line 72: Line 72:
  
 The UCX changes implemented in Open MPI 4.1.0 have not been backported to any of the older releases of the software, so this issue does persist.  The issue arises when attempting large data broadcasts, for example: The UCX changes implemented in Open MPI 4.1.0 have not been backported to any of the older releases of the software, so this issue does persist.  The issue arises when attempting large data broadcasts, for example:
-<code f90+<code fortran
-Allocate(Rmat(2,144259970) +      Allocate(Rmat(2,144259970) 
-  +        
-Call MPI_Bcast(Rmat, 2*144259970, MPI_REAL, 0, MPI_COMM_WORLD, mpierr)+      Call MPI_Bcast(Rmat, 2*144259970, MPI_REAL, 0, MPI_COMM_WORLD, mpierr)
 </code> </code>
-The array in question is 1154079760 bytes (slightly over 1.0 GiB).+The array in question is 1154079760 Bytes (slightly over 1.0 GiB)
 + 
 +The OpenFabrics Alliance produces a unified messaging library similar to UCX called OFI or libfabric.  The OFI library supports Omni-path PSM2 and a low-level InfiniBand verbs module, so we have used it under many Open MPI variants provided on Caviness.  The OFI library seemed a logical addition to our pre-4.1.0 DARWIN Open MPI builds as an alternative to UCX for large messages. 
 + 
 +Unfortunately, the first test of the code above (with UCX transport disabled and OFI transport enabled) failed: 
 +<code> 
 +Message size 1154079760 bigger than supported by selected transport. Max = 1073741824 
 +[r2x00:21827] *** An error occurred in MPI_Bcast 
 +</code> 
 +The size cited is exactly the size of the ''Rmat'' matrix.  Further investigation revealed that this is a known issue in the Open MPI MTL (Matching Transport Layer) whereby large memory regions are not broken into chunks, but rather fail when they exceed the maximum message size allowed by the chosen transport interface.  The Mellanox InfiniBand network interface indeed has a maximum message size of 1.0 GiB (1073741824 Bytes).  Open MPI developers admit that automatic "chunking" of a too-large buffer (making use of the maximum allowable size returned by OFI) is probably permissible and appropriate, but it has not yet been implemented in the MTL modules. 
 + 
 +So the implication is that Open MPI releases prior to 4.1.0 cannot internally handle an ''MPI_Bcast()'' in excess of 1.0 GiB on DARWIN.  Explicitly reenabling the ''openib'' BTL and disabling the UCX and OFI modules may address the issue by forcing the use of a now-deprecated less-efficient mechanism. 
 + 
 +===== Chunked Broadcast ===== 
 + 
 +Open MPI MTL does not yet implement "chunking" of a large buffer into multiple smaller ones for broadcast.  But it is relatively straightforward for the application developer to implement such functionality: 
 + 
 +<file fortran mpi_utils.f90> 
 +Module mpi_utils 
 +    ! 
 +    ! This module implements MPI broadcast functions that break a buffer 
 +    ! that may be too large into multiple chunks.  The <stride> passed 
 +    ! to each of the subroutines dictates how many 8-byte elements should 
 +    ! be passed to each underlying MPI_Bcast(). 
 +    ! 
 +    ! If the <stride> is zero, then the environment variable CONF_BROADCAST_MAX 
 +    ! is checked for an integer value.  If no value is found, or the value 
 +    ! exceeds the maximum threshold, the maximum threshold is used.  The 
 +    ! maximum equates to 1 GiB worth of elements. 
 +    ! 
 +    Implicit None 
 +     
 +    Private 
 +     
 +    Public :: BroadcastI, BroadcastR, BroadcastD 
 + 
 +    Integer, Parameter :: MaxStride8Byte = 134217728 
 +     
 +  Contains 
 + 
 +    Subroutine BroadcastI(buffer, count, stride, rank, comm, mpierr) 
 +        Use mpi_f08 
 +        Implicit None 
 + 
 +        Type(MPI_Comm), Intent(In)                  :: comm 
 +        Integer, Intent(In)                         :: count, stride, rank 
 +        Integer, Dimension(*), Intent(InOut)        :: buffer 
 +        Integer, Intent(InOut)                      :: mpierr 
 +        Character(Len=255)                          :: envvar 
 +        Integer                                     :: use_stride, count_remain,
 + 
 +        If (stride .le. 0) Then 
 +            Call GetEnv('CONF_BROADCAST_MAX', envvar) 
 +            Read(envvar, '(i)') use_stride 
 +            If (use_stride .le. 0) use_stride = MaxStride8Byte * 2 
 +        Else 
 +            use_stride = stride 
 +        End If 
 +        If (stride .gt. MaxStride8Byte * 2) use_stride = MaxStride8Byte * 2 
 +        count_remain = count 
 +        i = 1 
 +        mpierr = 0 
 +        Do While (count_remain .gt. use_stride) 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_INTEGER, rank, comm, mpierr)  
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting integer range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +            count_remain = count_remain - use_stride 
 +            i = i + use_stride 
 +        End Do 
 +        If (count_remain .gt. 0) Then 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_INTEGER, rank, comm, mpierr) 
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting integer range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +        End If 
 +    End Subroutine BroadcastI 
 + 
 +    Subroutine BroadcastR(buffer, count, stride, rank, comm, mpierr) 
 +        Use mpi_f08 
 +        Implicit None 
 + 
 +        Type(MPI_Comm), Intent(In)                  :: comm 
 +        Integer, Intent(In)                         :: count, stride, rank 
 +        Real, Dimension(*), Intent(InOut)           :: buffer 
 +        Integer, Intent(InOut)                      :: mpierr 
 +        Character(Len=255)                          :: envvar 
 +        Integer                                     :: use_stride, count_remain,
 + 
 +        If (stride .le. 0) Then 
 +            Call GetEnv('CONF_BROADCAST_MAX', envvar) 
 +            Read(envvar, '(i)') use_stride 
 +            If (use_stride .le. 0) use_stride = MaxStride8Byte * 2 
 +        Else 
 +            use_stride = stride 
 +        End If 
 +        If (stride .gt. MaxStride8Byte * 2) use_stride = MaxStride8Byte * 2 
 +        count_remain = count 
 +        i = 1 
 +        mpierr = 0 
 +        Do While (count_remain .gt. use_stride) 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_REAL, rank, comm, mpierr) 
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +            count_remain = count_remain - use_stride 
 +            i = i + use_stride 
 +        End Do 
 +        If (count_remain .gt. 0) Then 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_REAL, rank, comm, mpierr) 
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +        End If 
 +    End Subroutine BroadcastR 
 + 
 +    Subroutine BroadcastD(buffer, count, stride, rank, comm, mpierr) 
 +        Use mpi_f08 
 +        Use, intrinsic :: iso_fortran_env 
 +        Implicit None 
 + 
 +        Type(MPI_Comm), Intent(In)                  :: comm 
 +        Integer, Intent(In)                         :: count, stride, rank 
 +        Real(real64), Dimension(*), Intent(InOut)   :: buffer 
 +        Integer, Intent(InOut)                      :: mpierr 
 +        Character(Len=255)                          :: envvar 
 +        Integer                                     :: use_stride, count_remain,
 + 
 +        If (stride .le. 0) Then 
 +            Call GetEnv('CONF_BROADCAST_MAX', envvar) 
 +            Read(envvar, '(i)') use_stride 
 +            If (use_stride .le. 0) use_stride = MaxStride8Byte 
 +        Else 
 +            use_stride = stride 
 +        End If 
 +        If (stride .gt. MaxStride8Byte) use_stride = MaxStride8Byte 
 +        count_remain = count 
 +        i = 1 
 +        mpierr = 0 
 +        Do While (count_remain .gt. use_stride) 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_DOUBLE_PRECISION, rank, comm, mpierr) 
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +            count_remain = count_remain - use_stride 
 +            i = i + use_stride 
 +        End Do 
 +        If (count_remain .gt. 0) Then 
 +            Call MPI_Bcast(buffer(i:i+use_stride), use_stride, MPI_DOUBLE_PRECISION, rank, comm, mpierr) 
 +            If (mpierr .ne. 0 ) Then 
 +                Write(*,*) 'Failure broadcasting real range ',i,':',i+use_stride 
 +                Return 
 +            End If 
 +        End If 
 +    End Subroutine BroadcastD 
 +   
 +End Module 
 +</file> 
 +The ''MaxStride8Byte'' constant represents 1.0 GiB worth of 8-byte entities (a double-precision real on x86_64 is 8 bytes wide). 
 + 
 +Using this module, the previous code is transformed to: 
 +<code fortran> 
 +      Use mpi_utils 
 +        : 
 +      Allocate(Rmat(2,144259970) 
 +        : 
 +      Call BroadcastR(Rmat, 2*144259970, 0, 0, MPI_COMM_WORLD, mpierr) 
 +</code> 
 +The data is now broadcast as a 268435456-element chunk followed by a 20084484-element chunk, both of which are well below the 1.0 GiB limit associated with the OFI MTL.  The efficiency of MTL should be well in excess of the overhead involved in the "chunking," making this an efficient broadcast mechanism on older Open MPI releases on DARWIN.
  • technical/whitepaper/darwin_ucx_openmpi.1613166461.txt.gz
  • Last modified: 2021-02-12 16:47
  • by frey