Differences

This shows you the differences between two versions of the page.

--- technical:recipes:gcc-openacc [2023-04-14 11:36] – created frey
+++ technical:recipes:gcc-openacc [2023-04-14 11:46] (current) – [Example program] frey
@@ Line 1: / Line 1: @@
+====== Building GCC 12.2 with OpenACC Offload ======
+The GNU compiler suite includes an implementation of the OpenACC standard.  Offload is supported to AMD and NVIDIA GPU devices (as well as single-thread host CPU fallback).
+<WRAP center round important 60%>
+The GNU OpenACC implementation is currently still a work in progress and is very likely not the most efficient OpenACC runtime for your code.  The NVIDIA HPC SDK (known as Portland Group compilers before they were purchased by NVIDIA) seems to generally be considered the most feature-complete and performant OpenACC implementation available.
+</WRAP>
+Building a GCC compiler toolchain with one or more of the accelerator engines is procedurally similar to building alternative architecture cross-compile functionality.  Tools are built for each accelerator before a final host-native build of GCC -- with the accelerator architectures included -- completes the process.
+This document summarizes that build procedure then provides examples of using the compiler to build a simplistic OpenACC program for execution on NVIDIA and AMD accelerators.
+===== Build NVIDIA tools =====
+The NVIDIA offload engine uses the PTX (Parallel Thread Execution) infrastructure provided by NVIDIA.  PTX is a virtual execution environment and associated ISA (Instruction Set Architecture) for dispatching multiple execution threads to NVIDIA devices.  The CUDA tools include the PTX Assembler, ''ptxas'', that completes the compilation of PTX threads on the execution host.
+The following software packages are required for this build:
+  * GCC 12.2.0 source
+  * GCC support libraries in-tree (GMP, ISL, MPC, MPFR)
+  * nvptx-tools (https://github.com/MentorEmbedded/nvptx-tools)
+  * newlib (git://sourceware.org/git/newlib-cygwin.git)
+  * CUDA SDK
+On DARWIN binutils 2.35 was used in preference to the native 2.27 version provided by the OS.
+==== Unpacking the source ====
+A directory hierarchy was setup and existing downloads of the source were unpacked:
+<code bash>
+$ mkdir -p /opt/shared/gcc/12.2.0-openacc/src
+$ cd /opt/shared/gcc/12.2.0-openacc/src
+$ tar -xf ../../attic/gcc-12.2.0.tar.gz
+$ git clone https://github.com/MentorEmbedded/nvptx-tools nvptx-tools
+$ git clone git://sourceware.org/git/newlib-cygwin.git newlib
+$ cd gcc-12.2.0
+$ for x in gmp-6.1.2.tar.bz2 isl-0.24.tar.bz2 mpc-1.2.1.tar.gz mpfr-4.1.0.tar.bz2 ; do tar -xf ../../../attic/${x} .; done
+$ ln -s gmp-6.1.2 gmp
+$ ln -s isl-0.24 isl
+$ ln -s mpc-1.2.1 mpc
+$ ln -s mpfr-4.1.0 mpfr
+$ ln -s ../newlib/newlib .
+</code>
+The AMD components will later be unpacked in this same hierarchy.
+==== Compile and install ====
+First, the nvptx-tools must be built:
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc/src/nvptx-tools
+$ vpkg_require cuda/11.6.2 binutils/2.35
+Adding package `cuda/11.6.2-510.47.03` to your environment
+Adding package `binutils/2.35.1` to your environment
+$ ./configure --prefix=/opt/shared/gcc/12.2.0-openacc
+$ make -j 4
+$ make install
+$ cd ..
+$ ls -l ../bin
+total 424
+lrwxrwxrwx 1 frey swmgr     34 Apr 13 14:46 nvptx-none-ar -> /opt/shared/binutils/2.35.1/bin/ar
+-rwxr-xr-x 1 frey swmgr 203320 Apr 13 14:46 nvptx-none-as
+-rwxr-xr-x 1 frey swmgr 181328 Apr 13 14:46 nvptx-none-ld
+-rwxr-xr-x 1 frey swmgr 168032 Apr 13 14:46 nvptx-none-nm
+lrwxrwxrwx 1 frey swmgr     38 Apr 13 14:46 nvptx-none-ranlib -> /opt/shared/binutils/2.35.1/bin/ranlib
+-rwxr-xr-x 1 frey swmgr  39344 Apr 13 14:46 nvptx-none-run
+-rwxr-xr-x 1 frey swmgr    909 Apr 13 14:46 nvptx-none-run-single
+</code>
+With the tools built and installed, the GCC tools can be built:
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0
+$ mkdir build-nvptx
+$ cd build-nvptx
+$ GCC_TARGET=$(../config.guess)
+$ ../configure --prefix=/opt/shared/gcc/12.2.0-openacc --target=nvptx-none --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/nvptx-none/bin --enable-as-accelerator-for=$GCC_TARGET --disable-sjlj-exceptions --enable-newlib-io-long-long
+$ make -j 4
+$ make install
+</code>
+If successful, the NVIDIA PTX cross-compile tools are now present.
+===== Build AMD tools =====
+The AMD Graphics Core Next (GCN) is similar to NVIDIA PTX:  coprocessing hardware with its own ISA.  The interface of the host environment with the device is described by a Heterogeneous System Architecture (HSA).  These acronyms are mentioned here primarily because they will come up in the context of building and using the AMD OpenACC offload engine.
+Sadly, AMD has not released its own low-level development tools (assembler, etc.) while LLVM has produced them.  So the first step is doing a minimal build of LLVM for GCN to get those tools.
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc/src
+$ wget 'https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-13.0.1.tar.gz'
+$ tar -xf llvmorg-13.0.1.tar.gz
+$ rm llvmorg-13.0.1.tar.gz
+$ cd gcc-12.2.0
+$ mkdir -p build-amdgcn/llvm
+$ cd build-amdgcn/llvm
+$ vpkg_require cmake
+Adding package `cmake/3.21.4` to your environment
+$ cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm
+   :
+CMake Error at cmake/modules/CheckCompilerVersion.cmake:38 (message):
+  Host GCC version must be at least 5.1, your version is 4.8.5.
+   :
+$ vpkg_require gcc/7.3
+Adding package `gcc/7.3.0` to your environment
+$ CC=$(which gcc) CXX=$(which g++) cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm
+$ make -j 4
+</code>
+The results are **not** installed!  Instead, only the necessary components are copied into place by hand:
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc
+$ mkdir -p amdgcn-amdhsa/bin
+$ cd amdgcn-amdhsa/bin
+$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ar
+$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ranlib
+$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-mc as
+$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-nm nm
+$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/lld ld
+</code>
+Finally, the AMD GCC tools can be built and installed:
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0/build-amdgcn
+$ mkdir gcc
+$ cd gcc
+$ ../../configure --target=amdgcn-amdhsa --enable-languages=c,lto,fortran --disable-sjlj-exceptions --with-newlib --enable-as-accelerator-for=${GCC_TARGET} --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa/bin --disable-libquadmath --prefix=/opt/shared/gcc/12.2.0-openacc
+$ make -j 4
+$ make install
+</code>
+The AMD tools are done.  Time to do the host build of GCC.
+===== Build host tools =====
+With the two accelerator tool sets built and installed, the host compiler can be produced.
+<code bash>
+$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0
+$ mkdir build-host
+$ cd build-host
+$ ../configure --build=${GCC_TARGET} --host=${GCC_TARGET} --target=${GCC_TARGET} --enable-offload-targets=nvptx-none=/opt/shared/gcc/12.2.0-openacc/nvptx-none,amdgcn-amdhsa=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa --with-cuda-driver-include=${CUDA_HOME}/include --with-cude-driver-lib=${CUDA_HOME}/targets/x86_64-linux/lib/stubs --prefix=/opt/shared/gcc/12.2.0-openacc --disable-multilib --disable-bootstrap
+$ make -j 4
+$ make install
+</code>
+The GCC 12.2.0 toolchain with AMD and NVIDIA OpenACC offload is now complete.
+===== VALET definitions =====
+The GCC 12.2.0 compiler produced this way does not automatically and always require additional CUDA or ROCM software in its environment.  If you are building software with just the host compiler then the OpenACC components will not even be used.  It can be assumed that the runtime environment for software compiled with this toolchain will require the GCC 12.2.0 compiler and CUDA or ROCM libraries, so three variants will be configured in VALET:
+  * The 12.2.0 compiler without dependency on a CUDA or ROCM library
+  * The 12.2.0 compiler with dependency on **any** CUDA library
+  * The 12.2.0 compiler with dependency on **any** ROCM library
+<code yaml>
+    "12.2.0:openacc":
+      prefix: 12.2.0-openacc
+      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (no specific offload)
+      actions:
+        - variable: CXX17
+          action:   set
+          contexts: development
+          value:    g++
+      dependencies:
+        - binutils/2.35
+    "12.2.0:openacc,amd":
+      prefix: 12.2.0-openacc
+      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (AMD offload)
+      actions:
+        - variable: CXX17
+          action:   set
+          contexts: development
+          value:    g++
+      dependencies:
+        - binutils/2.35
+        - amd-rocm/=>4
+    "12.2.0:openacc,nvidia":
+      prefix: 12.2.0-openacc
+      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (NVIDIA offload)
+      actions:
+        - variable: CXX17
+          action:   set
+          contexts: development
+          value:    g++
+      dependencies:
+        - binutils/2.35
+        - cuda/=>9
+</code>
+A CUDA or ROCM library (or both!) can be loaded into the runtime environment before loading GCC 12.2.0 -- this is the recommended mode of operation.  The library-specific variants are present purely as a convenience.
+<code bash>
+$ vpkg_require cuda/11.6.2 amd-rocm/5.4.3 gcc/12.2.0:openacc
+Adding package `cuda/11.6.2-510.47.03` to your environment
+Adding package `amd-rocm/5.4.3` to your environment
+Adding dependency `binutils/2.35.1` to your environment
+Adding package `gcc/12.2.0:openacc` to your environment
+</code>
+===== Example program =====
+The following program includes simple OpenACC directives on the matrix multiplication loop:
+<file c matmul.c>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <openacc.h>
+#ifndef N
+#define N 10
+#endif
+int
+main()
+{
+    int              n = N;
+    static float     a[N*N], b[N*N], c[N*N];
+    int              i, j, k;
+    #define ELEMENT(M, I, J) (M)[(J) * N + (I)]
+    for ( i=1; i<=n; i++ ) {
+        for ( j=1; j<=n; j++ ) {
+            ELEMENT(a, i-1, j-1) = (float)(i * j);
+            ELEMENT(b, i-1, j-1) = (float)(j * i);
+        }
+    }
+#if _OPENACC
+    acc_init(acc_device_not_host);
+#endif
+    printf("START GEMM()...\n"); fflush(stdout);
+#pragma omp parallel for shared(n, a, b, c)
+#pragma acc kernels loop
+    for ( i=0; i<n; i++ ) {
+    #pragma acc loop
+        for ( j=0; j<n; j++ ) {
+            float t = 0.0f;
+        #pragma acc loop reduction (+:t)
+            for ( k=0; k<n; k++ ) t += ELEMENT(a, i, k) * ELEMENT(b, k, j);
+            ELEMENT(c, i, j) = t;
+        }
+    }
+    printf("...GEMM() DONE\n");
+    printf("{ %f %f %f ... }\n", c[0], c[n], c[2*n]);
+    return 0;
+}
+</file>
+Note the ''static'' scope on the arrays:  lacking this scoping token, the OpenACC programs would crash with segmentation faults.  This was the result of the OS stack size limit:
+<code bash>
+$ ...edit matmul.c to remote 'static' scope on arrays...
+$ gcc -o matmul matmul.c -DN=1000
+$ ./matmul
+Segmentation fault (core dumped)
+$ ulimit -s
+$ ulimit -s unlimited
+$ ./matmul
+START GEMM()...
+...GEMM() DONE
+{ 333833152.000000 667666304.000000 1001501248.000000 ... }
+</code>
+==== NVIDIA offload ====
+The program can be compiled for NVIDIA GPU offload:
+<code bash>
+$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc
+Adding package `cuda/11.6.2-510.47.03` to your environment
+Adding dependency `binutils/2.35.1` to your environment
+Adding package `gcc/12.2.0:openacc` to your environment
+$ gcc -fopenacc -foffload=nvptx-none -O2 matmul.c -o matmul -DN=1000
+</code>
+The resulting program can be run on a DARWIN T4 node:
+<code bash>
+$ salloc --partition=gpu-t4 --gpus=1
+[r1t04]$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc
+Adding package `cuda/11.6.2-510.47.03` to your environment
+Adding dependency `binutils/2.35.1` to your environment
+Adding package `gcc/12.2.0:openacc` to your environment
+[r1t04]$ ACC_DEVICE_TYPE=nvidia ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul
+GOMP_NVPTX_JIT: <Not defined>
+Loading:
+---
+// BEGIN PREAMBLE
+.version 6.0
+.target sm_30
+   :
+@ %r63 bra.uni $L5;
+$L1:
+ret;
+}
+---
+Linking
+Link complete: 0.000000ms
+Link log info    : 0 bytes gmem
+info    : Function properties for 'main$_omp_fn$0':
+info    : used 35 registers, 0 stack, 0 bytes smem, 360 bytes cmem[0], 0 bytes lmem
+START GEMM()...
+GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffdd24f0cd0, size=0x4040a0, kinds=0x404080
+  GOMP_OFFLOAD_openacc_exec: prepare mappings
+  nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32
+  nvptx_exec: kernel main$_omp_fn$0: finished
+...GEMM() DONE
+{ 333833152.000000 667666304.000000 1001501248.000000 ... }
+</code>
+==== AMD offload ====
+The program can be compiled for AMD GPU offload:
+<code bash>
+$ vpkg_rollback all
+$ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc
+Adding package `amd-rocm/5.4.3` to your environment
+Adding dependency `binutils/2.35.1` to your environment
+Adding package `gcc/12.2.0:openacc` to your environment
+$ gcc -fopenacc -foffload=amdgcn-amdhsa -O2 matmul.c -o matmul -foffload=-march=gfx908
+</code>
+<WRAP center round important 60%>
+Without the ''-foffload=-march=gfx908'' flag the compiler produced code that would not execute on the AMD Mi100 in DARWIN.  The desired architecture was determined by executing the ''rocminfo'' command on the Mi100 node and searching for "gfx" in the output.
+</WRAP>
+The resulting program can be run on the DARWIN Mi100 node:
+<code bash>
+$ salloc --partition=gpu-mi100 --gpus=1
+[r0m01]$ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc
+Adding package `amd-rocm/5.4.3` to your environment
+Adding dependency `binutils/2.35.1` to your environment
+Adding package `gcc/12.2.0:openacc` to your environment
+[r0m01]$ ACC_DEVICE_TYPE=gcn ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul
+libgomp: while loading libgomp-plugin-nvptx.so.1: libcuda.so.1: cannot open shared object file: No such file or directory
+START GEMM()...
+GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffda87a5680, size=0x4040a0, kinds=0x404080
+...GEMM() DONE
+{ 385.000000 770.000000 1155.000000 ... }
+</code>
+The message regarding the inability to load the GNU OpenMP (GOMP) NVPTX plugin appears because no CUDA library is present and the OpenMP runtime will try to load //every// plugin available to it.  It can be ignored for host and AMD GCN OpenACC builds.