Building GCC 12.2 with OpenACC Offload
The GNU compiler suite includes an implementation of the OpenACC standard. Offload is supported to AMD and NVIDIA GPU devices (as well as single-thread host CPU fallback).
The GNU OpenACC implementation is currently still a work in progress and is very likely not the most efficient OpenACC runtime for your code. The NVIDIA HPC SDK (known as Portland Group compilers before they were purchased by NVIDIA) seems to generally be considered the most feature-complete and performant OpenACC implementation available.
Building a GCC compiler toolchain with one or more of the accelerator engines is procedurally similar to building alternative architecture cross-compile functionality. Tools are built for each accelerator before a final host-native build of GCC – with the accelerator architectures included – completes the process.
This document summarizes that build procedure then provides examples of using the compiler to build a simplistic OpenACC program for execution on NVIDIA and AMD accelerators.
Build NVIDIA tools
The NVIDIA offload engine uses the PTX (Parallel Thread Execution) infrastructure provided by NVIDIA. PTX is a virtual execution environment and associated ISA (Instruction Set Architecture) for dispatching multiple execution threads to NVIDIA devices. The CUDA tools include the PTX Assembler, ptxas
, that completes the compilation of PTX threads on the execution host.
The following software packages are required for this build:
- GCC 12.2.0 source
- GCC support libraries in-tree (GMP, ISL, MPC, MPFR)
- nvptx-tools (https://github.com/MentorEmbedded/nvptx-tools)
- newlib (git:sourceware.org/git/newlib-cygwin.git) * CUDA SDK On DARWIN binutils 2.35 was used in preference to the native 2.27 version provided by the OS. ==== Unpacking the source ==== A directory hierarchy was setup and existing downloads of the source were unpacked: <code bash> $ mkdir -p /opt/shared/gcc/12.2.0-openacc/src $ cd /opt/shared/gcc/12.2.0-openacc/src $ tar -xf ../../attic/gcc-12.2.0.tar.gz $ git clone https://github.com/MentorEmbedded/nvptx-tools nvptx-tools $ git clone git:sourceware.org/git/newlib-cygwin.git newlib
$ cd gcc-12.2.0 $ for x in gmp-6.1.2.tar.bz2 isl-0.24.tar.bz2 mpc-1.2.1.tar.gz mpfr-4.1.0.tar.bz2 ; do tar -xf ../../../attic/${x} .; done $ ln -s gmp-6.1.2 gmp $ ln -s isl-0.24 isl $ ln -s mpc-1.2.1 mpc $ ln -s mpfr-4.1.0 mpfr $ ln -s ../newlib/newlib . </code>
The AMD components will later be unpacked in this same hierarchy.
Compile and install
First, the nvptx-tools must be built:
$ cd /opt/shared/gcc/12.2.0-openacc/src/nvptx-tools $ vpkg_require cuda/11.6.2 binutils/2.35 Adding package `cuda/11.6.2-510.47.03` to your environment Adding package `binutils/2.35.1` to your environment $ ./configure --prefix=/opt/shared/gcc/12.2.0-openacc $ make -j 4 $ make install $ cd .. $ ls -l ../bin total 424 lrwxrwxrwx 1 frey swmgr 34 Apr 13 14:46 nvptx-none-ar -> /opt/shared/binutils/2.35.1/bin/ar -rwxr-xr-x 1 frey swmgr 203320 Apr 13 14:46 nvptx-none-as -rwxr-xr-x 1 frey swmgr 181328 Apr 13 14:46 nvptx-none-ld -rwxr-xr-x 1 frey swmgr 168032 Apr 13 14:46 nvptx-none-nm lrwxrwxrwx 1 frey swmgr 38 Apr 13 14:46 nvptx-none-ranlib -> /opt/shared/binutils/2.35.1/bin/ranlib -rwxr-xr-x 1 frey swmgr 39344 Apr 13 14:46 nvptx-none-run -rwxr-xr-x 1 frey swmgr 909 Apr 13 14:46 nvptx-none-run-single
With the tools built and installed, the GCC tools can be built:
$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0 $ mkdir build-nvptx $ cd build-nvptx $ GCC_TARGET=$(../config.guess) $ ../configure --prefix=/opt/shared/gcc/12.2.0-openacc --target=nvptx-none --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/nvptx-none/bin --enable-as-accelerator-for=$GCC_TARGET --disable-sjlj-exceptions --enable-newlib-io-long-long $ make -j 4 $ make install
If successful, the NVIDIA PTX cross-compile tools are now present.
Build AMD tools
The AMD Graphics Core Next (GCN) is similar to NVIDIA PTX: coprocessing hardware with its own ISA. The interface of the host environment with the device is described by a Heterogeneous System Architecture (HSA). These acronyms are mentioned here primarily because they will come up in the context of building and using the AMD OpenACC offload engine.
Sadly, AMD has not released its own low-level development tools (assembler, etc.) while LLVM has produced them. So the first step is doing a minimal build of LLVM for GCN to get those tools.
$ cd /opt/shared/gcc/12.2.0-openacc/src $ wget 'https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-13.0.1.tar.gz' $ tar -xf llvmorg-13.0.1.tar.gz $ rm llvmorg-13.0.1.tar.gz $ cd gcc-12.2.0 $ mkdir -p build-amdgcn/llvm $ cd build-amdgcn/llvm $ vpkg_require cmake Adding package `cmake/3.21.4` to your environment $ cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm : CMake Error at cmake/modules/CheckCompilerVersion.cmake:38 (message): Host GCC version must be at least 5.1, your version is 4.8.5. : $ vpkg_require gcc/7.3 Adding package `gcc/7.3.0` to your environment $ CC=$(which gcc) CXX=$(which g++) cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm $ make -j 4
The results are not installed! Instead, only the necessary components are copied into place by hand:
$ cd /opt/shared/gcc/12.2.0-openacc $ mkdir -p amdgcn-amdhsa/bin $ cd amdgcn-amdhsa/bin $ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ar $ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ranlib $ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-mc as $ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-nm nm $ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/lld ld
Finally, the AMD GCC tools can be built and installed:
$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0/build-amdgcn $ mkdir gcc $ cd gcc $ ../../configure --target=amdgcn-amdhsa --enable-languages=c,lto,fortran --disable-sjlj-exceptions --with-newlib --enable-as-accelerator-for=${GCC_TARGET} --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa/bin --disable-libquadmath --prefix=/opt/shared/gcc/12.2.0-openacc $ make -j 4 $ make install
The AMD tools are done. Time to do the host build of GCC.
Build host tools
With the two accelerator tool sets built and installed, the host compiler can be produced.
$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0 $ mkdir build-host $ cd build-host $ ../configure --build=${GCC_TARGET} --host=${GCC_TARGET} --target=${GCC_TARGET} --enable-offload-targets=nvptx-none=/opt/shared/gcc/12.2.0-openacc/nvptx-none,amdgcn-amdhsa=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa --with-cuda-driver-include=${CUDA_HOME}/include --with-cude-driver-lib=${CUDA_HOME}/targets/x86_64-linux/lib/stubs --prefix=/opt/shared/gcc/12.2.0-openacc --disable-multilib --disable-bootstrap $ make -j 4 $ make install
The GCC 12.2.0 toolchain with AMD and NVIDIA OpenACC offload is now complete.
VALET definitions
The GCC 12.2.0 compiler produced this way does not automatically and always require additional CUDA or ROCM software in its environment. If you are building software with just the host compiler then the OpenACC components will not even be used. It can be assumed that the runtime environment for software compiled with this toolchain will require the GCC 12.2.0 compiler and CUDA or ROCM libraries, so three variants will be configured in VALET:
- The 12.2.0 compiler without dependency on a CUDA or ROCM library
- The 12.2.0 compiler with dependency on any CUDA library
- The 12.2.0 compiler with dependency on any ROCM library
"12.2.0:openacc": prefix: 12.2.0-openacc description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (no specific offload) actions: - variable: CXX17 action: set contexts: development value: g++ dependencies: - binutils/2.35 "12.2.0:openacc,amd": prefix: 12.2.0-openacc description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (AMD offload) actions: - variable: CXX17 action: set contexts: development value: g++ dependencies: - binutils/2.35 - amd-rocm/=>4 "12.2.0:openacc,nvidia": prefix: 12.2.0-openacc description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (NVIDIA offload) actions: - variable: CXX17 action: set contexts: development value: g++ dependencies: - binutils/2.35 - cuda/=>9
A CUDA or ROCM library (or both!) can be loaded into the runtime environment before loading GCC 12.2.0 – this is the recommended mode of operation. The library-specific variants are present purely as a convenience.
$ vpkg_require cuda/11.6.2 amd-rocm/5.4.3 gcc/12.2.0:openacc Adding package `cuda/11.6.2-510.47.03` to your environment Adding package `amd-rocm/5.4.3` to your environment Adding dependency `binutils/2.35.1` to your environment Adding package `gcc/12.2.0:openacc` to your environment
Example program
The following program includes simple OpenACC directives on the matrix multiplication loop:
- matmul.c
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <openacc.h> #ifndef N #define N 10 #endif int main() { int n = N; static float a[N*N], b[N*N], c[N*N]; int i, j, k; #define ELEMENT(M, I, J) (M)[(J) * N + (I)] for ( i=1; i<=n; i++ ) { for ( j=1; j<=n; j++ ) { ELEMENT(a, i-1, j-1) = (float)(i * j); ELEMENT(b, i-1, j-1) = (float)(j * i); } } #if _OPENACC acc_init(acc_device_not_host); #endif printf("START GEMM()...\n"); fflush(stdout); #pragma omp parallel for shared(n, a, b, c) #pragma acc kernels loop for ( i=0; i<n; i++ ) { #pragma acc loop for ( j=0; j<n; j++ ) { float t = 0.0f; #pragma acc loop reduction (+:t) for ( k=0; k<n; k++ ) t += ELEMENT(a, i, k) * ELEMENT(b, k, j); ELEMENT(c, i, j) = t; } } printf("...GEMM() DONE\n"); printf("{ %f %f %f ... }\n", c[0], c[n], c[2*n]); return 0; }
Note the static
scope on the arrays: lacking this scoping token, the OpenACC programs would crash with segmentation faults. This was the result of the OS stack size limit:
$ ...edit matmul.c to remote 'static' scope on arrays... $ gcc -o matmul matmul.c -DN=1000 $ ./matmul Segmentation fault (core dumped) $ ulimit -s 8192 $ ulimit -s unlimited $ ./matmul START GEMM()... ...GEMM() DONE { 333833152.000000 667666304.000000 1001501248.000000 ... }
NVIDIA offload
The program can be compiled for NVIDIA GPU offload:
$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc Adding package `cuda/11.6.2-510.47.03` to your environment Adding dependency `binutils/2.35.1` to your environment Adding package `gcc/12.2.0:openacc` to your environment $ gcc -fopenacc -foffload=nvptx-none -O2 matmul.c -o matmul -DN=1000
The resulting program can be run on a DARWIN T4 node:
$ salloc --partition=gpu-t4 --gpus=1 [r1t04]$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc Adding package `cuda/11.6.2-510.47.03` to your environment Adding dependency `binutils/2.35.1` to your environment Adding package `gcc/12.2.0:openacc` to your environment [r1t04]$ ACC_DEVICE_TYPE=nvidia ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul GOMP_NVPTX_JIT: <Not defined> Loading: --- // BEGIN PREAMBLE .version 6.0 .target sm_30 : @ %r63 bra.uni $L5; $L1: ret; } --- Linking Link complete: 0.000000ms Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 35 registers, 0 stack, 0 bytes smem, 360 bytes cmem[0], 0 bytes lmem START GEMM()... GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffdd24f0cd0, size=0x4040a0, kinds=0x404080 GOMP_OFFLOAD_openacc_exec: prepare mappings nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$0: finished ...GEMM() DONE { 333833152.000000 667666304.000000 1001501248.000000 ... }
AMD offload
The program can be compiled for AMD GPU offload:
$ vpkg_rollback all $ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc Adding package `amd-rocm/5.4.3` to your environment Adding dependency `binutils/2.35.1` to your environment Adding package `gcc/12.2.0:openacc` to your environment $ gcc -fopenacc -foffload=amdgcn-amdhsa -O2 matmul.c -o matmul -foffload=-march=gfx908
Without the -foffload=-march=gfx908
flag the compiler produced code that would not execute on the AMD Mi100 in DARWIN. The desired architecture was determined by executing the rocminfo
command on the Mi100 node and searching for "gfx" in the output.
The resulting program can be run on the DARWIN Mi100 node:
$ salloc --partition=gpu-mi100 --gpus=1 [r0m01]$ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc Adding package `amd-rocm/5.4.3` to your environment Adding dependency `binutils/2.35.1` to your environment Adding package `gcc/12.2.0:openacc` to your environment [r0m01]$ ACC_DEVICE_TYPE=gcn ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul libgomp: while loading libgomp-plugin-nvptx.so.1: libcuda.so.1: cannot open shared object file: No such file or directory START GEMM()... GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffda87a5680, size=0x4040a0, kinds=0x404080 ...GEMM() DONE { 385.000000 770.000000 1155.000000 ... }
The message regarding the inability to load the GNU OpenMP (GOMP) NVPTX plugin appears because no CUDA library is present and the OpenMP runtime will try to load every plugin available to it. It can be ignored for host and AMD GCN OpenACC builds.