Building GCC 12.2 with OpenACC Offload

The GNU compiler suite includes an implementation of the OpenACC standard. Offload is supported to AMD and NVIDIA GPU devices (as well as single-thread host CPU fallback).

The GNU OpenACC implementation is currently still a work in progress and is very likely not the most efficient OpenACC runtime for your code. The NVIDIA HPC SDK (known as Portland Group compilers before they were purchased by NVIDIA) seems to generally be considered the most feature-complete and performant OpenACC implementation available.

Building a GCC compiler toolchain with one or more of the accelerator engines is procedurally similar to building alternative architecture cross-compile functionality. Tools are built for each accelerator before a final host-native build of GCC – with the accelerator architectures included – completes the process.

This document summarizes that build procedure then provides examples of using the compiler to build a simplistic OpenACC program for execution on NVIDIA and AMD accelerators.

The NVIDIA offload engine uses the PTX (Parallel Thread Execution) infrastructure provided by NVIDIA. PTX is a virtual execution environment and associated ISA (Instruction Set Architecture) for dispatching multiple execution threads to NVIDIA devices. The CUDA tools include the PTX Assembler, ptxas, that completes the compilation of PTX threads on the execution host.

The following software packages are required for this build:

GCC 12.2.0 source
GCC support libraries in-tree (GMP, ISL, MPC, MPFR)
nvptx-tools (https://github.com/MentorEmbedded/nvptx-tools)
newlib (git:\/\/sourceware.org/git/newlib-cygwin.git)
CUDA SDK

On DARWIN binutils 2.35 was used in preference to the native 2.27 version provided by the OS.

A directory hierarchy was setup and existing downloads of the source were unpacked:

$ mkdir -p /opt/shared/gcc/12.2.0-openacc/src
$ cd /opt/shared/gcc/12.2.0-openacc/src
$ tar -xf ../../attic/gcc-12.2.0.tar.gz
$ git clone https://github.com/MentorEmbedded/nvptx-tools nvptx-tools
$ git clone git://sourceware.org/git/newlib-cygwin.git newlib
$ cd gcc-12.2.0
$ for x in gmp-6.1.2.tar.bz2 isl-0.24.tar.bz2 mpc-1.2.1.tar.gz mpfr-4.1.0.tar.bz2 ; do tar -xf ../../../attic/${x} .; done
$ ln -s gmp-6.1.2 gmp
$ ln -s isl-0.24 isl
$ ln -s mpc-1.2.1 mpc
$ ln -s mpfr-4.1.0 mpfr
$ ln -s ../newlib/newlib .

The AMD components will later be unpacked in this same hierarchy.

First, the nvptx-tools must be built:

$ cd /opt/shared/gcc/12.2.0-openacc/src/nvptx-tools
$ vpkg_require cuda/11.6.2 binutils/2.35
Adding package `cuda/11.6.2-510.47.03` to your environment
Adding package `binutils/2.35.1` to your environment
 
$ ./configure --prefix=/opt/shared/gcc/12.2.0-openacc
 
$ make -j 4
 
$ make install
 
$ cd ..
$ ls -l ../bin
total 424
lrwxrwxrwx 1 frey swmgr     34 Apr 13 14:46 nvptx-none-ar -> /opt/shared/binutils/2.35.1/bin/ar
-rwxr-xr-x 1 frey swmgr 203320 Apr 13 14:46 nvptx-none-as
-rwxr-xr-x 1 frey swmgr 181328 Apr 13 14:46 nvptx-none-ld
-rwxr-xr-x 1 frey swmgr 168032 Apr 13 14:46 nvptx-none-nm
lrwxrwxrwx 1 frey swmgr     38 Apr 13 14:46 nvptx-none-ranlib -> /opt/shared/binutils/2.35.1/bin/ranlib
-rwxr-xr-x 1 frey swmgr  39344 Apr 13 14:46 nvptx-none-run
-rwxr-xr-x 1 frey swmgr    909 Apr 13 14:46 nvptx-none-run-single

With the tools built and installed, the GCC tools can be built:

$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0
$ mkdir build-nvptx
$ cd build-nvptx
$ GCC_TARGET=$(../config.guess)
$ ../configure --prefix=/opt/shared/gcc/12.2.0-openacc --target=nvptx-none --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/nvptx-none/bin --enable-as-accelerator-for=$GCC_TARGET --disable-sjlj-exceptions --enable-newlib-io-long-long
 
$ make -j 4
 
$ make install

If successful, the NVIDIA PTX cross-compile tools are now present.

The AMD Graphics Core Next (GCN) is similar to NVIDIA PTX: coprocessing hardware with its own ISA. The interface of the host environment with the device is described by a Heterogeneous System Architecture (HSA). These acronyms are mentioned here primarily because they will come up in the context of building and using the AMD OpenACC offload engine.

Sadly, AMD has not released its own low-level development tools (assembler, etc.) while LLVM has produced them. So the first step is doing a minimal build of LLVM for GCN to get those tools.

$ cd /opt/shared/gcc/12.2.0-openacc/src
$ wget 'https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-13.0.1.tar.gz'
$ tar -xf llvmorg-13.0.1.tar.gz
$ rm llvmorg-13.0.1.tar.gz
$ cd gcc-12.2.0
$ mkdir -p build-amdgcn/llvm
$ cd build-amdgcn/llvm
$ vpkg_require cmake
Adding package `cmake/3.21.4` to your environment
 
$ cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm
   :
CMake Error at cmake/modules/CheckCompilerVersion.cmake:38 (message):
  Host GCC version must be at least 5.1, your version is 4.8.5.
   :
 
$ vpkg_require gcc/7.3
Adding package `gcc/7.3.0` to your environment
 
$ CC=$(which gcc) CXX=$(which g++) cmake -D 'LLVM_TARGETS_TO_BUILD=X86;AMDGPU' -D LLVM_ENABLE_PROJECTS=lld ../../../llvm-project-llvmorg-13.0.1/llvm
 
$ make -j 4

The results are not installed! Instead, only the necessary components are copied into place by hand:

$ cd /opt/shared/gcc/12.2.0-openacc
$ mkdir -p amdgcn-amdhsa/bin
$ cd amdgcn-amdhsa/bin
$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ar
$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-ar ranlib
$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-mc as
$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/llvm-nm nm
$ ln -s ../../src/gcc-12.2.0/build-amdgcn/llvm/bin/lld ld

Finally, the AMD GCC tools can be built and installed:

$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0/build-amdgcn
$ mkdir gcc
$ cd gcc
$ ../../configure --target=amdgcn-amdhsa --enable-languages=c,lto,fortran --disable-sjlj-exceptions --with-newlib --enable-as-accelerator-for=${GCC_TARGET} --with-build-time-tools=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa/bin --disable-libquadmath --prefix=/opt/shared/gcc/12.2.0-openacc
 
$ make -j 4
 
$ make install

The AMD tools are done. Time to do the host build of GCC.

With the two accelerator tool sets built and installed, the host compiler can be produced.

$ cd /opt/shared/gcc/12.2.0-openacc/src/gcc-12.2.0
$ mkdir build-host
$ cd build-host
$ ../configure --build=${GCC_TARGET} --host=${GCC_TARGET} --target=${GCC_TARGET} --enable-offload-targets=nvptx-none=/opt/shared/gcc/12.2.0-openacc/nvptx-none,amdgcn-amdhsa=/opt/shared/gcc/12.2.0-openacc/amdgcn-amdhsa --with-cuda-driver-include=${CUDA_HOME}/include --with-cude-driver-lib=${CUDA_HOME}/targets/x86_64-linux/lib/stubs --prefix=/opt/shared/gcc/12.2.0-openacc --disable-multilib --disable-bootstrap
 
$ make -j 4
 
$ make install

The GCC 12.2.0 toolchain with AMD and NVIDIA OpenACC offload is now complete.

The GCC 12.2.0 compiler produced this way does not automatically and always require additional CUDA or ROCM software in its environment. If you are building software with just the host compiler then the OpenACC components will not even be used. It can be assumed that the runtime environment for software compiled with this toolchain will require the GCC 12.2.0 compiler and CUDA or ROCM libraries, so three variants will be configured in VALET:

The 12.2.0 compiler without dependency on a CUDA or ROCM library
The 12.2.0 compiler with dependency on any CUDA library
The 12.2.0 compiler with dependency on any ROCM library

    "12.2.0:openacc":
      prefix: 12.2.0-openacc
      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (no specific offload)
      actions:
        - variable: CXX17
          action:   set
          contexts: development
          value:    g++
      dependencies:
        - binutils/2.35
    "12.2.0:openacc,amd":
      prefix: 12.2.0-openacc
      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (AMD offload)
      actions:
        - variable: CXX17
          action:   set
          contexts: development
          value:    g++
      dependencies:
        - binutils/2.35
        - amd-rocm/=>4
    "12.2.0:openacc,nvidia":
      prefix: 12.2.0-openacc
      description: GCC with C, C++, Obj-C, Obj-C++, Fortran, OpenACC (NVIDIA offload)
      actions:
        - variable: CXX17
          action:   set
          contexts: development
          value:    g++
      dependencies:
        - binutils/2.35
        - cuda/=>9

A CUDA or ROCM library (or both!) can be loaded into the runtime environment before loading GCC 12.2.0 – this is the recommended mode of operation. The library-specific variants are present purely as a convenience.

$ vpkg_require cuda/11.6.2 amd-rocm/5.4.3 gcc/12.2.0:openacc
Adding package `cuda/11.6.2-510.47.03` to your environment
Adding package `amd-rocm/5.4.3` to your environment
Adding dependency `binutils/2.35.1` to your environment
Adding package `gcc/12.2.0:openacc` to your environment

The following program includes simple OpenACC directives on the matrix multiplication loop:

matmul.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openacc.h>
 
#ifndef N
#define N 10
#endif
 
int
main()
{
    int              n = N;
    static float     a[N*N], b[N*N], c[N*N];
    int              i, j, k;
 
    #define ELEMENT(M, I, J) (M)[(J) * N + (I)]
 
    for ( i=1; i<=n; i++ ) {
        for ( j=1; j<=n; j++ ) {
            ELEMENT(a, i-1, j-1) = (float)(i * j);
            ELEMENT(b, i-1, j-1) = (float)(j * i);
        }
    }
 
#if _OPENACC
    acc_init(acc_device_not_host);
#endif
 
    printf("START GEMM()...\n"); fflush(stdout);
#pragma omp parallel for shared(n, a, b, c)
#pragma acc kernels loop
    for ( i=0; i<n; i++ ) {
    #pragma acc loop
        for ( j=0; j<n; j++ ) {
            float t = 0.0f;
 
        #pragma acc loop reduction (+:t)
            for ( k=0; k<n; k++ ) t += ELEMENT(a, i, k) * ELEMENT(b, k, j);
            ELEMENT(c, i, j) = t;
        }
    }
    printf("...GEMM() DONE\n");
    printf("{ %f %f %f ... }\n", c[0], c[n], c[2*n]);
    return 0;
}

Note the static scope on the arrays: lacking this scoping token, the OpenACC programs would crash with segmentation faults. This was the result of the OS stack size limit:

$ ...edit matmul.c to remote 'static' scope on arrays...
$ gcc -o matmul matmul.c -DN=1000
$ ./matmul 
Segmentation fault (core dumped)
 
$ ulimit -s
8192
 
$ ulimit -s unlimited
 
$ ./matmul
START GEMM()...
...GEMM() DONE
{ 333833152.000000 667666304.000000 1001501248.000000 ... }

The program can be compiled for NVIDIA GPU offload:

$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc
Adding package `cuda/11.6.2-510.47.03` to your environment
Adding dependency `binutils/2.35.1` to your environment
Adding package `gcc/12.2.0:openacc` to your environment
 
$ gcc -fopenacc -foffload=nvptx-none -O2 matmul.c -o matmul -DN=1000

The resulting program can be run on a DARWIN T4 node:

$ salloc --partition=gpu-t4 --gpus=1
 
[r1t04]$ vpkg_require cuda/11.6.2 gcc/12.2.0:openacc
Adding package `cuda/11.6.2-510.47.03` to your environment
Adding dependency `binutils/2.35.1` to your environment
Adding package `gcc/12.2.0:openacc` to your environment
 
[r1t04]$ ACC_DEVICE_TYPE=nvidia ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul
GOMP_NVPTX_JIT: <Not defined>
Loading:
---
// BEGIN PREAMBLE
.version 6.0
.target sm_30
   :
@ %r63 bra.uni $L5;
$L1:
ret;
}
 
---
Linking
Link complete: 0.000000ms
Link log info    : 0 bytes gmem
info    : Function properties for 'main$_omp_fn$0':
info    : used 35 registers, 0 stack, 0 bytes smem, 360 bytes cmem[0], 0 bytes lmem
START GEMM()...
GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffdd24f0cd0, size=0x4040a0, kinds=0x404080
  GOMP_OFFLOAD_openacc_exec: prepare mappings
  nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=1, vectors=32
  nvptx_exec: kernel main$_omp_fn$0: finished
...GEMM() DONE
{ 333833152.000000 667666304.000000 1001501248.000000 ... }

The program can be compiled for AMD GPU offload:

$ vpkg_rollback all
$ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc
Adding package `amd-rocm/5.4.3` to your environment
Adding dependency `binutils/2.35.1` to your environment
Adding package `gcc/12.2.0:openacc` to your environment
 
$ gcc -fopenacc -foffload=amdgcn-amdhsa -O2 matmul.c -o matmul -foffload=-march=gfx908

Without the -foffload=-march=gfx908 flag the compiler produced code that would not execute on the AMD Mi100 in DARWIN. The desired architecture was determined by executing the rocminfo command on the Mi100 node and searching for "gfx" in the output.

The resulting program can be run on the DARWIN Mi100 node:

$ salloc --partition=gpu-mi100 --gpus=1
 
[r0m01]$ vpkg_require amd-rocm/5.4.3 gcc/12.2.0:openacc
Adding package `amd-rocm/5.4.3` to your environment
Adding dependency `binutils/2.35.1` to your environment
Adding package `gcc/12.2.0:openacc` to your environment
 
[r0m01]$ ACC_DEVICE_TYPE=gcn ACC_DEVICE_NUM=0 GOMP_DEBUG=1 ./matmul
libgomp: while loading libgomp-plugin-nvptx.so.1: libcuda.so.1: cannot open shared object file: No such file or directory
START GEMM()...
GOACC_parallel_keyed: mapnum=4, hostaddrs=0x7ffda87a5680, size=0x4040a0, kinds=0x404080
...GEMM() DONE
{ 385.000000 770.000000 1155.000000 ... }

The message regarding the inability to load the GNU OpenMP (GOMP) NVPTX plugin appears because no CUDA library is present and the OpenMP runtime will try to load every plugin available to it. It can be ignored for host and AMD GCN OpenACC builds.

Building GCC 12.2 with OpenACC Offload

Build NVIDIA tools

Unpacking the source

Compile and install

Build AMD tools

Build host tools

VALET definitions

Example program

NVIDIA offload

AMD offload

hpc documentation