====== Compilation of Gaussian '16 for NVIDIA Ampere GPUs ======

<WRAP center round important 60%>
In compliance with Gaussian licensing, this article includes no reproduction of Gaussian '16 source code or build files.
</WRAP>


The third-generation expansion to UD's Caviness cluster includes Intel Xeon Gold 6240R processors and NVIDIA A100 and A40 GPUs.  The CPUs are significantly newer than the Haswell generation that is the maximum architectural optimization level the Gaussian '16 build system will target.  We also found that the extant GPU-enabled binaries built using the Portland Group 17 compiler suite were unable to make use of the A100 and A40 coprocessors in the nodes:

<code>
[frey@r06g00 ~]$ PGI_ACC_DEBUG=1 openacc-test/a.out.17
 Calling ACC_Init(ACC_Device_Nvidia)
ACC: detected 4 CUDA devices
ACC: initialized 0 CUDA devices
ACC: device[1] is PGI native
ACC: device[0] is PGI native
pinitialize for thread 1
 Calling ACC_Get_Num_Devices(ACC_Device_Nvidia)
                        0
 Calling ACC_Set_Device_Num(0, ACC_Device_Nvidia)
pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1)
 Calling ACC_Get_Free_Memory()
curr_devid for thread 1 is 0
                        0
 Calling ACC_Get_Memory()
curr_devid for thread 1 is 0
                        0
</code>

The PGI 17 OpenACC runtime detects the A100's but cannot use them.  The problem manifests in Gaussian runs as the following message:

<code>
   :
 SetGPE:  set environment variable "MP_BIND" = "yes"
 SetGPE:  set environment variable "MP_BLIST" = "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47"
 MDGPrp: Not enough Memory on GPU.
 Error termination via Lnk1e in /opt/shared/gaussian/g16c01/gpu/g16/l1.exe at Thu May 25 09:55:13 2023.
 Job cpu time:       0 days  1 hours  5 minutes 56.1 seconds.
 Elapsed time:       0 days  0 hours  1 minutes 23.6 seconds.
</code>

When Gaussian calls ''ACC_Set_Device_Num()'' to bind the current thread to an OpenACC device it is not checked for failure.  The code eventually gets to a call to the PGI Fortran helper function, ''ACC_Get_Free_Memory()'', which returns zero -- which is always less than the amount of memory required, producing the error message cited above.

Using the NVIDIA HPC SDK 22.7 compiler suite -- which is a rebrand of PGI after NVIDIA purchased it -- the Ampere GPUs are now usable by the OpenACC runtime:

<code>
[frey@r06g00 ~]$ PGI_ACC_DEBUG=1 openacc-test/a.out.22
 Calling ACC_Init(ACC_Device_Nvidia)
ACC: detected 4 CUDA devices
cuda_initdev thread:0 data.default_device_num:0 pdata.cuda.default_device_num:0
ACC: device[1] is NVIDIA CUDA device 0 compute capability 8.0
ACC: device[2] is NVIDIA CUDA device 1 compute capability 8.0
ACC: device[3] is NVIDIA CUDA device 2 compute capability 8.0
ACC: device[4] is NVIDIA CUDA device 3 compute capability 8.0
ACC: initialized 4 CUDA devices
ACC: device[5] is PGI native
pinitialize (threadid=1)
cuda_init_device thread:1 data.default_device_num:1 pdata.cuda.default_device_num:1
cuda_init_device(threadid=1, device 0) dindex=1, api_context=(nil)
cuda_init_device(threadid=1, device 0) dindex=1, setting api_context=(nil)
cuda_init_device(threadid=1, device 0) dindex=1, new api_context=0x97efb0
argument memory for queue 32 device:0x7f51a3200000 host:0x7f51a3400000
 Calling ACC_Get_Num_Devices(ACC_Device_Nvidia)
                        4
 Calling ACC_Set_Device_Num(0, ACC_Device_Nvidia)
pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1)
pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1) cuda devid=1 dindex=1
 Calling ACC_Get_Free_Memory()
              84592361472
 Calling ACC_Get_Memory()
              85031714816
</code>

In order to produce Gaussian '16 binaries that can make use of the A100 and A40 GPUs in Caviness, the build system and source code must be altered.

===== Add Skylake Target =====

Since we are making changes anyway, directives to target Skylake-generation Intel CPUs were added to the build system.  The token ''skylake'' was used in association with these changes:  ''skylake'' is passed as an argument to ''bsd/bldg16'' and ''bsd/set-mflags'' and ''bsd/setup-make'' have been adapted to recognize it.

The ''bsd/bldg16'' script was updated to create a ''skylake.flag'' file that triggers the added functionality in ''bsd/set-mflags'' and ''bsd/setup-make''.

==== Flags ====

The flags for Haswell were reused with a target processor of "skylake" instead.

==== BLAS/LAPACK ====

The Haswell-optimized ATLAS library included with the Gaussian '16 source is used for the ''skylake'' binaries.

<WRAP center round tip 60%>
Substitution of the Intel MKL (serial) library for the ATLAS BLAS/LAPACK is another optimization variant we might try.  That does represent a very significant change that would require validation before using the binaries in production computation.
</WRAP>

===== Add Ampere OpenACC Target =====

The token ''ampere'' was used in association with these changes:  ''ampere'' is passed as an argument to ''bsd/bldg16'' and ''bsd/set-mflags'' and ''bsd/setup-make'' have been adapted to recognize it.

==== Flags ====

NVIDIA HPC SDK 22.7 flags that enable the necessary functionality are:

<code>
… -cuda -acc=gpu,nowait -gpu=cuda11.7,ccall,flushz,unroll,fma -cudalib=cublas …
</code>

==== OpenACC Changes ====

The OpenACC in NVIDIA HPC SDK 22.7 differs from the version present in PGI 17:

  * The syntax ''C$ACC Do …'' has been replaced with ''C$ACC Loop …''
  * The syntax ''C$ACC Kernels Do …'' has been replaced with ''C$ACC Kernels Loop …''
  * The syntax ''C$ACC Parallel Do …'' has been replaced with ''C$ACC Parallel Loop …''
  * The ''NoCreate()'' ACC directive has been replaced with ''No_Create()''