In compliance with Gaussian licensing, this article includes no reproduction of Gaussian '16 source code or build files.
The third-generation expansion to UD's Caviness cluster includes Intel Xeon Gold 6240R processors and NVIDIA A100 and A40 GPUs. The CPUs are significantly newer than the Haswell generation that is the maximum architectural optimization level the Gaussian '16 build system will target. We also found that the extant GPU-enabled binaries built using the Portland Group 17 compiler suite were unable to make use of the A100 and A40 coprocessors in the nodes:
[frey@r06g00 ~]$ PGI_ACC_DEBUG=1 openacc-test/a.out.17 Calling ACC_Init(ACC_Device_Nvidia) ACC: detected 4 CUDA devices ACC: initialized 0 CUDA devices ACC: device[1] is PGI native ACC: device[0] is PGI native pinitialize for thread 1 Calling ACC_Get_Num_Devices(ACC_Device_Nvidia) 0 Calling ACC_Set_Device_Num(0, ACC_Device_Nvidia) pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1) Calling ACC_Get_Free_Memory() curr_devid for thread 1 is 0 0 Calling ACC_Get_Memory() curr_devid for thread 1 is 0 0
The PGI 17 OpenACC runtime detects the A100's but cannot use them. The problem manifests in Gaussian runs as the following message:
: SetGPE: set environment variable "MP_BIND" = "yes" SetGPE: set environment variable "MP_BLIST" = "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47" MDGPrp: Not enough Memory on GPU. Error termination via Lnk1e in /opt/shared/gaussian/g16c01/gpu/g16/l1.exe at Thu May 25 09:55:13 2023. Job cpu time: 0 days 1 hours 5 minutes 56.1 seconds. Elapsed time: 0 days 0 hours 1 minutes 23.6 seconds.
When Gaussian calls ACC_Set_Device_Num()
to bind the current thread to an OpenACC device it is not checked for failure. The code eventually gets to a call to the PGI Fortran helper function, ACC_Get_Free_Memory()
, which returns zero – which is always less than the amount of memory required, producing the error message cited above.
Using the NVIDIA HPC SDK 22.7 compiler suite – which is a rebrand of PGI after NVIDIA purchased it – the Ampere GPUs are now usable by the OpenACC runtime:
[frey@r06g00 ~]$ PGI_ACC_DEBUG=1 openacc-test/a.out.22 Calling ACC_Init(ACC_Device_Nvidia) ACC: detected 4 CUDA devices cuda_initdev thread:0 data.default_device_num:0 pdata.cuda.default_device_num:0 ACC: device[1] is NVIDIA CUDA device 0 compute capability 8.0 ACC: device[2] is NVIDIA CUDA device 1 compute capability 8.0 ACC: device[3] is NVIDIA CUDA device 2 compute capability 8.0 ACC: device[4] is NVIDIA CUDA device 3 compute capability 8.0 ACC: initialized 4 CUDA devices ACC: device[5] is PGI native pinitialize (threadid=1) cuda_init_device thread:1 data.default_device_num:1 pdata.cuda.default_device_num:1 cuda_init_device(threadid=1, device 0) dindex=1, api_context=(nil) cuda_init_device(threadid=1, device 0) dindex=1, setting api_context=(nil) cuda_init_device(threadid=1, device 0) dindex=1, new api_context=0x97efb0 argument memory for queue 32 device:0x7f51a3200000 host:0x7f51a3400000 Calling ACC_Get_Num_Devices(ACC_Device_Nvidia) 4 Calling ACC_Set_Device_Num(0, ACC_Device_Nvidia) pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1) pgi_uacc_set_device_num(devnum=0,devtype=4,threadid=1) cuda devid=1 dindex=1 Calling ACC_Get_Free_Memory() 84592361472 Calling ACC_Get_Memory() 85031714816
In order to produce Gaussian '16 binaries that can make use of the A100 and A40 GPUs in Caviness, the build system and source code must be altered.
Since we are making changes anyway, directives to target Skylake-generation Intel CPUs were added to the build system. The token skylake
was used in association with these changes: skylake
is passed as an argument to bsd/bldg16
and bsd/set-mflags
and bsd/setup-make
have been adapted to recognize it.
The bsd/bldg16
script was updated to create a skylake.flag
file that triggers the added functionality in bsd/set-mflags
and bsd/setup-make
.
The flags for Haswell were reused with a target processor of "skylake" instead.
The Haswell-optimized ATLAS library included with the Gaussian '16 source is used for the skylake
binaries.
Substitution of the Intel MKL (serial) library for the ATLAS BLAS/LAPACK is another optimization variant we might try. That does represent a very significant change that would require validation before using the binaries in production computation.
The token ampere
was used in association with these changes: ampere
is passed as an argument to bsd/bldg16
and bsd/set-mflags
and bsd/setup-make
have been adapted to recognize it.
NVIDIA HPC SDK 22.7 flags that enable the necessary functionality are:
… -cuda -acc=gpu,nowait -gpu=cuda11.7,ccall,flushz,unroll,fma -cudalib=cublas …
The OpenACC in NVIDIA HPC SDK 22.7 differs from the version present in PGI 17:
C$ACC Do …
has been replaced with C$ACC Loop …
C$ACC Kernels Do …
has been replaced with C$ACC Kernels Loop …
C$ACC Parallel Do …
has been replaced with C$ACC Parallel Loop …
NoCreate()
ACC directive has been replaced with No_Create()