HPCC with open64 compiler, ACML and base FFT

Make

Here are som modifications on Mills based on recommendations from Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)

Changes to the make file hpl/setup/Make.Linux_ATHLON_FBLAS copied to hpl/Make.open64-acml

Comment lines beginning in MP or LA
Change /usr/bin/gcc to mpicc
Change /usr/bin/g77 to mpif77
Append -DHPCC_FFT_235 to CCFLAGS

The Valet commands are

vpkg_devrequire acml/5.2.0-open64-fma4
vpkg_devrequire openmpi/1.6.1-open64

Exported variables (to set values for commented LAinc and LAlib)

export LAinc="$CPPFLAGS"
export LAlib="$LDFLAGS -lacml"

Make command with 4 threads

make -j 4 arch=open64-acml

NCPU=1 vs 2

package `acml/5.2.0-open64-fma4` 
package `open64/4.5` 
package `openmpi/1.4.4-open64`

N = 30000, NB = 100, P = 6, Q=8

These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.

Options	Grid Engine	MPI flags
`NCPU=1`	`-pe openmpi 48`	`--bind-to-core --bycore`
`NCPU=2`	`-pe openmpi 96`	`--bind-to-core --bycore --cpus-per-proc 2 -np 48 --loadbalance`

HPCC benchmark results for two runs:

result	NCPU=1	NCPU=2
HPL_Tflops	0.0769491	1.54221
StarDGEMM_Gflops	1.93686	14.6954
SingleDGEMM_Gflops	11.5042	15.6919
MPIRandomAccess_LCG_GUPs	0.0195047	0.00352421
MPIRandomAccess_GUPs	0.0194593	0.00410853
StarRandomAccess_LCG_GUPs	0.0113424	0.0302748
SingleRandomAccess_LCG_GUPs	0.0448261	0.0568664
StarRandomAccess_GUPs	0.0113898	0.0288637
SingleRandomAccess_GUPs	0.0521811	0.053262
StarFFT_Gflops	0.557555	1.14746
SingleFFT_Gflops	1.2178	1.45413
MPIFFT_Gflops	5.31624	34.3552

ibverb vs psm

package `acml/5.3.0-open64-fma4` 
package `open64/4.5` 
package `openmpi/1.4.4-open64` or  `openmpi/1.6.1-open64` or  `openmpi/1.6.3-open64`

N = 72000, NB = 100, P = 12, Q = 16

nproc = 2x192   (384 slots with 192 MPI workers bound to a bulldozer core pair)

Runs mostly differ by the use of Qlogic PSM endpoints. For full examples of a grid engine scripts. see the files /opt/shared/templates/openmpi-ibverb.qs and /opt/shared/templates/openmpi-psm.qs.

All jobs will bind the processes by core using 2 cpus per process.

Common flags

Grid Engine	-pe openmpi 384 -l exclusive=1
MPI flags	--bind-to-core --bycore --cpus-per-proc 2 --np 192 --loadbalance

Flags for ibverbs vs psm

Option	Grid Engine	MPI flags
ibverbs		--mca btl ^tcp --mca mtl ^psm
psm	-l psm_endpoints=1	--mca btl ^tcp

The job submission command

 qsub -q standby-4h.q@@24core -l standby=1,h_rt=4:00:00 bhpcc.qs

Since we need more than 240 slots (we need 384 slots to have enough cores for 192 processes taking 2 cores each), we specify a hard time limit of 4 hours to allow our jobs to run in the standby-4h queue. We use the group name @24core to make sure we only run on 24-core nodes. This is because the load balancing algorithm require the same core count on all the nodes. Use the qconf command to see the nodes in group_name @24core:

 qconf -shgrp @24core

HPCC benchmark results for three runs:

Result	ibverb (v1.4.4)	psm (v1.6.1)	psm (v1.6.3)
HPL_Tflops	1.68496	2.08056	2.08719
StarDGEMM_Gflops	14.6933	14.8339	14.8108
SingleDGEMM_Gflops	15.642	15.536	15.7652
PTRANS_GBs	9.25899	18.4793	21.2181
StarFFT_Gflops	1.19982	1.25452	1.24688
StarSTREAM_Triad	3.62601	3.65631	3.5967
SingleFFT_Gflops	1.44111	1.44416	1.45183
MPIFFT_Gflops	7.67835	77.603	68.98
RandomlyOrderedRingLatency_usec	65.8478	2.44898	2.39858

N = 72000, NB = 100, P = 12, Q=16, NP=384

package `acml/5.2.0-open64-fma4` to your environment
package `open64/4.5` to your environment
package `openmpi/1.4.4-open64` to your environment

 package `acml/5.3.0-open64-fma4` to your environment
 package `open64/4.5` to your environment
 package `openmpi/1.6.1-open64` to your environment

Result	139765	145105
HPL_Tflops	1.54221	0.364243
StarDGEMM_Gflops	14.6954	13.6194
SingleDGEMM_Gflops	15.6919	15.453
PTRANS_GBs	1.14913	1.07982
MPIRandomAccess_GUPs	0.00410853	0.00679052
StarSTREAM_Triad	3.39698	2.83863
StarFFT_Gflops	1.14746	0.737805
SingleFFT_Gflops	1.45413	1.3756
MPIFFT_Gflops	34.3552	32.3555
RandomlyOrderedRingLatency_usec	77.9332	76.9595