Here are som modifications on Mills based on recommendations from Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)
Changes to the make file hpl/setup/Make.Linux_ATHLON_FBLAS
copied to hpl/Make.open64-acml
MP
or LA
/usr/bin/gcc
to mpicc
/usr/bin/g77
to mpif77
-DHPCC_FFT_235
to CCFLAGS
The Valet commands are
vpkg_devrequire acml/5.2.0-open64-fma4 vpkg_devrequire openmpi/1.6.1-open64
Exported variables (to set values for commented LAinc and LAlib)
export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -lacml"
Make command with 4 threads
make -j 4 arch=open64-acml
package `acml/5.2.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64`
N = 30000, NB = 100, P = 6, Q=8
These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.
Options | Grid Engine | MPI flags |
---|---|---|
NCPU=1 | -pe openmpi 48 | –bind-to-core |
NCPU=2 | -pe openmpi 96 | –bind-to-core –bycore –cpus-per-proc 2 -np 48 |
HPCC benchmark results for two runs:
result | NCPU=1 | NCPU=2 |
---|---|---|
HPL_Tflops | 0.0769491 | 1.54221 |
StarDGEMM_Gflops | 1.93686 | 14.6954 |
SingleDGEMM_Gflops | 11.5042 | 15.6919 |
MPIRandomAccess_LCG_GUPs | 0.0195047 | 0.00352421 |
MPIRandomAccess_GUPs | 0.0194593 | 0.00410853 |
StarRandomAccess_LCG_GUPs | 0.0113424 | 0.0302748 |
SingleRandomAccess_LCG_GUPs | 0.0448261 | 0.0568664 |
StarRandomAccess_GUPs | 0.0113898 | 0.0288637 |
SingleRandomAccess_GUPs | 0.0521811 | 0.053262 |
StarFFT_Gflops | 0.557555 | 1.14746 |
SingleFFT_Gflops | 1.2178 | 1.45413 |
MPIFFT_Gflops | 5.31624 | 34.3552 |
package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64`
N = 72000, NB = 100, P = 12, Q = 16
nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair)
Two runs mostly differ by the use of Qlogic PSM endpoints
Result | ^ PSM (v1.4.4) | PSM (v1.6.1) |
---|---|---|
HPL_Tflops | 1.68496 | 2.08056 |
StarDGEMM_Gflops | 14.6933 | 14.8339 |
SingleDGEMM_Gflops | 15.642 | 15.536 |
PTRANS_GBs | 9.25899 | 18.4793 |
StarFFT_Gflops | 1.19982 | 1.25452 |
StarSTREAM_Triad | 3.62601 | 3.65631 |
SingleFFT_Gflops | 1.44111 | 1.44416 |
MPIFFT_Gflops | 7.67835 | 77.603 |
RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 |
N = 72000, NB = 100, P = 12, Q=16, NP=384 package `acml/5.2.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.4.4-open64` to your environment package `acml/5.3.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.6.1-open64` to your environment
Result | 139765 | 145105 |
---|---|---|
HPL_Tflops | 1.54221 | 0.364243 |
StarDGEMM_Gflops | 14.6954 | 13.6194 |
SingleDGEMM_Gflops | 15.6919 | 15.453 |
PTRANS_GBs | 1.14913 | 1.07982 |
MPIRandomAccess_GUPs | 0.00410853 | 0.00679052 |
StarSTREAM_Triad | 3.39698 | 2.83863 |
StarFFT_Gflops | 1.14746 | 0.737805 |
SingleFFT_Gflops | 1.45413 | 1.3756 |
MPIFFT_Gflops | 34.3552 | 32.3555 |
RandomlyOrderedRingLatency_usec | 77.9332 | 76.9595 |