Table of Contents

HPCC with open64 compiler, ACML and base FFT

Make

Here are som modifications on Mills based on recommendations from Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)

Changes to the make file hpl/setup/Make.Linux_ATHLON_FBLAS copied to hpl/Make.open64-acml

  1. Comment lines beginning in MP or LA
  2. Change /usr/bin/gcc to mpicc
  3. Change /usr/bin/g77 to mpif77
  4. Append -DHPCC_FFT_235 to CCFLAGS

The Valet commands are

vpkg_devrequire acml/5.2.0-open64-fma4
vpkg_devrequire openmpi/1.6.1-open64

Exported variables (to set values for commented LAinc and LAlib)

export LAinc="$CPPFLAGS"
export LAlib="$LDFLAGS -lacml"

Make command with 4 threads

make -j 4 arch=open64-acml

NCPU=1 vs 2

package `acml/5.2.0-open64-fma4` 
package `open64/4.5` 
package `openmpi/1.4.4-open64` 
N = 30000, NB = 100, P = 6, Q=8

These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.

Options Grid Engine MPI flags
NCPU=1 -pe openmpi 48 --bind-to-core --bycore
NCPU=2 -pe openmpi 96 --bind-to-core --bycore --cpus-per-proc 2 -np 48 --loadbalance

HPCC benchmark results for two runs:

result NCPU=1 NCPU=2
HPL_Tflops 0.0769491 1.54221
StarDGEMM_Gflops 1.93686 14.6954
SingleDGEMM_Gflops 11.5042 15.6919
MPIRandomAccess_LCG_GUPs 0.0195047 0.00352421
MPIRandomAccess_GUPs 0.0194593 0.00410853
StarRandomAccess_LCG_GUPs 0.0113424 0.0302748
SingleRandomAccess_LCG_GUPs 0.0448261 0.0568664
StarRandomAccess_GUPs 0.0113898 0.0288637
SingleRandomAccess_GUPs 0.0521811 0.053262
StarFFT_Gflops 0.557555 1.14746
SingleFFT_Gflops 1.2178 1.45413
MPIFFT_Gflops 5.31624 34.3552

ibverb vs psm

package `acml/5.3.0-open64-fma4` 
package `open64/4.5` 
package `openmpi/1.4.4-open64` or  `openmpi/1.6.1-open64` or  `openmpi/1.6.3-open64`
N = 72000, NB = 100, P = 12, Q = 16
nproc = 2x192   (384 slots with 192 MPI workers bound to a bulldozer core pair)

Runs mostly differ by the use of Qlogic PSM endpoints. For full examples of a grid engine scripts. see the files /opt/shared/templates/openmpi-ibverb.qs and /opt/shared/templates/openmpi-psm.qs.

All jobs will bind the processes by core using 2 cpus per process.

Common flags

Grid Engine -pe openmpi 384 -l exclusive=1
MPI flags --bind-to-core --bycore --cpus-per-proc 2 --np 192 --loadbalance

Flags for ibverbs vs psm

Option Grid Engine MPI flags
ibverbs --mca btl ^tcp --mca mtl ^psm
psm -l psm_endpoints=1 --mca btl ^tcp

The job submission command

 qsub -q standby-4h.q@@24core -l standby=1,h_rt=4:00:00 bhpcc.qs
Since we need more than 240 slots (we need 384 slots to have enough cores for 192 processes taking 2 cores each), we specify a hard time limit of 4 hours to allow our jobs to run in the standby-4h queue. We use the group name @24core to make sure we only run on 24-core nodes. This is because the load balancing algorithm require the same core count on all the nodes. Use the qconf command to see the nodes in group_name @24core:
 qconf -shgrp @24core

HPCC benchmark results for three runs:

Result ibverb (v1.4.4) psm (v1.6.1) psm (v1.6.3)
HPL_Tflops 1.68496 2.08056 2.08719
StarDGEMM_Gflops 14.6933 14.8339 14.8108
SingleDGEMM_Gflops 15.642 15.536 15.7652
PTRANS_GBs 9.25899 18.4793 21.2181
StarFFT_Gflops 1.19982 1.25452 1.24688
StarSTREAM_Triad 3.62601 3.65631 3.5967
SingleFFT_Gflops 1.44111 1.44416 1.45183
MPIFFT_Gflops 7.67835 77.603 68.98
RandomlyOrderedRingLatency_usec 65.8478 2.44898 2.39858

more

N = 72000, NB = 100, P = 12, Q=16, NP=384

package `acml/5.2.0-open64-fma4` to your environment
package `open64/4.5` to your environment
package `openmpi/1.4.4-open64` to your environment

 package `acml/5.3.0-open64-fma4` to your environment
 package `open64/4.5` to your environment
 package `openmpi/1.6.1-open64` to your environment
 
 
Result 139765 145105
HPL_Tflops 1.54221 0.364243
StarDGEMM_Gflops 14.6954 13.6194
SingleDGEMM_Gflops 15.6919 15.453
PTRANS_GBs 1.14913 1.07982
MPIRandomAccess_GUPs 0.00410853 0.00679052
StarSTREAM_Triad 3.39698 2.83863
StarFFT_Gflops 1.14746 0.737805
SingleFFT_Gflops 1.45413 1.3756
MPIFFT_Gflops 34.3552 32.3555
RandomlyOrderedRingLatency_usec 77.9332 76.9595