HPCC with open64 compiler, ACML and base FFT
Make
Here are som modifications on Mills based on recommendations from Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)
Changes to the make file hpl/setup/Make.Linux_ATHLON_FBLAS
copied to hpl/Make.open64-acml
- Comment lines beginning in
MP
orLA
- Change
/usr/bin/gcc
tompicc
- Change
/usr/bin/g77
tompif77
- Append
-DHPCC_FFT_235
toCCFLAGS
The Valet commands are
vpkg_devrequire acml/5.2.0-open64-fma4 vpkg_devrequire openmpi/1.6.1-open64
Exported variables (to set values for commented LAinc and LAlib)
export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -lacml"
Make command with 4 threads
make -j 4 arch=open64-acml
NCPU=1 vs 2
package `acml/5.2.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64`
N = 30000, NB = 100, P = 6, Q=8
These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.
Options | Grid Engine | MPI flags |
---|---|---|
NCPU=1 | -pe openmpi 48 | --bind-to-core --bycore |
NCPU=2 | -pe openmpi 96 | --bind-to-core --bycore --cpus-per-proc 2 -np 48 --loadbalance |
HPCC benchmark results for two runs:
result | NCPU=1 | NCPU=2 |
---|---|---|
HPL_Tflops | 0.0769491 | 1.54221 |
StarDGEMM_Gflops | 1.93686 | 14.6954 |
SingleDGEMM_Gflops | 11.5042 | 15.6919 |
MPIRandomAccess_LCG_GUPs | 0.0195047 | 0.00352421 |
MPIRandomAccess_GUPs | 0.0194593 | 0.00410853 |
StarRandomAccess_LCG_GUPs | 0.0113424 | 0.0302748 |
SingleRandomAccess_LCG_GUPs | 0.0448261 | 0.0568664 |
StarRandomAccess_GUPs | 0.0113898 | 0.0288637 |
SingleRandomAccess_GUPs | 0.0521811 | 0.053262 |
StarFFT_Gflops | 0.557555 | 1.14746 |
SingleFFT_Gflops | 1.2178 | 1.45413 |
MPIFFT_Gflops | 5.31624 | 34.3552 |
ibverb vs psm
package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64` or `openmpi/1.6.3-open64`
N = 72000, NB = 100, P = 12, Q = 16
nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair)
Runs mostly differ by the use of Qlogic PSM endpoints. For full examples of a grid engine scripts. see the files
/opt/shared/templates/openmpi-ibverb.qs
and /opt/shared/templates/openmpi-psm.qs
.
All jobs will bind the processes by core using 2 cpus per process.
Common flags
Grid Engine | -pe openmpi 384 -l exclusive=1 |
---|---|
MPI flags | --bind-to-core --bycore --cpus-per-proc 2 --np 192 --loadbalance |
Flags for ibverbs
vs psm
Option | Grid Engine | MPI flags |
---|---|---|
ibverbs | --mca btl ^tcp --mca mtl ^psm | |
psm | -l psm_endpoints=1 | --mca btl ^tcp |
The job submission command
qsub -q standby-4h.q@@24core -l standby=1,h_rt=4:00:00 bhpcc.qs
@24core
to make sure we only run on 24-core nodes. This is because the load balancing algorithm require the same core count on all the nodes. Use the qconf
command to see the nodes in group_name @24core:qconf -shgrp @24core
HPCC benchmark results for three runs:
Result | ibverb (v1.4.4) | psm (v1.6.1) | psm (v1.6.3) |
---|---|---|---|
HPL_Tflops | 1.68496 | 2.08056 | 2.08719 |
StarDGEMM_Gflops | 14.6933 | 14.8339 | 14.8108 |
SingleDGEMM_Gflops | 15.642 | 15.536 | 15.7652 |
PTRANS_GBs | 9.25899 | 18.4793 | 21.2181 |
StarFFT_Gflops | 1.19982 | 1.25452 | 1.24688 |
StarSTREAM_Triad | 3.62601 | 3.65631 | 3.5967 |
SingleFFT_Gflops | 1.44111 | 1.44416 | 1.45183 |
MPIFFT_Gflops | 7.67835 | 77.603 | 68.98 |
RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 | 2.39858 |
more
N = 72000, NB = 100, P = 12, Q=16, NP=384 package `acml/5.2.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.4.4-open64` to your environment package `acml/5.3.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.6.1-open64` to your environment
Result | 139765 | 145105 |
---|---|---|
HPL_Tflops | 1.54221 | 0.364243 |
StarDGEMM_Gflops | 14.6954 | 13.6194 |
SingleDGEMM_Gflops | 15.6919 | 15.453 |
PTRANS_GBs | 1.14913 | 1.07982 |
MPIRandomAccess_GUPs | 0.00410853 | 0.00679052 |
StarSTREAM_Triad | 3.39698 | 2.83863 |
StarFFT_Gflops | 1.14746 | 0.737805 |
SingleFFT_Gflops | 1.45413 | 1.3756 |
MPIFFT_Gflops | 34.3552 | 32.3555 |
RandomlyOrderedRingLatency_usec | 77.9332 | 76.9595 |