HPCC with intel compiler, MKL and base FFT
Make
Start by downloading and extracting the hpcc-1.4.3 directory:
curl -s http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.4.3.tar.gz | tar zx
The hpcc-1.4.3 directory will have all the files you need to run the benchmark. Our job is to modify the setup for intel, and mkl on farber which uses VALET.
Copy the make file hpl/setup/Make.Linux_ATHLON_FBLAS
to hpl/Make.intel-mkl
- Comment lines beginning in
MP
orLA
- Change
/usr/bin/gcc
tompicc
- Change
/usr/bin/g77
tompif77
- Change
CCFLAGS
to-mkl -O3 -fno-alias -DHPCC_FFT_235
- Change
LINGFLAGS
to-mkl -nofor-main
The Valet commands are
vpkg_devrequire intel
vpkg_devrequire openmpi/1.8.2-intel64
Exported variables (to set values for commented LAinc and LAlib)
export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -nofor-main"
Make command with 4 threads
make -j 4 arch=intel-mkl
runs: N = 30000
package `intel/2015.0.090` package `openmpi/1.8.2-intel64`
N = 30000, NB = 200, P = 5 Q = 8
These runs need 40 processes (5 per row and 8 per column.) The same number of processes are run with 40 slots.
WEB NAME | VALUE | UNITS |
---|---|---|
G-HPL | 0.6201 | TeraFlops/Sec |
G-PTRANS | 0.0127 | TeraBytes/Sec |
G-RandomAccess | 0.0789 | GigaUpdates/Sec |
G-FFT | 0.0222 | TeraFlops/Sec |
EP-STREAM Sys | 0.1638 | TeraBytes/Sec |
EP-STREAM Triad | 4.0951 | GigaBytes/Sec |
EP-DGEMM | 14.7898 | GigaFlops/Sec |
RandomRing Bandwidth | 0.5619 | GigaBytes/Sec |
RandomRing Latency | 2.1133 | micro-seconds |
qacct values:
ru_wallclock 130.716 ru_utime 5114.084 ru_stime 39.425 maxvmem 25.174G
runs: N = 72000
package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64`
N = 72000, NB = 100, P = 12, Q = 16
nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair)
Two runs mostly differ by the use of Qlogic PSM endpoints
Result | ^ PSM (v1.4.4) | PSM (v1.6.1) |
---|---|---|
HPL_Tflops | 1.68496 | 2.08056 |
StarDGEMM_Gflops | 14.6933 | 14.8339 |
SingleDGEMM_Gflops | 15.642 | 15.536 |
PTRANS_GBs | 9.25899 | 18.4793 |
StarFFT_Gflops | 1.19982 | 1.25452 |
StarSTREAM_Triad | 3.62601 | 3.65631 |
SingleFFT_Gflops | 1.44111 | 1.44416 |
MPIFFT_Gflops | 7.67835 | 77.603 |
RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 |