Show pageOld revisionsBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== HPCC with intel compiler, MKL and base FFT ====== ===== Make ===== Start by downloading and extracting the hpcc-1.4.3 directory: <code bash> curl -s http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.4.3.tar.gz | tar zx </code> The hpcc-1.4.3 directory will have all the files you need to run the benchmark. Our job is to modify the setup for intel, and mkl on farber which uses VALET. Copy the make file ''hpl/setup/Make.Linux_ATHLON_FBLAS'' to ''hpl/Make.intel-mkl'' - Comment lines beginning in ''MP'' or ''LA'' - Change ''/usr/bin/gcc'' to ''mpicc'' - Change ''/usr/bin/g77'' to ''mpif77'' - Change ''CCFLAGS'' to ''-mkl -O3 -fno-alias -DHPCC_FFT_235'' - Change ''LINGFLAGS'' to ''-mkl -nofor-main'' The Valet commands are <code bash> vpkg_devrequire intel vpkg_devrequire openmpi/1.8.2-intel64 </code> Exported variables (to set values for commented LAinc and LAlib) <code bash> export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -nofor-main" </code> Make command with 4 threads <code bash> make -j 4 arch=intel-mkl </code> ==== runs: N = 30000 ==== package `intel/2015.0.090` package `openmpi/1.8.2-intel64` N = 30000, NB = 200, P = 5 Q = 8 These runs need 40 processes (5 per row and 8 per column.) The same number of processes are run with 40 slots. ^ WEB NAME ^ VALUE ^ UNITS ^ | G-HPL | 0.6201 | TeraFlops/Sec | | G-PTRANS | 0.0127 | TeraBytes/Sec | | G-RandomAccess | 0.0789 | GigaUpdates/Sec | | G-FFT | 0.0222 | TeraFlops/Sec | | EP-STREAM Sys | 0.1638 | TeraBytes/Sec | | EP-STREAM Triad | 4.0951 | GigaBytes/Sec | | EP-DGEMM | 14.7898 | GigaFlops/Sec | | RandomRing Bandwidth | 0.5619 | GigaBytes/Sec | | RandomRing Latency | 2.1133 | micro-seconds | qacct values: ru_wallclock 130.716 ru_utime 5114.084 ru_stime 39.425 maxvmem 25.174G ==== runs: N = 72000 ==== package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64` N = 72000, NB = 100, P = 12, Q = 16 nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair) Two runs mostly differ by the use of Qlogic PSM endpoints ^ Result ^ ''^''PSM (v1.4.4) ^ PSM (v1.6.1) ^ | HPL_Tflops | 1.68496 | 2.08056 | | StarDGEMM_Gflops | 14.6933 | 14.8339 | | SingleDGEMM_Gflops | 15.642 | 15.536 | | PTRANS_GBs | 9.25899 | 18.4793 | | StarFFT_Gflops | 1.19982 | 1.25452 | | StarSTREAM_Triad | 3.62601 | 3.65631 | | SingleFFT_Gflops | 1.44111 | 1.44416 | | MPIFFT_Gflops | 7.67835 | 77.603 | | RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 | software/hpcc/farber.txt Last modified: 2018-05-08 13:28by sraskar