hpc documentation

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
====== HPCC with intel compiler, MKL and base FFT ======

===== Make =====

Start by downloading and extracting the hpcc-1.4.3 directory:
<code bash>
curl -s http://icl.cs.utk.edu/projectsfiles/hpcc/download/hpcc-1.4.3.tar.gz | tar zx
</code> 

The hpcc-1.4.3 directory will have all the files you need to run the benchmark.  Our job
is to modify the setup for intel, and mkl on farber which uses VALET.

Copy the make file ''hpl/setup/Make.Linux_ATHLON_FBLAS'' to ''hpl/Make.intel-mkl''

  - Comment lines beginning in ''MP'' or ''LA''
  - Change ''/usr/bin/gcc'' to ''mpicc''
  - Change ''/usr/bin/g77'' to ''mpif77''
  - Change ''CCFLAGS'' to ''-mkl -O3 -fno-alias -DHPCC_FFT_235''
  - Change ''LINGFLAGS'' to ''-mkl -nofor-main''
  
The Valet commands are

<code bash>
vpkg_devrequire intel
vpkg_devrequire openmpi/1.8.2-intel64
</code>

Exported variables (to set values for commented LAinc and LAlib)

<code bash>
export LAinc="$CPPFLAGS"
export LAlib="$LDFLAGS -nofor-main"
</code>

Make command with 4 threads

<code bash>
make -j 4 arch=intel-mkl
</code>

==== runs: N = 30000 ====

  package `intel/2015.0.090` 
  package `openmpi/1.8.2-intel64` 

  N = 30000, NB = 200, P = 5 Q = 8
  
These runs need 40 processes (5 per row and 8 per column.) The same number of processes are run with 40 slots.

^ WEB NAME              ^         VALUE ^ UNITS ^
| G-HPL                 |        0.6201 | TeraFlops/Sec   |
| G-PTRANS              |        0.0127 | TeraBytes/Sec   |
| G-RandomAccess        |        0.0789 | GigaUpdates/Sec |
| G-FFT                 |        0.0222 | TeraFlops/Sec   |
| EP-STREAM Sys         |        0.1638 | TeraBytes/Sec   |
| EP-STREAM Triad       |        4.0951 | GigaBytes/Sec   |
| EP-DGEMM              |       14.7898 | GigaFlops/Sec   |
| RandomRing Bandwidth  |        0.5619 | GigaBytes/Sec   |
| RandomRing Latency    |        2.1133 | micro-seconds   |

qacct values:

  ru_wallclock 130.716      
  ru_utime     5114.084     
  ru_stime     39.425       
  maxvmem      25.174G



==== runs: N = 72000 ====

  package `acml/5.3.0-open64-fma4` 
  package `open64/4.5` 
  package `openmpi/1.4.4-open64` or  `openmpi/1.6.1-open64`

  N = 72000, NB = 100, P = 12, Q = 16

  nproc = 2x192   (384 slots with 192 MPI workers bound to a bulldozer core pair)

Two runs mostly differ by the use of Qlogic PSM endpoints


^ Result  ^  ''^''PSM (v1.4.4)  ^  PSM (v1.6.1) ^
| HPL_Tflops |  1.68496 |  2.08056 |
| StarDGEMM_Gflops |  14.6933 |  14.8339 |
| SingleDGEMM_Gflops |  15.642 |  15.536 |
| PTRANS_GBs |  9.25899 |  18.4793 |
| StarFFT_Gflops |  1.19982 |  1.25452 |
| StarSTREAM_Triad |  3.62601 |  3.65631 |
| SingleFFT_Gflops |  1.44111 |  1.44416 |
| MPIFFT_Gflops |  7.67835 |  77.603 |
| RandomlyOrderedRingLatency_usec |  65.8478 |  2.44898 |