====== HPCC with open64 compiler, ACML and base FFT ======

===== Make =====

Here are som modifications on Mills based on recommendations from [[http://developer.amd.com/wordpress/media/2012/10/ACMLinHPCC.pdf| Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)]]

Changes to the make file ''hpl/setup/Make.Linux_ATHLON_FBLAS'' copied to ''hpl/Make.open64-acml''

  - Comment lines beginning in ''MP'' or ''LA''
  - Change ''/usr/bin/gcc'' to ''mpicc''
  - Change ''/usr/bin/g77'' to ''mpif77''
  - Append ''-DHPCC_FFT_235'' to ''CCFLAGS''
  
The Valet commands are

<code bash>
vpkg_devrequire acml/5.2.0-open64-fma4
vpkg_devrequire openmpi/1.6.1-open64
</code>

Exported variables (to set values for commented LAinc and LAlib)

<code bash>
export LAinc="$CPPFLAGS"
export LAlib="$LDFLAGS -lacml"
</code>

Make command with 4 threads

<code bash>
make -j 4 arch=open64-acml
</code>

==== runs: N = 30000 ====

  package `acml/5.2.0-open64-fma4` 
  package `open64/4.5` 
  package `openmpi/1.4.4-open64` 

  N = 30000, NB = 100, P = 6, Q=8
  
These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.

^  Options  ^  Grid Engine  ^  MPI flags  ^
| NCPU=1  | -pe openmpi 48 |  --bind-to-core  |
| NCPU=2  | -pe openmpi 96 |  --bind-to-core --bycore --cpus-per-proc 2 -np 48  |

HPCC benchmark results for two runs:

^ result ^ NCPU=1 ^ NCPU=2 ^
| HPL_Tflops |  0.0769491 |  1.54221 |
| StarDGEMM_Gflops |  1.93686 |  14.6954 |
| SingleDGEMM_Gflops |  11.5042 |  15.6919 |
| MPIRandomAccess_LCG_GUPs |  0.0195047 |  0.00352421 |
| MPIRandomAccess_GUPs |  0.0194593 |  0.00410853 |
| StarRandomAccess_LCG_GUPs |  0.0113424 |  0.0302748 |
| SingleRandomAccess_LCG_GUPs |  0.0448261 |  0.0568664 |
| StarRandomAccess_GUPs |  0.0113898 |  0.0288637 |
| SingleRandomAccess_GUPs |  0.0521811 |  0.053262 | 
| StarFFT_Gflops |  0.557555 |  1.14746 |
| SingleFFT_Gflops |  1.2178 |  1.45413 |
| MPIFFT_Gflops |  5.31624 |  34.3552 |

==== runs: N = 72000 ====

  package `acml/5.3.0-open64-fma4` 
  package `open64/4.5` 
  package `openmpi/1.4.4-open64` or  `openmpi/1.6.1-open64`

  N = 72000, NB = 100, P = 12, Q = 16

  nproc = 2x192   (384 slots with 192 MPI workers bound to a bulldozer core pair)

Two runs mostly differ by the use of Qlogic PSM endpoints


^ Result  ^  ''^''PSM (v1.4.4)  ^  PSM (v1.6.1) ^
| HPL_Tflops |  1.68496 |  2.08056 |
| StarDGEMM_Gflops |  14.6933 |  14.8339 |
| SingleDGEMM_Gflops |  15.642 |  15.536 |
| PTRANS_GBs |  9.25899 |  18.4793 |
| StarFFT_Gflops |  1.19982 |  1.25452 |
| StarSTREAM_Triad |  3.62601 |  3.65631 |
| SingleFFT_Gflops |  1.44111 |  1.44416 |
| MPIFFT_Gflops |  7.67835 |  77.603 |
| RandomlyOrderedRingLatency_usec |  65.8478 |  2.44898 |


==== more ====


  N = 72000, NB = 100, P = 12, Q=16, NP=384
  
  package `acml/5.2.0-open64-fma4` to your environment
  package `open64/4.5` to your environment
  package `openmpi/1.4.4-open64` to your environment
  
   package `acml/5.3.0-open64-fma4` to your environment
   package `open64/4.5` to your environment
   package `openmpi/1.6.1-open64` to your environment
   
   
^ Result ^ 139765 ^ 145105 ^
| HPL_Tflops | 1.54221 | 0.364243 |
| StarDGEMM_Gflops | 14.6954 | 13.6194 |
| SingleDGEMM_Gflops | 15.6919 | 15.453 |
| PTRANS_GBs | 1.14913 | 1.07982 |
| MPIRandomAccess_GUPs | 0.00410853 | 0.00679052 |
| StarSTREAM_Triad | 3.39698 | 2.83863 |
| StarFFT_Gflops | 1.14746 | 0.737805 |
| SingleFFT_Gflops | 1.45413 | 1.3756 |
| MPIFFT_Gflops | 34.3552 | 32.3555 |
| RandomlyOrderedRingLatency_usec | 77.9332 | 76.9595 |