====== HPCC with open64 compiler, ACML and base FFT ====== ===== Make ===== Here are som modifications on Mills based on recommendations from [[http://developer.amd.com/wordpress/media/2012/10/ACMLinHPCC.pdf| Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)]] Changes to the make file ''hpl/setup/Make.Linux_ATHLON_FBLAS'' copied to ''hpl/Make.open64-acml'' - Comment lines beginning in ''MP'' or ''LA'' - Change ''/usr/bin/gcc'' to ''mpicc'' - Change ''/usr/bin/g77'' to ''mpif77'' - Append ''-DHPCC_FFT_235'' to ''CCFLAGS'' The Valet commands are vpkg_devrequire acml/5.2.0-open64-fma4 vpkg_devrequire openmpi/1.6.1-open64 Exported variables (to set values for commented LAinc and LAlib) export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -lacml" Make command with 4 threads make -j 4 arch=open64-acml ==== NCPU=1 vs 2 ==== package `acml/5.2.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` N = 30000, NB = 100, P = 6, Q=8 These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots. ^ Options ^ Grid Engine ^ MPI flags ^ | ''NCPU=1'' | ''-pe openmpi 48'' | ''%%--bind-to-core --bycore%%'' | | ''NCPU=2'' | ''-pe openmpi 96'' | ''%%--bind-to-core --bycore --cpus-per-proc 2 -np 48 --loadbalance%%'' | HPCC benchmark results for two runs: ^ result ^ NCPU=1 ^ NCPU=2 ^ | HPL_Tflops | 0.0769491 | 1.54221 | | StarDGEMM_Gflops | 1.93686 | 14.6954 | | SingleDGEMM_Gflops | 11.5042 | 15.6919 | | MPIRandomAccess_LCG_GUPs | 0.0195047 | 0.00352421 | | MPIRandomAccess_GUPs | 0.0194593 | 0.00410853 | | StarRandomAccess_LCG_GUPs | 0.0113424 | 0.0302748 | | SingleRandomAccess_LCG_GUPs | 0.0448261 | 0.0568664 | | StarRandomAccess_GUPs | 0.0113898 | 0.0288637 | | SingleRandomAccess_GUPs | 0.0521811 | 0.053262 | | StarFFT_Gflops | 0.557555 | 1.14746 | | SingleFFT_Gflops | 1.2178 | 1.45413 | | MPIFFT_Gflops | 5.31624 | 34.3552 | ==== ibverb vs psm ==== package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64` or `openmpi/1.6.3-open64` N = 72000, NB = 100, P = 12, Q = 16 nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair) Runs mostly differ by the use of Qlogic PSM endpoints. For full examples of a grid engine scripts. see the files ''/opt/shared/templates/openmpi-ibverb.qs'' and ''/opt/shared/templates/openmpi-psm.qs''. All jobs will bind the processes by core using 2 cpus per process. Common flags ^ Grid Engine | %%-pe openmpi 384 -l exclusive=1%% | ^ MPI flags | %%--bind-to-core --bycore --cpus-per-proc 2 --np 192 --loadbalance%% | Flags for ''ibverbs'' vs ''psm'' ^ Option ^ Grid Engine ^ MPI flags ^ | ibverbs | | %%--mca btl ^tcp --mca mtl ^psm%% | | psm | -l psm_endpoints=1 | %%--mca btl ^tcp%% | The job submission command qsub -q standby-4h.q@@24core -l standby=1,h_rt=4:00:00 bhpcc.qs Since we need more than 240 slots (we need 384 slots to have enough cores for 192 processes taking 2 cores each), we specify a hard time limit of 4 hours to allow our jobs to run in the standby-4h queue. We use the group name ''@24core'' to make sure we only run on 24-core nodes. This is because the load balancing algorithm require the same core count on all the nodes. Use the ''qconf'' command to see the nodes in group_name @24core: qconf -shgrp @24core HPCC benchmark results for three runs: ^ Result ^ ibverb (v1.4.4) ^ psm (v1.6.1) ^ psm (v1.6.3) ^ | HPL_Tflops | 1.68496 | 2.08056 | 2.08719 | | StarDGEMM_Gflops | 14.6933 | 14.8339 | 14.8108 | | SingleDGEMM_Gflops | 15.642 | 15.536 | 15.7652 | | PTRANS_GBs | 9.25899 | 18.4793 | 21.2181 | | StarFFT_Gflops | 1.19982 | 1.25452 | 1.24688 | | StarSTREAM_Triad | 3.62601 | 3.65631 | 3.5967 | | SingleFFT_Gflops | 1.44111 | 1.44416 | 1.45183 | | MPIFFT_Gflops | 7.67835 | 77.603 | 68.98 | | RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 | 2.39858 | ==== more ==== N = 72000, NB = 100, P = 12, Q=16, NP=384 package `acml/5.2.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.4.4-open64` to your environment package `acml/5.3.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.6.1-open64` to your environment ^ Result ^ 139765 ^ 145105 ^ | HPL_Tflops | 1.54221 | 0.364243 | | StarDGEMM_Gflops | 14.6954 | 13.6194 | | SingleDGEMM_Gflops | 15.6919 | 15.453 | | PTRANS_GBs | 1.14913 | 1.07982 | | MPIRandomAccess_GUPs | 0.00410853 | 0.00679052 | | StarSTREAM_Triad | 3.39698 | 2.83863 | | StarFFT_Gflops | 1.14746 | 0.737805 | | SingleFFT_Gflops | 1.45413 | 1.3756 | | MPIFFT_Gflops | 34.3552 | 32.3555 | | RandomlyOrderedRingLatency_usec | 77.9332 | 76.9595 |