====== Scalapack linsolve benchmark ====== ===== Fortran 90 source code ===== We base this benchmark in the ''linsolve.f90'' from the online tutorial //Running on Mio Workshop// presented January 25, 2011 at the Colorado School of Mines. see [[http://inside.mines.edu/mio/tutorial/pdf/libs.pdf|slides]]. We get the program with if [ ! -f "linsolve.f90" ]; then wget http://inside.mines.edu/mio/tutorial/libs/solvers/linsolve.f90 fi This program reads one line to start the benchmark. The input must contain 5 numbers: * N: order of linear system * NPROC_ROWS: number of rows in process grid * NPROC_COLS: number of columns in process grid * ROW_BLOCK_SIZE: blocking size for matrix rows * COL_BLOCK_SIZE: blocking size for matrix columns Where ''ROW_BLOCK_SIZE * COL_BLOCK_SIZE'' cannot exceed 500*500. For this benchmark we will set ''N = NPROC_ROWS*ROW_BLOCK_SIZE = NPROC_ROWS*ROW_BLOCK_SIZE''. For example, to construct a one line file, ''in.dat'', from N and the maximum block sizes of 500: let N=3000 let ROW_BLOCK_SIZE=500 let COL_BLOCK_SIZE=500 let NPROC_ROWS=$N/$ROW_BLOCK_SIZE let NPROC_COLS=$N/$COL_BLOCK_SIZE echo "$N $NPROC_ROWS $NPROC_ROWS $ROW_BLOCK_SIZE $COL_BLOCK_SIZE" > in.dat To allow larger blocks you could extend the two MAX parameters in the ''linsolve.f90'' file. for example, MAX_VECTOR_SIZE from 1000 to 2000 MMAX_MATRIX_SIZE from 250000 to 1000000 To accommodate these larger sizes some of the FORMAT statements should have I4 instead of I2 and I3. ===== Compiling ===== First set the variables * $packages set to VALET packages * $libs set to libraries * $f90flags set to compiler flags Since this test is completely contained in one Fortran 90 program you can compile with one compile, link and load with one command. vpkg_rollback all vpkg_devrequire $packages mpif90 $f90flags -o solve linsolve.f90 $LDFLAGS $libs Some version of the ''mpif90'' wrapper should be add to you environment, either explicitly or implicitly by VALET. Also VALET will set the LDFLAGS for your compile statement. ===== Grid engine script file ===== You must run the ''solve'' executable on the compute nodes, and submit with Grid Engine. You will need a script, which we will copy from ''/opt/shared/templates/openmpi/openmpi-ibverb.qs'' and modified so * $MY_EXEC: ''./solve < in.dat'' * NPROC: ''$NPROC_ROWS*$NPROC_COLS'' * vpkg_require line includes the Valet packages for the benchmark. For example, with the ''packages'', ''NPROC_ROWS'' and ''NPROC_COLS'' variables set as above: let NPROC=$NPROC_ROWS*$NPROC_COLS if [ ! -f "template.qs" ]; then sed -e 's/\$MY_EXE/.\/solve < in.dat/g;s/^echo "\(.*\)--"/date "+\1 %D %T--"/' \ /opt/shared/templates/openmpi/openmpi-ibverb.qs > template.qs echo "new copy of template in template.qs" fi sed "s/NPROC/$NPROC/g;s#^vpkg_require.*#vpkg_require $packages#" template.qs > solve.qs The file ''solve.qs'' will contain a script file to run the executable ''./solve'' that reads the data from ''in.dat''. Also ''$NPROCS'' is set the number of processors required on handle ''$NPROC_ROWS x $NPROC_COLS'' blocks. There is only one executable, ''solve'', one data file ''in.dat'' and one batch script file ''solve.qs''. There may be different test versions to test alternate compilers, libraries and problem sizes. But you should not change these files until the Grid Engine jobs are done. Use the ''qsub -sync y'' command to wait for all jobs to complete. ===== Submitting ===== There is only linear system solve, and it should take just a few seconds. We will use the under 4hr standby queue for testing. The variables ''name'' and ''N'' are set as described above. qsub -N $name$N -l standby=1 -l h_rt=04:00:00 solve.qs ===== Tests ===== ==== gcc ==== name=gcc packages=scalapack/2.0.1-openmpi-gcc libs="-lscalapack -llapack -lblas" f90flags='' ==== gcc and atlas ==== name=gcc_atlas packages='scalapack atlas' libs="-lscalapack -llapack -lcblas -lf77blas -latlas" f90flags='' The documentation in ''/opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt'' gives the loader flags to use to load the ''atlas'' libraries, the are: ''-LLIBDIR -llapack -lcblas -lf77blas -latlas''. The ''-L'' directory is in the ''LDFLAGS'', which is supplied by VALET. Also from the same documentation: ATLAS does not provide a full LAPACK library. This means the order the VALET packages are added is important. If we add ''atlas'' before ''scalapack'', then the loader will not be able to find some LAPACK routines. But this may not be optimal: Just linking in ATLAS's liblapack.a first will not get you the best LAPACK performance, mainly because LAPACK's untuned ILAENV will be used instead of ATLAS's tuned one. With these variables set and ''packages'' changed to packages='scalapack atlas' we get ''ld'' errors: ... /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/libf77blas.a(xerbla.o): In function `xerbla_': xerbla.f:(.text+0x18): undefined reference to `s_wsfe' ... Explanation: the funcion ''xerbla_'' in the library 'f77blas' was compiled with ''g77'', not ''gfortran''. So it needs the support routines in the ''g2c'' library. Knowing the name of this library wech find a copy: find /usr -name libg2c.a find: `/usr/lib64/audit': Permission denied /usr/lib/gcc/x86_64-redhat-linux/3.4.6/32/libg2c.a /usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a To remove these errors, change: libs="-lscalapack -llapack -lcblas -lf77blas -latlas -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6 -lg2c" New ''ld'' errors: ... linsolve.f90:(.text+0xd00): undefined reference to `slarnv_' ... Explanation: The ''atlas'' version of ''lapack'' does not have all of the LAPACK routines. nm -g /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/liblapack.a | grep slarnv_ nm -g /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a | grep slarnv_ U slarnv_ 0000000000000000 T slarnv_ U slarnv_ U slarnv_ No output from first ''nm'' command. You can copy the full atlas directory in your working direction and then follow the directions in ''/opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt'', section: **** GETTING A FULL LAPACK LIB **** We call this library ''myatlas''. ==== gcc and myatlas ==== name=gcc_myatlas packages='scalapack' libs="-lscalapack -L./lib -llapack -lcblas -lf77blas -latlas" f90flags='' This requires a copy of atlas in your directory, ''./lib'' and uses the ''g2c'' lib in the ''/usr/lib'' directory. You need to build your own copy of ''atlas''. Assuming you do not have a ''lib'' directory in your working directory: cp -a /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER lib ar x lib/liblapack.a cp /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a lib ar r lib/liblapack.a *.o rm *.o cp /usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a lib Now you have a ''./lib'' directory with your own copy of the ''atlas'' libraries, and the ''g2c'' library. You no longer need to specify atlas VALET package. ==== gcc and myptatlas ==== name=gcc_myptatlas packages='scalapack' libs="-lscalapack -L./lib -llapack -lptcblas -lptf77blas -latlas" f90flags='' Parallel threads will dynamically uses all the cores available at compile time (24), but only if problem size indicates they will help. ==== pgi and acml ==== name=pgi_acml packages=scalapack/2.0.1-openmpi-pgi libs="-lscalapack -lacml" f90flags='' ==== intel and mkl ==== name=intel_mkl packages=openmpi/1.4.4-intel64 libs="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" f90flags="-mkl" ===== Results N=4000===== ==== BLOCK=1000, NPROCS=16 ==== Each test is repeated three time. ^ File name ^ Time ^ | gcc4000.o91943 | Elapsed time = 0.613728D+05 milliseconds | | gcc4000.o92019 | Elapsed time = 0.862935D+05 milliseconds | | gcc4000.o92030 | Elapsed time = 0.826695D+05 milliseconds | | gcc_atlas4000.o91945 | Elapsed time = 0.386161D+04 milliseconds | | gcc_atlas4000.o92023 | Elapsed time = 0.433195D+04 milliseconds | | gcc_atlas4000.o92035 | Elapsed time = 0.424980D+04 milliseconds | | gcc_myatlas4000.o92009 | Elapsed time = 0.448106D+04 milliseconds | | gcc_myatlas4000.o92026 | Elapsed time = 0.461706D+04 milliseconds | | gcc_myatlas4000.o92032 | Elapsed time = 0.441593D+04 milliseconds | | intel_mkl4000.o91611 | Elapsed time = 0.222194D+05 milliseconds | | intel_mkl4000.o92016 | Elapsed time = 0.215223D+05 milliseconds | | intel_mkl4000.o92039 | Elapsed time = 0.214088D+05 milliseconds | | pgi_acml4000.o91466 | Elapsed time = 0.238015D+04 milliseconds | | pgi_acml4000.o92017 | Elapsed time = 0.261841D+04 milliseconds | | pgi_acml4000.o92040 | Elapsed time = 0.222939D+04 milliseconds | ==== BLOCK=800, NPROCS=25 ==== Each test is repeated three time. ^ File name ^ Time ^ | gcc4000.o92335 | Elapsed time = 0.638246D+05 milliseconds | | gcc4000.o92386 | Elapsed time = 0.633060D+05 milliseconds | | gcc4000.o92412 | Elapsed time = 0.629561D+05 milliseconds | | gcc_atlas4000.o92336 | Elapsed time = 0.314615D+04 milliseconds | | gcc_atlas4000.o92389 | Elapsed time = 0.358208D+04 milliseconds | | gcc_atlas4000.o92413 | Elapsed time = 0.334147D+04 milliseconds | | gcc_myatlas4000.o92337 | Elapsed time = 0.363176D+04 milliseconds | | gcc_myatlas4000.o92390 | Elapsed time = 0.306922D+04 milliseconds | | gcc_myatlas4000.o92414 | Elapsed time = 0.333779D+04 milliseconds | | intel_mkl4000.o92339 | Elapsed time = 0.433877D+05 milliseconds | | intel_mkl4000.o92393 | Elapsed time = 0.400862D+05 milliseconds | | intel_mkl4000.o92417 | Elapsed time = 0.409855D+05 milliseconds | | pgi_acml4000.o92338 | Elapsed time = 0.234248D+04 milliseconds | | pgi_acml4000.o92392 | Elapsed time = 0.276856D+04 milliseconds | | pgi_acml4000.o92415 | Elapsed time = 0.211567D+04 milliseconds | ==== BLOCK=500, NPROCS=64 ==== Each test is repeated three time. ^ File name ^ Time ^ | gcc4000.o92123 | Elapsed time = 0.284893D+05 milliseconds | | gcc4000.o92144 | Elapsed time = 0.278744D+05 milliseconds | | gcc4000.o92150 | Elapsed time = 0.289137D+05 milliseconds | | gcc_atlas4000.o92130 | Elapsed time = 0.296471D+04 milliseconds | | gcc_atlas4000.o92142 | Elapsed time = 0.264463D+04 milliseconds | | gcc_atlas4000.o92148 | Elapsed time = 0.269103D+04 milliseconds | | gcc_myatlas4000.o92133 | Elapsed time = 0.280457D+04 milliseconds | | gcc_myatlas4000.o92138 | Elapsed time = 0.312135D+04 milliseconds | | gcc_myatlas4000.o92153 | Elapsed time = 0.286337D+04 milliseconds | | intel_mkl4000.o92134 | Elapsed time = 0.436288D+05 milliseconds | | intel_mkl4000.o92140 | Elapsed time = 0.413780D+05 milliseconds | | intel_mkl4000.o92152 | Elapsed time = 0.401095D+05 milliseconds | | pgi_acml4000.o92137 | Elapsed time = 0.234475D+04 milliseconds | | pgi_acml4000.o92145 | Elapsed time = 0.214514D+04 milliseconds | | pgi_acml4000.o92149 | Elapsed time = 0.293480D+04 milliseconds | ==== BLOCK=250, NPROCS=256 ==== Each test is repeated three time. ^ File name ^ Time ^ | gcc4000.o92164 | Elapsed time = 0.148302D+05 milliseconds | | gcc4000.o92168 | Elapsed time = 0.144862D+05 milliseconds | | gcc4000.o92317 | Elapsed time = 0.160144D+05 milliseconds | | gcc_atlas4000.o92167 | Elapsed time = 0.785104D+04 milliseconds | | gcc_atlas4000.o92171 | Elapsed time = 0.749285D+04 milliseconds | | gcc_atlas4000.o92318 | Elapsed time = 0.798376D+04 milliseconds | | gcc_myatlas4000.o92165 | Elapsed time = 0.797618D+04 milliseconds | | gcc_myatlas4000.o92222 | Elapsed time = 0.792745D+04 milliseconds | | gcc_myatlas4000.o92320 | Elapsed time = 0.720193D+04 milliseconds | | intel_mkl4000.o92162 | Elapsed time = 0.636915D+05 milliseconds | | intel_mkl4000.o92169 | Elapsed time = 0.733785D+05 milliseconds | | intel_mkl4000.o92324 | Elapsed time = 0.653791D+05 milliseconds | | pgi_acml4000.o92161 | Elapsed time = 0.740457D+04 milliseconds | | pgi_acml4000.o92170 | Elapsed time = 0.733668D+04 milliseconds | | pgi_acml4000.o92322 | Elapsed time = 0.769606D+04 milliseconds | ===== Summary ===== ==== 4000 x 4000 matrix ==== === Time to solve linear system === A randomly generated matrix is solved using ScaLAPACK with different block sizes. The times are the average elapsed time in seconds, as reported by ''MPI_WTIME'', over three batch job submissions. ^ Test ^ N=4000 ^^^^ ^ name ^ np=16 ^ np=25 ^ np=64 ^ np=256 ^ | [[#gcc|gcc]] | 76.779 | 63.362 | 28.426 | 15.110 | | [[#gcc_and_atlas|gcc_atlas]] | 4.148 | 3.357 | 2.767 | 7.776 | | [[#gcc_and_myatlas|gcc_myatlas]] | 4.505 | 3.346 | 2.930 | 7.702 | | [[#intel_and_mkl|intel_mkl]] | 21.717 | 41.486 | 41.705 | 67.483 | | [[#pgi_and_acml|pgi_acml]] | 2.409 | 2.409 | 2.475 | 7.479 | There is not much difference between ''gcc_atlas'' and ''gcc_myatlas''. === Speedup === The speedup for ''ATLAS'', ''MKL'' and ''ACML'' are relative to ''GCC'' with no special BLAS libraries. {{:clusters:mills:speedup.png?500|}} ==== 16000 x 16000 matrix ==== === Time to solve linear system === A randomly generated matrix is solved using ScaLAPACK with different block sizes. The times are the average elapsed time in seconds, as reported by ''MPI_WTIME'', over two batch job submissions. ^ Test ^ N=16000 ^^^ ^ name^ np=16^ np=64^ np=256 ^ | [[#gcc|gcc]] | 5627.655 | 1682.308 | 474.087 | | [[#gcc-and-atlas|gcc_atlas]] | 231.552 | 81.281 | 44.890 | | [[#gcc-and-myatlas|gcc_myatlas]] | 235.682 | 88.871 | 39.559 | | [[#gcc-and-myptatlas|gcc_myptatlas]] | 218.815 | 79.013 | 47.388 | | [[#intel-and-mkl|intel_mkl]] | 132.071 | 174.859 | 224.309 | | [[#pgi-and-acml|pgi_acml]] | 161.941 | 64.373 | 34.463 | === Speedup === Speedup for ATLAS, MKL and ACML compared to the reference GCC with no optimized library. ATLAS has two variants: ''myatlas'' has the complete LAPACK merged in, and ''myptatlas'' loads the multi-threaded versions of the libraries. {{:clusters:mills:speedup16000.png?500|}} === Time plot === Elapsed time for ATLAS, MKL and ACML. {{:clusters:mills:plot16000.png?500|}}